Annotating variants relative to your reference database

March 12, 2024 03:26
Updated

This article describes how to turn on the option for annotating differences relative to the reference database(s) used in Antibody Annotator, NGS Antibody Annotator and Single Cell Analysis, and where to view these differences in your analysis result. Note that this function annotates differences relative to the Reference Database(s) used. To learn how to make your own reference databases, view this article.

How to turn on Annotate variants from reference database

To annotate the differences between the input sequence and reference sequence, select the Annotate variants from reference database option under Analysis Options in the Annotation pipelines.

It can also be helpful to turn on Always annotate entire regions (except CDR3) in the Advanced option tab, to ensure that you get variants identified across the full variable region.

Screenshot 2023-12-12 at 1.56.22 PM.png

Note: If you have selected Long reads (PacBio/Nanopore) under the Advanced Options when running Antibody Annotator or NGS Antibody Annotator, only the amino acid variants will be recorded.

How to annotate variants relative to an Antibody Template Sequence

Note that the Annotate variants from reference database option will annotate any differences relative to the Reference Database(s) used. Often these reference sequences will be the germline genes, but using a custom reference database containing a Template sequence (e.g. full VDJ) allows you to annotate the variants relative to the target sequence, including along the CDR3s.

Learn how to make your own custom Antibody Template Reference Database here: How to make a Custom Reference Database.

Viewing variants in the Sequence Viewer

To view the variants, select any Biologics Annotator Result document and select the sequence that you are interested in.

Note that:

In general, for every sequence with variants, there will be a DNA (nucleotide) and a AA (amino acid) variant annotation track.
- If you have selected Long reads (PacBio/Nanopore) under the Advanced Options when running Antibody Annotator or NGS Antibody Annotator, only the amino acid variants will be recorded.
When equally good matches for the same gene would have exactly the same nucleotide or amino acid differences, these annotations are reduced to just a single track named after both (or more) genes.
Amino acid differences use the frame of the closest FR/CDR that starts at or before the start of the gene. So for example if there is a frame shift in FR1, this won't have an effect on the J Gene amino acid differences, which will use the frame that CDR3 starts in.
To turn off the annotation tracks, simply uncheck the DNA Variant and AA Variant annotation tracks in the Sequence Viewer right-hand panel.

Viewing variants in the Sequences Table

When Annotate variants from reference database is selected, a number of additional columns are present in the All Sequences table, containing information about variants across different regions of interest.

Any mismatches between the best-matched gene in the germline reference database and your annotated sequences are outputted into the Sequence Table as mismatches in the standard format: eg. Q17E for a substitution of Gln to Glu at position 17.

Numbering is based on the IMGT system unless otherwise specified under Analysis Options when running Antibody Annotator.

Variant Nomenclature

The HGVS nomenclature is used in Biologics for mutations other than substitutions:

Deletions
- L4del - Leu at position 4 deleted
- P9_G11del - Pro at position 9 through to Gly at position 11 deleted
Insertions
- T18_L19insK - Lys has been inserted between Thr18 and Leu19
Deletion-Insertions
- S88_L91delinsT - Residues Ser88 through to Leu91 have been deleted and replaced with a Thr
Frameshifts
- X142fs - Frameshift at position 142. The amino acid is not given (X) as there has been a deletion/insertion of nucleotides that is not a multiple of 3, thus causing the frameshift. Since there was not a whole codon deleted/inserted/substituted, the amino acid is not determined.

Note: IMGT numbering includes suffixes to some positions eg. position 98 a, b, c etc. An example of a mutation that includes this notation may look like P98a_D98bDel. This means that two amino acids (Pro and Asp) found at position 98a and 98b in the germline have been deleted.

Screen_Shot_2022-02-15_at_4.52.54_PM.png In the above image, we can see that in the Heavy FR4 region of this sequence there are a total of two AA Germline Mismatches that correspond to substitutions T122I and L123V. These Germline Mismatches indicate the differences between the sequenced FR4 region and the best-fit candidate gene found in the germline: IGHJ4-02

Notes:

In cases where the region (eg. FR1) is longer than the gene region (eg. Heavy V Gene) or within the highly variable CDR3, nucleotides which are not covered by a gene annotation are considered to be mismatches for any mismatch statistics.
- When using a an Antibody Template Reference Database, the CDR3 region variants will be annotated
The number of mismatches in the DNA with respect to the best-fit germline gene are also recorded in the sequences table.
All or any parts of the Sequences Table can be exported as an .xlsx or .csv file by selecting the columns under Table Preferences and then selecting Export in the Table Viewer.

** See Varnomen for more information on HGVS nomenclature

Germline Gene Statistics

In addition to the variant annotations, you can also view the closest V,D, and J gene matches for each of your sequences. Biologics also calculates percentage identity of your sequence relative to the closest gene (Heavy V Gene Identity below), and the percentage of the complete gene that the target sequence matches (Heavy V Gene Coverage).

Screen_Shot_2019-10-14_at_2.11.20_PM.png

As shown in the image above, filtering and sorting on these gene match statistics can aid in investigating the characteristics of your sequence data set. There is more detailed information about filtering here.