What is Antibody Annotator?
The Antibody Annotator annotates your input data using the reference germline sequences in the Reference Database. This pipeline is suitable for the analysis of both NGS and Sanger type data. It supports annotation of most IgG-like molecules, including scFvs and Fabs and a variety of different species. It may also be used to annotate similar molecules, so long as they have defined FR and CDR regions in a variable region. If you are unsure whether this includes your molecule or are wondering what other solutions we have available, please contact us directly.
In addition to identifying CDR and FR regions in raw antibody sequence data, the Antibody Annotator also offers deeper analysis capabilities:
- Identification of closest germline match(es) for V, D, and J genes as appropriate
- Visualisation of both silent and non-silent variation present compared to the germline match
- Identification of sequence-based liabilities in the variable region, and a liability and sequence quality scoring system
- Region-based clustering of sequences to identify broader trends in the sequence dataset. Regions available include FR, CDR, VDJ, and VJ regions, plus more depending on options selected. See our article for some examples: What is a cluster?
- A rich visualisation suite including a sequence explorer, alignment views, and a broad range of graphs to help you gain further observations from your data.
How do I run Antibody Annotator?
To run the Antibody Annotator select a file in your folder and select Antibody Annotator in the Annotation dropdown.
To start the Antibody Annotator operation, adjust your options in the pop-up as desired and then click Run. This operation will output a Biologics Annotator Result file.
Antibody Annotator Options
The following sections outline the steps to successfully carry out an analysis using the Antibody Annotator and how each section and option works.
In the Antibody Annotator analysis popup window, select a reference database from the Reference Database dropdown. The Reference Database here should be a database of annotated germline sequences, and is used to help identify the correct FR and CDR regions in the new sequences being analysed. To learn more about reference databases, please refer to the following article.
The Genetic code dropdown allows you to select the genetic code to use for translating nucleotide sequences. The codes are obtained directly from NCBI. One additional Genetic code, "Amber readthrough" allows the Amber Stop Codon not to be treated as a stop codon during translation.
The Selected sequences are dropdown specifies the expected chain type(s) found in the input sequences.
Single chain (either heavy or light) tells the annotator to only look for one chain in each sequence. The annotator will determine whether each sequence more closely matches the Heavy or the Light chains in your database.
- In the Antibody Annotator pipeline, the reference database is split into heavy and light sections. By specifying the expected chain (either all Heavy or all Light), only the appropriate database section is used. This improves performance and potentially accuracy too.
- When selecting "Single chain" options, you can send paired reads as input, for example paired reads that had no overlap and were not able to be merged. The Antibody Annotator will attempt to use both reads in a pair to generate a single V(D)J region. If successful, there will be a linker section of Amino Acid ambiguities joining the two ends.
- If you expect your sequences to have both heavy and light chains in the same sequence (scFv sequences), select either the Both chains in a single sequence or Both chains in a single sequence with linker (scFv) value in the dropdown. There is not a lot of difference between these two options.
If you have two chains in separate sequences that you have associated together select Both chains in associated sequences option to ensure that these separate sequences get analysed together. You can pair associated sequences three different ways:
- by specifying a Name Scheme that has a Chain field in the Antibody Annotator Name Scheme option. There is more information about Name Schemes here.
- by specifying a Chain field in your Name Scheme when assembling Sanger sequences by Name Scheme prior to running antibody Annotator (this will do it automatically as long as you select "Chain type: both chains in associated sequences"). There are examples here.
- by using the Pair Heavy/Light Chain (old way)
Note: Single Heavy or Light chain sequences will be annotated as usual when using associated Heavy/Light Chains, however they will be marked as "Not Fully Annotated" in the analysis, as only one chain could be found.
This option allows you to specify custom nucleotide features (such as Signal Peptides or HisTags) that will be annotated on each sequence. To learn more about annotation of additional features, please refer to the following article.
The Name scheme dropdown lets you select a Name Scheme that can read the selected sequence names to extract information of interest. Antibody Annotator will use this information to pair Heavy/Light chains (if required) and output all Name Scheme fields as columns in the results table. For more information about Name Schemes, see What Is a Name Scheme and Why Is It Useful?
Pseudogenes and ORF genes
The bundled human and mouse immunoglobulin databases may contain a number of pseudogenes and ORF genes along with fully functional genes. You can choose to Include pseudogenes from database and/or the ORF genes from database in the analysis as mutations in these genes might be corrected prior to expression. If these are not included, then each sequence will be classified according to the most closely related functional gene in the database.
Germline gene difference annotation
To annotate and see the differences between your input sequences and reference sequences, select the Annotate germline gene differences option. With the selection of this option, the nucleotide and amino acid differences will be annotated on your input sequences. For more information viewing these differences and what they mean, see this article.
The database is assumed to contain IMGT style annotations. To change the annotation scheme from IMGT to Kabat, select Kabat in the Results Annotation Scheme dropdown. The Kabat style results are produced by adjusting the IMGT CDRs end points.
Note: Custom FR/CDR offsets are also available if you have your own preferred annotation scheme. Please contact us and we can enable this extra option for you.
Liability and asset search
To search and score motifs liable to post-translational modifications or any other types of modifications or beneficial motifs, select the Find liabilities and assets option. The Antibody Annotator pipeline has a default set of sequence liability checks. To learn more about configuring this option, please refer to the following article.
Clustering provides a way to group your sequences based on genes or regions of interest. To learn more about clustering and how it can help with interpreting your dataset, see this article.
Several default clusters will already be listed, and further clusters can be added using the "Plus" sign as seen above. Any clusters that are not applicable for a particular analyses, such as Light CDR3 in a Heavy chain data set, will be skipped.
It is possible to cluster up to six regions or genes (FR1, CDR3, Heavy D gene etc.) together based on shared identity across sequences in the regions selected. It is also possible to allow mismatches across one region and to cluster based on amino acid similarity. To learn more about configuring this option, please refer to the following article.
In general, most NGS data are relatively large and contain low quality sequences and noise. In order to improve the meaningfulness of clusters in your results, select one or more of the following options:
- Only cluster results with asset and liability score of at least - This will cluster the sequences based on the score specified. For example, if you specify a score of 1000, only sequences that have a liability and asset score of 1000 or more will be included in the clusters.
- Only cluster results which are - This will cluster the sequences that are either: Fully annotated, Fully annotated and in frame, or Fully annotated, in frame, and without stop codons. For example, if you chose to cluster Fully annotated and in frame sequences, only sequences that meet the specifications of being fully annotated and in frame will be clustered, sequences that are not fully annotated, frameshifted and include of stop codons will not be included in the clustering operation.
Database gene classification
When multiple genes are equally close to a query sequence there are three possible ways we can handle it with regards to the list of the number of sequences that match each gene. For example, if a query sequence is equally close to IGHD1-26 and IGHD2-15, each option below will handle this differently:
- Each gene with partial frequency - This will assign that query sequence to all matching genes with partial frequency. Based on the example above, this sequence will add 0.5 to IGHD1-26's and 0.5 to IGHD2-15's total number of sequences.
- Group of genes - This will create a separate entry in the list of genes that represents this combination of genes. Based on the example above, this sequence will contribute 0 towards the total for each of IGHD2-15 and IGHD1-26, and instead add 1 to the total for a gene called "IGHD1-26/IGHD2-15".
Unknown - This will treat this as an unknown gene. Based on the example above, this sequence will add nothing to the totals of IGHD1-26 and IGHD2-15.
Fully annotated region
To define what a "fully annotated" sequence is, you can select the values from the dropdown menu. The default values between FR1 and FR4 means that a sequence is considered to be fully annotated if it consists of all of the regions: FR1, CDR2, FR2, CDR2, FR3, CDR3, and FR4. In addition to affecting the "Fully Annotated" column in your "All sequences" result table, this may also determine which sequences are used to create the cluster tables. See the section on clustering above for more information.
To automatically trim the non-annotated regions upstream and downstream of your annotate sequences, select the Trim each side of fully annotated region if over option and specify the number of nucleotide base pairs to leave untrimmed.
- The default is set to 10. This means that your annotated sequence will have 10 base pairs flanking the 5' and 3' ends of the sequence.
- For single chain sequences, if the start of FR1 is untruncated the start is trimmed. However, if the V-Gene starts before FR1, trimming starts outside of the V-Gene. Similar rules are applied for FR4.
- For scFv data, this applies to the first FR1 and the last FR4.
Note that trimming only applies to fully annotated sequences whereby sequences that are classified as not fully annotated by the Antibody Annotator operation are left untrimmed.
Entire regions only (Alpha)
With the Always annotate entire regions (except CDR3) option selected, if a CDR or FR annotation would be truncated due to mismatches, instead force it to end at the boundary of the respective CDR/FR region. That region will be complete and not truncated. This allows you to see germline variants across the whole sequence, even in divergent or poor quality areas.