Antibody Annotator

October 02, 2024 01:27
Updated

This article outlines all of the main options available in Antibody Annotator. The videos in our Getting Started Series may also be helpful. Below is our video on using the Antibody Annotator Tool.

If you don't know how to process your Sanger sequences prior to antibody Annotation, please see Workflows for Sanger Antibody Analysis.

What is Antibody Annotator?

The Antibody Annotator annotates your input data using the reference sequences in your chosen Reference Database. These can be germline genes, or a custom-made database of antibody template sequence(s) - see Understanding Reference Databases. This pipeline is suitable for the analysis of Sanger and smaller NGS datasets (less than 20 million).

If you have datasets larger than 5 million sequences, we recommend using NGS Antibody Annotator instead. NGS Antibody Annotator uses an alternative algorithm that collapses your dataset down to representative sequences.

Antibody Annotator supports annotation of most IgG-like molecules, including scFvs, VHH-VHH and Fabs across a variety of different species. It may also be used to annotate similar molecules, like TCRs, so long as they have defined FR and CDR regions in a variable region. If you are unsure whether this includes your molecule or are wondering what other solutions we have available, please contact us directly.

In addition to identifying CDR and FR regions in raw antibody sequence data, the Antibody Annotator also offers deeper analysis capabilities:

Identification of closest germline match(es) for V, D, and J genes as appropriate
Visualisation of both silent and non-silent variation present compared to the germline match
Identification of sequence-based liabilities in the variable region, and a liability and sequence quality scoring system
Region-based clustering of sequences to identify broader trends in the sequence dataset. Please see this article to learn more: Understanding "clusters"
Custom clonotype identification via clustering: Clustering Options.
A rich visualisation suite including a sequence explorer, alignment views, and a broad range of graphs to help you gain further observations from your data.

How do I run Antibody Annotator?

To run the Antibody Annotator select a file in your folder and select Antibody Annotator in the Annotation dropdown.

Screen_Shot_2020-03-18_at_5.08.11_PM.png

To start the Antibody Annotator operation, adjust your options in the pop-up as desired and then click Run. This operation will output a Biologics Annotator Result file.

Settings

The following sections outline the steps to successfully carry out an analysis using the Antibody Annotator and how each section and option works.

Main Options

antibody anno main options.png

Reference database(s):

The Reference Database(s) selected can be of antibody germline sequences or of template (variable region) sequences. A reference database is used to help identify the correct FR and CDR regions in the new sequences being analyzed. See Understanding Reference Databases.
- Your Biologics account will include Human, Mouse and Alpaca Ig germline databases. If you would like to access to other germlines, please either contact us or see How to make a Custom Reference Database to easily make your own.
- Multiple reference databases can be selected, for example to compare hybridized datasets including Human and Mouse germline genes.

Selected Sequences are:

Single Chain Options:
- Either Heavy or Light: The annotator will determine whether each sequence more closely matches the Heavy or the Light chains in your database.
- Single chain (light) identifies single light chains per read.
- Single chain (heavy) identifies single heavy chains per read.

Note: when selecting "Single chain" options, you can send paired reads as input, for example paired reads that had no overlap and were not able to be merged. The NGS Antibody Annotator will attempt to use both reads in a pair to generate a single V(D)J region. If successful, there will be a linker section of Amino Acid ambiguities joining the two ends.

Two chain/scFv Options:
- Both chains in a single sequence expects to find a heavy and light chain per read/sequence.
- Both chains in a single sequence (opposite directions) also expects to find a heavy and light chain per read/sequence, but specifically for the case in which the reading frame for one chain is on the forward DNA strand, while the other chain is on the reverse DNA strand.
- Both chains in a single sequence with linker (scFv) expects to find a heavy and light chain per read/sequence, and will place an annotation spanning the linker between the two chains.
Associated/Paired Heavy light chains option:
- Both chains in associated sequences expects two or more heavy and light chains on separate reads that have been paired or "associated" together. This ensures that these separate reads get analysed together.
- The pairing/association can be designated in a couple of different ways:
  - 1. Via a Name Scheme:
    To find and pair heavy-kappa and heavy-lambda pairs from the same sample, the sample and chain information will need to be captured somewhere in the sequence name. You can then make a Name Scheme, and specify this in the Main Options. This Name Scheme will tell the annotator to enumerate chains that are found within the same sample (via the Common Identifier). See How to Create a Name Scheme.
    - Note: if you Batch assembled your Sanger sequences using a Name Scheme, this Name Scheme will be automatically entered in the Main Options.
  - 2. Via the Pair Heavy/Light Chain operation
    This allows you to pair heavy and light chains on separate sequences without "merging" Sanger reads. If you have paired reads using this operation, you do not need to use a Name Scheme.

Note: Single Heavy or Light chain sequences will be annotated as usual when using associated Heavy/Light Chains, however they will be marked as "Not Fully Annotated" in the analysis, as only one chain could be found.

VHH-VHH options:
- Two heavy chains in a single sequence expects to find two heavy chains per read/sequence.
- Two heavy chains in a single sequence with linker (scFv) expects to find two heavy chains per read/sequence, and will place an annotation spanning the linker between the two chains.

Sequence region of interest is between:

To define what a "fully annotated" sequence is, you can select the values from the dropdown menu. The default values between FR1 and FR4 means that a sequence is considered to be fully annotated if it consists of all of the regions: FR1, CDR2, FR2, CDR2, FR3, CDR3, and FR4. In addition to affecting the "Fully Annotated" column in your "All sequences" result table, this may also determine which sequences are used to create the cluster tables. See the section on clustering above for more information.

Name Scheme:

This option will be automatically filled in if you used Batch Assemble Sanger Sequences with a Name Scheme. Name schemes are highly customisable, and allow you to use information contained within the sequence names (eg. well or donor/sample) to classify sequences and groups of sequences. A Name Scheme is crucial for the option below, which can enumerate pairs (for example: Heavy-Kappa and Heavy-Lambda) within the same sample/well.

If there are three or more sequences in a pair:

This option, in conjunction with a Name Scheme (above) allows you to enumerate the possible heavy-light pairs within common identifier/sample. It will only be available if the option Selected Sequences are: Both chains in associated sequences is chosen. Options:
- Leave sequences unpaired
  This will pair any doublets (a single heavy and light chain with the same common identifier), but leave any triplets or singlets unpaired. Unpaired sequences will be classed as Not Fully Annotated.
- Show all possible Heavy/Light combinations
  This will enumerate all the possible heavy/light pairings within the same common identifier. For example, if a kappa and lambda chain can be found within the same common identifier as a heavy chain, two pairings will be made:
  Heavy-lambda
  Heavy-kappa

Analysis Options

analysis options aa.png

Antibody numbering:

The default scheme is IMGT CDR definitions and numbering. We support IMGT, Kabat, Chothia, Martin and AHo schemes. To learn more, see Numbering Schemes. You can also turn on and off annotating the numbers on your sequences here.
- Custom FR/CDR offsets are also available if you have your own preferred annotation scheme, see Advanced Options.
- If you have selected Long reads (PacBio/Nanopore) under the Advanced Options, this option will be disabled.

Annotate variants from reference database

To annotate and see the differences between your input sequences and reference sequences, select this option. The nucleotide and amino acid differences relative to your reference database(s) will be annotated on your input sequences. For more information on viewing these differences and what they mean, see this article.
- If you have selected Long reads (PacBio/Nanopore) under the Advanced Options, only the amino acid variants will be recorded.

Calculate protein statistics

This will calculate the Molecular Weight (kDa), the Isoelectric point, the charge at pH 7 and the Extinction Coefficient across the VDJ or VJ region - or both. If a full V(D)J Region can not be found these values will not be calculated.

Find liabilities and assets:

To search and score amino acid or nucleotide motifs associated with deleterious post-translational modifications or any type of reduced antibody function or desirable motifs, select this option. The Antibody Annotator pipeline has a default set of sequence liability checks. These can also be customized, see How to Customize Sequence Liabilities and Assets.

Additional features:

This option allows you to specify custom features (such as fusions, constant regions, or Signal Peptides) that will be annotated on each sequence. To learn more about annotation of additional features, please refer to Using Feature Databases to identify Constant Regions and Fusion Proteins.

Clustering Options

Screen_Shot_2022-02-17_at_4.22.57_PM.png

Clustering provides a way to specify clonotype parameters and to group your sequences based on genes or regions of interest. To learn more about clustering and how it can help with interpreting your dataset, see Understanding "Clusters".

Several default clusters will already be listed, and further clusters can be added using the blue "Plus" sign as seen above. Any clusters that are not applicable for a particular analyses, such as Light CDR3 in a Heavy chain data set, will be skipped.

It is possible to cluster up to six regions or genes (FR1, CDR3, Heavy D gene etc.) together based on shared identity across sequences in the regions selected. It is also possible to allow mismatches across a region and to cluster based on amino acid similarity. To learn more about configuring this option, please refer to Clustering Options.

Cluster Filters

Sequencing data can contain low quality sequences and noise. In order to improve the meaningfulness of clusters in your results, select one or more of the following options:

Only cluster results with asset and liability score of at least - This will cluster the sequences based on the score specified. For example, if you specify a score of -1000, only sequences that have a liability and asset score of -1000 or more will be included in the clusters.
Only cluster results which are - This will cluster the sequences that are either: Fully annotated, Fully annotated and in frame, or Fully annotated, in frame, and without stop codons. For example, if you chose to cluster Fully annotated and in frame sequences, only sequences that meet the specifications of being fully annotated and in frame will be clustered, sequences that are not fully annotated or have frameshifts will not be included in the clustering operation.

Advanced Options

anti anno advanced options.png

Genetic code:

The Genetic code dropdown allows you to select the genetic code to use for translating nucleotide sequences. The codes are obtained directly from NCBI. One additional Genetic code, "Amber readthrough" allows certain stop codons not to be treated as stop codons during translation.

Record equal gene matches as:

Each gene with partial frequency - This will assign the query sequence to all matching genes with partial frequency. Based on the example of a query sequence matching two genes: IGHD1-26 and IGHD2-15, the query sequence will add 0.5 to the total count for both IGHD1-26 and IGHD2-15.
Group of genes - This will create a separate entry in the list of genes that represents this combination of genes. Based on the example of a query sequence matching two genes: IGHD1-26 and IGHD2-15, the query sequence will contribute 0 towards the total for each of IGHD2-15 and IGHD1-26, and instead add 1 to the total for a gene called "IGHD1-26/IGHD2-15".
Unknown - This will treat this as an unknown gene. Based on the example of a query sequence matching two genes: IGHD1-26 and IGHD2-15, the query sequence will add nothing to the totals of IGHD1-26 and IGHD2-15.

Long reads (PacBio/Nanopore)

This option will optimise the analysis for performance on long read sequences. Choosing this will turn off numbering and only record the amino acid variants if the Analysis option Annotate variants from reference database has been selected (rather than both aa and nt variants).

Pseudogenes and ORF genes

The human and mouse immunoglobulin databases may contain a number of pseudogenes and ORF genes along with fully functional genes. You can choose to Include pseudogenes from database and/or the ORF genes from database in the analysis as mutations in these genes might be corrected prior to expression. If these are not included, then each sequence will be classified according to the most closely related functional gene in the database.

Trim each side of fully annotated region if over:

This setting trims off extra bases on either side of the sequence region of interest. The default is 10, which means that your annotated sequence will have 10 base pairs flanking the 5' and 3' ends of the sequence.
- This will be determined by your sequencing region of interest, specified in the main options.

Note that trimming only applies to fully annotated sequences whereby sequences that are classified as not fully annotated by the Antibody Annotator operation are left untrimmed.

Always annotate entire regions (except CDR3)

This option is default on, and is recommended. If a CDR or FR annotation would be truncated due to mismatches, this instead forces it to end at the boundary of the respective CDR/FR region. That region will be complete and not truncated. This allows you to see germline variants across the whole sequence, even in divergent or poor quality areas.

Heavy and Light adjustments

Please see our main article Adjusting CDR definitions

Saving different settings as Profiles

Geneious Biologics allows you to save Profiles which can be used to record and re-run alternative settings depending on the dataset. This means that you can specify custom sequence liabilities, custom clusters and other settings depending on what dataset you are working with.

Profiles can be saved and applied at the bottom of all our Annotation analysis pipelines:

apply or save profile 2.png

What next?

To learn more about the tables produced, see these articles:

Like any other Biologics Annotator Result document, you can also:

Filter your Sequences
View the Graphs for Quality Assurance and Graphs to interpret Clusters and Clonotypes
Perform Sequence Alignments
View the "Clusters" in your dataset
Add new Clusters to your Results
Subset your sequences and re-calculate clusters
Add Assay Data to your Analysis Results
Compare Results across Multiple Experiments to monitor clonal expansion or to identify sequences present across multiple datasets.
Edit your Sequences
Repair low-quality sequences after Annotation