NGS Antibody Annotator

April 15, 2026 02:24
Updated

If you have datasets larger than 5 million sequences, we recommend using NGS Antibody Annotator. The NGS Antibody Annotator pipeline uses an alternative algorithm to Antibody Annotator that collapses your dataset down to unique sequences, while retaining counts.

If you don't know what sorts of sequences can be run through NGS Antibody Annotator, see Workflows for NGS Antibody Analysis.

If you have Single Cell/10X data, please see Single Cell Analysis Workflows.

What is the NGS Antibody Annotator?

The NGS Antibody Annotator annotates and collapses your input data using the reference sequences in your chosen Reference Database. These can be germline genes, or a custom-made database of template sequence(s) - see Understanding Reference Databases. This pipeline is suitable for the analysis of NGS datasets, and is the default for datasets with more than 10 million sequences.

The NGS Antibody Annotator supports annotation of most IgG-like molecules, including scFvs and Fabs across a variety of different species. It may also be used to annotate similar molecules, like TCRs, as long as they have FR and CDR regions in a defined variable region. If you are unsure whether this includes your molecule or are wondering what other solutions we have available, please contact us directly.

In addition to identifying CDR and FR regions in raw antibody sequence data, the NGS Antibody Annotator also offers deeper analysis capabilities:

Identification of closest germline match(es) for V, D, and J genes as appropriate
Visualization of both silent and non-silent variation present compared to the germline match
Identification of sequence-based liabilities in the variable region, and a liability and sequence quality scoring system
Region-based clustering of sequences to identify broader trends in the sequence dataset. Please see this article to learn more: Understanding "clusters"
Custom clonotype identification via clustering: Clustering Options.
A rich visualization suite including a sequence explorer, alignment views, and a broad range of graphs to help you gain further insights from your data.

How do I run NGS Antibody Annotator?

To run the NGS Antibody Annotator select the file(s) within a folder you want to analyze and go to Annotation > NGS Antibody Analysis in the dropdown:

Annotation > NGS AA.png

To start the NGS Antibody Annotator operation, adjust your options in the pop-up as desired and then click Run. This operation will output a Biologics Annotation Result file.

Settings

The following sections outline the steps to successfully carry out an analysis using the NGS Antibody Annotator and how each section and option works.

Main Options

Reference database(s):

The Reference Database(s) selected can be of antibody germline sequences or of template (variable region) sequences. Reference database(s) are used to help identify the correct FR and CDR regions in the new sequences being analyzed. See Understanding Reference Databases.
- Your Biologics account will include Human (Ig and TCR), Mouse (Ig and TCR), and Alpaca Ig germline databases. If you would like to access to other germlines, please either contact us or see How to make a Custom Reference Database.
- Multiple reference databases can be selected, for example to compare hybridized datasets including Human and Mouse germline genes.

Selected sequences are:

Single Chain Options:
- Either Heavy or Light: The annotator will determine whether each sequence more closely matches the Heavy or the Light chains in your database.
- Single chain (light) identifies single light chains per read.
- Single chain (heavy) identifies single heavy chains per read.

Note: when selecting "Single chain" options, you can use paired reads that had no overlap and were not able to be merged. The annotator will attempt to use both reads in a pair to generate a single V(D)J region. If successful, there will be a linker section of aa ambiguities joining the two ends.

Two chain/scFv Options:
- Both chains in a single sequence expects to find a heavy and light chain per read/sequence.
- Both chains in a single sequence (opposite directions) also expects to find a heavy and light chain per read/sequence, but specifically for the case in which the reading frame for one chain is on the forward DNA strand, while the other chain is on the reverse DNA strand.
- Both chains in a single sequence with linker (scFv) expects to find a heavy and light chain per read/sequence, and will place an annotation spanning the linker between the two chains.

VHH-VHH options:
- Two heavy chains in a single sequence expects to find two heavy chains per read/sequence.
- Two heavy chains in a single sequence with linker (scFv) expects to find two heavy chains per read/sequence, and will place an annotation spanning the linker between the two chains.

Linker database:

Linker databases are optional databases that are available when selecting formats that contain a flexible linker between variable domains. They are only available when selecting the "Both chains in single sequence with linker (scFv)" and "Two heavy chains in single sequence with linker" options. The Linker database option allows you to select a pre-made linker database and set the maximum amount of mismatches allowed in the linker

Sequence region of interest:

This allows you to set the boundaries of your sequence. For example, if your forward sequencing primer was located in the FR2 while your reverse sequencing primer was located in the Constant region, you might set this to CDR2 -> FR4

Collapse Sequences at least:

This option will collapse "identical" sequences up to a percentage threshold of identity. The most frequently occurring sequence from within the group is taken forwards to the results and the count of sequences collapsed will be listed.
- If multiple datasets are submitted, sequences are only collapsed within each dataset
- Note that the collapsed sequence includes both the "Sequence region of interest is between" option above as well as the "Retain upstream/downstream of fully annotated region" bp sequence if selected below.

Retain upstream and downstream of fully annotated region:

These options retain up to the specified number of nucleotides upstream/downstream of the Sequence region of interest (explained above) when trimming the ends of contigs. Up to the specified bp will be retained, and sequences will only be collapsed if they have the same length after including the retained bp - in addition to meeting the identity threshold.

Analysis Options

Antibody numbering:

The default scheme is IMGT CDR definitions and numbering. We support IMGT, Kabat, Chothia, Martin and AHo schemes. To learn more, see Numbering Schemes. You can also turn on and off annotating the numbers on your sequences here.
- Custom FR/CDR offsets are also available if you have your own preferred annotation scheme, see Advanced Options.
- If you have selected Long reads (PacBio/Nanopore) under the Advanced Options, this option will be disabled.

Calculate protein statistics

This will calculate the Molecular Weight (kDa), the Isoelectric point, the charge at pH 7 and the Extinction Coefficient across the VDJ or VJ region - or both. If a full V(D)J Region can not be found these values will not be calculated. See Protein Statistics for more info.

Annotate variants from reference database

To annotate and see the differences between your input sequences and reference sequences, select this option. The nucleotide and amino acid differences relative to your reference database(s) will be annotated on your input sequences. For more information on viewing these differences and what they mean, see this article.
- If you have selected Long reads (PacBio/Nanopore) under the Advanced Options, only the amino acid variants will be recorded.

Trim primers:

This option allows you to select a primer database from the dropdown for trimming any of the primers in the database from your sequences, if they are present.

Find liabilities and assets:

To search and score amino acid or nucleotide motifs associated with deleterious post-translational modifications or any type of reduced antibody function or desirable motifs, select this option. Biologics has a default set of sequence liability checks, these include: cleavage, deamidation, glycosylation, hydrolysis, isomerization and oxidation. To learn how to specify your own liabilities, see this article: How to customize antibody sequence liabilities and assets

Additional features:

This option allows you to specify custom features (such as fusions, constant regions, or Signal Peptides) that will be annotated on each sequence. To learn more about annotation of additional features, please see Using Feature Databases to identify Constant Regions and Fusion Proteins.

Filtering Options

Screenshot 2025-04-30 at 2.49.12 PM.png

Discard Sequences shorter than:

This can be used to discard sequences that do not meet the required length

Discard sequences with a chance of error over:

This will discard sequences that are low quality and have a chance of error over X%. The default is 50% and we recommend turning this on for large datasets (> 1 million seq).
- When viewing sequences containing quality scores, this chance of error is calculated by converting the confidence score for each base call to the error probability using the formula 10^(-Q/10). For example, a base with a quality score of 30 will have an error probability of 0.001. The expected errors value is the product of the error rates for each base in the sequence.

Clustering Options

clustering options NGS AA.png

Clustering provides a way to specify clonotype parameters and to group your sequences based on genes or regions of interest. To learn more about clustering and how it can help with interpreting your dataset, see Understanding "Clusters".

Several default clusters will already be listed, and further clusters can be added using the blue "Plus" sign as seen above. Any clusters that are not applicable for a particular analyses, such as Light CDR3 in a Heavy chain data set, will be omitted.

It is possible to create custom clusters with up to six regions or genes (FR1, CDR3, Heavy D gene etc.), allow mismatches across a region, and/or cluster based on amino acid similarity. To learn more about configuring advanced clustering options, please refer to Clustering Options.

Cluster Filters

In general, most NGS datasets are relatively large and contain low quality sequences and noise. In order to improve the meaningfulness of clusters in your results, select the following options:

Only cluster results with asset and liability score of at least - This will cluster the sequences based on the score specified. For example, if you specify a score of -1000, only sequences that have a liability and asset score of -1000 or more will be included in the clusters.

Advanced Options

Document name scheme:

This option lets you optionally select a Name Scheme that can read the collapsed sequence names to extract information of interest. The Name Scheme field will be populated as a column in the results table.

Genetic code:

The Genetic code dropdown allows you to select the genetic code to use for translating nucleotide sequences. The codes are obtained directly from NCBI. One additional Genetic code, "Amber readthrough" allows the Amber Stop Codon not to be treated as a stop codon during translation.

Record equal gene matches as:

Each gene with partial frequency - This will assign the query sequence to all matching genes with partial frequency. Based on the example of a query sequence matching two genes: IGHD1-26 and IGHD2-15, the query sequence will add 0.5 to the total count for both IGHD1-26 and IGHD2-15.
Group of genes - This will create a separate entry in the list of genes that represents this combination of genes. Based on the example of a query sequence matching two genes: IGHD1-26 and IGHD2-15, the query sequence will contribute 0 towards the total for each of IGHD2-15 and IGHD1-26, and instead add 1 to the total for a gene called "IGHD1-26/IGHD2-15".
Unknown - This will treat this as an unknown gene. Based on the example of a query sequence matching two genes: IGHD1-26 and IGHD2-15, the query sequence will add nothing to the totals of IGHD1-26 and IGHD2-15.

Long reads (PacBio/Nanopore)

This option will optimise the analysis for performance on long read sequences. Choosing this will turn off numbering and only record the amino acid variants if the Analysis option Annotate variants from reference database has been selected (rather than both aa and nt variants).

Pseudogenes and ORF genes

The human and mouse immunoglobulin databases may contain a number of pseudogenes and ORF genes along with fully functional genes. You can choose to Include pseudogenes from database and/or the ORF genes from database in the analysis as mutations in these genes might be corrected prior to expression. If these are not included, then each sequence will be classified according to the most closely related functional gene in the database.

Keep unmerged reads

This specifies that paired reads which failed to merge should be used in the next step of the pipeline. In use cases where pairs are expected to overlap, discarding unmerged pairs is recommended in order to improve assembly accuracy.

De novo assembly required

Select this option to perform de novo assembly if the reads constitute fragments of the sequenced region of interest - these will be stitched together to form full sequences.
- This is only recommended if you have performed a barcoded analysis as reads will be assembled into full sequences within the same barcode. Performing this on non-barcoded sequences will result in undesired results, as read assembly will be performed on the entire dataset.

Heavy & Light FR adjustments:

This option is explained in more detail here: Adjusting CDR definitions

Saving different settings as Profiles

Geneious Biologics allows you to save Profiles which can be used to record and re-run alternative settings depending on the dataset. This means that you can specify custom sequence liabilities, custom clusters and other settings depending on what dataset you are working with.

Profiles can be saved and applied at the bottom of all our Annotation analysis pipelines:

apply or save profile 2.png

Viewing your results

The result will be output in the same folder as the sequence list(s), unless otherwise specified. The description of the file will give a brief summary, indicating how many unique chains were identified in the dataset - according to your collapse settings (default 100% sequence identity).

After opening the NGS Antibody Annotation result, the layout will look similar to an Antibody Annotator result, but the All Sequences Table will represent multiple sequences that were "collapsed" together due to having % identity. It is therefore more similar to a Cluster Table.

To learn more about the cluster tables produced, see Exploring the Cluster Table Columns.

NGS AA result overview.png

In the above example output, the most abundant heavy chain in the dataset (Heavy-1) is selected. You can see the sequence in the Sequence Viewer below. % of Sequences refers to the percent that this sequence made up of the entire dataset, and the exact # Sequences which was 497. See Exploring the tables produced by NGS Antibody Analysis for a description of all the columns produced.

Like any other Biologics Annotator Result document, you can also:

Filter your Sequences
View the Graphs for Quality Assurance and Graphs to interpret Clusters and Clonotypes
Perform Sequence Alignments
View the "Clusters" in your dataset
Add new Clusters to your Results
Subset your sequences and re-calculate clusters
Add Assay Data to your Analysis Results
Compare Results across Multiple Experiments to monitor clonal expansion or to identify sequences present across multiple datasets.
Edit your Sequences
Repair low-quality sequences after Annotation