Single Cell Antibody Annotator

October 02, 2025 21:42
Updated

The Single Cell Antibody Annotator operation is an alternate way to annotate and analyze the variable regions of standard IgG-like molecules. It combines the sequences analyzed into fewer representative sequences and can pull out the dominant chain(s) within your dataset.

The Single Cell Analysis operation is particularly suited for analysing Barcoded Sequences from both "single clones in wells" and "single cells in droplets" experiments. It is also suitable for analyzing datasets from NGS technologies that incorporate UMIs. To learn more about Barcodes and UMIs, see this article: Understanding Single Cell technologies: Barcodes and UMIs. The video below provides an overview of these concepts:

For Barcode and/or UMI analysis, you will first need to run the Collapse UMI Duplicates and Separate Barcodes tool. For more discussion about possible workflows, see the Single Cell Analysis workflows article.

How do I run Single Cell Antibody Annotator?

Select one or more nucleotide sequences or lists in your folder and select Annotation > Single Cell Antibody Anotator:

annotation -> single cell anti anno.png

To run the Single Cell Antibody Annotator operation, select options relevant to your sequence data in the dialog that appears and click Run.

Each selected input sequence list will be analysed independently. The pipeline works by:

Trimming sequences and then applying filtering steps to reduce total data size (configurable)
De-novo assembling reads together (optional)
Identifying and annotating Ig-like Antibody regions (FR/CDR)
Collapsing sequences with similar V(D)J regions into a single dominant sequence. Heavy and light chains are compared separately.

When run on barcoded sequences in combination with the Collapse UMI Duplicates and Separate Barcodes operation, it can identify heavy-light chains pairs and determine dominant chains within each cell/barcode. For an example of a workflow for analysing barcoded data, please refer to the UMI/Barcode and Single Cell Analysis tutorials.

Single Cell Antibody Annotator options

Main Options

Reference database:

The Reference Database(s) selected can be of antibody germline sequences or of template (variable region) sequences. Reference database(s) are used to help identify the correct FR and CDR regions in the new sequences being analyzed. See Understanding Reference Databases.
- Your Biologics account will include Human, Mouse, and Alpaca Ig germline databases. If you would like to access to other germlines, please either contact us or see How to make a Custom Reference Database.
- Multiple reference databases can be selected, for example to compare hybridized datasets including Human and Mouse germline genes.

Sequence region of interest is between:

To define what a "fully annotated" sequence is, you can select the values from the dropdown menu. The default values between FR1 and FR4 means that a sequence is considered to be fully annotated if it consists of all of the regions: FR1, CDR2, FR2, CDR2, FR3, CDR3, and FR4.

Collapse regions at least:

This option will collapse "identical" sequences up to a percentage threshold of identity. The most frequently occurring sequence from within the group is taken forwards to the results and the count of sequences collapsed will be listed.
- Note that the collapsed sequence includes both the "Sequence region of interest is between" option above as well as the "Retain upstream/downstream of fully annotated region" bp sequence if selected below.
It is recommended to use an identity percentage low enough to capture sequencing errors but high enough to preserve true variation. 97% is a reasonable default.

Retain upstream and downstream of fully annotated region:

These options retain up to the specified number of nucleotides upstream/downstream of the Sequence region of interest (explained above) when trimming the ends of contigs. Up to the specified bp will be retained, and sequences will only be collapsed if they have the same length after including the retained bp - in addition to meeting the identity threshold.

Associate significant dominant heavy and light pair

This option is recommended for barcoded data or sequencing data from separate sequence lists that represent identical samples. If heavy and light chains are present under the same barcode/sequence list, a Chain Combinations table will be generated. This table will enumerate the possible heavy/light chain pairings found under the same barcode/sequence list.
- Note that this only pairs the 4 most prevalent heavy and light chains that meet the significance threshold (see the filtering section).

Keep unmerged reads

This option specifies that paired reads which failed to merge should be used in the next step of the pipeline. In use cases where sequence reads are expected to overlap, discarding unmerged reads is recommended in order to improve assembly accuracy.

De novo assembly

Select this option to perform de novo assembly if the reads constitute fragments of the sequenced region of interest - these will be stitched together to form full sequences.
- This is only recommended if you have performed a barcoded analysis as reads will be assembled into full sequences within the same barcode. Performing this on non-barcoded sequences will result in undesired results, as sequence assembly is performed on the entire dataset.

Analysis Options

analysis options aa.png

Antibody numbering:

The default scheme is IMGT CDR definitions and numbering. We support IMGT, Kabat, Chothia, Martin and AHo schemes. To learn more, see Numbering Schemes. You can also turn on and off annotating the numbers on your sequences here.

Annotate variants from reference database

To annotate and see the differences between your input sequences and reference sequences, select the Annotate germline differences option. With the selection of this option, the nucleotide and amino acid differences will be annotated on your input sequences.

Calculate protein statistics

This will calculate the Molecular Weight (kDa), the Isoelectric point, the charge at pH 7 and the Extinction Coefficient across the VDJ or VJ region - or both. If a full V(D)J Region can not be found these values will not be calculated.

Find liabilities and assets:

To search and score amino acid or nucleotide motifs associated with deleterious post-translational modifications or any type of reduced antibody function or desirable motifs, select this option. Biologics has a default set of sequence liability checks, these include: cleavage, deamidation, glycosylation, hydrolysis, isomerization and oxidation. To learn how to specify your own liabilities, see this article: How to customize antibody sequence liabilities and assets

Additional features:

This option allows you to specify custom features (such as fusions, constant regions, or Signal Peptides) that will be annotated on each sequence. To learn more about annotation of additional features, please see Using Feature Databases to identify Constant Regions and Fusion Proteins.

Filtering Options

single cell anno filtering options.png

Only keep sequences longer than:

All sequences that are shorter than the threshold defined will be discarded. This is useful to remove likely low quality sequences. You may wish to set this parameter lower if your sequences have adapters, UMIs and/or barcodes which have been trimmed in the Collapse UMI Duplicates and Separate Barcodes operation.

Only use longest:

This option lets you specify the number of the reads that will be used from each list or barcode (after sorting by length), with any additional reads discarded. This helps improve performance on large data sets where excessive data significantly slows down analysis. 500 reads is generally sufficient.
- Generally this is not recommended for de novo assembly

Only keep regions with at least:

This input field lets you discard sequences where the number of reads is equal to or less than the number of reads for the most prevalent chain within the barcode/dataset, specified as a percentage. This setting allows you to permanently discard regions that do not have enough supporting data for further analysis.

Significant regions

Significant region thresholds allow you to filter out infrequent regions to retain statistically significant regions for downstream analysis. Regions deemed insignificant are retained but annotated as not significant.

The Significant regions have at least (percentage read count of the cell/barcode) input field lets you flag sequences with low numbers of reads relative to the total reads in the cell, specified as a percentage.
The Significant regions have at least (reads) input field lets you flag sequences with low numbers of reads as not significant. This is useful for filtering out sequences that were only defined due to reads of very low frequency or due to sequencing errors.
The Significant regions have at least (percentage read count of the dominant same chain region) input field lets you flag sequences where the number of reads is equal to or less than the number in the dominant same chain region (the one with the most reads), specified as a percentage. This setting allows you to filter out regions that do not have enough supporting data for further analysis.

Note: The chain combinations table will only be produced if at least one heavy and one light chain reaches the significance thresholds above, and can be found in the same barcode/well.

Clustering Options

clustering options NGS AA.png

Clustering provides a way to specify clonotype parameters and to group your sequences based on genes or regions of interest. To learn more about clustering and how it can help with interpreting your dataset, see Understanding "Clusters".

Several default clusters will already be listed, and further clusters can be added using the blue "Plus" sign as seen above. Any clusters that are not applicable for a particular analyses, such as Light CDR3 in a Heavy chain data set, will be omitted.

Note: at present, combination clusters across paired Heavy-Light chains (paired under the same barcode) like Heavy-Light CDR3 can only be generated when running the initial annotation. If you are interested in specific clusters that combine regions from the heavy and light chains, please make sure to include these in your annotation run.

It is possible to create custom clusters with up to six regions or genes (FR1, CDR3, Heavy D gene etc.), allow mismatches across a region, and/or cluster based on amino acid similarity. To learn more about configuring advanced clustering options, please refer to Clustering Options.

Cluster Filters

In general, most NGS datasets are relatively large and contain low quality sequences and noise. In order to improve the meaningfulness of clusters in your results, select the following options:

Only cluster results with asset and liability score of at least - This option can help to speed up the run-time for large datasets. Sequences will be clustered based on whether they meet the score specified. For example, if you specify a score of -1000, only sequences that have a liability and asset score of -1000 or more will be included in the clusters.

Advanced Options

Single clone advanced options.png

Document name scheme:

This dropdown lets you optionally select a Name Scheme that can read the collapsed single clone names to extract information of interest. Single Cell Antibody Analysis will use this information to pair Heavy/Light chains (if the Name Scheme has chain values) and output all Name Scheme fields as columns in the results table. For more information about Name Schemes, see Using Name Schemes

Genetic code:

The Genetic code dropdown allows you to select the genetic code to use for translating nucleotide sequences. The codes are obtained directly from NCBI. One additional Genetic code, "Amber readthrough" allows the Amber Stop Codon not to be treated as a stop codon during translation.

Record equal gene matches as:

Each gene with partial frequency - This will assign the query sequence to all matching genes with partial frequency. Based on the example of a query sequence matching two genes: IGHD1-26 and IGHD2-15, the query sequence will add 0.5 to the total count for both IGHD1-26 and IGHD2-15.
Group of genes - This will create a separate entry in the list of genes that represents this combination of genes. Based on the example of a query sequence matching two genes: IGHD1-26 and IGHD2-15, the query sequence will contribute 0 towards the total for each of IGHD2-15 and IGHD1-26, and instead add 1 to the total for a gene called "IGHD1-26/IGHD2-15".
Unknown - This will treat this as an unknown gene. Based on the example of a query sequence matching two genes: IGHD1-26 and IGHD2-15, the query sequence will add nothing to the totals of IGHD1-26 and IGHD2-15.

Pseudogenes and ORF genes

The human and mouse immunoglobulin databases may contain a number of pseudogenes and ORF genes along with fully functional genes. You can choose to Include pseudogenes from database and/or the ORF genes from database in the analysis as mutations in these genes might be corrected prior to expression. If these are not included, then each sequence will be classified according to the most closely related functional gene in the database.

Trim primers:

The Trim primers option allows you to select a primer database from the dropdown for trimming any of the primers in the database from your sequences, if they are present.

Always annotate entire regions (except CDR3)

If a CDR or FR annotation would be truncated due to mismatches, this option instead forces it to end at the boundary of the respective CDR/FR region. That region will be complete and not truncated.

Heavy & Light FR adjustments:

This option is explained in more detail here: Adjusting CDR definitions

Saving different settings as Profiles

Geneious Biologics allows you to save Profiles which can be used to record and re-run alternative settings depending on the dataset. This means that you can specify custom sequence liabilities, custom clusters and other settings depending on what dataset you are working with.

Profiles can be saved and applied at the bottom of all our Annotation analysis pipelines:

apply or save profile 2.png

What next?

To learn more about the tables produced, see these articles:

Like any other Biologics Annotator Result document, you can also:

Filter your Sequences
View the Graphs for Quality Assurance and Graphs to interpret Clusters and Clonotypes
Perform Sequence Alignments
View the "Clusters" in your dataset
Add new Clusters to your Results
Subset your sequences and re-calculate clusters
Add Assay Data to your Analysis Results
Compare Results across Multiple Experiments to monitor clonal expansion or to identify sequences present across multiple datasets.
Edit your Sequences
Repair low-quality sequences after Annotation