Peptide Annotator

April 30, 2025 22:51
Updated

Our newest annotation analysis pipeline allows you to annotate, analyze, and cluster non antibody-like sequences. The sequences can be nucleotide or protein sequences of any kind. This article outlines all of the main options available in the Peptide Annotator.

What is the Peptide Annotator?

The Peptide Annotator annotates and clusters input query sequences. You can choose to annotate the query sequences without a reference, or you can create and use a known template sequence as your reference - see Understanding Reference Databases.

The Peptide Annotator accepts nucleotide and protein* sequences of any nature - it is not dependent on a specific molecule or protein type. Some examples of sequences may include:

Panning rounds of peptides 5-40+ amino acids long
Domains/regions of proteins
Lists of HCDR3 sequences

* Note: Protein input sequences are only supported for non-NGS analysis

In addition to Clustering your input sequences, the Peptide Annotator also offers deeper analysis capabilities:

Visualizations and recording of variants present compared to the reference database (if a reference database is used)
Identification of sequence-based liabilities and a liability and sequence quality scoring system
- See how to customize sequence liabilities and assets
Custom clustering options
A rich visualization suite including a sequence explorer, alignment views, and a broad range of graphs to help you gain further insights from your dataset

To view a tutorial demonstrating the use of the Peptide Annotator, see Peptide Tutorial 1. Phage Display Libraries.

How do I run Peptide Annotator?

To run the Peptide Annotator, select a file in your folder and go to Annotation > Peptide Annotator from the dropdown:

annotation > peptide anno.png

To start the Peptide Annotator operation, adjust your parameters in the pop-up as desired and then click Run. This operation will output a Biologics Annotator Result file.

Settings

The following sections outline the steps to successfully carry out an analysis using the Peptide Annotator and how each section and option works.

Main Options

peptide anno main options.png

Reference database(s):

The Peptide Annotator supports using no reference database, or using a General Template Reference Database. See General Template Databases to make your own - this could be a full protein, a protein domain or a peptide sequence.
- Multiple reference databases can be selected.

Name Scheme:

This option will be automatically filled in if you used Batch Assemble Sanger Sequences with a Name Scheme. Name schemes are highly customizable, and allow you to use information contained within the sequence names (eg. well or donor/sample) to classify sequences and groups of sequences, or pull out information from the sequence names into columns in your result.

Handle Input Sequences:

NOTE: Your selection here will influence the following Collapsing and Filtering section.

Keep individual sequences (e.g. Sanger)
- All submitted sequences will be retained in the result document. This option will not be available if you have selected more than 10,000 input reads for annotation.
Collapse duplicates (e.g. NGS)
- This will collapse sequences that meet an identity threshold at the nucleotide level and retain a count of how many sequences were collapsed. This will be the default option when more than 10,000 reads were selected for annotation.
Collapse duplicates with Barcodes (e.g. Single Cell)
- This will collapse sequences that meet an identity threshold at the nucleotide level and retain a count of how many sequences were collapsed. It enables extra options that are relevant for sequencing campaigns that require full sequences to be assembled from multiple reads i.e. de novo assembly. See Understanding Single Cell technologies: Barcodes and UMIs and Collapse UMI Duplicates and Separate Barcodes.

Collapsing and Filtering

The options available in this section will depend on your analysis mode (Handle Input Sequences) above.

Keep individual sequences/Sanger

None available

Collapse duplicates/NGS

Screenshot 2025-05-01 at 10.49.28 AM.png

Collapse Sequences at least: X % identical

Sequences that meet the identity threshold will be collapsed together and a count of how many sequences were collapsed is retained. The default is 100% identical.

Only keep reads longer than:

This can be used to discard sequences that do not meet the required length.

Discard sequences with a chance of error over:

This will discard sequences that are low quality and have a chance of error over X%. The default is 50% and we recommend turning this on for large datasets (> 1 million seq).

Retain upstream and downstream of sequenced region:

These options are only available if a reference database was selected. The two options enable you to retain a portion of the sequence on either side of the targeted domain/protein/peptide. This additional sequence will be included when the collapsing step is undertaken.

Collapse duplicates with Barcodes/Single Cell

peptide single cell filtering.png

Collapse Sequences at least: X % identical
- Sequences that meet the identity threshold will be collapsed together and a count of how many sequences were collapsed is retained. The default is 100% identical.
Retain upstream and downstream of sequenced region: X bp
- These options are only available if a reference database was selected. The two options enable you to retain up to the specified length of bp on either side of the targeted reference. This additional sequence will be included when the collapsing step is performed.
Keep unmerged reads
- This option specifies that paired reads which failed to merge should be used in the next step of the analysis. In use cases where sequence reads are expected to overlap, discarding unmerged reads is recommended in order to improve assembly accuracy.
De novo assembly required
- Select this option to perform de novo assembly if the reads constitute fragments of a larger sequenced region of interest - these will be assembled together to form full sequences.
  - This is only recommended if you have performed a barcoded analysis as reads will be assembled into full sequences within the same barcode. Performing this on non-barcoded sequences will result in undesired results, as sequence assembly is performed on the entire dataset.
Only keep reads longer than: X bp
- All sequences that are shorter than the threshold defined will be discarded. This is useful to remove likely low quality sequences. You may wish to set this parameter lower if your sequences have adapters, UMIs and/or barcodes which have been trimmed in the Collapse UMI Duplicates and Separate Barcodes operation.
Only use longest: X reads from each list/barcode
- This option lets you specify the number of the reads that will be used from each sequence list or barcode (after sorting by length), with any additional reads discarded. This helps improve performance on large data sets where excessive data significantly slows down analysis. 500 reads is generally sufficient.
  - Generally this is not recommended for de novo assembly
Significant sequences have at least: X% of the read count of the cell/barcode
- - This input field lets you flag sequences with low numbers of reads relative to the total reads in the cell (if barcodes were used), or relative to the input sequence list, specified as a percentage.
Significant sequences have at least: X reads
- - This input field lets you flag sequences with low numbers of reads as not significant. This is useful for filtering out sequences that were only defined due to reads of very low frequency or due to sequencing errors.
Only keep sequences with at least: X% of the read count of the dominant sequence
- This input field lets you discard sequences where the number of reads is equal to or less than the number of reads for the most prevalent sequence within the barcode/dataset, specified as a percentage. This setting allows you to permanently discard sequences that do not have enough supporting data for further analysis.
  - We recommend not setting this higher than 5%

Analysis Options

peptide anno analysis options.png

Annotate variants from reference database

To annotate and see the differences between your input sequences and reference sequences, select this option. The nucleotide and amino acid differences relative to your reference database(s) will be annotated on your input sequences and recorded in the All Sequences Table. For more information about viewing these differences and what they mean, see this article.

Calculate protein statistics

This will calculate the Molecular Weight (kDa), the Isoelectric point, the charge at pH 7 and the Extinction Coefficient across the Full Sequence (no reference database) or the Template Region (when using a reference database).
- If a full Template Region can not be found these values will not be calculated.

Find liabilities and assets:

To search and score amino acid or nucleotide motifs associated with deleterious post-translational modifications or any type of reduced antibody function or desirable motifs, select this option. The Peptide Annotator pipeline has a default set of sequence liability checks. These can also be customized, see how to Customize Sequence Liabilities and Assets.

Clustering Options

peptide anno clustering options.png

Clustering provides a way to group your sequences based on shared identity/similarity across your sequences. To learn more about clustering and how it can help with interpreting your dataset, see Understanding "Clusters".

Several default clusters will already be listed, and further clusters can be added using the blue "Plus" sign as seen above.

It is possible to cluster up to six regions together based on shared identity across sequences in the regions selected. It is also possible to allow mismatches across a region and to cluster based on amino acid similarity. To learn more about configuring this option, please refer to Clustering Options.

Cluster Filters

Sequencing data can contain low quality sequences and noise. In order to improve the meaningfulness of clusters in your results, select one or more of the following options:

Only cluster results with asset and liability score of at least - This will cluster the sequences based on the score specified. For example, if you specify a score of -1000, only sequences that have a liability and asset score of -1000 or more will be included in the clusters.
Only cluster results which are - This will cluster the sequences that are either: Fully annotated, Fully annotated and In Frame, or Fully annotated, In Frame, and Without Stop Codons. For example, if you chose to cluster Fully annotated and In Frame sequences, only sequences that meet the specifications of being fully annotated and in frame will be clustered, sequences that are not fully annotated or have frameshifts will not be included in the clustering operation.
- Note: this option is only available for a Sanger-type analysis.

Advanced Options

peptide anno advanced options.png

Genetic code:

The Genetic code dropdown allows you to select the genetic code to use for translating nucleotide sequences. The codes are obtained directly from NCBI. One additional Genetic code, "Amber readthrough" allows certain stop codons not to be treated as stop codons during translation.

Record equal reference sequence match as:

Each sequence with partial frequency - This will assign the query sequence to all matching references with partial frequency. Based on the example of a query sequence matching to two unique reference sequences, the query sequence will add 0.5 to the total count for both Reference-1 and Reference-2.
Groups of sequences - This will create a separate entry in the list of reference matches that represents this combination of references sequences. Based on the example of a query sequence matching two references: Reference-1 and Reference-2, the query sequence will contribute 0 towards the total for each of Reference-1 and Reference-2, and instead add 1 to the total for a reference called "Reference-1/Reference-2".
Unknown - This will treat this as an unknown match. Based on the example of a query sequence matching two references equally: Reference-1 and Reference-2, the query sequence will add nothing to the totals of Reference-1 and Reference-2.

Trim each side of fully annotated region if over:

This setting trims off extra bases on either side of the sequence region of interest. The default is 10, which means that your annotated sequence will have 10 base pairs flanking the 5' and 3' ends of the sequence.
- This option is only available when performing a Sanger-type analysis

Note that trimming only applies to fully annotated sequences whereby sequences that are classified as not fully annotated by the Peptide Annotator operation are left untrimmed.

Saving different settings as Profiles

Geneious Biologics allows you to save Profiles which can be used to record and re-run alternative settings depending on the dataset. This means that you can specify custom sequence liabilities, custom clusters and other settings depending on what dataset you are working with.

Profiles can be saved and applied at the bottom of all our Annotation analysis pipelines:

apply or save profile 2.png

What next?

Like any other Biologics Annotator Result document, you can also:

Filter your Sequences
Go to the Graphs Tab to view visualizations of your results like Sequence Logos
Perform Sequence Alignments
View the "Clusters" in your dataset
Add new Clusters to your Results
Subset your sequences and re-calculate clusters
Add Assay Data to your Analysis Results
Compare Results across Multiple Experiments to monitor enrichment across panning rounds or to identify sequences present across multiple datasets.
Edit your Sequences