Our newest annotation analysis pipeline allows you to submit non antibody-like sequences for annotation, analysis and clustering. The sequences you submit can be nucleotide or protein sequences of any kind. This article outlines all of the main options available in the Peptide Annotator.
Jump to:
- What is the Peptide Annotator?
- How do I run Antibody Annotator?
- Settings
- Saving different settings as Profiles
What is the Peptide Annotator?
The Peptide Annotator annotates and clusters your input query sequences. You can choose to annotate the query sequences without a reference, or you can use a known template sequence as your reference - see Understanding Reference Databases.
The Peptide Annotator accepts nucleotide and protein* sequences of any nature - it is not dependent on a specific molecule or protein type. Some examples of sequences may include:
- Panning rounds of peptides 5-40+ amino acids long
- Domains/regions of proteins
- Lists of HCDR3 sequences
* Note: Protein input sequences are only supported for non-NGS analysis
In addition to Clustering your input sequences, the Peptide Annotator also offers deeper analysis capabilities:
- Visualisation and recording of variants present compared to the reference database (if a reference database is used)
- Identification of sequence-based liabilities and a liability and sequence quality scoring system
- Custom clustering options
- A rich visualisation suite including a sequence explorer, alignment views, and a broad range of graphs to help you gain further insights from your dataset
To view a tutorial demonstrating the use of the Peptide Annotator, see Peptide Tutorial 1. Phage Display Libraries.
How do I run Peptide Annotator?
To run the Peptide Annotator, select a file in your folder and go to Annotation > Peptide Annotator from the dropdown:
To start the Peptide Annotator operation, adjust your options in the pop-up as desired and then click Run. This operation will output a Biologics Annotator Result file.
Settings
The following sections outline the steps to successfully carry out an analysis using the Peptide Annotator and how each section and option works.
Main Options
Reference database(s):
-
The Peptide Annotator supports using no reference database, or using a General Template Reference Database. See General Template Databases to easily make your own - this could be a full protein, a protein domain or a peptide sequence
- Multiple reference databases can be selected.
Name Scheme:
- This option will be automatically filled in if you used Batch Assemble Sanger Sequences with a Name Scheme. Name schemes are highly customisable, and allow you to use information contained within the sequence names (eg. well or donor/sample) to classify sequences and groups of sequences, or pull out information from the sequence names into columns in your result.
Handle Input Sequences:
NOTE: Your selection here will influence the following Collapsing and Filtering section.
-
Keep individual sequences (e.g. Sanger)
- All submitted sequences will be retained in the outputted result. This option will not be available if you have selected more than 10,000 input reads for annotation.
-
Collapse duplicates (e.g. NGS)
- This will collapse sequences that meet an identity threshold at the nucleotide level and retain a count of how many sequences were collapsed. This will be the default option when more than 10,000 reads were selected for annotation.
-
Collapse duplicates with Barcodes (e.g. Single Cell)
- This will collapse sequences that meet an identity threshold at the nucleotide level and retain a count of how many sequences were collapsed. It enables extra options that are relevant for sequencing campaigns that require full sequences to be assembled from multiple reads i.e. de novo assembly. See Understanding Single Cell technologies: Barcodes and UMIs and Collapse UMI Duplicates and Separate Barcodes.
- This will collapse sequences that meet an identity threshold at the nucleotide level and retain a count of how many sequences were collapsed. It enables extra options that are relevant for sequencing campaigns that require full sequences to be assembled from multiple reads i.e. de novo assembly. See Understanding Single Cell technologies: Barcodes and UMIs and Collapse UMI Duplicates and Separate Barcodes.
Collapsing and Filtering
The options available in this section will depend on your analysis mode (Handle Input Sequences) above.
Keep individual sequences/Sanger
None available
Collapse duplicates/NGS
Collapse Sequences at least: X % identical
- Sequences that meet the identity threshold will be collapsed together and a count of how many sequences were collapsed is retained. The default is 100% identical.
Retain upstream and downstream of sequenced region:
- These options are only available if a reference database was selected. The two options enable you to retain a portion of the sequence on either side of the targeted domain/protein/peptide. This additional sequence will be included when the collapsing step is undertaken.
Collapse duplicates with Barcodes/Single Cell
-
Collapse Sequences at least: X % identical
- Sequences that meet the identity threshold will be collapsed together and a count of how many sequences were collapsed is retained. The default is 100% identical.
-
Retain upstream and downstream of sequenced region: X bp
- These options are only available if a reference database was selected. The two options enable you to retain up to the specified length of bp on either side of the targeted reference. This additional sequence will be included when the collapsing step is performed.
-
Keep unmerged reads
- This option specifies that paired reads which failed to merge should be used in the next step of the analysis. In use cases where sequence reads are expected to overlap, discarding unmerged reads is recommended in order to improve assembly accuracy.
-
De novo assembly required
- Select this option to perform de novo assembly if the reads constitute fragments of a larger sequenced region of interest - these will be stitched together to form full sequences.
- This is only recommended if you have performed a barcoded analysis as reads will be assembled into full sequences within the same barcode. Performing this on non-barcoded sequences will result in undesired results, as sequence assembly is performed on the entire dataset.
- Select this option to perform de novo assembly if the reads constitute fragments of a larger sequenced region of interest - these will be stitched together to form full sequences.
-
Only keep reads longer than: X bp
- All sequences that are shorter than the threshold defined will be discarded. This is useful to remove likely low quality sequences. You may wish to set this parameter lower if your sequences have adapters, UMIs and/or barcodes which have been trimmed in the Collapse UMI Duplicates and Separate Barcodes operation.
-
Only use longest: X reads from each list/barcode
- This option lets you specify the number of the reads that will be used from each sequence list or barcode (after sorting by length), with any additional reads discarded. This helps improve performance on large data sets where excessive data significantly slows down analysis. 500 reads is generally sufficient.
- Generally this is not recommended for de novo assembly
- This option lets you specify the number of the reads that will be used from each sequence list or barcode (after sorting by length), with any additional reads discarded. This helps improve performance on large data sets where excessive data significantly slows down analysis. 500 reads is generally sufficient.
-
Significant sequences have at least: X% of the read count of the cell/barcode
-
- This input field lets you flag sequences with low numbers of reads relative to the total reads in the cell (if barcodes were used), or relative to the input sequence list, specified as a percentage.
-
-
Significant sequences have at least: X reads
-
- This input field lets you flag sequences with low numbers of reads as not significant. This is useful for filtering out sequences that were only defined due to reads of very low frequency or due to sequencing errors.
-
-
Only keep sequences with at least: X% of the read count of the dominant sequence
- This input field lets you discard sequences where the number of reads is equal to or less than the number of reads for the most prevalent sequence within the barcode/dataset, specified as a percentage. This setting allows you to permanently discard sequences that do not have enough supporting data for further analysis.
- We recommend not setting this higher than 5%
- We recommend not setting this higher than 5%
- This input field lets you discard sequences where the number of reads is equal to or less than the number of reads for the most prevalent sequence within the barcode/dataset, specified as a percentage. This setting allows you to permanently discard sequences that do not have enough supporting data for further analysis.
Analysis Options
Annotate variants from reference database
- To annotate and see the differences between your input sequences and reference sequences, select this option. The nucleotide and amino acid differences relative to your reference database(s) will be annotated on your input sequences and recorded in the All Sequences Table. For more information about viewing these differences and what they mean, see this article.
Calculate protein statistics
- This will calculate the Molecular Weight (kDa), the Isoelectric point, the charge at pH 7 and the Extinction Coefficient across the Full Sequence (no reference database) or the Template Region (when using a reference database).
- If a full Template Region can not be found these values will not be calculated.
Find liabilities and assets:
- To search and score amino acid or nucleotide motifs associated with deleterious post-translational modifications or any type of reduced antibody function or desirable motifs, select this option. The Peptide Annotator pipeline has a default set of sequence liability checks. These can also be customized, see how to Customize Sequence Liabilities and Assets.
Clustering Options
Clustering provides a way to group your sequences based on shared identity/similarity across your sequences. To learn more about clustering and how it can help with interpreting your dataset, see Understanding "Clusters".
Several default clusters will already be listed, and further clusters can be added using the blue "Plus" sign as seen above.
It is possible to cluster up to six regions together based on shared identity across sequences in the regions selected. It is also possible to allow mismatches across a region and to cluster based on amino acid similarity. To learn more about configuring this option, please refer to Clustering Options.
Cluster Filters
Sequencing data can contain low quality sequences and noise. In order to improve the meaningfulness of clusters in your results, select one or more of the following options:
- Only cluster results with asset and liability score of at least - This will cluster the sequences based on the score specified. For example, if you specify a score of -1000, only sequences that have a liability and asset score of -1000 or more will be included in the clusters.
-
Only cluster results which are - This will cluster the sequences that are either: Fully annotated, Fully annotated and In Frame, or Fully annotated, In Frame, and Without Stop Codons. For example, if you chose to cluster Fully annotated and In Frame sequences, only sequences that meet the specifications of being fully annotated and in frame will be clustered, sequences that are not fully annotated or have frameshifts will not be included in the clustering operation.
- Note: this option is only available for a Sanger-type analysis.
- Note: this option is only available for a Sanger-type analysis.
Advanced Options
Genetic code:
- The Genetic code dropdown allows you to select the genetic code to use for translating nucleotide sequences. The codes are obtained directly from NCBI. One additional Genetic code, "Amber readthrough" allows certain stop codons not to be treated as stop codons during translation.
Record equal reference sequence match as:
- Each sequence with partial frequency - This will assign the query sequence to all matching references with partial frequency. Based on the example of a query sequence matching to two unique reference sequences, the query sequence will add 0.5 to the total count for both Reference-1 and Reference-2.
- Groups of sequences - This will create a separate entry in the list of reference matches that represents this combination of references sequences. Based on the example of a query sequence matching two references: Reference-1 and Reference-2, the query sequence will contribute 0 towards the total for each of Reference-1 and Reference-2, and instead add 1 to the total for a reference called "Reference-1/Reference-2".
- Unknown - This will treat this as an unknown match. Based on the example of a query sequence matching two references equally: Reference-1 and Reference-2, the query sequence will add nothing to the totals of Reference-1 and Reference-2.
Trim each side of fully annotated region if over:
-
This setting trims off extra bases on either side of the sequence region of interest. The default is 10, which means that your annotated sequence will have 10 base pairs flanking the 5' and 3' ends of the sequence.
- This option is only available when performing a Sanger-type analysis
Note that trimming only applies to fully annotated sequences whereby sequences that are classified as not fully annotated by the Peptide Annotator operation are left untrimmed.
Saving different settings as Profiles
Geneious Biologics allows you to save Profiles which can be used to record and re-run alternative settings depending on the dataset. This means that you can specify custom sequence liabilities, custom clusters and other settings depending on what dataset you are working with.
Profiles can be saved and applied at the bottom of all our Annotation analysis pipelines:
What next?
Like any other Biologics Annotator Result document, you can also:
- Filter your Sequences
- Go to the Graphs Tab to view visualisations of your results like Sequence Logos
- Perform Sequence Alignments
- View the "Clusters" in your dataset
- Add new Clusters to your Results
- Subset your sequences and re-calculate clusters
- Add Assay Data to your Analysis Results
- Compare Results across Multiple Experiments to monitor enrichment across panning rounds or to identify sequences present across multiple datasets.
- Edit your Sequences