In this tutorial, you will learn how to annotate and cluster short peptide sequences and compare panning rounds over an experiment. See our article on the Peptide Annotator to learn more about what kinds of datasets you can use this analysis pipeline for.
The peptide annotator is an agnostic tool; if you are interested in antibody analysis, see our Getting Started page for antibody-specific tutorials.
This tutorial will cover the following exercises:
Get Started: To start this tutorial, you will need the input data. If you have recently started Geneious Biologics, your organization may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the input sequences here and then uploading them into Geneious Biologics. Note that the data and images used in this tutorial are obtained from this research article.
Sequence Annotation
In this exercise, you will learn how to analyze sequences from multiple Illumina MiSeq biopanning libraries. These sequence libraries have already been trimmed to full regions that are 36 nucleotides long (12 amino acids). To annotate these short peptide sequences, select the pan1_SRR2050319 document in the Input data folder and click Annotation > Peptide Antibody Annotator
Select the following options from the Peptide Annotator dialog box and click Run to start the analysis (see sections and image below).
Main Options
Select the following options:
- Reference database(s): No database
- Name Scheme: None
- Handle Input Sequences: Collapse duplicates (e.g. NGS)
Collapsing and Filtering Options
Leave defaults - Collapse sequences at least: 100% identical
Analysis Options
Leave all defaults
Clustering Options
Leave all defaults
Advanced Options
Leave all defaults
Click Run to start the analysis. This operation will produce a pan1_SRR2050319 Annotated & Clustered Biologics Annotator Result document (also available in the Sequence Annotation folder).
Repeat this step with the other libraries (pan2-4). Ultimately, this will result in 4 individual Biologics Annotator Result documents (one per library).
**Note that you will have to run each panning round individually as running them all at one go will result in the loss of library categorization as all the libraries will be analyzed as one document.
Viewing the results
To view a Peptide Annotation result, select one of the SRR2050319 Annotated & Clustered Biologics Annotator Result documents. This will bring up the info tab, which documents the analysis and also allows you to open the document:
Clicking Open Full Document will open the result which consists of a tabular view with your sequences organised into rows - the Sequences Table. Clicking on a sequence will allow you to view the sequence in the Sequence Viewer below.
The Sequences Table contains details for each individual sequence such as the amino acid sequence and (if selected under Analysis Options) the reference database mismatches, score and protein statistics. On the other hand, the Sequence Viewer allows you to view the annotated sequences and search for motifs and annotations.
Other options for sequence analysis within a result include:
- Filtering your results to pull out sequences that meet certain metrics you specify
- Extract and Re-cluster to take a subset of sequences out of an existing Biologics Annotator Result Document and make a new document with re-calculated clusters
- Adding Assay Data (ELISA values etc.) to further inform your results: Adding Assay Data to your Analysis Results
- Aligning sequences to compare the amino acid diversity across a region or multiple regions: Sequence Alignment
- Editing your Sequences to perform point mutations that might increase developability
Adding Clusters
Clustering is used to group sequences together based on shared identity, and allows you to view the counts of unique or related sequences. If you are unsure what a cluster is, see Understanding "Clusters".
To group together sequences that have a single amino acid mismatch, go to Post-processing > Add Clusters:
This will bring up the Add Clusters dialogue. Click the blue "+" sign to add a new cluster. Then, switch to the Advanced tab and make sure to select the region as "Full Sequence", Cluster Method
"Identity (by count)".
After adding the cluster, select Run.
See our Add Cluster page for more on these options.
Viewing the new Cluster
To view the new cluster, change the Cluster Table: dropdown to the new cluster:
When viewing this cluster table, we can see that the most common sequence (SGVYKVAYDWQH) also had 5 related sequences that had a single amino acid difference.
The Cluster Contents columns will list the top 100 related amino acid sequences and their percent abundance or count in the cluster, while the # Exact Clusters column will list the number of unique sequences in the cluster.
Graphs
When viewing the any of the tables, you can also switch to the Graphs tab in the Sequence Viewer panel, as shown below. Of particular interest is the the Cluster Similarity Network plot which will enable you to investigate the relationships between clusters of varying size/abundance. To view this, navigate to the Full Sequence cluster in the Cluster Table: dropdown and select the appropriate graph:
Each node represents a cluster, and clusters that are more similar in terms of their sequence will be connected together on the network. The relative size of the nodes represents the # of Sequences.
You can learn more about this plot here: Network and Tree plots: Identifying clonotype and sequence relationships or learn more about our graphs here: Using Graphs to interpret Clusters and Clonotypes.
Comparing panning rounds
This dataset consists of four rounds of sequencing on a phage display library. To compare the relative frequencies of peptide sequences within each panning round to find enriched peptides, we can use Compare Results. Exit out of the result document and select all four results in the main folder, go to Post-processing and click Compare Results:
Select the following settings and click run:
-
Filtering
- Filter out sequences where the sum of counts for all samples is lower than: 5
-
Normalization
- Method: Total count
-
Additional Clustering
- Group similar sequences across all samples: ON
- Method: Identity-based clustering
- Threshold: 100%
- Region: Full Sequence
-
Experiment
- Reference sample: pan1_SRR2050319 Annotated & Clustered
To learn more about these settings and what they mean, see our main article Comparing Results across Multiple Experiments.
Viewing the comparison results
After the comparisons result has finished running, opening the document will bring up the Summary Table. Navigate to the Full Sequence (100% Identity) table as shown below:
Automatically, a frequency plot will populate showing the top 10 peptide sequences by score and the rates they were found at within each of the four samples. These graphs are interactable, and mousing over the columns reveals that the sequence WPTDHQMLRIPM made up around 44% of pan four, while only making up 0.024% of pan one.
You can also make more complex scatterplot graphs. The below image shows a plot of the Normalized count versus the log2 fold change in sequences, after selecting Graph Type: Scatterplot from the left-hand drop-down.
This graph is useful for determining which sequences were enriched relative to the first panning round and were also found at high counts in the last panning round.