In this tutorial, you will learn how to annotate and cluster short peptide sequences and compare panning rounds over an experiment. See our article on the Peptide Annotator to learn more about what kinds of datasets you can use this analysis pipeline for.
The peptide annotator is an agnostic tool; if you are interested in antibody analysis, see our Getting Started page for antibody-specific tutorials.
This tutorial will cover the following exercises:
Get Started: To start this tutorial, you will need the input data. If you have recently started Geneious Biologics, your organization may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the input sequences here and then uploading them into Geneious Biologics. Note that the data and images used in this tutorial are obtained from this research article.
Sequence Annotation
In this exercise, you will learn how to analyze sequences from multiple Illumina MiSeq biopanning libraries. These sequence libraries have already been trimmed to full regions that are 36 nucleotides long (12 amino acids). To annotate these short peptide sequences, select the pan1_SRR2050319 document in the Input data folder and click Annotation > Peptide Antibody Annotator
Select the following options from the Peptide Annotator dialog box and click Run to start the analysis (see sections and image below).
Main Options
Select the following options:
- Reference database(s): No database
- Name Scheme: None
Analysis Options
Leave all defaults
Clustering Options
Leave all defaults
Advanced Options
Leave all defaults
Click Run to start the analysis. This operation will produce a pan1_SRR2050319 Annotated & Clustered Biologics Annotator Result document (also available in the Sequence Annotation folder).
Repeat this step with the other libraries (pan2-4). Ultimately, this will result in 4 individual Biologics Annotator Result documents (one per library).
**Note that you will have to run each library individually as running them all at one go will result in the loss of library categorization as all the libraries will be analyzed as one document.
Viewing the results
To view a Peptide Annotation result, select one of the SRR2050319 Annotated & Clustered Biologics Annotator Result documents. This will bring up the info tab, which documents the analysis and also allows you to open the document:
Clicking Open Full Document will open the result which consists of a tabular view with your sequences organised into rows - the Sequences Table. Clicking on a sequence will allow you to view the sequence in the Sequence Viewer.
The Sequences Table contains details for each individual sequence such as the amino acid sequence and (if selected under Analysis Options) the reference database mismatches, score and protein statistics. On the other hand, the Sequence Viewer allows you to view the annotated sequences and search for motifs and annotations.
Other options for sequence analysis within a result include:
- Filtering your results to pull out sequences that meet certain metrics you specify
- Extract and Re-cluster to take a subset of sequences out of an existing Biologics Annotator Result Document and make a new document with re-calculated clusters
- Adding Assay Data (ELISA values etc.) to further inform your results: Adding Assay Data to your Analysis Results
- Aligning sequences to compare the amino acid diversity across a region or multiple regions: Sequence Alignment
- Editing your Sequences to perform point mutations that might increase developability
Clusters
Clustering is used to group sequences together based on shared identity, and allows you to view the counts of unique or related sequences. If you are unsure what a cluster is, see Understanding "Clusters".
To access the clustered full sequences, go to Cluster Table: and select Full Sequence from the dropdown.
In the above image, the most common peptide is SGVYKVAYDWQH with 1133 sequences making up 0.15% of this dataset.
You can also add new clusters, including "fuzzy" or inexact clusters based on shared identity up to a specified threshold with the Add Cluster function.
Graphs
When viewing the Full Sequence cluster table, you can also switch to the graphs tab in the Sequence Viewer panel, as shown below. Some graphs of interest are the Cluster Diversity and Cluster Sizes charts, which can also be accessed in the Graphs tab at the top of the result.
The Cluster Diversity chart above shows that the majority of clusters (~300 k) only contained one sequence. The following Cluster Sizes chart shows the amino acid sequence for the top 25 largest clusters:
Here you can see the aforementioned most populous cluster of sequence SGVYKVAYDWQH.
Comparing panning rounds
This dataset consists of four rounds of sequencing on a phage display library. To compare the relative frequencies of peptide sequences within each panning round to find enriched peptides, we can use Compare Results. Exit out of the result document and select all four results in the main folder, go to Post-processing and click Compare Results:
Select the following settings and click run:
-
Filtering
- Filter out sequences where the sum of counts for all samples is lower than: 5
-
Normalization
- Method: Total count
-
Additional Clustering
- Group similar sequences across all samples: ON
- Method: Identity-based clustering
- Threshold: 100%
- Region: Full Sequence
-
Experiment
- Reference sample: pan1_SRR2050319 Annotated & Clustered
To learn more about these settings and what they mean, see our main article Comparing Results across Multiple Experiments.
Viewing the comparison results
After the comparisons result has finished running, opening the document will bring up the Summary Table. Navigate to the Full Sequence (100% Identity) table as shown below:
Automatically, a frequency plot will populate showing the top 10 peptide sequences by score and the rates they were found at within each of the four samples. These graphs are interactable, and mousing over the columns reveals that the sequence WPTDHQMLRIPM made up around 44% of pan four, while only making up 0.024% of pan one.
You can also make more complex scatterplot graphs. The below image shows a plot of the Normalized count versus the log2 fold change in sequences, after selecting Graph Type: Scatterplot from the left-hand drop-down.
This graph is useful for determining which sequences were enriched relative to the first panning round and were also found at high counts in the last panning round.