It is advisable to read the Quick Start Guide to help you get familiarised with Geneious Biologics before proceeding with the following tutorial.
NOTE: A previous version of this tutorial using Antibody Annotator instead of NGS Antibody Annotator can be found here.
In this tutorial, you will learn how to merge and annotate next-generation sequencing (NGS) reads produced by sequencing variable gene repertoires from immunized mice. You will also learn how to assess antibody repertoire diversity through sequence clustering.
This tutorial will cover the following exercises:
- Merge overlapping paired-end NGS reads
- Sequence Annotation
- Pipeline Report
- Understanding Sequence Clusters
- Sequence Filtering
- Extract and Recluster
- Further Analysis
To start this tutorial, you will need input data. If you have recently started Geneious Biologics, your organization may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the latest input sequences here and then uploading them into Geneious Biologics.
Note: Both the full dataset and a reduced dataset are provided in the above link. The full dataset takes about 3-4 hours to process and the reduced (5k sequences) dataset takes a few mins. This tutorial uses the full dataset.
The videos in our Getting Started series may also be helpful, linked here. Below is our video on Pre-processing NGS Sequences.
Merge overlapping paired-end NGS reads
Merging paired reads, also known as overlapping or assembly of read pairs, converts a read pair into a single read containing a sequence and a set of quality scores. A read pair must overlap a significant fraction of its length for the reads to be merged.
In this exercise you will learn how to merge paired-end Illumina MiSeq reads. Immunoglobulin heavy chains are approximately 300-350 bp long and because this example read library was obtained by 250 bp paired-end sequencing, it is important to merge the read in order to obtain full length heavy chain sequences. To merge these paired-end reads, select both the paired-end documents in the Input data folder and click Pre-processing > Set & Merge Paired Reads (see image below).
As the read libraries are paired-end, select the following options in the Set & Merge Paired Reads dialog box and click Run to start the analysis (see sections and image below).
- Pairs of lists
- Forward/Reverse (inward pointing, e.g. Illumina paired end)
- Set and merge paired reads using BBMerge
Once the operation is completed, 2 new documents will be generated in the Set and merge paired reads folder; a full_ERR346598 (merged) and a full_ERR346598 (couldn't be merged) document. The ERR346598 (merged) document consists of reads that were successfully paired and merged while the ERR346598 (couldn't be merged) document consists of reads that are paired but couldn’t be merged.
**Note that the number of merged and unmerged reads are dependent on the read quality and increasing the merge rate may result in higher false positives. Read more on Set & Merge Paired Reads here.
NGS Antibody Annotator identifies immunoglobulin framework regions, complementary determining regions (CDR) and V(D)JC genes, and annotates input sequences against a selected reference database.
In this exercise, you will learn how to annotate variable heavy immunoglobulin genes in mice produced by PCR amplification and how to analyze the results with the help of the Pipeline Report and Graphs. To annotate these heavy IgG genes, select the full_ERR346598 (merged) document in the Set and merge paired reads folder and click Annotation > NGS Antibody Annotator:
Select the following options from the Antibody Annotator dialog box and click Run to start the analysis (see sections and image below).
- Reference database: Mouse Ig 2022
- Selected sequences are: Single chain (heavy)
- Sequence region of interest is between: FR1 and FR4
- Collapse sequences at least: 98% identical
- Retain downstream of sequenced region: 27 bp
- Turn on Find liabilities and assets
You can leave all other sections as the default. Click Run to start the analysis. This operation will produce a full_ERR346598 (merged) Annotated & Clustered Biologics Annotator Result document in the Sequence annotation folder.
**Note that the bundled IgG reference databases are split into light and heavy sections. If the sequence type (Selected sequence are: option) is specified for a sequence, only the appropriate database section is used thus, improving performance and potentially annotation accuracy.
*** The "Find liabilities" and "Annotate variants from reference databases" options can cause delays on large datasets. The liabilities and assets box is also customisable: How to Customize Sequence Liabilities and Assets.
Below is our introductory video on how to understand your results using the Graphs tab and the Pipeline Report tab.
A pipeline report is generated for every Biologics Annotator Result document. This report provides an indication of the annotation rate of the input data, region cluster diversity, and gene mutation distribution among others which are derived from the Antibody Annotator analysis.
In the following section, we will determine how well the sequences are annotated. To get a quick overview of the Antibody Annotator analysis, select the full_ERR346598 (merged) Annotated & Clustered document and click the Pipeline Report tab along the top of the viewer:
The below report shows that within the total dataset of 2.54 million sequences, 2.37 million could be fully annotated according to our sequence specifications: Full region spans FR1 > FR4, with 27bp extending into the constant region. The 2.37 million annotated reads were then collapsed down to 216,000 unique sequences (98% identity).
Approximately 66% of the sequences were identified as Heavy Chain and these sequences, consisting of all of the framework regions (FR) and CDRs were fully annotated, in-frame and without stop codons.
Another graph that is particularly useful for quality control is the Mutation distribution by gene plot found in the pipeline report:
If the values are hovering around or above 25% this can indicate that the incorrect germline reference database was used. Here we can see that the values are generally distributed around 5%, indicating that this dataset is not very divergent from the germline.
The Pipeline Report can be exported as a PDF document. Click Export to PDF to export the report as a PDF document.
The Graphs option is available for every Biologics Annotator Result document. In the following sections, we will learn more about clusters and assess the cluster diversity of the Heavy CDR3 region. Our main article on clusters may also be useful: Understanding "Clusters".
The Immunoglobulin Heavy CDR3 sequence has a outsized contribution to antibody diversity and for this reason, the region has been widely used as unique identifiers. To assess the HCDR3 cluster diversity, click the Graphs tab at the top left of the result. Then, click Graph Type: Annotations rates and select Cluster Diversity in the dropdown. Finally, select Heavy CDR3 in right hand Gene/Region dropdown:
**Note that these graphs can be exported as image (png) or table (csv) files that can be used for publication or as laboratory documentation. To export a graph, click Export to the left of the graphs dropdown.
Among the CD regions, CDR3 varies the most. The Cluster Diversity and Cluster Lengths graphs provide a quick indirect comparison of the CDR clusters.
The Number of Clusters graph showed that Heavy CDR3 is the CD region with the highest cluster diversity with approximately 38,500 clusters while Heavy CDR1 and CDR2 consist of approximately 5910 and 7740 clusters respectively.
Additionally, the majority of the Heavy CDR3 clusters in this dataset consist of a single unique CDR3 amino acid sequence suggesting high sequence diversity. The Heavy CDR3 cluster diversity is also reflected in its cluster length where the top 5 cluster lengths range from 10-14 amino acids long:
Below is our introductory video on how to understand and find clusters in your NGS Antibody Annotator result document. Our main article on clusters may also be useful: Understanding "Clusters".
Next-generation sequencing enables the discovery of the great diversity of natural antibody repertoires bringing about vast volumes of sequencing data for a fraction of the cost of Sanger sequencing on a similar sized dataset. Sequence clustering is the process of grouping similar sequences into clusters resulting in reduced sequence redundancy and making data analysis more straightforward.
In this exercise you will learn how to view Heavy CDR3 region clusters and identify the most abundant associated region (CDR1 and CDR2, and FR1-FR4). To view sequences within a cluster switch table views using the Cluster Table > Heavy CDR3 dropdown.
The table will automatically sort the clusters in descending order, with the most abundant Heavy CDR3 cluster at the top. Select the first in-frame cluster to view the cluster of sequences consisting of identical Heavy CDR3 “ARHHRYAYYFDY” sequence (see image below).
**Note that all the sequences within a region cluster consist of identical regions unless specified. For example, when sequences are grouped by Heavy CDR3, all the sequences within a cluster will consist of an identical Heavy CDR3 sequence but they may consist of distinct CDR1 and CDR2, and FR1-FR4 regions.
The Sequences Table can be used to quickly identify the most frequent associated FR and CDR sequences for a selected cluster. Scroll to the right of the Sequences Table or search the Table Preferences panel to bring up columns that match. You can then click the Focus column button which allows you to quickly navigate to your column of interest.
The Sequences Table above demonstrates that the most abundant associated Heavy CDR1 sequences for the Heavy CDR3 “ARHHRYAYYFDY” cluster were “GYTFTSSW” (47.35%) and “GYTFTNSW” (37.38%) respectively.
**Note that you can create custom cluster combinations. Up to six regions or genes (FR1, CDR3, Heavy D gene etc.) can be clustered together based on shared identity across sequences in the regions selected. It is also possible to specify a percent threshold of mismatches allowed and to cluster based on amino acid similarity. To explore these advanced clustering methods, see Clustering Options.
NGS data generally comprises of a large number of reads making antibody candidate selection difficult. Sequence filtering coupled with assets and liability score, may aid in identifying suitable candidates for further downstream analyses.
In this exercise, you will learn how to filter the All Sequences table for sequences that meet a set of conditions. To filter the sequences for sequences that are fully annotated, in-frame and without stop codons with a score greater or equal to -100, first, right click a cell in the Without Stop Codons & In Frame & Fully Annotated column and click the Filter syntax. Then, right click a cell in the Score column and click the Filter syntax. Finally, in the Filter box, ensure that the filter syntax is as below and click Filter.
['Without Stop Codons & In Frame & Fully Annotated'] = 'Yes' AND ['Score'] >= -100
A total of 25,276 sequences that are without stop codons, in-frame and fully annotated, and have a Score of ≥ -100 were identified from the original dataset of 216,441. The high score suggests low sequence annotation errors and few liability sites such as post-translational modifications (PTM) sites.
You can filter sequences on all of the columns within a Biologics Annotator Result document. Learn more about sequence filtering here: Filtering your Sequences.
Extract and Recluster
Sequence clustering is commonly used to group highly similar immunoglobulin sequences together with the assumption that their sequence similarity is the result of them sharing the same initial B cell. You may want to take a subset of your sequences and specify more clusters or just re-calculate existing clusters to identify trends within subsets of your sequences.
In this exercise, you will extract and recluster the above filtered sequences to a new Antibody Annotator result, with added clustering on the Heavy V Gene, Heavy J Gene and Heavy CDR3 region with 85% similarity. This is to represent grouping into Clonotypes. See Understanding Clonotypes and how to find them in your data or the video below for more information:
Ensure you are still filtering on:
['Without Stop Codons & In Frame & Fully Annotated'] = 'Yes' AND ['Score'] >= -100
In the All Sequences Table, and select all of the filtered sequences. Go to the Export/Extract dropdown and select Extract and Recluster.
Click the blue "plus" icon in the Extract and Recluster menu below to add a new cluster.
This will open a box where you can add clusters. To add a new multi-region and/or similarity threshold cluster click the advanced tab. Hold down command (Mac) or control (PC) to select three regions: Heavy CDR3, Heavy V Gene and Heavy J Gene. Then select the following options:
- Cluster By: Amino Acids
- Cluster Method: Similarity (BLOSUM)
- Similarity Threshold: 85%
- Allow Mismatches in: Heavy CDR3
Add the three region cluster. This will group sequences together that are identified as being from the same V and J Genes, while having at least 85% sequence similarity in the Heavy CDR3 region (as judged by the BLOSUM matrix). Click Run to produce the new Biologics Annotator Result document.
To view the re-clustered sequences, select the 25276 nucleotide sequences Annotated & Clustered document in the parent folder. First, we can check what the most common Heavy CDR3 sequence is by selecting Heavy CDR3 from the Cluster Table dropdown menu.
Recall from above section that the most common Heavy CDR3 sequence was ARHHRYAYYFDY. After taking a subset of the sequences that were Without Stop Codons & In Frame & Fully Annotated and had a Score above -100, the most common Heavy CDR3 region sequence is AREARTTARFAY.
We can also check out the combination cluster for identifying Clonotypes that we specified (Heavy V Gene, Heavy J Gene and 85% similarity in the Heavy CDR3 region). We can see that the largest sequence cluster is labelled ARHAYYDQTEVSFVY & IGHV5-9-2 & IGHJ3 with 300 total sequences. Since this is an inexact cluster, there will be different (but related) HCDR3 amino acid sequences within the cluster, with 45% being ARHAYYDQTEVSFVY (Primary Sequence Frequency). In total, there were 78 different Unique HCDR3 sequences in this cluster. All sequences will however be derived from the IGHV5-9-2 & IGHJ3 genes:
All the graphs mentioned previously in this tutorial are also available, and will be recalculated according to the subset of sequences and the clusters specified.
The other tools in Biologics provide a stepping point for analysis.
- Compare two or more Annotation Result Documents from separate experiments to monitor clonal expansion etc.
- You can add your Assay Data (ELISA values etc.) to further inform your results: Adding Assay Data to your analysis results
- You can align sequences to compare the amino acid diversity across a region or multiple regions: Sequence alignment
- Edit your Sequences to perform point mutations that might increase developability