It is advisable to read this article to help you get familiarised with Geneious Biologics before proceeding with the following tutorial.
In this tutorial, you will learn how to merge and annotate next-generation sequencing (NGS) reads produced by sequencing variable gene repertoires from immunized mice. You will also learn how to assess antibody repertoire diversity through sequence clustering.
This tutorial will cover the following exercises:
- Merge overlapping paired-end NGS reads
- Sequence annotation
- Understanding sequence clusters
- Sequence Filtering
- Similarity clustering
To start this tutorial, you will need input data. If you have recently started Geneious Biologics, your organization may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the latest input sequences here and then uploading them into Geneious Biologics.
Note: Version 1 of this tutorial used a slightly different dataset. The workflow is exactly the same, but some numbers may not match up.
Merging paired reads also known as overlapping or assembly of read pairs converts a read pair into a single read containing a sequence and a set of quality scores. A read pair must overlap a significant fraction of its length for the reads to be merged.
In this exercise you will learn how to merge paired-end Illumina MiSeq reads. Immunoglobulin heavy chains are approximately 300-350 bp long and because this example read library was obtained by 250 bp paired-end sequencing, it is important to merge the read in order to obtain full length heavy chain sequences. To merge these paired-end reads, select both the paired-end documents in the Input data folder and click Pre-processing Set & Merge Paired Reads (see image below).
As the read libraries are paired-end, select the following options in the Set & Merge Paired Reads dialog box and click Run to start the analysis (see sections and image below).
- Pairs of lists
- Forward/Reverse (inward pointing, e.g. Illumina paired end)
- Set and merge paired reads using BBMerge
Once the operation is completed, 2 new documents will be generated in the Set and merge paired reads folder; a ERR346598 (merged) and a ERR346598 (couldn't be merged) document. The ERR346598 (merged) document consists of reads that were successfully paired and merged while the ERR346598 (couldn't be merged) document consists of reads that are paired but couldn’t be merged.
**Note that the number of merged and unmerged reads are dependent on the read quality and increasing the merge rate may result in higher false positives. Read more on Set & Merge Paired Reads here.
The Antibody Annotator identifies immunoglobulin framework regions, complementary determining regions (CDR) and V(D)JC genes, and annotates input sequences against a selected reference database.
In this exercise, you will learn how to annotate variable heavy immunoglobulin genes in mice produced by PCR amplification and how to analyze the results with the help of the Pipeline Report and Graphs. To annotate these heavy IgG genes, select the ERR346598 (merged) document in the Set and merge paired reads folder and click Annotation Antibody Annotator (see image below).
Select the following options from the Antibody Annotator dialog box and click Run to start the analysis (see sections and image below).
Select the following options:
- Reference database: Mouse Ig
- Selected sequences are: Single chain (heavy)
Select the following option:
- Include pseudo genes from database
- Find liabilities and assets ***
This operation will produce a ERR346598 (merged) Annotated & Clustered Biologics Annotator Result document in the Sequence annotation folder.
**Note that the bundled IgG reference databases are split into light and heavy sections. If the sequence type (Selected sequence are: option) is specified for a sequence, only the appropriate database section is used thus, improving performance and potentially annotation accuracy. Read more about the Antibody Annotator here.
*** The "Find liabilities" and "Annotate germline gene differences" options can cause delays on large datasets. If you are analysing more than 1 million sequences, it is recommended that you leave these options off unless absolutely necessary.
A pipeline report is generated for every Biologics Annotator Result document. This report provides an indication of the annotation rate of the input data, region cluster diversity, and gene mutation distribution among others which are derived from the Antibody Annotator analysis.
In the following section, we will determine how well the sequences are annotated. To get a quick overview of the Antibody Annotator analysis, select the ERR346598 (merged) Annotated & Clustered document and click Pipeline Report.
Approximately 88% of the sequences were identified as Heavy Chain and these sequences, consisting of all of the framework regions (FR) and CDRs, were fully annotated, in-frame and without stop codons (Figure 1.1).
Figure 1.1 | The number of sequences without stop codons, in-frame and fully annotated identified by the Antibody Annotator.
**Note that the Pipeline Report can be exported as a PDF document. Click Export to PDF to export the report as a PDF document.
The Graphs option is available for every Biologics Annotator Result document. Graphs are a collections of graphs are that derived from the Antibody Annotator analysis.
In the following sections, we will learn more about clusters and assess the cluster diversity of the Heavy CDR3 region. Immunoglobulin CDR3 region has been reported to contribute to antibody diversity and for this reason, they have been widely used as unique identifiers. To assess the Heavy CDR3 cluster diversity, first, select the ERR346598 (merged) Annotated & Clustered document and click Graphs. Then, click Graph Type: Annotations rates and select Cluster diversity in the dropdown. Finally, select Heavy CDR3 in the Filter By: dropdown (see image below).
**Note that these graphs can be exported as image (png) or table (csv) files that can be used for publication or as laboratory documentation. To export a graph, click Export.
The three CDRs, which interact with antigen, are more diverse compared to the FRs. Among the CD regions, CDR3 varies the most. The CDR cluster diversity and cluster lengths graphs provide a quick indirect comparison of the CDR clusters.
The Heavy CDR cluster diversity graphs showed that Heavy CDR3 is the CD region with the highest cluster diversity with approximately 2,400 clusters while Heavy CDR1 and CDR2 consist of approximately 560 and 840 clusters respectively. Additionally, the majority of the Heavy CDR3 clusters in this dataset consist of a single unique CDR3 amino acid sequence suggesting high sequence diversity. The Heavy CDR3 cluster diversity is also reflected in its cluster length where the top 5 cluster lengths range from 10-14 amino acids long while the majority length of both Heavy CDR1 and CDR2 is at 8 amino acids long as shown in the Heavy CDR length graphs (Figure 1.2).
Figure 1.2 | Heavy CDRs cluster diversity and cluster length distribution. The graphs on the left show the CDR cluster diversity and the graphs on the right show the CDR cluster length.
Next-generation sequencing enables the discovery of the great diversity of natural antibody repertoires bringing about vast volume of sequencing data for a fraction of the cost of Sanger sequencing. Sequence clustering is the process of grouping similar sequences into clusters resulting in reduced sequence redundancy making data analysis more straightforward.
In this exercise you will learn how to view heavy CDR3 region clusters and identify its most abundant associated region (CDR1 and CDR2, and FR1-FR4). To view sequences within a cluster with identical heavy CDR3, select ERR346598 (merged) Annotated & Clustered in the Annotation folder. Finally, switch table views using the Cluster Table dropdown. Click All Sequences and then scroll down to select Heavy CDR3.
The table will automatically sort the clusters in descending order, with the most abundant heavy CDR3 cluster at the top. Select the first in-frame cluster to view the cluster of sequences consisting of identical Heavy CDR3 “ARWEYYAMDY” sequence (see image below).
**Note that all the sequences within a region cluster consist of identical regions unless specified. For example, when sequences are grouped by Heavy CDR3, all the sequences within a cluster will consist of an identical Heavy CDR3 sequence but they may consist of distinct CDR1 and CDR2, and FR1-FR4 regions. Learn more about clusters here.
The Sequences Table can be used to quickly identify the most frequent FR and CDR clusters for a selected cluster. To view the most abundant regions associated with the selected Heavy CDR3 cluster, scroll to the right of the Sequences Table or use Focus column button located in the Table Preferences panel to quickly navigate to your column of interest.
The Sequences Table demonstrated that the most abundant associated Heavy CDR1 and CDR2 sequences for the Heavy CDR3 “ARWEYYAMDY” cluster were “GFNIKDTY” (94.44%) and “IDPANGNT” (97.22%) respectively (Figure 1.3).
Figure 1.3 | The Heavy CDR3 “ARWEYYAMDY” cluster and its most abundant associated CDR1 and CDR2 sequences.
**Note that you can create custom cluster combinations. Up to six regions or genes (FR1, CDR3, Heavy D gene etc.) can be clustered together based on shared identity across sequences in the regions selected. It is also possible to specify a percent threshold of mismatches allowed and to cluster based on amino acid similarity. To explore these advanced clustering methods, see this article.
NGS data generally comprises of a large number of reads making antibody candidate selection difficult. Sequence filtering coupled with assets and liability score, may aid in identifying suitable candidates for further downstream analyses.
In this exercise, you will learn how to filter the All Sequences table for sequences that meet a set of conditions. To filter the sequences for sequences that are fully annotated, in-frame and without stop codon with a score of -100, first, right click a cell in the Without Stop Codons & In Frame & Fully Annotated column and click the Filter syntax. Then, right click a cell in the Score column and click the Filter syntax. Finally, in the Filter box, ensure that the filter syntax is as below and click Filter.
['Without Stop Codons & In Frame & Fully Annotated'] = 'Yes' AND ['Score'] >= -100
A total of 32 sequences that are without stop codons, in-frame and fully annotated, and have a Score of ≥ -100 were identified (Figure 1.4). The high score suggests low sequence annotation error with low number of liability sites such as post-translational modifications (PTM) sites.
Figure 1.4 | A total of 32 of the 4211 sequences meet the conditions of having a score ≥ -100 and is fully annotated, in-frame and without stop codon.
**Note that you can filter sequences on all of the columns within a Biologics Annotator Result document. Learn more about sequence filtering and filtering using scripts here.
Sequence clustering is commonly used to group highly similar immunoglobulin sequences together with the assumption that their sequence similarity is the result of them sharing the same initial B cell. Re-clustering is the process of grouping sequences sharing a similar region into clusters based on a set threshold. Note that these similarity threshold clusters can also be specified when running Antibody Annotator by using the advanced clustering options.
In this exercise, you will learn how to re-cluster an Antibody Annotator result to produce a Heavy CDR3 region with 90% similarity. To re-cluster the heavy CDR3 region, select ERR346598 (merged) Annotated & Clustered document in the Sequence Annotation folder and click Post-processing Re-cluster.
Select the following options from the Additional Clustering dialog box and click Run to start the analysis (see image below).
This analysis will produce a ERR346598 (merged) Annotated & Clustered (Reclustered) document in the Similarity clustering folder.
To view the re-clustered sequences, select the ERR346598 (merged) Annotated & Clustered (Reclustered) document and select the Heavy CDR3 90 Percent Similarity cluster from the Cluster Table: dropdown.
Prior to re-clustering, a total of 2,421 clusters of Heavy CDR3 were identified (top) and upon re-clustering a total of 2,101 clusters of Heavy CDR3 were identified (bottom) (Figure 1.5).
Figure 1.5 | Sequence re-clustering groups similar sequences into clusters based on a threshold.
**Note that the Re-cluster operation essentially produces another Biologics Annotator Result document with the additional cluster table. Additionally, this new document will not consist of the Graphs and Pipeline Report options.