NGS Tutorial 1. Sequence Analysis (Version 1)

November 11, 2024 21:48
Updated

Note: A new version of this tutorial is available here: NGS Tutorial 1. Sequence Analysis.

It is advisable to read this article to help you get familiarised with Geneious Biologics before proceeding with the following tutorial.

In this tutorial, you will learn how to merge and annotate next-generation sequencing (NGS) reads produced by sequencing variable gene repertoires from immunized mice. You will also learn how to assess antibody repertoire diversity through sequence clustering.

This tutorial will cover the following exercises:

Merge overlapping paired-end NGS reads
Sequence Annotation
Pipeline Report
Graphs
Understanding Sequence Clusters
Sequence Filtering
Extract and Recluster

To start this tutorial, you will need input data. If you have recently started Geneious Biologics, your organization may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the latest input sequences here and then uploading them into Geneious Biologics.

The videos in our Getting Started series may also be helpful, linked here. Below is our video on Pre-processing NGS Sequences.

Read merging

Merging paired reads, also known as overlapping or assembly of read pairs, converts a read pair into a single read containing a sequence and a set of quality scores. A read pair must overlap a significant fraction of its length for the reads to be merged.

In this exercise you will learn how to merge paired-end Illumina MiSeq reads. Immunoglobulin heavy chains are approximately 300-350 bp long and because this example read library was obtained by 250 bp paired-end sequencing, it is important to merge the read in order to obtain full length heavy chain sequences. To merge these paired-end reads, select both the paired-end documents in the Input data folder and click Pre-processing Set & Merge Paired Reads (see image below).

As the read libraries are paired-end, select the following options in the Set & Merge Paired Reads dialog box and click Run to start the analysis (see sections and image below).

Pair By

Pairs of lists

Relative Orientation

Forward/Reverse (inward pointing, e.g. Illumina paired end)

Output Format

Set and merge paired reads using BBMerge

Once the operation is completed, 2 new documents will be generated in the Set and merge paired reads folder; a ERR346598 (merged) and a ERR346598 (couldn't be merged) document. The ERR346598 (merged) document consists of reads that were successfully paired and merged while the ERR346598 (couldn't be merged) document consists of reads that are paired but couldn’t be merged.

**Note that the number of merged and unmerged reads are dependent on the read quality and increasing the merge rate may result in higher false positives. Read more on Set & Merge Paired Reads here.

Sequence Annotation

The Antibody Annotator identifies immunoglobulin framework regions, complementary determining regions (CDR) and V(D)JC genes, and annotates input sequences against a selected reference database.

In this exercise, you will learn how to annotate variable heavy immunoglobulin genes in mice produced by PCR amplification and how to analyze the results with the help of the Pipeline Report and Graphs. To annotate these heavy IgG genes, select the ERR346598 (merged) document in the Set and merge paired reads folder and click Annotation Antibody Annotator (see image below).

Select the following options from the Antibody Annotator dialog box and click Run to start the analysis (see sections and image below).

Input Options

Reference database: Mouse Ig 2022
Selected sequences are: Single chain (heavy)

Analysis Options

Include pseudo genes from database
Find liabilities and assets ***

This operation will produce a ERR346598 (merged) Annotated & Clustered Biologics Annotator Result document in the Sequence annotation folder.

**Note that the bundled IgG reference databases are split into light and heavy sections. If the sequence type (Selected sequence are: option) is specified for a sequence, only the appropriate database section is used thus, improving performance and potentially annotation accuracy. Read more about the Antibody Annotator here.

*** The "Find liabilities" and "Annotate germline gene differences" options can cause delays on large datasets. If you are analyzing more than 1 million sequences, it is recommended that you leave these options off unless absolutely necessary. The liabilities and assets box is also customisable: How to Customize Sequence Liabilities and Assets.

Pipeline Report

Below is our introductory video on how to understand your results using the Graphs tab and the Pipeline Report tab.

A pipeline report is generated for every Biologics Annotator Result document. This report provides an indication of the annotation rate of the input data, region cluster diversity, and gene mutation distribution among others which are derived from the Antibody Annotator analysis.

In the following section, we will determine how well the sequences are annotated. To get a quick overview of the Antibody Annotator analysis, select the ERR346598 (merged) Annotated & Clustered document and click Pipeline Report.

Approximately 88% of the sequences were identified as Heavy Chain and these sequences, consisting of all of the framework regions (FR) and CDRs, were fully annotated, in-frame and without stop codons.

Another graph that is particularly useful for quality control is the Mutation distribution by gene plot found in the pipeline report:

Screenshot_2023-02-03_at_11.41.35_AM.png

If the values are hovering around or above 25% this can indicate that the incorrect germline reference database was used. Here we can see that the values are generally distributed around 5-20%, indicating that this dataset is not very divergent from the germline.

The Pipeline Report can be exported as a PDF document. Click Export to PDF to export the report as a PDF document:

Graphs

The Graphs option is available for every Biologics Annotator Result document. Graphs are a collections of graphs are that derived from the Antibody Annotator analysis.

In the following sections, we will learn more about clusters and assess the cluster diversity of the Heavy CDR3 region. Our main article on clusters may also be useful: Understanding "Clusters". Immunoglobulin CDR3 region has been reported to contribute to antibody diversity and for this reason, they have been widely used as unique identifiers. To assess the Heavy CDR3 cluster diversity, first, select the ERR346598 (merged) Annotated & Clustered document and click Graphs. Then, click Graph Type: Annotations rates and select Cluster diversity in the dropdown. Finally, select Heavy CDR3 in the Filter By: dropdown (see image below).

**Note that these graphs can be exported as image (png) or table (csv) files that can be used for publication or as laboratory documentation. To export a graph, click Export.

The three CDRs, which interact with antigen, are more diverse compared to the FRs. Among the CD regions, CDR3 varies the most. The CDR cluster diversity and cluster lengths graphs provide a quick indirect comparison of the CDR clusters.

The Heavy CDR cluster diversity graphs showed that Heavy CDR3 is the CD region with the highest cluster diversity with approximately 2,400 clusters while Heavy CDR1 and CDR2 consist of approximately 560 and 840 clusters respectively. Additionally, the majority of the Heavy CDR3 clusters in this dataset consist of a single unique CDR3 amino acid sequence suggesting high sequence diversity. The Heavy CDR3 cluster diversity is also reflected in its cluster length where the top 5 cluster lengths range from 10-14 amino acids long while the majority length of both Heavy CDR1 and CDR2 is at 8 amino acids long as shown in the Heavy CDR length graphs (Figure 1).

Figure 1 | Heavy CDRs cluster diversity and cluster length distribution. The graphs on the left show the CDR cluster diversity and the graphs on the right show the CDR cluster length.

Sequence Clusters

Below is our introductory video on how to understand and find clusters in your Antibody Annotator result document. Our main article on clusters may also be useful: Understanding "Clusters".

Next-generation sequencing enables the discovery of the great diversity of natural antibody repertoires bringing about vast volumes of sequencing data for a fraction of the cost of Sanger sequencing. Sequence clustering is the process of grouping similar sequences into clusters resulting in reduced sequence redundancy and making data analysis more straightforward.

In this exercise you will learn how to view Heavy CDR3 region clusters and identify the most abundant associated region (CDR1 and CDR2, and FR1-FR4). To view sequences within a cluster with identical Heavy CDR3s, select ERR346598 (merged) Annotated & Clustered in the Annotation folder. Finally, switch table views using the Cluster Table dropdown. Click All Sequences and then scroll down to select Heavy CDR3.

The table will automatically sort the clusters in descending order, with the most abundant Heavy CDR3 cluster at the top. Select the first in-frame cluster to view the cluster of sequences consisting of identical Heavy CDR3 “ARWEYYAMDY” sequence (see image below).

Screen_Shot_2022-03-10_at_10.49.53_AM.png

**Note that all the sequences within a region cluster consist of identical regions unless specified. For example, when sequences are grouped by Heavy CDR3, all the sequences within a cluster will consist of an identical Heavy CDR3 sequence but they may consist of distinct CDR1 and CDR2, and FR1-FR4 regions.

The Sequences Table can be used to quickly identify the most frequent FR and CDR clusters for a selected cluster. To view the most abundant regions associated with the selected Heavy CDR3 cluster, scroll to the right of the Sequences Table or hover over any column in the Table Preferences panel to bring up the Focus column button which allows you to quickly navigate to your column of interest.

Screenshot_2023-02-10_at_3.33.04_PM.png

The Sequences Table demonstrated that the most abundant associated Heavy CDR1 and CDR2 sequences for the Heavy CDR3 “ARWEYYAMDY” cluster were “GFNIKDTY” (94.44%) and “IDPANGNT” (97.22%) respectively (Figure 2).

Screen_Shot_2022-03-10_at_10.57.23_AM.png

Figure 2 | The Heavy CDR3 “ARWEYYAMDY” cluster and its most abundant associated CDR1 and CDR2 sequences.

**Note that you can create custom cluster combinations. Up to six regions or genes (FR1, CDR3, Heavy D gene etc.) can be clustered together based on shared identity across sequences in the regions selected. It is also possible to specify a percent threshold of mismatches allowed and to cluster based on amino acid similarity. To explore these advanced clustering methods, see Clustering Options.

Sequence Filtering

NGS data generally comprises of a large number of reads making antibody candidate selection difficult. Sequence filtering coupled with assets and liability score, may aid in identifying suitable candidates for further downstream analyses.

In this exercise, you will learn how to filter the All Sequences table for sequences that meet a set of conditions. To filter the sequences for sequences that are fully annotated, in-frame and without stop codons with a score greater or equal to -100, first, right click a cell in the Without Stop Codons & In Frame & Fully Annotated column and click the Filter syntax. Then, right click a cell in the Score column and click the Filter syntax. Finally, in the Filter box, ensure that the filter syntax is as below and click Filter.

['Without Stop Codons & In Frame & Fully Annotated'] = 'Yes' AND ['Score'] >= -100

A total of 383 sequences that are without stop codons, in-frame and fully annotated, and have a Score of ≥ -100 were identified (Figure 3). The high score suggests low sequence annotation errors and few liability sites such as post-translational modifications (PTM) sites.

Screen_Shot_2022-09-08_at_1.43.09_PM.png

Figure 3 | A total of 383 of the 4208 sequences meet the conditions of having a score ≥ -100 and are fully annotated, in-frame and without stop codons.

**Note that you can filter sequences on all of the columns within a Biologics Annotator Result document. Learn more about sequence filtering here: Filtering your Sequences.

Extract and Recluster

Sequence clustering is commonly used to group highly similar immunoglobulin sequences together with the assumption that their sequence similarity is the result of them sharing the same initial B cell. You may want to take a subset of your sequences and specify more clusters or just re-calculate existing clusters to identify trends within subsets of your sequences.

In this exercise, you will extract and recluster the above filtered sequences to a new Antibody Annotator result, with added clustering on the Heavy V Gene, Heavy J Gene and Heavy CDR3 region with 85% similarity. This is to represent grouping into Clonotypes. See Understanding Clonotypes and how to find them in your data for more information.

Ensure you are still filtering on:

['Without Stop Codons & In Frame & Fully Annotated'] = 'Yes' AND ['Score'] >= -100

In the All Sequences Table, and select all of the filtered sequences. Go to the Export/Extract dropdown and select Extract and Recluster.

Screen_Shot_2022-09-08_at_10.55.41_AM.png

Click the blue "plus" icon in the Extract and Recluster menu below to add a new cluster.

Screen_Shot_2022-09-08_at_10.50.05_AM.png

This will open a box where you can add clusters. To add a new multi-region and/or similarity threshold cluster click the advanced tab. Hold down command (Mac) or control (PC) to select three regions: Heavy CDR3, Heavy V Gene and Heavy J Gene. Then select the following options:

Cluster By: Amino Acids
Cluster Method: Similarity (BLOSUM)
Similarity Threshold: 85%
Allow Mismatches in: Heavy CDR3

Screen_Shot_2022-09-08_at_10.49.36_AM.png

Add the three region cluster. This will group sequences together that are identified as being from the same V and J Genes, while having at least 85% sequence similarity in the Heavy CDR3 region (as judged by the BLOSUM matrix). Click run to produce the new Biologics Annotator Result document.

To view the re-clustered sequences, select the 383 nucleotide sequences Annotated & Clustered document. First, we can check what the most common Heavy CDR3 sequence is by selecting Heavy CDR3 from the Cluster Table dropdown menu.

Screen_Shot_2022-09-08_at_2.11.20_PM.png

Recall from above that the most common Heavy CDR3 sequence was ARWEYYAMDY. After taking a subset of the sequences that were Without Stop Codons & In Frame & Fully Annotated and had a score above -100, the most common Heavy CDR3 region sequence is ARDYGSSHFDH.

We can also check out the combination cluster for identifying Clonotypes we specified (Heavy V Gene, Heavy J Gene and 85% similarity in the Heavy CDR3 region). We can see that all 11 sequences that had the same Heavy CDR3 sequence of ARDYGSSHFDH also had the same Heavy V and J genes identified as the closest germline gene match:

Screen_Shot_2022-09-08_at_2.15.27_PM.png

All the graphs mentioned previously in this tutorial are also available, and will be recalculated according to the subset of sequences and the clusters specified.

*** Tutorial reference - Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice., BMC Immunol, 2014 Oct 16;15:40.