In this tutorial, you will learn how to assemble and annotate raw sequences produced by Sanger sequencing. You will also learn how to find heterozygote base(s).
This tutorial will cover the following sections:
Get started: To start this tutorial, you will need the input data. If you have recently started Geneious Biologics, your organization may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the input sequences here and then uploading them into Geneious Biologics.
The first few videos in our Getting Started series may also be helpful, linked here. Below is our video on Pre-processing Sanger Sequences.
Trimming low-quality ends of sequences is normally performed before assembling a contig. This is because the noise introduced by low-quality regions and vector contamination can produce incorrect assemblies.
In this exercise, you will learn how to trim low-quality bases from both ends of chromatogram sequences. Trim the poor quality bases off the ends of the sequences by selecting all the chromatograms in the Input data folder, then, click Pre-processing > Trim Ends.
Select the Error Probability Limit option located under Trim By Quality and click Run. This option trims bases up until the point where trimming further bases will only improve the error rate by less than the limit (see image below).
This will produce 8 documents in the Trim ends folder where the sequences are trimmed and shorter than the original input sequences.
**Note that sequence trimming is not necessary for downstream analysis such as sequence assembly and annotation.
This section explains how to find heterozygous bases in sequences without first assembling the sequences into a contig. If you are planning to assemble your sequences prior to running the Antibody Annotator, you can skip this section and proceed to the next section.
Heterogeneity in base sequence occurs when a single peak position within a trace contains more than one peak. Find heterozygotes identifies heterozygotes in sequences with trace information by looking at the relative height of traces at peak positions.
In this exercise, you will learn how to identify heterozygous bases in chromatograms. To annotate heterozygous bases, select all of the chromatograms in the Trim ends folder, click Pre-processing and select Find Heterozygotes in the dropdown.
Set the Peak Similarity to 50%, the Action to take option to Annotate in the Find Heterozygotes dialog box, and click Run (see image below).
With the above settings, when an alternative peak is ≥ 50% as high as the best peak, this base will be annotated as heterozygous. This analysis will produce 8 documents with (found heterozygotes) appended to the end of the document Name.
A heterozygous base was identified at interval 105 of 310819a_P1_T2_Kappa-2_C2.F1.ab1 heterozygotes. The presence of an A and G at interval 105 where A is ≥ 50% as high as G resulted in a possible ambiguous R base (Figure 1.1).
Figure 1.1 | A possible heterozygous base was identified in interval 105 of the 310819a_P1_T2_Kappa-2_C2.F1.ab1 heterozygotes.
**Note that you can also run this operation within the Batch Assemble Sanger Sequences operation. Learn more heterozygous bases identification and annotation here.
Sequence assembly is used to align and merge overlapping fragments of a DNA sequence to form contig(s) that can be used to reconstruct the original sequence.
In this exercise, you will learn how to assemble chromatograms (i.e. forward and reverse reads of the same sequence) to form contigs. To assemble Sanger sequencing reads, firstly select all of the sequences in the Trim ends folder then click Preprocessing -> Batch Assemble Sanger Sequences.
Select the following options from the Batch Assemble Sanger Sequences dialog box and click Run to start the analysis (see image below).
Select the following options:
Batch by Name
- Name part: 4th
- Name separator: _ (underscore)
- Consensus: call Sanger heterozygotes > 50 %
- Save list of unused reads
- Generate a contig for each assembly
- Output consensus sequences as list
In the example above, sequences that share identical name part when separated by an underscore will be matched together (see Example). Consequently, the Heavy-1, Heavy-2, Kappa-1 and Kappa-2 sequences will be matched together resulting in 4 individual contig files; Heavy-1 Assembly, Heavy-2 Assembly, Kappa-1 Assembly and Kappa-2 Assembly.
**Note that for chromatogram assembly, the orientation of fragments will be determined automatically, and they will be reverse complemented where necessary. Learn more about batch assembly by name and how to assemble chromatograms here.
The Antibody Annotator is a versatile pipeline that identifies and annotates both NGS-type and Sanger-type sequences in reference to an immunoglobulin reference database.
In this exercise, you will learn how to annotate heavy and light sequences. To annotate heavy and light chain assemblies, select the 310819a_P1_T2 Assembly Consensus Sequences document in the Batch assembly folder and click Annotation > Antibody Annotator.
Select the following options from the Antibody Annotator dialog box (see sections and image below).
Select the following options:
- Reference database: Human Ig 2022
- Selected sequences are: Single chain (either heavy or light)
Select the following options:
- Annotate Numbers (IMGT)
- Find liabilities and assets
You can leave all other sections as the default. Click Run to start the analysis.
This operation will produce a 310819a_P1_T2 Assembly Consensus Sequences Annotated & Clustered Biologics Annotator Result document.
Sequence and Data Analysis
This section demonstrates the utilization of the Sequences Table coupled with the Sequence Viewer to analyze sequences. When used together they may aid in rapid candidate selection for downstream analysis such as humanization.
The Sequences Table contains details of each individual sequence such as chain type, sequence and region lengths, FR and CDR nucleotide and amino acid sequences, and score to name a few. On the other hand, the Sequence Viewer allows you to view the annotated sequences and search for motifs and annotations.
In this exercise, you will learn how to interpret a Biologics Annotator Result document and search for specific motifs within the sequences. First, select the 310819a_P1_T2 Assembly Consensus Sequences Annotated & Clustered document in the Sequence annotation folder to view the Sequences Table. Subsequently, select Kappa-1 Assembly and Kappa-2 Assembly to view the annotated sequences in the Sequence Viewer.
Kappa-2 Assembly has a distinctly lower liability score compared to Kappa-1 Assembly (-18,120 and -401 respectively) as observed in the Sequences Table and Sequence Viewer (Figure 1.2). The presence of multiple ambiguous bases resulted in a truncated CDR1 region.
Figure 1.2 | Kappa-1 Assembly and Kappa-2 Assembly sequence annotation. Truncation of the CDR1 region in Kappa-2 and the presence of multiple ambiguous bases and stop codons throughout contributed towards its low liability score.
**Note that liabilities and assets with their associated scores are highly customisable. Learn how to customize your own sequence liabilities and assets set here.
Heterogeneity may result in bases being called as ambiguous bases and these ambiguous bases may affect sequence annotation as observed in the previous analysis (Figure 1.2). To search for ambiguous bases within the selected sequence, simply enter “Type:Contamination” (without the quotes) in the Find text box within the Sequence Viewer and hit enter or "".
An ambiguous R base is found at interval 105 of Kappa-2 Assembly (Figure 1.3). This contamination annotation is a result of the presence of an alternative base at position 105 of the 310819a_P1_T2_Kappa-2_C2.F1.ab1 chromatogram sequence that was used to generate the Kappa-2 Assembly sequence (Figure 1.3B).
Figure 1.3 | An ambiguous base is detected in the Kappa-2 Assembly sequence. (A) Kappa-1 and Kappa-2 Assembly Sequence Annotation & Clustering document showing the annotated ambiguous R base. (B) The Kappa-2 Assembly sequence, an output of the Batch Assembly function. The presence of an A peak that is ≥ 50% the height of the G peak resulted in a heterozygote base being called at interval 105.
**Note that you can also use the Find operation within the Sequence Viewer to search for nucleotide and amino acid motifs. Read more on motif search here.