Sanger Tutorial 3. Assembling Chromatograms and Flagging Heterozygotes

August 14, 2025 04:49
Updated

In this tutorial, you will learn how to assemble and annotate raw sequences produced by Sanger sequencing. You will also learn how to find heterozygote base(s). Note: This was previously Sanger Tutorial 1.

This tutorial will cover the following sections:

Sequence Trimming
Find Heterozygotes
Batch Assembly
Sequence Annotation
Sequence Analysis
Further Analysis

Get started: To start this tutorial, you will need the input data. If you have recently started Geneious Biologics, your organization may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the input sequences here and then uploading them into Geneious Biologics.

The first few videos in our Getting Started series may also be helpful, linked here. Below is our video on Pre-processing Sanger Sequences.

Sequence Trimming

Trimming low-quality ends of sequences is normally performed before assembling a contig. This is because the noise introduced by low-quality regions and vector contamination can produce incorrect assemblies.

In this exercise, you will learn how to trim low-quality bases from both ends of chromatogram sequences. Trim the poor quality bases off the ends of the sequences by selecting all the chromatograms in the Input data folder, then, click Pre-processing > Trim Ends.

Select the Error Probability Limit option located under Trim By Quality and click Run. This option trims bases up until the point where trimming further bases will only improve the error rate by less than the limit (see image below).

This will produce 8 documents in the Trim ends folder where the sequences are trimmed and shorter than the original input sequences.

**Note that sequence trimming is not necessary for downstream analysis such as sequence assembly and annotation.

Find Heterozygotes

This section explains how to find heterozygous bases in sequences without first assembling the sequences into a contig. If you are planning to assemble your sequences prior to running the Antibody Annotator, you can skip this section and proceed to the next section.

Heterogeneity in base sequence occurs when a single peak position within a trace contains more than one peak. Find heterozygotes identifies heterozygotes in sequences with trace information by looking at the relative height of traces at peak positions.

To annotate heterozygous bases, select all of the chromatograms in the Trim ends folder, click Pre-processing and select Find Heterozygotes in the dropdown.

Set the Peak Similarity to 50%, the Action to take option to Annotate in the Find Heterozygotes dialog box, and click Run (see image below).

With the above settings, when an alternative peak is ≥ 50% as high as the best peak, this base will be annotated as heterozygous. This analysis will produce 8 documents with (found heterozygotes) appended to the end of the document Name.

A heterozygous base was identified at interval 105 of 310819a_P1_T2_Kappa-2_C2.F1.ab1 heterozygotes. The presence of an A and G at interval 105 where A is ≥ 50% as high as G resulted in a possible ambiguous R base, as seen below:

Screen_Shot_2022-02-08_at_4.44.32_PM.png

**Note that you can also run this operation within the Batch Assemble Sanger Sequences operation. Learn more heterozygous bases identification and annotation here.

Batch Assembly

Sequence assembly is used to align and merge overlapping fragments of a DNA sequence to form contig(s) that can be used to reconstruct the original sequence.

In this exercise, you will learn how to assemble chromatograms (i.e. forward and reverse reads of the same sequence) to form contigs. To assemble Sanger sequencing reads, firstly select all of the sequences in the Trim ends folder then click Preprocessing -> Batch Assemble Sanger Sequences.

Select the following options from the Batch Assemble Sanger Sequences dialog box and click Run to start the analysis (see image below).

Select the following options:

Batch by Name

Name part: 4th
Name separator: _ (underscore)

Assembly Options

Consensus: call Sanger heterozygotes > 50 %
Save list of unused reads
Generate a contig for each assembly
Output consensus sequences as list

In the example above, sequences that share identical name part when separated by an underscore will be matched together (see Example). Consequently, the Heavy-1, Heavy-2, Kappa-1 and Kappa-2 sequences will be matched together resulting in 4 individual contig files; Heavy-1 Assembly, Heavy-2 Assembly, Kappa-1 Assembly and Kappa-2 Assembly.

**Note that for chromatogram assembly, the orientation of fragments will be determined automatically, and they will be reverse complemented where necessary. Learn more about batch assembly by name and how to assemble chromatograms here.

Sequence Annotation

The Antibody Annotator is a versatile pipeline that identifies and annotates both NGS-type and Sanger-type sequences in reference to an immunoglobulin reference database.

In this exercise, you will learn how to annotate heavy and light sequences. To annotate heavy and light chain assemblies, select the 310819a_P1_T2 Assembly Consensus Sequences document in the Batch assembly folder and click Annotation > Antibody Annotator.

Select the following options from the Antibody Annotator dialog box (see sections and image below).

Main Options

Select the following options:

Reference database: Human Ig 2022
Selected sequences are: Single chain (either heavy or light)

sanger tut1 main options.png

Analysis Options

Select the following options:

Annotate Numbers (IMGT)
Find liabilities and assets

sanger tut1 analysis options.png

You can leave all other sections as the default. Click Run to start the analysis.

This operation will produce a 310819a_P1_T2 Assembly Consensus Sequences Annotated & Clustered Biologics Annotator Result document.

Sequence and Data Analysis

This section demonstrates the utilization of the Sequences Table coupled with the Sequence Viewer to analyze sequences. When used together they may aid in rapid candidate selection for downstream analysis such as humanization.

The Sequences Table contains details of each individual sequence such as chain type, sequence and region lengths, FR and CDR nucleotide and amino acid sequences, and score to name a few. A full list is described in Exploring the Columns of the All Sequences Table. On the other hand, the Sequence Viewer allows you to view the annotated sequences and search for motifs and annotations.

In this exercise, you will learn how to interpret a Biologics Annotator Result document and search for specific motifs within the sequences. First, select the 310819a_P1_T2 Assembly Consensus Sequences Annotated & Clustered document in the Sequence annotation folder to view the Sequences Table. Subsequently, select Kappa-1 Assembly and Kappa-2 Assembly to view the annotated sequences in the Sequence Viewer.

Kappa-2 Assembly has a distinctly lower liability score compared to Kappa-1 Assembly (-18,120 and -401 respectively) as observed in the Sequences Table and Sequence Viewer below. The presence of multiple ambiguous bases resulted in a truncated CDR1 region.

Screen_Shot_2022-02-08_at_4.50.42_PM.png
**Note that liabilities and assets with their associated scores are highly customisable. Learn how to customize your own sequence liabilities and assets set here.

Heterogeneity may result in bases being called as ambiguous bases and these ambiguous bases may affect sequence annotation as observed in the previous analysis. To search for ambiguous bases within the selected sequence, simply enter this search term in the Filter box:

['Error'] LIKE '%Contamination%'

An ambiguous R base is found at interval 105 of Kappa-2 Assembly below. This contamination annotation is a result of the presence of an alternative base at position 105 of the 310819a_P1_T2_Kappa-2_C2.F1.ab1 chromatogram sequence that was used to generate the Kappa-2 Assembly sequence.

Screenshot 2024-01-31 at 9.04.52 AM.png

Further Analysis

The other tools in Biologics provide a stepping point for analysis.

Antibody Annotator produces clusters by default. To learn more about clusters and how to specify clonotype clusters, see this article: What is a "cluster"?
Filtering is a very powerful tool that allows you to pull out sequences that meet certain metrics you specify: Filtering your sequences
Extract and Re-cluster to take a subset of sequences out of an existing Biologics Annotator Result Document and make a new document with re-calculated clusters
Compare two or more Annotation Result Documents from separate experiments to monitor clonal expansion etc.
You can add your Assay Data (ELISA values etc.) to further inform your results: Adding Assay Data to your analysis results
You can align sequences to compare the amino acid diversity across a region or multiple regions: Sequence alignment
Edit your Sequences to perform point mutations that might increase developability