This article uses typical raw sequence data produced from a Sanger sequencing run to learn how to edit and align chromatograms for downstream Biologics annotator analysis. The article covers assembly of Sanger sequences with traces, consensus sequences building, and heterozygous bases calling.
Note: "Set and Merge Paired Reads" preprocessing operation also allows users to merge sequences together. The "Set and Merge Paired Reads" operation uses BBMerge and can be less accurate than batch assembling, however it can be run on very large datasets and is thus more suitable for high throughput or NGS use.
Select all the sequences the raw sequences and select Batch Assemble Sanger Sequences in the dropdown.
Assemble by Sequence Name
You can assemble the sequences based on sequence names. Enter an appropriate separator and then choose the part that corresponds to a unique identifier to assemble sequences with that matching identifier together. In the example below this corresponds to selecting 4th in the Name part dropdown and input "_" (hyphen) as the Name separator to assemble the selected sequences by matching well IDs. Check the example, the sequences should match the 4th position of the sequence name when the sequence name is separated by hyphen.
Assemble by Name Scheme
Sequences can be assembled by matching common parts of the sequence names to identify which sequences to assemble together. This involves using a name scheme defined by an administrator of your organization. The name scheme will be applied to the sequence names to extract fields defined in the name scheme, such as the Common Identifier name, chain and sequencing direction. Assembly will be carried out on a combination of the Common Identifier and Chain fields (if present). For more information about name schemes, see What Is a Name Scheme and Why Is It Useful?
Assemble by List
Assembling sequences by list is useful if your sequences are interlaced and have been grouped into a list. The assembly will be carried out on each pair of sequences, starting from the first, e.g. sequences 1 and 2 will be assembled together, sequences 3 and 4 will be assembled together, and so forth.
Click Run to start the operation. This operation with the above settings should produce 6 Contigs, 6 consensus sequences and an assembly report.
To call heterozygous bases, in the Batch Assembly options you can select Consensus: call Sanger heterozygotes and input 50%. To learn more about heterozygous bases calling and annotation, please refer to this article.
Select the an output Assembly contig to check whether heterozygote bases are present in the assembled sequences.
To quickly identify potential heterozygous bases, click Zoom out to full View in the Sequence Viewer and in the sidebar, ensure that Annotationstrimmed, Highlightings, Consensus and GraphsPairwise identity are selected. Look for regions within the Identity graph with low peaks, select the region around position 900-950 bp and zoom in. A heterozygote base M, is called in position 923 bp as the second peak is at least 50% of the height of the first peak.
Viewing consensus sequences
To view the consensus sequences of all of assembled sequences, select the Assembly Consensus Sequences document output. This is the sequence list that you will be using as input in downstream operations such as Antibody Annotator.
You can also view the Assembly Report for a summary of what did and didn't assemble successfully.
What if the sequences didn't assemble successfully?
This can happen due to incorrect name scheme/ name pattern settings. If you are seeing some sequences merge correctly, it could also be that not all sequences have a sufficient matching overlap in order to be able to merge.
Tip: If you are getting a lot of unassembled reads and you don't think it is due to name scheme variations, all is not lost! One trick that you can use is to pair the unassembled reads using the Set Paired Reads tool.
Once paired you can select both the assembled and unassembled reads together and run Antibody Annotator. Antibody Annotator will attempt to find overlap on the paired reads in the VDJ/VJ region, and if it can't it will leave ambiguities in the gapped area . For example you may see XXXX in the FR3 region but ideally we try to still connect the reads into a single sequence.
Note: This method is not compatible with the Both Chains in Associated Sequences Annotator option, which assumes paired reads are matching heavy and light chains.