This article provides an overview of the Batch Assemble Sanger Sequences pipeline, which can be found under the Pre-processing options. It uses typical raw Sanger sequencing reads to assemble full sequences for downstream Annotation analysis. This article covers the main options including consensus sequences building and heterozygous bases calling.
Note: The Set and Merge Paired Reads preprocessing operation also allows users to set or merge sequences together. The "Set and Merge Paired Reads" operation uses BBMerge and can be less accurate than batch assembling, however it can be run on very large datasets and is thus more suitable for high throughput or NGS use.
The below video gives a general introduction to pre-processing in Biologics. The first few videos in our Getting Started series may also be helpful, linked here.
- How to run Batch Assembly
- Batch Assemble Sanger Sequences Options
- Viewing Consensus Sequences
- Identifying Heterozygotes
- What if the sequences didn't assemble successfully?
How to run Batch Assembly
It is recommended that if you have multiple sequences (eg. more than 100) you first Group Sequences into a sequence list.
Select all the raw sequences/sequence lists and choose Pre-processing > Batch Assemble Sanger Sequences in the dropdown.
Assemble by Sequence Name
You can assemble the sequences based on sequence names. Enter an appropriate separator and then choose the part that corresponds to a unique identifier to assemble sequences. This will merge reads that are identical across that part/unique identifier.
In the example below this corresponds to selecting 4th in the Name part dropdown and inputting "_" (underscore) as the Name separator to assemble the selected sequences by matching well IDs. Only sequences that match the 4th position of the sequence name when the sequence name is separated by underscore will be assembled together.
Assemble by Name Scheme
Sequences can be assembled by matching common parts of the sequence names to identify which sequences to assemble together. The below video explains how this works in Geneious Biologics:
This involves using a previously created Name Scheme: How to Create a Name Scheme. The name scheme will be applied to the sequence names to extract fields defined in the name scheme, such as the Common Identifier name, chain and sequencing direction. Assembly will be carried out on a combination of the Common Identifier and Chain fields (if present). See this article for more information about what a Name Scheme is and how it works.
Assemble by List
Assembling sequences by list is useful if your sequences are interlaced and have been grouped into a list. The assembly will be carried out on each pair of sequences, starting from the first, e.g. sequences 1 and 2 will be assembled together, sequences 3 and 4 will be assembled together, and so forth.
To call heterozygous bases, in the Batch Assembly options you can select Consensus: call Sanger heterozygotes and input 50%. To learn more about heterozygous bases calling and annotation, please refer to this article.
Save unused reads
This will output any reads that did not have a pair/could not be assembled.
Select Generate a contig for each assembly to check whether heterozygote bases are present in the assembled sequences.
Viewing Consensus Sequences
To quickly identify potential heterozygous bases, click Zoom Out to full view in the Sequence Viewer. In the sidebar, ensure that Annotations: trimmed, Highlighting, Consensus and GraphsPairwise identity are selected. Look for regions within the Identity graph with low peaks, select the region around position 900-950 bp and zoom in. A heterozygote base M, is called in position 923 bp as the second peak is at least 50% of the height of the first peak.
To view the consensus sequences of all of assembled sequences, select the Assembly Consensus Sequences document output. This is the sequence list that you will be using as input in downstream operations such as Antibody Annotator.
You can also view the Assembly Report for a summary of what did and didn't assemble successfully.
What if the sequences didn't assemble successfully?
This can happen due to incorrect name scheme/name pattern settings. If you are seeing some sequences merge correctly, it could also be that not all sequences have a sufficient matching overlap in order to be able to merge.
Tip: If you are getting a lot of unassembled reads and you don't think it is due to name scheme variations, all is not lost! One trick that you can use is to pair the unassembled reads using the Set Paired Reads tool.
Once paired you can select both the assembled and unassembled reads together and run Antibody Annotator. Antibody Annotator will attempt to find overlap on the paired reads in the VDJ/VJ region, and if it can't it will leave ambiguities in the gapped area . For example you may see XXXX in the FR3 region but ideally we try to still connect the reads into a single sequence.
Note: This method is not compatible with the Both Chains in Associated Sequences Annotator option, which assumes paired reads are matching heavy and light chains.