Aligning sequences is a good way of visually comparing sequences, and can include a tree, assay data, and cluster information.
Pairwise and multiple sequence alignment can be performed both before and after sequence annotation. Alignment can be performed using the Align operation under the Post-processing tab.
- Aligning raw sequences
- Aligning from the All Sequences result table
Aligning Raw Sequences
To align sequences directly from the Files table, (1) select more than one sequence, (2) click Post-processing and (3) Align in the dropdown.
When you align sequences you have several options, as shown below.
You will always be able to produce an alignment of your entire sequences (whether your sequence is nucleotide or protein). You may also wish to extract a certain region or translate your nucleotide sequence prior to alignment. The Expected Output options allow you to add a tree or customise your alignment algorithm. The Result Name option allows you to choose a custom file name for your new alignment.
Translating before Alignment
When aligning nucleotide sequences, you may also choose to translate the sequence and align the Amino Acid sequences rather than the underlying nucleotides. You can align translated sequences by selecting the Translate nucleotide sequence(s) prior to alignment option. This operation will translate the nucleotide sequence according to the selected genetic code and translation frame before aligning the sequences. The available genetic codes are obtained from NCBI.
The standard start codon (AUG) codes for Methionine in eukaryotes and a modified Met (fMet) in prokaryotes. Alternative start codons are still translated as Met when they are located at the start of a coding sequence. When Consider Alternative Start Codons is selected, you can select from the following options:
- Auto-detect: Alternative start codons are translated as M when the annotation is of type CDS, ORF or gene
- Always consider: Alternative start codons are translated as M regardless of annotation type
- Always ignore: Alternative start codons are not translated as M
To translate an entire sequence, you will need to specify a translation frame. If you have an annotation, you can choose to align that sequence region, which will automatically set the translation frame in the following manner:
- By default, it will translate from the beginning (frame 1) of the selected region for each sequence.
- If the selected annotation has a "codon_start" qualifier then that will be used as the translation frame. Antibody Annotator sometimes puts this qualifier on V(D)J regions if there is a frameshift very early in the variable region, and the majority of the sequence is in a different frame to the beginning.
If you are having trouble with translation frames then I recommend annotating first using Antibody Annotator to add CDR, FR, and V(D)J regions to your sequence, and then following the steps for aligning from the All Sequences or Cluster tables below.
Aligning by Annotations
If your selected sequences have annotations on them, you could choose to extract and align on an annotated region rather than the whole sequence. This is particularly useful when you are also translating because then the translation will start from the beginning of your selected annotation.
If your selected sequences do not have annotations, then the Align regions option will not be available (shown above). If this is the case then I recommend annotating first using Antibody Annotator, and then following the steps for aligning from the All Sequences table below.
Alignment and Tree Algorithms
The Alignment algorithm dropdown allows you to select the alignment algorithm you wish to use. We currently support the iteration-based alignment method MUSCLE (multiple sequence comparison by log-expectation) and MAFFT (Multiple Alignment using Fast Fourier Transform). The latter will be selected automatically if you select greater than 1,000 sequences as it performs much faster on large datasets.
To build a tree from the alignment, select Build tree from alignment with in the Expected Output section. You can then select a tree builder algorithm from one of the following options: RAxML (Randomized Axelerated Maximum Likelihood) and Geneious (Neighbour-joining algorithm, Saitou & Nei 1987).
*Note: "Align with reference" is an alpha feature - please contact us if you would like advanced access.
To view the aligned sequences as a tree view, (1) select a Tree file and (2) ensure that Show Tree in the Sequence Viewer Sidebar is selected.
*Note: The tree will be hidden upon sequence sorting, to view the alignment in a tree format select Show Tree in the Sequence Viewer Sidebar
Combining Duplicate Sequences
The Combine duplicate sequences option will combine all input sequence that are the same together into representative sequences before producing the alignment. This can be very useful if you are aligning based on a single region, or expect there to be many duplicate sequences in you dataset.
Identification of a duplicate sequence is based on an exact match of the sequence string. If you are aligning amino acid translations, then sequences count as duplicates when the translation is the same, even if the underlying nucleotide sequences are different. Similarly, when aligning by regions, sequences with the same region will be 'combined' together, even if the sequences are different outside of the chosen aligned regions.
When looking at alignments with duplicate sequences, you will be able to display the number of duplicates alongside the alignment so that it is clear how many original sequences each sequence represents.
Combined sequences currently will not contain other metadata in the Alignment view.
Aligning from the All Sequences Result Table
To align sequences within a Biologics Annotator Result document, (1) select a Biologics Annotator Result document, (2) select two or more sequences in the All Sequences table and click Post-processing and select Align in the dropdown.
The Alignment Options dialog will now pop up, allowing you to customise your alignment:
You can choose to either align the entire sequence or just a region of your sequences. To align the entire sequence, click Entire sequence under Regions to align. This operation will align all the selected sequences from the 5' end to the 3' end. If you choose to translate, you will need to specify a translation frame (relative to the start of the sequence). If your frame varies sequence-to-sequence, you may wish to align by the VDJ or VJ Region instead (see below). This will automatically set the best frame for each sequence.
Aligning by Region
To align a selected region, click the checkbox by Align regions and select the appropriate region name in the drop-down. This operation will align the sequences from the 5' end to the 3' end of the selected region rather than the entire region.
In order to align on a region, at least two of your selected sequences need to have that region annotated. You can see how many sequences contain the region beside the region name. Sequences that don't have that region will be skipped, and won't appear in the resulting alignment. Antibody Annotator and Single Clone Antibody Analysis will add some of these regions by default, but you can also choose your own custom annotations that you may have added to the sequences.
There is the option to align multiple regions at once. To do this, hold down the SHIFT key or Command/Control key as you are making your region selections. This will concatenate all the selected regions for each sequence, and then show the resulting alignment. The help text below the selector will show you what order your regions appear in.
If you select multiple regions, these will always align internally. In the example above, this means that all the Heavy CDR1s will align together, all the CDR2s will align together, and so on. You will never get a base from CDR1 aligning in the same position as a base from CDR2.
When scFv sequences and associated Heavy/Light Chain pairs are aligned, it is possible to include regions from both Chains in the same alignment.
The resulting alignment will contain the relevant sections of each chain, concatenated in the same row. The example below was created using the associated Heavy/Light chains from the Sanger Tutorial 2 data.
If some of your sequences are missing one or more of your selected regions, this will be indicated by a region not found annotation (see below). Sequences that do not have any of the specified regions will be skipped and won't show in the alignment.
Adding Metadata to Alignments
It is possible to view any of the information from the table alongside your sequences in the Alignment view. If you aligned from the All Sequences table, then all the columns from the All Sequences table will be available, including Sequence name, notes, labels and analysis data.
To include external assay or metadata in the alignment document as a heatmap, you will need to first import the assay data into the All Sequences result table prior to alignment. More on how to add assay data to your results can be found in the following article.
Upon importing assay data, select the sequences you would like to align and proceed with the alignment operation as described in the sections above. To view the metadata in the tree or alignment view, click Sidebar in the right Sequence Viewer panel and select the assay data you would like to be included in the view. To view the values of the metadata, click Show values for metadata and hover over the heatmap to see the details of the metadata.
Numerical columns will show as heatmaps, while text columns will simply show the relevant word(s). Boolean (yes/no) columns, including labels, will show as ticks.
The metadata columns can be resized horizontally by hovering the mouse pointer of the metadata column headers and dragging. The metadata headers can also be resized vertically by dragging the bottom of the column header down. The tree can also be resized by clicking and dragging on the bar with triple dots.
Aligned sequences can be sorted by metadata values, by clicking on the desired Metadata header. When you sort on a metadata column, the tree will be hidden. To get the tree back, simply click twice on the metadata header to un-sort.
When looking at alignments with combined duplicate sequences, there will be additional metadata options to turn on columns for statistics such as number of duplicate sequences. Combined (duplicate) sequences currently will not contain other metadata in the Alignment view, as the duplicate sequences may have had various different column values.