Aligning sequences is a good way of visually comparing sequences, and can include a tree, assay data, and cluster information. This article outlines how to perform a Sequence Alignment in Geneious Biologics. A brief video tutorial is embedded below. The other videos in our Getting Started series may also be helpful, linked here.
Jump to:
-
Aligning from the All Sequences Table
-
Aligning from Clusters
- Alignment Options
- Viewing Alignments
- Alignment and Tree Algorithms
-
Aligning Raw Sequences
Aligning from the All Sequences Table
To align sequences, navigate to a Biologics Annotator Result document and select two or more sequences in the All Sequences table and click Post-processing > Align in the dropdown.
You might find it useful to set up Filters to select only sequences that meet certain requirements prior to aligning, for example only taking sequences that were from the IGHV1 family. See Filtering your Sequences to learn about filtering syntax.
You can align either a single region, or align multiple regions at once. You can also align across the entire sequence, ignoring any annotations. If you have added Assay Data to your sequences prior to alignment, these values can be represented alongside your alignment. Jump to this section to learn more: Visualizing assay data/metadata.
Alignment Options
Regions to Align
You can align either a single region, or align multiple regions at once. You can also align across the entire sequence, ignoring any annotations.
-
Align Regions: To align multiple regions, hold down command (MacOS) or control (PC) and click on the region(s) you would like to align. This will concatenate all the selected regions for each sequence, and then show the resulting alignment.
The help text below the selector will show you what order your regions will appear in, and you can see how many sequences contain the region beside the region name. Sequences that don't have that region will be skipped, and won't appear in the resulting alignment.
Any custom regions added via annotation using a feature database can also be selected for alignment. Learn more here: Using Feature Databases to identify Fusion Proteins
-
Entire sequence: This operation will align all the selected sequences from the 5' end to the 3' end, disregarding any annotations. If you choose to translate, you will need to specify a translation frame (relative to the start of the sequence).
Translate Nucleotide Sequence(s) Prior to Alignment
This operation will translate nucleotide sequences according to the selected genetic code and translation frame before aligning the sequences. By default, it will autodetect the start of the region(s) selected, see below:
If the selected annotation has a "codon_start" qualifier then that will be used as the translation frame. Antibody Annotator sometimes puts this qualifier on V(D)J regions if there is a frameshift very early in the variable region, and the majority of the sequence is in a different frame to the beginning.
- Genetic Code: The available genetic codes are obtained from NCBI.
-
Consider Alternative Start Codons:
- Auto-detect: Recommended. This option will use either the codon_start qualifier or translate from the beginning of the selected region.
- Always consider: Alternative start codons are translated as M regardless of annotation type
- Always ignore: Alternative start codons are not translated as M
- Use Frame: Manually select the frame for translation
Expected Output
-
Alignment Algorithm: the dropdown allows you to select the alignment algorithm you wish to use. We currently support the iteration-based alignment method MUSCLE (multiple sequence comparison by log-expectation) and MAFFT (Multiple Alignment using Fast Fourier Transform).
- MAFFT will be chosen automatically if you select more than 1,000 sequences as it performs much faster on large datasets.
- MAFFT will be chosen automatically if you select more than 1,000 sequences as it performs much faster on large datasets.
-
Build tree from alignment with:
- RAxML (Randomized Axelerated Maximum Likelihood)
- Geneious Tree Builder (Neighbour-joining algorithm, Saitou & Nei 1987).
-
Align with reference*
Select this to choose any single sequence contained within the parent folder of your Annotation result as the reference sequence
*Note: "Align with reference" is an alpha feature - please contact us if you would like advanced access.
-
Combine duplicate sequences
This option will combine all input sequence that are the same together into representative sequences before producing the alignment. This can be very useful if you are aligning based on a single region, or expect there to be many duplicate sequences in you dataset.-
Identification of a duplicate sequence is based on an exact match of the sequence.
-
When looking at alignments with duplicate sequences, you will be able to display the number of duplicates alongside the alignment so that it is clear how many original sequences each sequence represents.
-
Combined sequences currently will not contain other metadata (like assay data) in the Alignment view.
-
Aligning From Clusters
Sequences can be grouped into "clusters" based on shared identity across specific regions or combinations of regions, like the Heavy CDR3 or VDJ region. In Geneious Biologics, you can specify both exact and threshold clusters to help you sift through large datasets. To learn more about clusters, please see: Understanding "Clusters".
***Note that alignments from clusters will only produce an alignment across the Cluster Region. For example, if you choose the Heavy CDR3 cluster, only the Heavy CDR3 regions will be aligned.
To align sequences from a cluster, go to any Annotation Result Document and select your Cluster Table of choice. Select either a single row/cluster (inexact/threshold clusters only) or multiple clusters to align. You can then click Post-Processing > Align to bring up the alignment dialogue.
In the above example using the NGS Tutorial 1 data, the top 10 clusters (excluding frameshifted sequences) have been selected from the Heavy CDR3 (85% Similarity) cluster table.
The resulting alignment from these clusters looks like this when aligning only those sequences that meet a 14% threshold by frequency:
This means that within each cluster, only those HCDR3 sequences that made up at least 14% of the sequences within that cluster will be aligned.
You can also choose which columns are shown in the Sidebar, as seen in the above example. The Identity Cluster Count and Identity Cluster % of Similarity Cluster are both shown as metadata in the alignment.
- Identity Cluster Count refers to the total number of identical HCDR3 sequences found in the whole dataset
- Identity Cluster % of Similarity Cluster refers to what percentage the HCDR3 sequence makes up of the parent cluster (remember that these are inexact HCDR3 85% threshold clusters)
You can also perform alignments on clusters made up of multiple regions (eg. Heavy CDR1, CDR2, CDR3) or clusters that contain both genes and regions (eg. Heavy CDR3, V Gene, J Gene). The options for how to choose which sequences from each cluster will be aligned are described in the following section on Alignment Options.
Alignment Options for Clusters
Expected Output
-
Alignment Algorithm: the dropdown allows you to select the alignment algorithm you wish to use. We currently support the iteration-based alignment method MUSCLE (multiple sequence comparison by log-expectation) and MAFFT (Multiple Alignment using Fast Fourier Transform).
***MAFFT will be chosen automatically if you select greater than 1,000 sequences as it performs much faster on large datasets.
-
Build tree from alignment with:
- RAxML (Randomized Axelerated Maximum Likelihood)
- Geneious Tree Builder (Neighbour-joining algorithm, Saitou & Nei 1987).
-
Align with reference*
Select this to choose any single sequence contained within the parent folder of your Annotation result as the reference sequence
*Note: "Align with reference" is an alpha feature - please contact us if you would like advanced access.
Align Cluster Regions
If you produce an alignment from clusters, you can choose how strict to be when determining what sequences contained within each cluster should be included. There are four options to choose from when determining what sequences will be included in the alignment from each cluster.
1. Majority Sequence Only
This will take only the most common sequence for the region(s) from each cluster. If you select 10 clusters, this means that the alignment produced will have 10 sequences.
2. Threshold by Count
This will take only the sequences contained within each cluster that have met a threshold number.
To get an idea of what to set this value to, you can look at the Cluster Contents (Top 100) column:
In the first cluster, the HCDR3 sequence ASYYYGSSSFAY was found 9 times, while another sequence contained in this cluster AMYYYGSSSLFAY was found 7 times. Setting the threshold by count to 7 will include both these sequences in the resulting alignment, as well as any other HCDR3 sequences in the other clusters that have a count equal or greater than 7.
3. Threshold by Frequency
This will take only the sequences contained within each cluster that have met a threshold frequency. To get an idea of what to set this value to, you can look at the Cluster Contents % (Top 100) column:
In the first cluster, the sequence ASYYYGSSSFAY makes up 16.4% of the HCDR3 sequences within that cluster. Other clusters have a more "dominant" sequence, for example ARWEYYAMDY makes up 94.7% of the HCDR3 sequences within the second cluster. Setting the threshold by frequency to 16% will include any sequences within each cluster that make up at least 16% of their corresponding cluster.
4. All Sequences:
This will align every unique sequence contained within each cluster, up to a maximum of 100 per cluster. For example, if the top 10 clusters for HCDR3 have been selected as shown below:
A total of 59 sequences will be aligned (this is the summed unique sequences counts for all the selected clusters).
Viewing Alignments
If you select multiple regions, these will always align internally. In the example below, this means that all the Heavy CDR1s will align together, all the CDR2s will align together, and so on. You will never get a residue from CDR1 aligning in the same position as a residue from CDR2.
If some of your sequences are missing one or more of your selected regions, this will be indicated by a region not found annotation (see below). Sequences that do not have any of the specified regions will be skipped and won't show in the alignment.
Sequence Logo
You can also view the sequence logo for any alignment by switching to the Sequence Logo tab:
To the right of the Sequence Logo, you can select to plot the amino acids by Frequency or Entropy, and colour the amino acids by a variety of options.
Visualizing assay data/metadata
It is possible to view any of the information from the table alongside your sequences in the Alignment view. If you aligned from the All Sequences table, then all the columns from the All Sequences table will be available, including Sequence name, notes, labels and Assay Data.
To include external assay or metadata in the alignment document as a heatmap, you will need to first import the assay data into the All Sequences result table prior to alignment: Adding Assay Data to your Analysis Results.
Upon importing assay data, select the sequences you would like to align and proceed with the alignment operation as described in the sections above. To view the metadata in the tree or alignment view, click Sidebar in the right Sequence Viewer panel and select the assay data you would like to be included in the view. To view the values of the metadata, click Show values for metadata and hover over the heatmap to see the details of the metadata.
The above image shows an alignment with a tree, as well as the Tm, ELISA, BVP ELISA and Liability Score as calculated by Geneious Biologics.
- Numerical columns will show as heatmaps
- Text columns will simply show the relevant word(s).
- Boolean (yes/no) columns, including labels, will show as ticks.
The metadata columns can be resized horizontally by hovering the mouse pointer of the metadata column headers and dragging. The metadata headers can also be resized vertically by dragging the bottom of the column header down.
Aligned sequences can be sorted by metadata values, by clicking on the desired Metadata header. When you sort on a metadata column, the tree will be hidden. To get the tree back, simply click twice on the metadata header to un-sort. The below image shows an alignment sorted by descending ELISA values:
Note: When looking at alignments with combined duplicate sequences, there will be additional metadata options to turn on columns for statistics such as number of duplicate sequences. Combined (duplicate) sequences currently will not contain other metadata in the Alignment view, as the duplicate sequences may have had various different column values.
Circular Alignment viewer
To view the Circular tree view, click on the Tree View tab:
The above Tree view was generated from an alignment on the Heavy CDR3 sequences from Sanger Tutorial 2. To customise the Tree, use the right panel options:
- Color Branches Automatically
- Will generate unique colouring based on the selected drop down. The above is colored by the Heavy V Gene family
- Show Tip Labels
- This allows you to toggle whether the tip labels show
- Tip Labels
- Select what the aligned sequences are represented by, the default is the Name. Metadata/added assay data can be selected here.
- Max Label Length
- Determines the number of characters that are showed in the label. This can be useful for very long labels
- Branch Transform
- Options include: Cladogram (default), Equal and No Transform
- Options include: Cladogram (default), Equal and No Transform
Alignment and Tree Algorithms
We currently support the iteration-based alignment method MUSCLE (multiple sequence comparison by log-expectation) and MAFFT (Multiple Alignment using Fast Fourier Transform).
The latter will be selected automatically if you select greater than 1,000 sequences as it performs much faster on large datasets.
The tree builder algorithms currently available are: RAxML (Randomized Axelerated Maximum Likelihood) and Geneious (Neighbour-joining algorithm, Saitou & Nei 1987).
Aligning Raw Sequences
To align sequences directly from the Files table, (1) select more than one sequence, (2) click Post-processing and (3) Align in the dropdown.
You will always be able to produce an alignment of your entire sequences (whether your sequence is
nucleotide or protein). See the above sections for more on Alignment Options.