Sequence Alignment

July 22, 2024 03:28
Updated

Aligning sequences is a good way of visually comparing sequences, and can include a tree, circular viewer, assay data heatmaps, and cluster information. This article outlines how to perform a Sequence Alignment in Geneious Biologics. A brief video tutorial is embedded below. The other videos in our Getting Started series may also be helpful, linked here.

Aligning from the All Sequences Table
- Alignment Options
Aligning from Clusters
- Alignment Options for Clusters
Viewing Alignments
- Visualizing assay data/metadata
- Sequence Logo
- Circular tree viewer
Exporting Sequences from Alignments
Alignment and Tree Algorithms
Advanced Alignment Options
Aligning Raw Sequences

Aligning from the All Sequences Table

To align sequences, navigate to a Biologics Annotator Result document and select two or more sequences in the All Sequences table and click Post-processing > Align in the dropdown.

You might find it useful to set up Filters to select only sequences that meet certain requirements prior to aligning, for example only taking sequences that were from the IGHV1 family. See Filtering your Sequences to learn about filtering syntax.

post processing >alignment.png

You can align either a single region, or align multiple regions at once. You can also align across the entire sequence, ignoring any annotations.

If you have added Assay Data to your sequences prior to alignment, these values can be represented alongside your alignment if specified in the Visualization Options while running an alignment. Jump to this section to learn how to bring up the values in the alignment view: Visualizing assay data/metadata.

Alignment Options

Main Options

alignment allseq main options.png
You can align either a single region, or align multiple regions at once. You can also align across the entire sequence, ignoring any annotations.

Align Regions:
To align multiple regions, hold down command (MacOS) or control (PC) and click on the region(s) you would like to align. This will concatenate all the selected regions for each sequence, and then show the resulting alignment.

The help text below the selector will show you what order your regions will appear in, and you can see how many sequences contain the region beside the region name. Sequences that don't have that region will be skipped, and won't appear in the resulting alignment.
Any custom regions added via annotation using a feature database can also be selected for alignment. Learn more here: Using Feature Databases to identify Fusion Proteins

Translate nucleotides before aligning
This operation will translate nucleotide sequences according to the selected genetic code and translation frame before aligning the sequences. By default, it will autodetect the start of the region(s) selected.

If the selected annotation has a "codon_start" qualifier then that will be used as the translation frame. Our annotation pipelines sometimes put this qualifier on V(D)J regions if there is a frameshift very early in the variable region, and the majority of the sequence is in a different frame to the beginning.

Combine duplicate sequences
This option will combine all input sequence that are the identical across the regions selected together into a representative sequence before producing the alignment. This can be very useful if you are aligning based on a single region, or expect there to be many duplicate sequences in you dataset.

Identification of a duplicate sequence is based on an exact match of the sequence across the region(s).
When looking at alignments with duplicate sequences, you will be able to display the number of duplicates alongside the alignment so that it is clear how many original sequences each sequence represents.
You can choose how metadata (like assay data) will be represented for combined sequences in the next section

Represent duplicate sequences by:
This option allows you to select what sequence is chosen as the representative when identical sequences are collapsed. The options are:

Representative Sequence (by Liability Score)
- This will select the best ranked sequence according to liability score. Antibody Sequence Liabilities needs to be turned on when running an annotation for this option to work.
  Note: if multiple sequences have the same lowest score, the sequence with the lowest ID will be selected
First occurrence (for the current sort order)
- This will select the first listed sequence from within the duplicates. The first listed sequence is dependent on how the sequences are sorted.

Alignment Algorithm:
The dropdown allows you to select the alignment algorithm you wish to use. We currently support the iteration-based alignment method MUSCLE (multiple sequence comparison by log-expectation) and MAFFT (Multiple Alignment using Fast Fourier Transform).

MAFFT will be chosen automatically if you select more than 1,000 sequences as it performs much faster on large datasets.
This operation will align all the selected sequences from the 5' end to the 3' end, disregarding any annotations. If you choose to translate, you will need to specify a translation frame (relative to the start of the sequence).

Visualisation Options

Alignment visualization options.png

Build tree from alignment with:

RAxML (Randomized Axelerated Maximum Likelihood)
Geneious Tree Builder (Neighbour-joining algorithm, Saitou & Nei 1987).

Show table columns in alignment:
This option allows you to pull across metadata values from the All Sequences Table such as gene matches, any attached assay data and any other column in the table. Some defaults will be populated already.

To specify a metadata column, click on the blue plus to bring up the selector:
Metadata field:
Select the column in the All Sequences Table that will be displayed alongside the aligned sequence(s)
Field value type:
Select numeric or categorical. This will influence the options available for Represent values as:
Represent values as:
- Categorical value options:
  - Most common
  - All Values
  - Unique Values
- Numeric value options:
  - Most Common
  - All Values
  - Unique Values
  - Mean
  - Median
  - Min
  - Max

Aligning From Clusters

Sequences can be grouped into "clusters" based on shared identity across specific regions or combinations of regions, like the Heavy CDR3 or VDJ region. In Geneious Biologics, you can specify both exact and threshold clusters to help you sift through large datasets. To learn more about clusters, please see: Understanding "Clusters".

To align sequences from within the clusters of a region, go to any Annotation Result Document and select your Cluster Table of choice. Select either a single row/cluster (inexact/threshold clusters only) or multiple clusters to align. You can then click Post-Processing > Align to bring up the alignment dialogue.

Screenshot_2023-02-02_at_3.29.49_PM.png

In the above example using the NGS Tutorial 1 data, the top 10 clusters (excluding frameshifted sequences) have been selected from the Heavy CDR3 (85% Similarity) cluster table.

The resulting alignment from these clusters looks like this when aligning only those sequences that meet a 14% threshold by frequency:

Screenshot_2023-02-02_at_3.35.48_PM.png

This means that within each cluster, only those HCDR3 sequences that made up at least 14% of the sequences within that cluster will be aligned.

You can also choose which columns are shown in the Sidebar, as seen in the above example. The Identity Cluster Count and Identity Cluster % of Similarity Cluster are both shown as metadata in the alignment.

Identity Cluster Count refers to the total number of identical HCDR3 sequences found in the whole dataset
Identity Cluster % of Similarity Cluster refers to what percentage the HCDR3 sequence makes up of the parent cluster (remember that these are inexact HCDR3 85% threshold clusters)

You can also perform alignments on clusters made up of multiple regions (eg. Heavy CDR1, CDR2, CDR3) or clusters that contain both genes and regions (eg. Heavy CDR3, V Gene, J Gene). The options for how to choose which sequences from each cluster will be aligned are described in the following section on Alignment Options.

Alignment Options for Clusters

Main Options

alignment clusters main options.png

Represent cluster by:
If you produce an alignment from clusters, you can choose how strict to be when determining what sequences contained within each cluster should be included. There are multiple options to choose from when determining what sequences will be included in the alignment from each cluster.

Majority Sequence Only
This will take only the most common sequence for the region(s) from each cluster. If you select 10 clusters, this means that the alignment produced will have 10 sequences. Only the clustered region is aligned.
Threshold by Count (inexact clusters only)
This will take only the sequences contained within each cluster that have met a threshold number. Only the clustered region is aligned.
To get an idea of what to set this value to, you can look at the Cluster Contents (Top 100) column:In the first cluster, the HCDR3 sequence ASYYYGSSSFAY was found 9 times, while another sequence contained in this cluster AMYYYGSSSLFAY was found 7 times. Setting the threshold by count to 7 will include both these sequences in the resulting alignment, as well as any other HCDR3 sequences in the other clusters that have a count equal or greater than 7.
Threshold by Frequency (inexact clusters only)
This will take only the sequences contained within each cluster that have met a threshold frequency. Only the clustered region is aligned. To get an idea of what to set this value to, you can look at the Cluster Contents % (Top 100) column:In the first cluster, the sequence ASYYYGSSSFAY makes up 16.4% of the HCDR3 sequences within that cluster. Other clusters have a more "dominant" sequence, for example ARWEYYAMDY makes up 94.7% of the HCDR3 sequences within the second cluster. Setting the threshold by frequency to 16% will include any sequences within each cluster that make up at least 16% of their corresponding cluster.
All Sequences (inexact clusters only)
This will align every unique sequence contained within each cluster, up to a maximum of 100 per cluster. Only the clustered region is aligned. For example, if the top 10 clusters for HCDR3 have been selected as shown below:A total of 59 sequences will be aligned (this is the summed unique sequences counts for all the selected clusters).
Representative Sequence (by Liability Score)
If this option is selected, you can chose to align a different region of the sequence - not just the clustered region. The best ranked sequence from within each cluster (each row) according to liability score will be aligned. Antibody Sequence Liabilities needs to be turned on when running an annotation for this option to work.
- To align multiple regions, hold down command (MacOS) or control (PC) and click on the region(s) you would like to align. This will concatenate all the selected regions for each sequence, and then show the resulting alignment.
- The help text below the selector will show you what order your regions will appear in, and you can see how many sequences contain the region beside the region name. Sequences that don't have that region will be skipped, and won't appear in the resulting alignment.

Note: if multiple sequences have the same lowest score, the sequence with the lowest ID will be selected.

Alignment Algorithm
The dropdown allows you to select the alignment algorithm you wish to use. We currently support the iteration-based alignment method MUSCLE (multiple sequence comparison by log-expectation) and MAFFT (Multiple Alignment using Fast Fourier Transform).

MAFFT will be chosen automatically if you select greater than 1,000 sequences as it performs much faster on large datasets.

Visualisation Options

cluster align visualization options.png

Build tree from alignment with:

RAxML (Randomized Axelerated Maximum Likelihood)
Geneious Tree Builder (Neighbour-joining algorithm, Saitou & Nei 1987).

Viewing Alignments

If you select multiple regions, these will always align internally. In the example below, this means that all the Heavy CDR1s will align together, all the CDR2s will align together, and so on. You will never get a residue from CDR1 aligning in the same position as a residue from CDR2.

If some of your sequences are missing one or more of your selected regions, this will be indicated by a region not found annotation (see below). Sequences that do not have any of the specified regions will be skipped and won't show in the alignment.

Screen Shot 2020-07-13 at 8.08.21 PM.png

Sequence Logo

You can also view the sequence logo for any alignment by switching to the Sequence Logo tab:

Screenshot 2023-02-03 at 11.13.36 AM.png

To the right of the Sequence Logo, you can select to plot the amino acids by Frequency or Entropy, and colour the amino acids by a variety of options. To learn more about these options see Identifying variation within regions.

Visualizing assay data/metadata

It is possible to view any of the information from the table alongside your sequences in the Alignment view. If you aligned from the All Sequences table, then all the columns from the All Sequences table will be available, including Sequence name, notes, labels and Assay Data.

To include external assay or metadata in the alignment document as a heatmap, you will need to first import the assay data into the All Sequences result table prior to alignment: Adding Assay Data to your Analysis Results.

Upon importing assay data, select the sequences you would like to align and proceed with the alignment operation as described in the sections above.

To view the assay data in the tree or alignment view, click Sidebar in the right Sequence Viewer panel. This will bring up the chosen Alignment Metadata specified in your Alignment Options.

Note: if metadata is not showing, check that Show values for metadata is selected.

example alignment assay data.png

The above image shows an alignment with a tree, as well as the Tm, ELISA, BVP ELISA and Liability Score as calculated by Geneious Biologics.

Numerical columns will show as heatmaps
Text columns will simply show the relevant word(s).
Boolean (yes/no) columns, including labels, will show as ticks.

The metadata columns can be resized horizontally by hovering the mouse pointer of the metadata column headers and dragging. The metadata headers can also be resized vertically by dragging the bottom of the column header down.

Aligned sequences can be sorted by metadata values, by clicking on the desired Metadata header. When you sort on a metadata column, the tree will be hidden. To get the tree back, simply click twice on the metadata header to un-sort. The below image shows an alignment sorted by descending ELISA values:

Screenshot 2023-10-10 at 4.48.24 PM.png

Note: When looking at alignments with combined duplicate sequences, there will be additional metadata options to turn on columns for statistics such as Total combined sequences. Combined (duplicate) sequences can contain assay/metadata information if an appropriate Represent values as option has been specified (e.g. most common). See the section on Alignment Options for how to represent metadata.

Circular Tree viewer

To view the Circular tree view, click on the Tree View tab:

Heat maps with assay data example.png

The above Tree view was generated from an alignment on the Heavy CDR3 sequences from Sanger Tutorial 2. To customise the Tree, use the right panel options:

Color Based On:
- Will generate unique coloring based on the selected drop down. The above is colored by the Heavy V Gene family
Color Palette:
- Allows you to select from a variety of color schemes for the above
Show Legend
- This allows you to toggle whether to display a legend of the selected assay data or other data
Legend to Show:
- Select data to display as the legend. In the above picture, the legend displays the color scale for median ELISA values
Show Tip Labels
- This allows you to toggle whether the tip labels show
Tip Labels:
- Select what the aligned sequences are represented by, the default is the Name. Metadata/added assay data can be selected here.
Max Label Length
- Determines the number of characters that are showed in the label. This can be useful for very long labels
Branch Transform
- Options include: Cladogram (default), Equal and No Transform
Show Heatmap
- This allows you to toggle whether to display heatmaps along the outside of the tree (cladogram only)
Heatmap Rows:
- The blue + Add button allows you to add up to three rows of assay data or other data
- Rows can be edited by first expanding the card using the arrow at the top left and then selecting the data and color scale from the following drop-downs:

Exporting Sequences from Alignments

Alignments and aligned sequences can be exported via the Export dropdown at the top left of the document view:

export alignment.png

Alignments can be exported as:

Full Document - this exports the alignment itself.
- Geneious
- Genbank
- Fasta
- Fasta compressed
- Newick
Image (.png)
Table
- This option will export your alignment as a table doc (.xlsx or .csv) with each residue of a sequence in a single cell. The consensus can also be exported, along with any metadata columns. Any sorting on the alignment (eg. on ascending assay data values) will be preserved in the table.
Selected Sequences
- This will allow you to export the selected sequences in the alignment. There are a couple of options for what can be exported:
  - The current aligned region
  - In some cases, the original full heavy/light chain that the aligned sequence came from. Further support for this will be coming in the future.
- Depending on which settings were used in the alignment, the table rows for the parent sequences may be able to be exported as well.
- See more on this here: Exporting Annotated Sequences and Sequence Tables

Alignment and Tree Algorithms

We currently support the iteration-based alignment method MUSCLE (multiple sequence comparison by log-expectation) and MAFFT (Multiple Alignment using Fast Fourier Transform).

MAFFT will be selected automatically if you select greater than 1,000 sequences as it performs much faster on large datasets.

The tree builder algorithms currently available are: RAxML (Randomized Axelerated Maximum Likelihood) and Geneious (Neighbour-joining algorithm, Saitou & Nei 1987).

Advanced Alignment Options

alignment all seq advanced.png

Genetic Code:
The available genetic codes are obtained from NCBI.

Consider Alternative Start Codons:

Auto-detect: Recommended. This option will use either the codon_start qualifier or translate from the beginning of the selected region.
Always consider: Alternative start codons are translated as M regardless of annotation type
Always ignore: Alternative start codons are not translated as M
Use Frame: Manually select the frame for translation

Align entire sequence
Only available when aligning from the All Sequences Table.
This operation will align all the selected sequences from the 5' end to the 3' end, disregarding any annotations. If you choose to translate, you can specify a translation frame below (relative to the start of the sequence).

Use frame:
Specify which frame to start translation across the region (1, 2 or 3).

Aligning Raw Sequences

To align sequences directly from the Files table, (1) select more than one sequence, (2) click Post-processing and (3) Align in the dropdown.

You will always be able to produce an alignment of your entire sequences (whether your sequence is
nucleotide or protein). See the above sections for more on Alignment Options.