Identifying variation within regions

December 19, 2024 21:11
Updated

This article outlines how to explore and plot amino acid variation by position within regions, such as the FRs, CDRs and even the entire V(D)J region. There are multiple ways to access these plots in Geneious Biologics.

What variation are you interested in?

If you would like to view the variation across a single region (eg the HCDR3), but include CDR3s of different lengths there are a couple of ways to do this.
- Network and Tree plots: Geneious Biologics automatically produces these plots on regions like the V(D)J and CDR3:
  
  These plots are generated across a maximum of 1000 clusters for a given region and aim to plot the majority of unique sequences in your dataset. See the Network and Tree plots section for how to access these.
- Alignments: To investigate the variability across a group of selected sequences or clusters, we recommend running an Alignment which also produces a Sequence Logo.
If you are interested in viewing the variation found within a single inexact cluster (eg. HCDR3 clustered at 85% identity) you can view the Sequence Logo from the cluster table
If you would like to view the variation across the whole dataset for all regions of a fixed length (say 11 residue HCDR3s), you can view the Amino Acid Distribution Chart

Network and Tree plots

The Network and Tree plots are available pre-computed on an annotation result. First, navigate to your cluster table of interest - e.g. Heavy CDR3. Then bring up the lower panel which contains the Graphs tab. The dropdown selector enables you to select the graph of interest:

finding graphs.png

Sequence Logo

The sequence logo represents the proportion of each amino acid found at a given position in a region across a set of sequences.

Screenshot_2023-02-22_at_11.42.58_AM.png

The sequence logo can be accessed in two ways: For individual inexact clusters (below) or for an alignment document (following section).

Sequence logos for Inexact Clusters

Inexact clusters include both identity and similarity based clusters. If you are unsure what a cluster refers to, see Understanding "Clusters". You can also Add Clusters to your Results if you do not have an inexact cluster available.

To access the Sequence Logo for the region of a similarity or identity based cluster:

Navigate to an inexact cluster table of your choice in a Biologics Annotator Result - Heavy CDR3 (85% Similarity) in the example below.
Select a single cluster. This will display all the sequences contained in that cluster in the Sequence Viewer below.
Switch from the Sequence Viewer tab to the Sequence Logo tab.

Screenshot 2023-03-07 at 3.19.53 PM.png

In the example above, the largest cluster for the Heavy CDR3 region (with an 85% similarity threshold) has been selected. This cluster contains all the sequences in the dataset that had a Heavy CDR3 of ASYYYGSSSFAY or a sequence at least 85% similar. Using the sequence logo we can see that certain residues in this Heavy CDR3 cluster are highly conserved, while other residues show greater variation.

All plots can be exported as images or .csv files using the Export dropdown in the top left.

Note: you can only view the contents of one cluster at a time with the sequence logo. If you would like to see the Sequence Logo for multiple clusters with regions of varying length, you can perform an alignment. This is outlined below.

Sequence logos for Alignments

Performing an alignment before viewing the Sequence Logo allows you to compare the variation in amino acids across regions of varying lengths. To perform an alignment on the sequenced region of each cluster, first navigate to your cluster table of choice and select the clusters you would like to include in your alignment.

In the below example, the Heavy CDR3 (85% Similarity) cluster table has been used.

Screenshot 2023-03-08 at 9.53.31 PM.png

A filter has been applied to find only those clusters that had a total number of sequences greater than 15. You can learn more about filtering here: Filtering your Sequences. The 2nd largest cluster containing frameshifted sequences has been de-selected.

To align the sequence(s) in these clusters:

Go to Post-processing (highlighted in orange above)
Select Align... from the dropdown
Choose the relevant options (see our main Sequence Alignment article) and click Run

After the alignment job has completed, open the alignment document. To view the sequence logo for the alignment, click on the Sequence Logo tab, as shown below:

alignment sequence logo.png

All plots can be exported as images or .csv files using the Export dropdown in the top left.

Amino Acid Distribution Charts

Graphs are populated for all Annotator Result Documents, and will display under the Sequences Table automatically. To learn more about the graphs produced by Geneious Biologics see:

The plots are also accessible in a larger view via the Graphs tab of any Biologics Annotator Result document:

Screenshot 2023-05-23 at 12.34.52 AM.png

It might first be useful to look at the Cluster Lengths graph to determine the most common length for the region you are interested in. In the below example, we can see that the most common VDJ region length is 119 residues long.

Screenshot 2023-03-08 at 10.36.56 PM.png

We can then navigate the the Amino Acid Distribution Chart via the Graph Type: dropdown and specify a VDJ region length of 119. The chart below is colored according to Polarity.

Screenshot 2023-03-08 at 10.44.41 PM.png

All plots can be exported as images or .csv files using the Export dropdown o the right of the graph drop-down.

Coloring and plotting options

Plot by (Sequence Logo only)

Frequency
- Each position is given the same weighting, and the fraction that each residue made up at each position is represented
Entropy
- Entropy is a metric that quantifies uncertainty, with more variation at a given position contributing to lower entropy. The Entropy calculation used is Shannons Entropy. Positions that are highly conserved will appear larger, while positions with more variation will appear smaller.

Colour by (Sequence Logo and Amino Acid Distribution Chart)

Default
- Geneious Biologics color scheme for proteins
Geneious
- Default amino acid colors used in Geneious Prime
Rasmol
- The Rasmol scheme colors amino acids according to traditional amino acid properties. Amino acids associated with the outer surface of a protein are given bright colors and non-polar residues are darker. Most colors are hallowed by tradition.
```
Bright red: D, Q
Blue: K, R
Mid blue: F, Y
Light grey: G
Dark grey: A
Pale blue: H
Yellow: C, M
Orange: S, T
Cyan: N, Q
Green: L, V, I
Pink: W
Flesh: P
```
Hydrophobicity
- This colors amino acids from red through to blue according to their hydrophobicity value, where red is the most hydrophobic and blue is the most hydrophilic. The values of the color scale are given in the figure below. These values are taken from Expasy:

Polarity

This colors amino acids according to their polarity as follows:

Yellow: Non-polar (G, A, V, L, I, F, W, M, P)
Green: Polar, uncharged (S, T, C, Y, N, Q)
Red: Polar, acidic (D, E)
Blue: Polar, basic (K, R, H)

Clustal
- This colors amino acids according to their properties and is adapted from Clustal to incude acidic residues as follows:
```
Orange: G, P, S, T 
Red: H, K, R 
Blue: F, W, Y 
Green: I, L, M, V 
Purple: D, E
```
Structural AAs
- This colors amino acids F, Y, W, P and G light green.
Cysteines
- This colors cysteines yellow.