Clustering options

February 01, 2024 03:05
Updated

This article outlines the different options available for creating custom clusters when using the Antibody Annotator, NGS Antibody Annotator or Single Cell Antibody Analysis tool. If you are unsure what a cluster is, see Understanding "Clusters".

Overview

It is possible to cluster up to six regions or genes (FR1, CDR3, Heavy D Gene etc.) together. Sequences can be grouped into clusters based on shared identity across sequences in the regions selected, or in the case of genes, sequences that have the same "best-fit" germline gene. Using the Advanced clustering options, it is also possible to allow mismatches in a region up to a specified percent threshold or count and to cluster based on amino acid similarity. Below are the default clustering options:

Screen Shot 2022-02-17 at 4.22.57 PM.png

Adding simple clusters

To add new single-region clusters, click the blue "Plus" icon in the Clustering Options box. This will bring up the Quick Select menu, where you can specify a single region to generate clusters for.

Screen_Shot_2022-08-11_at_4.59.48_PM.png

In the example above, adding this Heavy FR4 cluster option will generate groups of sequences or "clusters" in your annotation result that have identical Heavy FR4 amino acid sequences.

Adding advanced clusters (combo-regions, identity and similarity clusters)

To add new clusters with the advanced options click the blue "Plus" icon in the Clustering Options box and then select the Advanced tab, as shown below. To select multiple regions and/or genes, hold down the command/control button on your keyboard while selecting.

Screen_Shot_2022-02-23_at_10.13.29_AM.png

As you add clusters, the Cluster Name will be automatically generated. You can change this to a name of your choosing by editing the Cluster Name box. The clustering can be performed based on the amino acid or nucleotide sequences.

The Clustering Method allows the user to specify whether clusters are based on exact identity across sequences in the regions, or if mismatches up to a specified threshold or count are permitted. Clustering based on shared amino acid characteristics (polarity etc.) is also possible via similarity clustering. These Cluster Methods are described below.

Saving different clusters for different datasets

Geneious Biologics allows you to save Profiles which can be used to record and re-run alternative settings depending on the dataset. This means that you can specify different clusters and other settings depending on what dataset you are working with.

Profiles can be saved and applied at the bottom of our Annotation analysis pipelines:

apply or save profile.png

Clonotype cluster example

Screen_Shot_2022-02-25_at_3.28.37_PM.png

In the above example, a cluster will be generated called "Heavy CDR3 V Gene J Gene (85% Identity on Heavy CDR3)" when the annotation result has finished running. This cluster will group sequences together based on common germline gene ancestry in the Heavy V and J genes, while allowing for some difference across the CDR3 sequence (at least 85% identity).

If you are unsure what a clonotype is, see Understanding Clonotypes or watch the video below:

Cluster method overview

Exact: 100% match in sequence across regions (either amino acids or nucleotides).
- Note that for clustering on genes this is the only option, as sequences identified as best-matched to the same germline gene will be clustered together, but may contain differences in the nucleotide/amino acid sequence.
Identity (by percent): Allows mismatches in either the amino acids or nucleotides across a region. When selected, the similarity threshold can be chosen (in percentage) and the region that this threshold is applied to specified (mismatches can be allowed in one or all of the regions/genes).
Identity (by count): Allows for a discreet number of allowed mismatches in a specified region(s), ranging from 1 to 5 amino acids or bases. This can be used to ensure only 1 mismatch is allowed, no matter the length of the regions selected.
Similarity (BLOSUM): Amino acids only. Groups regions of different sequences based on similarity of the amino acids (Polarity, Hydrophobicity etc.) via the BLOSUM matrix. Can specify the percentage threshold to be met.

Note: The default cluster combinations can be edited or removed and the default cluster combinations can be restored by clicking the Reset to default button on the bottom left of the Antibody Annotator popup.

Clustering options in depth

Identity (by percent)

Clustering is only performed on sequences/regions of identical length. Sequences are first assigned to clusters which consist of only identical sequences. Then, starting with a maximum of 1 difference, all the way up to the the maximum number of amino acid differences allowed based on the threshold setting:

Clusters are sorted by decreasing size. Clusters of equal size are sorted so that less diverse clusters (i.e. containing fewer different sequences) are first. When still tied, clusters are sorted alphabetically.
A new set of clusters is then built by iterating over the above sorted clusters.
Of those newly generated clusters, find those which have a representative sequence which has at most the maxDifferences (as defined by the threshold) from the representative sequence of the initial parent cluster.
- If the above condition is met, merge the two clusters together and add to the set of new clusters
- If the above is not met, this cluster is not added to the wider cluster

Identity (by count)

Clusters are sorted by decreasing size. Clusters of equal size are sorted so that less diverse clusters (i.e. containing fewer different sequences) are first. When still tied, clusters are sorted alphabetically.
A new set of clusters is then built by iterating over the above sorted clusters.
Of those newly generated clusters, find those which have a representative sequence which has at most the maximum differences (as defined by the threshold) from the representative sequence of the initial parent cluster.
- If the above condition is met, merge the two clusters together and add to the set of new clusters
- If the above is not met, this cluster is not added to the wider cluster

Similarity (BLOSUM matrix)

The same process as described above for percent identity clusters is used for similarity clusters, except that the number of differences between two amino acids is defined as the sum of the BLOSUM62 score matrix reduction from the maximum of a perfect match for either amino acid.

The number of differences between two amino acids is defined as the sum of the BLOSUM62 score matrix (see below), a reduction from the maximum of a perfect match for either amino acid.

Screen_Shot_2022-10-21_at_10.00.03_AM.png

For example, if there is a Leu/Lys mismatch, the BLOSUM62 entry is -2. Leu/Leu match is 4. Lys/Lys match is 5. The maximum difference between a match and mismatch in this case is 7, so a Leu/Lys mismatch scores 7 'differences'. A Leu/Ile mismatch has a score reduction of 2 (from Leu/Leu = 4 to Leu/Ile = 2), so this means two sequences with 3 Leu to Ile mismatches will be clustered together before two sequences with just a single Leu to Lys mismatch (Leu/Leu = 4 and Leu/Lys = -2).

The similarity threshold percentage is transformed into a maximum number of 'differences' by comparing the difference between a match and mismatch score with the mean score difference over all pairwise amino acid comparisons. The mean score difference is 8.21, so for example 90% similarity over a 20 amino acid sequence this allows up to 16.42 'differences' which would allow up to 8 Leu/Ile mismatches (scoring 2 'differences'), or 2 Leu/Lys mismatches (scoring 7 'differences'), or a single Trp/Val mismatch (scoring 14 'differences').

This method is available for clustering of amino acid regions only.

Gene clusters

When clustering by Genes, it is important to understand that these behave differently to region-based clusters. A gene cluster represents all sequences or reads that 'matched' a particular gene. It does not necessarily mean that the gene was a good representation of the read, simply that the gene was the closest match amongst the reference sequences.

'Identity' and 'Similarity' options are not available for gene clusters. When used for combinations that include genes, the Identity and Similarity options will apply to the chosen combination sequence regions, but will not influence the way genes are handled.