This article outlines the different options available for creating custom clusters when using the Antibody Annotator, NGS Antibody Annotator or Single Cell Antibody Analysis tool. To find out more about what a cluster is and why it is useful, see this article.
It is possible to cluster up to six regions or genes (FR1, CDR3, Heavy D Gene etc.) together. Sequences can be grouped into clusters based on shared identity across sequences in the regions selected, or in the case of genes, sequences that have the same "best-fit" germline gene. Using the Advanced Clustering Options, it is also possible to allow mismatches in a region up to a specified threshold and to cluster based on amino acid similarity. Below are the default clustering options:
- Adding Simple Clusters
- Adding Advanced Clustering Options (%Identity and Similarity Clusters)
- Cluster Method Overview
- Clonotype Clustering Example
- Clustering Options in depth
Adding Simple Clusters
To add new single-region clusters, click the blue "Plus" icon in the Clustering Options box. This will bring up the Quick Select menu, where you can specify a single region to generate clusters for.
In the example above, adding this Heavy FR4 cluster option will generate groups of sequences or "clusters" in your annotation result that have identical Heavy FR4 sequences.
Adding Advanced Clustering Options (%Identity and Similarity Clusters)
To add new clusters with the advanced options click the blue "Plus" icon in the Clustering Options box and then select the Advanced tab, as shown below. To select multiple regions and/or genes, hold down the command/control button on your keyboard while selecting.
As you add clusters, the Cluster Name will be automatically generated. You can change this to a name of your choosing by editing the Cluster Name box. The clustering can be performed based on the amino acid or nucleotide sequences.
The Clustering Method allows the user to specify whether clusters are based on exact identity across sequences in the regions, or if mismatches up to a specified threshold are permitted. Clustering based on shared amino acid characteristics (Polarity etc.) is also possible. These Cluster Methods are described below.
Cluster Method Overview
Exact: 100% match in sequence across regions (either amino acids or nucleotides).
- Note that for clustering on genes this is the only option, as sequences identified as best-matched to the same germline gene will be clustered together, but may contain differences in the nucleotide/amino acid sequence.
- Identity: Allows mismatches in either the amino acids or nucleotides across a region. When selected, the similarity threshold can be chosen (in percentage) and the region that this threshold is applied to specified (mismatches can be allowed in one or all of the regions/genes).
- Similarity (BLOSUM): Amino acids only. Groups regions of different sequences based on similarity of the amino acids (Polarity, Hydrophobicity etc.) via the BLOSUM matrix. Can specify the percentage threshold to be met.
Note: The default cluster combinations can be edited or removed and the default cluster combinations can be restored by clicking the Reset to default button on the bottom left of the Antibody Annotator popup.
Clonotype Clustering Example
In the above example, a cluster will be generated called "Heavy CDR3 V Gene J Gene (85% Identity on Heavy CDR3)" when Antibody Annotator has finished running. This cluster will group sequences together based on common germline gene ancestry in the Heavy V and J genes, while allowing for a that also have CDR3 regions which have 85% identity (mismatches).
If you are unsure what a clonotype is, see Understanding Clonotypes or watch the video below:
Clustering Options in depth
Clustering is only performed on sequences/regions of identical length. Sequences are first assigned to clusters which consist of only identical sequences. Then, starting with a maximum of 1 difference, all the way up to the the maximum number of amino acid differences allowed based on the threshold setting:
- Clusters are sorted by decreasing size. Clusters of equal size are sorted so that less diverse clusters (i.e. containing fewer different sequences) are first. When still tied, clusters are sorted alphabetically.
- A new set of clusters is then built by iterating over the above sorted clusters.
- Of those newly generated clusters, find those which have a representative sequence which has at most the maxDifferences (as defined by the threshold) from the representative sequence of the initial parent cluster.
- If the above condition is met, merge the two clusters together and add to the set of new clusters
- If the above is not met, this cluster is not added to the wider cluster
Similarity (BLOSUM matrix)
The same process as described above for percent identity clusters is used for similarity clusters, except that the number of differences between two amino acids is defined as the sum of the BLOSUM62 score matrix reduction from the maximum of a perfect match for either amino acid.
The number of differences between two amino acids is defined as the sum of the BLOSUM62 score matrix (see below), a reduction from the maximum of a perfect match for either amino acid.
For example, if there is a Leu/Lys mismatch, the BLOSUM62 entry is -2. Leu/Leu match is 4. Lys/Lys match is 5. The maximum difference between a match and mismatch in this case is 7, so a Leu/Lys mismatch scores 7 'differences'. A Leu/Ile mismatch has a score reduction of 2 (from Leu/Leu = 4 to Leu/Ile = 2), so this means two sequences with 3 Leu to Ile mismatches will be clustered together before two sequences with just a single Leu to Lys mismatch (Leu/Leu = 4 and Leu/Lys = -2).
The similarity threshold percentage is transformed into a maximum number of 'differences' by comparing the difference between a match and mismatch score with the mean score difference over all pairwise amino acid comparisons. The mean score difference is 8.21, so for example 90% similarity over a 20 amino acid sequence this allows up to 16.42 'differences' which would allow up to 8 Leu/Ile mismatches (scoring 2 'differences'), or 2 Leu/Lys mismatches (scoring 7 'differences'), or a single Trp/Val mismatch (scoring 14 'differences').
This method is available for clustering of amino acid regions only.
When clustering by Genes, it is important to understand that these behave differently to region-based clusters. A gene cluster represents all sequences or reads that 'matched' a particular gene. It does not necessarily mean that the gene was a good representation of the read, simply that the gene was the closest match amongst the reference sequences.
'Identity' and 'Similarity' options are not available for gene clusters. When used for combinations that include genes, the Identity and Similarity options will apply to the chosen combination sequence regions, but will not influence the way genes are handled.