This article is about how to view and interpret clusters to aid in understanding your dataset. To find out how to define custom clusters, including how to allow mismatches up to a specified threshold and to cluster based on amino acid similarity, see this article.
"Clusters" is a term we generally use to refer to when we group sequences in a dataset together by a particular region. For example, when viewing result documents, in the "Cluster Table" selection dropdown you can collapse all the annotated sequences together by annotated region. For example if you chose "Heavy CDR3" from the Cluster Table dropdown, this would show you a table where sequences with the exact same CDR3 region sequence get "clustered" together into one row.
In the picture above 40,104 sequences comprising 1.67% of the total sequences have a Heavy CDR3 of "ARWEYYAMDY". We would refer to this as a single CDR3 "cluster". We generally cluster by amino acid translations of regions, rather than the underlying nucleotides. This means that the exact nucleotide sequences may differ within a single amino acid cluster.
If you have a lot of analysed sequences, looking at clusters can be a good way of understanding the variability in your data set. It can also tell you if a single CDR3 region is dominant within your dataset. And if you were to align the CDR3 regions together, you can see how they relate to each other sequence-wise.
Another way to visualise clusters is using our graphing suite in the "Graphs" tab of Antibody Annotator result documents. To see our main article on graphs, view this article: Using graphs to interpret clusters and clonotypes.
Number of Clusters will show which regions have the most variability within your annotated sequences. For example in the above graph we can see that the FR/CDR region with largest amount of different protein sequences is the Heavy FR3 region, closely followed by the Heavy FR1/CDR3 region. There are much fewer unique sequences in Heavy FR4.
The cluster size graph can be a good way of looking at the distribution of clusters within your sequences. We can see from the graph above that the majority of Heavy CDR3 clusters in this dataset (close to 70k) only contain a single CDR3 sequence, and are unique. On the other hand, we can also see that there is at least one Heavy CDR3 cluster that contains 40,000 sequences (this is our "ARWEYYAMDY" cluster from above). If we wanted to double check that, we could look at the CDR3 count graph below, which shows the 25 most abundant Heavy CDR3 regions in the dataset:
One final graph that may be of interest is the "Cluster length" graphs, which show what variation in length is present within the Heavy CDR3 region. Below we can see that the most frequent CDR3 length is 11 Amino Acids long.
Note: These graphs are also available for other cluster regions, we were just using Heavy CDR3 as an example above.
% Identity and Similarity Clusters
Similarity or identity clustering results can be viewed in a similar way to regular clusters, but they do have some special columns and visualisation options.
One you have selected your cluster combination in the drop-down menu, the table will refresh and the specified clusters will appear in the results table. For each cluster, a representative sequence is listed in the table. This can be further inspected as sequence logo, where amino acid frequencies are reported as stacked histograms. To view the sequences within a cluster as a sequence logo, select a cluster and click Sequence Logo. Hovering the mouse over a residue in the sequence logo graph shows the exact percentage composition of each amino acid in that particular position.
To investigate the diversity of each cluster across all cluster easier, the field Evenness can be used. The evenness column can be found in the Sequences Table, as highlighted above. The value is calculated using Pielou's evenness index, which goes from 0 (uneven) to 1 (even), and describes how closely expressed each sequence is within the cluster.
The field Unique states how many different unique sequence regions are represented in the cluster. For protein clusters, this is the number of different amino acid sequences present (there may be more nucleotide variations). In the picture above, you can see the selected Heavy CDR3 cluster ATARRR... actually represents a Total of 620 reads, containing 55 different unique Heavy CDR3 amino acid sequences.