Understanding "Clusters"

February 14, 2024 04:27
Updated

This article is about how to view and interpret clusters to aid in understanding your dataset. To find out how to define custom clusters, including how to allow mismatches up to a specified threshold or count, and to cluster based on amino acid similarity see Clustering Options.

What is a Cluster?

"Clusters" is a term we generally use to refer to when we group sequences in a dataset together by a particular region. Often, it is useful to group sequences into Clonotype clusters, see our main article Understanding Clonotypes or the video below for more information.

When viewing an annotated result document, in the "Cluster Table" selection dropdown you can collapse all the annotated sequences together by annotated region. If you chose "Heavy CDR3" from the Cluster Table dropdown, this would show you a table where sequences with the exact same CDR3 region sequence get "clustered" together into one row.

Screen_Shot_2022-02-21_at_1.26.09_PM.png

In the picture above 40,104 sequences comprising 1.67% of the total sequences have a Heavy CDR3 of "ARWEYYAMDY". We would refer to this as a single CDR3 "cluster". We generally cluster by amino acid translations of regions, rather than the underlying nucleotides. This means that the exact nucleotide sequences may differ within a single amino acid cluster.

If you have a lot of analysed sequences, looking at clusters can be a good way of understanding the variability in your data set. It can also tell you if a single CDR3 region is dominant within your dataset. And if you were to align the CDR3 regions together, you can see how they relate to each other sequence-wise.

Cluster Graphs

Another way to visualise clusters is via the graphs, which will automatically populate under any cluster table:

cluster table graph example.png

You can also view the graphs in a full window via the Graphs tab at the top of a result:

Screenshot 2023-05-23 at 12.34.52 AM.png

See our main article Using graphs to interpret clusters and clonotypes for a list of the types of graphs available.

For example, the Number of Clusters graph will show which regions have the most variability within your annotated sequences:

Screen_Shot_2022-02-21_at_1.35.43_PM.png

In the above graph we can see that the FR/CDR region with largest amount of different protein sequences is the Heavy FR3 region, closely followed by the Heavy FR1/CDR3 region. There are much fewer unique sequences in Heavy FR4.

Screen_Shot_2022-02-21_at_1.39.37_PM.png

The cluster size graph can be a good way of looking at the distribution of clusters within your sequences. We can see from the graph above that the majority of Heavy CDR3 clusters in this dataset (close to 70k) only contain a single CDR3 sequence, and are unique. On the other hand, we can also see that there is at least one Heavy CDR3 cluster that contains 40,000 sequences (this is our "ARWEYYAMDY" cluster from above). If we wanted to double check that, we could look at the CDR3 count graph below, which shows the 25 most abundant Heavy CDR3 regions in the dataset:

Screen_Shot_2022-02-21_at_1.43.14_PM.png

See our main article Using graphs to interpret clusters and clonotypes for a list of the types of graphs available.

Inexact Clusters

Using the Advanced Clustering Options, sequences can be grouped into clusters based on shared identity across sequences up to a specified % threshold or count and to cluster based on amino acid similarity.

Similarity or identity clustering results can be viewed in a similar way to regular clusters, but they do have some special columns and visualisation options. See Exploring the Cluster Table Columns.

Screen_Shot_2022-02-24_at_1.14.33_PM.png

One you have selected your cluster combination in the drop-down menu, the table will refresh and the specified clusters will appear in the results table. For each inexact cluster, a representative sequence is listed in the table.

Sequence Logos and Finding the "Evenness" of an Inexact Cluster

An inexact cluster can be further inspected as a sequence logo, where amino acid frequencies are reported as stacked histograms. To view the sequences within a cluster as a sequence logo, select a cluster and click Sequence Logo. Hovering the mouse over a residue in the sequence logo graph shows the exact percentage composition of each amino acid in that particular position.

Screen_Shot_2022-02-24_at_1.19.11_PM.png

To investigate the diversity of each cluster across all cluster easier, the field Evenness can be used. The evenness column can be found in the Sequences Table, as highlighted above. The value is calculated using Pielou's evenness index, which goes from 0 (uneven) to 1 (even), and describes how closely expressed each sequence is within the cluster.

The field Unique states how many different unique sequence regions are represented in the cluster. For protein clusters, this is the number of different amino acid sequences present (there may be more nucleotide variations). In the picture above, you can see the selected Heavy CDR3 cluster ATARRR... actually represents a Total of 620 reads, containing 55 different unique Heavy CDR3 amino acid sequences.