Sequence similarity clustering is helpful in identifying homogenous sequences and generating clusters that are tractable to reduce sequence redundancy.
To cluster similar or identical sequences, (1) select a Biologics Annotator Result document, (2) click Post-processing and (3) select Re-cluster in the dropdown.
You can cluster sequences on a region where every sequence in the cluster have similarity above the selected threshold. To cluster your sequences, (1) select a method of clustering from the dropdown, (2) set your preferred threshold and (3) select the region to cluster the sequence by from the dropdown, and click Run.
Clustering Algorithms
Identity Clustering
Clustering is only performed on sequences or regions of identical length. Sequences are first assigned to clusters which consist of only identical sequences. Starting with a maximum of 1 difference, all the way up to the maximum number of amino acid differences allowed based on the threshold setting:
- Sort clusters by decreasing size - clusters of equal size are sorted so that less diverse clusters (i.e. containing fewer different sequences) are first. When still tied, clusters are sorted alphabetically.
- Build a new set of clusters by iterating over the sorted clusters:
- Find the first new cluster which has a representative sequence which has at most maximum differences from the representative sequence of this old cluster.
- If one is found, merge the two clusters together and add to the set of new clusters.
- If one is not found, add the old cluster to the set of new clusters
- Find the first new cluster which has a representative sequence which has at most maximum differences from the representative sequence of this old cluster.
Similarity Clustering
This algorithm works the same way as the identity clustering algorithm, but the number of differences between two amino acids is defined as the sum of the BLOSUM62 score matrix (see below) reduction from the maximum of a perfect match for either amino acid.
For example, if there is a Leu/Lys mismatch, the BLOSUM62 entry is -2. Leu/Leu match is 4. Lys/Lys match is 5. The maximum difference between a match and mismatch in this case is 7, so a Leu/Lys mismatch scores 7 'differences'. A Leu/Ile mismatch has a score reduction of 2 (from Leu/Leu = 4 to Leu/Ile = 2), so this means two sequences with 3 Leu to Ile mismatches will be clustered together before two sequences with just a single Leu to Lys mismatch (Leu/Leu = 4 and Leu/Lys = -2).
The similarity threshold percentage is transformed into a maximum number of 'differences' by comparing the difference between a match and mismatch score with the mean score difference over all pairwise amino acid comparisons. The mean score difference is 8.21, so for example 90% similarity over a 20 amino acid sequence this allows up to 16.42 'differences' which would allow up to 8 Leu/Ile mismatches (scoring 2 'differences'), or 2 Leu/Lys mismatches (scoring 7 'differences'), or a single Trp/Val mismatch (scoring 14 'differences').
Viewing reclustered sequences
Similarity or identity clustering results can be view by (1) selecting the correspondent reclustered Biologics Annotator Result document. To view the reclustered table, (2) click Group By in the Sequence Table and (3) select the clustered region in the dropdown.
For each cluster, a representative sequence is listed in the table. This can be further inspected as sequence logo, where amino acid frequencies are reported as stacked histograms. To view the sequences within a cluster as a sequence logo, (1) select a cluster and (2) click Sequence Logo. Hovering the mouse over a bar (3) shows the exact percentage composition of each amino acid in that particular position.
To investigate the diversity of each cluster across all cluster easier, the field Evenness can be used. The value is calculated using Pielou's evenness index, goes from 0 (uneven) to 1 (even), and describes how closely expressed each sequence is within the cluster.
Comments
0 comments
Please sign in to leave a comment.