Network and Tree plots: Identifying clonotype and sequence relationships

January 24, 2025 03:14
Updated

Analyzing clonal relationships can be challenging in NGS datasets with hundreds of millions of sequences. Developing new tools to enable researchers to identify and extract these meaningful trends from their data is key to the discovery and development of novel therapeutics.

Geneious Biologics has developed and optimized an in-house alignment algorithm that will enable researchers to interrogate the relationships between unique V(D)Js, CDRs or other sequenced regions of interest. Clustered unique sequences are aligned and can be plotted as either a minimum-spanning network diagram or an alignment tree.

focused_library_trimer_pan1 Annotated & Clustered.png

Fig. 1: An example of a Minimum-Spanning Network (A) and a Cladogram-Transfromed Tree visualisation (B) of unique Heavy CDR3 amino acid sequences, colored by Heavy CDR3 length. The dataset used was from a phage-display panning experiment undertaken on human-derived scFvs exposed to the HIV-1 envelope glycoprotein trimer gp140 (see data source in references). The node sizes in the Network visualisation represent the total count of each unique Heavy CDR3 cluster.

How to view the Network and Tree in Geneious Biologics

Both the Tree and Network visualizations are available in Geneious Biologics, a cloud-based antibody and peptide therapeutics discovery software application.

After producing an Annotation Result (using our Antibody Annotator, NGS Antibody Annotator, Single Cell Antibody Annotator or Peptide Annotator analysis pipelines), navigate to a Cluster table. If you are unsure what a cluster is, see Understanding "Clusters" or Understanding Clonotypes and how to find them in your data.

In the image below of the Heavy CDR3 cluster table both graphs can be selected from the Graph dropdown on the right-hand panel:

finding graphs.png

Note: These plots are not generated if the region length exceeds ~200 amino acids. Currently we support these two graphs for sequences as long as typical V(D)J regions (~100 aa), but not scFvs (~250 aa).

Both plots are generated via the following steps:

First, an alignment is performed on the most populous primary sequence in each cluster row. Up to 1000 individual clusters can be selected for alignment.
- For example, if the VDJ Region cluster table had 5,000 rows or "clusters", only the top 1000 would be selected for alignment. If the 1000th cluster contained the same number of sequences as the 999th, these clusters would be omitted until a cut-off number of clustered sequences is reached.
Following the alignment, either the Neighbour-joining Tree algorithm or Force-directed layout (in the case of the Network graph) is applied.

Developing the Alignment Algorithm

We iterated through multiple different approaches to the algorithm, including the scoring and distance calculations. Choosing the correct algorithm and parameters is crucial for balancing ease of interpretability and accurately representing true sequence similarity. The various algorithms were evaluated by plotting the following relationships:

Alignment Distance vs Network (Shortest Path) Distance for the minimum-spanning network diagram (Figure 2)
Alignment Distance versus Tree Distance for the tree diagram (Figure 3)

The final algorithm used is a Needleman-Wunsch pairwise alignment with BLOSUM-62 scoring, a gap opening penalty of 10, a gap extension penalty of 5, and no free end gaps. Both the minimum-spanning Network and the Tree use the same underlying alignment algorithm.

Fig. 2: Series of plots showing the relationship between the Alignment versus Tree distances across the Heavy CDR3 (see data source in References) using modified Needleman-Wunsch algorithms. (A) No free-end gaps, Jukes-Cantor distances, initial scoring method. (B) Free end-gaps, Jukes-Cantor distances, Identity scoring. (C) Free end gaps, Jukes-Cantor distances, normalised scoring. (D) Free end-gaps (shortest only), Jukes-Cantor distances, improved normalised scoring. (E) Final algorithm: No free-end gaps, linear distances, improved normalised scoring.

Fig. 3: Series of plots showing the relationship between the Alignment versus Network (Shortest Path) distances across the Heavy CDR3 (see data source in References) using modified Needleman-Wunsch algorithms. (A) No free-end gaps, Jukes-Cantor distances, initial scoring method. (B) Free end-gaps, Jukes-Cantor distances, Identity scoring. (C) Free end gaps, Jukes-Cantor distances, normalised scoring. (D) Free end-gaps (shortest only), Jukes-Cantor distances, improved normalised scoring. (E) Final algorithm: No free-end gaps, linear distances, improved normalised scoring.

Alternate Network Visualisations

Initially, a network visualization was generated which plotted all node connections up to an adjustable alignment score threshold (Figure 4, below). This visualization technique was overly complex and difficult to interpret. We found that the minimum spanning method (shown in Figure 1A) was the best at capturing both the sequence relationships and being easy to understand.

network tree original.svg

Fig. 4: The initial Network Visualization prior to using the minimum spanning tree method.

References

He, L., Lin, X., De Val, N., Saye-Francisco, K. L., Mann, C. J., Augst, R., Morris, C. D., Azadnia, P., Zhou, B., Sok, D., Ozorowski, G., Ward, A. B., Burton, D. R., & Zhu, J. (2017). Hidden Lineage Complexity of Glycan-Dependent HIV-1 Broadly Neutralizing Antibodies Uncovered by Digital Panning and Native-Like gp140 Trimer. Frontiers in Immunology, 8. https://doi.org/10.3389/fimmu.2017.01025