Comparing Results across Multiple Experiments

March 12, 2024 19:51
Updated

Repertoire comparison by next-generation sequencing gives insight into the breadth of the antibody repertoire used in an immune response. Compare Sequences can be used to identify a set of query sequences that are also present in a reference dataset, or to follow the enrichment of a clone across datasets/panning rounds.

Running Compare Results

To compare results, select 2 or more Biologics Annotator Result documents and click Post-processing > Compare Results in the dropdown (see image below).

Comparison Parameters

Depending on your sample, you can customise your comparisons run by selecting or adjusting the following parameters:

Filtering

Filter out sequences which are not: "Without Stop Codons & In Frame & Fully Annotated"
- Removes sequences from the analysis that had stop codons or were not in frame or could not be fully annotated
Filter out sequences where the sum of counts for all samples is lower than:
- Does not include sequences that were found at a count (across all samples) lower than the specified number. This is done per region analyzed, as each region (eg. FR2, CDR3, VDJ Region) is outputted as a separate table in the results.

Normalization

In order to accurately compare samples, normalization or scaling is often performed to remove variation between samples that prevents direct comparison of data. The choice of the normalization method used affects what the normalized ratio of each cluster will be as well as the P-Value. Depending on your experimental design and data, you can choose a normalization method from the following options:

Total count
- This is a simplistic way to compare samples by scaling the raw frequency counts based on the total number of sequences in the sample relative to the other sample. Total count normalization effectively just compares the raw frequency percentages of each cluster. However, this is prone to problems. For example, if one sample has a single new cluster that makes up 50% of sequences, then all other clusters will appear to have half their usual frequency, despite them not actually being any less frequent.
Total count excluding upper quartile
- This is another approach used during differential gene expression normalization, and is usually better suited (than the DESeq2 normalization method) to immune system data comparison. This is due to there often being only a few regions in common between samples that have significant numbers of sequences, and selecting the median ratio of these produces a value which is quite sensitive to a change in only a few sequence frequencies.
Median of frequency ratios
- This method uses the DESeq2 approach to sample normalization during differential gene expression, but with an additional heuristic to exclude clusters with very low frequencies when calculating the normalization ratio. This is because the median frequencies may often be only 1 or 2 sequences, which can lead to inaccurate normalization ratios. For example if the median cluster has 1 sequence in one sample and 2 sequences in the other, this would produce a normalization ratio of 2. Instead we take the median ratio of those clusters where both samples have at least as many sequences as specified by the 'of frequencies at least' setting.
  - Note that the Median of frequency ratios normalization method is not supported for comparisons of more than 2 Biologics Annotator Result documents.

Additional Clustering

To reduce sequence redundancy, you can cluster similar sequences based on a specific region and threshold by selecting the Group similar sequences across all samples option. You can then create a cluster using a region that is found in common across all datasets selected for comparison (eg. HCDR3 or VDJ). See Understanding "Clusters" if you are unsure what clustering is.

Reference Sample

This option asks you to set a reference sample for sample comparisons. To set a reference sample, select one of the Biologics Annotator Result documents in the Reference sample dropdown.

This will often be the time 0 panning round, or the reference you want all counts to be relative to.

Viewing the Results

Summary Page

The default landing page is the Summary page, which displays a non-selectable table which gives an overview of the clusters and counts for each region:

Comparison example summary table.png

The summary page has the following columns:

Total Clusters
- Number of clusters across all datasets for a given region/gene
Clusters in [Sample]
- Number of clusters in this dataset for a given region/gene
Clusters in Both/All Samples
- Number of clusters that are common to all datasets for a given region/gene
Total in [Sample]
- Total # sequences found for a region/gene in the given sample
Total Count Excluding Upper Quartile Normalized Ratio / Median Ratio Normalized Ratio / Frequency % Ratio
- The ratio between the normalized counts for each of the 3 normalization methods. When a normalized count is less than 1, for the purposes of calculating a ratio, both counts are incremented by the same amount to ensure both values are at least 1. For example if the normalized counts are 0.3 and 1.2, then these are first increased by 0.7 to make 1.0 and 1.9 before calculating the ratio to be 1.9.

Exploring specific regions or gene groupings

To investigate specific regions, gene groupings or region lengths across samples, go to the Cluster Table: dropdown and select any of the available tables:

Comparisons table dropdown.png

The tables produced display more granular information on specific regions.

Filtering

Right clicking on any cell in the table will allow you to add that cell to the filter bar automatically:

Comparisons filtering example.png

Filters can be edited according to SQL syntax to define specific searches. Please see our main article Filtering your Sequences for more on SQL syntax and filtering.

Columns for Cluster Tables

Region / Gene / Region length
- This column displays either the amino acid sequence of region, the gene name or a length for a region (in amino acids).
Count before filtering [Sample]
- The number of sequences prior to filtering for a given sample
Count [Sample]
- The number of sequences after filtering for a given sample
Normalized Count [Sample]
- The raw count of the number of reads in this cluster, scaled according to the normalization method. For example if the 'frequency % (total count)' normalization method is used (which is not recommended), then the ratio between the normalized counts of two samples will be the same as the ratio between the frequency % of those two samples.
Frequency % [Sample]
- The percentage of reads in this cluster out of the total number of reads from the sample used during analysis. If the `Only use sequences that are fully annotated, in frame, and without stop codons` setting is off, this total is the number of reads in the sample. If that setting is on, then the frequency is out of all reads in the sample which meet those conditions.
Fold Change (FC) Norm. Count [Sample] / [Reference Sample]
- This is the fold change between the normalized counts of a given sample, as compared to the reference sample
Log2 Fold Change (FC) Norm. Count [Sample] / [Reference Sample]
- The base 2 log of one of the above fold change ratio, depending on which normalization method was selected in the options.
Score
- A higher score indicates the differences between counts in this cluster are more interesting. This is a combination of the normalized count ratio and the p-value. For example a normalized count change from 100 to 200 (ratio 2) is more interesting than a normalized count change from 1 to 4 (ratio 4), because the later case is likely to have happened by chance rather than there being any real difference between the samples.
P-value
- The probability that the difference between observed normalized counts would happen by chance if there is actually no difference in the levels of these clones between the samples.
Adjusted P-Value
- The P-Value adjusted upwards to account for the fact that many different clusters are being analyzed. For example when there are 1000 clusters, we would expect one of these to have a p-value of 0.001 by chance when there is no actual difference between the samples.
Cluster ID [Sample]
- The cluster ID assigned to the region/gene/region length from the given sample

Finding sequences in common across samples

To identify a list of sequences that were present in two different samples (eg. a set of query sequences vs a set of reference sequences), navigate to the Cluster Table of interest - eg. the VDJ region or Heavy V Gene table. In the example below we will identify all the HCDR3 (90% Similarity) clusters that were present in both the query and reference dataset.

We can do this by filtering on the counts columns. Add filters for the count columns of both the query and reference sequences by right clicking on a cell in the appropriate count column:

Screenshot 2023-08-03 at 11.12.59 AM.png

Once filters are added, you can edit the syntax before filtering.
To make a filter that finds sequences from both documents that were grouped together under the same HCDR3, you can use a filter in this format:

['Count query sequences (Reclustered)'] > 0 AND ['Count reference sequences (Reclustered)'] > 0

The AND operator will find sequences where the counts were above 0 in both samples. In this example, only one query sequence was grouped with a reference:

Screenshot 2023-08-03 at 11.18.35 AM.png

You can then export the selected filtered sequences via the Export/Extract dropdown.

Generating frequency graphs across samples

The Graphs tab, next to the Sequence Viewer tab, allows you to plot the frequency of selected clusters across all samples by selecting Graph Type: Frequency. The picture below shows how you could identify trends in the Heavy CDR3 region between 5 successive panning rounds. The Heavy CDR3 regions selected in the table will be shown in the graph below. These regions all have high fold change values, and the increased frequency indicates that they have been enriched in the later panning rounds compared to the original sample.

You can mouse over the bars, as in the image above, to reveal more information (eg. region sequence and frequencies across the different samples).

Finding enriched sequences/clones using Scatterplots

The Graphs tab, next to the Sequence Viewer tab, allows you to generate scatterplots by selecting Graph Type: Scatterplot from the dropdown. You can choose to plot a variety of metrics by changing the X Axis and Y Axis drop-downs to the right of the graph. Biologics will plot the selected rows, or if no rows are selected Biologics will plot all points.

A few common plots can be generated, including the Volcano Plot and a Normalized count vs Log2 Fold change plot.

Volcano Plot Example:

Log2 fold change on the X-axis
-Log10 P-value on the Y-axis

Screenshot 2023-08-31 at 5.18.03 PM.png

Normalized count vs Log2 Fold change example plot:

Mousing over a point in the graph will bring up the region sequence in question, eg. Heavy CDR3 (90% Similarity): ATARRGQRIYGVVSFGEFFYYYYMDV