NGS Tutorial 4. Single Cell Antibody Analysis

December 05, 2023 03:22
Updated

In this tutorial, we will look at the Single Cell Antibody Analysis pipeline. It allows users that have been NGS sequencing antibody chains from Single Clones in wells or Single Cells in droplets to end up with a list of associated heavy and light chains per barcode/cell/well.

You can quickly filter out the noise from the dominant sequences identified; handle two or more prevalent sequences for the same chain by filtering based on properties such as frameshift issues; identify “doublets”; screen for contamination; compare across an experiment for enriched clones, etc.

The video below gives an introduction to the concepts behind single cell analysis, and our article Understanding Single Cell technologies: Barcodes and UMIs may also be useful.

This tutorial will continue the analysis of the reduced 10X Genomics BCR Single Cell dataset from NGS Tutorial 3, after having run the Collapse UMI Duplicates & Separate Barcodes tool.

In this tutorial we demonstrate the tool and results on a 10X Genomics dataset, however it can also be used with other kinds of Single Cell data. As discussed in Tutorial 3, you may have clones within 96-well plates that have been mass sequenced after barcoding each well with plate, row, and column barcodes. See Single Cell Analysis Workflows to learn more about what sorts of data formats can be analyzed.

This tutorial will cover the following exercises:

Setting up Single Cell Antibody Annotation
Understanding the results from Single Cell Antibody Annotator
Exploring the dataset using Filtering
Screening for enriched clones using Clusters
Identifying Heavy-Light pairs using the Chain Combinations Table

Setting up Single Cell Antibody Analysis

The input in this tutorial is the pre-processed data stemming from NGS Tutorial 3, where 10X Genomics read pairs were assigned their corresponding Barcode, and read pairs within each Barcode collection collapsed according to their UMI and sequence identity.

Select the file and click Annotation > Single Cell Antibody Annotator (see image below).

annotation -> single cell anti anno.png

This brings up a dialogue box with a number of settings that needs to be adjusted depending on the technology used to generate the Single Cell sequences, the expected sequence count per clone, as well as the desired outcome. For this tutorial, select the following options in the Single Cell Antibody Annotator dialogue box and click Run to start the analysis (see sections and image below).

Main Options

Reference database: Human Ig
Select the proper database for the data in the pull down menu, here Human Ig 2022. Note that you can make your own custom databases: How to make a Custom Reference Database
Sequence region of interest is between: FR1 and FR4 inclusive
As not all technologies provide coverage across the entire V(D)J region, this option allows you to keep a given (consensus) sequence, even if it only covers the ‘fully annotated region’ according to these settings. Thus you're only interested in the CDR3 region, you can set both X and Y to CDR3. Here we have chosen the entire V(D)J region, as in FR1 to FR4.
Collapse Regions at least: 97% identical
In addition to assembling the heavy and light chain individually within a cell, you sometimes find two or more different sequences for the same chain. For example, a common phenomenon is two transcribed light chains, either Kappa and Lambda or two Kappa/Lambda chains in the same cell. Or you might have two unique cells in the same well or droplet. Finally, you might have a significant level of contamination. Thus here we choose to only combine determined regions if they are more than 97% identical.
Retain upstream/downstream of fully annotated region: 100 bp
Combined with the above setting for defining the sequence region of interest, one can obtain more sequence information outside the V(D)J. This is useful if you want to look at leader sequences or the Constant region. Or if the ‘fully annotated region’ has been set to a bare minimum (e.g. CDR3) to recoup as many sequences as possible, but you still would like to know everything there is to know about the surrounding data, you can then have the ‘upstream and downstream’ data retained. Here, as our ‘fully annotated region’ is the entire V(D)J, 100 bp are retained both upstream and downstream to allow for leader sequence and constant region identification.
Associate significant dominant heavy and light pair: On
This will pair the dominant heavy and light chains within a barcode in the Chain Combinations table produced by the analysis.
Keep unmerged reads: On
Check this ON as the input is in read pairs that have not been merged
De novo assembly required: On
Check this on to allow for assembly of the multiple sequences that represent a single chain. Here the input sequences are represented by the consensus sequences from the UMI collapsed read pairs. They include both heavy and light chain sequences for the given clone.

Analysis Options

analysis options ugly.png

Annotate variants from reference database: On
These are parameters as defined in the normal Antibody Annotation pipelines, see Annotating differences relative to reference sequences.
Find liabilities and assets: On
This setting will find and annotate the defined liabilities and assets as shown in the draggable box. See our list of default liabilities, or learn how to customize sequence liabilities and assets.

Filtering Options

single cell filtering options.png

Only keep sequences longer than: 50 bp
As many of the sequences have been trimmed for Barcode and UMI, we have lowered this number from the default, to only 50 bp.
Only use longest: OFF
Due to the nature of the 10X Genomics read pairs, both short and long reads are needed to bridge the entire V(D)J region. Thus here we are not using this option.
Only keep regions with at least: 5% of the read count of the dominant same chain
This parameter is set to avoid reporting every single possible noise sequence, but you might be interested to see some medium level contamination. So even if not judged Significant according to the three following parameters, you still have the option to investigate these lower level calls for potential issues, yet avoid most noise, while still having an option to filter them away for the final result. Here we have set it to 5%.
Significant regions have at least: 5% of the read count of the cell/barcode
In order to assist in the filtering of noise from real sequence, a number of parameters can be set to tag a given region with the label Significant: Yes/No.
This describes the threshold percentage of the total input reads that a chain will be considered as "significant". Here we have set it to 5%, as we often find that cells produce a greater diversity of heavy chains. This also accounts for possible multiple flavors of the same chain. If this threshold is not met, the region will be labelled as Significant: No in the output.
Significant regions have at least: 10 reads
This should be adjusted according to the expected depth of sequences for a given region. Here we keep it relatively low at 10 UMI collapsed consensus sequences, as the data is a reduced dataset. If this is not fulfilled, the region will be labelled as Significant: No in the output.
Significant regions have at least: 20% of the read count of the dominant same chain region
As mentioned above, even if these are single cells one often finds multiple flavors of the same chain. To help distinguish between when a minority light or heavy chain might be real or noise, one can compare the read count that makes up this sequence with the read count of the most common chain in the cell/well/barcode. Here we have set it to 20%. If this is not fulfilled, the region will be labelled as Significant: No in the output

Clustering Options

Leave all defaults

Advanced Options

Leave all defaults

Click Run. Once the operation is completed, a new document will be generated containing all Single Cell results and are ready for further downstream processing, see below. The top panel shows the Name and Description of this file.

In the Description, one can quickly see how many total chains have been found and of what kind. For example here, from the 330 Barcode group of sequences, a total of 834 entire V(D)J regions were identified: 334 Heavy chains and 500 Light chains.

Screenshot_2023-02-09_at_2.17.30_PM.png

Understanding the results from Single Cell Antibody Analysis

The results table from the Single Cell Antibody Analysis is similar to a regular NGS Antibody Annotator result table and includes the columns, tools, functions, and graphical viewers described in NGS Tutorial 1.

Each row in the All Sequences Table therefore represents a sort of consensus sequence of multiple sequences that had the same barcode and are sufficiently similar in sequence identity (here we chose a threshold of 97% identical).

Because of this, there are additional columns that will be of interest:

Chain Ranking
- The prevalence of each chain found within the same barcode. For example, a chain ranking of Light-1 means that within the given barcode, this was the most common light chain sequence.
Barcode
- This column denotes the barcode for that chain. E.g. CTAGTGACACACGCTG
% of Dominant Same Chain
- The percentage of each chain sequence, relative to the top ranked chain within that barcode. Therefore, Heavy-1 and Light-1 will always be ranked 100%, while any other ranked chains (eg. Heavy-2, Light-2) will be represented as a percentage of the 1st-ranked chain.
% of Sequences in Cell
- This column denotes the percentage of that chain relative to all the sequences found with the same barcode (the same cell).
# Sequences
- The number of sequences that made up a particular determined chain. Remember that all these sequences have been collapsed together based on the user-specified 97% identical threshold.
Significant (above threshold)
- Column that according to the settings used at runtime will have a Yes or No call. Our user settings were at least 5% of the read count of the cell (the same barcode) OR at least 10 sequences.

Screenshot_2023-02-09_at_3.34.36_PM.png

Exploring the dataset using Filtering

You can add filters by manually typing them in or by right clicking on a cell within the column and selecting Filter, as shown below. To add further criteria to a Filter, the easiest way is to keep right-clicking on a cell in the columns desired and pick the Filter option. This will now show up in the Filter line, by default with an ‘AND’ to the previous filter. Here you can edit the actual criteria. Learn more here: Filtering your Sequences

Screenshot_2023-02-09_at_4.21.41_PM.png

Several clones have the typical one Heavy and one Light chain pattern, but you will also notice a number of clones with two or more light chains called. Some of these secondary sequences contain an issue; for example Frameshifts, Stop Codons, or Liabilities that make them non-functional, whereas others look to be non-problematic sequences.

To filter out these potentially undesirable sequences we have made the following filter that will find sequences without any errors, that are significant, and make up at least 60% of the Dominant Same Chain.

['Error'] NOT LIKE '%' AND ['Significant (above threshold)'] = 'Yes' AND ['% of Dominant Same Chain'] > 60

After right-clicking to add these filters, we have edited some of the syntax. Here we have asked it to only keep rows that do not contain (NOT LIKE) ‘anything’ (%) in the Error column. We then right clicked on cells in the Significant (above threshold) and % of Dominant Same Chain columns to add these filters, and changed the % of Dominant Same Chain value to > 60.

Screenshot_2023-02-09_at_4.43.10_PM.png
Note that this has narrowed down our sequences from a total of 834 to 646 sequences.

Screening for enriched clones using Clusters

To screen for enriched clones, as in the results table from the normal NGS Antibody Annotator (Tutorial 1), you can likewise use the results from the Single Cell Antibody Analysis to cluster sequences based on certain regions. If you are unsure what a cluster is, see: Understanding "Clusters".

To access the Cluster Tables go to Cluster Table in the top left and select from the dropdown menu:

Screenshot_2023-02-03_at_4.06.33_PM.png

The image above shows the grouping of the Single Clones based on the Light CDR3 region. In this case the data is from a BCR repertoire from a healthy donor, so there is not that much redundancy.

Identifying Heavy-Light pairs using the Chain Combinations Table

This table is accessed via the Cluster Table dropdown, and shows the automatically-paired Heavy and Light chains from each barcode on one row.

Having two cells in one droplet (and hence one barcode) can be a common issue with the experimental process in generating Single Cell data. The Chain Combinations table can be used to quickly identify cells (barcodes) where this may be the case. We can identify any chains that weren't ranked 1st but were found at very high levels in the cell (at least 70% of the dominant same chain) by using the following Filter:

['Heavy Chain-Ranking'] = 'Heavy-2' AND ['Light Chain-Ranking'] = 'Light-2' AND ['Heavy % of Dominant Same Chain'] > 70 AND ['Light % of Dominant Same Chain'] > 70

Screenshot_2023-02-10_at_10.05.17_AM.png

Note: you can easily apply filters by right-clicking on any cell in the table and selecting Filter. This will add the filter syntax automatically to the filter box, where you can edit the syntax before pressing 'Enter' or clicking filter. See Filtering your Sequences to learn more.

There are two secondary chain pairs that were found at very high levels within the same cell: Barcodes ACATCAGCAGCATGAG and AGTCTTTAGGAGTTTA. This may indicate that there were two cells in their respective droplets.