In this tutorial, you will learn how to use Single Cell Analysis to annotate NGS sequences that have already been partially processed and demultiplexed. You may have been provided sequences in lists that have already been grouped according to barcodes, and would like these barcodes to only be associated within and not across sequence lists.
- This tutorial can also be used to generate a single annotation document from multiple sequence lists that have already had their UMIs and Barcodes processed according to NGS Tutorial 3. Using Barcodes and UMIs.
Get started: To start this tutorial, you will need the input data. If you have recently started Geneious Biologics, your organisation may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the input sequences here and then uploading them into Geneious Biologics. See here for more information about where the data was sourced.
This tutorial will cover the following exercises:
- Setting up the Single Cell Antibody Analysis run on multiple sequence lists
- Understanding the results from Single Cell Antibody Analysis
-
Further analysis
Setting up the Single Cell Antibody Analysis run
The input data for this tutorial is two sequence lists, each containing sequences assigned to one of four barcodes. The two sequencing runs used the same 4 barcodes, but we do not want to associate sequences with the same barcode from both lists together. Instead, we want each barcode to be associated within each list separately, before carrying out Single Cell Analysis. When Single Cell Analysis is run on multiple lists, barcodes are not compared between lists, only within lists.
Select the two files in the Input folder and click Annotation > Single Cell Antibody Analysis (see image below).
This brings up the dialogue box below, with a number of settings that need to be adjusted depending on the technology used to generate the Single Cell sequences, the expected sequence count per clone, as well as the desired outcome. To find out more about what these options mean, click here. For this tutorial, select the following options below in the Single Cell Antibody Analysis dialogue box and click Run to start the analysis.
Main Options
-
Reference database: Human Ig
Select the proper database for the data in the pull down menu, here Human Ig 2022. Note that you can make your own custom databases: How to make a Custom Reference Database -
Sequence region of interest is between: FR1 and FR4 inclusive
As not all technologies provide coverage across the entire V(D)J region, this option allows you to keep a given (consensus) sequence, even if it only covers the ‘fully annotated region’ according to these settings. Thus you're only interested in the CDR3 region, you can set both X and Y to CDR3. Here we have chosen the entire V(D)J region, as in FR1 to FR4. -
Collapse Regions at least: 97% identical
In addition to assembling the heavy and light chain individually within a cell, you sometimes find two or more different sequences for the same chain. For example, a common phenomenon is two transcribed light chains, either Kappa and Lambda or two Kappa/Lambda chains in the same cell. Or you might have two unique cells in the same well or droplet. Finally, you might have a significant level of contamination. Thus here we choose to only combine determined regions if they are more than 97% identical.
-
Retain upstream/downstream of fully annotated region: 50 bp
Combined with the above setting for defining the sequence region of interest, one can obtain more sequence information outside the V(D)J. This is useful if you want to look at leader sequences or the Constant region. Or if the ‘fully annotated region’ has been set to a bare minimum (e.g. CDR3) to recoup as many sequences as possible, but you still would like to know everything there is to know about the surrounding data, you can then have the ‘upstream and downstream’ data retained. Here, as our ‘fully annotated region’ is the entire V(D)J, 50 bp are retained both upstream and downstream to allow for leader sequence and constant region identification.
-
Associate significant dominant heavy and light pair: On
This will pair the dominant heavy and light chains within a barcode in the Chain Combinations table produced by the analysis.
-
Keep unmerged reads: On
Check this ON as the input is in read pairs that have not been merged -
De novo assembly required: On
Check this on to allow for assembly of the multiple sequences that represent a single chain. Here the input sequences are represented by the consensus sequences from the UMI collapsed read pairs. They include both heavy and light chain sequences for the given clone.
Analysis Options
- Find liabilities and Assets
Filtering Options:
- Only keep sequences longer than: 50 bp
- Only use longest: OFF
- Only keep regions with at least: 5% of the read count of the dominant same chain region
- Significant regions have at least: 1% of the read count of the cell
- Significant regions have at least: 20 reads
-
Significant regions have at least: 20% of the read count of the dominant same chain region
Once the operation is completed (~1 hour), a new document will be generated containing all Single Cell results for the two documents. The top panel shows the Name and Description of this file. In the Description, one can quickly see how many total chains have been found and the type (heavy or light). In this case, A total of 237 Heavy chains were found (including 27 that were significant) and 236 Light chains were found (with 20 being significant).
Understanding the results from Single Cell Antibody Analysis
A useful first pass to explore your data is to go to the Cluster Table dropdown and select Chain Combinations (1). This brings up the combinations of each heavy and light chain assigned to the same barcode (remember that the barcodes from different lists are not grouped together).
As seen in the image above with the combinations ordered by heavy chain ranking, the two samples (SRR1056423 and SRR1056424) and their sets of barcodes are kept separate (2). Within barcode 4 of Sample SRR1056423, the top ranked heavy chain (Heavy-1 indicates the most prevalent heavy chain) contains a frame shift in the CDR2 region (3).
We can also see the diversity of light chains that pair with the top ranked heavy chain assigned to barcode 2 in the SRR1056424 data set (4, also highlighted in blue). The dominant Heavy-1 chain pairs with 4 different light chains (ranked 1-4) according to their being grouped into the same barcode. The column Light% of Dominant Same Chain shows the percentage of each of these light chains relative to the top ranked light chain. Therefore, Light-1 is ranked 100%, while the number of Light-2 sequences found in barcode 2 is 43.12% of the number of Light-1 chains.
If any of these sequences in the Chain Combinations dropdown table is selected, you can view the two chains (Heavy and Light) in the Sequence Viewer:
The above displays the sequence of Heavy-1 (the top ranked heavy chain for barcode 3 in the SRR1056423 dataset) with the second ranked Light-2 pair (the 2nd ranked light chain for barcode 3 in the same dataset). This light chain is found at a rate of 72.42% of the sequences for the top ranked light chain in this barcode and dataset.
Further Analysis
The other tools in Biologics provide a stepping point for analysis.
- Single Cell Analysis produces clusters, just like Antibody Annotator. To learn more about clusters, see this article: Understanding "clusters"
- You can add your Assay Data (ELISA values etc.) to further inform your results: Adding Assay Data to your analysis results
- Filtering is a very powerful tool that allows you to pull out sequences that meet certain metrics you specify: Filtering your sequences
- You can align sequences to compare the amino acid diversity across a region or multiple regions: Sequence alignment
-
Compare two or more Annotation Result Documents from separate experiments to monitor clonal expansion etc.
Data Source
The example included here is a reduced and modified dataset from the following paper with publicly available data: