In this tutorial, we will look at the Single Clone Antibody Analysis pipeline. It allows Users that have been NGS sequencing antibody chains from Single Clones in wells or Single Cells in droplets (here referred to as Single Clone) to end up with a list of associated heavy and light chains per clone. You can quickly filter out the noise from the dominant sequences identified; handle two or more prevalent sequences for the same chain by filtering based on properties, such as frameshift issues; identify “doublets”; screen for contamination; compare across an experiment for enriched clones, etc.
This tutorial will continue the analysis of the reduced 10X Genomics BCR Single Cell dataset from NGS Tutorial 3, after having run the Collapse UMI Duplicates & Separate Barcodes tool.
In this tutorial we demonstrate the tool and results on a 10X Genomics dataset, however it can also be used with other kinds of Single Clone data. As discussed in Tutorial 3, you may have Single Clones in plates, that have been mass sequenced after barcoding each well with plate, row, and column barcodes. This pipeline can also be applicable to individual clones that have been sequenced in their individual NGS reactions.
This tutorial will cover the following exercises:
- Setting up the Single Clone Antibody Analysis run
- Understanding the results from the Single Clone Antibody Analysis
- Filter, sort, and cluster sequences for the desired output
Setting up the Single Clone Antibody Analysis run
The input in this tutorial is the pre-processed data stemming from Tutorial 3, where 10X Genomics read pairs were assigned their corresponding Barcode, and read pairs within each Barcode collection collapsed according to their UMI and sequence identity.
Select the file in the 1. Input folder and click Annotation Single Clone Antibody Analysis (see image below).
This brings up the dialogue box below, with a number of settings that needs to be adjusted depending on the technology used to generate the Single Clone sequences, the expected sequence count per clone, as well as the desired outcome. For this tutorial, select the following options in the Single Clone Antibody Analysis dialogue box and click Run to start the analysis (see sections and image below).
- Discard sequences shorter than: 50 bp
As many of the sequences have been trimmed for Barcode and UMI, we have lowered this number from the default, to only 50 bp.
- Only use longest X reads from each list/barcode
Due to the nature of the 10X Genomics read pairs, both short and long reads are needed to bridge the entire V(D)J region. Thus here we are not using this option.
- De novo assembly required
Check this on to allow for assembly of the different sequences representing a single clone. Here the input sequences are represented by the consensus sequences from the UMI collapsed read pairs. They include both heavy and light chain sequences for the given clone.
- Keep unmerged reads
Check this ON as the Input is in read pairs that has not been merged
- Antibody annotator database: Human Ig 2022
Select the proper database for the data in the pull down menu, here Human Ig. See here for how to make custom databases.
- Associate significant dominant heavy and light pair
This will pair the dominant heavy and light chains with a barcode in the Chain Combinations table produced by the analysis.
- Include pseudogenes from database
- Include ORF genes from database
- Annotate germline differences
These are parameters as defined in the normal Antibody Annotation pipelines, see Tutorial 1, Antibody Annotator and Germline differences. They currently use the default settings for each of these.
- Find liabilities and assets
This setting will find and annotate the defined liabilities and assets as shown in the draggable box. See this article for a description of the default liabilities, and this article to learn how to define your own custom liabilities.
Sequence Region of Interest
- A fully annotated region is between FR1 and FR4 inclusive
As not all technologies provide coverage across the entire V(D)J region, this option allows to keep a given (consensus) sequence, even if it only covers the ‘fully annotated region’ according to these settings. Thus if one is only interested in the CDR3 region, one can set both X and Y to CDR3. Here we have chosen the entire V(D)J region, as in FR1 to FR4. Also see how to use with the next two options.
- Retain upstream and downstream of fully annotated region: 100 bp
Combined with the above setting for defining a ‘fully annotated region’, one can obtain more sequence information outside the ‘fully annotated region’. This is useful if you want to look at leader sequences or call the Constant region. Or if the ‘fully annotated region’ has been set to a bare minimum (e.g. CDR3) to recoup as many sequences as possible, but you still would like to know everything there is to know about the surrounding data, you can then have the ‘upstream and downstream’ data retained, eg up to 400 bp upstream and 200 bp downstream of the CDR3. Here, as our ‘fully annotated region’ is the entire V(D)J, 100 bp are retained both upstream and downstream to allow for leader sequence and clonotype identification.
- Always annotate entire regions (except CDR3)
- Combine regions at least X% identical
In addition to assembling the heavy and light chain individually within the Single Clone, you sometimes find two or more different sequences for the same chain. For example, a common phenomenon is two transcribed light chains, could be Kappa and Lambda, but it could also be two Kappa chains in the same cell. Or you might have two clones in the same well or droplet. Finally, you might have a significant level of contamination. Thus here we choose not to combine determined regions, if they are less than 97% identical.
- Significant regions have at least X% of the read count of the cell
In order to assist in the filtering of noise from real sequence, a number of parameters can be set to tag a given region with the label Significant: Yes/No. This one describes how many percentage of the total input reads to the Single Clone, the number of reads that made up a particular region needs to be to be called significant. Here we have set it to 5%, as we often find the Heavy chain supported by fewer sequences, and also need to account for possible multiple flavors of the same chain. If this is not fulfilled, the region will be labelled as Significant: No in the output.
- Significant regions have at least X reads
This should be adjusted according to the expected depth of sequences for a given region. Here we keep it relatively low at 10 UMI collapsed consensus sequences, as the data is a reduced dataset. Here we have set it to 10 reads. If this is not fulfilled, the region will be labelled as Significant: No in the output.
- Significant regions have at least X% of the read count of the dominant same chain region
As mentioned, even if these are Single Clones, one often finds multiple flavors of the same chain. To help distinguish between when a minority region might be real or might be noise, one can compare the read count that makes up this region in the minority cluster with the read count that makes up the region for the cluster with the most sequences in it, defined as the dominant cluster. Here we have set it to 20%. If this is not fulfilled, the region will be labelled as Significant: No in the output.
- Only keep regions with a least X% of the read count of the dominant same chain region
This parameter is set to avoid reporting every single possible noise sequence, but you might be interested to see some medium level contamination. So even if not judged Significant according to the three above parameters, you still have the option to investigate these lower level calls for potential issues, yet avoid most noise, while still having an option to filter them away for the final result. Here we have set it to 5%.
Once the operation is completed, a new document will be generated containing all Single Clone results and are ready for further downstream processing, see Figure 4.1 below. The top panel shows the Name and Description of this file. In the Description, one can quickly see how many total chains have been found and of what kind. For example here, from the 330 Barcode group of sequences, a total of 834 entire V(D)J regions were identified (1), 334 Heavy chains (2) and 500 Light chains (3).
Understanding the results from the Single Clone Antibody Analysis
The results table from the Single Clone Antibody Analysis is similar to a regular Antibody Annotation result table and includes the columns, tools, functions, and graphical viewers described in NGS Tutorial 1. In addition, it also has some extra columns, see Figure 4.1 below. These include columns such as No. of Sequences that made up a particular determined chain; the percentage of these sequences out of the total number of sequences used as input to that Single Clone; whether a particular sequence is the most prevalent for that chain, described as 100% of the Dominant Same Chain; a Significance column that according to the settings used at runtime will have a Yes or No call.
Figure 4.1 | Result table as produced from the Single Clone Antibody Analysis pipeline.
In the above result table in Figure 4.1, you will find a number of expected Single Clone results, specifically cell Barcodes with each one Heavy and one Light chain, without any secondary sequences called for the same chain (4).
Also highlighted are some typical cases where secondary sequences have been called.
One example includes the cell Barcode ACATCAGCAGCATGAG, where two Heavy and two Light chains have been called at roughly the same levels (5). This may indicate that two cells were present in the same droplet.
The bottom highlighted row indicates a secondary Light chain, but one will also notice that it is producing an aberrant transcript as indicated by the FrameShift (Light CDR3) in the Error column (6).
In another snippet below in Figure 4.2 from the same Single Clone Antibody Analysis result table are examples of other typical issues you can find, when secondary sequences are determined for the same chain for the same clone. Again several clones have the typical one Heavy and one Light chain pattern, but you will also notice a number of clones with two light chains called. In this Figure 4.2, you will also notice, in addition to the Error column (1) we have moved the Missing Cysteine column (2) into view.
One of the secondary Light chains (3) is missing a Cysteine residue in FR3 (this call stems from looking for liabilities). This is also indicated by an annotation of the graphical display of the sequence (4).
The other two secondary Light chains (5) and (6) have been found with a Frameshift or Stop codon, respectively.
One can now use these different calls and annotations to filter non-functional or otherwise questionable sequences away. See next section.
Figure 4.2 | Another example of the Result table as produced from the Single Clone Antibody Analysis pipeline. Here with a focus on columns, that indicates issues with a given sequence.
Filter, sort, and cluster sequences for the desired output
As described in the above section and seen in Figures 4.1 and 4.2, whereas many of the Single Clone Antibody Analysis results produce one Heavy and one Light chain for each clone, quite a few clones produce two or more sequences per chain. Some of these secondary sequences, as seen above, contain an issue, for example, Frameshifts, Stop Codons, or Liabilities that make them non-functional, whereas others looks to be non-problematic sequences.
To end up with a result with trustworthy sequences per clone, you can now filter away the problematic sequences and determine how to handle otherwise good secondary sequences.
In Figure 4.3 is shown how you can add to a desired Filter, by right-clicking in a cell of a column, and from the pull-down menu select the bottom: Filter By. This cannot be done from the title of the column, as that has a number of other column functions. In Figure 4.3, although we picked a particular Frameshift error, as we want to have the filter include all errors, we will change what we filter upon to ‘anything’. See Figure 4.4 for the first filter option (1): Here we have asked it to only keep rows, that do not contain (NOT LIKE) ‘anything’ (%) in the Error column.
To add further criteria to a Filter, the easiest way is to keep right-clicking on a cell in the columns desired and pick the Filter option. This will now show up in the Filter line, by default with an ‘AND’ to the previous filter. Here you can edit the actual criteria. For more information and options to Filter, please refer to Geneious Biologics’ Knowledge Base: How do I filter my results?
Figure 4.3 An easy way to make a Filter is to Right-click in a cell of the column one would like to filter on. Here we first sorted on the Error column, then Right-clicked and picked the Filter option in the pull-down menu.
In Figure 4.4, you have an example where we filtered away sequences that had any kind of error (1), where the Significance criteria was not met (2) as specified at run time, if a Cysteine residue was missing in certain regions (3) as specified in liabilities, and the percent of sequences making up a certain chain was less than 60% of the number of sequences making up the Dominant contig sequence of that same chain.
Figure 4.4 Example of a composite Filter to get rid of many of the secondary called sequences from the Single Clone Antibody Analysis, often due to re-arranged, but non-functional sequences.
To screen for enriched clones, as in the results table from the normal Antibody Annotator (Tutorial 1), you can likewise use the results from the Single Clone Antibody Analysis to ‘Group By’ sequences based on certain regions. Figure 4.5 shows grouping of the Single Clones based on the Light CDR3 region. In this case the data is from a BCR repertoire from a healthy donor, so not that much redundancy. For selection experiments, you will be able to use the ‘Group By’ function to isolate enriched clones.
Figure 4.5 Example of a Group By operation of the Light CDR3 region on the results stemming from Single Clone Antibody Analysis to identify enriched clones.