In this tutorial, you will learn how to group sequences according to Barcodes and collapse sequences together if they have the same UMI (Unique Molecular Identifier) and the same sequence. If you'd like to learn more about what barcodes and UMIs are, see Understanding Single Cell technologies: Barcodes and UMIs. The video below on using Barcodes and UMIs in Single Cell Analysis gives an overview of these concepts:
The tools here are very flexible and can also be used for other technologies. For barcodes, you also have the option of providing lists of known barcodes, for example, if plate, column, and/or row barcodes are being used for single clone mass sequencing. The UMI tools are also applicable for normal repertoire sequencing if UMIs have been incorporated.
This tutorial will cover the following exercises:
- Pairing paired-end NGS reads
- Separating Barcodes and Collapsing UMIs from 10X Genomics data
- Understanding the results from Collapse UMI Duplicates & Separate Barcodes
Get started: To start this tutorial, you will need the input data. If you have recently started Geneious Biologics, your organisation may already have the tutorial folders set up as described in the tutorial below. If not, you can still follow this tutorial by first downloading the input sequences here and then uploading them into Geneious Biologics. See here for more information about where the data was sourced.
Note: The Barcode and UMI functionality is available as an add-on. If your organisation does not have Barcode Separation, UMI Collapse, or Single Clone Analysis pipelines, please contact us to try them out.
The videos in our Getting Started series may also be helpful, linked here. Below is our video on Pre-processing NGS Sequences.
Pairing paired-end NGS reads
In Tutorial 1, there is an example of pairing and merging NGS paired reads. However, here in Tutorial 3, due to the particular 10X Genomics technology, where not all read pairs are expected to overlap and the non-overlapping ones are crucial to get to the 3’ end of the V(D)J region and into the Constant region, we do want to pair the reads together, but not to merge them.
To pair these paired-end reads, select both the paired-end documents in the 1. Input folder and click Pre-processing Set & Merge Paired Reads (see image below).
As we only want to pair these reads, select the following options in the Set & Merge Paired Reads dialog box and click Run to start the analysis (see sections and image below).
- Pairs of lists
- Forward/Reverse (inward pointing, e.g. Illumina paired end)
- Set paired reads only
Once the operation is completed, a new document will be generated containing all paired read pairs. If you open this up, you can check the pairing by hovering your cursor over one sequence and see how it highlights both that sequence and its partner. See Figure 3.1
Figure 3.1 | A paired reads sequence list. The cursor is placed on the top sequence, and the two sequences in the read pair is now highlighted in red. In addition, the dumbbell icon in front of the sequence names indicates pairing.
Read more on Set & Merge Paired Reads here.
Separating Barcodes and Collapsing UMIs from 10X Genomics data
The paired read pairs now need to be internally annotated and grouped according to their Barcode sequence. Sequences with the same UMI within a Barcode group will also be collapsed if there is sufficient overlap and identity.
Select the paired-end document in the 2. Paired reads folder and click Pre-processing > Collapse UMI Duplicates & Separate Barcodes (see image below).
For 10X Genomics data that starts with a 16 bp Barcode followed by a 10 bp UMI, select the following options in the Collapse UMI Duplicates & Separate Barcodes dialog box and click Run to start the analysis (see sections and image below).
Barcode (for separation)
- By length: 16 bp
- Discard barcodes containing fewer than: 100 sequences or 5% of dominant barcode
- Depending on the dataset and the expected number of sequences per single cell sequenced as well as the evenness across the single cells these parameters most likely need to be adjusted. The lower these numbers, the more results, including both real results and results from noise.
- Length: 10 bp
- Adapter/Barcode/UMI are present at: 5' end
- Other options include the 3' end or both ends. This refers to where the Barcodes/UMIs are present relative to the fully assembled sequence. For Illumina sequencing, if your Barcodes/UMIs are on your R2 read then choose the 3' end, while if they are on your R1 read select the 5' end.
- Allow single mismatch in UMI, barcode, and TSO
- ‘Allow single mismatch in UMI, barcode, and TSO’ also works with sequences where the first base may be missing, as well as with normal mismatches.
- Trim adapter, barcode, UMI, and TSO from results
- ‘Trim adapter, barcode, UMI, and TSO from results’, are critical in the case of 10X Genomics, as a later step will be applied to assemble the UMI collapsed sequences from a given barcode, and here, it is best not to have a 26 bp tail with differences in the UMI.
- Collapse sequences with the same UMI if they are more than: 90 % identical
- Collapse sequences with the same UMI if they are more than X% identical, is primarily used to make sure that if, inadvertently, different templates have the same UMI, that they do not collapse together. In the image below this was set to 90% in the overlap region*.
*As the data in this tutorial is very sparse with only few sequences per UMI per Barcode, and due to the nature of 10X Genomics data, often the second read of the pair does not overlap, and thus many sequences albeit having the same UMI will not be able to collapse.
Read more about how to use Collapse UMI Duplicates & Separate Barcodes. If you have specific Barcodes with known sequences, for example for Single Clones in wells in plates, see our article How to Specify Barcodes. Also be aware that, for other technologies, it is possible to define different ‘oligo segments’ around the insert on Read1 and Read2 of a read pair.
Understanding the Results from Collapse UMI Duplicates & Separate Barcodes
Once the operation is completed, two new documents will be generated:
- Sequence List with all Collapsed UMI contig consensi, each assigned to a particular Barcode. With the limited dataset used here, run with the parameters described above, one obtains a total of 330 Barcode groups, with between 100 and 348 collapsed UMI consensus sequences, for a total of 81,560 sequences. Figure 3.3 indicates the content of this file. Each UMI consensus sequence is assigned a name that includes the Barcode plus UMI in question (here Barcode: GATCGCGAGAATGTGT, UMI: CGAGCGCACA), as well as how many sequences with that same Barcode and UMI ended up in the same consensus sequence. Here, collapsing the five read pairs with this same Barcode-UMI gives rise to two consensus read pairs, one with four read pairs in it (4 of 5 instances), and one with one (1 of 5 instances).
Also note, the length of the sequences in each pair with the Barcode and UMI (16bp+10bp=26 bp) trimmed of the first consensus sequence of each pair.
In addition, but not visible, the file contains information with respect to each UMI consensus sequence corresponding Barcode, such that they can be kept together for subsequent steps, such as Single Clone Antibody Annotation, see Tutorial 4.
Figure 3.3 | Top of sequence list generated by the ‘Collapse UMI Duplicates & Separate Barcodes’ tool run on the 10X Genomics data.
- UMI Stats Report includes information with regards to the collapsing of read pairs with the same UMI. Here, see Figure 3.4, we have 500,000 read pairs as input, with a total of 347,614 different UMIs. When allowing for one mismatch, this reduces to 341,849 UMI sets. In the Reads per UMI set one can see the distribution of how many UMI sets contain how many read pairs. Here, as we have a reduced dataset, most UMI sets have just one read pair, a significant fraction have two, three or four read pairs, and a few UMI sets contain tens or hundreds read pairs. The bottom graph, Consensus sequences per UMI set, shows how many UMI sets contain how many consensus sequences, as a result of collapsing the underlying read pairs. You will notice that in this case not all UMI sets are collapsed into one consensus sequence, which here is primarily due to the potential little or non-overlapping nature of Read2 as a result of the reduced dataset.
Figure 3.4 | UMI Stats Report generated by the ‘Collapse UMI Duplicates & Separate Barcodes’ tool run on the 10X Genomics data.
Now that you have preprocessed the barcoded data, you can annotate and analyse the sequences using the Single Clone Analysis pipeline tool. This workflow is covered in NGS Tutorial 4.
The example included here shows both Barcode separation and UMI collapsing at the same time. It is a reduced dataset from one of 10X Genomics’ public Single Cell V(D)J BCR datasets. https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0/vdj_v1_hs_cd19_b. CD19+ B cells were isolated from PBMCs of a healthy donor and NGS sequencing libraries prepared following the Single Cell V(D)J Reagent Kits User Guide (CG000086 RevC.)