Collapse UMI Duplicates and Separate Barcodes

April 21, 2026 22:11
Updated

The UMI and Barcodes operation allows you to group and collapse your sequences based on the barcodes and UMIs, after trimming adapters. If you would like to learn more about what UMIs and Barcodes are, see Understanding Single Cell technologies: Barcodes and UMIs.

The below video explains the concepts behind UMIs and Barcodes in the context of single cell analysis. The video on NGS pre-processing in our Getting Started series may also be helpful, linked here.

What is the Collapse UMI Duplicates and Separate Barcodes operation?

Depending on sequencing platform and laboratory preparation, your sequences may have adapters, barcodes, unique molecular identifiers (UMIs) and template switching oligos (TSOs). To learn more about these, see Understanding Single Cell technologies: Barcodes and UMIs. The UMI Duplicates and Separate Barcodes operation allows you to group and collapse your sequences based on the barcodes and UMIs, after trimming adapters and TSOs.

The operation first trims adapters from all sequences, then separates sequences based on defined or fixed-length barcodes. The sequences within each barcode are then grouped based on the identity of their fixed-length UMI sequences. Finally, TSOs are trimmed and sequence similarity and quality thresholds are used to collapse sequences for each UMI.

Note that the operation does not require both barcodes and UMIs to be present in your sequences, you may choose which of these are present and only the relevant steps will be carried out. This operation was designed with raw 10X genomics data in mind.

How do I run Collapse UMI Duplicates and Separate Barcodes?

To run Collapse UMI Duplicates and Separate Barcodes, select one or more nucleotide sequences or lists in your folder and select Collapse UMI Duplicates & Separate Barcodes... in the Pre-processing dropdown.

Screen_Shot_2020-03-26_at_4.42.22_PM.png

To run the Collapse UMI Duplicates and Separate Barcodes operation, select options relevant to your sequence data in the dialog that appears and click Run. The operation will output a Nucleotide Sequence List of collapsed sequences and a report detailing how sequences were grouped, collapsed and discarded (based on quality options).

What does the output look like?

The output sequences of the Collapse UMI and Separate Barcodes operation will usually each represent a unique parent mRNA molecule - as the UMI represents reads from a parent UMI. Any barcodes present will be marked on the output sequences. You can see the barcode by enabling it in the "Sidebar" menu as shown below. Occasionally several sequences may be output for a single unique UMI/Barcode combination. This occurs when the sequences are too different to be combined into a single sequence. The similarity threshold can be adjusted as an option (see the "Collapse sequences with the same UMI" in the Options section below).

Screen_Shot_2020-04-16_at_12.18.56_PM.png

The pipeline also creates a UMI Stats Report document that can be used to perform a sanity check on your dataset.

Screen_Shot_2022-10-19_at_3.42.28_PM.png

Once barcodes have been identified, the barcoded sequences can be annotated by using the Single Clone Analysis operation, which will identify the dominant chain(s) present for each barcode. For an example of a workflow for analysing barcoded data, refer to NGS tutorial 3 here.

Options

Adapter Options

Important: Tick the Adapter checkbox to specify that your sequences have a stretch of nucleotides before the barcode and/or UMI. Specify the number of bases your adapters span in the Length input field and this number will be disregarded and/or trimmed from the beginning of each of your sequences.

Barcode Options

Important: Tick the Barcode checkbox to specify that your sequences have barcodes after adapters (if present) and before UMIs (if present). There are two options available:

Specify Barcodes By length
- Specify the number of bases your barcodes span by selecting the By length radio option and entering a number in the Length input field. The number of bases entered will be taken from the start of each sequence, following the adapter (if present), as the sequences for separating by barcodes.
Specify Barcodes By name
- Select the By name radio option if you would like to explicitly define your barcodes with a name and sequence. For instructions on how to specify your named barcodes, please visit this support page.

You may choose to discard all sequences that don't contain any of the named barcodes you specified by selecting the Discard sequences not matching names barcodes option. You will be notified of how many sequences were discarded in the report produced by the operation.

You have the option to discard sequences where the barcode they are assigned to contains less than a specified number of sequences (the Discard barcodes containing fewer than input field) or less than a specified percentage of the dominant barcode (the of dominant barcode (whichever is higher) input field). The number of sequences discarded due to this option will be shown to you in the report produced by the operation.

UMI Options

Important: Tick the UMI checkbox to specify that your sequences have UMIs after barcodes (if present) and before TSOs (if present).

Specify the length of your UMIs in the Length input field. The number of bases specified will be taken from each sequence following the barcode (if specified) and used for grouping sequences. Sequences that share the same UMI and Barcode will be analysed for similarity and collapsed to give as few sequences per UMI as possible.

TSO Options

Important: TSOs are sometimes used as part of barcoding/10X Genomics processes. Tick the TSO checkbox to specify that your sequences have TSOs after UMIs (if present) and before the insert sequence.

Enter a sequence into the Sequence input box to specify your TSO sequences. These will be ignored when determining insert sequence similarity when collapsing UMIs

Additional Settings

Adapter/Barcode/UMI are present at:

Specify which end of the sequence(s) your Adaptor and/or Barcode and/or UMIs are present on.

5' end: choose this if your Barcodes/UMIs are on the R1 read of Illumina sequences
3' end: choose this if your Barcodes/UMIs are on the R2 read of Illumina sequences.
Both ends: This will look for a barcode at the 5' end of the R1 reads and the 3' end of the R2 reads. The two barcodes are then added together and treated as a single, unique barcode (e.g. CAAGGGATGGGCAGAT+GAGGTGAAGCTGATGG)

Allow single mismatch in UMI, barcode, and TSO

Sequencing errors can introduce mismatches in adaptor sequences. To loosen the criteria for similarity to allow for a single mismatch, select this option.

Trim adapter, barcode, UMI, and TSO from results

Select this option to remove the sequences for these adapters from the output collapsed sequences.

Discard sequences shorter than:

You may discard all sequences shorter than a specified length by ticking this checkbox and entering a number of bases in the corresponding input field.

Discard sequences with a chance of error over:

You may discard sequences that have a chance of errors over a specified percentage by selecting this checkbox and entering a percentage in the corresponding input field.

Collapse sequences with the same UMI if they are more than:

This input field lets you specify the minimum sequence similarity (identity) required for collapsing sequences within each UMI. The goal is to specify a low enough value that accounts for sequencing errors while specifying a high enough value so that real variation is retained.