What is the Collapse UMI Duplicates and Separate Barcodes operation?
Depending on sequencing platform and laboratory preparation, your sequences may have adapters, barcodes, unique molecular identifiers (UMIs) and/or template switching oligos (TSOs). To learn more about these, see this article. The UMI Duplicates and Separate Barcodes operation allows you to group and collapse your sequences based on the barcodes and UMIs, after trimming adapters and TSOs.
The operation first trims adapters from all sequences, then separates sequences based on defined or fixed-length barcodes. The sequences within each barcode are then grouped based on the identity of their fixed-length UMI sequences. Finally, TSOs are trimmed and sequence similarity and quality thresholds are used to collapse sequences for each UMI.
- Note that the operation does not require adapters, barcodes, UMIs or TSOs to be present in your sequences, you may choose which of these are present and only the relevant steps will be carried out. This operation can also be used with raw 10X genomics data in some cases.
The Barcode and UMI functionality is available as an add-on. If your organisation does not have Barcode Separation, UMI Collapse, or Single Clone Analysis pipelines enabled, please contact us to try them out.
- How do I run the Collapse UMI Duplicates and Separate Barcodes operation?
- What does the output look like?
- Options for Collapse UMI Duplicates and Separate Barcodes
How do I run the Collapse UMI Duplicates and Separate Barcodes operation?
To specify Collapse UMI Duplicates and Separate Barcodes options, select one or more nucleotide sequences or lists in your folder and select Collapse UMI Duplicates & Separate Barcodes... in the Pre-processing dropdown.
To run the Collapse UMI Duplicates and Separate Barcodes operation, select options relevant to your sequence data in the dialog that appears and click Run. The operation will output a Nucleotide Sequence List of collapsed sequences and a report detailing how sequences were grouped, collapsed and discarded (based on quality options).
What does the output look like?
The output sequences of the Collapse UMI and Separate Barcodes operation will usually each represent a unique parent mRNA molecule. Any barcodes present will be marked on the output sequences. You can see the barcode by enabling it in the "Sidebar" menu as shown below. Occasionally several sequences may be output for a single unique UMI/Barcode combination. This occurs when the sequences are too different to be combined into a single sequence. The similarity threshold can be adjusted as an option (see the "Collapse sequences with the same UMI" in the Options section below).
The pipeline also creates a UMI Stats Report document that can be used to perform a sanity check on your dataset.
Once barcodes have been identified, the barcoded sequences can be annotated by using the Single Clone Analysis operation, which will identify the dominant chain(s) present for each barcode. For an example of a workflow for analysing barcoded data, refer to NGS tutorial 3 here.
Options for Collapse UMI Duplicates and Separate Barcodes
Important: Tick the Adapter checkbox to specify that your sequences have a stretch of nucleotides before the barcode and/or UMI. Specify the number of bases your adapters span in the Length input field and this number will be disregarded and/or trimmed from the beginning of each of your sequences.
Important: Tick the Barcode checkbox to specify that your sequences have barcodes after adapters (if present) and before UMIs (if present). There are two options available:
- Specify Barcodes By length
- Specify the number of bases your barcodes span by selecting the By length radio option and entering a number in the Length input field. The number of bases entered will be taken from the start of each sequence, following the adapter (if present), as the sequences for separating by barcodes.
- Specify Barcodes By name
- Select the By name radio option if you would like to explicitly define your barcodes with a name and sequence. For instructions on how to specify your named barcodes, please visit this support page.
You may choose to discard all sequences that don't contain any of the named barcodes you specified by selecting the Discard sequences not matching names barcodes option. You will be notified of how many sequences were discarded in the report produced by the operation.
- You have the option to discard sequences where the barcode they are assigned to contains less than a specified number of sequences (the Discard barcodes containing fewer than input field) or less than a specified percentage of the dominant barcode (the of dominant barcode (whichever is higher) input field). The number of sequences discarded due to this option will be shown to you in the report produced by the operation.
Important: Tick the UMI checkbox to specify that your sequences have UMIs after barcodes (if present) and before TSOs (if present).
Specify the length of your UMIs in the Length input field. The number of bases specified will be taken from each sequence following the barcode (if specified) and used for grouping sequences. Sequences that share the same UMI will be analysed for similarity and collapsed to give as few sequences per UMI as possible.
Important: TSOs are sometimes used as part of barcoding/10X Genomics processes. Tick the TSO checkbox to specify that your sequences have TSOs after UMIs (if present) and before the insert sequence.
Enter a sequence into the Sequence input box to specify your TSO sequences. These will be ignored when determining insert sequence similarity when collapsing UMIs
Adapter/Barcode/UMI are present at:
Specify which end of the sequence(s) your Adaptor and/or Barcode and/or UMIs are present on. For example, if your Barcodes/UMIs are on the R1 read of Illumina sequences, select the 5' end. The other options are: 3' end or both ends.
Mismatches in adapters
Sequencing errors can introduce mismatches in adaptor sequences. To loosen the criteria for similarity to allow for a single mismatch, select the Allow single mismatch in UMI, barcode, and TSO option.
Select the Trim adapter, barcode, UMI, and TSO from results option to remove the sequences for these adapters from the output collapsed sequences.
You may discard all sequences shorter than a specified length by ticking the Discard sequences shorter than checkbox and entering a number of bases in the corresponding input field.
You may discard sequences that have a chance of errors over a specified percentage by ticking the Discard sequences with a chance of error over checkbox and entering a percentage in the corresponding input field.
Sequencing similarity for collapsing
The Collapse sequences with the same UMI if they are more than input field lets you specify the minimum sequence similarity (identity) required for collapsing sequences within each UMI. The goal is to specify a low enough value that accounts for sequencing errors while specifying a high enough value that doesn't collapse real variation.