This article outlines how to process NGS sequencing data such as Illumina (HiSeq, MiSeq etc.) and PacBio sequencing in Geneious Biologics.
The flow chart below displays the general workflow for working with NGS sequencing data:
If you have Single Cell data (such as 10X) or Sanger sequences, please see our other articles: Single Cell Analysis Workflows or Sanger Analysis Workflows.
Jump to:
Pre-processing
NGS sequences with UMIs
These include sequences from NGS sequencing kits like the SMART-Seq Human TCR (with UMIs) kit from Takarabio. If you are unsure what UMIs are, please see Understanding Single Cell technologies: Barcodes and UMIs.
Set & Merge Paired Reads
Raw reads from NGS technologies will often need to be either merged or paired. The Set & Merge Paired Reads operation can be found under Pre-processing:
Choose the appropriate option for your paired reads. Often this will be either:
- Interlaced sequences within each list
- Pairs of lists (as with R1 and R2 files)
To determine if you would like to merge reads or just set them, see the following:
- If not all of your read pairs are expected to overlap but you would like to retain this sequence information then you can choose to Set Paired Reads but not merge them.
- If you chose to Merge Paired Reads, the output will produce two files. One file will contain all your successfully merged read pairs, while the other file will contain the reads that could not be merged due to a lack of overlapping sequence.
***Note that Set & Merge Paired Reads can be skipped if you are working with long read technologies, like PacBio sequences.
Collapse UMIs
As these sequences will have UMIs, next go to Pre-Processing > Collapse UMI Duplicates & Separate Barcodes:
The following options are applicable to most datasets. Note that only the relevant sections need to be ticked - if your data only has UMIs, you can leave the Adaptor and Barcode section un-ticked. Please see our main article Collapse UMI Duplicates and Separate Barcodes.
UMI Section
- Length:
- Input the length of the UMI in bp on your sequences.
Settings Section
- Adapter/Barcode/UMI are present at:
- Specify which end of the sequence(s) your UMIs are present on. For example, if your UMIs are on the R1 read of Illumina sequences, select the 5' end. The other options are: 3' end or both ends.
- Allow single mismatch in UMI, barcode, and TSO
- Sequencing errors can introduce mismatches in adapter sequences. To loosen the criteria for similarity to allow for a single mismatch, select this option.
- Sequencing errors can introduce mismatches in adapter sequences. To loosen the criteria for similarity to allow for a single mismatch, select this option.
- Trim adapter, barcode, UMI, and TSO from results
- Select this option to remove the sequences for these adapters from the output collapsed sequences. This can be turned off for QAing purposes, but should be on for sequences taken through to annotation and analysis.
- Select this option to remove the sequences for these adapters from the output collapsed sequences. This can be turned off for QAing purposes, but should be on for sequences taken through to annotation and analysis.
- Discard sequences shorter than:
- You may discard all sequences shorter than a specified length by ticking this checkbox and entering a number of bases in the corresponding input field.
- You may discard all sequences shorter than a specified length by ticking this checkbox and entering a number of bases in the corresponding input field.
- Discard sequences with a chance of error over:
- You may discard sequences that have a chance of error over a specified percentage by ticking this checkbox and entering a percentage in the corresponding input field.
- You may discard sequences that have a chance of error over a specified percentage by ticking this checkbox and entering a percentage in the corresponding input field.
- Collapse sequences with the same UMI if they are more than:
- This input field lets you specify the minimum sequence similarity (identity) required for collapsing sequences within each UMI. The goal is to specify a low enough value that accounts for sequencing errors while specifying a high enough value that doesn't collapse real variation.
- This input field lets you specify the minimum sequence similarity (identity) required for collapsing sequences within each UMI. The goal is to specify a low enough value that accounts for sequencing errors while specifying a high enough value that doesn't collapse real variation.
You can now proceed to Antibody Annotation.
Processing NGS sequences without UMIs
Set & Merge Paired Reads
Raw reads from NGS technologies will often need to be either merged or paired. The Set & Merge Paired Reads operation can be found under Pre-processing:
Choose the appropriate option for your paired reads. Often this will be either:
- Interlaced sequences within each list
- Pairs of lists (as with R1 and R2 files)
To determine if you would like to merge reads or just set them, see the following:
- If not all of your read pairs are expected to overlap but you would like to retain this sequence information then you can choose to Set Paired Reads but not merge them.
- If you chose to Merge Paired Reads, the output will produce two files. One file will contain all your successfully merged read pairs, while the other file will contain the reads that could not be merged due to a lack of overlapping sequence.
***Note that Set & Merge Paired Reads can be skipped if you are working with long read technologies, like PacBio sequences.
You can now proceed to NGS Antibody Annotation below.
NGS Antibody Annotation
It is recommended that for NGS Sequencing data, the NGS Antibody Annotator is used. This is because NGS Antibody Annotator collapses the input down to representative sequences according to a specified threshold, while retaining the original count.
To run the NGS Antibody Annotator, select a sequence document(s) and go to the Annotation > NGS Antibody Annotator dropdown:
The Main Options are listed below:
Reference database:
- The Reference Database can be a database of antibody germline sequences or of template (variable region) sequences. Your Biologics account will include a Human and Mouse Ig germline database. If you would like to access to other germlines, please either contact us or see How to make a Custom Reference Database.
The reference database is used to help identify the correct FR and CDR regions in the new sequences being analysed. See Understanding Reference Databases.
Selected sequences are:
- This asks what chains you are expecting on each read, or pairs of reads. For more information, see the main NGS Antibody Annotator article.
Sequence region of interest:
- This allows you to set the boundaries of your sequence. For example, if your forward sequencing primer was located in the FR2 while your reverse sequencing primer was located in the Constant region, you might set this to CDR2 -> FR4
Collapse Sequences at least:
- This is the threshold at which sequences will be assumed to be identical, and collapsed down (while retaining the count). The default is 100%.
Retain upstream and downstream of fully annotated region:
- These options retain the specified number of nucleotides upstream/downstream of the Sequence region of interest (explained above) when trimming the ends of contigs. This is in order to identify duplicates when ignoring incorrect contig ends. If a contig is not long enough to cover the specified range, it will be excluded from the next step in the pipeline.
For more options, see the main NGS Antibody Annotator article.
Inputing multiple sequence lists
Multiple sequence lists can be selected and run through NGS Antibody Annotator. This will result in one output file (an Annotation Result Document), which will contain all the sequences collapsed only within the same list, not across lists.
All graphs and clustering will be across the whole dataset (multiple lists)
Viewing your results
The results will look similar to an Antibody Annotator result, but the All Sequences Table will represent multiple sequences that have been "collapsed" together due to having 100% identity (or whatever percent you set in the Main Options section). It is therefore more similar to a Cluster Table, see Understanding "Clusters".
In the above example output, the most abundant heavy chain in the dataset (Heavy-1) is selected. You can see the sequence in the Sequence Viewer below. % of Sequences refers to the percent that this sequence made up of the entire dataset, and the exact # Sequences that were collapsed which is 497. See Exploring the tables produced by NGS and Single Clone Antibody Analysis for a description of all the columns produced.
Like any other Biologics Annotator Result document, you can also:
- Filter your Sequences
- View the Graphs for Quality Assurance and Graphs to interpret Clusters and Clonotypes
- Perform Sequence Alignments
- View the Clusters in your dataset
- Add Assay Data to your Analysis Results
- Compare Results across Multiple Experiments
- Export Annotated Sequences and Sequence Tables