Merging Paired Reads

July 12, 2024 01:49
Updated

This article discusses merging NGS paired reads. Setting paired reads without merging is closely related, and discussed here. You can also read about the best way to assemble Sanger sequences here.

The below video gives a general introduction to pre-processing in Biologics. The first few videos in our Getting Started series may also be helpful, linked here.

Using Set & Merge Paired Reads

Paired-end reads are often generated by next-generation sequencers and when the forward and reverse reads overlap, they can be merged into a single longer read. In Geneious Biologics this function of merging paired reads uses BBMerge from BBTools. Read more about BBMerge here.

To merge paired reads, select one or more sequence list documents and go to the Set & Merge paired reads option in the Pre-processing dropdown. Depending on your sequencing data, reads could be in parallel sets of sequences or interlaced, so you will need to specify which format the reads should be paired by.

Interlaced sequences within each document - This is where both reads in a pair are in the same file, one after the other i.e. the first sequence is paired with the second, the third is paired with the fourth, etc. You need to select a sequence list containing an even number of sequences to interlace them.
Pairs of documents - The sequence lists are paired together and each sequence in one list is paired with the sequence at the same position in the other list. You will need to select 2 or more sequence lists to pair the sequences.
- Note: Lists will be assigned to "forward" or "reverse" in the order they appear in the options for Relative Orientation according to the alphabetic order of the lists selected. For example, the list that comes first alphabetically will be considered the list that contains reverse reads for the Illumina mate pairs - outward pointing option.
Split each sequence in half - Use this when both reads in a pair have been concatenated together into a single sequence in the fastq file. The sequence will be split into two equal halves which are treated as pairs. All sequences must be of the same length and have an even number of nucleotides to use this option.
Match sequence names within each document - Pairs are identified by sorting the sequences by name, and then pairing adjacent sequences when the two names differ by a single character as long as the two different characters are in this list of possible pairs. Use this when the pairs may be in a random order or some reads may have their mate pair missing.

Select the appropriate Relative Orientation of data as different sequencing technologies orientate their paired reads differently.

To select the merge rate, select Set and merge paired reads as the Output format and select the rates in the dropdown. A Very High merge rate has higher merger rates but more false positives while a Low merge rate has lower merge rates with fewer false positives.

To start the operation, click Run. Depending on your data, this operation will output a merged reads file and an unmerged reads file.

Note that we currently only support sequence list documents.

Troubleshooting low merge rates

Set paired reads only and proceed to Annotation

Our Set & Merge Paired Reads pipeline uses BBMerge, which is an external all-purpose algorithm.

If you encounter low merge rates, we often find that the underlying algorithm used by our Annotation Pipelines is better at merging paired reads to construct full antibody sequences as it is more specialized.

To instead set paired reads, first go to Pre-processing > Set & Merge Paired Reads and select Set paired reads only.

set paired reads only.png

You can then take the paired reads through to annotation. All our Annotation pipelines can handle paired reads as an input.