Geneious Biologics uses specific methods to extract Representative Sequences from clustered or combined sequences when performing alignments and sequence exports. This article outlines several Representative Sequence selection methods, and where these apply.
Jump to:
- What is a Representative Sequence?
- When are Representative Sequences used?
- Types of Representative Sequences
What is a Representative Sequence?
When large amounts of sequences are grouped together - for example in Clusters - you can often choose to select a Representative Sequence from within that cluster to represent the group in downstream analysis - whether that be a Sequence Alignment or when Exporting Sequences.
Representative sequences can be selected using liability scores or VDJ abundance ranking as described below. If clusters have neither liability scores nor complete VDJ sequences, the first sequence in the cluster list will serve as the representative.
When are Representative Sequences used?
There are two situations where sequences are grouped: 1) Clusters are sequences grouped by similarity at regions of interest, and 2) an Alignment can combine duplicate sequences across the region chosen for alignment. In both of these cases it is useful to be able to select one sequence to represent that grouping.
Clusters
When looking at a cluster table, there is an option to extract a Representative Sequence from each cluster row when exporting sequences, or when aligning multiple clusters:
You are also able to view a Representative Sequence from each cluster in the Sequence Viewer, rather than every sequence contained in that cluster:
Collapsed regions in an Alignment
When performing an alignment across region(s) you have the option to combine duplicate sequences. In this case, one of those sequences is used to represent the combined sequences - a Representative Sequence. This is seen below in the Alignment options when selecting sequences from the All Sequences table:
Types of Representative Sequences
There are a two methods used to determine which sequence from within the cluster or combined sequence is taken forwards as the representative.
Representative Sequence (by Liability Score)
This option will extract the sequence that has the highest liability score from within the grouping.
Example:
- A cluster containing 2 sequences
- Sequence A (Liability Score = -126)
-
Sequence B (Liability Score = -183),
-> Sequence A would be chosen as the representative.
The score of a sequence is determined according to your Liability Settings.
Representative Sequence (by Most Common)
This option will extract the sequence with the most common VDJ Region from within the grouping - regardless of whether a sequence with a less common VDJ Region has a higher score.
Example:
- The first row of the Heavy CDR3 cluster table is chosen, containing 5 underlying sequences
- All sequences in this cluster have a HCDR3 of ARWEYYAMDY.
- 3 of the parent sequences have the same VDJ Region -> VDJ ID 1
- The highest scored parent sequence in VDJ 1 is -178
- 2 of the parent sequences have a different VDJ Region -> VDJ ID 2
-
The highest scored parent sequence in VDJ 2 is -140
-> The parent sequence chosen as the representative is the -178 scored sequence that has the more common VDJ Region of ID 1, even though a sequence with a different (but less common) VDJ has a higher score.
-
Caveats to Most Common:
- This option is only available if your sequences have full VDJ regions (or a full VJ region for light-only datasets). Please reach out to us if you would like us to support Most Common for truncated sequences (e.g. FR2 -> FR4)
- In general, Most Common will be calculated according to the VDJ Region, except in these cases:
- When working from a Light region cluster (e.g. Light CDR3) -> the most common VJ Region will be used
- When working with Heavy-Heavy (dual nanobody) sequences, the following behaviour is this:
- When working from a Heavy chain 1 cluster -> the most common H1 VDJ is used
- When working from a Heavy chain 2 cluster -> the most common H2 VDJ is used
- When working from a combination cluster (e.g H1CDR3-H2CDR3) -> the most common H1 VDJ is used