Collections are a way of saving curated lists of annotated sequences, complete with assay data and other relevant sequence information. This article explains how Collections can be searched against to find any sequences that are closely related to your selected candidates.
If you don't have any Collections yet, the you may wish to create one first. See our Creating a Collection article for more details.
Performing a search
Collections can be searched directly from Antibody Annotator, NGS Antibody Annotatorand Single Cell Antibody Annotator result documents. To do this, open up your Biologics Annotator Result document and then make sure you are viewing the All Sequences table. First, select the sequences you are interested in searching for in the Collection (up to 200 at a time). Then, find the Search Collections option in the Post-processing dropdown.
Search Options
In the search options dialog you are able to choose which region of the sequence you would like to compare (such as VDJ Region or Light CDR3). You can also choose whether you would like to do the sequence comparisons by nucleotide or amino acid alphabets.
Collections: Select one or more Collections to search. By using the SHIFT or CMD/CTRL key, you can select multiple Collections to search at once.
Search Region: Select a region of the sequence you would like to compare. This will extract the chosen region from both your selected sequences and the Collection sequences before the search runs.
Compare By: You can choose the search pipeline to compare by either Nucleotides or Amino Acids as desired. If you choose to Compare By Protein, and you have nucleotide sequences in your analysis or Collection, then these sequence regions will be automatically translated before the search runs. The Nucleotide option won't be available if any of the sequences in the selected sequences or Collections contain are protein sequences.
Match Similarity Threshold: The minimum Total % Match score of the selected regions required for a match to be included in the Search Result. This is calculated as the percentage residue matches over the selected region. Total % Match also works differently for very short sequences of length 3 Amino Acids or less and 10 nucleotides or less. Please see the Search Algorithm section below for more details.
Maximum number of matches: The amount of top matches to include per sequence (sorted by Total % Match as described above).
Only compare regions of the same length: This option will allow you to force the search to only look for matches that are the same length as the query region.
Output: Allows you to give your search a custom name.
Search Result
When the Collection Search is finished running, it will produce a Collection Search Result which you can see on the Collection Search Results tab in your Annotation Result document. This tab will automatically appear when there is at least one search result.
The Collection Search Results tab will always show the latest search result by default.
In this table you can see a list of the query sequences used in the search with the selected Best Match from the Collections that were searched on. By default the best match will be the sequence with the highest Total % Match score. Selecting a row will show an alignment of the query sequence and its Best Match sequence in the sequence viewer below.
Any matches for a query sequence can be shown by expanding the sub-table under the query sequence row. This Matches sub-table is ordered by the Total % Match by default, and shows detailed information for the top match(es). If you select a match row, the sequence viewer will show an alignment of the query sequence region and matching fragment of the Collection sequence region.
Changing 'Best Matches'
When a sequence is identified as a closest match, or 'Best Match', that match is also recorded in the corresponding Collections table. The Number of Matches column in the Collection table shows how many times that sequence ( or a region of that sequence) was considered the Best Match in a search.
You can un-assign Best Matches in the Search Results view if you do not consider that sequence a match. If there are multiple matches for a single query sequence, you can also assign one of the other matches as the 'Best Match' using the Set Best Match button. Changing the Best Match will also change the match information shown in the corresponding query sequence row in the search result table.
Add to Collection
If a query sequence does not have any matches (or even if it does), you can choose to add that query sequence to a Collection. You can do this directly in the Search Results view, by selecting the query sequence(s) and then clicking the Add sequences button above the table. For more information about the Add sequences options, see here.
Search Algorithm Details
Under the hood, the search uses a customized BLAST-like algorithm. This means several things:
- Very short query sequences (<4 Amino Acids or <11 nucleotides in length) will have no matches. This corresponds to the alignment word size. We are currently updating the search algorithm, so in the future very short sequences will only match if the sequence match is exact (no residue differences tolerated).
- Sequences require a certain level of local similarity in order to be considered a match (the word size). So if you have two sequences where every second amino acid matches, but never 3 amino acids matching consecutively, then they will not be considered a possible match.
- Sequences in the Collection that do not have the specified match region annotated will be skipped during the search. For example, a light chain sequence will never appear as a match for a Heavy CDR3 search.
- Total % Match is used to sort and rank matches. It is calculated as the percentage residue matches over the selected region. Total % Match is calculated relative to the query sequence, which means that if the query sequence is an exact subsequence of a match, it will currently show up as matching 100% . If this is an issue, you can turn on the "Only compare regions of the same length" option.
BLAST score and coverage metrics are also available in the search results.