A reference sequence database is chosen as the basis by which query sequences are annotated, and therefore, must include annotated reference sequences. These reference sequences will be used by the pipelines to automatically annotate your input sequences. There are three different types of reference databases: Annotated Germline Databases, Feature databases, and Scaffold Databases.
How to Create a Reference Database
Reference sequence databases can be created by anyone. This involves the following steps:
- Select Reference sequence under Organization databases, located in the Navigation panel.
- Click on the menu button (vertical ellipsis) or New database in the Menu bar to create a new database.
- Select your chosen database type and provide a name. Database types determine what format your sequences are expected to be, and what they can be used for. If you are unsure, the database types are described later in this article.
- Click Create. A new empty database will be created.
- Share your database with other users by opening the context menu next to your new database and choosing "Manage sharing".
By default everyone in your organisation will have read-only access to your new database, which means they will be able to use your database for their analyses. - Add annotated sequences in Genbank or Geneious formats to the newly created database by dragging and dropping or using the Upload button in the Menu bar. You can add both individual sequences or multi-sequence files (sequence list). The latter will show as a single entry in the database.
- If you get the database type wrong the first time, you can change it in the database folder context menu.
Note that files cannot be uploaded directly to the Sequence reference database root - you must first create a new database to add your files to.
Annotated germline databases for Antibody Annotator
Antibody Annotator uses (germline) reference sequences to determine two key aspects of sequence annotation:
- The closest matching reference gene(s) to each of your target sequences, and what the silent and non-silent variations in the target sequence are relative to its closest match.
- The most appropriate FR/CDR region boundaries, calculated from the whole dataset not just the closest match.
The human and mouse immunoglobulin germline sequences are bundled with Geneious Biologics but custom reference sequences can be used as well. These custom sequences can either represent the Ig/TCR variable region of a different species (or multiple species). They can also be used to represent non-germline parent sequences or scaffolds, so long as these sequences are Ig-like in nature and structure. Some amount of ambiguities is tolerated, particularly in the centre of CDR regions.
To learn more about custom germline reference sequences and how to set up a germline-style database using correctly formatted custom reference sequences, please refer to the following article.
Feature Databases
These databases are less strictly formatted, and allow you to augment your standard antibody annotations with other custom annotations such as fusion proteins, signal peptides or other sequence features. To read more about the benefits of using a feature database see here.
A feature database can contain individual sequences or a sequence list. Each sequence fragment should have a single annotation on it. The annotation needs to have two properties:
- 'Name' which should be a readable, unique name describing the annotated region.
- 'Type', which determines how the region is treated during analysis. You may have a mix of multiple feature types within your database, such as both Primer and Signal_Peptide types.