As a user of Biologics, you can make your own custom Reference Databases for annotating your sequences. For more detail on what a custom reference database is, please see this article: Understanding Reference Databases.
The article on Feature Databases may also be useful for those who would like to track and annotate extra features (like fusion proteins) as feature databases can be used in addition to a reference database.
Note: Biologics comes with three fully supported reference databases, Human, Alpaca (Vicugna pacos) and Mouse (Mus musculus). We also have other species available on an as-is basis, please contact us if you would like access to any of these.
This article outlines how to create:
1. A Germline Genes Reference Database composed of gene sequences. Differences are recorded relative to the genes.
2. An Antibody Template Reference Database containing full variable region sequences. This type of database enables you to track changes against full the template sequences, rather than genes.
Jump to:
- What is a Germline Reference Database?
- How to make an Antibody Template Reference Database
- Other Tips
What is a Germline Reference Database?
Our annotation pipelines use reference gene sequences to determine two key aspects of sequence annotation:
- The closest matching reference gene(s) to each of your target sequences, and what the silent and non-silent variations in the target sequence are relative to its closest match.
- The most appropriate FR/CDR region boundaries, calculated from the whole dataset and not just the closest match.
In order to create a custom reference sequence database, the reference sequences will have to satisfy some requirements to be able to be read by the Biologics Annotation pipelines. This guide outlines the steps to be taken to correctly format gene sequences for a reference database.
Note that currently, creating a TCR germline reference database requires a slightly different method than the one below. See Analyzing TCR sequences in Geneious Biologics to learn how to do this.
Steps to create a Germline Gene Reference Database
NOTE: The output format of your germline sequences using this method is at best an approximation. The correct annotation types will be added automatically to your sequences, however the boundaries of the CDR/FRs may be incorrect. We strongly advise you inspect and curate the output.
First, you will need to obtain the reference sequences. These could be germline V, D, J and constant genes or VHH genes. For uploading to Geneious Biologics, your sequences will need to be arranged in one of the two following formats:
- As a single sequence list, where the sequence name for each gene in the list is according to the IMGT nomenclature (eg. "IGHV1-2")
- Names must be prefixed with IGH, IGK, or IGL to indicate their chain, followed by V or D or J or C.
- If the name is only 4 letters long (after removing an optional allele suffix) and ends with D, it is treated as a constant gene.
- Names must be prefixed with IGH, IGK, or IGL to indicate their chain, followed by V or D or J or C.
- As separate sequence lists, where the gene type (Heavy V Genes etc.) is indicated by the sequence list name (eg. "IGHV")
- Names must be prefixed with IGH, IGK, or IGL to indicate their chain, followed by V or D or J or C.
To auto-annotate your list(s) of genes, navigate to the Organization Databases on the left panel and click on the 3 vertical dots next to Reference Sequences to bring up the New database option:
Next, follow the prompts to upload your reference sequences. Make sure you have selected Germline Gene as the Database type and that the option to Automatically format germline genes is selected.
In the next step, select an existing gene reference database as the basis for estimating the FR/CDR boundaries etc. Choosing a reference database from a species more closely related to your target sequences will have better results - however, the output will still require curation.
Select the the gene name source. This will depend on whether your gene sequences are in a single sequence list, or separated into sequence lists according to gene family. See the first steps of this guide if you are unsure.
After clicking next, a summary of the Reference Database to be created will be displayed. The option to proceed will not be available until you have confirmed that you understand the need for inspecting and curating the output by selecting the following box: I understand that I need to inspect and curate the formatted output.
Curating the output
Biologics will attempt to place various annotations on the raw germline sequences, according to rough estimates of the CDR and FR boundaries as determined by the chosen reference database. However, this method will not necessarily produce scientifically correct region boundaries. Expert knowledge of the sequences will need to be applied to curate the output.
To curate and inspect the output germline reference database, Export the output to a sequence file. Make sure you select a file type that will preserve the annotations, like .geneious. You can then inspect and edit the annotations in Geneious Prime.
In particular, a few assumptions are made by Biologics when generating the formatted output:
- Constant genes will be assumed to start in reading frame 3. If your constant regions are in frame (ie. start at the beginning of the sequence), you can adjust your results by deleting the qualifier codon_start: 3 from the c_segment annotation on constant gene sequences. This is shown below in Geneious Prime when editing the annotations:
You can also edit this qualifier to codon start: 2 if the true reading frame starts from the second nucleotide.
- The CDR/FR boundaries and the general length of all annotations are estimated based on the existing reference database chosen. This means they may need adjusting. We recommend inspecting each individual sequence to check if the annotations begin and end where you expect.
For more information on what format the annotations need to be in, you can refer to this article: How to Manually Annotate Reference Sequences.
How to make an Antibody Template Reference Database
You may find it useful to create a Reference Database that contains full variable region sequences. This enables you to track changes against these target sequences rather than against genes.
Navigate to the Organization Databases heading on the left navigation panel, hover over Reference Sequences and click on the vertical three dots that appear. Then select New Database.
This type of reference database is of type "Antibody Template"
You can either upload previously annotated variable region sequence(s) in the dialogue box, or click Create to make an empty folder to upload your template reference sequence(s) into.
What kind of template sequences are accepted?
In this case, you can use either the output of an Annotation result (with the corresponding annotations) as a sequence in your template database, or you can annotate the sequence(s) yourself ensuring that the following annotations are present. Geneious Prime is used as an example for editing annotations below:
- A VDJ-REGION or VJ-REGION annotation that spans the whole length of the variable region
- Name must be either VDJ-REGION or VJ-REGION
- Type must be CDS
-
Track must be No Track
-
FR region Annotations
It is important that the:
- Name is FR1/2/3/4
- Type is FR
- Track is No Track
- These annotations should be a length that's a multiple of 3 (i.e. can be translated to an amino acid sequence)
-
CDR region Annotations
It is important that the:- Name is CDR1/2/3
- Type is CDR
- Track is No Track
- These annotations should be a length that's a multiple of 3 (i.e. can be translated to an amino acid sequence)
- If the annotation is for a CDR3 region, the annotation needs to be Truncated as shown above
Other Tips
- Some amount of ambiguities are tolerated, particularly in the CDRs. For example, you might leave ambiguous residues (X) at positions across the CDR3 region to indicate mutation points in a mutant library.
- For an example of what a correctly annotated database looks like, you could refer to the Human and Mouse databases already set up by default, or contact us directly for a downloadable example.
- If there is more than one reference sequence per database, these reference sequences must be grouped into a sequence list. To group multiple sequences into a single sequence list, select the sequences and click Group Sequences in the Pre-processing dropdown.
- A reference database can contain multiple sequence lists, such as lists with different species or different chains.
- Geneious Prime is a great tool to use for manipulating sequence annotations, and makes creating databases simple.
Please contact support if you require further assistance with setting up your databases. We can also help recommend which database type(s) are most suitable for your purposes.