As a user of Biologics, you can select custom Reference Databases for annotating your sequences. If you are unsure what is meant by a reference database, please refer to this article: Understanding Reference Databases. If you are wanting to analyse TCR sequences, please see our main article: Analyzing TCR sequences in Geneious Biologics to learn how to make a TCR Reference Database.
The article on Feature Databases may be useful for those who would like to track and annotate additional features (like fusion proteins) as Feature Databases can be used alongside a Reference Database.
This article outlines how to create:
1. A Germline Reference Database
2. A Target Reference Database which allows you to track changes against a target sequence
Jump to:
What is a Germline Reference Database?
Antibody Annotator uses (germline) reference sequences to determine two key aspects of sequence annotation:
- The closest matching reference gene(s) to each of your target sequences, and what the silent and non-silent variations in the target sequence are relative to its closest match.
- The most appropriate FR/CDR region boundaries, calculated from the whole dataset not just the closest match.
The custom sequences used could represent germline genes of the Ig/TCR variable region of one or more species. They could also represent non-germline parent sequences or scaffolds, so long as these sequences are Ig-like in nature and structure. Some amount of sequence ambiguity is tolerated, particularly in the centre of CDR/FR regions.
In order to create a custom reference sequence database, the reference sequences will have to satisfy some requirements to be able to be read by Antibody Annotator/Single Clone Analysis. In particular, the annotator needs to know whether sequences are from heavy or light chains and where Framework/CDR regions are within each sequence to enable the annotation of novel sequences.
Steps to create a Germline Reference Database
A general process that you could use for making a germline reference database is outlined below:
- Obtain the reference sequences. These could be germline V, D, J and constant genes or VHH genes.
- Figure out where the FR and CDR boundaries are on each sequence. This is not required for D genes or constant genes. One way you could get an estimate of where the CDR regions are is to run them through Antibody Annotator with a suitably similar species chosen for the database.
- Annotate the sequences correctly according to the formatting described in the section below. Note that this format is not quite the same as the Antibody Annotator output. Geneious Prime would be a good choice to add/adjust annotations, and the bulk annotation editing feature could speed things up.
- Upload these sequences to a database in Geneious Biologics. You can upload multiple files/sequence lists to the same database - they will be combined during analysis
- Test it out to make sure it behaves as expected with formatting etc. Antibody Annotator should give you helpful error messages in the jobs table if something is wrong with the database format. Or you can report the job and ask us.
- Curate, validate, and adjust as required. In particular you may want to adjust the CDR boundary estimates from step #2, based on expert knowledge of where you would expect them to be. This is important for Antibody Annotator to be accurate.
- The database is finished and ready to be used in analysis! You can always come back and edit it later, but I would suggest uploading the next version to a new database in Geneious Biologics, so that you can maintain consistency and track changes made.
How to correctly format Germline Reference Sequences
Three kinds of annotations are required for Germline Reference Sequences to be used in Geneious Biologics:
- A single annotation which denotes the gene and segment (V_segment, D_segment, J_segment or C_segment)
- Any FR regions present with an annotation of type FR
- Any CDR regions present with an annotation type of CDR (note that CDR3 should be truncated)
Examples
All examples are given using Geneious Prime to edit/create annotations.
- Gene/Name Annotation
It is important that these properties are present (and are case sensitive):- The Name starts with IGH/IGL/IGK to denote the chain type (Heavy or Light)
- The Type is either V_segment, D_segment, J_segment or C_segment
- The Track is No Track
- An additional property of GeneRegion is added and defined
- An additional property of gene is added and defined
- FR region Annotations
It is important that the:
- Name is FR1/2/3/4
- Type is FR
- Track is No Track
- These annotations should be a length that's a multiple of 3 (i.e. can be translated to an amino acid sequence)
- CDR region Annotations
It is important that the:- Name is CDR1/2/3
- Type is CDR
- Track is No Track
- These annotations should be a length that's a multiple of 3 (i.e. can be translated to an amino acid sequence)
- If the annotation is for a CDR3 region, the annotation is Truncated as shown above
***Note: It is also possible to use un-annotated sequences as a database, however the database name must then contain the words Heavy-Chain or Light-Chain to identify the chain of all sequences in the database.
How to make a Target Reference Database
You may find it useful to create a Reference Database that contains the full variable-region sequence of your target antibody to track variation against this sequence. This type of reference database is still of type "Annotated Germline"
In this case, you can use either the output of an Antibody Annotator result (with the corresponding annotations) as a sequence in your Reference Database, or you can annotate the sequence yourself ensuring that the following annotations are present:
- A VDJ-REGION or VJ-REGION annotation that spans the whole length of the variable region
- Name must be either VDJ-REGION or VJ-REGION
- Type must be CDS
- Track must be No Track
- FR and CDR annotations, as described in the section above for How to correctly format Germline Reference Sequences.
Other Tips
- For an example of what a correctly annotated database looks like, you could refer to the Human and Mouse databases already set up by default, or contact us directly for a downloadable example.
- If there is more than one reference sequence per database, these reference sequences must be grouped into a sequence list. To group multiple sequences into a single sequence list, select the sequences and click Group Sequences in the Pre-processing dropdown.
- A reference database can contain multiple sequence lists, such as lists with different species or different chains.
- Geneious Prime is a great tool to use for manipulating sequence annotations, and makes creating databases simple.
Please contact support if you require further assistance with setting up your databases. We can also help recommend which database type(s) are most suitable for your purposes.