How to Manually Annotate Reference Sequences

May 16, 2023 02:03
Updated

This article explains how to manually annotate reference sequences using Geneious Prime. If you are wanting to use Biologics to automatically annotate your germline reference genes, please see our main article How to make a Custom Reference Database. If you are unsure what is meant by a reference database, see Understanding Reference Databases.

This article outlines how to create:
1. A Germline Reference Database composed of gene sequences
2. A Target Reference Database containing full variable region sequence(s) which enables you to track changes against these target sequence(s)

What is a Germline Reference Database?

Antibody Annotator uses (germline) reference sequences to determine two key aspects of sequence annotation:

The closest matching reference gene(s) to each of your target sequences, and what the silent and non-silent variations in the target sequence are relative to its closest match.
The most appropriate FR/CDR region boundaries, calculated from the whole dataset not just the closest match.

The custom sequences used could represent germline genes of the Ig/TCR variable region of one or more species. Some amount of sequence ambiguity is tolerated, particularly in the centre of CDR/FR regions.

In order to create a custom reference sequence database, the reference sequences will have to satisfy some requirements to be able to be read by Antibody Annotator/Single Clone Analysis. In particular, the annotator needs to know whether sequences are from heavy or light chains and where Framework/CDR regions are within each sequence to enable the annotation of novel sequences.

Steps to manually create a Germline Reference Database

A general process that you could use for making a germline reference database is outlined below:

Obtain the reference sequences. These could be germline V, D, J and constant genes or VHH genes.
Figure out where the FR and CDR boundaries are on each sequence. This is not required for D genes or constant genes. One way you could get an estimate of where the CDR regions are is to run them through Antibody Annotator with a suitably similar species chosen for the database.
Annotate the sequences correctly according to the formatting described in the section below. Note that this format is not quite the same as the Antibody Annotator output. Geneious Prime would be a good choice to add/adjust annotations, and the bulk annotation editing feature could speed things up.
Upload these sequences to a database in Geneious Biologics. You can upload multiple files/sequence lists to the same database - they will be combined during analysis
Test it out to make sure it behaves as expected with formatting etc. Antibody Annotator should give you helpful error messages in the jobs table if something is wrong with the database format. Or you can report the job and ask us.
Curate, validate, and adjust as required. In particular you may want to adjust the CDR boundary estimates from step #2, based on expert knowledge of where you would expect them to be. This is important for annotation to be accurate.
The database is finished and ready to be used in analysis! You can always come back and edit it later, but we suggest uploading the next version to a new database in Geneious Biologics, so that you can maintain consistency and track changes made.

How to correctly format Germline Reference Sequences

Three kinds of annotations are required for Germline Reference Sequences to be used in Geneious Biologics:

A single annotation which denotes the gene and segment (V_segment, D_segment, J_segment or C_segment)
Any FR regions present with an annotation of type FR
Any CDR regions present with an annotation type of CDR (note that CDR3 should be truncated)

Examples

All examples are given using Geneious Prime to edit/create annotations.

Gene/Name Annotation
It is important that these properties are present (and are case sensitive):
- The Name starts with IGH/IGL/IGK to denote the chain type (Heavy or Light)
- The Type is either V_segment, D_segment, J_segment or C_segment
- The Track is No Track
- An additional property of GeneRegion is added and defined
- An additional property of gene is added and defined
FR region Annotations

It is important that the:
- Name is FR1/2/3/4
- Type is FR
- Track is No Track
- These annotations should be a length that's a multiple of 3 (i.e. can be translated to an amino acid sequence)
CDR region Annotations

It is important that the:
- Name is CDR1/2/3
- Type is CDR
- Track is No Track
- These annotations should be a length that's a multiple of 3 (i.e. can be translated to an amino acid sequence)
- If the annotation is for a CDR3 region, the annotation is Truncated as shown above

***Note: It is also possible to use un-annotated sequences as a database, however the database name must then contain the words Heavy-Chain or Light-Chain to identify the chain of all sequences in the database.

How to make a Target Reference Database

You may find it useful to create a Reference Database that contains full variable region sequence(s). This enables you to track changes against these target sequence(s). Navigate to the Reference Sequences heading on the left navigation panel and click on the vertical three dots that appear. Then select New Database.

Screenshot_2023-05-15_at_10.46.45_PM.png

Screenshot_2023-05-16_at_1.46.26_PM.png

In this case, you can use either the output of an Antibody Annotator result (with the corresponding annotations) as a sequence in your Reference Database, or you can annotate the sequence yourself ensuring that the following annotations are present:

A VDJ-REGION or VJ-REGION annotation that spans the whole length of the variable region
- Name must be either VDJ-REGION or VJ-REGION
- Type must be CDS
- Track must be No Track
FR and CDR annotations, as described in the section above for How to correctly format Germline Reference Sequences.

Other Tips

For an example of what a correctly annotated database looks like, you could refer to the Human and Mouse databases already set up by default, or contact us directly for a downloadable example.
If there is more than one reference sequence per database, these reference sequences must be grouped into a sequence list. To group multiple sequences into a single sequence list, select the sequences and click Group Sequences in the Pre-processing dropdown.
A reference database can contain multiple sequence lists, such as lists with different species or different chains.
Geneious Prime is a great tool to use for manipulating sequence annotations, and makes creating databases simple.

Please contact support if you require further assistance with setting up your databases. We can also help recommend which database type(s) are most suitable for your purposes.