As a user of Biologics, you can select custom reference sequence databases for use in analyses. The analysis operations that allow or expect a reference database will contain a dropdown listing the reference databases available to you. Only administrator users within your organisation can create custom databases, but everyone in the organisation can use them.
This article deals specifically with creating germline-style reference sequences for Antibody Annotator. It describes how to annotate and package the sequences in the correct format. Please refer to the following article for more information about the different types of reference database, and how to upload correctly formatted sequences into a reference database.
Creating reference databases for Antibody Annotator
Antibody Annotator uses (germline) reference sequences to determine two key aspects of sequence annotation:
- The closest matching reference gene(s) to each of your target sequences, and what the silent and non-silent variations in the target sequence are relative to its closest match.
- The most appropriate FR/CDR region boundaries, calculated from the whole dataset not just the closest match.
These custom sequences could represent germline genes of the Ig/TCR variable region of one or more species. They could also represent non-germline parent sequences or scaffolds, so long as these sequences are Ig-like in nature and structure. Some amount of sequence ambiguities is tolerated, particularly in the centre of CDR/FR regions.
In order to create a custom reference sequence database, the reference sequences will have to satisfy some requirements to be able to be read by Antibody Annotator. In particular, Antibody Annotator needs to know whether sequences are from heavy or light chains and where Framework/CDR regions are within each sequence to enable the annotation of novel sequences.
Denoting heavy/light chain
Antibody Annotator requires all sequences to be identifiable as heavy or light chain. This can be accomplished in one of the following ways:
- Annotating the entirety of each heavy chain sequence with VDJ-region and light chains with VJ-region annotation names (using a CDS annotation type).
- Alternatively, each reference sequence should contain an annotation named V-region or J-region or D-region (V_segment, J_segment and D_segment type, respectively) as well as a gene property starting with the prefix IGH (for heavy chain) or IGL or IGK (for light chain). For example: IGHV1-1*01 or IGL_myGeneName would both work.
- In the absence of annotations, then the database name must contain the words Heavy-Chain or Light-Chain to identify the chain of all sequences in the database.
Example
This is an example of a correctly annotated sequence, as seen in Geneious Prime:
The V-region annotation in purple stretches from the start to the end of the sequence.
When we mouse over the V-region annotation, we can see that the type is correctly specified as V_segment and there is a gene property starting with IGH:
Note that the gene annotation qualifier and annotation Type are crucial for heavy/light chain identification.
Adding region annotations
Within each sequence, Antibody Annotator expects Framework and CDR regions to be annotated. Framework annotations should be named FR1-4 and be of type FR while CDR annotations should be named CDR1-3 and be of type CDR. These annotations should be a length that's a multiple of 3 (i.e. can be translated to an amino acid sequence).
Example
This is an example of a correctly annotated FR1 region:
When we mouse over the FR1 annotation, we can see that the type is correctly specified as FR:
Example steps to create a germline database:
- Obtain the reference sequences. These could be EITHER germline V, D, J and constant genes, OR full VDJ and VJ regions that you have worked with previously.
- Figure out where the FR and CDR boundaries are on each sequence. This is not required for D genes or constant genes. One way you could get an estimate of where the CDR regions are is to run them through Antibody Annotator with a suitably similar species chosen for the database.
- Annotate the sequences correctly according to the formatting described in the section above. Note that this format is not quite the same as the Antibody Annotator output. Geneious Prime would be a good choice to add/adjust annotations, and the bulk annotation editing feature could speed things up.
The most important aspects are: - Each gene sequence needs a full length annotation (so that we can "recognise" it). This annotation must have a qualifier called "gene", which is where our annotators will look to find the gene name. Gene Names should be unique, I recommend following IMGT gene name format.
- You can also add a "GeneRegion" qualifier to this annotation to ensure the gene type is identified correctly. VHH genes can just be treated like regular heavy chains for this.
- The V and J genes should have FR and CDR annotations.The CDR3 annotations should be marked as "truncated" on the outside edge, to show that they are not complete
- Some correctly formatted examples are downloadable here
- Upload these sequences to a database in Geneious Biologics. You can upload multiple files/sequence lists to the same database - they will be combined during analysis
- Test it out to make sure it behaves as expected with formatting etc. Antibody Annotator should give you helpful error messages in the jobs table if something is wrong with the database format. Or you can report the job and ask us.
- Curate, validate, and adjust as required. In particular you may want to adjust the CDR boundary estimates from step #2, based on expert knowledge of where you would expect them to be. This is important for Antibody Annotator to be accurate.
- The database is finished and ready to be used in analysis! You can always come back and edit it later, but I would suggest uploading the next version to a new database in Geneious Biologics, so that you can maintain consistency and track changes made.
Other Tips
- For an example of what a correctly annotated database looks like, you could refer to the Human and Mouse databases already set up by default, or contact us directly for a downloadable example.
- If there is more than one reference sequence per database, these reference sequences must be grouped into a sequence list. To group multiple sequences into a single sequence list, select the sequences and click Group Sequences in the Pre-processing dropdown.
- A reference database can contain multiple sequence lists, such as lists with different species or different chains.
- Geneious Prime is a great tool to use for manipulating sequence annotations, and makes creating databases simple.
Please contact support if you require further assistance with setting up your databases. We can also help recommend which database type(s) are most suitable for your purposes.
Using Antibody Annotator with TCR sequences
- Most of the help articles written with Ig sequences in mind will still apply for TCR sequences, as their variable regions have a similar layout.
- Antibody Annotator can be used with a TCR database to analyse TCR sequences, and the same functionality will be available as with Ig sequences.
- Antibody Annotator will treat TRA and TRG chains as "Light" chains, and TRB and TRD chains as "Heavy" chains. By default, it determines the chain by checking the first three letters of the gene name.
- When creating your TCR database, if you need to force the annotator to treat a gene as heavy or light, you can add a GeneRegion qualifier to the Gene Annotation. For example, you could add the gene TRAV14/DV4*01 to your database twice, once with a GeneRegion of TRAV and once with a GeneRegion of TRDV .
Comments
0 comments
Please sign in to leave a comment.