Analyzing TCR sequences in Geneious Biologics

May 29, 2024 22:55
Updated

Although initially designed to annotate IgG and BCR sequences, Geneious Biologics can aIso annotate TCR variable regions. All that is needed for this is a correctly formatted Reference Database containing the TCR germline sequences. If you are unsure about what a Reference Database is, see Understanding Reference Databases.

Geneious Biologics comes bundled with a Human TCR database, however you can use this article to make your own custom TCR databases for other species.

How TCR sequences are represented in Geneious Biologics

By default, Biologics determines the chain by checking the first three letters of the gene name. Antibody Annotator and Single Clone Analysis will label chains accordingly:

- TRB and TRD chains as Heavy chains
- TRA and TRG chains as Light chains

Note that Biologics will only pair TRA and TRB chains together, or TRG and TRD chains together. We do allow recombination between certain V and J genes from the TRA and TRD families, as the V and J recombination mechanisms are less strict in TCRs.

Most of the help articles written with Ig sequences in mind will still apply for TCR sequences, as their variable regions have a similar layout.

Auto-annotating your TCR germline genes to make a Database

NOTE: The output format of your germline sequences using this method is at best an approximation. The correct annotation types will be added automatically to your sequences, however the boundaries of the CDR/FRs may be incorrect. We strongly advise you inspect and curate the output.

First, you will need to obtain the reference sequences. Before uploading to Geneious Biologics, your sequences will need to be arranged into Separate sequence lists:

Heavy Sequences
TRB and TRD gene sequences should be broken up into at least three separate sequence lists:

A sequence list containing all the Heavy V Genes. This will contain all the V genes from the TRB and TRD gene families.
- Name this sequence list "IGHV"
A sequence list containing all the Heavy D Genes. This will contain all the D genes from the TRB and TRD gene families.
- Name this sequence list "IGHD"
A sequence list containing all the Heavy J Genes. This will contain all the V genes from the TRB and TRD gene families.
- Name this sequence list "IGHJ"
Optional: A sequence list containing all the Heavy C Genes. This will contain all the C genes from the TRB and TRD gene families.
- Name this sequence list "IGHC"

Light Sequences
TRA and TRG gene sequences should be broken up into at least two separate sequence lists:

A sequence list containing all the Light V Genes. This will contain all the V genes from the TRA and TRG gene families.
- Name this sequence list "IGLV" or "IGKV"
A sequence list containing all the Light J Genes. This will contain all the J genes from the TRA and TRG gene families.
- Name this sequence list "IGLJ" or "IGKJ"
Optional: A sequence list containing all the Light C Genes. This will contain all the C genes from the TRA and TRG gene families.
- Name this sequence list "IGLC" or "IGKJ"

To auto-annotate your list(s) of genes, navigate to the Reference Databases on the left panel and click on the 3 vertical dots to bring up the New database option:

Screenshot_2023-05-15_at_10.46.45_PM.png

Next, follow the prompts to upload your reference sequences. Make sure you have selected Annotated Germline as the Database type and that the option to Automatically format germline genes is selected.

Screenshot

In the next step, select an existing gene reference database as the basis for estimating the FR/CDR boundaries etc. Choosing a reference database from a species more closely related to your target sequences will have better results - however, the output will still require curation.

Make sure to select Sequence List Names as the Gene name source:

Screenshot

After clicking next, a summary of the Reference Database to be created will be displayed. The option to proceed will not be available until you have confirmed that you understand the need for inspecting and curating the output by selecting the following box: I understand that I need to inspect and curate the formatted output.

Screenshot

Curating the output

Biologics will attempt to place various annotations on the raw germline sequences, according to rough estimates of the CDR and FR boundaries as determined by the chosen reference database. However, this method will not necessarily produce scientifically correct region boundaries. Expert knowledge of the sequences will need to be applied to curate the output.

To curate and inspect the output germline reference database, Export the output to a sequence file. Make sure you select a file type that will preserve the annotations, like .geneious. You can then inspect and edit the annotations in Geneious Prime.

In particular, a few assumptions are made by Biologics when generating the formatted output:

Constant genes will be assumed to start in reading frame 3. If your constant regions are in frame (ie. start at the beginning of the sequence), you can adjust your results by deleting the qualifier codon_start: 3 from the C_segment annotation on constant gene sequences. This is shown below in Geneious Prime when editing the annotations:

You can also edit this qualifier to codon start: 2 if the true reading frame starts from the second nucleotide.
The CDR/FR boundaries and the general length of all annotations are estimated based on the existing reference database chosen. This means they may need adjusting. We recommend inspecting each individual sequence to check if the annotations begin and end where you expect.

For more information on what format the annotations need to be in, you can refer to the section below.

Manually annotating TCR gene sequences to make a Database

Overview

For more information on creating a custom reference database, see this article. The general process for making a Germline Reference Database is outlined below:

Obtain the reference TCR genes.
Figure out where the FR and CDR boundaries are on each sequence. This is not required for D genes. One way you could get an estimate of where the CDR regions are is to run them through Antibody Annotator with a suitably similar species chosen for the database.
Annotate the sequences correctly according to the formatting described in the section below. Note that this format is not quite the same as the Antibody Annotator output. Geneious Prime would be a good choice to add/adjust annotations, and the bulk annotation editing feature could speed things up.
Upload these sequences to a database in Geneious Biologics. You can upload multiple files/sequence lists to the same database - they will be combined during analysis
Test it out to make sure it behaves as expected with formatting etc. Antibody Annotator should give you helpful error messages in the jobs table if something is wrong with the database format. Or you can report the job and ask us.
Curate, validate, and adjust as required. In particular you may want to adjust the CDR boundary estimates from step #2, based on expert knowledge of where you would expect them to be. This is important for Annotation in Geneious Biologics to be accurate.
The database is finished and ready to be used in analysis! You can always come back and edit it later, but I would suggest uploading the next version to a new database in Geneious Biologics, so that you can maintain consistency and track any changes made.

Correctly formatting the Germline Genes

Three kinds of annotations on your sequences are required for the Germline Reference Sequences to be used in Geneious Biologics:

A single annotation which denotes the gene and segment (V_segment, D_segment, J_segment or C_segment) that is the whole stretch of sequence
Any FR regions present with an annotation of type FR
Any CDR regions present with an annotation type of CDR (note that CDR3 should be truncated)

All examples are given using Geneious Prime to edit and create the annotations.

Gene/Name Annotation

It is important that these properties are present (and are case sensitive):
- The Name starts with TRB/TRD to denote a Heavy chain type
  OR TRA/TRG to denote Light chain type
- The Type is either V_segment, D_segment, J_segment or C_segment
- The Track is No Track
- An additional property of GeneRegion is added and defined
- An additional property of gene is added and defined
Tip: When creating your TCR database, if you need to force the annotator to treat a gene as heavy or light, you can add a GeneRegion qualifier to the Gene Annotation. For example, you could add the gene TRAV14/DV4*01 to your database twice, once with a GeneRegion of TRAV and once with a GeneRegion of TRDV.
FR region Annotations

It is important that the:
- Name is FR1/2/3/4
- Type is FR
- Track is No Track
- These annotations should be a length that's a multiple of 3 (i.e. can be translated to an amino acid sequence)
CDR region Annotations

It is important that the:
- Name is CDR1/2/3
- Type is CDR
- Track is No Track
- These annotations should be a length that's a multiple of 3 (i.e. can be translated to an amino acid sequence)
- If the annotation is for a CDR3 region, the annotation is Truncated as shown above

***Note: It is also possible to use un-annotated sequences as a database, however the database name must then contain the words Heavy-Chain or Light-Chain to identify the chain of all sequences in the database.