Why do VectorBase gene IDs change? (finding missing gene IDs)


Change is inevitable. As gene predictions are improved based on new evidence some will have a modified exon structure, new predictions will be created, old ones will be deleted, some will be split into multiple separate predictions and some will be merged together into a single new prediction. During this process the annotation identifiers can change. So how can you find out about these changes and how do you go about finding the new IDs for your genes?

First, lets deal with what kind of identifiers may be used for a given locus:

  • A gene symbol, e.g., CPR148. Symbols are assigned by the community and used as the common name for a locus. This name should be stable and maintained between releases. There are times when the name can change but that will not be initiated by us. We will usually keep the old name as a synonym in cases where genes are renumbered or renamed.
  • An annotation ID e.g. AGAP010102. These are defined during (VectorBase) the annotation process and form the basic identifier for the locus onto which all other information is associated. So external database references, or citations will be associated with the AGAP identifier and not with the symbol.
  • Manual annotations from the community via Apollo are given a new gene ID by VectorBase. These identifiers will be updated when a gene set update happens at release.

    Note 1:The new six digit ordinal number assigned to the locus is done arbitrarily, no location information is inherent in the name (i.e. AGAP004053 is not necessarily 5' of AGAP004054). Subsequent re-annotations of a genome will utilise the next available ordinal number for that species. Once a number has been used it will not be re-used.

    Note 2: A locus can have multiple Apollo submissions from the community which will show as multiple gene model annotations, that is why we advice to name models with your last name in the Apollo Information Editor.

So an example.

"Why when I search for KIBRLG, I get AGAP000002. This gene used to be AGAP000001."

This result shows the annotation ID AGAP000002. When the updated gene prediction was integrated into a new geneset the ID was changed from AGAP000001 to AGAP000002 as a consequence of the modification to the predicted exon structure. The gene symbol KIBRLG has been maintained.

Notification of updated geneset for your species can be found via news items and in the release notes.

For a visualization of both the current canonical and pending new gene models you can go to the Genome Browser. Go to the location tab, select the page 'Region in detail' and click on 'Configure this page'. Under the category "Genes and transcripts" activate the track called "Current Apollo annotation"

Follow this link for a step by step process with images, to activate this Apollo track in the Genome Browser.

How to find a gene that has changed?

You have four options, we recommend to try the one on top first and follow the other ones in order:

  • If you type your "old" gene ID on VectorBase Search, the results page should give you the "current/new" gene ID.
  • Type the genomic coordinates, i.e., scaffold (or supercontig) or chromosome and base pair range on the genome browser
  • Use the nucleotide or protein sequence for a BLAST search.
  • Note: This may not be available for all organisms gene sets. Use the lists of identifier changes presented in some gene set pages, follow this link for the latest one of Culex quinquefasciatus.
  • How can I cite a specific gene set on my thesis/paper? Follow the link to this FAQ.