View markdown source on GitHub

Introduction to Genome Annotation


AvatarAnthony Bretaudeau AvatarHelena Rasche
last_modification Last modification: Dec 8, 2021

Genome Annotation

Structural Annotation

Positions of genomic features along the genome

Functional Annotation

Assigning functions to those features

Speaker Notes

Two parts, structural and function. Structural can come from ab-initio predictions or structural data. Functional annotation often comes from analysis of protein domains or in rare cases from experimental data.

Structural Annotation

Types of elements:

Structural Annotation: Why?

Screenshot of apollo, with genome sequence at the top, and several gene models shown below. These gene models are coding sequences + protein sequences.

Structural Annotation: Why?

Locate your favorite gene + see what’s next to it

Basis for other analysis, e.g.:

Compare with other species

Prokaryotic Genes

.pull-left[ Promoter: - -35 Region - TATA Box - Initiation site (TSS) ] .pull-right[ ![Cartoon of a promoter region with a -35 region box reading TTGACA, a -10 region TATAAT, and that makes up the promoter. Then at +1 is the TSS or transcription start site.](../images/intro-prok1.png) ]
.pull-left[ Operons: - Promoter - Some genes - A terminator ] .pull-right.image-90[ ![A cartoon of gene with promoter, gene 1, gene 2, and gene 3 boxes before a terminator. This produces a polycistronic mRNA and that is cut up into protein 1, 2, and 3](../images/intro-prok2.png) ]

Speaker Notes

Prokaryotic genes often have a well conserved structure, with a promoter, one or a few genes and a terminator.

Eukaryotic Genes

A cartoon of a eukaryotic gene with a UTR, Exon 1, an intro, Exon 2, another intro, and Exon 3 before a final UTR shown along a line representing DNA. The process of Transcription extracts the region between and including UTRs, exons, and introns. Splicing produces the next product which is just the UTRs and 3 exons, before translation into a protein.

Speaker Notes

Things are a little more complicated for eukaryotic: splicing

Automatic Structural Annotation

Very difficult problem

The previous cartoon of a eukaryotic gene with TSS pointing to the UTR, the ATG codon at exon 1, the three exons, and a final UTR. The letters GT appear at the end of exons 1 and 2, and AG appear at the beginning of Exon 2 and 3, indicating where it will be spliced.


Multiple pieces of evidence

Screenshot of genome browser with a new gene at the top, a row with protein/mRNA from other species used as evidence, and a final row with RNASeq coverage indicating mapped reads.

But data unavailable for novel or very distant genes, or unexpressed genes

Ab initio Gene Calling

.pull-left[ Predictions using:


.pull-right.image-90[ Cartoon of ab initio prediction with several rows, genome, coding potential, atg and stop codons, splice sites, then these rows are duplicated for the reverse strand. Examples are listed like genefinder, augustus, glimmer, snap, and fgenesh. ]

Data Reconcilliation


.pull-right.image-90[ Multiple gene models are shown labelled current evidence, which is complemented by three more gene models from ab initio prediction. Below a final gene model is constructed and labelled current assembly. Source: Maker documentation ]

Evaluation of annotation: metrics

Evaluation of annotation: BUSCO

Benchmarking Universal Single-Copy Orthologs

Visualisation of Results

Genome Browsers (JBrowse, UCSC, …)

Screenshot of Apollo showing a track list on left, and then several rows of evidence like maker annotation and blastn showing gene models, followed by my bam file.

JBrowse: Slides - Tutorial

Repeated Elements

Exotic Elements


Manual Annotation


Apollo (based upon JBrowse), Artemis, others

Screenshot of apollo with dozens of tracks in the list. Several gene model predictions are shown as well as a few bigwig XY density from illumina sequencing data.

Check out the Apollo tutorial for more details.


.pull-left[ Annotations steps

.pull-right[ A report showing 15 errors like gene having no alternate alleles or missing a symbol, or missing a human readable name. 2 warnings about model structures are shown, and the other 119 genes are shown as valid. ]

Functional Annotation

Collection of information on the function of identified genes

Data Sources





Gene Ontology

.pull-left[ Controlled vocabulary to describe:

A tree browser showing the top level groups (MF, CC, BP) followed by a drill down into molecular function with several groups like acetylcholine receptor activator activity, inhibitor, and regulator activity. antioxidant activity, binding, etc. Each group shows large numbers of children. ]

.pull-right[ A stacked bar chart with species along the bottom and the number of annotations. The split is experimental and  non-experimental annotations which is split ~15% experimental/85% non, for most species. Human has the highest with 200k experimental/250k non-experimental. ]

Gene Ontology

Various sources


Diagram of orthology with species a and species B as two circles. Inside A is a1, 2, and 3 labelled as in-paralogs. Inside B is b1/b2 also in-paralogs. A1 and B1 are orthologs, and the rest of the potential connections between A and B are labelled co-orthologs.


Comparing annotations

Genome Annotation

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.