name: inverse layout: true class: center, middle, inverse
# Introduction to Genome Annotation
Updated: Jul 26, 2021
Or view the plain-text slides without JS
to view the presenter notes
??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off Press `C` to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other. Useful when presenting. --- ## Requirements Before diving into this slide deck, we recommend you to have a look at: - [Introduction to Galaxy Analyses](/archive/2021-11-01/topics/introduction) --- # Genome Annotation **Structural Annotation** Positions of genomic features along the genome **Functional Annotation** Assigning functions to those features ??? Two parts, structural and function. Structural can come from ab-initio predictions or structural data. Functional annotation often comes from analysis of protein domains or in rare cases from experimental data. --- ## Structural Annotation Types of elements: - genes - regulatory regions - ncRNA - repeat elements - pseudogenes and paralogs --- ### Structural Annotation: Why? ![Screenshot of apollo, with genome sequence at the top, and several gene models shown below. These gene models are coding sequences + protein sequences.](../images/intro-structural-annotation.png) --- ### Structural Annotation: Why? Locate your favorite gene + see what's next to it Basis for other analysis, e.g.: - Transcriptomic data (count reads mapping inside exons) - Variants detection (SNP, indels, …) and their effects - Epigenomic (ChIPSeq, FAIRESeq, …) Compare with other species - Presence/absence/mutations of genes - Family reduction or expansion - Structural variants --- ### Prokaryotic Genes <div> .pull-left[ Promoter: - -35 Region - TATA Box - Initiation site (TSS) ] .pull-right[ ![Cartoon of a promoter region with a -35 region box reading TTGACA, a -10 region TATAAT, and that makes up the promoter. Then at +1 is the TSS or transcription start site.](../images/intro-prok1.png) ] </div> <div> .pull-left[ Operons: - Promoter - Some genes - A terminator ] .pull-right.image-90[ ![A cartoon of gene with promoter, gene 1, gene 2, and gene 3 boxes before a terminator. This produces a polycistronic mRNA and that is cut up into protein 1, 2, and 3](../images/intro-prok2.png) ] </div> ??? Prokaryotic genes often have a well conserved structure, with a promoter, one or a few genes and a terminator. --- ### Eukaryotic Genes ![A cartoon of a eukaryotic gene with a UTR, Exon 1, an intro, Exon 2, another intro, and Exon 3 before a final UTR shown along a line representing DNA. The process of Transcription extracts the region between and including UTRs, exons, and introns. Splicing produces the next product which is just the UTRs and 3 exons, before translation into a protein.](../images/intro-eukaryotic-genes.png) ??? Things are a little more complicated for eukaryotic: splicing --- ## Automatic Structural Annotation Very difficult problem - Short, variable, unspecific motifs - Need data to support predictions ![The previous cartoon of a eukaryotic gene with TSS pointing to the UTR, the ATG codon at exon 1, the three exons, and a final UTR. The letters GT appear at the end of exons 1 and 2, and AG appear at the beginning of Exon 2 and 3, indicating where it will be spliced.](../images/intro-tss.png) --- ### Evidence Multiple pieces of evidence - Alignment of RNASeq reads - Alignment of EST or transcripts (same species or closely related species) - Alignment of proteins (closely related species) ![Screenshot of genome browser with a new gene at the top, a row with protein/mRNA from other species used as evidence, and a final row with RNASeq coverage indicating mapped reads.](../images/intro-evidences.png) *But* data unavailable for novel or very distant genes, or unexpressed genes --- ### *ab initio* Gene Calling .pull-left[ Predictions using: - Genome sequence - Statistical model (specific to organism) Models: - Built from training on evidence-based gene calls (2-3 iterations) ] .pull-right.image-90[ ![Cartoon of ab initio prediction with several rows, genome, coding potential, atg and stop codons, splice sites, then these rows are duplicated for the reverse strand. Examples are listed like genefinder, augustus, glimmer, snap, and fgenesh.](../images/intro-ab-initio.png) ] --- ### Data Reconcilliation .pull-left[ - Integration of evidence and *ab initio* predictions - "Consensus" of multiple sources - Automated pipelines: Maker, Braker, Pasa, Prokka ] .pull-right.image-90[ ![Multiple gene models are shown labelled current evidence, which is complemented by three more gene models from ab initio prediction. Below a final gene model is constructed and labelled current assembly.](../images/intro-consensus.png) ] <small> Source: [Maker tutorial](http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018) </small> --- ### Evaluation of Evidence: metrics - Number of genes - Average number of exons - Average gene length - ... --- ### Evaluation of Evidence: BUSCO #### Benchmarking Universal Single-Copy Orthologs * Sets of genes having single-copy orthologs in all species of a clade (insects, plants, bacteria, ...) * Genes supposed to be vital for the species * Expected to be found in a good annotation * Results: * Found genes * Fragmented genes * Duplicated genes --- ### Visualisation of Results Genome Browsers (JBrowse, UCSC, ...) ![Screenshot of Apollo showing a track list on left, and then several rows of evidence like maker annotation and blastn showing gene models, followed by my bam file.](../images/intro-viz.png) --- ### Repeat Elements - Transposons, low complexity regions - Disrupt gene calling - Prediction pipelines: - RepeatMasker - REPET - RepBase --- ### Exotic Elements - tRNA, rRNA, ncRNA, ... - Dedicated tools for prediction - Aragorn - tRNAscan - ... --- ### Summary - Difficult - Never perfect - Missing/incomplete genes - Split/fused genes - Pseudogenes --- ## Manual Annotation - Recruit experts of some gene families - Manual curation of their favorite genes - Better annotation - Things to say in the genome paper - Limits - There aren't experts for all genes - They can only annotate what is in the sequence - Poor assembly ⇒ Poor annotation - We need a user-friendly environment --- ### Editors Apollo (based upon JBrowse), Artemis, others ![Screenshot of apollo with dozens of tracks in the list. Several gene model predictions are shown as well as a few bigwig XY density from illumina sequencing data.](../images/intro-apollo.png) Check out the [Apollo tutorial](/archive/2021-11-01/topics/genome-annotation/tutorials/apollo/tutorial.html) for more details. --- ### Steps .pull-left[ Annotations steps - Check structure (exons, introns, start, stop, utr, ...) - Search for isoforms - Ensure consistent naming conventions - Add functional annotations (based on homologies with other species) ] .pull-right[ ![A report showing 15 errors like gene having no alternate alleles or missing a symbol, or missing a human readable name. 2 warnings about model structures are shown, and the other 119 genes are shown as valid.](../images/intro-steps.png) ] --- ## Functional Annotation Collection of information on the function of identified genes - biological function - regulations, expressions, ... Data Sources - wet lab experiments (reliable but long and expensive) - manual assignment (cf Apollo) - **automatic assignment** --- ### Methods - similarity search / homology - pattern search - orthologies - synteny to related organisms - Comparison against databases: - GenBank, NR: sequence bank - InterPro: pattern library (active sites, protein families, peptide signal ...) --- ### Blast - Blast against NR - For each protein (or CDS) of the annotation - Find the best xx hits - Huge database, good chances to have a match - Risk: - Spread of "putative xx protein" - Spread of low-evidence annotations --- ### InterProScan - For each protein (or CDS) of the annotation - Search for all InterPro patterns - Many motifs - Some of them manually curated - Gene Ontology Terms available for domains --- ### Gene Ontology .pull-left[ Controlled vocabulary to describe: - molecular function - biological process - cellular component - e.g.: `GO:0044430` = cytoskeletal part ![A tree browser showing the top level groups (MF, CC, BP) followed by a drill down into molecular function with several groups like acetylcholine receptor activator activity, inhibitor, and regulator activity. antioxidant activity, binding, etc. Each group shows large numbers of children.](../images/intro-go.png) ] .pull-right[ ![A stacked bar chart with species along the bottom and the number of annotations. The split is experimental and non-experimental annotations which is split ~15% experimental/85% non, for most species. Human has the highest with 200k experimental/250k non-experimental.](../images/intro-go-source.png) ] --- ### Blast2GO - Blast2GO - For each protein (or CDS) of the annotation, tag with GO terms - Based on Blast and InterProScan results --- ### Orthology - For each annotated gene - Search of orthologous genes in related species - Search for paralogues - Bioinformatics method: - Blast all against all transcripts - Filtering the best hits - Clustering - OrthoFinder, OrthoMCL, ... ![Diagram of orthology with species a and species B as two circles. Inside A is a1, 2, and 3 labelled as in-paralogs. Inside B is b1/b2 also in-paralogs. A1 and B1 are orthologs, and the rest of the potential connections between A and B are labelled co-orthologs.](../images/intro-ortho.png) --- ### Visualisation - Genomic databases (NCBI, FlyBase, etc.) - Other sites (Tripal sites) - reference data (assembly, annotation, ...) - interfaces to visualize this data - interfaces for querying e.g. [bipaa.genouest.org](https://bipaa.genouest.org) --- ## Genome Annotation - Very difficult - Automatic: - Structural: - *Ab initio* methods are improving - EST/RNA-Seq data provides good evidence - Functional: - Concrete evidence cost-prohibitive to obtain - Risks of automatically spreading "putative" annotations - Manual: - Slow - No expert or evidence available --- ## Related tutorials --- ## Thank You! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!
This material is licensed under the Creative Commons Attribution 4.0 International License