name: inverse layout: true class: center, middle, inverse
# Introduction to Genome Annotation
--- ## Requirements Before diving into this slide deck, we recommend you to have a look at: - [Introduction to Galaxy Analyses](/archive/2021-06-01/topics/introduction) .footnote[Tip: press `P` to view the presenter notes] ??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off Press `C` to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other. Useful when presenting. --- # Genome Annotation **Structural Annotation** Positions of genomic features along the genome **Functional Annotation** Assigning functions to those features ??? Two parts, structural and function. Structural can come from ab-initio predictions or structural data. Functional annotation often comes from analysis of protein domains or in rare cases from experimental data. --- ## Structural Annotation Types of elements: - genes - regulatory regions - ncRNA - repeat elements - pseudogenes and paralogs --- ### Structural Annotation: Why? !(../images/intro-structural-annotation.png) --- ### Structural Annotation: Why? Locate your favorite gene + see what's next to it Basis for other analysis, e.g.: - Transcriptomic data (count reads mapping inside exons) - Variants detection (SNP, indels, …) and their effects - Epigenomic (ChIPSeq, FAIRESeq, …) Compare with other species - Presence/absence/mutations of genes - Family reduction or expansion - Structural variants --- ### Prokaryotic Genes <div> .pull-left[ Promoter: - -35 Region - TATA Box - Initiation site (TSS) ] .pull-right[ !(../images/intro-prok1.png) ] </div> <div> .pull-left[ Operons: - Promoter - Some genes - A terminator ] .pull-right.image-90[ !(../images/intro-prok2.png) ] </div> ??? Prokaryotic genes often have a well conserved structure, with a promoter, one or a few genes and a terminator. --- ### Eukaryotic Genes !(../images/intro-eukaryotic-genes.png) ??? Things are a little more complicated for eukaryotic: splicing --- ## Automatic Structural Annotation Very difficult problem - Short, variable, unspecific motifs - Need data to support predictions !(../images/intro-tss.png) --- ### Evidence Multiple pieces of evidence - Alignment of RNASeq reads - Alignment of EST or transcripts (same species or closely related species) - Alignment of proteins (closely related species) .image-50[ !(../images/intro-evidences.png) ] *But* data unavailable for novel or very distant genes, or unexpressed genes --- ### *ab initio* Gene Calling .pull-left[ Predictions using: - Genome sequence - Statistical model (specific to organism) Models: - Built from training on evidence-based gene calls (2-3 iterations) ] .pull-right.image-90[ !(../images/intro-ab-initio.png) ] --- ### Data Reconcilliation .pull-left[ - Integration of evidence and *ab initio* predictions - "Consensus" of multiple sources - Automated pipelines: Maker, Braker, Pasa, Prokka ] .pull-right.image-90[ !(../images/intro-consensus.png) ] <small> Source: [Maker tutorial](http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018) </small> --- ### Evaluation of Evidence: metrics - Number of genes - Average number of exons - Average gene length - ... --- ### Evaluation of Evidence: BUSCO #### Benchmarking Universal Single-Copy Orthologs * Sets of genes having single-copy orthologs in all species of a clade (insects, plants, bacteria, ...) * Genes supposed to be vital for the species * Expected to be found in a good annotation * Results: * Found genes * Fragmented genes * Duplicated genes --- ### Visualisation of Results Genome Browsers (JBrowse, UCSC, ...) !(../images/intro-viz.png) --- ### Repeat Elements - Transposons, low complexity regions - Disrupt gene calling - Prediction pipelines: - RepeatMasker - REPET - RepBase --- ### Exotic Elements - tRNA, rRNA, ncRNA, ... - Dedicated tools for prediction - Aragorn - tRNAscan - ... --- ### Summary - Difficult - Never perfect - Missing/incomplete genes - Split/fused genes - Pseudogenes --- ## Manual Annotation - Recruit experts of some gene families - Manual curation of their favorite genes - Better annotation - Things to say in the genome paper - Limits - There aren't experts for all genes - They can only annotate what is in the sequence - Poor assembly ⇒ Poor annotation - We need a user-friendly environment --- ### Editors Apollo (based upon JBrowse), Artemis, others !(../images/intro-apollo.png) --- ### Steps .pull-left[ Annotations steps - Check structure (exons, introns, start, stop, utr, ...) - Search for isoforms - Ensure consistent naming conventions - Add functional annotations (based on homologies with other species) ] .pull-right[ !(../images/intro-steps.png) ] --- ## Functional Annotation Collection of information on the function of identified genes - biological function - regulations, expressions, ... Data Sources - wet lab experiments (reliable but long and expensive) - manual assignment (cf Apollo) - **automatic assignment** --- ### Methods - similarity search / homology - pattern search - orthologies - synteny to related organisms - Comparison against databases: - GenBank, NR: sequence bank - InterPro: pattern library (active sites, protein families, peptide signal ...) --- ### Blast - Blast against NR - For each protein (or CDS) of the annotation - Find the best xx hits - Huge database, good chances to have a match - Risk: - Spread of "putative xx protein" - Spread of low-evidence annotations --- ### InterProScan - For each protein (or CDS) of the annotation - Search for all InterPro patterns - Many motifs - Some of them manually curated - Gene Ontology Terms available for domains --- ### Gene Ontology .pull-left[ Controlled vocabulary to describe: - molecular function - biological process - cellular component - e.g.: `GO:0044430` = cytoskeletal part !(../images/intro-go.png) ] .pull-right[ !(../images/intro-go-source.png) ] --- ### Blast2GO - Blast2GO - For each protein (or CDS) of the annotation, tag with GO terms - Based on Blast and InterProScan results --- ### Orthology - For each annotated gene - Search of orthologous genes in related species - Search for paralogues - Bioinformatics method: - Blast all against all transcripts - Filtering the best hits - Clustering - OrthoFinder, OrthoMCL, ... !(../images/intro-ortho.png) --- ### Visualisation - Genomic databases (NCBI, FlyBase, etc.) - Other sites (Tripal sites) - reference data (assembly, annotation, ...) - interfaces to visualize this data - interfaces for querying e.g. [bipaa.genouest.org](https://bipaa.genouest.org) --- ## Genome Annotation - Very difficult - Automatic: - Structural: - *Ab initio* methods are improving - EST/RNA-Seq data provides good evidence - Functional: - Concrete evidence cost-prohibitive to obtain - Risks of automatically spreading "putative" annotations - Manual: - Slow - No expert or evidence available --- ## Related tutorials --- ## Thank you! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://wiki.galaxyproject.org/Teach/GTN) and all the contributors!
This material is licensed under the
Creative Commons Attribution 4.0 International License