name: inverse layout: true class: center, middle, inverse
--- # Introduction to Genome Annotation --- ## Requirements Before diving into this slide deck, we recommend you to have a look at: - [Introduction to Galaxy Analyses](/archive/2019-11-01/topics/introduction) .footnote[Tip: press `P` to view the presenter notes] ??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off --- # Genome Annotation **Structural Annotation** Positions of genomic features along the genome **Functional Annotation** Assigning functions to those features ??? Two parts, structural and function. Structural can come from ab-initio predictions or structural data. Functional annotation often comes from analysis of protein domains or in rare cases from experimental data. --- ## Structural Annotation Types of elements: - genes - regulatory regions - ncRNA - repeat elements - pseudogenes and paralogs --- ### Structural Annotation: Why? !(../images/intro-structural-annotation.png) --- ### Structural Annotation: Why? Locate your favorite gene + see what's next to it Basis for other analysis, e.g.: - Transcriptomic data (count reads mapping inside exons) - Variants detection (SNP, indels, …) and their effects - Epigenomic (ChIPSeq, FAIRESeq, …) Compare with other species - Presence/absence/mutations of genes - Family reduction or expansion - Structural variants --- ### Prokaryotic Genes <div> .pull-left[ Promoter: - -35 Region - TATA Box - Initiation site (TSS) ] .pull-right[ !(../images/intro-prok1.png) ] </div> <div> .pull-left[ Operons: - Promoter - Some genes - A terminator ] .pull-right.image-90[ !(../images/intro-prok2.png) ] </div> ??? Prokaryotic genes often have a well conserved structure, with a promoter, one or a few genes and a terminator. --- ### Eukaryotic Genes !(../images/intro-eukaryotic-genes.png) ??? Things are a little more complicated for eukaryotic: splicing --- ## Automatic Structural Annotation Very difficult problem - Short, variable, unspecific motifs - Need data to support predictions !(../images/intro-tss.png) --- ### Evidence Multiple pieces of evidence - Alignment of RNASeq reads - Alignment of EST or transcripts (same species or closely related species) - Alignment of proteins (closely related species) .image-50[ !(../images/intro-evidences.png) ] *But* data unavailable for novel or very distant genes, or unexpressed genes --- ### *ab initio* Gene Calling .pull-left[ Predictions using: - Genome sequence - Stastical model (specific to organism) Models: - Built from training on evidence-based gene calls (2-3 iterations) ] .pull-right.image-90[ !(../images/intro-ab-initio.png) ] --- ### Data Reconcilliation .pull-left[ - Integration of evidence and *ab initio* predictions - "Consensus" of multiple sources - Automated pipelines: Maker, Braker, Pasa, Prokka ] .pull-right.image-90[ !(../images/intro-consensus.png) ] <small> Source: [Maker tutorial](http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018) </small> --- ### Evaluation of Evidence: metrics - Number of genes - Average number of exons - Average gene length - ... --- ### Evaluation of Evidence: BUSCO #### Benchmarking Universal Single-Copy Orthologs * Sets of genes having single-copy orthologs in all species of a clad (insects, plants, bacteria, ...) * Genes supposed to be vital for the species * Expected to be found in a good annotation * Results: * Found genes * Fragmented genes * Duplicated genes --- ### Visualisation of Results Genome Browsers (JBrowse, UCSC, ...) !(../images/intro-viz.png) --- ### Repeat Elements - Tranposons, low complexity regions - Disrupt gene calling - Prediction pipelines: - RepeatMasker - REPET - RepBase --- ### Exotic Elements - tRNA, rRNA, ncRNA, ... - Dedicated tools for prediction - Aragorn - tRNAscan - ... --- ### Summary - Difficult - Never perfect - Missing/incomplete genes - Split/fused genes - Pseudogenes --- ## Manual Annotation - Recruit experts of some gene families - Manual curation of their favorite genes - Better annotation - Things to say in the genome paper - Limits - There aren't experts for all genes - They can only annotate what is in the sequence - Poor assembly ⇒ Poor annotation - We need a user-friendly environment --- ### Editors Apollo (based upon JBrowse), Artemis, others !(../images/intro-apollo.png) --- ### Steps .pull-left[ Annotations steps - Check structure (exons, introns, start, stop, utr, ...) - Search for isoforms - Ensure consistent naming conventions - Add functional annotations (based on homologies with other species) ] .pull-right[ !(../images/intro-steps.png) ] --- ## Functional Annotation Collection of information on the function of identified genes - biological function - regulations, expressions, ... Data Sources - wet lab experiments (reliable but long and expensive) - manual assignment (cf apollo) - **automatic assignment** --- ### Methods - similarity search / homology - pattern search - orthologies - synteny to related organisms - Comparison against databases: - GenBank, NR: sequence bank - InterPro: pattern library (active sites, protein families, peptide signal ...) --- ### Blast - Blast against NR - For each protein (or CDS) of the annotation - Find the best xx hits - Huge database, good chances to have a match - Risk: - Spread of "putative xx protein" - Spread of low-evidence annotations --- ### InterProScan - For each protein (or CDS) of the annotation - Search for all InterPro patterns - Many motifs - Some of them manually curated - Gene Ontology Terms available for domains --- ### Gene Ontology .pull-left[ Controlled vocabulary to describe: - molecular function - biological process - cellular component - e.g.: `GO:0044430` = cytoskeletal part !(../images/intro-go.png) ] .pull-right[ !(../images/intro-go-source.png) ] --- ### Blast2GO - Blast2GO - For each protein (or CDS) of the annotation, tag with GO terms - Based on Blast and InterProScan results --- ### Orthology - For each annotated gene - Search of orthologous genes in related species - Search for paralogues - Bioinformatics method: - Blast all against all transcripts - Filtering the best hits - Clustering - OrthoFinder, OrthoMCL, ... !(../images/intro-ortho.png) --- ### Visualisation - Genomic databases (NCBI, FlyBase, etc.) - Other sites (Tripal sites) - reference data (assembly, annotation, ...) - interfaces to visualize this data - interfaces for querying e.g. [bipaa.genouest.org](https://bipaa.genouest.org) --- ## Genome Annotation - Very difficult - Automatic: - Structural: - *Ab initio* methods are improving - EST/RNA-Seq data provides good evidence - Functional: - Concrete evidence cost-prohibitive to obtain - Risks of automatically spreading "putative" annotations - Manual: - Slow - No expert or evidence available --- ## Related tutorials --- ## Thank you! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://wiki.galaxyproject.org/Teach/GTN) and all the contributors!