Introduction to Genome Annotation

name: inverse
layout: true
class: center, middle, inverse

</span></div>

</span></div>

---

# Introduction to Genome Annotation

<div class="contributors-line">
		
	
<ul class="text-list">
			
			<li>
				<a href="/training-material/hall-of-fame/abretaud/" class="contributor-badge contributor-abretaud"><img src="/training-material/assets/images/orcid.png" alt="orcid logo" width="36" height="36"/><img src="https://avatars.githubusercontent.com/abretaud?s=36" alt="Anthony Bretaudeau avatar" width="36" class="avatar" />
    Anthony Bretaudeau</a>
			<li>
				<a href="/training-material/hall-of-fame/hexylena/" class="contributor-badge contributor-hexylena"><img src="/training-material/assets/images/orcid.png" alt="orcid logo" width="36" height="36"/><img src="https://avatars.githubusercontent.com/hexylena?s=36" alt="Helena Rasche avatar" width="36" class="avatar" />
    Helena Rasche</a></li>
</ul>

</div>

<div class="footnote" style="bottom: 8em;">
  <i class="far fa-calendar" aria-hidden="true"></i><span class="visually-hidden">last_modification</span> Updated:   
  <i class="fas fa-fingerprint" aria-hidden="true"></i><span class="visually-hidden">purl</span><abbr title="Persistent URL">PURL</abbr>: <a href="https://gxy.io/GTN:S00069">gxy.io/GTN:S00069</a>
</div>

<div class="footnote" style="bottom: 5em;">

<i class="fas fa-file-alt" aria-hidden="true"></i><span class="visually-hidden">text-document</span><a href="slides-plain.html"> Plain-text slides</a> |

</div>

<div class="footnote" style="bottom: 2em;">
    <strong>Tip: </strong>press <kbd>P</kbd> to view the presenter notes
    | <i class="fa fa-arrows" aria-hidden="true"></i><span class="visually-hidden">arrow-keys</span> Use arrow keys to move between slides

</div>

???
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press `P` again to switch presenter notes off

Press `C` to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.

Useful when presenting.

---

## Requirements

Before diving into this slide deck, we recommend you to have a look at:

- [Introduction to Galaxy Analyses](/training-material/topics/introduction)

---

# Genome Annotation

**Structural Annotation**

Positions of genomic features along the genome

**Functional Annotation**

Assigning functions to those features

???

Two parts, structural and function. Structural can come from ab-initio predictions or structural data. Functional annotation often comes from analysis of protein domains or in rare cases from experimental data.

---

## Structural Annotation

Types of elements:

- genes
- regulatory regions
- ncRNA
- repeats 
- pseudogenes and paralogs

---

### Structural Annotation: Why?

![Screenshot of apollo, with genome sequence at the top, and several gene models shown below. These gene models are coding sequences + protein sequences.](../../images/intro-structural-annotation.png)

---

### Structural Annotation: Why?

Locate your favorite gene + see what's next to it

Basis for other analysis, e.g.:

- Transcriptomic data (count reads mapping inside exons)
- Variants detection (SNP, indels, …) and their effects
- Epigenomic (ChIPSeq, FAIRESeq, …)

Compare with other species

- Presence/absence/mutations of genes
- Family reduction or expansion
- Structural variants

---

### Prokaryotic Genes

<div>
.pull-left[
Promoter:

- -35 Region
- TATA Box
- Initiation site (TSS)
]
.pull-right[
![Cartoon of a promoter region with a -35 region box reading TTGACA, a -10 region TATAAT, and that makes up the promoter. Then at +1 is the TSS or transcription start site.](../../images/intro-prok1.png)
]
</div>

<div>
.pull-left[
Operons:

- Promoter
- Some genes
- A terminator
]
.pull-right.image-90[
![A cartoon of gene with promoter, gene 1, gene 2, and gene 3 boxes before a terminator. This produces a polycistronic mRNA and that is cut up into protein 1, 2, and 3](../../images/intro-prok2.png)
]
</div>

???

Prokaryotic genes often have a well conserved structure, with a promoter, one or a few genes and a terminator.

---

### Eukaryotic Genes

![A cartoon of a eukaryotic gene with a UTR, Exon 1, an intro, Exon 2, another intro, and Exon 3 before a final UTR shown along a line representing DNA. The process of Transcription extracts the region between and including UTRs, exons, and introns. Splicing produces the next product which is just the UTRs and 3 exons, before translation into a protein.](../../images/intro-eukaryotic-genes.png)

???

Things are a little more complicated for eukaryotic: splicing

---

## Automatic Structural Annotation

Very difficult problem

- Short, variable, unspecific motifs
- Need data to support predictions

![The previous cartoon of a eukaryotic gene with TSS pointing to the UTR, the ATG codon at exon 1, the three exons, and a final UTR. The letters GT appear at the end of exons 1 and 2, and AG appear at the beginning of Exon 2 and 3, indicating where it will be spliced.](../../images/intro-tss.png)

---

### Evidence

Multiple pieces of evidence

- Alignment of RNASeq reads
- Alignment of EST or transcripts (same species or closely related species)
- Alignment of proteins (closely related species)

![Screenshot of genome browser with a new gene at the top, a row with protein/mRNA from other species used as evidence, and a final row with RNASeq coverage indicating mapped reads.](../../images/intro-evidences.png)

*But* data unavailable for novel or very distant genes, or unexpressed genes

---

### *Ab initio* Gene Calling

.pull-left[
Predictions using:
- Genome sequence
- Statistical model (specific to organism)

Models:
- Training on the best evidence-based gene calls
- "Best" = strong evidence, highly conserved
- Training can be iterative:
  - train, predict, select best genes, retrain, etc
]

.pull-right.image-90[
![Cartoon of ab initio prediction with several rows, genome, coding potential, atg and stop codons, splice sites, then these rows are duplicated for the reverse strand. Examples are listed like genefinder, augustus, glimmer, snap, and fgenesh.](../../images/intro-ab-initio.png)
]

---

### Data Reconciliation

.pull-left[

- Integration of evidence and *ab initio* predictions
- "Consensus" of multiple sources
- Automated pipelines
  - [**Maker**](/training-material/topics/genome-annotation/tutorials/annotation-with-maker-short/tutorial.html), **Braker**, [**Braker3**](/training-material/topics/genome-annotation/tutorials/braker3/tutorial.html), [**Funannotate**](/training-material/topics/genome-annotation/tutorials/funannotate/tutorial.html), **Pasa**, **Prokka**, ...
  - Align evidences (or use pre-aligned)
  - Run *ab initio* predictors
  - Reconciliate gene models
]

.pull-right.image-90[
![Multiple gene models are shown labelled current evidence, which is complemented by three more gene models from ab initio prediction. Below a final gene model is constructed and labelled current assembly.](../../images/intro-consensus.png)
]

>>>>>>> 7259342d100 (Remove slides folder and move introduction decks):topics/genome-annotation/tutorials/introduction/slides.html
<small>
Source: [Maker documentation](http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018)
</small>
]

---

### Helixer: a new and different approach

#### Why is this approach so different?

- Ab-initio annotation of genes in large eukaryotic genomes
- Based on a cross-species deep learning model
- No need for any evidences (RNASeq, aligments, etc)
- Uses GPUs, fast runtime (few hours max)
- A tutorial is available: Genome annotation with [**Helixer**](/training-material/topics/genome-annotation/tutorials/helixer/tutorial.html)
---

### Evaluation of annotation: metrics

- Number of genes
- Average number of exons
- Average gene length
- Average protein length
- ...

---

### Evaluation of annotation: BUSCO

#### Benchmarking Universal Single-Copy Orthologs

* Sets of genes having single-copy orthologs in all species of a clade (insects, plants, bacteria, ...)
* Genes supposed to be vital for the species
* Expected to be found in a good annotation
* Results
  * Found genes
  * Fragmented genes
  * Duplicated genes
  * Missing genes

---

### Evaluation of annotation: Compleasm
* "A faster and more accurate reimplementation of BUSCO" 
* Similar results as BUSCO:
  * Found genes
  * Fragmented genes
  * Duplicated genes

---

### Evaluation of annotation: OMArk

* Assign proteins to HOGs using k-mer composition
* HOG = Hierarchical Orthologous Groups (gene families from OMA db)
* Differences vs BUSCO:
  * Completeness: also considers genes conserved in multiple copies
  * Consistency: checks if (all) proteins matches the auto-detected lineage or not

---

### Visualisation of Results

Genome Browsers (JBrowse, UCSC, ...)

![Screenshot of Apollo showing a track list on left, and then several rows of evidence like maker annotation and blastn showing gene models, followed by my bam file.](../../images/intro-viz.png)

<small>
**JBrowse**: [Slides](/training-material/topics/visualisation/tutorials/jbrowse/slides.html) - [Tutorial](/training-material/topics/visualisation/tutorials/jbrowse/tutorial.html)
</small>

---

### Repeated Elements

- Transposons, Retrotransposons, low complexity regions
- Disrupt gene calling
- Prediction pipelines:
  - [RepeatMasker](/training-material/topics/genome-annotation/tutorials/repeatmasker/tutorial.html)
  - RepeatModeler
  - REPET
  - Red
- Databases of repeated elements
  - Can be used by pipelines
  - Dfam
  - RepBase (non free)

---

### Exotic Elements

- tRNA, rRNA, ncRNA, ...
- Dedicated tools for prediction
  - Aragorn
  - tRNAscan
  - ...

---

### Summary

- Difficult problem
- Automated pipelines
- Need for evidences
- Never perfect
  - Missing/incomplete genes
  - Split/fused genes
  - Pseudogenes

---

## Manual Annotation

- Recruit experts of some gene families
- Manual curation of their favorite genes
  - Better annotation
  - Things to say in the genome paper
- Limits
  - There aren't experts for all genes
  - They can only annotate what is in the sequence
  - Poor assembly ⇒ Poor annotation
  - We need a user-friendly environment

---

### Editors

Apollo (based upon JBrowse), Artemis, others

![Screenshot of apollo with dozens of tracks in the list. Several gene model predictions are shown as well as a few bigwig XY density from illumina sequencing data.](../../images/intro-apollo.png)

Check out the Apollo tutorials for more details: [prokaryotes](/training-material/topics/genome-annotation/tutorials/apollo/tutorial.html) - [eukaryotes](/training-material/topics/genome-annotation/tutorials/apollo-euk/tutorial.html)

---

### Steps

.pull-left[
Annotations steps

- Check structure (exons, introns, start, stop, utr, ...)
- Search for isoforms
- Ensure consistent naming conventions
- Add functional annotations (based on homologies with other species)
]

.pull-right[
![A report showing 15 errors like gene having no alternate alleles or missing a symbol, or missing a human readable name. 2 warnings about model structures are shown, and the other 119 genes are shown as valid.](../../images/intro-steps.png)
]

---

## Functional Annotation

Collection of information on the function of identified genes
- Biological function
- Regulation, expression, ...

Data Sources
- Wet lab experiments (reliable but long and expensive)
- Manual assignment (cf Apollo)
- **Automatic assignment**

---

### Methods

- Similarity search / homology
- Pattern search
- Orthologies
- Comparison against databases:
  - GenBank, NR: sequence databanks
  - InterPro: pattern databank (active sites, protein families, peptide signal ...)
  - EggNOG: databank of orthology relationships + functional annotation

---

### Blast

- Blast/Diamond against NR
  - For each protein (or CDS) of the annotation
  - Find the best xx hits
- Huge database, good chances to have a match
- Risk:
  - Spread of "putative xx protein"
  - Spread of low-evidence annotations

---

### InterProScan

- For each protein (or CDS) of the annotation
  - Search for all InterPro patterns
- Many motifs
- Some of them manually curated
- Gene Ontology Terms available for domains

---

### EggNOG

- EggNOG: >4M known orthology groups in >5000 organisms
- Tool using it: **EggNOG-mapper**
- For each protein of the annotation
  - Search for matches with known orthology groups
- Assign corresponding gene name and functional annotation

---

### Gene Ontology

.pull-left[
Controlled vocabulary to describe:
- molecular function
- biological process
- cellular component
  - e.g.: `GO:0044430` = cytoskeletal part

![A tree browser showing the top level groups (MF, CC, BP) followed by a drill down into molecular function with several groups like acetylcholine receptor activator activity, inhibitor, and regulator activity. antioxidant activity, binding, etc. Each group shows large numbers of children.](../../images/intro-go.png)
]

.pull-right[
![A stacked bar chart with species along the bottom and the number of annotations. The split is experimental and  non-experimental annotations which is split ~15% experimental/85% non, for most species. Human has the highest with 200k experimental/250k non-experimental.](../../images/intro-go-source.png)
]

---

### Gene Ontology

Various sources

- InterProScan
  - Assigned from matches with annotated motifs
- EggNOG-mapper
  - Assigned from matches with annotated orthology groups
- Blast2GO
  - Based on Blast (and InterProScan) results
  - For each protein (or CDS) of the annotation, tag with GO terms

---

### Orthology

- For each annotated gene
  - Search of orthologous genes in related species
  - Search for paralogues

- Bioinformatics method:
  - Blast all against all transcripts
  - Filtering the best hits
  - Clustering
  - OrthoFinder, OrthoMCL, ...

![Diagram of orthology with species a and species B as two circles. Inside A is a1, 2, and 3 labelled as in-paralogs. Inside B is b1/b2 also in-paralogs. A1 and B1 are orthologs, and the rest of the potential connections between A and B are labelled co-orthologs.](../../images/intro-ortho.png)

---

### Visualisation

- Genomic databases (NCBI, FlyBase, etc.)
- Other sites (Tripal sites)
  - reference data (assembly, annotation, ...)
  - interfaces to visualize this data
  - interfaces for querying (e.g. [bipaa.genouest.org](https://bipaa.genouest.org))

---

### Comparing annotations

- Needed to choose between different results on a same genome sequence
- Compare general statistics
  - Number of genes
  - Average number of exons
  - Average gene length
  - Average protein length
  - ...
- Compare gene content
  - Alignment of gene structures
  - Functional annotation
    - over/underrepresentation of functions
- Tools: **AEGeAN**, **Funannotate compare**

---

## Genome Annotation

- Difficult problem
- Automatic:
	- Structural:
    - EST/RNA-Seq data provides good evidence
	  - *Ab initio* methods are improving
	- Functional:
	  - Concrete evidence cost-prohibitive to obtain
    - Various sources of automatic assignment
	  - Risks of automatically spreading "putative" annotations
- Manual:
	- Slow
	- Requires experts and evidences

---

## Related tutorials

---

## Thank You!

This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!

<div class="contributors-line">
		
<table class="contributions">
	
	<tr>
		<td><abbr title="These people wrote the bulk of the tutorial, they may have done the analysis, built the workflow, and wrote the text themselves.">Author(s)</abbr></td>
		<td>
			<a href="/training-material/hall-of-fame/abretaud/" class="contributor-badge contributor-abretaud"><img src="/training-material/assets/images/orcid.png" alt="orcid logo" width="36" height="36"/><img src="https://avatars.githubusercontent.com/abretaud?s=36" alt="Anthony Bretaudeau avatar" width="36" class="avatar" />
    Anthony Bretaudeau</a><a href="/training-material/hall-of-fame/hexylena/" class="contributor-badge contributor-hexylena"><img src="/training-material/assets/images/orcid.png" alt="orcid logo" width="36" height="36"/><img src="https://avatars.githubusercontent.com/hexylena?s=36" alt="Helena Rasche avatar" width="36" class="avatar" />
    Helena Rasche</a>
		</td>
	</tr>

<tr class="reviewers">
		<td><abbr title="These people reviewed this material for accuracy and correctness">Reviewers</abbr></td>
		<td>
			<a href="/training-material/hall-of-fame/hexylena/" class="contributor-badge contributor-badge-small contributor-hexylena"><img src="https://avatars.githubusercontent.com/hexylena?s=36" alt="Helena Rasche avatar" width="36" class="avatar" /></a><a href="/training-material/hall-of-fame/bgruening/" class="contributor-badge contributor-badge-small contributor-bgruening"><img src="https://avatars.githubusercontent.com/bgruening?s=36" alt="Björn Grüning avatar" width="36" class="avatar" /></a><a href="/training-material/hall-of-fame/bebatut/" class="contributor-badge contributor-badge-small contributor-bebatut"><img src="https://avatars.githubusercontent.com/bebatut?s=36" alt="Bérénice Batut avatar" width="36" class="avatar" /></a><a href="/training-material/hall-of-fame/abretaud/" class="contributor-badge contributor-badge-small contributor-abretaud"><img src="https://avatars.githubusercontent.com/abretaud?s=36" alt="Anthony Bretaudeau avatar" width="36" class="avatar" /></a><a href="/training-material/hall-of-fame/rlibouba/" class="contributor-badge contributor-badge-small contributor-rlibouba"><img src="https://avatars.githubusercontent.com/rlibouba?s=36" alt="Romane LIBOUBAN avatar" width="36" class="avatar" /></a><a href="/training-material/hall-of-fame/shiltemann/" class="contributor-badge contributor-badge-small contributor-shiltemann"><img src="https://avatars.githubusercontent.com/shiltemann?s=36" alt="Saskia Hiltemann avatar" width="36" class="avatar" /></a><a href="/training-material/hall-of-fame/gallardoalba/" class="contributor-badge contributor-badge-small contributor-gallardoalba"><img src="https://avatars.githubusercontent.com/gallardoalba?s=36" alt="Cristóbal Gallardo avatar" width="36" class="avatar" /></a><a href="/training-material/hall-of-fame/stephanierobin/" class="contributor-badge contributor-badge-small contributor-stephanierobin"><img src="https://avatars.githubusercontent.com/stephanierobin?s=36" alt="Stéphanie Robin avatar" width="36" class="avatar" /></a><a href="/training-material/hall-of-fame/simonbray/" class="contributor-badge contributor-badge-small contributor-simonbray"><img src="https://avatars.githubusercontent.com/simonbray?s=36" alt="Simon Bray avatar" width="36" class="avatar" /></a><a href="/training-material/hall-of-fame/scorreard/" class="contributor-badge contributor-badge-small contributor-scorreard"><img src="https://avatars.githubusercontent.com/scorreard?s=36" alt="Solenne Correard avatar" width="36" class="avatar" /></a></td>
	</tr>

</table>

</div>

</div>

Tutorial Content is licensed under <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/>