Genome annotation with Maker (short)

Overview
Questions:
• How to annotate an eukaryotic genome?

• How to evaluate and visualize annotated genomic features?

Objectives:

• Annotate genome with Maker

• Evaluate annotation quality with BUSCO

• View annotations in JBrowse

Requirements:
Time estimation: 2 hours
Level: Intermediate Intermediate
Supporting Materials:
Last modification: Aug 22, 2022

Introduction

Genome annotation of eukaryotes is a little more complicated than for prokaryotes: eukaryotic genomes are usually larger than prokaryotes, with more genes. The sequences determining the beginning and the end of a gene are generally less conserved than the prokaryotic ones. Many genes also contain introns, and the limits of these introns (acceptor and donor sites) are not highly conserved.

In this tutorial we will use a software tool called Maker Campbell et al. 2014 to annotate the genome sequence of a small eukaryote: Schizosaccharomyces pombe (a yeast).

Maker is able to annotate both prokaryotes and eukaryotes. It works by aligning as many evidences as possible along the genome sequence, and then reconciliating all these signals to determine probable gene structures.

The evidences can be transcript or protein sequences from the same (or closely related) organism. These sequences can come from public databases (like NR or GenBank) or from your own experimental data (transcriptome assembly from an RNASeq experiment for example). Maker is also able to take into account repeated elements.

Maker uses ab-initio predictors (like Augustus or SNAP) to improve its predictions: these software tools are able to make gene structure predictions by analysing only the genome sequence with a statistical model.

In this tutorial you will learn how to perform a genome annotation, and how to evaluate its quality. Finally, you will learn how to use the JBrowse genome browser to visualise the results.

This tutorial was inspired by the MAKER Tutorial for WGS Assembly and Annotation Winter School 2018, don’t hesitate to consult it for more information on Maker, and on how to run it with command line.

Note: Two versions of this tutorial

Because this tutorial consists of many steps, we have made two versions of it, one long and one short.

This is the shortened version. We will skip the training of ab-initio predictors and use pre-trained data instead. We will also annotate only the third chromosome of the genome. If you would like to learn how to perform the training steps, please see the longer version of tutorial

In this tutorial, we will cover:

To annotate a genome using Maker, you need the following files:

• The genome sequence in fasta format
• A set of transcripts or EST sequences, preferably from the same organism.
• A set of protein sequences, usually from closely related species or from a curated sequence database like UniProt/SwissProt.

Maker will align the transcript and protein sequences on the genome sequence to determine gene positions.

1. Create and name a new history for this tutorial.

Click the new-history icon at the top of the history panel.

If the new-history is missing:

1. Click on the galaxy-gear icon (History options) on the top of the history panel
2. Select the option Create New from the menu
2. Import the following files from Zenodo or from the shared data library

https://zenodo.org/api/files/647ad552-19a8-46d9-aad8-f81f56860582/S_pombe_chrIII.fasta

• Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

• Select Paste/Fetch Data
• Paste the link into the text field

• Press Start

• Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

• Go into Shared data (top panel) then Data libraries
• Navigate to the correct folder as indicated by your instructor
• Select the desired files
• Click on the To History button near the top and select as Datasets from the dropdown menu
• In the pop-up window, select the history you want to import the files to (or create a new one)
• Click on Import
3. Rename the datasets
4. Check that the datatype for augustus_training_2.tar.gz is set to augustus

• Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
• In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
• Select augustus
• tip: you can start typing the datatype into the field to filter the dropdown menu
• Click the Save button

You have the following main datasets:

• S_pombe_trinity_assembly.fasta contains EST sequences from S. pombe, assembled from RNASeq data with Trinity
• Swissprot_no_S_pombe.fasta contains a subset of the SwissProt protein sequence database (public sequences from S. pombe were removed to stay as close as possible to real-life analysis)
• S_pombe_chrIII.fasta contains only the third chromosome from the full genome of S. pombe

The other datasets will be used later in the tutorial.

Genome quality evaluation

The quality of a genome annotation is highly dependent on the quality of the genome sequences. It is impossible to obtain a good quality annotation with a poorly assembled genome sequence. Annotation tools will have trouble finding genes if the genome sequence is highly fragmented, if it contains chimeric sequences, or if there are a lot of sequencing errors.

Before running the full annotation process, you need first to evaluate the quality of the sequence. It will give you a good idea of what you can expect from it at the end of the annotation.

Get genome sequence statistics
1. Fasta Statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/fasta_stats/fasta-stats/1.0.1 with the following parameters:
• param-file “fasta or multifasta file”: select S_pombe_chrIII.fasta from your history

Have a look at the statistics:

• num_seq: the number of contigs (or scaffold or chromosomes), compare it to expected chromosome numbers
• len_min, len_max, len_N50, len_mean, len_median: the distribution of contig sizes
• num_bp_not_N: the number of bases that are not N, it should be as close as possible to the total number of bases (num_bp)

These statistics are useful to detect obvious problems in the genome assembly, but it gives no information about the quality of the sequence content. We want to evaluate if the genome sequence contains all the genes we expect to find in the considered species, and if their sequence are correct.

Keep in mind that we are running this tutorial only on the chromosome III instead of the whole genome.

BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool allowing to answer this question: by comparing genomes from various more or less related species, the authors determined sets of ortholog genes that are present in single copy in (almost) all the species of a clade (Bacteria, Fungi, Plants, Insects, Mammalians, …). Most of these genes are essential for the organism to live, and are expected to be found in any newly sequenced genome from the corresponding clade. Using this data, BUSCO is able to evaluate the proportion of these essential genes (also named BUSCOs) found in a genome sequence or a set of (predicted) transcript or protein sequences. This is a good evaluation of the “completeness” of the genome or annotation.

We will first run this tool on the genome sequence to evaluate its quality.

Run Busco on the genome
1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/4.1.4 with the following parameters:
• param-file “Sequences to analyse”: select S_pombe_chrIII.fasta from your history
• “Mode”: Genome
• “Lineage”: Fungi

We select Fungi as we will annotate the genome of Schizosaccharomyces pombe which belongs to the Fungi kingdom. It is usually better to select the most specific lineage for the species you study. Large lineages (like Metazoa) will consist of fewer genes, but with a strong support. More specific lineages (like Hymenoptera) will have more genes, but with a weaker support (has they are found in fewer genomes).

BUSCO produces three output datasets

• A short summary: summarizes the results of BUSCO (see below)
• A full table: lists all the BUSCOs that were searched for, with the corresponding status (was it found in the genome? how many times? where?)
• A table of missing BUSCOs: this is the list of all genes that were not found in the genome

Do you think the genome quality is good enough for performing the annotation?

The genome consists of the expected number of chromosome sequences (1), with very few N, which is the ideal case. As we only analysed chromosome III, many BUSCO genes are missing, but still ~100 are found as complete single copy, and very few are found fragmented, which means that our genome have a good quality, at least on this single chromosome. That’s a very good material to perform an annotation.

Keep in mind that we are running this tutorial only on the chromosome III instead of the whole genome. The BUSCO result will also show a lot of missing genes: it is expected as all the BUSCO genes that are not on the chromosome III cannot be found by the tool.

Maker

Let’s run Maker to predict gene models! Maker will use align ESTs and proteins to the genome, and it will run ab initio predictors (SNAP and Augustus) using pre-trained models for this organism (have a look at the longer version of tutorial to understand how they were trained).

Annotation with Maker
1. Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 with the following parameters:
• param-file “Genome to annotate”: select S_pombe_chrIII.fasta from your history
• “Organism type”: Eukaryotic
• “Re-annotate using an existing Maker annotation”: No
• In “EST evidences (for best results provide at least one of these)”:
• param-file “ESTs or assembled cDNA: S_pombe_trinity_assembly.fasta
• In “Protein evidences (for best results provide at least one of these)”:
• param-file “Protein sequences”: Swissprot_no_S_pombe.fasta
• In “Ab-initio gene prediction”:
• “SNAP model”: snap_training_2.snaphmm
• “Prediction with Augustus”: Run Augustus with a custom prediction model
• param-file “Augustus model”: augustus_training_2.tar.gz
• “Repeat library source”: Disable repeat masking (not recommended)

For this tutorial repeat masking is disabled, which is not the recommended setting. When doing a real-life annotation, you should either use Dfam or provide your own repeats library.

Maker produces three GFF3 datasets:

• The final annotation: the final consensus gene models produced by Maker
• The evidences: the alignments of all the data Maker used to construct the final annotation (ESTs and proteins that we used)
• A GFF3 file containing both the final annotation and the evidences

Annotation statistics

We need now to evaluate this annotation produced by Maker.

First, use the Genome annotation statistics that will compute some general statistics on the annotation.

Get annotation statistics
1. Genome annotation statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/jcvi_gff_stats/jcvi_gff_stats/0.8.4 with the following parameters:
• param-file “Annotation to analyse”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 )
• “Reference genome”: Use a genome from history
• param-file “Corresponding genome sequence”: select S_pombe_chrIII.fasta from your history
1. How many genes where predicted by Maker?
2. What is the mean gene locus size of these genes?
1. 864 genes
2. 1793 bp

Busco

Just as we did for the genome at the beginning, we can use BUSCO to check the quality of this Maker annotation. Instead of looking for known genes in the genome sequence, BUSCO will inspect the transcript sequences of the genes predicted by Maker. This will allow us to see if Maker was able to properly identify the set of genes that Busco found in the genome sequence at the beginning of this tutorial.

First we need to compute all the transcript sequences from the Maker annotation, using GFFread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.1 . This tool will compute the sequence of each transcript that was predicted by Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 and write them all in a FASTA file.

Extract transcript sequences
• param-file “Input GFF3 or GTF feature file”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 )
• “Reference Genome”: select S_pombe_chrIII.fasta from your history
• “Select fasta outputs”:
• fasta file with spliced exons for each GFF transcript (-w exons.fa)
• “full GFF attribute preservation (all attributes are shown)”: Yes
• “decode url encoded characters within attributes”: Yes
• “warn about duplicate transcript IDs and other potential problems with the given GFF/GTF records”: Yes

Now run BUSCO with the predicted transcript sequences:

Run BUSCO
1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/4.1.4 with the following parameters:
• param-file “Sequences to analyse”: exons (output of GFFread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.1 )
• “Mode”: Transcriptome
• “Lineage”: Fungi

How do the BUSCO statistics compare to the ones at the genome level?

128 complete single-copy, 0 duplicated, 10 fragmented, 620 missing. This is in fact better than what BUSCO found in the genome sequence. That means the quality of this annotation is very good (by default BUSCO in genome mode can miss some genes, the advanced options can improve this at the cost of computing time). (Results can be very slightly different in your own history, it’s normal).

Improving gene naming

If you look at the content of the final annotation dataset, you will notice that the gene names are long, complicated, and not very readable. That’s because Maker assign them automatic names based on the way it computed each gene model. We are now going to automatically assign more readable names.

Change gene names
1. Map annotation ids Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker_map_ids/maker_map_ids/2.31.11 with the following parameters:
• param-file “Maker annotation where to change ids”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 )
• “Prefix for ids”: TEST_
• “Justify numeric ids to this length”: 6

Genes will be renamed to look like: TEST_001234. You can replace TEST_ by anything you like, usually an uppercase short prefix.

Look at the generated dataset, it should be much more readable, and ready for an official release.

Visualising the results

With Galaxy, you can visualize the annotation you have generated using JBrowse. This allows you to navigate along the chromosomes of the genome and see the structure of each predicted gene.

Visualize annotations in JBrowse
1. JBrowse Tool: toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.16.10+galaxy0 with the following parameters:
• “Reference genome to display”: Use a genome from history
• param-file “Select the reference genome”: select S_pombe_chrIII.fasta from your history
• “JBrowse-in-Galaxy Action”: New JBrowse Instance
• In “Track Group”:
• Click on “Insert Track Group”:
• In “1: Track Group”:
• “Track Category”: Maker annotation
• In “Annotation Track”:
• Click on “Insert Annotation Track”:
• In “1: Annotation Track”:
• “Track Type”: GFF/GFF3/BED Features

• param-files “GFF/GFF3/BED Track Data”: select the output of Map annotation ids Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker_map_ids/maker_map_ids/2.31.11

Enable the track on the left side of JBrowse, then navigate along the genome and look at the genes that were predicted by Maker.

Conclusion

Congratulations, you finished this tutorial! You learned how to annotate an eukaryotic genome using Maker, how to evaluate the quality of the annotation, and how to visualize it using the JBrowse genome browser.

What’s next?

After generating your annotation, you will probably want to automatically assign functional annotation to each predicted gene model. You can do it by using Blast, InterProScan, or Blast2GO for example.

An automatic annotation of an eukaryotic genome is rarely perfect. If you inspect some predicted genes, you will probably find some mistakes made by Maker, e.g. wrong exon/intron limits, splitted genes, or merged genes. Setting up a manual curation project using Apollo helps a lot to manually fix these errors. Check out the Apollo tutorial for more details.

Key points
• Maker allows to annotate a eukaryotic genome.

• BUSCO and JBrowse allow to inspect the quality of an annotation.

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

1. Campbell, M. S., C. Holt, B. Moore, and M. Yandell, 2014 Genome annotation and curation using MAKER and MAKER-P. Current Protocols in Bioinformatics 48: 4–11. 10.1002/0471250953.bi0411s48

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

1. Anthony Bretaudeau, Genome annotation with Maker (short) (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/annotation-with-maker-short/tutorial.html Online; accessed TODAY
2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012



@misc{genome-annotation-annotation-with-maker-short,
author = "Anthony Bretaudeau",
title = "Genome annotation with Maker (short) (Galaxy Training Materials)",
year = "",
month = "",
day = ""
url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/annotation-with-maker-short/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol} Computational Biology}
}

`

Congratulations on successfully completing this tutorial!