Genome annotation with Maker

Overview

Questions:
  • How to annotate an eukaryotic genome?

  • How to evaluate and visualize annotated genomic features?

Objectives:
  • Load genome into Galaxy

  • Annotate genome with Maker

  • Evaluate annotation quality with BUSCO

  • View annotations in JBrowse

Requirements:
Time estimation: 4 hours
Level: Advanced Advanced
Supporting Materials:
Last modification: Dec 31, 2020
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

Genome annotation of eukaryotes is a little more complicated than for prokaryotes: eukaryotic genomes are usually larger than prokaryotes, with more genes. The sequences determining the beginning and the end of a gene are generally less conserved than the prokaryotic ones. Many genes also contain introns, and the limits of these introns (acceptor and donor sites) are not highly conserved.

In this tutorial we will use a software tool called Maker Campbell et al. 2014 to annotate the genome sequence of a small eukaryote: Schizosaccharomyces pombe (a yeast).

Maker is able to annotate both prokaryotes and eukaryotes. It works by aligning as many evidences as possible along the genome sequence, and then reconciliating all these signals to determine probable gene structures.

The evidences can be transcript or protein sequences from the same (or closely related) organism. These sequences can come from public databases (like NR or GenBank) or from your own experimental data (transcriptome assembly from an RNASeq experiment for example). Maker is also able to take into account repeated elements.

Maker uses ab-initio predictors (like Augustus or SNAP) to improve its predictions: these software tools are able to make gene structure predictions by analysing only the genome sequence with a statistical model.

In this tutorial you will learn how to perform a genome annotation, and how to evaluate its quality. You will see how training ab-initio predictors is an important step to produce good results. Finally, you will learn how to use the JBrowse genome browser to visualise the results.

More information about Maker can be found here.

This tutorial was inspired by the MAKER Tutorial for WGS Assembly and Annotation Winter School 2018, don’t hesitate to consult it for more information on Maker, and on how to run it with command line.

comment Note: Two versions of this tutorial

Because this tutorial consists of many steps, we have made two versions of it, one long and one short.

This is the extended version. We will perform the complete training of ab-initio predictors and discuss the results in detail. If you would like to run through the tutorial a bit quicker and focus on the main analysis steps, please see the shorter version of this tutorial

Agenda

In this tutorial, we will cover:

  1. Data upload
  2. Genome quality evaluation
  3. First Maker annotation round
    1. Maker
    2. Annotation statistics
    3. Busco
  4. Ab-initio predictors first training
  5. Second Maker annotation round
    1. Maker
    2. Annotation statistics
    3. Busco
  6. Ab-initio predictors second training
  7. Third (last) Maker annotation round
    1. Maker
    2. Annotation statistics
    3. Busco
  8. Improving gene naming
  9. Visualising the results
    1. More visualisation
  10. What’s next?

Data upload

To annotate a genome using Maker, you need the following files:

  • The genome sequence in fasta format
  • A set of transcripts or EST sequences, preferably from the same organism.
  • A set of protein sequences, usually from closely related species or from a curated sequence database like UniProt/SwissProt.

Maker will align the transcript and protein sequences on the genome sequence to determine gene positions.

hands_on Hands-on: Data upload

  1. Create and name a new history for this tutorial.

    Tip: Creating a new history

    Click the new-history icon at the top of the history panel.

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
  2. Import the following files from Zenodo or from the shared data library

    https://zenodo.org/api/files/647ad552-19a8-46d9-aad8-f81f56860582/S_pombe_chrIII.fasta
    https://zenodo.org/api/files/647ad552-19a8-46d9-aad8-f81f56860582/S_pombe_genome.fasta
    https://zenodo.org/api/files/647ad552-19a8-46d9-aad8-f81f56860582/S_pombe_trinity_assembly.fasta
    https://zenodo.org/api/files/647ad552-19a8-46d9-aad8-f81f56860582/Swissprot_no_S_pombe.fasta
    https://zenodo.org/api/files/647ad552-19a8-46d9-aad8-f81f56860582/augustus_training_1.tar.gz
    https://zenodo.org/api/files/647ad552-19a8-46d9-aad8-f81f56860582/augustus_training_2.tar.gz
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    • By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

    Tip: Importing data from a data library

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    • Go into Shared data (top panel) then Data libraries
    • Navigate to the correct folder as indicated by your instructor
    • Select the desired files
    • Click on the To History button near the top and select as Datasets from the dropdown menu
    • In the pop-up window, select the history you want to import the files to (or create a new one)
    • Click on Import
  3. Rename the datasets
  4. Check that the datatype for augustus_training_1.tar.gz and augustus_training_2.tar.gz is set to augustus

    Tip: Changing the datatype

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
    • Select augustus
    • Click the Change datatype button

You have the following main datasets:

  • S_pombe_trinity_assembly.fasta contains EST sequences from S. pombe, assembled from RNASeq data with Trinity
  • Swissprot_no_S_pombe.fasta contains a subset of the SwissProt protein sequence database (public sequences from S. pombe were removed to stay as close as possible to real-life analysis)
  • S_pombe_chrIII.fasta contains only the third chromosome from the full genome of S. pombe
  • S_pombe_genome.fasta contains the full genome sequence of S. pombe

hands_on Hands-on: Choose your Genome

  1. You need to choose between S_pombe_chrIII.fasta and S_pombe_genome.fasta:

    • If you have time: use the full genome (S_pombe_genome.fasta), it will take more computing time, but the results will be closer to real-life data.
    • If you want to get results faster: use the chromosome III (S_pombe_chrIII.fasta).
  2. Rename the file you will use to genome.fasta. E.g. if you are using S_pombe_chrIII.fasta, rename it to genome.fa

    Tip: Renaming a dataset

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field to genome.fa
    • Click the Save button

The other datasets will be used later in the tutorial.

Genome quality evaluation

The quality of a genome annotation is highly dependent on the quality of the genome sequences. It is impossible to obtain a good quality annotation with a poorly assembled genome sequence. Annotation tools will have trouble finding genes if the genome sequence is highly fragmented, if it contains chimeric sequences, or if there are a lot of sequencing errors.

Before running the full annotation process, you need first to evaluate the quality of the sequence. It will give you a good idea of what you can expect from it at the end of the annotation.

hands_on Hands-on: Get genome sequence statistics

  1. Fasta Statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/fasta_stats/fasta-stats/1.0.1 with the following parameters:
    • param-file “fasta or multifasta file”: select genome.fasta from your history

Have a look at the statistics:

  • num_seq: the number of contigs (or scaffold or chromosomes), compare it to expected chromosome numbers
  • len_min, len_max, len_N50, len_mean, len_median: the distribution of contig sizes
  • num_bp_not_N: the number of bases that are not N, it should be as close as possible to the total number of bases (num_bp)

These statistics are useful to detect obvious problems in the genome assembly, but it gives no information about the quality of the sequence content. We want to evaluate if the genome sequence contains all the genes we expect to find in the considered species, and if their sequence are correct.

comment Comment

If you chose to use only the chromosome III sequence (S_pombe_chrIII.fasta), the statistics will be different. You will only see 1 contig.

BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool allowing to answer this question: by comparing genomes from various more or less related species, the authors determined sets of ortholog genes that are present in single copy in (almost) all the species of a clade (Bacteria, Fungi, Plants, Insects, Mammalians, …). Most of these genes are essential for the organism to live, and are expected to be found in any newly sequenced genome from the corresponding clade. Using this data, BUSCO is able to evaluate the proportion of these essential genes (also named BUSCOs) found in a genome sequence or a set of (predicted) transcript or protein sequences. This is a good evaluation of the “completeness” of the genome or annotation.

We will first run this tool on the genome sequence to evaluate its quality.

hands_on Hands-on: Run Busco on the genome

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/4.1.4 with the following parameters:
    • param-file “Sequences to analyse”: select genome.fasta from your history
    • “Mode”: Genome
    • “Lineage”: Fungi

    comment Comment

    We select Fungi as we will annotate the genome of Schizosaccharomyces pombe which belongs to the Fungi kingdom. It is usually better to select the most specific lineage for the species you study. Large lineages (like Metazoa) will consist of fewer genes, but with a strong support. More specific lineages (like Hymenoptera) will have more genes, but with a weaker support (has they are found in fewer genomes).

BUSCO produces three output datasets

  • A short summary: summarizes the results of BUSCO (see below)
  • A full table: lists all the BUSCOs that were searched for, with the corresponding status (was it found in the genome? how many times? where?)
  • A table of missing BUSCOs: this is the list of all genes that were not found in the genome

BUSCO genome summary

question Questions

Do you think the genome quality is good enough for performing the annotation?

solution Solution

The genome consists of the expected number of chromosome sequences (4, or 1 if you chose to annotate chromosome III only), with very few N, which is the ideal case.

If you used the full genome, most of the BUSCO genes are found as complete single copy, and very few are fragmented, which means that our genome have a good quality as it contains most of the expected content. That’s a very good material to perform an annotation. If you only analysed chromosome III, many BUSCO genes are missing, but still ~100 are found as complete single copy, and very few are found fragmented, which means that our genome have a good quality, at least on this single chromosome.

comment Comment

If you chose to use only the chromosome III sequence (S_pombe_chrIII.fasta), the statistics will be different. The genome size will be lower, with only 1 chromosome. The BUSCO result will also show a lot of missing genes: it is expected as all the BUSCO genes that are not on the chromosome III cannot be found by the tool. Keep it in mind when comparing these results with the other BUSCO results later.

First Maker annotation round

Maker

For this first round, we configure Maker to construct gene models only by aligning ESTs and proteins to the genome. This will produce a first draft annotation that we will improve in the next steps.

hands_on Hands-on: First draft annotation with Maker

  1. Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 with the following parameters:
    • param-file “Genome to annotate”: select genome.fasta from your history
    • “Re-annotate using an existing Maker annotation”: No
    • “Organism type”: Eukaryotic
    • In “EST evidences (for best results provide at least one of these)”:
      • “Infer gene predictions directly from all ESTs”: Yes
      • param-file “ESTs or assembled cDNA: S_pombe_trinity_assembly.fasta
    • In “Protein evidences (for best results provide at least one of these)”:
      • “Infer gene predictions directly from all protein alignments”: Yes
      • param-file “Protein sequences”: Swissprot_no_S_pombe.fasta
    • In “Ab-initio gene prediction”:
      • “Prediction with Augustus”: Don't use Augustus to predict genes
    • In “Repeat masking”:
      • “Repeat library source”: Disable repeat masking (not recommended)

    comment Comment

    For this tutorial repeat masking is disabled, which is not the recommended setting. When doing a real-life annotation, you should either use Dfam or provide your own repeats library.

Maker produces three GFF3 datasets:

  • The final annotation: the final consensus gene models produced by Maker
  • The evidences: the alignments of all the data Maker used to construct the final annotation (ESTs and proteins that we used)
  • A GFF3 file containing both the final annotation and the evidences

Annotation statistics

We need now to evaluate this first draft annotation produced by Maker.

First, use the Genome annotation statistics that will compute some general statistics on the annotation.

hands_on Hands-on: Get annotation statistics

  1. Genome annotation statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/jcvi_gff_stats/jcvi_gff_stats/0.8.4 with the following parameters:
    • param-file “Annotation to analyse”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 )
    • “Reference genome”: Use a genome from history
      • param-file “Corresponding genome sequence”: select genome.fasta from your history

question Questions

  1. How many genes where predicted by Maker?
  2. What is the mean gene locus size of these genes?

solution Solution

  1. 712 genes (503 for the chromosome III only)
  2. 1570 bp (1688bp for the chromosome III only)

Busco

Just as we did for the genome at the beginning, we can use BUSCO to check the quality of this first Maker annotation. Instead of looking for known genes in the genome sequence, BUSCO will inspect the transcript sequences of the genes predicted by Maker. This will allow us to see if Maker was able to properly identify the set of genes that Busco found in the genome sequence at the beginning of this tutorial.

First we need to compute all the transcript sequences from the Maker annotation, using GFFread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.1 . This tool will compute the sequence of each transcript that was predicted by Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 and write them all in a FASTA file.

hands_on Hands-on: Extract transcript sequences

  1. GFFread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.1 with the following parameters:
    • param-file “Input GFF3 or GTF feature file”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 )
    • “Reference Genome”: select genome.fasta from your history
      • “Select fasta outputs”:
        • fasta file with spliced exons for each GFF transcript (-w exons.fa)
    • “full GFF attribute preservation (all attributes are shown)”: Yes
    • “decode url encoded characters within attributes”: Yes
    • “warn about duplicate transcript IDs and other potential problems with the given GFF/GTF records”: Yes

Now run BUSCO with the predicted transcript sequences:

hands_on Hands-on: Run BUSCO

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/4.1.4 with the following parameters:
    • param-file “Sequences to analyse”: exons (output of GFFread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.1 )
    • “Mode”: Transcriptome
    • “Lineage”: Fungi

question Questions

How do the BUSCO statistics compare to the ones at the genome level?

solution Solution

Around 100 complete single-copy, and 650 missing. As the quality of this first draft is yey not very good, you should see better results after next rounds of Maker.

The statistics are not really satisfactory at this stage, but it’s normal: Maker only used the EST and protein evidences to guess the gene positions. Let’s see now how this first draft can be improved.

Ab-initio predictors first training

Maker can use several ab-initio predictors to annotate a genome. “Ab-initio” means that these predictors are able to predict the structure of genes in a genome based only on its sequence and on a species-specific statistical model. They don’t use any evidence (e.g. EST or proteins) to predict genes, but they need to be trained with a set of already predicted genes.

Maker is able to use the EST and protein evidences, and to combine them with the result of several ab-initio predictors to predict consensus gene models. It allows to detect genes in regions where no EST or protein align, and also to refine gene structures in regions where there are EST and/or protein evidences and ab-initio predictions.

We will use two of the most widely used ab-initio predictors: SNAP and Augustus. Before using it within Maker, we need to train them with the first annotation draft we produced in the previous steps. We know the quality of this draft is not perfect, but only the best scoring genes (ie the ones having the strongest evidences) will be retained to train the predictors.

hands_on Hands-on: Train SNAP

  1. Train SNAP Tool: toolshed.g2.bx.psu.edu/repos/iuc/snap_training/snap_training/2013_11_29+galaxy1 with the following parameters:
    • param-file “Genome to annotate”: select genome.fasta from your history
    • param-file “Maker annotation to use for training”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 )

Augustus is trained in a very similar way.

hands_on Hands-on: Train Augustus

  1. Train Augustus Tool: toolshed.g2.bx.psu.edu/repos/bgruening/augustus_training/augustus_training/3.3.3 with the following parameters:
    • param-file “Genome to annotate”: select genome.fasta from your history
    • param-file “Annotation to use for training”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 )

Both SNAP and Augustus produce a statistical model representing the observed general structure of genes in the analysed genome. Maker will use these models to create a new annotation for our genome.

The Augustus training usually takes around 2 hours to complete, to continue this tutorial without waiting for the result, you can use the file augustus_training_1.tar.gz imported from Zenodo.

Second Maker annotation round

Maker

We need now to run a new round of Maker. As the evidences were already aligned on the genome on the first run, we can reuse these alignments as is. But this time, enable ab-initio gene prediction, and input the output of Train SNAP Tool: toolshed.g2.bx.psu.edu/repos/iuc/snap_training/snap_training/2013_11_29+galaxy1 and Train Augustus Tool: toolshed.g2.bx.psu.edu/repos/bgruening/augustus_training/augustus_training/3.3.3 tools. We also disable inferring gene predictions directly from all ESTs and proteins: now we want Maker to infer gene predictions by reconciliating evidence alignments and ab-initio gene predictions.

hands_on Hands-on: Second draft annotation with Maker

  1. Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 with the following parameters:
    • param-file “Genome to annotate”: select genome.fasta from your history
    • “Organism type”: Eukaryotic
    • “Re-annotate using an existing Maker annotation”: Yes
      • param-file “Previous Maker annotation”: evidences (output of the previous Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 run)
      • “Re-use ESTs”: Yes
      • “Re-use alternate organism ESTs”: Yes
      • “Re-use protein alignments”: Yes
      • “Re-use repeats”: Yes
    • In “EST evidences (for best results provide at least one of these)”:
      • “Infer gene predictions directly from all ESTs”: No
    • In “Protein evidences (for best results provide at least one of these)”:
      • “Infer gene predictions directly from all protein alignments”: No
    • In “Ab-initio gene prediction”:
      • “SNAP model”: snap model (output of Train SNAP Tool: toolshed.g2.bx.psu.edu/repos/iuc/snap_training/snap_training/2013_11_29+galaxy1 )
      • “Prediction with Augustus”: Run Augustus with a custom prediction model
        • param-file “Augustus model”: augustus model (output of Train Augustus Tool: toolshed.g2.bx.psu.edu/repos/bgruening/augustus_training/augustus_training/3.3.3 )
    • In “Repeat masking”:
      • “Repeat library source”: Disable repeat masking (not recommended)

Annotation statistics

Do we get a better result from Maker after this second run? Let’s run the same tools as after the first run, and compare the results.

hands_on Hands-on: Get annotation statistics

  1. Genome annotation statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/jcvi_gff_stats/jcvi_gff_stats/0.8.4 with the following parameters:
    • param-file “Annotation to analyse”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 second run)
    • “Reference genome”: Use a genome from history
      • param-file “Corresponding genome sequence”: select genome.fasta from your history

Busco

hands_on Hands-on: Extract transcript sequences

  1. GFFread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.1 with the following parameters:
    • param-file “Input GFF3 or GTF feature file”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 second run)
    • “Reference Genome”: select genome.fasta from your history
      • “Select fasta outputs”:
        • fasta file with spliced exons for each GFF transcript (-w exons.fa)
    • “full GFF attribute preservation (all attributes are shown)”: Yes
    • “decode url encoded characters within attributes”: Yes
    • “warn about duplicate transcript IDs and other potential problems with the given GFF/GTF records”: Yes

Now run BUSCO with the predicted transcript sequences:

hands_on Hands-on: Run BUSCO

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/4.1.4 with the following parameters:
    • param-file “Sequences to analyse”: exons (output of GFFread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.1 )
    • “Mode”: Transcriptome
    • “Lineage”: Fungi

question Questions

  1. How does the second annotation compare to the previous one? Did the ab-initio predictors training improve the results?
  2. How do you explain these changes?

solution Solution

  1. The annotation looks much better: more BUSCO complete single-copy, and more genes
  2. Using ab-initio predictors allowed to find much more genes in regions where EST or protein alignments were not sufficient to predict genes.

Ab-initio predictors second training

To get better results, we are going to perform a second training of SNAP and Augustus, and then run Maker for a third (final) time.

hands_on Hands-on: Train SNAP and Augustus

  1. Train SNAP Tool: toolshed.g2.bx.psu.edu/repos/iuc/snap_training/snap_training/2013_11_29+galaxy1 with the following parameters:
    • param-file “Genome to annotate”: select genome.fasta from your history
    • param-file “Maker annotation to use for training”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 , second run)
    • “Number of gene model to use for training”: "1000"
  2. Train Augustus Tool: toolshed.g2.bx.psu.edu/repos/bgruening/augustus_training/augustus_training/3.3.3 with the following parameters:
    • param-file “Genome to annotate”: select genome.fasta from your history
    • param-file “Annotation to use for training”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 , second run)

The Augustus training usually take around 2 hours to complete, to continue this tutorial without waiting for the result, you can use the file augustus_training_2.tar.gz imported from Zenodo.

Third (last) Maker annotation round

Maker

Let’s run the final round of Maker, in the same way as we did for the second run.

hands_on Hands-on: Final annotation with Maker

  1. Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 with the following parameters:
    • param-file “Genome to annotate”: select genome.fasta from your history
    • “Organism type”: Eukaryotic
    • “Re-annotate using an existing Maker annotation”: Yes
      • param-file “Previous Maker annotation”: evidences (output of the previous second Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 run)
      • “Re-use ESTs”: Yes
      • “Re-use alternate organism ESTs”: Yes
      • “Re-use protein alignments”: Yes
      • “Re-use repeats”: Yes
    • In “EST evidences (for best results provide at least one of these)”:
      • “Infer gene predictions directly from all ESTs”: No
    • In “Protein evidences (for best results provide at least one of these)”:
      • “Infer gene predictions directly from all protein alignments”: No
    • In “Ab-initio gene prediction”:
      • param-file “SNAP model”: snap model (output of Train SNAP Tool: toolshed.g2.bx.psu.edu/repos/iuc/snap_training/snap_training/2013_11_29+galaxy1 )
      • “Prediction with Augustus”: Run Augustus with a custom prediction model
      • param-file “Augustus model”: augustus model (output of Train Augustus Tool: toolshed.g2.bx.psu.edu/repos/bgruening/augustus_training/augustus_training/3.3.3 )
    • In “Repeat masking”:
      • “Repeat library source”: Disable repeat masking (not recommended)

Annotation statistics

Do we get a better result from Maker after this third run? Let’s run the same tools as after the first and second run, and compare the results.

hands_on Hands-on: Get annotation statistics

  1. Genome annotation statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/jcvi_gff_stats/jcvi_gff_stats/0.8.4 with the following parameters:
    • param-file “Annotation to analyse”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 third run)
    • “Reference genome”: Use a genome from history
      • param-file “Corresponding genome sequence”: select genome.fasta from your history

Busco

hands_on Hands-on: Extract transcript sequences

  1. GFFread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.1 with the following parameters:
    • param-file “Input GFF3 or GTF feature file”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 third run)
    • “Reference Genome”: select genome.fasta from your history
      • “Select fasta outputs”:
        • fasta file with spliced exons for each GFF transcript (-w exons.fa)
        • fasta file with spliced CDS for each GFF transcript (-x cds.fa)
        • protein fasta file with the translation of CDS for each record (-y pep.fa)
    • “full GFF attribute preservation (all attributes are shown)”: Yes
    • “decode url encoded characters within attributes”: Yes
    • “warn about duplicate transcript IDs and other potential problems with the given GFF/GTF records”: Yes

Now run BUSCO with the predicted transcript sequences:

hands_on Hands-on: Run BUSCO

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/4.1.4 with the following parameters:
    • param-file “Sequences to analyse”: exons (output of GFFread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.1 )
    • “Mode”: Transcriptome
    • “Lineage”: Fungi

question Questions

How do the third annotation compare to the previous ones? Did the second ab-initio predictors training improve the results?

solution Solution

Depending on wether you annotated the full genome or only chromosome III, you should see nearly the same, or even less genes than in the previous Maker round. But you’ll notice that the number of multi-exon genes have increased. It means that in this third round, Maker was able to predict more complex genes, for example merging some genes that were considered separate beforehand.

Usually no more than two rounds of training is needed to get the best results from the ab-initio predictors. You can try to retrain Augustus and SNAP, but you will probably notice very few changes. We will keep the final annotation we obtained for the rest of this tutorial.

Improving gene naming

If you look at the content of the final annotation dataset, you will notice that the gene names are long, complicated, and not very readable. That’s because Maker assign them automatic names based on the way it computed each gene model. We are now going to automatically assign more readable names.

hands_on Hands-on: Change gene names

  1. Map annotation ids Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker_map_ids/maker_map_ids/2.31.11 with the following parameters:
    • param-file “Maker annotation where to change ids”: final annotation (output of Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 )
    • “Prefix for ids”: TEST_
    • “Justify numeric ids to this length”: 6

    comment Comment

    Genes will be renamed to look like: TEST_001234. You can replace TEST_ by anything you like, usually an uppercase short prefix.

Look at the generated dataset, it should be much more readable, and ready for an official release.

Visualising the results

With Galaxy, you can visualize the annotation you have generated using JBrowse. This allows you to navigate along the chromosomes of the genome and see the structure of each predicted gene.

hands_on Hands-on: Visualize annotations in JBrowse

  1. JBrowse Tool: toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.16.10+galaxy0 with the following parameters:
    • “Reference genome to display”: Use a genome from history
      • param-file “Select the reference genome”: select genome.fasta from your history
    • “JBrowse-in-Galaxy Action”: New JBrowse Instance
    • In “Track Group”:
      • Click on “Insert Track Group”:
      • In “1: Track Group”:
        • “Track Category”: Maker annotation
        • In “Annotation Track”:
          • Click on “Insert Annotation Track”:
          • In “1: Annotation Track”:
            • “Track Type”: GFF/GFF3/BED Features

            • param-files “GFF/GFF3/BED Track Data”: select the final annotation of each Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 run

Enable the three different tracks on the left side of JBrowse, then navigate along the genome and compare the three different annotations. You should see how Maker progressively produced more complex gene models.

question Questions

Navigate to the position NC_003421.2:143850..148763 (meaning: on the sequence named NC_003421.2 (NCBI identifier for Chromosome III), between positions 143850 and 148763). JBrowse navigation

  1. How did the annotation improved in this region after each Maker round?

solution Solution

  1. At the end of the first round, a first short gene model was predicted by Maker in this region. After the second round, Maker was able to predict a second gene model in this region. Notice the name of the model beginning with snap_masked: it means that Maker used mainly a gene prediction from SNAP to construct this gene model. After the third round, the two genes were merged into a single one. Training Augustus and SNAP allowed to refine the gene structures and to refine the gene structures found in this region.

More visualisation

You might want to understand how a specific gene model was predicted by Maker. You can easily visualise the evidences used by Maker (EST alignments, protein alignments, ab-initio predictions, …) by using JBrowse too.

hands_on Hands-on: Visualize evidences in JBrowse

  1. JBrowse Tool: toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.16.10+galaxy0 with the following parameters:
    • “Reference genome to display”: Use a genome from history
      • param-file “Select the reference genome”: select genome.fasta from your history
    • “JBrowse-in-Galaxy Action”: Update existing JBrowse Instance
    • “Previous JBrowse Instance”: select the result from the previous JBrowse Tool: toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.16.10+galaxy0 run
    • In “Track Group”:
      • Click on “Insert Track Group”:
      • In “1: Track Group”:
        • “Track Category”: Maker evidences
        • In “Annotation Track”:
          • Click on “Insert Annotation Track”:
          • In “1: Annotation Track”:
            • “Track Type”: GFF/GFF3/BED Features
            • param-files “GFF/GFF3/BED Track Data”: select the “evidences” output of each Maker Tool: toolshed.g2.bx.psu.edu/repos/iuc/maker/maker/2.31.11 run
            • “This is match/match_part data”: Yes

You will now see new tracks displaying all the evidences used by Maker to generate consensus gene models.

Conclusion

Congratulations, you finished this tutorial! You learned how to annotate an eukaryotic genome using Maker, how to evaluate the quality of the annotation, and how to visualize it using the JBrowse genome browser.

What’s next?

After generating your annotation, you will probably want to automatically assign functional annotation to each predicted gene model. You can do it by using Blast, InterProScan, or Blast2GO for example.

An automatic annotation of an eukaryotic genome is rarely perfect. If you inspect some predicted genes, you will probably find some mistakes made by Maker, e.g. wrong exon/intron limits, splitted genes, or merged genes. Setting up a manual curation project using Apollo helps a lot to manually fix these errors. Check out the Apollo tutorial for more details.

Key points

  • Maker allows to annotate a eukaryotic genome.

  • BUSCO and JBrowse allow to inspect the quality of an annotation.

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. Campbell, M. S., C. Holt, B. Moore, and M. Yandell, 2014 Genome annotation and curation using MAKER and MAKER-P. Current Protocols in Bioinformatics 48: 4–11. 10.1002/0471250953.bi0411s48

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. Anthony Bretaudeau, 2020 Genome annotation with Maker (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/annotation-with-maker/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{genome-annotation-annotation-with-maker,
author = "Anthony Bretaudeau",
title = "Genome annotation with Maker (Galaxy Training Materials)",
year = "2020",
month = "12",
day = "31"
url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/annotation-with-maker/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                

Congratulations on successfully completing this tutorial!