Genome Annotation

Overview

question Questions
objectives Objectives
requirements Requirements

time Time estimation: 2 hours

Introduction

Genome annotation is the process of attaching biological information to sequences. It consists of three main steps:

Agenda

In this tutorial, we will deal with:

  1. Introduction into File Formats
  2. Structural Annotation
    1. Sequence Features
    2. Gene Prediction
  3. Functional Annotation
    1. Similarity Searches (BLAST)
    2. More Similarity Search Tools in Galaxy
    3. Identification of Gene Clusters

Introduction into File Formats

FASTA

DNA and protein sequences are written in FASTA format where you have in the first line a “>” followed by the description. In the second line the sequence starts.

FASTA file

GFF3

The general feature format (gene-finding format, generic feature format, GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences.

GFF3 overview

GENBANK

The genbank sequence format is a rich format for storing sequences and associated annotations.

genbank file

Structural Annotation

For the genome annotation we use a piece of the Aspergillus fumigatus genome sequence as input file.

Sequence Features

First we want to get some general information about our sequence.

hands_on Hands-on: Sequence composition

  1. Count the number of bases in your sequence (compute sequence length)
  2. Check for sequence composition and GC content (geecee).
  3. Plot the sequence composition as bar chart.

Bar chart output of the sequence

Gene Prediction

At first you need to identify those structures of the genome which code for proteins. This step of annotation is called “structural annotation”. It contains the identification and location of open reading frames (ORFs), identification of gene structures and coding regions, and the location of regulatory motifs. Galaxy contains several tools for the structural annotation. Tools for gene prediction are Augustus (for eukaryotes and prokaryotes) and glimmer3 (only for prokaryotes).

hands_on Hands-on: Gene prediction

We use Augustus for gene prediction.

  1. Use the genome sequence (FASTA file) as input.
  2. Choose the right model organism, gff format output.
  3. Select all possible output options.

augustus

Augustus will provide three output files: gff3, coding sequences (CDS) and protein sequences.

question Question

How many genes are predicted?

solution Solution

Check the output: augustus_output

hands_on Hands-on: tRNA and tmRNA Prediction

Use Aragorn for tRNA and tmRNA prediction.

  1. As input file use the Aspergillus genome sequence. You can choose the genetic code (e.g. bacteria).
  2. Select the topology of your genome (circular or linear).

    question Question

    Are there tRNAs or tmRNAs in the sequence?

tip Tip:

read more about Aragorn here.

Functional Annotation

Similarity Searches (BLAST)

Functional gene annotation means the description of the biochemical and biological function of proteins. Possible analyses to annotate genes can be for example:

For similarity searches we use NCBI BLAST+ blastp to find similar proteins in a protein database.

  1. tool As input file, select the protein sequences from Augustus.
  2. Choose the protein BLAST database SwissProt and the output format xml.

blastp tool interface and parameters

  1. Parsing the xml output (Parse blast XML output) results in changing the format style into tabular.

    question Questions

    What information do you see in the BLAST output?

From BLAST search results we want to get only the best hit for each protein.

  1. tool Therefore apply the tool BLAST top hit descriptions with number of descriptions =1 on the xml output file.

    question Question

    For how many proteins we do not get a BLAST hit?

  1. tool Choose the tool Select lines that match an expression and enter the following information: Select lines from [select the BLAST top hit descriptions result file]; that [not matching]; the pattern [gi].

Select lines that match an expression tool interface and parameters

tip The result file will contain all proteins which do not have an entry in the second column and therefore have no similar protein in the SwissProt database.

tip For functional description of those proteins we want to search for motifs or domains which may classify them more. To get a protein sequence FASTA file with only the not annotated proteins, use the tool Filter sequences by ID from a tabular file and select for Sequence file to filter on the identifiers [Augustus protein sequences] and for Tabular file containing sequence identifiers the protein file with not annotated sequences. The output file is a FASTA file with only those sequences without description.

This file will be the input for more detailed analysis:

BLAST Programs

BLAST programs

BLAST databases

tip Tip:

If you have an organism which is not available in a BLAST database, you can use its genome sequence in FASTA file for BLAST searches “sequence file against sequence file”. If you need to search in these sequences on a regularly basis, you can create a own BLAST database from the sequences of the organism. The advantage of having a own database for your organism is the duration of the BLAST search which speeds up a lot.

NCBI BLAST+ makeblastdb creates a BLAST database from your own FASTA sequence file. Molecule type of input is protein or nucleotide.

tip Tip: Further Reading about BLAST Tools in Galaxy

Cock et al. (2015): NCBI BLAST+ integrated into Galaxy

Cock et al. (2013): Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology

More Similarity Search Tools in Galaxy

tip Tip:

Documentation for vsearch see here.

tip Tip:

Buchfink et al. (2015): Fast and sensitive protein alignment using Diamond.

Identification of Gene Clusters

For identification of gene clusters, antiSMASH is used. The tool uses genbank file as input files and predicts gene clusters. Output files are a html visualization and the gene cluster proteins.

hands_on Hands-on: antiSMASH analysis

tool Import this dataset into your Galaxy history and run antiSMASH to detect gene clusters. The genbank file contains a part of the Streptomyces coelicolor genome sequence.

question Questions

Which gene clusters are identified?

When you have a whole genome antiSMASH analysis, your result may look like this:

The result of antiSMASH

At the end, you can extract a reproducible workflow out of your history. The workflow should look like this:

GenomeAnnotation Workflow

congratulations Congratulations on successfully completing this tutorial!