GO Enrichment Analysis
Author(s) | IGC Bioinformatics Unit Maria Doyle |
Reviewers |
OverviewQuestions:Objectives:
How can I functionally interpret a list of genes of interest that I obtained from my experiment?
Requirements:
How to perform a GO Enrichment Analysis
How to interpret and simplify the results
- Introduction to Galaxy Analyses
- slides Slides: Quality Control
- tutorial Hands-on: Quality Control
- slides Slides: Mapping
- tutorial Hands-on: Mapping
Time estimation: 1 hourSupporting Materials:Published: Jan 23, 2019Last modification: Mar 5, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00291rating Rating: 3.8 (0 recent ratings, 9 all time)version Revision: 12
When we have a large list of genes of interest, such as a list of differentially expressed genes obtained from an RNA-Seq experiment, how do we extract biological meaning from it?
One way to do so is to perform functional enrichment analysis. This method consists of applying statistical tests to verify if genes of interest are more often associated to certain biological functions than what would be expected in a random set of genes. In this tutorial you will learn about enrichment analysis and how to perform it.
What is the Gene Ontology?
The Gene Ontology (GO) is a structured, controlled vocabulary for the classification of gene function at the molecular and cellular level. It is divided in three separate sub-ontologies or GO types: biological process (e.g., signal transduction), molecular function (e.g., ATPase activity) and cellular component (e.g., ribosome). These sub-ontologies are structured as directed acyclic graphs (a hierarchy with multi-parenting) of GO terms.
The GO Ontology, like other ontologies, are usually coded in the OBO or the OWL formats. It can be downloaded from the Gene Ontology website or from the OBO foundry. You can also find in Galaxy tools that allow you to manipulate and extract information from OBO files, but this is outside the scope of this tutorial.
CommentTake note of when and where you obtained your ontology file, as these are constantly being updated.
What are GO annotations?
Genes are associated to GO terms via GO annotations. Each gene can have multiple annotations, even of the same GO type. An important notion to take into account when using GO is that, according to the true path rule, a gene annotated to a term is also implicitly annotated to each ancestor of that term in the GO graph. GO annotations have evidence codes that encode the type of evidence supporting them: only a small minority of genes have experimentally verified annotations; the large majority have annotations inferred electronically based on sequence homology or known patterns.
GO annotations can be obtained from the Gene Ontology website, or from species-specific databases. One useful resource to obtain GO annotations is Ensembl biomart. Again, take note to when and from where you obtained your annotations. For example, if you obtained your data from Ensembl, record the release you used.
Overview
In this tutorial, we will deal with:
Functional Enrichment Analysis
To perform functional enrichment analysis, we need to have:
- A set of genes of interest (e.g., differentially expressed genes): study set
- A set with all the genes to consider in the analysis: population set (which must contain the study set)
- GO annotations, associating the genes in the population set to GO terms
- The GO ontology, with the description of GO terms and their relationships
For each GO term, we need to count the frequency (k) of genes in the study set (n) that are associated to the term, and the frequency (K) of genes in the population set (N) that are associated to the same term. Then we test how likely would it be to obtain at least k genes associated to the term if n genes would be randomly sampled from the population, given the frequency K and size N of the population.
The appropriate statistical test is the one-tailed variant of Fisher’s exact test, also known as the hypergeometric test for over-representation. When the one-tailed version is applied, this test will compute the probability of observing at least the sample frequency, given the population frequency. The hypergeometric distribution measures precisely the probability of k successes in n draws, without replacement, from a finite population of size N that contains exactly K successful objects:
For this first exercise we will use data from Trapnell et al. 2014. In this work, the authors created an artificial dataset of gene expression in Drosophila melanogaster, where 300 random genes were set (insilico) to be differentially expressed between two conditions.
Hands-onThe data for this tutorial is available at Zenodo to download. For convenience and reproducibility of results, we already added the GO ontology and annotations in the Zenodo repository.
Create a new history
To create a new history simply click the new-history icon at the top of the history panel:
- Upload to the Galaxy the following files:
- go.obo
- drosophila_gene_association.fb
- trapnellPopulation.tab
- Click on the upload button in the upper left of the interface.
- Press Choose local file and search for your file.
- Press Start and wait for the upload to finish.
Rename the go.obo file to
GO
and drosophila_gene_association.fb file toGO annotations Drosophila melanogaster
.After you upload the files, and if you press the galaxy-eye (eye) icon of
trapnellPopulation.tab
it should look something like this:CommentThe study set represents the differentially expressed genes. These were chosen as having an adjusted p-value for the differential expression test (last column) smaller than a given threshold. In this case, we want to select the genes with an adjusted p-value < 0.05.
- Filter data on any column using simple expressions tool with the following parameters:
This generates one file. Rename it to
trapnellStudy
.CommentBoth files have the same type of information, the difference between them being the number of genes, as the genes in the study sample are a subset of the population.
- GOEnrichment tool with the following parameters:
- param-file “Gene Ontology File”:
GO
- param-file “Gene Product Annotation File”:
Go Annotations Drosophila melanogaster
- param-file “Study set File”:
trapnellStudy
- param-file “Population set File”:
trapnellPopulation.tab
- Use the default options for the rest.
QuestionWhat were the results from running GOEnrichment?
This will generate 6 files with the respective default names:
goenrichment on trapnellStudy MF Table',
goenrichment on trapnellStudy BP Table’,goenrichment on trapnellStudy MF Table',
goenrichment on trapnellStudy MF Graph’,goenrichment on trapnellStudy BP Graph' and
goenrichment on trapnellStudy CC Graph’. The three Table files list the results of the statistical test for each GO Term, ordered by p-value, and the three Graph files are image files displaying a graph view of the enriched GO terms.CommentFor each GO term we obtain a p-value corresponding to a single, independent test. Since we are making multiple similar tests, the probability of at least one of them being a false positive increases. Therefore we need to make a correction for multiple testing.
QuestionHow many significant terms do we get?
When we ask how many significant terms, we want to see GO terms that have a p-value < 0.05. According to the results, for Molecular Function we have 5 GO terms, Biological Process we have 43 GO terms and Component Cellular we have 10 GO terms.
If you press the galaxy-eye (eye) icon of the Molecular Function file (
MF Trapnell
) you should see something like this:QuestionDid you expect to see significant terms?
The ~300 genes should be random, so we wouldn’t expect to see any enriched term. Nonetheless we still have significant terms.
CommentLet’s go back a little bit, and reopen the
trapnellPopulation.tab
file. If you go through the file, you’ll see genes with ‘NA’ as an adjusted p-value. This means that there are genes in our background population for which the differential expression test was not even performed (usually genes that were not expressed in any sample). These genes are irrelevant for this functional enrichment analysis. We also need to note that the study set does not include those genes!
Let’s remove the irrelevant genes from the background population (trapnellPopulation.tab
), to see the differences in results.
Hands-on
- Filter data on any column using simple expressions tool with the following parameters:
- This generates one file. Rename to
trapnellFilteredPopulation
.- GOEnrichment tool with the following parameters:
- Rename all 6 output files by appending
FilteredPop
to the name, to distinguish them from the previous outputs. Let’s check the new graph goenrichment on trapnellStudy MF Graph FilteredPop.Question
- How many significant terms do we get now?
- Why do you see these differences?
- According to the results, in Molecular Function and Biological Process we have 0 GO terms and Component Cellular just 1 GO term.
- The background population genes that we removed are not random, they are usually genes that are expressed in specific conditions, tissues or time points. If they are included in the test, we will obtain false enrichments, as we saw.
Simplification of graphs
Graphical views are essential, but sometimes the graph view can become overwhelming due to the size of the results. To exemplify this issue, we will next perform functional enrichment analysis using a more realistic dataset from a study using the mouse model organism. The original dataset can be found on NCBI. In this study, the authors compared the gene expression of several tissues. Here, we will use results from the comparison between heart and brain.
Hands-onFor the first exercise we will use as a study set the differential genes (padjusted<0.05).
- Upload to Galaxy the mouse_brain_vs_heart.txt, Mus_musculus_annotations_biomart_e92.tab and mouse_brain_vs_heart.difgenes.txt files.
- Rename the mouse_brain_vs_heart.txt file to
Mouse population
, Mus_musculus_annotations_biomart_e92.tab file toGO annotations Mus musculus
and mouse_brain_vs_heart.difgenes.txt file toMouse diff
.- GOEnrichment tool with the following parameters:
- This will generate 6 files with the names:
goenrichment on Mouse diff MF Table
,goenrichment on Mouse diff BP Table
,goenrichment on Mouse diff CC Table
,goenrichment on Mouse diff MF Graph
,goenrichment on Mouse diff BP Graph
andgoenrichment on Mouse diff CC Graph
.- Analyze the table and graph from Biological Process.
CommentAs you can see the three graphs are very complex and difficult to analyze.
As you see, the number of enriched GO Terms is very high, with graphs that are too extensive to analyze manually. And this is despite the fact that GOEnrichment ignores singletons and skips dependent tests by default, precisely to avoid enrichment results that are too extensive and not informative.
The Summarize Output option in the GOEnrichment tool addresses this problem by conflating branches/families of enriched GO terms and selecting the most representative term(s) from them (usually 1-2 term per family). The greatly simplifies the results while retaining branch information, and thus ensuring that every enriched family of functions is present in the results. Some specificity is necessarily lost, but the trade-off is that the results become easier and more intuitive to analyze.
Hands-on
- GOEnrichment tool with the following parameters:
- Analyze again the table and graph from Biological Process.
Question
- Are there differences in complexity comparing the graph with and without the summarize output option?
- Yes, there are differences. As you can see, the activation of the Summarize option reduces the size of the graph because this parameter causes families of GO terms to be conflated. Each major branch in the full results is still present in the summarized results, but now is reduced to 1 or 2 most representative terms, leading to a graph that is much easier to interpret while still containing all the key functional information.
Another approach to reduce the complexity of the results is to use a shallower version of GO, the GO slims. GO slims are transversal cuts of GO that cover all key branches but lack specific terms. Thus, using them leads to much simpler results than using the full GO, but also leads to a substantial loss in specificity, which is greater than that of the Summarize Output option. You can download slimmed versions of GO from the Gene Ontology website.
To test the GO slim approach, let us use the mouse dataset again. First, however, we need to use GOSlimmer tool to convert the annotations file from full GO to GO slim (as GO annotations are typically made to terms that are too specific to be in the GO slim, and thus need to be extended by the true path rule).
Hands-on
- Upload to the Galaxy the goslim_generic.obo file.
- Rename the goslim_generic.obo file to
GO Slim
.- Run GOSlimmer tool with the following parameters:
- param-file “Full Gene Ontology File”:
GO
- param-file “GOSlim File”:
GO Slim
- param-file “Gene Product Annotation File”:
GO annotations Mus musculus
This will generate one file called
Slim Annotations
.
Now we will go use the GOEnrichment tool with the new Slim Annotations file and the same study set.
Hands-on
- GOEnrichment tool with the following parameters:
Question
- What differences do you observe when comparing the results obtained with the GO Slim to those obtained with the full GO, with the Summarize Output option?
Component Cellular with full GO Component Cellular with GO Slim
- The differences that you observe are due to the ontology used. When we apply the summarize option with the full GO, the GOEnrichment tool will return a summarized output (as we have seen previously). When we opted for GO Slim, the original annotation was already summarized, resulting in an even more summarized output, but with a consequent loss of specificity.
Interpretation of the results
The interpretation of the results will depend on the biological information that we intend to extract. Enrichment analysis can be used in validation (e.g., of a protocol for extracting membrane proteins), characterization (e.g., of the effects of a stress in a organism) and elucidation (e.g., of the functions impacted by the knock-out of a transcription factor).
There is one important point to keep in mind during the analysis: statistically significant is different from biologically meaningful. That said, it is typically possible to obtain some biological or technical insight about the underlying experiment from statistically enriched terms, even if it isn’t readily apparent.
Terms that are very generic tend to be difficult to interpret, because the meaning they convey is shallow. On the other hand, very specific terms are generally not integrative and thus not useful in interpreting a gene set collectively. The interesting terms are those that are sufficiently specific to transmit substantial biological meaning, yet generic enough to integrate multiple genes.
For the second exercice, we will continue to work with the same study set as before but now we analyze separately genes that are over- and under-expressed, and see the enriched GO terms presents in the brain and heart from the mouse.
Hands-on
Upload to Galaxy the mouseOverexpressed.txt and the mouseUnderexpressed.txt files.
CommentThe differentially expressed genes can be identified using the adjusted p-value (also known as FDR). The logFC values indicate whether genes are more expressed (logFC>0) or less expressed (logFC<0) in one condition when comparing with another condition.
- GOEnrichment tool with the following parameters for the both study files (mouseOverexpressed.txt and the mouseUnderexpressed.txt).
This will generate 12 files, 6 for each sample file, like in previous cases.
QuestionAnalyze both Biological Process tables. According to the study, which tissues are over- and underexpressed?
The samples correspond to the expressions that occur in the tissues referring to the brain and heart, so the results in the tables (and also in the graphs) should correspond to the specific functions of each organ. When we analyze the tables of enriched functional terms, we can see that the results from underexpressed genes reveal functions related to the brain. While in the case of the genes overexpressed, we identify functions related to muscle / heart function.
Conclusion
Functional enrichment is a good way to look for patterns in gene lists, but interpretation of results can become a complicated process. One way to reduce this complexity is to use the GOEnrichment tool. This tool not only performs the GO Enrichment test, showing us enriched GO terms from our sets, but also contains functionality to simplify the results and make them more easily interpretable. Independently of this, we need to be careful when choosing our genes of interest, but also the background set of genes against which we want to compare.