What is Gene Ontology (GO) enrichment analysis, and why should I perform it on my marker genes?
How can I use GO enrichment analysis to better understand the biological functions of the genes in my clusters?
Can GO enrichment analysis help me confirm that my clusters represent distinct cell types or states?
How can I visualize my GO enrichment results to make them easier to understand and interpret?
Objectives:
Understand the role of GO Enrichment in Single-Cell Analysis.
Use marker genes from different cell clusters or conditions for GO enrichment analysis.
Compare enrichment across experimental conditions (e.g., wild type vs. knockout) to uncover functional changes associated with genetic or environmental perturbations.
Link GO enrichment results with previously annotated cell clusters, providing a clearer picture of the functional roles of different cell populations.
In the tutorial Filter, plot and explore single-cell RNA-seq data with Scanpy, we took an important step in our single-cell RNA sequencing analysis by identifying marker genes for each of the clusters in our dataset. These marker genes are crucial, as they help us distinguish between different cell types and states, giving us a clearer picture of the cellular diversity within our samples.
However, simply identifying marker genes is just the beginning. To truly understand what makes each cluster unique, we need to dive deeper into the biological functions these genes are involved in. This is where Gene Ontology (GO) enrichment analysis comes into play.
We will perform GO enrichment analysis as a type of over-representation analysis (ORA). ORA is a statistical method that determines whether genes from pre-defined sets (e.g. genes belonging to a specific GO term) are expressed more than would be expected in a subset of your data. The most commonly used statistical tests are Fischer’s exact test and hypergeometric test, more details about them are explained in the tutorial slides.
We’ll start with two input datasets of marker genes (Study sets):
Marker genes per cell cluster: This dataset lists the genes that are significantly different in each cell cluster.
Marker genes per condition (wt and ko): This dataset lists the genes that are significantly different between the wild-type (wt) and knockout (ko) conditions.
Note: Marker genes were obtained using Scanpy FindMarkers tool. The top 50 marker genes were included in the downstream GO enrichment analysis. Scanpy FindMarkers tool selects the marker genes based on their log2 fold change and p-values. Focusing on the top-ranked genes helps to filter out less relevant genes, thereby addressing the concern of high false positives that can arise from traditional methods.
[B] GO Enrichment Files:
We’ll also use three additional files for GO enrichment analysis.
Gene Ontology file: This file contains information about Gene Ontology terms.
GO Annotations file: This file maps genes to their corresponding GO terms.
Population set file: This file provides a list of all genes involved in the experiment and is used as a background gene set for the analysis.
Note: There are several online databases available for downloading GO and GO Annotations files, including the Gene Ontology website, Ensembl, and the UCSC Genome Browser.
Comment: Concept behind GO Enrichment Analysis
The goal of GO enrichment analysis is to interpret the biological significance of long lists of marker genes by summarizing these genes into a shorter list of enriched GO terms. The analysis works by comparing each GO term between your list of marker genes and a background gene set. Statistical tests are then used to calculate a p-value that indicates whether a particular GO term is significantly enriched in the marker gene list compared to the background.
Get data
You can access the data for this tutorial in multiple ways:
Importing from a history - You can import this history
Open the link to the shared history
Click on the Import this history button on the top left
Enter a title for the new history
Click on Copy History
Uploading from Zenodo (see below)
Hands On: Data Upload from Zenodo
Create a new history for this tutorial
To create a new history simply click the new-history icon at the top of the history panel:
Click galaxy-uploadUpload Data at the top of the tool panel
Select galaxy-wf-editPaste/Fetch Data
Paste the link(s) into the text field
Press Start
Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
Go into Libraries (left panel)
Navigate to the correct folder as indicated by your instructor.
On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
Select the desired files
Click on Add to Historygalaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
“Select history”: the history you want to import the data to (or create a new one)
Click on Import
Rename the datasets
Check that the datatype is tabular
Click on the galaxy-pencilpencil icon for the dataset to edit its attributes
In the central panel, click galaxy-chart-select-dataDatatypes tab on the top
In the galaxy-chart-select-dataAssign Datatype, select tabular from “New type” dropdown
Tip: you can start typing the datatype into the field to filter the dropdown menu
Click the Save button
Important tips for easier analysis
Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.
Open your Galaxy server
Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
Navigate to your tutorial
Tool names in tutorials will be blue buttons that open the correct tool for you
Note: this does not work for all tutorials (yet)
You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
We’ve had some issues with Tutorial mode on Safari for Mac users.
Try a different browser if you aren’t seeing the button.
Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.
The Single Cell Omics Lab is a different view of the underlying Galaxy server that organises tools and resources better for single-cell users! It also provides a platform for communities to engage and connect; distribute more targeted news and events; and highlight community-specific funding sources.
When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.
Expand the red history dataset by clicking on it.
Sometimes you can already see an error message here
View the error message by clicking on the bug icongalaxy-bug
Check the logs. Output (stdout) and error logs (stderr) of the tool are available:
Expand the history item
Click on the details icon
Scroll down to the Job Information section to view the 2 logs:
To perform GO enrichment analysis on each cell cluster individually, we need to separate our “Markers_clusters Dataset” into seven files, one for each cluster and the “Markers_genotype Dataset” into 2 files, one for each condition. We’ll use the “Split file” tool for this step.
Hands On: File splitting
Split file ( Galaxy version 0.4) with the following parameters:
param-file“File to select”: Markers_cluster (Input dataset)
“on column”: c1
“Include the header in all splitted files?”: Yes
Comment: Input Dataset
As we have two datasets, one with the marker genes for all the seven clusters and one with the marker genes for the knockout (KO) and wild-type (WT) conditions. Make sure to repeat the analysis twice for the two different datasets. Alternatively, you can run this workflow for parallel analysis of the datasets, under Marker genes choose the second icon to select multiple datasets as shown in the below image.
Next, we need to isolate the Ensembl gene IDs column from each file. We’ll use the “Cut Columns” tool to achieve this.
Hands On: Extract Ensembl IDs
Cut with the following parameters:
“Cut columns”: c4
param-file“From”: split_output (output of Split filetool)
Comment: The gene format to use
In this example we extract column 4 because it contains the Ensembl gene IDs on which the subsequent steps are ideally working. While there are other gene formats like gene symbols, Entrez gene IDs, and more, make sure to check the specific format accepted by the tool you are using. There are also tools available to convert between different gene formats if needed.
GO Analysis using GOEnrichment tool
Now we will perform the GO Enrichment analysis on the list of ensembl gene IDs.
Hands On: GOEnrichment
GOEnrichment ( Galaxy version 2.0.1) with the following parameters:
param-file“Gene Ontology File”: GO (Input dataset)
param-file“Gene Product Annotation File”: GO annotations Mus musculus (Input dataset)
param-file“Study Set File”: out_file1 (output of Cuttool)
param-file“Population Set File (Optional)”: Background gene set (Input dataset)
Comment: Population Set File Selection for GO Enrichment
When choosing a background for GO enrichment analysis (Population Set File), it’s important to consider the context of your data. While using a broad background (like all genes in the organism) is common, it might be more informative to limit the background to genes expressed in the specific tissue or cell type being profiled. In this tutorial we used only genes involved in the experiment before selecting the marker genes.
Question
Take a look at the enriched terms for the different clusters, Can you find any GO terms that are specific to cluster 7?
Can we perform manual annotation of cluster 7 based on GO enrichment results?
Cluster 7 is enriched for terms like “regulation of cell death”, “T cell-mediated cytotoxicity”, and “peptidase activator activity involved in the apoptotic process”.
By looking at the most enriched functions and using our biological knowledge, we can figure out the cell types for many clusters. For example, since the data comes from thymus tissue, we already have an idea of the cell types we might find. The enriched terms in cluster 7 confirm that the cell type is macrophages, which support thymocyte maturation by cleaning up dead cells and debris.
GO Analysis using gProfiler GOSt tool
The gProfiler GOSt (Gene Ontology Sequential Testing) is another popular tool used to perform gene ontology (GO) enrichment analysis. In addition to providing enrichment results for the standard GO categories of Biological Process (BP), Cellular Component (CC), and Molecular Function (MF), the tool also analyzes enrichment across several other functional annotation databases, including KEGG Pathways, Reactome Pathways, WikiPathways and TF Targets. It also gives a plot to better visualize the results.
Hands On: gProfiler GOSt
gProfiler GOSt ( Galaxy version 0.1.7+galaxy11) with the following parameters:
param-file“Input is whitespace-separated list of genes, proteins, probes, term IDs or chromosomal regions.”: out_file1 (output of Cuttool)
“Organism”: Common organisms
“Common organisms”: Mus musculus (Mouse)
In “Tool settings”:
“Export plot”: Yes
Comment: Picking the right species matter
The species you select should match the species your genes come from. If you choose the wrong species, the tool might use incorrect information, leading to inaccurate results. For example, human genes behave differently from mouse genes, so selecting the correct species ensures the analysis is relevant to your data.
Question
Can you find enriched GO terms that are inline with the published study findings in KO results file?
What might be happening to the stem cells in the KO mice compared to the WT mice?
In the KO g:GOSt result file, enrichment for the GO term “Negative regulation of stem cell differentiation” is found.
This suggests that the KO condition is causing a delay in the differentiation of stem cells into mature T cells in the thymus which is inline with the study findings.
Conclusion
In this tutorial, we have performed GO enrichment analysis on the differentially expressed genes between 2 different conditions and between different cell types. This analysis provided valuable insights into the biological processes, molecular functions, and cellular components associated with the gene sets, enhancing our understanding of the underlying mechanisms involved in the studied conditions.
You've Finished the Tutorial
Please also consider filling out the Feedback Form as well!
Key points
GO enrichment helps make sense of your data and understand what makes each cell cluster/condition unique.
GO enrichment analysis is used to discover new insights about how cells work, which can lead to better understanding of biological processes and diseases.
Frequently Asked Questions
Have questions about this tutorial? Have a look at the available FAQ pages and support channels
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{single-cell-GO-enrichment,
author = "Menna Gamal",
title = "GO Enrichment Analysis on Single-Cell RNA-Seq Data (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/GO-enrichment/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings:
5 stars:
Liked: the introduction and data description part was especially helpful, in that it explained thoroughly the input data, what it is used for, and the concept overview for how GO analysis works.
Disliked: i understand this tutorial was for people who are using galaxy, but i was curious about other tools that could be used for single cell GO enrichment analysis.