Single-cell RNA sequencing can be sensitive to both biological and technical variation, which is why preparing your data carefully is an important part of the analysis. You want the results to reflect the interesting differences in expression between cells that relate to their type or state. Other sources of variation can conceal or confound this, making it harder for you to see what is going on.
One common biological confounder is the cell cycle (Luecken and Theis 2019). Cells express different genes during different parts of the cell cycle, depending on whether they are in their growing phase (G1), duplicating their DNA (the S or Synthesis phase), or dividing in two (G2/M or Mitosis phase). If these cell cycle genes are having a big impact on your data, then you could end up with separate clusters that actually represent cells of the same type that are just at different stages of the cycle.
In this tutorial, we will identify the genes whose expression is known to vary during the cell cycle so that we can use them to regress out (or remove) the effects of the cell cycle on the clustering.
Comment: Other Scanpy and Seurat tutorials
This tutorial is based on the Scanpy cell cycle regression tutorial (Cittaro 2018), which was itself based on the Seurat vignette addressing the same issue (Paul Hoffman 2022). However, we will be using a different dataset for this tutorial.
The data used in this tutorial is from a mouse dataset of fetal growth restriction (Bacon et al. 2018). Cell cycle regression should be performed after the data has been filtered, normalised, and scaled. You can download the dataset below or import the history with the starting data.
Comment
If you’ve been working through the Single-cell RNA-seq: Case Study then you can use your dataset from the Filter, Plot and Explore Single-cell RNA-seq Data tutorial here. Select the Use_me_Scaled dataset from your history. Rename that dataset as Processed_Anndata. You will still need to import the S and G2/M gene lists below through Zenodo.
At the end of this tutorial, you can return to the main tutorial to plot and explore your data with reduced effects from the cell cycle.
In addition to the scRNA-seq dataset, we will also need lists of the genes that are known to be expressed at different points in the cell cycle. The lists used in this tutorial are part of the HBC tinyatlas and can be downloaded from Zenodo below Kirchner and HBC 2018. Between them, they include 97 genes that are expressed during the S and G2/M phases. The expression level of these cycle genes are - mostly - determined by the phases of the cell cycle. Make sure that the file type is tabular (not just the name of the file) - you can choose this when you download the files or change it after the files are in your history.
Hands On: Option 1: Data upload - Import history
Sometimes data upload can take a while, so a faster route is to import a history.
Click galaxy-uploadUpload Data at the top of the tool panel
Select galaxy-wf-editPaste/Fetch Data
Paste the link(s) into the text field
Press Start
Close the window
Rename the datasets sPhase and g2mPhase respectively - be careful not to mix them up!
Click on the galaxy-pencilpencil icon for the dataset to edit its attributes
In the central panel, change the Name field
Click the Save button
Check that the datatype for both is tabular
Click on the galaxy-pencilpencil icon for the dataset to edit its attributes
In the central panel, click galaxy-chart-select-dataDatatypes tab on the top
In the galaxy-chart-select-dataAssign Datatype, select datatypes from “New type” dropdown
Tip: you can start typing the datatype into the field to filter the dropdown menu
Click the Save button
Important tips for easier analysis
Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.
Open your Galaxy server
Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
Navigate to your tutorial
Tool names in tutorials will be blue buttons that open the correct tool for you
Note: this does not work for all tutorials (yet)
You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
We’ve had some issues with Tutorial mode on Safari for Mac users.
Try a different browser if you aren’t seeing the button.
Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.
The Single Cell Omics Lab is a different view of the underlying Galaxy server that organises tools and resources better for single-cell users! It also provides a platform for communities to engage and connect; distribute more targeted news and events; and highlight community-specific funding sources.
The first step towards reducing the effects of the cell cycle on our dataset is cell cycle scoring. The cell cycle scoring algorithm will look at each cell in turn and calculate an S score based on the difference in the mean expression of the S Phase genes and a random sample of the same number of non-cell cycle genes from the same cell. It will do the same for the G2M genes in order to calculate the G2M score. The cells will then be assigned to the most likely phase: S, G2M, or G1, if neither G2M or S score highly. Three columns will be added to the AnnData dataset: S_score, G2M_score and phase.
Question
Why don’t we need a list of genes expressed in the G1 Phase?
Since we know which genes are expressed in the S and G2/M phases, we can classify cells that are expressing these genes into the S and G2/M phases respectively. Cells that aren’t expressing either the S or G2/M genes must be in the other phase of the cell cycle, so we can classify them as in G1 phase.
Comment: When should we regress out the effects of the cell cycle?
Cell cycle regression can be particularly important if we are planning to do trajectory analysis down the line or if we have a dataset that is very strongly influenced by the cell cycle Luecken and Theis 2019. However, it isn’t always appropriate to remove the effects of the cell cycle - sometimes they can be useful for distinguishing between dividing and non-dividing cell types. When you are analysing your own data, you might need to try it both ways to determine if the effects of the cell cycle are helpful or not. You could also check whether the cell cycle genes are among the top scoring genes expressed by your cell clusters to get an idea of how strong the effects are.
Hands On: Score the cell cycle genes
Inspect and manipulate ( Galaxy version 1.7.1+galaxy0) with the following parameters:
param-file“Annotated data matrix”: Processed_Anndata (Input dataset)
“Method used for inspecting”: Score cell cycle genes, using 'tl.score_genes_cell_cycle'
“Format for the list of genes associated with S phase”: File
param-file“File with the list of genes associated with S phase”: sPhase (Input dataset)
“Format for the list of genes associated with G2M phase”: File
param-file“File with the list of genes associated with G2M phase”: g2mPhase (Input dataset)
Rename the output CellCycle_Annotated
Cell Cycle Regression
The second step after scoring the cell cycle genes is to use these scores to regress out the effects of the cell cycle. Now that we know which phase each of our cells is in, we can work out how this is affecting gene expression in our cells. We can subtract these effects from our data so that they won’t influence our cell clustering. You will need to type the phase variable in to the Scanpy RegressOut tool to enable it to regress out the cell cycle effects.
The Scanpy RegressOut tool will create a linear model of the relationship between gene expression and the cell cycle scores we assigned in the previous step. Basically, this model is a line that shows how gene expression changes as the S or G2M score changes. Each gene will have its own line, so for any S score or G2M score, we could look at the corresponding point on the line to see the expected expression level of that gene.
Scanpy RegressOut will then regress out or remove this expected effect for the genes expressed by each cell, according to the cell’s S and G2M scores. The expected effect is subtracted from the expression data, leaving behind the difference between the expected position on the line and the actual position of each data point. The data points won’t sit exactly on the line because their expression levels aren’t determined completely by the cell cycle - the linear model only tells us what we would expect based on the cell cycle scores alone.
Understanding how the regression works should help you to see why we’re not just deleting the cell cycle genes from the dataset. We are using these genes that are known to be expressed during different phases to calculate the cell cycle scores and determine the phase each cell is in. We then use this information to work out the effect of the cell cycle on all of the other genes expressed by the cells. Even if they are not cell cycle genes, their expression can still be affected by the cycle. Finally, we remove or regress out the effect of the cell cycle on all the genes, leaving behind the variation that we’re interested in.
Hands On: Regress out the effects of the cell cycle
Scanpy RegressOut ( Galaxy version 1.8.1+galaxy0) with the following parameters:
param-file“Input object in AnnData/Loom format”: CellCycle_Annotated (output of Inspect and manipulatetool)
“Variables to regress out”: phase
Rename the output CellCycle_Regressed
Plotting the Effects of Cell Cycle Regression
Your data is now ready for further analysis, so you can return to the Filter, Plot and Explore Single-cell RNA-seq Data tutorial and move on to the Preparing coordinates step there. Make sure that you use the CellCycle_Regressed dataset (you may want to rename it as Use_me_Scaled as that is the name used in the main tutorial). However, if you want to understand how cell cycle regression has affected your data then you can work through the following steps first, to visualise where the cell cycle genes are expressed - with or without regression.
In order to look at the cell cycle genes, we first need to label them in our AnnData dataset so that we can select them for plotting. To add a new annotation to the genes or variables, we need a column with entries for each gene, in the same order as in the dataset, and with a header at the top that will become the key for identifying these entries in the AnnData dataset. We want to end up with a column that reads TRUE for the 97 cell cycle genes and FALSE for all the other genes.
You might find it easier to create this new column using a spreadsheet and then upload it as a tabular dataset, but it is possible to complete all the steps on Galaxy.
Prepare a table of cell cycle genes
If we’re going to mark all the cell cycle genes, we’ll need a single list of all 97 genes instead of the two separate lists for S Phase and G2/M Phase. We’ll combine the two lists into a single column with 97 entries. We’ll then add a second column that simply reads TRUE in every row, which we’ll use later to mark these as cell cycle genes in the main dataset.
Hands On: Create a list of all cell cycle genes
Concatenate datasets with the following parameters:
Add column ( Galaxy version 1.0.0) with the following parameters:
“Add this value”: TRUE
param-file“to Dataset”: out_file1 (output of Concatenate datasetstool)
Rename the dataset CC_Genes
Create an ordered list of gene names
Next, we’ll need a list of all the genes in our dataset, so that we can mark the ones that are in our cell cycle list. We’ll also add a column of numbers as this will help us keep the gene names in order.
Hands On: Get the gene names from your dataset
Inspect AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
param-file“Annotated data matrix”: CellCycle_Regressed (Input dataset)
“What to inspect?”: Key-indexed annotation of variables/features (var)
Table Compute ( Galaxy version 1.2.4+galaxy0) with the following parameters:
“Input Single or Multiple Tables”: Single Table
param-file“Table”: var (output of Inspect AnnDatatool)
“Type of table operation”: Drop, keep or duplicate rows and columns
“List of columns to select”: 1
“Output formatting options”: Unselect all
Add column ( Galaxy version 1.0.0) with the following parameters:
param-file“to Dataset”: table (output of Table Computetool)
“Iterate?”: YES
Comment: Keeping the genes in order
Adding these numbers will enable us to keep the genes in their original order. This is essential for adding the cell cycle gene annotation back into the AnnData dataset.
Rename the output Dataset_Genes
Mark the cell cycle genes
We can now combine our table of cell cycle genes CC_genes with the table of gene names Dataset_Genes.
Hands On: Combine the two tables
Join ( Galaxy version 1.1.2) with the following parameters:
param-file“1st file”: Dataset_Genes (output of Add columntool)
“Column to use from 1st file”: c1
param-file“2nd File”: CC_Genes (output of Add columntool)
“Column to use from 2nd file”: c1
“Output lines appearing in”: All lines [-a 1 -a 2]
“Value to put in unpaired (empty) fields”: FALSE
Comment: How the cell cycle genes are marked
When we join the two tables, we’ll ask for any empty fields to be filled in with FALSE. The cell cycle gene table has an extra column where they are all marked as TRUE - they will retain these entries when we join the tables, but since there are no entries for the rest of the genes in this column, their rows will be filled in as FALSE. This will enable us to pick out the cell cycle genes later.
Sort with the following parameters:
param-file“Sort Dataset”: output (output of Jointool)
“on column”: c2
“everything in”: Ascending order
Comment: Putting the genes in order again
Sorting the genes using the column of numbers we added earlier will put them back in their original order - make sure to sort them in ascending order, otherwise they’ll end up the opposite way around.
Question
What would happen if any of the cell cycle genes were not present in the dataset?
How would we remove these genes from the table?
Any cell cycle genes that weren’t in the dataset would have an empty field in the numbered column, which would be filled in with FALSE when we created the table with the Join tool. These rows would appear at the top of the table after it was sorted.
We should check the first rows of the table for any unnumbered genes and then cut these rows out in the next step.
Create the annotation column
We now have a table with all the gene names in the same order as the main dataset and a column indicating which ones are cell cycle genes. If we cut this column out of the table, then we can add it as a new annotation to the main dataset. We’ll also need to add a column header, which will be used as the key for this annotation in the AnnData dataset.
Hands On: Create the cell cycle annotation column
Table Compute ( Galaxy version 1.2.4+galaxy0) with the following parameters:
“Input Single or Multiple Tables”: Single Table
param-file“Table”: out_file1 (output of Sorttool)
“Input data has:”: Unselect all
“Type of table operation”: Drop, keep or duplicate rows and columns
“List of columns to select”: 3
“Output formatting options”: Unselect all
Comment: Removing rows for missing genes
If there were any cell cycle genes that weren’t present in the main dataset, we could remove them at this stage by excluding them from the List of rows to select. As before, if we were using a dataset of a different size, we would need to change this parameter to include all the rest of the rows.
Create a new tabular file from the following
CC_genes
Click galaxy-uploadUpload Data at the top of the tool panel
Select galaxy-wf-editPaste/Fetch Data at the bottom
Paste the file contents into the text field
Change Type from “Auto-detect” to tabular* Press Start and Close the window
Concatenate datasets with the following parameters:
param-file“Select”: table (output of Table Computetool)
Add an annotation to the AnnData
We will need to add the annotation to both the annotated dataset CellCycle_Annotated and to the one that we created by regressing out the cell cycle genes CellCycle_Regressed. This will allow us to plot the cell cycle genes before and after regression. We can do this using the Manipulate Anndata tool and selecting the correct function from the dropdown menu.
Hands On: Add the new annotations
Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
param-file“Annotated data matrix”: CellCycle_Annotated (output of Inspect and manipulatetool)
“Function to manipulate the object”: Add new annotation(s) for observations or variables
param-file“Table with new annotations”: out_file1 (output of Concatenate datasetstool)
Rename the output CellCycle_Annotated_CC
Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
param-file“Annotated data matrix”: CellCycle_Regressed (output of Scanpy RegressOuttool)
“Function to manipulate the object”: Add new annotation(s) for observations or variables
param-file“Table with new annotations”: out_file1 (output of Concatenate datasetstool)
Rename the output CellCycle_Regressed_CC
Filter the cell cycle genes
To demonstrate the power of cell cycle regression, we’re going to reduce our expression matrices to contain only the 97 cell cycle genes. This will force our dimension reduction and plotting to be based entirely on cell cycle genes. You wouldn’t do this during analysis, but for proof of principle, let’s go for it!
Hands On: Filter the AnnData datasets
Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
param-file“Annotated data matrix”: CellCycle_Annotated_CC (output of Manipulate AnnDatatool)
“Function to manipulate the object”: Filter observations or variables
“Type of filtering?”: By key (column) values
“Key to filter”: CC_genes
“Type of value to filter”: Boolean
Rename the output CellCycle_Annotated_CC_Only
Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
param-file“Annotated data matrix”: CellCycle_Regressed_CC (output of Manipulate AnnDatatool)
“Function to manipulate the object”: Filter observations or variables
“Type of filtering?”: By key (column) values
“Key to filter”: CC_genes
“Type of value to filter”: Boolean
Rename the output CellCycle_Regressed_CC_Only
Plot the cell cycle genes before regression
Now that we have a dataset that only includes the cell cycle genes, we can visualise their effects in a PCA plot. We first calculate the PCA coordinates, which are a measure of how similar each pair of cells is in terms of the expression of the 97 cell cycle genes we’ve included in the filtered dataset. We will then visualise the cells on a PCA plot where the axes represent the principal components, which reflect the genes (or groups of genes) that had the biggest impact in these calculations.
You will learn more about plotting your data in the Filter, Plot and Explore tutorial. For now, it is enough to know that each dot on the plot represents a cell and the closer two cells are together, the more similar they are.
Hands On: Create a PCA Plot of cell cycle genes
Cluster, infer trajectories and embed ( Galaxy version 1.7.1+galaxy0) with the following parameters:
param-file“Annotated data matrix”: CellCycle_Annotated_CC_Only (output of Manipulate AnnDatatool)
“Method used”: Computes PCA (principal component analysis) coordinates, loadings and variance decomposition, using 'tl.pca'
“Type of PCA?”: Full PCA
Comment: Plot all the genes
Make sure that you de-select the option for the Cluster, infer trajectories and embed tool to use highly variable genes only - some of the cell cycle genes are also HVGs, but we want our plots to include the cell cycle genes that aren’t HVGs too.
Plot ( Galaxy version 1.7.1+galaxy1) with the following parameters:
param-file“Annotated data matrix”: anndata_out (output of Cluster, infer trajectories and embedtool)
“Method used for plotting”: PCA: Plot PCA results, using 'pl.pca_overview'
“Keys for annotations of observations/cells or variables/genes”: phase
In “Plot attributes”:
“Colors to use for plotting categorical annotation groups”: rainbow (Miscellaneous)
Question
Does the plot look as you expected?
The PCA plot shows that the three groups of cells are separated out according to what phase of the cell cycle they are in. This is what we would expect to see as we are only looking at the cell cycle genes, which by definition are expressed during particular phases.
Figure 1: PCA Plot of Cell Cycle Genes before regression
Plot the cell cycle genes after regression
We will now repeat the same steps to create a PCA plot of the filtered dataset after the effects of the cell cycle have been regressed out.
Hands On: Recreate the PCA plot of cell cycle genes after regression
Cluster, infer trajectories and embed ( Galaxy version 1.7.1+galaxy0) with the following parameters:
param-file“Annotated data matrix”: CellCycle_Regressed_CC_Only (output of Manipulate AnnDatatool)
“Method used”: Computes PCA (principal component analysis) coordinates, loadings and variance decomposition, using 'tl.pca'
“Type of PCA?”: Full PCA
Plot ( Galaxy version 1.7.1+galaxy1) with the following parameters:
param-file“Annotated data matrix”: anndata_out (output of Cluster, infer trajectories and embedtool)
“Method used for plotting”: PCA: Plot PCA results, using 'pl.pca_overview'
“Keys for annotations of observations/cells or variables/genes”: phase
In “Plot attributes”:
“Colors to use for plotting categorical annotation groups”: rainbow (Miscellaneous)
Question
Does the plot look as you expected?
The cells in different phases are now all mixed up together. This makes sense because we are only plotting the cell cycle genes, but the previously strong effects of the cell cycle on these genes have now been regressed out. There are still some differences between the cells (they don’t all end up at the same point on the PCA chart) because the regression only removes the expected effects of the cell cycle, leaving behind any individual variation in the expression of the cell cycle genes.
Figure 2: PCA Plot of Cell Cycle Genes after regression
Comparing the before and after plots, we can clearly see that the effects of the cell cycle have been removed. Although you wouldn’t usually need to filter out the cell cycle genes or create these plots when analysing your own data, hopefully you have found doing it now to be helpful for understanding the impact of cell cycle regression.
Question
What impact do you think the cell cycle regression will have when you analyse the whole dataset? What would happen if we plotted all of the genes from the main dataset?
The regression reduces the impact of the cell cycle on the data - this is why the cells are less separated by phase afterwards. When we analyse the whole CellCycle_Regressed dataset, with all of the genes, this could allow other differences in gene expression to become more apparent.
We wouldn’t expect to see such clear distinctions in PCA plots created using all of the genes (not just the cell cycle ones), even before the regression. Although the cell cycle genes can have a significant effect, these won’t be as obvious when other genes are also being taken into account. However, we will still see a difference after we regress out the effects of the cell cycle - the cells in different phases will become more mixed up together. How much of a difference the regression makes will depend on how strong the effects of the cell cycle are in a particular dataset - you can see the effects on this dataset below. You can also replicate these plots after completing the rest of the Filter, Plot and Explore tutorial by colouring your PCA plots by phase.
Figure 4: PCA Plot using all genes after regression
Conclusion
In this tutorial, you have annotated and scored the cell cycle genes and regressed out the effects of the cell cycle. You have also created PCA plots of the data before and after regression to visualise the effects.
You might want to check your results against this example history.
You can now continue to analyse this data by returning to the Preparing coordinates step in the Filter, Plot and Explore tutorial. If you use the CellCycle_Regressed dataset (which you may now want to rename as Use_me_Scaled since that is the name used in the main tutorial), you should notice some differences in your results compared to those shown there because the effects of the cell cycle have been regressed out.
We also post new tutorials / workflows there from time to time, as well as any other news.
point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
References
Bacon, W. A., R. S. Hamilton, Z. Yu, J. Kieckbusch, D. Hawkes et al., 2018 Single-Cell Analysis Identifies Thymic Maturation Delay in Growth-Restricted Neonatal Mice. Frontiers in Immunology 9: 10.3389/fimmu.2018.02523
Luecken, M. D., and F. J. Theis, 2019 Current best practices in single-cell RNA-seq analysis: a tutorial. Molecular Systems Biology 15: 10.15252/msb.20188746
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{single-cell-scrna-case_cell-cycle,
author = "Marisa Loach",
title = "Removing the effects of the cell cycle (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_cell-cycle/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings: