Removing the effects of the cell cycle
Author(s) | Marisa Loach |
Editor(s) | Wendi Bacon |
Tester(s) | Graeme Tyson |
Reviewers |
OverviewQuestions:Objectives:
How can I reduce the effects of the cell cycle on my scRNA-seq data?
Requirements:
Identify the cell cycle genes
Use the cell cycle genes to regress out the effects of the cell cycle
Create PCA plots to understand the impact of the regression
- Introduction to Galaxy Analyses
- tutorial Hands-on: Generating a single cell matrix using Alevin
- tutorial Hands-on: Combining single cell datasets after pre-processing
- tutorial Hands-on: Filter, plot and explore single-cell RNA-seq data with Scanpy
Time estimation: 1 hourSupporting Materials:Published: Jan 25, 2023Last modification: Jun 14, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00248version Revision: 11
Single-cell RNA sequencing can be sensitive to both biological and technical variation, which is why preparing your data carefully is an important part of the analysis. You want the results to reflect the interesting differences in expression between cells that relate to their type or state. Other sources of variation can conceal or confound this, making it harder for you to see what is going on.
One common biological confounder is the cell cycle (Luecken and Theis 2019). Cells express different genes during different parts of the cell cycle, depending on whether they are in their growing phase (G1), duplicating their DNA (the S or Synthesis phase), or dividing in two (G2/M or Mitosis phase). If these cell cycle genes are having a big impact on your data, then you could end up with separate clusters that actually represent cells of the same type that are just at different stages of the cycle.
In this tutorial, we will identify the genes whose expression is known to vary during the cell cycle so that we can use them to regress out (or remove) the effects of the cell cycle on the clustering.
Comment: Other Scanpy and Seurat tutorialsThis tutorial is based on the Scanpy cell cycle regression tutorial (Cittaro 2018), which was itself based on the Seurat vignette addressing the same issue (Paul Hoffman 2022). However, we will be using a different dataset for this tutorial.
AgendaIn this tutorial, we will cover:
Get Data
The data used in this tutorial is from a mouse dataset of fetal growth restriction (Bacon et al. 2018). Cell cycle regression should be performed after the data has been filtered, normalised, and scaled. You can download the dataset below or import the history with the starting data.
Comment
- If you’ve been working through the Single-cell RNA-seq: Case Study then you can use your dataset from the Filter, Plot and Explore Single-cell RNA-seq Data tutorial here. Select the
Use_me_Scaled
dataset from your history. Rename that dataset asProcessed_Anndata
. You will still need to import the S and G2/M gene lists below through Zenodo.- At the end of this tutorial, you can return to the main tutorial to plot and explore your data with reduced effects from the cell cycle.
In addition to the scRNA-seq dataset, we will also need lists of the genes that are known to be expressed at different points in the cell cycle. The lists used in this tutorial are part of the HBC tinyatlas and can be downloaded from Zenodo below Kirchner and HBC 2018. Between them, they include 97 genes that are expressed during the S and G2/M phases. The expression level of these cycle genes are - mostly - determined by the phases of the cell cycle. Make sure that the file type is tabular (not just the name of the file) - you can choose this when you download the files or change it after the files are in your history.
Hands-on: Option 1: Data upload - Import historySometimes data upload can take a while, so a faster route is to import a history.
Import history from: example input history
- Open the link to the shared history
- Click on the Import this history button on the top left
- Enter a title for the new history
- Click on Copy History
Rename galaxy-pencil the the history to your name of choice.
Hands-on: Option 2: Data upload - Add to history
Create a new history for this tutorial
Import the AnnData object from Zenodo
https://zenodo.org/record/7311628/files/Processed_AnnData.h5ad
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Rename the dataset
Processed_Anndata
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
Check that the datatype is
h5ad
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
datatypes
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Hands-on: Option 2 continued: Data upload - Add to history
Import the files from Zenodo
https://zenodo.org/record/7311628//files/sPhase.tabular https://zenodo.org/record/7311628//files/g2mPhase.tabular
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Rename the datasets
sPhase
andg2mPhase
respectively - be careful not to mix them up!
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
Check that the datatype for both is
tabular
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
datatypes
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Important tips for easier analysis
Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.
- Open your Galaxy server
- Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
- Navigate to your tutorial
- Tool names in tutorials will be blue buttons that open the correct tool for you
- Note: this does not work for all tutorials (yet)
- You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
- We’ve had some issues with Tutorial mode on Safari for Mac users.
- Try a different browser if you aren’t seeing the button.
Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.
The Single Cell Omics Lab currently uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!
Try it out!
- subdomain Europe: Single Cell Omics Lab
- subdomain USA: Single Cell Omics Lab
- subdomain Australia: Single Cell Omics Lab
Cell Cycle Scoring
The first step towards reducing the effects of the cell cycle on our dataset is cell cycle scoring. The cell cycle scoring algorithm will look at each cell in turn and calculate an S score based on the difference in the mean expression of the S Phase genes and a random sample of the same number of non-cell cycle genes from the same cell. It will do the same for the G2M genes in order to calculate the G2M score. The cells will then be assigned to the most likely phase: S, G2M, or G1, if neither G2M or S score highly. Three columns will be added to the AnnData dataset: S_score
, G2M_score
and phase
.
- Why don’t we need a list of genes expressed in the G1 Phase?
- Since we know which genes are expressed in the S and G2/M phases, we can classify cells that are expressing these genes into the S and G2/M phases respectively. Cells that aren’t expressing either the S or G2/M genes must be in the other phase of the cell cycle, so we can classify them as in G1 phase.
Comment: When should we regress out the effects of the cell cycle?Cell cycle regression can be particularly important if we are planning to do trajectory analysis down the line or if we have a dataset that is very strongly influenced by the cell cycle Luecken and Theis 2019. However, it isn’t always appropriate to remove the effects of the cell cycle - sometimes they can be useful for distinguishing between dividing and non-dividing cell types. When you are analysing your own data, you might need to try it both ways to determine if the effects of the cell cycle are helpful or not. You could also check whether the cell cycle genes are among the top scoring genes expressed by your cell clusters to get an idea of how strong the effects are.
Hands-on: Score the cell cycle genes
- Inspect and manipulate ( Galaxy version 1.7.1+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Processed_Anndata
(Input dataset)- “Method used for inspecting”:
Score cell cycle genes, using 'tl.score_genes_cell_cycle'
- “Format for the list of genes associated with S phase”:
File
- param-file “File with the list of genes associated with S phase”:
sPhase
(Input dataset)- “Format for the list of genes associated with G2M phase”:
File
- param-file “File with the list of genes associated with G2M phase”:
g2mPhase
(Input dataset)- Rename the output
CellCycle_Annotated
Cell Cycle Regression
The second step after scoring the cell cycle genes is to use these scores to regress out the effects of the cell cycle. Now that we know which phase each of our cells is in, we can work out how this is affecting gene expression in our cells. We can subtract these effects from our data so that they won’t influence our cell clustering. You will need to type the phase
variable in to the Scanpy RegressOut tool to enable it to regress out the cell cycle effects.
The Scanpy RegressOut tool will create a linear model of the relationship between gene expression and the cell cycle scores we assigned in the previous step. Basically, this model is a line that shows how gene expression changes as the S or G2M score changes. Each gene will have its own line, so for any S score or G2M score, we could look at the corresponding point on the line to see the expected expression level of that gene.
Scanpy RegressOut will then regress out or remove this expected effect for the genes expressed by each cell, according to the cell’s S and G2M scores. The expected effect is subtracted from the expression data, leaving behind the difference between the expected position on the line and the actual position of each data point. The data points won’t sit exactly on the line because their expression levels aren’t determined completely by the cell cycle - the linear model only tells us what we would expect based on the cell cycle scores alone.
Understanding how the regression works should help you to see why we’re not just deleting the cell cycle genes from the dataset. We are using these genes that are known to be expressed during different phases to calculate the cell cycle scores and determine the phase each cell is in. We then use this information to work out the effect of the cell cycle on all of the other genes expressed by the cells. Even if they are not cell cycle genes, their expression can still be affected by the cycle. Finally, we remove or regress out the effect of the cell cycle on all the genes, leaving behind the variation that we’re interested in.
Hands-on: Regress out the effects of the cell cycle
- Scanpy RegressOut ( Galaxy version 1.8.1+galaxy0) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
CellCycle_Annotated
(output of Inspect and manipulate tool)- “Variables to regress out”:
phase
- Rename the output
CellCycle_Regressed
Plotting the Effects of Cell Cycle Regression
Your data is now ready for further analysis, so you can return to the Filter, Plot and Explore Single-cell RNA-seq Data tutorial and move on to the Preparing coordinates step there. Make sure that you use the CellCycle_Regressed
dataset (you may want to rename it as Use_me_Scaled
as that is the name used in the main tutorial). However, if you want to understand how cell cycle regression has affected your data then you can work through the following steps first, to visualise where the cell cycle genes are expressed - with or without regression.
In order to look at the cell cycle genes, we first need to label them in our AnnData dataset so that we can select them for plotting. To add a new annotation to the genes or variables, we need a column with entries for each gene, in the same order as in the dataset, and with a header at the top that will become the key for identifying these entries in the AnnData dataset. We want to end up with a column that reads TRUE
for the 97 cell cycle genes and FALSE
for all the other genes.
You might find it easier to create this new column using a spreadsheet and then upload it as a tabular dataset, but it is possible to complete all the steps on Galaxy.
Prepare a table of cell cycle genes
If we’re going to mark all the cell cycle genes, we’ll need a single list of all 97 genes instead of the two separate lists for S Phase and G2/M Phase. We’ll combine the two lists into a single column with 97 entries. We’ll then add a second column that simply reads TRUE
in every row, which we’ll use later to mark these as cell cycle genes in the main dataset.
Hands-on: Create a list of all cell cycle genes
- Concatenate datasets with the following parameters:
- param-file “Concatenate Dataset”:
sPhase
(Input dataset)- In “Dataset”:
- param-repeat “Insert Dataset”
- param-file “Select”:
g2mPhase
(Input dataset)- Add column ( Galaxy version 1.0.0) with the following parameters:
- “Add this value”:
TRUE
- param-file “to Dataset”:
out_file1
(output of Concatenate datasets tool)- Rename the dataset
CC_Genes
Create an ordered list of gene names
Next, we’ll need a list of all the genes in our dataset, so that we can mark the ones that are in our cell cycle list. We’ll also add a column of numbers as this will help us keep the gene names in order.
Hands-on: Get the gene names from your dataset
- Inspect AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
- param-file “Annotated data matrix”:
CellCycle_Regressed
(Input dataset)- “What to inspect?”:
Key-indexed annotation of variables/features (var)
- Table Compute ( Galaxy version 1.2.4+galaxy0) with the following parameters:
- “Input Single or Multiple Tables”:
Single Table
- param-file “Table”:
var
(output of Inspect AnnData tool)- “Type of table operation”:
Drop, keep or duplicate rows and columns
- “List of columns to select”:
1
- “Output formatting options”:
Unselect all
- Add column ( Galaxy version 1.0.0) with the following parameters:
- param-file “to Dataset”:
table
(output of Table Compute tool)- “Iterate?”:
YES
Comment: Keeping the genes in orderAdding these numbers will enable us to keep the genes in their original order. This is essential for adding the cell cycle gene annotation back into the AnnData dataset.
- Rename the output
Dataset_Genes
Mark the cell cycle genes
We can now combine our table of cell cycle genes CC_genes
with the table of gene names Dataset_Genes
.
Hands-on: Combine the two tables
- Join ( Galaxy version 1.1.2) with the following parameters:
- param-file “1st file”:
Dataset_Genes
(output of Add column tool)- “Column to use from 1st file”:
c1
- param-file “2nd File”:
CC_Genes
(output of Add column tool)- “Column to use from 2nd file”:
c1
- “Output lines appearing in”:
All lines [-a 1 -a 2]
- “Value to put in unpaired (empty) fields”:
FALSE
Comment: How the cell cycle genes are markedWhen we join the two tables, we’ll ask for any empty fields to be filled in with
FALSE
. The cell cycle gene table has an extra column where they are all marked asTRUE
- they will retain these entries when we join the tables, but since there are no entries for the rest of the genes in this column, their rows will be filled in asFALSE
. This will enable us to pick out the cell cycle genes later.- Sort with the following parameters:
- param-file “Sort Dataset”:
output
(output of Join tool)- “on column”:
c2
- “everything in”:
Ascending order
Comment: Putting the genes in order againSorting the genes using the column of numbers we added earlier will put them back in their original order - make sure to sort them in ascending order, otherwise they’ll end up the opposite way around.
Question
- What would happen if any of the cell cycle genes were not present in the dataset?
- How would we remove these genes from the table?
- Any cell cycle genes that weren’t in the dataset would have an empty field in the numbered column, which would be filled in with
FALSE
when we created the table with the Join tool. These rows would appear at the top of the table after it was sorted.- We should check the first rows of the table for any unnumbered genes and then cut these rows out in the next step.
Create the annotation column
We now have a table with all the gene names in the same order as the main dataset and a column indicating which ones are cell cycle genes. If we cut this column out of the table, then we can add it as a new annotation to the main dataset. We’ll also need to add a column header, which will be used as the key for this annotation in the AnnData dataset.
Hands-on: Create the cell cycle annotation column
- Table Compute ( Galaxy version 1.2.4+galaxy0) with the following parameters:
- “Input Single or Multiple Tables”:
Single Table
- param-file “Table”:
out_file1
(output of Sort tool)- “Input data has:”:
Unselect all
- “Type of table operation”:
Drop, keep or duplicate rows and columns
- “List of columns to select”:
3
- “Output formatting options”:
Unselect all
Comment: Removing rows for missing genesIf there were any cell cycle genes that weren’t present in the main dataset, we could remove them at this stage by excluding them from the List of rows to select. As before, if we were using a dataset of a different size, we would need to change this parameter to include all the rest of the rows.
Create a new tabular file from the following
CC_genes
- Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data at the bottom
- Paste the file contents into the text field
- Change Type from “Auto-detect” to
tabular
* Press Start and Close the window- Concatenate datasets with the following parameters:
- param-file “Concatenate Dataset”:
Pasted Entry
dataset- In “Dataset”:
- param-repeat “Insert Dataset”
- param-file “Select”:
table
(output of Table Compute tool)
Add an annotation to the AnnData
We will need to add the annotation to both the annotated dataset CellCycle_Annotated
and to the one that we created by regressing out the cell cycle genes CellCycle_Regressed
. This will allow us to plot the cell cycle genes before and after regression. We can do this using the Manipulate Anndata tool and selecting the correct function from the dropdown menu.
Hands-on: Add the new annotations
- Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
- param-file “Annotated data matrix”:
CellCycle_Annotated
(output of Inspect and manipulate tool)- “Function to manipulate the object”:
Add new annotation(s) for observations or variables
- param-file “Table with new annotations”:
out_file1
(output of Concatenate datasets tool)Rename the output
CellCycle_Annotated_CC
- Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
- param-file “Annotated data matrix”:
CellCycle_Regressed
(output of Scanpy RegressOut tool)- “Function to manipulate the object”:
Add new annotation(s) for observations or variables
- param-file “Table with new annotations”:
out_file1
(output of Concatenate datasets tool)- Rename the output
CellCycle_Regressed_CC
Filter the cell cycle genes
To demonstrate the power of cell cycle regression, we’re going to reduce our expression matrices to contain only the 97 cell cycle genes. This will force our dimension reduction and plotting to be based entirely on cell cycle genes. You wouldn’t do this during analysis, but for proof of principle, let’s go for it!
Hands-on: Filter the AnnData datasets
- Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
- param-file “Annotated data matrix”:
CellCycle_Annotated_CC
(output of Manipulate AnnData tool)- “Function to manipulate the object”:
Filter observations or variables
- “Type of filtering?”:
By key (column) values
- “Key to filter”:
CC_genes
- “Type of value to filter”:
Boolean
Rename the output
CellCycle_Annotated_CC_Only
- Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
- param-file “Annotated data matrix”:
CellCycle_Regressed_CC
(output of Manipulate AnnData tool)- “Function to manipulate the object”:
Filter observations or variables
- “Type of filtering?”:
By key (column) values
- “Key to filter”:
CC_genes
- “Type of value to filter”:
Boolean
- Rename the output
CellCycle_Regressed_CC_Only
Plot the cell cycle genes before regression
Now that we have a dataset that only includes the cell cycle genes, we can visualise their effects in a PCA plot. We first calculate the PCA coordinates, which are a measure of how similar each pair of cells is in terms of the expression of the 97 cell cycle genes we’ve included in the filtered dataset. We will then visualise the cells on a PCA plot where the axes represent the principal components, which reflect the genes (or groups of genes) that had the biggest impact in these calculations.
You will learn more about plotting your data in the Filter, Plot and Explore tutorial. For now, it is enough to know that each dot on the plot represents a cell and the closer two cells are together, the more similar they are.
Hands-on: Create a PCA Plot of cell cycle genes
- Cluster, infer trajectories and embed ( Galaxy version 1.7.1+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
CellCycle_Annotated_CC_Only
(output of Manipulate AnnData tool)- “Method used”:
Computes PCA (principal component analysis) coordinates, loadings and variance decomposition, using 'tl.pca'
- “Type of PCA?”:
Full PCA
Comment: Plot all the genesMake sure that you de-select the option for the Cluster, infer trajectories and embed tool to use highly variable genes only - some of the cell cycle genes are also HVGs, but we want our plots to include the cell cycle genes that aren’t HVGs too.
- Plot ( Galaxy version 1.7.1+galaxy1) with the following parameters:
- param-file “Annotated data matrix”:
anndata_out
(output of Cluster, infer trajectories and embed tool)- “Method used for plotting”:
PCA: Plot PCA results, using 'pl.pca_overview'
- “Keys for annotations of observations/cells or variables/genes”:
phase
- In “Plot attributes”:
- “Colors to use for plotting categorical annotation groups”:
rainbow (Miscellaneous)
Question
- Does the plot look as you expected?
- The PCA plot shows that the three groups of cells are separated out according to what phase of the cell cycle they are in. This is what we would expect to see as we are only looking at the cell cycle genes, which by definition are expressed during particular phases.
Plot the cell cycle genes after regression
We will now repeat the same steps to create a PCA plot of the filtered dataset after the effects of the cell cycle have been regressed out.
Hands-on: Recreate the PCA plot of cell cycle genes after regression
- Cluster, infer trajectories and embed ( Galaxy version 1.7.1+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
CellCycle_Regressed_CC_Only
(output of Manipulate AnnData tool)- “Method used”:
Computes PCA (principal component analysis) coordinates, loadings and variance decomposition, using 'tl.pca'
- “Type of PCA?”:
Full PCA
- Plot ( Galaxy version 1.7.1+galaxy1) with the following parameters:
- param-file “Annotated data matrix”:
anndata_out
(output of Cluster, infer trajectories and embed tool)- “Method used for plotting”:
PCA: Plot PCA results, using 'pl.pca_overview'
- “Keys for annotations of observations/cells or variables/genes”:
phase
- In “Plot attributes”:
- “Colors to use for plotting categorical annotation groups”:
rainbow (Miscellaneous)
Question
- Does the plot look as you expected?
- The cells in different phases are now all mixed up together. This makes sense because we are only plotting the cell cycle genes, but the previously strong effects of the cell cycle on these genes have now been regressed out. There are still some differences between the cells (they don’t all end up at the same point on the PCA chart) because the regression only removes the expected effects of the cell cycle, leaving behind any individual variation in the expression of the cell cycle genes.
Comparing the before and after plots, we can clearly see that the effects of the cell cycle have been removed. Although you wouldn’t usually need to filter out the cell cycle genes or create these plots when analysing your own data, hopefully you have found doing it now to be helpful for understanding the impact of cell cycle regression.
Question
- What impact do you think the cell cycle regression will have when you analyse the whole dataset? What would happen if we plotted all of the genes from the main dataset?
- The regression reduces the impact of the cell cycle on the data - this is why the cells are less separated by phase afterwards. When we analyse the whole
CellCycle_Regressed
dataset, with all of the genes, this could allow other differences in gene expression to become more apparent.We wouldn’t expect to see such clear distinctions in PCA plots created using all of the genes (not just the cell cycle ones), even before the regression. Although the cell cycle genes can have a significant effect, these won’t be as obvious when other genes are also being taken into account. However, we will still see a difference after we regress out the effects of the cell cycle - the cells in different phases will become more mixed up together. How much of a difference the regression makes will depend on how strong the effects of the cell cycle are in a particular dataset - you can see the effects on this dataset below. You can also replicate these plots after completing the rest of the Filter, Plot and Explore tutorial by colouring your PCA plots by phase.
Conclusion
In this tutorial, you have annotated and scored the cell cycle genes and regressed out the effects of the cell cycle. You have also created PCA plots of the data before and after regression to visualise the effects.
You might want to check your results against this example history.
You can now continue to analyse this data by returning to the Preparing coordinates step in the Filter, Plot and Explore tutorial. If you use the CellCycle_Regressed
dataset (which you may now want to rename as Use_me_Scaled
since that is the name used in the main tutorial), you should notice some differences in your results compared to those shown there because the effects of the cell cycle have been regressed out.
feedback To discuss with like-minded scientists, join our Galaxy Training Network chatspace in Slack and discuss with fellow users of Galaxy single cell analysis tools on #single-cell-users
We also post new tutorials / workflows there from time to time, as well as any other news.
point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.
tool You can request tools here on our Single Cell and Spatial Omics Community Tool Request Spreadsheet