Comparing inferred cell compositions using MuSiC deconvolution
Author(s) | Wendi Bacon Mehmet Tekman |
Tester(s) | Marisa Loach |
Reviewers |
OverviewQuestions:Objectives:
How do the cell type distributions vary in bulk RNA samples across my variable of interest?
For example, are beta cell proportions different in the pancreas data from diabetes and healthy patients?
Requirements:
Apply the MuSiC deconvolution to samples and compare the cell type distributions
Compare the results from analysing different types of input, for example, whether combining disease and healthy references or not yields better results
- Introduction to Galaxy Analyses
- tutorial Hands-on: Bulk RNA Deconvolution with MuSiC
- tutorial Hands-on: Matrix Exchange Format to ESet | Creating a single-cell RNA-seq reference dataset for deconvolution
- tutorial Hands-on: Bulk matrix to ESet | Creating the bulk RNA-seq dataset for deconvolution
Time estimation: 1 hourSupporting Materials:Published: Jan 20, 2023Last modification: Nov 9, 2023License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00243version Revision: 8
The goal of this tutorial is to apply bulk RNA deconvolution techniques to a problem with multiple variables - in this case, a model of diabetes is compared with its healthy counterparts. All you need to compare inferred cell compositions are well-annotated, high quality reference scRNA-seq datasets, transformed into MuSiC-friendly Expression Set objects, and your bulk RNA-samples of choice (also transformed into MuSiC-friendly Expression Set objects). For more information on how MuSiC works, you can check out their github site MuSiC or published article (Wang et al. 2019).
Comment: Research question
- How does variable X impact the cell distributions in my samples?
- Needs: scRNA-seq reference dataset; bulk RNA-seq samples of interest to compare
AgendaIn this tutorial, we will cover:
Data
In the standard MuSiC tutorial, we used human pancreas data. We will now use the same single cell reference dataset Segerstolpe et al. 2016 with its 10 samples of 6 healthy subjects and 4 with Type-II diabetes (T2D), as well as the bulk RNA-samples from the same lab (3 healthy, 4 diseased). Both of these datasets were accessed from the public EMBL-EBI repositories and transformed into Expression Set objects in the previous two tutorials. For both the single cell reference and the bulk samples of interest, you have generated Expression Set objects with only T2D samples, only healthy samples, and a final everything-combined sample for the scRNA reference. We won’t need the combined bulk RNA dataset. The plan is to analyse this data in three ways: using a combined reference (altogether); using only the healthy single cell reference (healthyscref); or using a healthy and combined reference separately (like4like), all to identify differences in cellular composition.
If you have followed the previous tutorials, you will have built your single cell ESet object and your bulk ESet object, then you can copy these into a new history now. Otherwise, follow the steps below to import the datasets you’ll need.
There 3 ways to copy datasets between histories
From the original history
- Click on the galaxy-gear icon which is on the top of the list of datasets in the history panel
- Click on Copy Datasets
Select the desired files
Give a relevant name to the “New history”
- Validate by ‘Copy History Items’
- Click on the new history name in the green box that have just appear to switch to this history
Using the galaxy-columns Show Histories Side-by-Side
- Click on the galaxy-dropdown dropdown arrow top right of the history panel (History options)
- Click on galaxy-columns Show Histories Side-by-Side
- If your target history is not present
- Click on ‘Select histories’
- Click on your target history
- Validate by ‘Change Selected’
- Drag the dataset to copy from its original history
- Drop it in the target history
From the target history
- Click on User in the top bar
- Click on Datasets
- Search for the dataset to copy
- Click on its name
- Click on Copy to current History
Get data
Hands-on: Data upload
- Create a new history for this tutorial “Deconvolution: Compare”
Import the files from Zenodo
Human single cell RNA ESet objects (tag:
#singlecell
)https://zenodo.org/record/7319925/files/ESet_object_sc_combined.rdata https://zenodo.org/record/7319925/files/ESet_object_sc_T2D.rdata https://zenodo.org/record/7319925/files/ESet_object_sc_healthy.rdata
Human bulk RNA ESet objects (tag:
#bulk
)https://zenodo.org/record/7319925/files/ESet_object_bulk_healthy.rdata https://zenodo.org/record/7319925/files/ESet_object_bulk_T2D.rdata
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Rename the datasets as needed
Add to each file a tag corresponding to
#bulk
and#scrna
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
Infer cellular composition & compare
It’s finally time!
Altogether: Deconvolution with a combined sc reference
Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.
- Open your Galaxy server
- Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
- Navigate to your tutorial
- Tool names in tutorials will be blue buttons that open the correct tool for you
- Note: this does not work for all tutorials (yet)
- You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
- We’ve had some issues with Tutorial mode on Safari for Mac users.
- Try a different browser if you aren’t seeing the button.
Hands-on: Comparing: altogether
- MuSiC Compare ( Galaxy version 0.1.1+galaxy4) with the following parameters:
- In “New scRNA Group”:
- param-repeat “Insert New scRNA Group”
- “Name of scRNA Dataset”:
scRNA_set
- param-file “scRNA Dataset”:
ESet_object_sc_combined.rdata
(Input dataset)- In “Advanced scRNA Parameters”:
- “Cell Types Label from scRNA dataset”:
Inferred cell type - author labels
- “Samples Identifier from scRNA dataset”:
Individual
- “Comma list of cell types to use from scRNA dataset”:
alpha cell,beta cell,delta cell,gamma cell,acinar cell,ductal cell
- In “Bulk Datasets in scRNA Group”:
- param-repeat “Insert Bulk Datasets in scRNA Group”
- “Name of Bulk Dataset”:
Bulk_set:Normal
- param-file “Bulk RNA Dataset”:
ESet_object_bulk_healthy.rdata
(Input dataset)- “Factor Name”:
Disease
- param-repeat “Insert Bulk Datasets in scRNA Group”
- “Name of Bulk Dataset”:
Bulk_set:T2D
- param-file “Bulk RNA Dataset”:
ESet_object_bulk_T2D.rdata
(Input dataset)- “Factor Name”:
Disease
- To each of the outputs, add the
#altogether
tag.
There are four sets of output files.
- Summarised Plots <- This is the most interesting output, because it has the pretty pictures!
- Individual Heatmaps <- This kind of does what standard (non-Comparing) MuSiC does for each sample, rather than combining them.
- Stats <- This will be very handy if you want to make any statistical calculations, as it contains medians and quartiles
- Tables <- This contains the cell proportions found within each sample as well as the number of reads.
Summarised Plots
Examine galaxy-eye the output file Summarised Plots (MuSiC). Now the first few pages are similar to the standard deconvolution tool, but now comparing across the factor of interest (disease). Among the myriad of visualisations available, our favourite is on page 5 - a comparison of inferred cell proportions across disease.
Here we can see that the bulk-RNA seq samples from the T2D patients contain markedly fewer beta cells as compared with their healthy counterparts. This makes sense, so that’s good!
Individual Heatmaps
Examine galaxy-eye the output file Individual heatmaps (MuSiC). This shows the cell distribution across each of the individual samples, separated out by disease factor into two separate plots, but ultimately isn’t particularly informative.
Stats
If you select the Stats dataset, you’ll find it contains four sets of data, Bulk_disease: Read Props
, Bulk_disease: Sample Props
, Bulk_healthy: Read Props
and Bulk_healthy: Sample Props
. Examine galaxy-eye the file Bulk_disease: Sample Props
. This contains summary statistics (Min, quartiles, median, mean, etc.) for each phenotype. This could be quite helpful if you’re trying to statistically identify differences across samples.
Tables
Finally, if you select the Tables dataset, you’ll find it contains three sets of data, Data Table
, Matrix of Cell Type Read Counts
, and Matrix of Cell Type Sample Proportions
.
Examine galaxy-eye the file Data Table
. This contains the inferred proportions and reads associated with each sample and cell type, along with its important factor of interest (Disease). In this tutorial, we tend to use sample proportions rather than read count, but either works. The two other matrix files are just portions of this data table.
Question
- Why does the data table contain 42 rows?
- The data table contains a row for each cell type within each sample. Since there are 6 cell types and 7 samples,
6*7 = 42
rows.
Hopefully, this has been illuminating! Now let’s try two other ways of inferring from a reference and see if it makes a difference.
Like4like: Deconvolution of healthy samples with a healthy reference and diseased samples with a diseased reference
Hands-on: Like4like Inference
- MuSiC Compare ( Galaxy version 0.1.1+galaxy4) with the following parameters:
- In “New scRNA Group”:
- param-repeat “Insert New scRNA Group”
- “Name of scRNA Dataset”:
scRNA_set:Normal
- param-file “scRNA Dataset”:
ESet_object_sc_healthy.rdata
(Input dataset)- In “Advanced scRNA Parameters”:
- “Cell Types Label from scRNA dataset”:
Inferred cell type - author labels
- “Samples Identifier from scRNA dataset”:
Individual
- “Comma list of cell types to use from scRNA dataset”:
alpha cell,beta cell,delta cell,gamma cell,acinar cell,ductal cell
- In “Bulk Datasets in scRNA Group”:
- param-repeat “Insert Bulk Datasets in scRNA Group”
- “Name of Bulk Dataset”:
Bulk_set:Normal
- param-file “Bulk RNA Dataset”:
ESet_object_bulk_healthy.rdata
(Input dataset)- “Factor Name”:
Disease
- param-repeat “Insert New scRNA Group”
- “Name of scRNA Dataset”:
scRNA_set:T2D
- param-file “scRNA Dataset”:
ESet_object_sc_T2D.rdata
(Input dataset)- In “Advanced scRNA Parameters”:
- “Cell Types Label from scRNA dataset”:
Inferred cell type - author labels
- “Samples Identifier from scRNA dataset”:
Individual
- “Comma list of cell types to use from scRNA dataset”:
alpha cell,beta cell,delta cell,gamma cell,acinar cell,ductal cell
- In “Bulk Datasets in scRNA Group”:
- param-repeat “Insert Bulk Datasets in scRNA Group”
- “Name of Bulk Dataset”:
bulk_set:T2D
- param-file “Bulk RNA Dataset”:
ESet_object_bulk_T2D.rdata
(Input dataset)- “Factor Name”:
Disease
- Add the
#like4like
tag to each of the outputs.
Question
- How have the cell inferences changed, now that we have changed the scRNA references used?
- Overall, our interpretation here is that the differences are less pronounced. It’s interesting to conjecture whether this is an artefact of analysis, or whether - possibly - the beta cells in the diseased samples are not only fewer, but also contain fewer beta-cell specific transcripts (and thereby inhibited beta cell function), thereby lowering the bar for the inference of a beta cell and leading to a higher proportion of interred B-cells.
Let’s try one more inference - this time, we’ll use only healthy cells as a reference, to (theoretically) make a more consistent analysis across the two phenotypes.
healthyscref: Deconvolution using only healthy cells as a reference
Hands-on: Healthy sc reference only inference
- MuSiC Compare ( Galaxy version 0.1.1+galaxy4) with the following parameters:
- In “New scRNA Group”:
- param-repeat “Insert New scRNA Group”
- “Name of scRNA Dataset”:
scRNA_set:Normal
- param-file “scRNA Dataset”:
ESet_object_sc_healthy.rdata
(Input dataset)- In “Advanced scRNA Parameters”:
- “Cell Types Label from scRNA dataset”:
Inferred cell type - author labels
- “Samples Identifier from scRNA dataset”:
Individual
- “Comma list of cell types to use from scRNA dataset”:
alpha cell,beta cell,delta cell,gamma cell,acinar cell,ductal cell
- In “Bulk Datasets in scRNA Group”:
- param-repeat “Insert Bulk Datasets in scRNA Group”
- “Name of Bulk Dataset”:
Bulk_set:Normal
- param-file “Bulk RNA Dataset”:
ESet_object_bulk_healthy.rdata
(Input dataset)- “Factor Name”:
Disease
- param-repeat “Insert Bulk Datasets in scRNA Group”
- “Name of Bulk Dataset”:
Bulk_set:T2D
- param-file “Bulk RNA Dataset”:
ESet_object_bulk_T2D.rdata
(Input dataset)- “Factor Name”:
Disease
- Add the
#healthyscref
tag to each of the outputs.
Question
- How have the cell inferences changed this time?
- If using a like4like inference reduced the difference between the phenotype, aligning both phenotypes to the same (healthy) reference exacerbated them - there are even fewer beta cells in the output of this analysis.
Overall, it’s important to remember how the inference changes depending on the reference used - for example, a combined reference might have majority healthy samples or diseased samples, so that would impact the inferred cellular compositions.
Conclusion
Congrats! You’ve made it to the end of this suite of deconvolution tutorials! You’ve learned how to find quality data for reference and for analysis, how to reformat it for deconvolution using MuSiC, and how to compare cellular inferences using multiple kinds of reference datasets. You can find the workflow for this tutorial and an example history.
We hope this helps you in your research!
This tutorial is part of the https://singlecell.usegalaxy.eu portal (Tekman et al. 2020).
feedback To discuss with like-minded scientists, join our Galaxy Training Network chatspace in Slack and discuss with fellow users of Galaxy single cell analysis tools on #single-cell-users
We also post new tutorials / workflows there from time to time, as well as any other news.
point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.
tool You can request tools here on our Single Cell and Spatial Omics Community Tool Request Spreadsheet