Bulk RNA Deconvolution with MuSiC

Author(s)	Mehmet Tekman Wendi Bacon
Editor(s)	Morgan Howells
Reviewers

Overview
Questions:

How do we infer cell type proportions from bulk RNA-seq data?

How are these cell types grouped together?

Objectives:

Construct Bulk and scRNA Expression Set Objects

Inspect these objects for various properties

Measure the abundance of certain cell type cluster markers compared to others

Requirements:

Introduction to Galaxy Analyses

slides Slides: An introduction to scRNA-seq data analysis

Time estimation: 2 hours

Supporting Materials:

Datasets

Workflows

FAQs

video Recordings

video Tutorial (March 2021) - 20m

video View All

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

UseGalaxy.org.au ✅ ⭐️

UseGalaxy.cz ✅

Published: Feb 11, 2022

Last modification: Feb 14, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00240

rating Rating: 5.0 (0 recent ratings, 2 all time)

version Revision: 10

Bulk RNA-seq data contains a mixture of transcript signatures from several types of cells. We wish to deconvolve this mixture to obtain estimates of the proportions of cell types within the bulk sample. To do this, we can use single cell RNA-seq data as a reference for estimating the cell type proportions within the bulk data.

In this tutorial, we will use bulk and single-cell RNA-seq data, including matrices of similar tissues from different sources, to illustrate how to infer cell type abundances from bulk RNA-seq.

Agenda

In this tutorial, we will cover:

Bulk RNA-seq Cell Type Deconvolution

Cell Proportion Estimation

Get data

Building the Expression Set objects

Estimating Cell Type proportions

Estimation of cell type proportions with pre-grouping of cell types

Get data

Conclusion

Bulk RNA-seq Cell Type Deconvolution

The heterogeneity that exists in the cellular composition of bulk RNA-seq can add bias to the results from differential expression analysis. In order to circumvent this limitation, RNA-seq deconvolution aims to infer cell type abundances by modelling the gene expressions levels as ‘weighted sums’ of cell type specific expression profiles.

So… You fancy some maths do you? Good! This is important, as you’ll see variations of the phrase ‘weighted sums’ if you ever look at any papers in the field! Let’s think about just the ‘sums’ here. If we think about the total expression of a given gene you might get from bulk RNA-seq, you could also think about it as the sum of the expression of each cell, for example,

Total = cell_a expression + cell_b expression … (and so forth)

This is a ‘sum’, because we’re adding up all the cells. If we now think about how we expect similar expression from similar cell types, we could change this to the following:

Total = cell_type_a expression + cell_type_b expression … (and so forth)

BUT WAIT! That’s not strictly true, because the cell types are not all in equal proportion (if only!). So we have to take into account another variable, the proportion a given cell type takes up in a sample. So we now have:

Total = Proportion_a x cell_type_a expression + Proportion_b x cell_type_b expression

So now, we are using the sums of expression based off of the cell proportion. And if you’re a mathematician, you might instead put this as

T = C x P or some fancier formulas…(read more in-depth here if you like!)

The point is, if we have an idea of what the average expression should be for each gene (what we can get from single cell RNA-seq data, C), and we have the total expression (from the bulk RNA-seq, T), then we can infer the cell proportions (P).

Many different computational methods have been developed to estimate these cell type proportions, but in this tutorial we will be using the MuSiC tool suite (Wang et al. 2019) to estimate the proportion of individual cell types in our bulk RNA-seq datasets.

MusiC

MuSiC uses cell-type specific gene expression from single-cell RNA seq data to characterize cell type compositions (proportions) from bulk RNA-seq data in complex tissues. By appropriate weighting of genes showing cross-subject (sample to sample) and cross-cell (cells of the same cell type within a sample) consistency, MuSiC enables the transfer of cell type-specific gene expression information from one dataset to another.

Question

What is a weighted sum? Hint: What kinds of genes would be best for distinguishing cell types, and what kinds of genes would make it difficult?

So you know what the Sum is from above - the total expression of a given gene in a bulk RNA-seq sample depends on the proportion of cell types and the average expression level of each of those cell types (T = C x P). However, single cell RNA-seq data is highly variable. Cell-type specific expression in genes with lower variation from sample to sample (i.e. person to person or organism to organism) and cell to cell (i.e. within a sample) will be the most useful for distinguishing cells, while genes that vary heavily (i.e. high in cell type_a in one sample, but low in cell type_a in another sample) will be the least useful in accurately distinguishing cells. Therefore, to accurately use the mean expression level in a cell_type, MusiC weights the sums, favouring more consistently expressed genes in cell types.

Solid tissues often contain closely related cell types that are difficult to distinguish from one another, a phenomenon known as “collinearity”. To deal with collinearity, MuSiC can also employ a tree-guided procedure that recursively zooms in on closely related cell types. Briefly, MuSic first groups similar cell types into the same cluster and estimates cluster proportions, then recursively repeats this procedure within each cluster.

muse1. — **Figure 1**: Overview of MuSiC Suite

Cell Proportion Estimation

Here we will extract cell proportions from a bulk data of human pancreas data from Fadista et al. 2014 concerning 56 638 genes across 89 samples, using a single cell human pancreas dataset from Segerstolpe et al. 2016 containing 25 453 genes across 2209 cells, clustered into 14 cell types, from 6 healthy subjects and 4 with Type-II diabetes (T2D). If the deconvolution is good, and the datasets are compatible with sufficient enough overlap, we should be able to identify the same cell types from the bulk data.

Get data

Hands On: Data upload
Create a new history for this tutorial “Deconvolution: Cell Type inference of Human Pancreas Data”
Import the files from Zenodo or from the shared data library (GTN - Material -> single-cell -> Bulk RNA Deconvolution with MuSiC):
Human pancreas bulk RNA datasets (tag: #bulk)
https://zenodo.org/record/5719228/files/GSE50244bulkeset.expression.tabular
https://zenodo.org/record/5719228/files/GSE50244bulkeset.phenotype.tabular
Human pancreas single-cell RNA datasets (tag: #scrna)
https://zenodo.org/record/5719228/files/EMTABesethealthy.expression.tabular
https://zenodo.org/record/5719228/files/EMTABesethealthy.phenotype.tabular
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Rename the datasets

Check the datatype

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Add to each expression file a tag corresponding to #bulk and #scrna

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

Exploring the Datasets

We are told before we download the data, that:

The bulk human pancreas dataset is 89 samples across 56 638 genes
The single cell human pancreas datasets is 2209 cells across 24 453 genes.

But what does this actually look like in the data?

Comment: Inspecting Datasets

Note that at any time you can visually inspect the input datasets yourself by either:

Expanding the dataset in the history panel clicking on the name of the dataset

Or clicking on galaxy-eye icon to load them into the main window.

Let’s start exploring the datasets.

Hands On: Exploring the Datasets

Inspect the #scrna expression file

Question

What do the rows and columns correspond to?

What do the values at each position mean?

Has the data been normalised?

Why are there so many zeroes?

Open image in new tab

Figure 2: Peeking at the tabular scRNA expression dataset

Rows correspond to gene names, and columns to cell identifiers

These are the number of reads or counts for each gene for each cell

The data has not been normalised since the counts are integer and not decimal

Single-cell datasets can be very sparse for a variety of reasons relating to dropouts and biological factors. For more information, please see the introduction single-cell RNA-seq slides#41.

Inspect the #bulk expression file

Question

What do the rows and columns correspond to?

In which field is there likely to be overlap with the #scrna dataset?

Open image in new tab

Figure 3: Peeking at the tabular bulk RNA-seq expression dataset

As before with the #scrna dataset, rows correspond to gene names and columns to sample identifiers.

The sample identifiers and the cell identifiers are completely different, but the gene names appear to be using the same symbols as the #scrna dataset, so the gene field is the common factor here.

Inspect the #scrna phenotype file

Question

Does the phenotypes file describe genes or cells?

What does the SubjectName field describe?

What does the cellType field describe?

Open image in new tab

Figure 4: Peeking at the tabular scRNA phenotype dataset

The first column is the index column, which uses the cell identifiers in the header of the #scrna expression file, so the phenotypes file describes the cells.

SubjectName tells us whether the cell (on that row) is labelled as Type-II diabetes or not, “Non T2D” or “T2D” respectively.

cellType tells us which cell type the cell was assigned to. This could be the result of prior-clustering and then labelling, or the cells could be labelled before analysis.

Inspect the #bulk phenotype file

Question

Does the phenotypes file describe genes or samples?

Is the SubjectName field related to the SubjectName field in the #scrna phenotypes file?

What other factors are in the phenotypes file?

Is the tissue field related to the cellType field in the #scrna phenotypes file?

Open image in new tab

Figure 5: Peeking at the tabular bulk RNA-seq phenotype dataset

The first column is the index column, which uses the sample identifiers in the header of the #bulk expression file, so the phenotypes file describes the samples.

The SubjectName column uses a completely different set of identifiers from the #scrna phenotypes file so they should be assumed to be unrelated.

We see age, bmi, hba1c, gender, tissue. HbA1c appears to be a gene of interest related to a known phenotype.

This is the question we wish to answer in the deconvolution. Visually, there appears to be no overlap, but the “pancreatic islets” tissue likely consists of several cell types that show expressions profiles with some affinity to the single cell types described in the #scrna phenotypes file.

The bulk RNA-seq phenotype file lists the main factors of interest, and HbA1c appears to be a specific gene associated with a phenotype. It is well known that the beta cell proportions are related to T2D disease status. In the progress of T2D, the number of beta cells decreases. One of the most important tests for T2D is the HbA1c (hemoglobin A1c) test. When the HbA1c level is greater than 6.5%, the patient is diagnosed as T2D. We will look later at the beta cell proportions relationship with HbA1c level in this deconvolution analysis.

Building the Expression Set objects

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

Open your Galaxy server

Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.

Navigate to your tutorial

Tool names in tutorials will be blue buttons that open the correct tool for you

Note: this does not work for all tutorials (yet)

You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface

Warning: Not all browsers work!

We’ve had some issues with Tutorial mode on Safari for Mac users.

Try a different browser if you aren’t seeing the button.

For now we need to construct our Expression set objects that will be consumed by MuSiC.

Open image in new tab

Figure 6: Image from Stefano Monti

Expression Set objects are a container for high-throughput assays and experimental metadata. The ExpressionSet class is derived from eSet, and requires a matrix named exprs.

The ExpressionSet class is designed to combine several different sources of information into a single convenient structure. An ExpressionSet can be manipulated (e.g., subsetted, copied) conveniently, and is the input or output from many Bioconductor functions.

The data in an ExpressionSet is complicated, consisting of:

expression data (exprs, or assayData; assayData is used to hint at the methods used to access different data components, as we will see below);

metadata describing samples in the experiment (phenoData, pData)

information related to the protocol used for processing each sample and usually extracted from manufacturer files (protocolData);

annotations and metadata about the features on the chip or technology used for the experiment (featureData, annotation, fData);

and a flexible structure to describe the experiment (experimentData).

The ExpressionSet class coordinates all of this data, so that you do not usually have to worry about the details.

For more information please read the specification as well as the image source.

Here we shall build two ExpressionSet objects corresponding to the bulk and single-cell datatypes.

Construct Expression Set Object

Hands On: Build the Expression Set inputs

Construct Expression Set Object ( Galaxy version 0.1.1+galaxy4) with the following parameters:

param-file “Assay Data”: GSE50244bulkeset.expression.tabular (Input dataset)

param-file “Phenotype Data”: GSE50244bulkeset.phenotype.tabular (Input dataset)

Comment

An ExpressionSet object has many data slots, the principle of which are the experiment data (exprs), the phenotype data (pData), as well metadata pertaining to experiment information and additional annotations (fData).

Construct Expression Set Object ( Galaxy version 0.1.1+galaxy4) with the following parameters:

param-file “Assay Data”: EMTABesethealthy.expression.tabular (Input dataset)

param-file “Phenotype Data”: EMTABesethealthy.phenotype.tabular (Input dataset)

Inspect Expression Set Object

We will now inspect these objects we just created to see what information we can extract out of them, and how these multiple datasets are summarized within the object.

Hands On: Inspect and Describe the scRNA ExpressionSet Object

Obtain General Info about the data set

galaxy-eye Click on the #scrna General Info dataset in the history view (output of Construct Expression Set Object tool)

Obtain Feature Information about the data set

Inspect Expression Set Object ( Galaxy version 0.1.1+galaxy4) with the following parameters:

Warning: Danger: This tool has needs!

You may need to click the dataset from your history and drag it into the input of this tool. Some browsers don’t allow this.

param-file “ESet Dataset”: #scrna (output of Construct Expression Set Object tool)

“Inspect”: Feature Data Table

Comment: Features or Genes?

“Features” are synonymous with “Genes” in a genomic setting, but data scientists tend to prefer to use the former term, as it can be used in other non-genomic settings.

By inspecting the Feature Data Table, you should see a list of gene names.

Obtain the dimensions of the data set

Inspect Expression Set Object ( Galaxy version 0.1.1+galaxy4) with the following parameters:

param-file “ESet Dataset”: #scrna (output of Construct Expression Set Object tool)

“Inspect”: Dimension

Question

How many samples are in the dataset, and how many genes?

Does this agree with the original input tabular expression data set?

1097 samples and 25 453 genes

Yes!

Estimating Cell Type proportions

Instead of selecting marker genes, MuSiC gives weights to each gene. The weighting scheme is based on cross-subject variation, by up-weighing genes with low variation and down-weighing genes with high variation in a cell type. Here we demonstrate this step-by-step with the human pancreas datasets.

The deconvolution of 89 subjects from Fadista et al. 2014 is performed with the bulk data GSE50244 expression set and single cell reference EMTAB. The estimation was constrained on 6 major cell types: alpha, beta, delta, gamma, acinar and ductal, which make up over 90% of the whole islet.

Cell Type estimation with MuSiC

The deconvolution process can be performed by simply inputting the bulk RNA-seq & scRNA-seq datasets and then hitting execute. However, in this section we will be tracing the effect of a disease phenotype across both datasets.

Here we will use one of the factors from the bulk RNA-seq phenotypes related to the T2D disease status. Any bulk RNA-seq sample coming from a patient with the HbA1c factor above 6.5% would be clinically classified as having T2D, so we want to compare these datasets with non-diseased samples. As diabetes affects Beta cells, we’re particularly interested in these cells proportions - mostly to prove that the deconvolution has worked well!

It’s also important that input scRNA-seq datasets (ideally) have some representation of cell populations from diseased patients.

Question

Why should the scRNA-seq dataset contain diseased cells and healthy cells?

For instance, there might be a weird extra cell type that appears in the disease phenotype. Or perhaps there is a missing cell type in the disease phenotype. Without both healthy & diseased datasets, the cells in the bulk datasets might not be fully identifiable. Saying that, such datasets are not always available - something to keep in mind when interpreting the results!

Hands On: Task description

MuSiC ( Galaxy version 0.1.1+galaxy4) with the following parameters:

*MuSiC sometimes does not show up from the tool search box. You may need to look for it under the Single Cell heading

param-file “scRNA Dataset”: #scrna (output of Construct Expression Set Object tool)

param-file “Bulk RNA Dataset”: #bulk (output of Construct Expression Set Object tool)

“Purpose”: Estimate Proportions

“Methods to use”: MuSiC and NNLS

“Cell Types Label from scRNA dataset”: cellType

“Samples Identifier from scRNA dataset”: sampleID

“Comma list of cell types to use from scRNA dataset”: alpha,beta,delta,gamma,acinar,ductal

“Phenotype factors”: (leave blank)

“Excluded phenotype factors”: sampleID,SubjectName

Under “Show proportions of a disease factor?”: Yes

“scRNA Phenotype Cell Target”: beta

“Bulk Phenotype Target”: hba1c (Factor from Bulk Phenotype)

“Bulk Phenotype Target Threshold”: 6.5

“scRNA Sample Disease Group”: T2D (Ideally a Factor from scRNA Phenotype)

“scRNA Sample Disease Group (Scale)”: 5

Comment

It’s important to set a phenotype target threshold, otherwise no cells will be matched for the phenotype target. In this case, when the HbA1c level is greater than 6.5%, the patient is diagnosed as T2D.

The estimated proportions are normalized such that the proportions of cell types within each sample sum to 1. MuSic compares itself against a previous method of deconvolution known as Non-negative Least-Squares (NNLS), which MuSic supercededs via its Weighted Non-negative Least-Squares (W-NNLS) methodology. You can remove this if you wish from within the tool parameters when running.

jitter_plot. — **Figure 7**: Jitter plot of Estimated Proportions

In the above image you can see (a) the estimated proportion of cells for each of the 6 declared types, as calculated by MuSiC and the NNLS methods, respectively. In the (b) section, this information is better represented as a box plot to show you the distribution of cell type proportions.

ctprop. — **Figure 8**: (top) Cell type proportions by disease factor, and (bottom) HbA1c factor expression against beta cell type proportion

As stated previously, it is well known that the beta cell proportions are related to T2D disease status. As T2D progresses, the number of beta cells decreases. In the above image we can see in (a) that we have the same information as previous, but we also distinguish between cells that from patients with T2D status over the Normal cell phenotypes. Section (b) further explores this with a linear regression showing the cell type proportion of cells with HbA1c expression, where we see that there is a significant negative correlation between HbA1c level and beta cell proportions.

Comment

We can extract the coefficients of this fitting by looking at the Log of Music Fitting Data in the Summaries and Logs output collection:
 Coefficients:
              Estimate Std. Error t value Pr(>|t|)
 (Intercept)  0.797148   0.194757   4.093  0.00011 ***
 age          0.002639   0.001772   1.489  0.14087
 bmi         -0.013620   0.007276  -1.872  0.06529 .
 hba1c       -0.061396   0.025403  -2.417  0.01819 *
 genderMale   0.079874   0.039274   2.034  0.04566 *
In addition to HbA1c levels, gender has also correlated with beta cell proportions. This is unsurprising, as more men have diabetes than women and sex is known to impact HbA1c levels.

Proportions of Cell Type to each Bulk RNA sample

One question we might wish to ask is: what affinity did each of the 6 single cell types have to each of the 89 subjects in the bulk data? For this we can look at the raw data galaxy-eye MuSiC Estimated Proportions of Cell Types in the Proportion Matrices, to get a glimpse of cell type compositions on a bulk RNA sample level.

Both the MuSiC and the NNLS calculations of this data is best represented in the below heatmap, with RNA samples as rows and cell types as columns:

subject_heatmap. — **Figure 9**: Heatmap of cell type proportions at the RNA sample level.

Question

Which cell types are under-represented in the NNLS method?

Which cell types do not appear to be present in both?

Here it is evident that the previous NNLS method over-represents the Alpha cell type compared to the MuSiC method which gives more weight to the Beta and Ductal cell types, which were under-represented in the NNLS method.

The Delta and Gamma cell types appear empty in both.

Estimation of cell type proportions with pre-grouping of cell types

In the previous section we estimated cell types under the assumption that that the gene expression between cell types was largely independent. However, solid tissues often contain closely related cell types. This correlation of gene expression between these cell types is termed ‘collinearity’, which makes it difficult to resolve their relative proportions in bulk data.

To deal with collinearity, MuSiC can also employ a tree-guided procedure that recursively zooms in on closely related cell types.

Briefly, similar cell types are grouped into the same cluster and their cluster proportions are estimated, then this procedure is recursively repeated within each cluster. At each recursion stage, only genes that have low within-cluster variance are used, as they should be consistent within a cell type. This is critical as the mean expression estimates of genes with high variance are affected by the pervasive bias in cell capture of scRNA-seq experiments, and thus cannot serve as reliable reference.

To perform this analysis, we will use mouse kidney single-cell RNA-seq data from Park et al. 2018 described by 16 273 genes over a trimmed subset of 10 000 cells, giving 16 unique cell type (2 of which are novel) across 7 subjects. The bulk RNA-seq dataset is from Beckerman et al. 2017 and contains mouse kidney tissue described by 19 033 genes over 10 samples.

Get data

Hands On: Data upload
Create a new history for this tutorial “Deconvolution: Dendrogram of Mouse Data”
Import the files from Zenodo or from the shared data library (GTN - Material -> single-cell -> Bulk RNA Deconvolution with MuSiC):
Mouse kidney bulk RNA datasets (tag: #bulk)
https://zenodo.org/record/5719228/files/Mousebulkeset.expression.tabular
https://zenodo.org/record/5719228/files/Mousebulkeset.phenotype.tabular
Mouse kidney single-cell RNA datasets (tag: #scrna)
https://zenodo.org/record/5719228/files/Mousesubeset.expression.tabular
https://zenodo.org/record/5719228/files/Mousesubeset.phenotype.tabular
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Rename the datasets

Check that the datatype

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Add to each expression file a tag corresponding to #bulk and #scrna

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

As before, you may choose to explore the bulk and scrna datasets and try to determine their factors from the phenotypes as well as any overlapping fields that will be used to guide the deconvolution.

You will need to again create ExpressionSet objects, as before.

Construct Expression Set Object

Hands On: Build the Expression Set inputs

Construct Expression Set Object ( Galaxy version 0.1.1+galaxy4) with the following parameters:

param-file “Assay Data”: Mousebulkeset.expression.tabular (Input dataset)

param-file “Phenotype Data”: Mousebulkeset.phenotype.tabular (Input dataset)

Comment

An ExpressionSet object has many data slots, the principle of which are the experiment data (exprs), the phenotype data (pData), as well metadata pertaining to experiment information and additional annotations (fData).

Construct Expression Set Object ( Galaxy version 0.1.1+galaxy4) with the following parameters:

param-file “Assay Data”: Mousesubeset.expression.tabular (Input dataset)

param-file “Phenotype Data”: Mousesubeset.phenotype.tabular (Input dataset)

Colinearity Dendrogram with MuSiC to determine cell type similarities

Determining cell type similarities requires first producing a design matrix as well as a cross-subject mean of relative abundance, using a tree-based clustering method of the cell types we wish to cluster.

Hands On: Task description

MuSiC ( Galaxy version 0.1.1+galaxy4) with the following parameters:

param-file “scRNA Dataset”: #scrna (output of Construct Expression Set Object tool)

param-file “Bulk RNA Dataset”: #bulk (output of Construct Expression Set Object tool)

“Purpose”: Compute Dendrogram

“Cell Types Label from scRNA Dataset”: cellType

“Cluster Types Label from scRNA dataset”: clusterType

“Samples Identifier from scRNA dataset”: sampleID

“Comma list of cell types to use from scRNA dataset”: Endo,Podo,PT,LOH,DCT,CD-PC,CD-IC,Fib,Macro,Neutro,B lymph,T lymph,NK

dendrogram. — **Figure 10**: Dendrogram of Design Matrix and Cross-Subject Mean of Relative Abundance

Question

What do you notice about the cells clustering?

How many clusters can you see with a height threshold above 650 in the “Cluster log(Design Matrix)”?
The immune cells are clustered together and the kidney specific cells are clustered together. Notice that DCT and PT are within the same high-level grouping.
The cut-off of 650. Here we cut 13 cell types into 4 groups:
C1: Neutro
C2: Podo
C3: Endo, CD-PC, CD-IC, LOH, DCT, PT
C4: Fib, Macro, NK, B lymph, T lymph

Heatmap of Cell Type Similarities using MuSiC

We shall use the 4 cell type groups determined by the cut off threshold in the above question box. To guide the clustering, we shall upload known epithelial and immune cell markers to improve the more diverse collection of cell types in the C3 and C4 groups.

Hands On: Upload marker genes and generate heatmap
Import the files from Zenodo or from the shared data library (GTN - Material -> single-cell -> Bulk RNA Deconvolution with MuSiC):
https://zenodo.org/record/5719228/files/epith.markers
https://zenodo.org/record/5719228/files/immune.markers
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
MuSiC ( Galaxy version 0.1.1+galaxy4) with the following parameters:

Note

Warning: Shortcut!

Here we need to re-use all the inputs from the previous MuSiC tool step, plus add a few extra. To speed this up, you can simply click on the re-run icon galaxy-refresh under any of its outputs.

param-file “scRNA Dataset”: #scrna (output of Construct Expression Set Object tool)

param-file “Bulk RNA Dataset”: #bulk (output of Construct Expression Set Object tool)

“Purpose”: Compute Dendrogram

“Cell Types Label from scRNA Dataset”: cellType

“Cluster Types Label from scRNA dataset”: clusterType

“Samples Identifier from scRNA dataset”: sampleID

“Comma list of cell types to use from scRNA dataset”: Endo,Podo,PT,LOH,DCT,CD-PC,CD-IC,Fib,Macro,Neutro,B lymph,T lymph,NK

In “Cluster Groups”:

param-repeat “Insert Cluster Groups”

“Cluster ID”: C1

“Comma list of cell types to use from scRNA dataset”: Neutro

param-repeat “Insert Cluster Groups”

“Cluster ID”: C2

“Comma list of cell types to use from scRNA dataset”: Podo

param-repeat “Insert Cluster Groups”

“Cluster ID”: C3

“Comma list of cell types to use from scRNA dataset”: Endo,CD-PC,LOH,CD-IC,DCT,PT

“Marker Gene Group Name”:Epithelial

param-file “List of Gene Markers”: epith.markers (Input dataset)

param-repeat “Insert Cluster Groups”

“Cluster ID”: C4

“Comma list of cell types to use from scRNA dataset”: Macro,Fib,B lymph,NK,T lymph

“Marker Gene Group Name”:Immune

param-file “List of Gene Markers”: immune.markers (Input dataset)

Comment

The C1 (Neutrophil) and C2 (Podocyte) clusters do not use marker genes for the dendrogram clustering in this dataset.

dendro_jitboxheat. — **Figure 11**: Jitter and Boxplots of all cell types followed by a heatmap of each RNA sample against cell type. Note that the y-axis for each of the plots above are not constant across cell types.

Question

Most of the expression in the above plot appears to be derived from one cell type.

Which cell type dominates the plot?

What does this tell you about the bulk RNA?

The PT cells appear to dominate.

Most of the expression in the bulk RNA dataset is derived solely from the PT cells, and could be a monogenic cell line.

Conclusion

In this tutorial we constructed ExpressionSet objects, inspected and annotated them, and then finally processed them with the MuSiC RNA-Deconvolution analysis suite.

Below is an overview of the workflow that was used throughout this tutorial.

workflow1. — **Figure 12**: Workflow for estimating cell types

Note how two ExpressionSet objects are constructed: one from bulk RNA-seq tabular assay data, and the other from single-cell RNA-seq tabular assay data. A blind analysis of cell proportion estimation is performed first. Then in the second half of the tutorial, we performed a guided analysis using pre-grouped cell types.

workflow_dendro.png. — **Figure 13**: Workflow for generating a dendrogram and clustering cells upon it

This tutorial is part of the https://singlecell.usegalaxy.eu portal (Tekman et al. 2020).

feedback To discuss with like-minded scientists, join our Galaxy Training Network chatspace in Slack and discuss with fellow users of Galaxy single cell analysis tools on #single-cell-users

We also post new tutorials / workflows there from time to time, as well as any other news.

point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.

tool You can request tools here on our Single Cell and Spatial Omics Community Tool Request Spreadsheet

You've Finished the Tutorial

Key points

Deconvolution tools show individual cell type proportions in bulk RNA-seq data

Bulk RNA-seq can be complimented by scRNA-seq data

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Fadista, J., P. Vikman, E. O. Laakso, I. G. Mollet, J. L. Esguerra et al., 2014 Global genomic and transcriptomic analysis of human pancreatic islets reveals novel genes influencing glucose metabolism. Proceedings of the National Academy of Sciences 111: 13924–13929. 10.1073/pnas.1402665111
Segerstolpe, Å., A. Palasantza, P. Eliasson, E.-M. Andersson, A.-C. Andréasson et al., 2016 Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell metabolism 24: 593–607. 10.1016/j.cmet.2016.08.020
Beckerman, P., J. Bi-Karchin, A. S. D. Park, C. Qiu, P. D. Dummer et al., 2017 Transgenic expression of human APOL1 risk variants in podocytes induces kidney disease in mice. Nature medicine 23: 429–438. 10.1038/nm.4287
Park, J., R. Shrestha, C. Qiu, A. Kondo, S. Huang et al., 2018 Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 360: 758–763. 10.1126/science.aar2131
Wang, X., J. Park, K. Susztak, N. R. Zhang, and M. Li, 2019 Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nature communications 10: 1–9. 10.1038/s41467-018-08023-x
Tekman, M., B. Batut, A. Ostrovsky, C. Antoniewski, D. Clements et al., 2020 A single-cell RNA-sequencing training and analysis suite using the Galaxy framework. GigaScience 9: giaa102. 10.1093/gigascience/giaa102 https://academic.oup.com/gigascience/article/9/10/giaa102/5931798

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Mehmet Tekman, Wendi Bacon, Bulk RNA Deconvolution with MuSiC (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/bulk-music/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{single-cell-bulk-music,
author = "Mehmet Tekman and Wendi Bacon",
	title = "Bulk RNA Deconvolution with MuSiC (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/bulk-music/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings:

tutorial Hands-on: Matrix Exchange Format to ESet | Creating a single-cell RNA-seq reference dataset for deconvolution

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/single-cell/tutorials/bulk-music/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: music_construct_eset
  owner: bgruening
  revisions: 48f0fb3061b1
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: music_deconvolution
  owner: bgruening
  revisions: 8cd2ecfa2e61
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: music_inspect_eset
  owner: bgruening
  revisions: fde4c972e7c5
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 2

March 2022

5 stars: Liked: the creation of plots and after the process of visualization and the try of understanding what I saw Disliked: i think nothing

5 stars: Liked: The tutorial was very informative and steps were defined clearly to perform the analysis... Disliked: Overall, it was quite informative. However, it would be preferable to include a video tutorial for performing the analysis, where results can be thoroughly explained and the steps would be more easier to follow to perform the analysis for new users.