Clustering 3K PBMCs with Scanpy

Author(s)	Bérénice Batut Hans-Rudolf Hotz Mehmet Tekman Pavankumar Videm Diana Chiang Jurado
Reviewers

Overview
Questions:

What are the steps to prepare single-cell RNA-Seq data for clustering?

How to cluster cells in single-cell RNA-Seq data?

How cell type annotation can be assigned to cell clusters?

Objectives:

Describe an AnnData object to store single-cell data

Explain the preprocessing steps for single-cell data

Evaluate quality of single-cell data and apply steps to select and filter cells and genes based on QC

Execute data normalization and scaling

Identify highly variable genes

Construct and run a dimensionality reduction using Principal Component Analysis

Perform a graph-based clustering for cells

Identify marker genes for the clusters

Construct and run a cell type annotation for the clusters

Requirements:

Introduction to Galaxy Analyses

slides Slides: Pre-processing of Single-Cell RNA Data

tutorial Hands-on: Pre-processing of Single-Cell RNA Data

tutorial Hands-on: Pre-processing of 10X Single-Cell RNA Datasets

Time estimation: 8 hours

Supporting Materials:

Slides

Datasets

Workflows

galaxy-history-answer Answer Histories

UseGalaxy.eu
2024-04-29

UseGalaxy.org
2024-04-29

UseGalaxy.org.au
2024-04-29

UseGalaxy.fr
2024-04-29

help How to Use This

FAQs

video Recordings

video Tutorial (September 2024) - 1h39m

video Tutorial (March 2021) - 45m

video View All

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

Published: Dec 19, 2019

Last modification: May 5, 2026

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00254

rating Rating: 5.0 (1 recent ratings, 17 all time)

version Revision: 17

Single-cell RNA-seq analysis is a rapidly evolving field at the forefront of transcriptomic research, used in high-throughput developmental studies and rare transcript studies to examine cell heterogeneity within a populations of cells. The cellular resolution and genome wide scope make it possible to draw new conclusions that are not otherwise possible with bulk RNA-seq.

In this tutorial, we will investigate clustering of single-cell data from 10x Genomics, including preprocessing, clustering and the identification of cell types via known marker genes, using Scanpy (Wolf et al. 2018). It will be illustrated using a dataset of Peripheral Blood Mononuclear Cells (PBMC), containing 2,700 single cells.

Comment

This tutorial is significantly based on “Clustering 3K PBMCs” tutorial from Scanpy, “Seurat - Guided Clustering Tutorial” and “Orchestrating Single-Cell Analysis with Bioconductor” Amezquita et al. 2019.

Agenda

In this tutorial, we will cover:

Data

Data upload

AnnData

Preprocessing

Quality control

Normalization and scaling

Selection of features

Scaling the data

Dimensionality reduction

Principal Component Analysis

Visualization of PCA

Determination of the number of PCs to keep

Clustering of the cells

Computation of a neighborhood graph

Visualization of the neighborhood graph

Clustering of the neighborhood graph

Finding marker genes

Using the t-test

Using Wilcoxon rank sum test

Visualization of expression of the marker genes

Comparison of the marker genes between clusters

Cell type annotation

Conclusion

Data

For this tutorial, we analyze a dataset of Peripheral Blood Mononuclear Cells (PBMC) extracted from a healthy donor, freely available from 10X Genomics. The dataset contains 2,700 single cells sequenced using Illumina NextSeq 500. The raw sequences have been processed by the cellranger pipeline from 10X to extract a unique molecular identifier (UMI) count matrix, in a similar way to that as explained in “Pre-processing of 10X Single-Cell RNA Datasets” tutorial.

In this matrix, the values represent the number for each feature (i.e. gene; row) that are detected in each cell (column). Such matrices can be quite large, where here there are 2,700 columns with 32,738 lines, with mostly zero values, i.e. an extremely sparse matrix. To optimize the storage of such a table and the information about the genes and cells, cellranger creates 3 files:

genes.tsv: a tabular file with information about the 32,738 genes in 2 columns (Ensembl gene id and the gene symbol)
barcodes.tsv: a tabular file with the barcode for each of the 2,700 cells
matrix.mtx: a condensed version of the count matrix

The count matrix is therefore represented by its non-zero values. Each non-zero value is indicated by its line number (1st column), its column number (2nd column) and its value (3rd column). The first row gives indication about the number of lines, column and non-zero values. More information on the Matrix Market Exchange (mtx) format can be found in this documentation

Data upload

Hands On: Data upload
Create a new history for this tutorial
Import the genes.tsv, barcodes.tsv and matrix.mtx from Zenodo or from the shared data library
https://zenodo.org/record/3581213/files/genes.tsv
https://zenodo.org/record/3581213/files/barcodes.tsv
https://zenodo.org/record/3581213/files/matrix.mtx
Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Rename the datasets

Inspect the matrix file

Question
32738	2700	2286884
32709	1	4
32707	1	1
32706	1	10
32704	1	1
How many non-zero values are in the matrix?

How many counts are found for the 32,706th gene in the 1st cell?

There are 2,286,884 (2.6%) non-zero values for the 88,392,600 possible counts of the 32,738 genes (rows) and 2,700 cells (columns).

10 counts are found for the 32,706th row and 1st column.

The representation of the matrix with 3 files is convenient to share the data but not to process them. Single-cell analysis packages have attempted to solve the problem of storage and analysis by inventing their own standard, which has led to the proliferation of many different “standards” in the scRNA-seq package ecosystem.

AnnData

The most common format, called AnnData, stores the matrix as well as gene and cell annotations in a concise, compressed and extremely readable manner:

This format is used by Scanpy (Wolf et al. 2018), the tool suite for analyzing single-cell gene expression data that we will use in this tutorial. So we need first to import the matrix and annotations of genes and cells into an AnnData object.

Hands On: Transform matrix and all into AnnData object

Import Anndata ( Galaxy version 0.10.9+galaxy0) with the following parameters:

“Format for the annotated data matrix”: Matrix Market (mtx), from Cell ranger or not

param-file “Matrix”: matrix.mtx

“Use 10x Genomics formatted mtx”: Output from Cell Ranger v2 or earlier versions

param-file “2-column genes file with gene id and gene name”: genes.tsv

param-file “Barcodes (whitelist) file with one barcode per line”: barcodes.tsv

“Variables index”: gene_symbols

“Make the variable index unique by appending ‘-1’, ‘-2’?”: Yes

Rename the generated file to Input 3k PBMC

Check that the format is h5ad

Because the AnnData format is an extension of the HDF5 format, i.e. a binary format, an AnnData object can not be inspected directly in Galaxy by clicking on the galaxy-eye (View data) icon. Instead we need to use a dedicated tool from the AnnData suite.

Hands On: Inspect an AnnData object
Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: Input 3k PBMC

“What to inspect?”: General information about the object
Inspect the generated file
Question
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'
How many observations are there? What do they represent?

How many variables are there? What do they represent?

There are 2,700 observations, representing the cells.

There are 32,738 variables, representing the genes.
Comment: Faster Method for General Information
The ScanPy toolset in Galaxy produces AnnData formats as output, where general information is provided for each dataset:
Click on the name of the dataset in the history to expand it.
General Anndata information would be given in the expanded box:

e.g.
[n_obs x n_vars]
-    2700 x 32738
For more specific queries, Inspect AnnData ( Galaxy version 0.10.9+galaxy0) is required.
Warning: Large compute resource consumption! Please run the next step responsibly.

Extracting the complete matrix requires large compute resources and long time to finish. Use this step on matrices with less than 10k cells. To inspect a subset of matrix contents on huge matrices, please change “What to inspect?”: Random chunk of defined size
Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: Input 3k PBMC

“What to inspect?”: The full data matrix
Inspect the generated file
Question
	MIR1302-10	FAM138A	OR4F5	RP11-34P13.7	RP11-34P13.8 ...
AAACATACAACCAC-1	0.0	0.0	0.0	0.0
AAACATTGAGCTAC-1	0.0	0.0	0.0	0.0
AAACATTGATCAGC-1	0.0	0.0	0.0	0.0
AAACCGTGCTTCCG-1	0.0	0.0	0.0	0.0
AAACCGTGTATGCG-1	0.0	0.0	0.0	0.0
What is stored in the generated file?

The file is a table with 2,700 lines (observations or cells) and 32,738 columns (variables or genes): the count matrix for each of the 32,738 genes and 2,700 cells. The 1st row contains the gene symbol as annotation of the columns and the 1st column the barcodes of the cells as annotation of the rows.
Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: Input 3k PBMC

“What to inspect?”: Key-indexed observations annotation (obs)
Inspect the generated file
Question
""
AAACATACAACCAC-1
AAACATTGAGCTAC-1
AAACATTGATCAGC-1
AAACCGTGCTTCCG-1
AAACCGTGTATGCG-1
What is stored in the generated file?

The file is a tabular with annotations of the observations. i.e. the cells. Here there are only the barcodes as annotation, so only one column, being also the index for the count matrix.
Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: Input 3k PBMC

“What to inspect?”: Key-indexed annotation of variables/features (var)
Inspect the generated file
Question
	gene_ids
MIR1302-10	ENSG00000243485
FAM138A	ENSG00000237613
OR4F5	ENSG00000186092
RP11-34P13.7	ENSG00000238009
RP11-34P13.8	ENSG00000239945
What is stored in the generated file?

The file is a tabular with annotations of the variables, i.e. the genes. Here the files has 2 columns: the gene symbols (also the index in the count matrix), gene_ids or the Ensemble gene ids.

Preprocessing

The first step when we get our counts is to prepare them before we perform the clustering. This includes (Amezquita et al. 2019):

Selection and filtration of cells and genes based on quality metrics

Low-quality cells would interfere with downstream analyses. These cells may have been damaged during processing and may not have been fully captured by the sequencing protocol.
Data normalization and scaling

It will eliminate cell-specific biases (e.g., in capture efficiency), allowing us to perform explicit comparisons across cells afterwards. Some transformations, typically log, are also applied to adjust for the mean-variance relationship.
Selection of features

This selection picks a subset of interesting features for downstream analysis, by modelling the variance across cells for each gene and retaining genes that are highly variable. This step is done to reduce computational overhead and noise from uninteresting genes.

Quality control

Genes that appear in less than a few cells can be considered noise and thus removed.

Hands On: Remove genes found in less than 3 cells
Scanpy filter ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: Input 3k PBMC

“Method used for filtering”: Filter genes based on number of cells or counts, using 'pp.filter_genes'

“Filter”: Minimum number of cells expressed

“Minimum number of cells expressed required for a gene to pass filtering”: 3

Rename the generated file 3k PBMC
Inspect the dataset.
Question
[n_obs × n_vars]
-    2700 × 13714
[var]
-    gene_ids
-    n_cells
How many genes have been removed because they are expressed in less than 3 expressed cells?

There are now 13,714 genes. So 19,024 (32,738 - 13,714) genes have been removed.

Low-quality cells may be due to a variety of sources such as cell damage during dissociation or failure in library preparation (e.g., inefficient reverse transcription or PCR amplification, Ilicic et al. 2016). Usually these low-quality cells have low total counts, few expressed genes and high mitochondrial or spike-in proportions. Low-quality cells are problematic for downstream analyses and may contribute to misleading results.

Formation of their own distinct cluster(s)

Increased mitochondrial proportions or enrichment for nuclear RNAs after cell damage are the most obvious causes for this phenomenon. Low-quality cells in different cell types can, in the worst case, cluster together (because of similarities in the damage-induced expression profiles) leading to artificial intermediate states or trajectories between subpopulations that should be otherwise distinct. This phenomenon makes the results difficult to interpret.

Distortion of population heterogeneity during variance estimation or principal components analysis

With low-quality cells, the main differences captured by the first few principal components will be based on quality rather than biology, reducing then the effectiveness of dimensionality reduction. Similarly, differences between low- and high-quality cells will create genes with the largest variances. For example, in low-quality cells with very low counts, the scaling normalization increases the variance of genes that happen to have a non-zero count in those cells.

Misidentification of upregulated genes

Due to aggressive scaling to normalize for small cell sizes, low-quality cells shows genes that appear to be strongly “upregulated”. For example, contaminating transcripts may be present in all cells with a low but constant levels. With the increased scaling normalization in low-quality cells, the small counts for these transcripts may become large normalized expression values, i.e. an apparent upregulation compared to other cells.

To mitigate these problems, we need to remove these low-quality cells at the start of the analysis. Several common QC metrics can be used to identify these cells based on their expression profile:

Cell size, i.e. the total sum of counts across all genes for each cells

When cells are very degraded or absent from the library preparation, the number of reads sequenced from that library will be very low. Cells with small sizes are then considered to be of low quality: the RNA has been lost during the library preparation (cell lysis, inefficient cDNA capture and amplification, etc.)
Number of expressed genes, i.e. the number of genes with non-zero counts for each cells

A low number of expressed genes may be a result of poor-quality cells (e.g. dying, degraded, damaged, etc.), followed by high PCR amplification of the remaining RNA. The diverse transcript population has not been successfully captured so the cell is considered of low quality. Cell doublets or multiplets may also exhibit an aberrantly high gene count.
Proportion of reads mapped to genes in the mitochondrial genome

High concentrations of mitochondrial genes is often a result of damaged cells (Islam et al. 2014, Ilicic et al. 2016) where the endogenous RNA escapes or degrades. As mitochondria has its own cell membranes, it is often the last DNA/RNA in damaged cells to degrade and hence occurs in high quantities during sequencing.

We will make the key assumption that these metrics are independent of the biological state of each cell. Technical factors rather than biological processes are presumed to drive poor values (e.g., low cell sizes, high mitochondrial proportions). The subsequent removal of impacted cells do not misrepresent the biology in downstream analyses.

To identify low-quality cells, the simplest approach is to apply thresholds on the QC metrics. This strategy, while simple, is required to determine appropriate thresholds that will change given the experimental protocol and the biological system.

Computation of QC metrics

To estimate the thresholds, we first need to look at the distribution of QC metrics for our dataset.

The first 2 QC metrics (cell size and number of expressed genes) can be easily estimated from the count table. For the third metric (proportion of reads mapped to genes in the mitochondrial genome), we need the information about which genes are mitochondrial or not.

Hands On: Extract gene annotation

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC

“What to inspect?”: Key-indexed annotation of variables/features (var)

Rename the generated file Gene annotation

In the gene annotation, we currently have the gene symbol the Ensembl gene ids and the number of cells in which the genes are expressed, but have no information on whether a gene is mitochondrial. This can be extracted from the gene symbol as all mitochondrial genes have a name starting with MT- and added to the var of the AnnData object using Manipulate AnnData tool.

A new annotation for var should be new column named mito on the 1st line and then True for mitochondrial gene or False for non-mitochondrial genes.

Hands On: Add mitochondrial gene annotation

Manipulate Anndata ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC

“Function to manipulate the object”: Flag genes start with a pattern

Click on “Insert Flag genes that start with these names”

“Text that you expect the genes to be flagged to start with?”: MT-

“Name of the column in var.names where this boolean flag is stored”: mito

Rename the generated file 3k PBMC with mito annotation

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with mito annotation

“What to inspect?”: Key-indexed annotation of variables/features (var)

Inspect the generated file and check if the mitochondrial annotation has been added

Question

How can I annotate genes like ribosomal protein genes or non-protein coding genes that might interfere with my downstream analysis?

What care should be taken while annotating such genes?

To annotate genes such as ribosomal protein genes or non-protein coding genes, follow these steps:

Manipulate Anndata ( Galaxy version 0.10.9+galaxy0) tool as above with

“Function to manipulate the object”: Flag genes start with a pattern

Click on “Insert Flag genes that start with these names” - “Text that you expect the genes to be flagged to start with?”: RP for ribosomal proteins or - “Name of the column in var.names where this boolean flag is stored”: rpgene

Click on “Insert Flag genes that start with these names” - “Text that you expect the genes to be flagged to start with?”: LINC for long non-protein-coding RNAs - “Name of the column in var.names where this boolean flag is stored”: lincrna

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with mito annotation

“What to inspect?”: Key-indexed annotation of variables/features (var)

Inspect your AnnData to confirm the new gene annotations are correctly added.

Here we are annotating genes starting with a pattern. Long Non-Coding RNAs often start with “LINC” followed by a number, while Ribosomal Protein Genes typically start with “RP”. Be aware that some non-ribosomal genes also start with “RP” and are associated with conditions like Retinitis Pigmentosa. If these genes are relevant, annotate ‘RP’ genes carefully. Maybe use more precise naming like RPL and RPS. In general, make sure that your name/pattern does not annotate any genes that you did not want annotate.

We can now compute QC metrics on the AnnData object.

Hands On: Compute QC metrics
Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with mito annotation

“Method used for inspecting”: Calculate quality control metrics, using 'pp.calculate_qc_metrics'

“Name of kind of values in X”: counts

“The kind of thing the variables are”: genes

“Keys for boolean columns of .var which identify variables you could want to control for”: mito

Rename the generated file 3k PBMC with mito annotation and qc metrics
Inspect the dataset
Question
[n_obs × n_vars]
-    2700 × 13714
[obs]
-    n_genes_by_counts
-    log1p_n_genes_by_counts
-    total_counts
-    log1p_total_counts
-    pct_counts_in_top_50_genes
-    pct_counts_in_top_100_genes
-    pct_counts_in_top_200_genes
-    pct_counts_in_top_500_genes
-    total_counts_mito
-    log1p_total_counts_mito
-    pct_counts_mito
[var]
-    gene_ids
-    n_cells
-    mito
-    n_cells_by_counts
-    mean_counts
-    log1p_mean_counts
-    pct_dropout_by_counts
-    total_counts
-    log1p_total_counts
Which QC metrics have been computed?

The tool computed several QC metrics at both cell and gene levels. These metrics are added to the annotation tables.

At gene levels (stored in var):

total_counts: sum of counts for a gene

mean_counts: mean expression for a gene over all cells

n_cells_by_counts: number of cells with non-zero counts for a gene

pct_dropout_by_counts: percentage of cells this gene does not appear in

At cell levels (stored in obs):

total_counts: total number of counts for a cell

total_counts_mito: total number of counts for the mitochondrial genes in a cell

n_genes_by_counts: number of genes with non-zero counts

pct_counts_mito: proportion of total counts for a cell which are from mitochondrial genes

pct_counts_in_top_50_genes: cumulative percentage of counts for 50 most expressed genes in a cell

pct_counts_in_top_100_genes: cumulative percentage of counts for 100 most expressed genes in a cell

pct_counts_in_top_200_genes: cumulative percentage of counts for 200 most expressed genes in a cell

pct_counts_in_top_500_genes: cumulative percentage of counts for 500 most expressed genes in a cell

We would like to visualize 3 of the more informative QC metrics:

the cell size, i.e. total_counts
the number of expressed genes, i.e. n_genes_by_counts
the proportion of reads mapped to mitochondrial genes, i.e. pct_counts_mito

Hands On: Visualize QC metrics

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with mito annotation and qc metrics

“Method used for plotting”: Generic: Violin plot, using 'pl.violin'

“Keys for accessing variables”: Subset of variables in 'adata.var_names' or fields of '.obs'

“Keys for accessing variables”: n_genes_by_counts, total_counts, pct_counts_mito

In “Violin plot attributes”:

“Display keys in multiple panels”: Yes

Inspect the generated file

Question

How do the distributions of the 3 QC metrics look?

For the cell size, i.e. total_counts, most of the values are between 1,000 reads and 4,000 reads, with some extremely high values skewing the distribution.

The numbers of expressed genes, i.e. n_genes_by_counts, are mostly between 500 genes and 1,200 genes, with also some extremely high values skewing the distribution.

The distribution of the proportions of reads mapped to mitochondrial genes, i.e. pct_counts_mito, is even more narrow with some cells having no counts from mitochondrial genes but also having some really extreme values (above 5%).

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with mito annotation and qc metrics

“Method used for plotting”: Generic: Scatter plot along observations or variables axes, using 'pl.scatter'

“Plotting tool that computed coordinates”: Using coordinates

“x coordinate”: total_counts

“y coordinate”: n_genes_by_counts

“Use the layers attribute?”: No

Inspect the generated file

Question

Is there any relationship between the cell size and the number of expressed genes?

On the plot, we can see a strong correlation between the total number of counts for a cell and the number of genes with positive counts.

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with mito annotation and qc metrics

“Method used for plotting”: Generic: Scatter plot along observations or variables axes, using 'pl.scatter'

“Plotting tool that computed coordinates”: Using coordinates

“x coordinate”: n_genes_by_counts

“y coordinate”: pct_counts_mito

“Use the layers attribute?”: No

Inspect the generated file

Question

Is there any relationship between the number of expressed genes and the proportion of reads mapped to mitochondrial genes?

What could be a good threshold to filter for cells with high concentrations of mitochondrial genes?

What could be good thresholds to filter for cells based on the number of expressed genes?

Cells with a high proportion of mitochondrial genes are not also cells that have many expressed genes. There is no visible correlation.

The cells with a percentage of mitochondrial counts above 5% have also few genes. So 5% may be a good threshold.

As discussed before, a low number of expressed genes may be a sign of poor-quality cells. Any cells with less than 200 genes is definitely out of the main distribution so this could be a good threshold to use. High gene count may also be a sign for cell multiplets. We will choose here a threshold of 2,500 genes.

Filtering of low-quality cells

As explained before, we would like now to filter for low-quality cells based on the 3 previous metrics (cell size, number of expressed genes, and proportion of reads mapped to mitochondrial genes). As the cell size is highly correlated with the number expressed genes, we can focus only on the number expressed genes and the proportion of reads mapped to mitochondrial genes.

Based on the previous plot, we would like to remove cells that have:

a number of expressed genes below 200 or above 2,500
a percentage of reads mapped to mitochondrial genes above 5%

Hands On: Remove low-quality cells
Scanpy filter ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with mito annotation and qc metrics

“Method used for filtering”: Filter cell outliers based on counts and numbers of genes expressed, using 'pp.filter_cells'

“Filter”: Minimum number of genes expressed

“Minimum number of genes expressed required for a cell to pass filtering”: 200
Inspect the dataset
Question
[n_obs × n_vars]
-    2700 × 13714
How many cells have been removed because they have less than 200 expressed genes?

There are still 2,700 cells. So no cells have been removed because they have less than 200 expressed genes.
Scanpy filter ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: output of Scanpy filter tool

“Method used for filtering”: Filter cell outliers based on counts and numbers of genes expressed, using 'pp.filter_cells'

“Filter”: Maximum number of genes expressed

“Maximum number of genes expressed required for a cell to pass filtering”: 2500
Inspect the dataset
Question
[n_obs × n_vars]
-    2695 × 13714
How many cells have been removed because they have more than 2,500 expressed genes?

There are now 2,695 cells. So 5 cells have been removed because they have more than 2,500 expressed genes.
Manipulate Anndata ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: output of Scanpy filter tool

“Function to manipulate the object”: Filter observations or variables

“What to filter?”: Observations (obs)

“Type of filtering?”: By key (column) values

“Key to filter”: pct_counts_mito

“Type of value to filter”: Number

“Filter”: less than

“Value”: 5.0

Rename the generated file 3k PBMC after QC filtering
Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC after QC filtering

“What to inspect?”: General information about the object
Question
AnnData object with n_obs × n_vars = 2638 × 13714
How many cells have been removed because they have more than 5% of reads mapped to mitochondrial genes?

There are now 2,638 cells. So 57 cells have been removed because they have more than 5% of reads mapped to mitochondrial genes.

Normalization and scaling

In scRNA-seq, we can observe systematic differences in sequencing coverage between cells (Stegle et al. 2015), because of technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material. After removing low-quality cells, normalization of the counts removes these differences to avoid that they interfere with comparisons of the expression profiles between cells. Any observed heterogeneity or differential expression within the cell population after normalization are then driven by biology and not technical biases.

Scaling normalization is the simplest and most commonly-used class of normalization strategies. All counts for each cell are divided by a cell-specific scaling factor, often called a “size factor”. The assumption behind this process is that any technical biases tend to affect genes in a similar manner. The relative bias for each cell is represented by the size factor for that cell. Dividing the cell counts by its size factor should then remove the bias.

The simplest strategy for scaling normalization is the cell size normalization, i.e. similar total sum of counts across all genes for each cell. The “cell size factor” for each cell is computed directly for its cell size and transformed such that the mean size factor across all cells is equal to 1. The normalized expression values are then kept on the same scale as the original counts.

Here we would to normalize our count table such that each cell have 10,000 reads.

Hands On: Normalize for cell size

Scanpy normalize ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC after QC filtering

“Method used for normalization”: Normalize counts per cell, using 'pp.normalize_total'

“Target sum”: 10000.0

“Exclude (very) highly expressed genes for the computation of the normalization factor (size factor) for each cell”: No

“Name of the field in ‘adata.obs’ where the normalization factor is stored”: norm

The normalized counts should be log-transformed afterwards to adjust for the mean-variance relationship.

With log-transformation, the differences in the log-values represent log-fold changes in expression. Log-transformation focuses on promoting contributions from genes with strong relative differences (e.g. a gene that is expressed at an average count of 50 in cell type A and 10 in cell type B rather than a gene that is expressed at an average count of 1100 in A and 1000 in B).

Hands On: Log-transform the counts

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: output of Scanpy normalize tool

“Method used for inspecting”: Logarithmize the data matrix, using 'pp.log1p'

We will freeze the current state of the AnnData object, i.e. the logarithmized raw gene expression, in the a raw attribute. This information will be used later in differential testing and visualizations of gene expression.

Hands On: Freeze the state of the AnnData object

Manipulate Anndata ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: output of Scanpy Inspect and manipulate tool

“Function to manipulate the object”: Freeze the current state into the 'raw' attribute

Rename the generated output 3k PBMC after QC filtering and normalization

Selection of features

Data from scRNA-Seq is often used in exploratory analyses to characterize heterogeneity across cells. With clustering and dimensionality reduction, cells are compared based on their gene expression profiles. The choice of genes to use may have a major impact on the behaviour of the clustering and the dimensionality reduction. We need then to remove genes with random noise while keeping genes containing useful information about the biology of the system. This reduces the data size but still highlights any interesting biological signal without the noise that obscures that structure.

Selecting the most variable genes based on their expression across the cells (i.e. genes highly expressed in some cells, and lowly expressed in others) is the simplest approach for feature selection. With this approach, we assume that increased variation in some genes compared to other genes are genuine biological differences and not technical noise or a baseline level of “uninteresting” biological variation.

To quantify the per-gene variation, the simplest approach consists of computing the variance of the log-normalized expression values for each gene across all cells in the population. With this approach, the feature selection is based on the same log-values as the ones used in clustering, ensuring then that the quantitative definition of heterogeneity is consistent throughout the entire analysis.

Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship, as done in the Seurat procedure (Stuart et al. 2019) we will use.

Once the per-gene variation has been quantified, we need to select the subset of highly variable genes that we will use in downstream analyses. A large subset reduces the risk of discarding any interesting biological signal, but increases the noise from irrelevant genes on the signal. The optimal trade-off is difficult to determine but there are several common strategies routinely used. You can read the Chapter 8 from “Orchestrating Single-Cell Analysis with Bioconductor” for a nice presentation of the different strategies. Here, we will define the set of highly variable genes as those which, after normalization, have a normalized dispersion amount higher than 0.5.

Hands On: Identify the highly variable genes

Scanpy filter ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC after QC filtering and normalization

“Method used for filtering”: Annotate (and filter) highly variable genes, using 'pp.highly_variable_genes'

“Flavor for computing normalized dispersion”: seurat

“Minimal mean cutoff”: 0.0125

“Maximal mean cutoff”: 3

“Minimal normalized dispersion cutoff”: 0.5

“Inplace subset to highly-variable genes?”: No

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: output of the last Scanpy filter tool

“Method used for plotting”: Preprocessing: Plot dispersions versus means for genes, using 'pl.highly_variable_genes'

Highly variable genes.

Both highly variable genes and other genes are still in the AnnData object. We can now actually keep only the highly variable genes.

Hands On: Keep the highly variable genes
Inspect the output of the last Scanpy filter tool
Question
[n_obs × n_vars]
-    2638 × 13714
[obs]
-    n_genes_by_counts
-    log1p_n_genes_by_counts
-    total_counts
-    log1p_total_counts
-    pct_counts_in_top_50_genes
-    pct_counts_in_top_100_genes
-    pct_counts_in_top_200_genes
-    pct_counts_in_top_500_genes
-    total_counts_mito
-    log1p_total_counts_mito
-    pct_counts_mito
-    n_genes
-    norm
[var]
-    gene_ids
-    n_genes
-    mito
-    n_cells_by_counts
-    mean_counts
-    log1p_mean_counts
-    pct_dropout_by_counts
-    total_counts
-    log1p_total_counts
-    highly_variable
-    means
-    dispersions
-    dispersions_norm
How many genes are in the AnnData object?

Where is the stored the information about the genes and if they are highly variable or not?

There are now 13,714 genes, as before.

Extra annotations have been added to var, whose a boolean annotation highly_variable for highly variable genes.
Manipulate Anndata ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: output of the last Filter tool

“Function to manipulate the object”: Filter observations or variables

“What to filter?”: Variables (var)

“Type of filtering?”: By key (column) values

“Key to filter”: highly_variable

“Type of value to filter”: Boolean

“Value to keep”: Yes

Rename the generated output 3k PBMC with only HVG
Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG

“What to inspect?”: General information about the object
Question
AnnData object with n_obs × n_vars = 2638 × 1838
How many genes have been removed?

Only 1,838 (over the 13,714) genes are kept.

Scaling the data

Prior to any downstream analysis like dimensional reduction, we need to apply a linear transformation or scaling to:

Regress out unwanted sources of variation in the total counts per cell and the percentage of mitochondrial genes expressed.
Scale data to unit variance and zero mean, i.e. the variance across cells is 1 and the mean expression is 0, in order to give equal weight in downstream analyses and ensure that highly-expressed genes do not dominate.

Hands On: Scale the data

Scanpy remove confounders ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG

“Method used for plotting”: Regress out unwanted sources of variation, using 'pp.regress_out'

“Keys for observation annotation on which to regress on”: total_counts, pct_counts_mito

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: output of Scanpy remove confounders tool

“Method used for inspecting”: Scale data to unit variance and zero mean, using 'pp.scale'

“Zero center?”: Yes

“Maximum value”: 10.0

It clips values exceeding a standard deviation of 10.

Rename the generated output 3k PBMC with only HVG, after scaling

Dimensionality reduction

We aim in scRNA-seq to compare cells based on their expressions across genes, e.g. to identify similar transcriptomic profiles. Each gene represents then a dimension of the data.

With a dataset of 2 genes, we could make a 2-dimensional plot where each point is a cell and each axis is the expression of one gene. For datasets with thousands of genes, the concept is the same: each cell’s expression profile defines its location in the high-dimensional expression space.

Expressions of different genes are correlated if they are affected by the same biological process. The separate information for these individual genes do not need to be stored, but can instead be compressed into a single dimension, e.g. an “eigengene”. Dimensionality reduction aims then to reduce the number of separate dimensions in the data and then:

reduces the computational work in downstream analyses to only a few dimensions
reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data
enables effective plotting of the data.

Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique consisting in the identification of axes in high-dimensional space that capture the largest amount of variation. This simple, highly effective strategy is widely used in data science.

In PCA, the first axis (or Principal Component (PC)) is chosen such that it captures the greatest variance across cells. The next PC should be orthogonal to the first and capture the greatest remaining amount of variation, and so on.

By applying PCA to scRNA-Seq, we assume that multiple genes are affected by the same biological processes in a coordinated way and random technical or biological noise affects each gene independently. As more variation can be captured by considering the correlated behaviour of many genes, the top PCs are likely to represent the biological signal and the noise are concentrated into the later PCs. The dominant factors of heterogeneity are then captured by the top PCs. Restricting downstream analyses to the top PCs will then reduce the dimensionality of the data whilst focusing on the biological signal and removing the noise.

Here we perform the PCA on the log-normalized expression values and compute the first 50 PCs.

Hands On: Perform the PCA

Scanpy cluster, embed ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling

“Method used for plotting”: Computes PCA (principal component analysis) coordinates, loadings and variance decomposition, using 'pp.pca'

“Number of principal components to compute”: 50

“Type of PCA?”: Full PCA

“Compute standard PCA from covariance matrix?”: Yes

Rename the generated output 3k PBMC with only HVG, after scaling and PCA

PCA information are stored in the AnnData object in a quite complex way.
Hands On: Inspect the PCA inside an `AnnData` object
Inspect the 3k PBMC with only HVG, after scaling and PCA dataset
Question
[n_obs × n_vars]
-    2638 × 1838
[obs]
-    n_genes_by_counts
-    log1p_n_genes_by_counts
-    total_counts
-    log1p_total_counts
-    pct_counts_in_top_50_genes
-    pct_counts_in_top_100_genes
-    pct_counts_in_top_200_genes
-    pct_counts_in_top_500_genes
-    total_counts_mito
-    log1p_total_counts_mito
-    pct_counts_mito
-    n_genes
-    norm
[var]
-    gene_ids
-    mito
-    n_cells_by_counts
-    mean_counts
-    log1p_mean_counts
-    pct_dropout_by_counts
-    total_counts
-    log1p_total_counts
-    highly_variable
-    means
-    dispersions
-    dispersions_norm
[uns]
-    pca
[obsm]
-    X_pca
[varm]
-    PCs
How is the PCA stored in the AnnData object?

3 new objects (uns, obsm, varm) have been added to the AnnData object with information that seem related to PCA.

uns is an unstructured annotation, obsm multi-dimensional annotation of the observations (i.e. genes) and varm multi-dimensional annotation of the variables (i.e. cells).
Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling and PCA

“What to inspect?”: Unstructured annotation (uns)

“What to inspect in uns?”: PCA

Question

What information is stored in uns regarding the PCA?

uns is storing:

the ratio of explained variance by the PCs, sorted by PC, as a one-column table

the explained variance for each PCs, equivalent to the eigenvalues of the covariance matrix, as a one-column table

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling and PCA

“What to inspect?”: Multi-dimensional observations annotation (obsm)

“Which annotation to inspect for the observations?”: PCA coordinates (X_pca)

Question

What are the information stored in obsm regarding the PCA?

obsm is storing the PCA coordinates for the cells as a table with the rows being the cells, the columns the PCs and the values the PCs coordinate for cell and PC.

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling and PCA

“What to inspect?”: Multi-dimensional variables annotation (varm)

“Which annotation to inspect for the variables?”: Principal components containing the loadings

Question

What information is stored in varm regarding the PCA?

varm is storing the PCA coordinates for the genes as a table with the rows being the genes, the columns the PCs and the values the PCs coordinate for gene and PC.

Visualization of PCA

One application of dimensionality reduction and PCA is to compress the data for plotting into 2 (sometimes 3) dimensions with the most salient features of the data.

Scanpy provides several useful ways of visualizing both cells and genes that define the PCA. The simplest approach for visualization is to plot the top 3 PCs on 2D plots (PC1 vs PC2 and PC2 vs PC3)

Hands On: Plot the top 2 PCs the PCA

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling and PCA

“Method used for plotting”: PCA: Plot PCA results, using 'pl.pca_overview'

In “Plot attributes”

In “Component”

Click on param-repeat “Insert Component”

In “1: Component”

“X-axis”: 1

“Y-axis”: 2

Click on param-repeat “Insert Component”

In “1: Component”

“X-axis”: 2

“Y-axis”: 3

“Number of panels per row”: 2

PCA overview.

On these plots we see the different cells projected onto the first 3 PCs. We can already see subpopulations of cells, but only 3 PCs are represented there and plot like these are not so informative. It may be more interesting to project also the values for the genes, since perhaps these are the genes most involved in the 3 PCs.

Hands On: Visualize the top genes associated with PCs

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling and PCA

“Method used for plotting”: PCA: Rank genes according to contributions to PCs, using 'pl.pca_loadings'

“List of comma-separated components”: 1,2,3

Question

What are the top genes for each of the 3 first PCs? What do they represent?

CST3 is the gene the most associated with the 1st PC, NKG7 the one for the 2nd PC, and PF4 and PPBP for the 3rd PC (for consistency with the published scanpy and Seurat documentation, we will use PPBP).

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling and PCA

“Method used for plotting”: PCA: Plot PCA results, using 'pl.pca_overview'

“Keys for annotations of observations/cells or variables/genes”: CST3, NKG7, PPBP

In “Plot attributes”

In “Component”

Click on param-repeat “Insert Component”

In “1: Component”

“X-axis”: 1

“Y-axis”: 2

Click on param-repeat “Insert Component”

In “1: Component”

“X-axis”: 2

“Y-axis”: 3

“Number of panels per row”: 2

PCA scatter for CST3, NKG7 and PPBP.

Question

Where are the differences in expression for CST3, NKG7, and PPBP?

For CST3, the differences are mostly projected on PC1 (expected as CST3 is the top gene for PC1), and not visible on the PC3 vs PC2 plot.

For NKG7, the differences in expression are seen on PC2.

For PPBP, the differences in expression are seen on PC3.

Determination of the number of PCs to keep

We performed the PCA analyses using 50 PCS. They represent a robust compression of the dataset, but we may not need to keep all of them. How many components should we choose to include?

The choice of the number of PCs is a similar question to the choice of the number of highly variable genes to keep. More PCs means more noise but also more biological signal.

A simple heuristic for choosing the number of PCs generates an “Elbow plot”: a ranking of the PCs based on the percentage of variance they explain.

Hands On: Generate an Elbow plot

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling and PCA

“Method used for plotting”: PCA: Scatter plot in PCA coordinates, using 'pl.pca_variance_ratio'

“Use the log of the values?”: Yes

PCA variance ratio.

To determine the elbow point, we assume that each of the PCs should explain much more variance than the remaining PCS. So after the last PCs we choose, the percentage of variance explained should not drop much.

Question

How many PCs should we keep given the previous plot?

We choose here to keep 10 PCs.

We encourage users to repeat downstream analyses with different number of PCs.

Clustering of the cells

In the PC projection, we saw some subpopulations of cells emerging. We would like now to empirically define these subpopulation of cells with similar expression profiles using unsupervised clustering.

Clustering summarizes the data and allows us to describe the population heterogeneity in terms of discrete and easily understandable labels. The subpopulations can be afterwards treated as proxies for biological objects like cell types or states.

Graph-based clustering has been popularized for clustering large scRNA-Seq datasets by its use in Seurat (Butler et al. 2018, Stuart et al. 2019). Such approaches like the K-nearest neighbor (KNN) graph works in 2 steps:

Computation of a neighborhood graph

A graph is first built with each node being a cell connected to its nearest neighbours having similar expression patterns. Edges are weighted based on the similarity between the cells: higher weight is given to cells that are more closely related, i.e. have similar expression profiles.
Clustering of the neighborhood graph

The graph is then partitioned into highly interconnected “quasi-cliques” or “communities”, i.e. cells that are more connected to cells in the same community than they are to cells of different communities. Each community represents a cluster that we can use for downstream interpretation.

Computation of a neighborhood graph

When running graph-based clustering, we need to think about:

How many neighbors are considered when constructing the graph.
What scheme is used to weight the edges.

Here, to reproduce original results, we choose 10 neighbors for a KNN graph, the Euclidian distance metrics and the UMAP method (McInnes et al. 2018) to compute the connectivities. The values will change depending on the data and can not easily predefined: testing different values is the only solution.

Hands On: Compute the neighborhood graph

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling and PCA

“Method used for inspecting”: Compute a neighborhood graph of observations, using 'pp.neighbors'

“The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation”: 10

“Number of PCs to use”: 10

“Use a hard threshold to restrict the number of neighbors to n_neighbors?”: Yes

“Method for computing connectivities”: umap (McInnes et al, 2018)

“Distance metric”: euclidean

Rename the generated output 3k PBMC with only HVG, after scaling, PCA and KNN graph

Inspect the dataset

Question

How is the neighborhood graph stored in the AnnData object?

An extra object neighbors has been added to uns with 2 mtx objects (similar format to the original count table):

Distance between each cells

Weighted adjacency matrix between cells

This information can be accessed using Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA and KNN graph

“What to inspect?”: Unstructured annotation (uns)

“What to inspect in uns?”: Neighbors

Visualization of the neighborhood graph

To visualize and explore the neighborhood graph, we can apply an extra step of non-linear dimensional reduction techniques to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells that will be in the same clusters, i.e. cells with similar local neighborhoods in high-dimensional space, should co-localize on these low-dimensional plots.

Two techniques are commonly used for the non-linear dimensional reduction: t-SNE (t-distributed stochastic neighbor embedding) and UMAP. Scanpy authors recommend to use here UMAP as it better preserves trajectories (check Becht et al. 2019 for a review) and easily accommodates new data.

Here, we will reduce the neighborhood to 2 UMAP components and then we will check to see how the cells are projected on them given the top genes

Hands On: Embed and plot the neighborhood graph

Scanpy cluster, embed ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA and KNN graph

“Method used for plotting”: Embed the neighborhood graph using UMAP, using 'tl.umap'

Rename the generated output 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP

Question

How is the UMAP reduction stored in the AnnData object?

An extra object X_umap has been added to obsm with the 2 UMAP coordinates for each cell, as a table of 2 columns and 2,638 lines.

This information can be accessed using:

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP

“What to inspect?”: Generalinformation about the object

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA and KNN graph, UMAP

“What to inspect?”: Multi-dimensional observations annotation (obsm)

“What to inspect in for the observations?”: UMAP coordinates (X_umap)

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Keys for annotations of observations/cells or variables/genes”: CST3, NKG7, PPBP

UMAP coordinates before clustering.

Question

Are clusters identifiable on these graphs? Might they be linked to PCs?

There seem to be at least 3 big clusters (the 3 blobs) before looking at the expression of some representative genes.

With the plot colored with NKG7 expression, we clearly see that the blob on the bottom right could be divided in 3 sub-clusters (light green, blue, purple). Given the expression of PPBP, the “arm” on the blob on the left should be a different cluster.

The static plots above give us a first overview of the data, but it can be difficult to explore all dimensions at once. Vitessce is an interactive visualization framework for single-cell data that allows you to explore multiple linked views simultaneously — for example, selecting cells in a UMAP and seeing their gene expression highlighted in a heatmap at the same time.

We can generate a Vitessce configuration file directly from Scanpy plot by enabling the “Make an interactive plot?” option.

Hands On: Explore marker genes interactively with Vitessce

Scanpy plot ( Galaxy version 1.11.5+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Make an interactive plot?”: Yes

“Keys for annotations of observations/cells or variables/genes”: CST3, NKG7, PPBP

Rename the vitessce.json output to Vitessce config - marker genes

Click on the galaxy-eye (View data) icon of the Vitessce config - marker genes dataset to explore the marker genes interactively in Vitessce

Open image in new tab

Figure 2: Vitessce showing CST3, NKG7 and PPBP expression in UMAP space.

Question

Explore the marker genes in Vitessce. Can you see differences in expression between cells? How does this compare to the static plot above?

In Vitessce you can hover over individual cells to see their exact expression values, and select groups of cells to highlight them across views. The expression patterns of CST3, NKG7 and PPBP should match the static UMAP above, but now you can interactively explore which cells express each gene and compare them simultaneously.

Clustering of the neighborhood graph

Given the first visualization, we can now cluster the cells within a neighborhood graph.

Which community detection algorithm should we use to define the clusters? Several modularity optimization techniques such as the SLM (Blondel et al. 2008), Louvain algorithm (Levine et al. 2015) or the Leiden algorithm (Traag et al. 2019) are available to iteratively group cells together while optimizing the standard modularity function.

Currently, the Louvain graph-clustering method (community detection based on optimizing modularity) is the one recommended. We need to define a value for the resolution parameter, i.e. the ‘granularity’ of the downstream clustering. High values lead to a greater number of clusters. For single-cell datasets of around 3K cells, we recommend to use a value between 0.4 and 1.2. For larger datasets, the optimal resolution will be higher.

Hands On: Cluster the neighborhood graph

Scanpy cluster, embed ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP

“Method used for plotting”: Cluster cells into subgroups, using 'tl.louvain'

“Flavor for the clustering”: vtraag (much more powerful)

“Resolution”: 0.45

Rename the generated output 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering

Question

How is the clustering information stored in the AnnData object?

An extra column louvain has been added to the obs object with the cluster id for each cell.

This information can be accessed using:

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering

“What to inspect?”: Generalinformation about the object

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering

“What to inspect?”: Key-indexed observations annotation (obs)

The cells in the same clusters should be co-localized in the UMAP coordinate plots.

Hands On: Plot the neighborhood graph and the clusters

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Keys for annotations of observations/cells or variables/genes”: louvain, CST3, NKG7, PPBP

In “Plot attributes”

“Number of panels per row”: 2

UMAP coordinates after clustering.

Question

How many clusters have been identified? Do they fit with the ones quickly identified with UMAP plots?

In which clusters do you expected to have CST3, NKG7 and PPBP as representative?

8 clusters are identified, more or less corresponding to the ones we could see on the UMAP plots.

We expect that:

CST3 should be representative of clusters 1, 3, 4, 6

NKG7 for clusters 0, 3 and 5

PPBP for cluster 7

Finding marker genes

To give sense to the clusters, we need to identify the genes that drive separation between clusters. These marker genes can then be used to assign biological sense (e.g. cell type) to each cluster based on their functional annotation, but also to identify subtle differences between clusters (e.g., changes in activation or differentiation state) based on the behaviour of genes in the affected pathways.

Marker genes are usually detected by their differential expression between clusters, as the more strongly DE genes are more likely to have caused separate clustering of cells. To quantify the differences in expression profiles, several different statistical tests can be used.

The identification of the marker genes for each cluster is made not only on the highly variable genes, but on the whole set (currently stored in the raw attribute).

Using the t-test

The simplest and fastest method is the Welch t-test. It has good statistical properties for large numbers of cells (Soneson and Robinson 2018).

Hands On: Rank the highly differential genes using t-test

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering

“Method used for inspecting”: Rank genes for characterizing groups, using 'tl.rank_genes_groups'

“The key of the observations grouping to consider”: louvain

“Use ‘raw’ attribute of input if present”: Yes

“Comparison”: Compare each group to the union of the rest of the group

“The number of genes that appear in the returned tables”: 100

“Method”: t-test

“P-value correction method”: Benjamini-Hochberg

Rename the generated output 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with t-test

Question

How is the marker gene information stored in the AnnData object?

An extra object rank_genes_groups has been added to the uns object with 5 tables for the 100 top ranked genes (rows) for each cluster (column):

Names of the rank genes

Z-scores for the rank genes

Log2 Fold changes

P-values

Adjusted p-values

This information can be accessed using:

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with t-test

“What to inspect?”: Generalinformation about the object

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with t-test

“What to inspect?”: Unstructured annotation (uns)

“What to inspect in uns?”: Rank gene groups (rank_genes_groups)

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with t-test

“Method used for plotting”: Marker genes: Plot ranking of genes using dotplot plot, using 'pl.rank_genes_groups'

“Number of genes to show”: 20

“Number of panels per row”: 3

“Should the y-axis of each panels be shared?”: No

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with t-test

“What to inspect?”: Unstructured annotation (uns)

“What to inspect in uns?”: Rank gene groups (rank_genes_groups)

Inspect the Names for rank genes file

Rank genes with t-test.

We can quickly look at the 5 top ranked genes per cluster:

0	1	2	3	4	5	6	7
LDHB	LYZ	CD74	CCL5	LST1	NKG7	CD74	GPX1
CD3D	FTL	HLA-DRA	NKG7	AIF1	GZMB	HLA-DRA	PF4
RPS12	TYROBP	HLA-DPB1	B2M	FCER1G	PRF1	CST3	PPBP
RPS27	CST3	CD79A	IL32	FTL	CTSW	HLA-DPA1	CALM3
RPS25	S100A9	HLA-DRB1	CST7	FTH1	GNLY	HLA-DRB1	NRGN

Question

Are CST3, NKG7 and PPBP in the set of marker genes? If yes, are they assigned to the clusters we guessed before?

CST3 is a marker gene for clusters 1, 4, 6 (not 3 as guessed previously)

NKG7 for clusters 3 and 5 (not 0 as guessed previously)

PPBP for cluster 7, as guessed previously

Using Wilcoxon rank sum test

Another widely used method for pairwise comparisons between groups of observations is the Wilcoxon rank sum test (also known as the Wilcoxon-Mann-Whitney test). This test directly assesses separation between the expression distributions of different clusters. It is the recommended approach to use in publication (Soneson and Robinson 2018).

Hands On: Rank the highly differential genes using Wilcoxon rank sum

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering

Note: Please pay attention to the dataset name.

“Method used for inspecting”: Rank genes for characterizing groups, using 'tl.rank_genes_groups'

“The key of the observations grouping to consider”: louvain

“Use ‘raw’ attribute of input if present”: Yes

“Comparison”: Compare each group to the union of the rest of the group

“The number of genes that appear in the returned tables”: 100

“Method”: Wilcoxon-Rank-Sum

“P-value correction method”: Benjamini-Hochberg

Rename the generated output 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test

“Method used for plotting”: Marker genes: Plot ranking of genes using dotplot plot, using 'pl.rank_genes_groups'

“Number of genes to show”: 20

“Number of panels per row”: 3

“Should the y-axis of each panels be shared?”: No

Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test

“What to inspect?”: Unstructured annotation (uns)

“What to inspect in uns?”: Rank gene groups (rank_genes_groups)

Rename the Names for rank genes file to Ranked genes with Wilcoxon test

Inspect Ranked genes with Wilcoxon test file

Rank genes with Wilcoxon.

We can quickly look at the 5 top ranked genes per cluster:

0	1	2	3	4	5	6	7
LDHB	LYZ	CD74	CCL5	LST1	NKG7	HLA-DPA1	PF4
RPS12	S100A9	CD79A	NKG7	FCER1G	GZMB	HLA-DPB1	GNG11
RPS25	S100A8	HLA-DRA	CST7	AIF1	PRF1	HLA-DRB1	SDPR
RPS27	TYROBP	CD79B	CTSW	COTL1	GNLY	HLA-DRA	PPBP
RPS6	FTL	HLA-DPB1	B2M	FCGR3A	CTSW	CD74	NRGN
RPS3	CST3	HLA-DQA1	GZMA	IFITM2	GZMA	CST3	SPARC
CD3D	FCN1	MS4A1	HLA-C	FTH1	CST7	HLA-DQA1	GPX1

Question

Are the 5 top ranked genes different than the one for the t-test?

Are CST3, NKG7 and PPBP in the marker genes? If yes, are they assigned to the clusters we guessed before?

The 5 top ranked genes are slightly different, at least in their order.

We see that:

CST3 is a ranked genes for clusters 1, 4, 6 (not 3 as guessed previously)

NKG7 for clusters 3 and 5 (not 0 as guessed previously)

PPBP for cluster 7, as we guessed previously.

Hands On: Compare differential expression for CST3, NKG7 and PPBP in the different clusters

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test

“Method used for plotting”: Generic: Violin plot, using 'pl.violin'

“Keys for accessing variables”: Subset of variables in 'adata.var_names' or fields in '.obs'

“Keys for accessing variables”: CST3, NKG7, PPBP

“The key of the observation grouping to consider”: louvain

“Use ‘raw’ attribute of input if present”: Yes

Violin plot for CST3, NKG7 and PPBP after clustering.

Question

Are CST3, NKG7 and PPBP more expressed in the clusters for which they are marker genes?

CST3 is more expressed in clusters 1, 4, 6, the ones for which it was previously found as a marker gene, but also in cluster 7, which is unexpected.

NKG7 is more expressed in clusters 3 and 5, the ones for which it was previously found as a marker gene.

PPBP is very pronounced in cluster 7, for which it was previously found as a marker gene.

Visualization of expression of the marker genes

The marker genes should be more expressed in the clusters for which they are markers. We would like to confirm this idea using some visualization.

Expression of the top marker genes in each cluster

The assumption should be even more true for the top marker genes. The first way to visualize the expression of the top marker genes is to look at the distribution of the expression of each marker gene in cells for each cluster.

Hands On: Plot expression probability distributions across clusters of top marker genes

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test

“Method used for plotting”: Generic: Stacked violin plot, using 'pl.stacked_violin'

“Variables to plot (columns of the heatmaps)”: Subset of variables in 'adata.var_names'

“List of variables to plot”: LDHB, LYZ, CD74, CCL5, LST1, NKG7, HLA-DPA1, PF4

“The key of the observation grouping to consider”: louvain

“Number of categories”: 8

“Use ‘raw’ attribute of input if present”: Yes

“Custom figure size: Yes

“Swap axes?”: Yes

In “Violin plot attributes”:

“Add a stripplot on top of the violin plot”: No

“Colors to use in each of the stacked violin plots”: Accent

Violin plot for top marker genes.

Question

Are the top marker genes only expressed in the cluster for which they are markers?

LDHB, LYZ and CD74, even if they are top markers genes for the cluster 0, 1, 2 respectively, are also expressed in all other clusters (and also found in the top 100 marker genes for other clusters), but with higher level in the cluster for they are markers.

CCL5, LST1, NKG7 and HLA-DPA1 are not expressed in all clusters but also not only in the one they are markers.

PF4 is only expressed in cluster 7.

Another approach consists of displaying the mean expression of the marker genes for the cells on the neighborhood graph.

Hands On: Plot top marker gene expression on an UMAP plot

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Keys for annotations of observations/cells or variables/genes”: louvain, LDHB, LYZ, CD74, CCL5, LST1, NKG7, HLA-DPA1, PF4

“Use ‘raw’ attribute of input if present”: Yes

In “Plot attributes”

“Number of panels per row”: 2

Top marker gene expression on a UMAP plot.

Question

Are the top marker genes clearly associated to their clusters?

For most genes, we can clearly see for which cluster they are marker genes. But for physically close clusters, the differences are not so obvious.

Expression of the top 20 markers genes in cells for each cluster

We would like now to have a look at the expression of the top 20 marker genes in the different cells for each cluster

Hands On: Plot heatmap of the gene expression in cells

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test

“Method used for plotting”: Marker genes: Plot ranking of genes as heatmap plot, using 'pl.rank_genes_groups_heatmap'

“Number of genes to show”: 20

“Use ‘raw’ attribute of input if present”: Yes

“Compute and plot a dendrogram?”: Yes

Heatmap of the gene expression in cells.

Each cells is shown in a row and in columns are the marker genes for each cluster.

Question

How are clusters close to each other in terms of expression of the top 20 marker genes?

Clusters 0, 3, and 5 are similar in term of expression. This was expected as they are physically close on the neighborhood graph.

Clusters 1 and 4 are together and then 2 and 6 are together after 7. These observations are less expected given the neighborhood graph: 1 and 4 are physically close, but 2 is far from 7 and 6.

Comparison of the marker genes between clusters

Previously we identified marker genes by taking gene expression in cells for each cluster individually and comparing them to all remaining cells.

In some cases, it may also be interesting to find marker genes distinguishing one given cluster from one or several clusters. For example, the marker genes distinguishing cluster 0 from cluster 1.

Hands On: Identify the marker genes distinguishing cluster 0 from cluster 1 using Wilcoxon rank sum

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test

“Method used for inspecting”: Rank genes for characterizing groups, using 'tl.rank_genes_groups'

“The key of the observations grouping to consider”: louvain

“Use ‘raw’ attribute of input if present”: Yes

“Subset of groups to which comparison shall be restricted”: 0

“Comparison”: Compare with respect to a specific group

“Group identifier with respect to which compare”: 1

“The number of genes that appear in the returned tables”: 100

“Method”: Wilcoxon-Rank-Sum

“P-value correction method”: Benjamini-Hochberg

Rename the generated output 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes for 0 vs 1 with Wilcoxon test

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes for 0 vs 1 with Wilcoxon test

“Method used for plotting”: Marker genes: Plot ranking of genes using dotplot plot, using 'pl.rank_genes_groups'

“Number of genes to show”: 20

“Should the y-axis of each panels be shared?”: No

Rank genes 0 vs 1 with Wilcoxon.

Question

How can we interpret this plot?

In this graph are the marker genes distinguishing cluster 0 from cluster 1, ranked based on their difference of expression of the genes between cells in both clusters. So MALAT1 is the most differentially expressed gene between cells in cluster 0 and cells in cluster 1, even though it was not in the set of top marker genes for cluster 0.

The marker genes distinguishing cluster 0 from cluster 1 are extracted based on their differences in expression, which can be easily visualized.

Hands On: Plot expression difference for the marker genes distinguishing cluster 0 from cluster 1

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes for 0 vs 1 with Wilcoxon test

“Method used for plotting”: Marker genes: Plot ranking of genes as violin plot, using 'pl.rank_genes_groups_violin'

“Which genes to plot?”: A number of genes

“Number of genes to show”: 10

“Use ‘raw’ attribute of input if present”: Yes

Violin plot for marker genes for clusters 0 vs 1 with Wilcoxon.

Previous visualizations like the heatmap can also be used to represent the differential expression of the marker genes between both clusters.

In the next steps, we are mostly interested in the marker genes for each cluster individually by comparison one cluster to the rest, instead of a 1-to-1 comparison. So we will use again the AnnData object called 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test.

Cell type annotation

Obtaining clusters of cells is quite straightforward. Determining what biological state is represented by each of those clusters is likely the most challenging task in scRNA-Seq data analysis. To do so, we need to bridge the gap between our current dataset and prior biological knowledge.

Hands-on: Choose Your Own Tutorial

This is a 'Choose Your Own Tutorial' (CYOT) section (also known as 'Choose Your Own Analysis' (CYOA)), where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

There are two approaches for cell type annotation. Choose the one that suits you best!

Manual CellTypist

This biological knowledge is not always available in a consistent and quantitative manner. For example, the concept of “cell type” is not clearly defined. The interpretation of scRNA-seq data is often then quite manual.

Fortunately in the case of our dataset, we can use canonical markers to known cell types:

Cell type	Marker genes
CD4+ T cells	IL7R, CCR7
CD8+ T cells	CD8A
CD14+ Monocytes	CD14, LYZ
B cells	MS4A1, CD79A
Natural killer (NK) cells	GNLY, NKG7, KLRB1
Dendritic Cells	FCER1A, CST3
Megakaryocytes	PPBP
FCGR3A+ Monocytes	FCGR3A, MS4A7

Canonical markers are usually found in the literature and are also aggregated into dedicated database like the PanglaoDB Franzén et al. 2019

Question

Is it possible to match the clusters to cell types using the previous table?

Hint: search for the marker genes in the table in the Ranked genes with Wilcoxon test dataset

By searching the marker genes in the the Ranked genes with Wilcoxon test dataset:

Marker genes Cell type Cluster

IL7R CD4+ T cells 0

CCR7 CD4+ T cells 0

CD8A CD8+ T cells 3

CD14 CD14+ Monocytes 1

LYZ CD14+ Monocytes 1 (6)

MS4A1 B cells 2

CD79A B cells 2

GNLY Natural killer (NK) cells 5

NKG7 Natural killer (NK) cells 5 (3)

KLRB1 Natural killer (NK) cells -

FCER1A Dendritic cells 6

CST3 Dendritic cells 6 (4)

PPBP Megakaryocytes 7

FCGR3A FCGR3A+ Monocytes 4 (5)

We can then aggregate the results by clusters

Cell type Marker genes Cluster

CD4+ T cells IL7R, CCR7 0

CD8+ T cells CD8A 3

CD14+ Monocytes CD14, LYZ 1

B cells MS4A1, CD79A 2

Natural killer (NK) cells GNLY, NKG7, KLRB1 5

Dendritic Cells FCER1A, CST3 6

Megakaryocytes PPBP 7

FCGR3A+ Monocytes FCGR3A, MS4A7 4

Marker genes	Cell type	Cluster
IL7R	CD4+ T cells	0
CCR7	CD4+ T cells	0
CD8A	CD8+ T cells	3
CD14	CD14+ Monocytes	1
LYZ	CD14+ Monocytes	1 (6)
MS4A1	B cells	2
CD79A	B cells	2
GNLY	Natural killer (NK) cells	5
NKG7	Natural killer (NK) cells	5 (3)
KLRB1	Natural killer (NK) cells	-
FCER1A	Dendritic cells	6
CST3	Dendritic cells	6 (4)
PPBP	Megakaryocytes	7
FCGR3A	FCGR3A+ Monocytes	4 (5)

Cell type	Marker genes	Cluster
CD4+ T cells	IL7R, CCR7	0
CD8+ T cells	CD8A	3
CD14+ Monocytes	CD14, LYZ	1
B cells	MS4A1, CD79A	2
Natural killer (NK) cells	GNLY, NKG7, KLRB1	5
Dendritic Cells	FCER1A, CST3	6
Megakaryocytes	PPBP	7
FCGR3A+ Monocytes	FCGR3A, MS4A7	4

These canonical marker genes can be easily match the clusters to known cell types:

Cluster	Cell type
0	CD4+ T cells
1	CD14+ Monocytes
2	B cells
3	CD8+ T cells
4	FCGR3A+ Monocytes
5	Natural killer (NK) cells
6	Dendritic Cells
7	Megakaryocytes

Hands On: Add the cell type as cluster names

Manipulate Anndata ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test

Note: Take note that this is not the “0 vs 1 Wilcoxon” dataset

“Function to manipulate the object”: Rename categories of annotation

“Key for observations or variables annotation”: louvain

“Comma-separated list of new categories”: CD4+ T, CD14+, B, CD8+ T, FCGR3A+, NK, Dendritic, Megakaryocytes

Rename the generated output 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test, annotation

Question

How can we check that the cell type has been correctly added?

The cluster information is available in the obs attribute of the AnnData object. So to check, we should run Inspect AnnData ( Galaxy version 0.10.9+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA and KNN graph

“What to inspect?”: Key-indexed observations annotation (obs)

In the last column (called louvain), we should now have cell type.

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test, annotation

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Keys for annotations of observations/cells or variables/genes”: louvain

“Use ‘raw’ attribute of input if present”: Yes

In “Plot attributes”

“Location of legend”: on data

“Draw a frame around the scatter plot?”: No

UMAP plot with annotated clusters.

Question

How well are the cell types clustered in the neighborhood graph?

T cells (CD4+ and CD8+) are clustered together with NK cells. Monocytes cells (CD14+ and FCGR3A+) are close to each others, with Dendritic and Megakaryocytes. B cells are physically independent.

With the annotated cell types, we can also visualize the expression of their canonical marker genes.

Hands On: Plot expression of canonical marker genes for the annotated cell types

Scanpy plot ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test, annotation

“Method used for plotting”: Generic: Makes a dot plot of the expression values, using 'pl.dotplot'

“Variables to plot (columns of the heatmaps)”: Subset of variables in 'adata.var_names'

“List of variables to plot”: IL7R, CCR7, CD8A, CD14, LYZ, MS4A1, CD79A, GNLY, NKG7, KLRB1, FCER1A, CST3, PPBP, FCGR3A

“The key of the observation grouping to consider”: louvain

“Number of categories”: 8

“Use ‘raw’ attribute of input if present”: Yes

“Compute and plot a dendrogram?”: Yes

In “Group of variables to highlight”

Click on param-repeat “Group of variables to highlight”

In “1: Group of variables to highlight”

“Start”: 0

“End”: 1

“Label”: CD4+ T

Click on param-repeat “Group of variables to highlight”

In “2: Group of variables to highlight”

“Start”: 2

“End”: 2

“Label”: CD8+

Click on param-repeat “Group of variables to highlight”

In “3: Group of variables to highlight”

“Start”: 3

“End”: 4

“Label”: CD14+

Click on param-repeat “Group of variables to highlight”

In “4: Group of variables to highlight”

“Start”: 5

“End”: 6

“Label”: B

Click on param-repeat “Group of variables to highlight”

In “5: Group of variables to highlight”

“Start”: 7

“End”: 9

“Label”: NK

Click on param-repeat “Group of variables to highlight”

In “6: Group of variables to highlight”

“Start”: 10

“End”: 11

“Label”: Dendritic

Click on param-repeat “Group of variables to highlight”

In “7: Group of variables to highlight”

“Start”: 12

“End”: 12

“Label”: Megakaryocytes

Click on param-repeat “Group of variables to highlight”

In “8: Group of variables to highlight”

“Start”: 13

“End”: 13

“Label”: FCGR3A+

“Color palette”: viridis (Perceptually Uniform Sequential)

Dotplot plot with annotated clusters.

Question

Are the canonical marker genes only expressed in their cell type?

Are the previous physical cluster (UMAP graph) confirmed?

Some canonical marker genes like LYZ or CST3 are not only highly expressed in their cell type but also on closely related other cell types.

As seen on the UMAP graph, T cells (CD4+ and CD8+) are clustered together with NK cells, but NK cells and CD8+ closer to each other. Monocytes cells (CD14+ and FCGR3A+) are close to each others, but B cells are close to Dendritic and then Megakaryocytes despite B cells being physically independent on the neighborhood graph.

The automated approach uses CellTypist, a tool that applies pre-trained logistic classifiers to predict cell identities directly from the normalized expression data, without requiring prior knowledge of canonical marker genes.

Comment

CellTypist requires a log1p-normalized expression matrix (normalized to 10,000 counts per cell), which is already stored in the raw attribute of our AnnData object from the preprocessing steps above.

Hands On: Automated cell type annotation with CellTypist

CellTypist ( Galaxy version 1.7.1+galaxy0) with the following parameters:

param-file “Input AnnData file”: 3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test, annotation

“Select model from”: Cached

“Choose CellTypist model”: immune sub-populations combined from 20 tissues of 18 studies (v2)

“Refine the predicted labels by running the majority voting classifier after over-clustering”: Yes

“Annotation mode”: Choose the cell type with the largest score/probability as the final prediction

“Probability threshold”: 0.5

“Generate a dotplot of the predicted cell types”: Yes

“Reference column in AnnData.obs for dotplot”: louvain

“Prediction label in AnnData.obs for dotplot”: predicted_labels

“Dotplot format”: png

Rename the generated output 3k PBMC CellTypist annotated

Inspect the dotplot output

Open image in new tab

Figure 3: CellTypist label transfer dotplot using the cached immune model, showing predicted cell type labels against the Louvain clusters.

If you are working with a dataset from a tissue or organism not covered by the available CellTypist models, you can train a custom model using your own annotated AnnData. To do so, select “Select model from”: History → “Train a model on an existing AnnData and use it”, and provide your AnnData file along with the column name in .obs that contains your cell type labels (e.g. louvain). This is particularly useful when studying tissues with unique or poorly characterized cell populations.

Hands On: Explore CellTypist annotations interactively with Vitessce

Scanpy plot ( Galaxy version 1.11.5+galaxy0) with the following parameters:

param-file “Annotated data matrix”: 3k PBMC CellTypist annotated

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Make an interactive plot?”: Yes

“Keys for annotations of observations/cells or variables/genes”: louvain

Rename the vitessce.json output to Vitessce config - CellTypist

Click on the galaxy-eye (View data) icon of the Vitessce config - CellTypist dataset to explore the annotations interactively

Open image in new tab

Figure 4: Vitessce showing the UMAP with CellTypist-annotated cell types and Cell Sets panel.

Question

Compare this Vitessce view with the one generated before cell type annotation. What has changed?

The Cell Sets panel now shows the annotated cell type names (B, CD14+, CD4+ T, CD8+ T, Dendritic, FCGR3A+, Megakaryocytes, NK) with their cell counts, instead of the numbered Louvain clusters. This allows you to interactively explore the biological identity of each cell population.

Conclusion

congratulations Well done, you’ve made it to the end! In this tutorial, we investigated clustering and annotation of single-cell data from 10x Genomics using Scanpy. This workflow used here was typical for scRNA-seq data analysis:

Preprocessing with
1. Selection and filtration of cells and genes based on quality metrics
2. Data normalization and scaling
3. Selection of features, i.e. marker genes
Reduction of the dimensionality via a Principal Component Analysis
Clustering of the cells by
1. Computation of a neighborhood graph
2. Clustering of the neighborhood graph into 8 clusters of cells
Identification of marker genes for the clusters
Annotation of the clusters with cell types

This tutorial is part of the https://singlecell.usegalaxy.eu portal (Tekman et al. 2020).

You've Finished the Tutorial

Key points

scRNA-seq data analysis is complex and exploratory process, still in development

Different tools and parameters should be tested for each step of the process

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Blondel, V. D., J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, 2008 Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008: P10008. https://iopscience.iop.org/article/10.1088/1742-5468/2008/10/P10008/meta
Islam, S., A. Zeisel, S. Joost, G. La Manno, P. Zajac et al., 2014 Quantitative single-cell RNA-seq with unique molecular identifiers. Nature methods 11: 163. https://www.nature.com/nmeth/journal/v11/n2/abs/nmeth.2772.html
Levine, J. H., E. F. Simonds, S. C. Bendall, K. L. Davis, D. A. El-ad et al., 2015 Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162: 184–197. https://www.sciencedirect.com/science/article/pii/S0092867415006376
Stegle, O., S. A. Teichmann, and J. C. Marioni, 2015 Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics 16: 133–145. https://www.nature.com/articles/nrg3833
Ilicic, T., J. K. Kim, A. A. Kolodziejczyk, F. O. Bagger, D. J. McCarthy et al., 2016 Classification of low quality cells from single-cell RNA-seq data. Genome biology 17: 29.
Butler, A., P. Hoffman, P. Smibert, E. Papalexi, and R. Satija, 2018 Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology 36: 411. https://www.nature.com/articles/nbt.4096?draft=journal
McInnes, L., J. Healy, and J. Melville, 2018 Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. https://arxiv.org/abs/1802.03426
Soneson, C., and M. D. Robinson, 2018 Bias, robustness and scalability in single-cell differential expression analysis. Nature methods 15: 255. https://www.nature.com/nmeth/journal/v15/n4/abs/nmeth.4612.html
Wolf, F. A., P. Angerer, and F. J. Theis, 2018 SCANPY: large-scale single-cell gene expression data analysis. Genome biology 19: 15. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0
Amezquita, R. A., V. J. Carey, L. N. Carpp, L. Geistlinger, A. T. L. Lun et al., 2019 Orchestrating Single-Cell Analysis with Bioconductor. bioRxiv 590562. https://www.biorxiv.org/content/10.1101/590562v1
Becht, E., L. McInnes, J. Healy, C.-A. Dutertre, I. W. H. Kwok et al., 2019 Dimensionality reduction for visualizing single-cell data using UMAP. Nature biotechnology 37: 38. https://www.nature.com/articles/nbt.4314
Franzén, O., L.-M. Gan, and J. L. M. Björkegren, 2019 PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019: https://academic.oup.com/database/article-abstract/doi/10.1093/database/baz046/5427041
Stuart, T., A. Butler, P. Hoffman, C. Hafemeister, E. Papalexi et al., 2019 Comprehensive Integration of Single-Cell Data. Cell. https://www.sciencedirect.com/science/article/pii/S0092867419305598
Traag, V. A., L. Waltman, and N. J. van Eck, 2019 From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports 9: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6435756/
Tekman, M., B. Batut, A. Ostrovsky, C. Antoniewski, D. Clements et al., 2020 A single-cell RNA-sequencing training and analysis suite using the Galaxy framework. GigaScience 9: giaa102. 10.1093/gigascience/giaa102 https://academic.oup.com/gigascience/article/9/10/giaa102/5931798

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Bérénice Batut, Hans-Rudolf Hotz, Mehmet Tekman, Pavankumar Videm, Diana Chiang Jurado, Clustering 3K PBMCs with Scanpy (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-scanpy-pbmc3k/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{single-cell-scrna-scanpy-pbmc3k,
author = "Bérénice Batut and Hans-Rudolf Hotz and Mehmet Tekman and Pavankumar Videm and Diana Chiang Jurado",
	title = "Clustering 3K PBMCs with Scanpy (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-scanpy-pbmc3k/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings:

tutorial Hands-on: Importing files from public atlases

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/single-cell/tutorials/scrna-scanpy-pbmc3k/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: anndata_import
  owner: iuc
  revisions: 6f143dfe902a
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_inspect
  owner: iuc
  revisions: 14c576ea6148
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_manipulate
  owner: iuc
  revisions: c4209ea387d4
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: celltypist
  owner: iuc
  revisions: 7518638a7b75
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: datamash_ops
  owner: iuc
  revisions: 4c07ddedc198
  tool_panel_section_label: Join, Subtract and Group
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_cluster_reduce_dimension
  owner: iuc
  revisions: 38db3fef7aad
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_filter
  owner: iuc
  revisions: 8cb783c81905
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_inspect
  owner: iuc
  revisions: d86cb2ce311b
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_normalize
  owner: iuc
  revisions: 56c0d4acef02
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_plot
  owner: iuc
  revisions: '0185bd174ea0'
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_plot
  owner: iuc
  revisions: 7f478b3258ec
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_remove_confounders
  owner: iuc
  revisions: 9fb627495ec2
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

5 stars 9

4 stars 2

3 stars 1

1 stars 1

0 stars 4

August 2025

5 stars: Liked: Complete tutorial which is beginner friendly Disliked: Some tools not working at moments

October 2024

3 stars: Liked: Following the instructions. Disliked: There should more reminder on the type of version to be use.

July 2024

4 stars: Liked: The step-by-step guide and questions

May 2023

5 stars: Liked: The module scRNA analysis is really complete and very well explained !!! Disliked: - In this step: Plot expression difference for the marker genes distinguishing cluster 0 from cluster 1 - There is no the suggested option (even using the same version from Galaxy Training Materials ): “Method used for plotting”: Marker genes: Plot ranking of genes as violin plot, using 'pl.rank_genes_groups_violin' - The option available is: pl.rank_genes_groups_stacked_violin - So I don't know if there is another way to plot.

July 2021

1 stars: Disliked: There is no code. For instance, how do you search your clusters for the marker genes?

April 2021

0 stars: Liked: clearly described, can be followed step by step, super tutorial for new users.

March 2021

0 stars: Liked: clearly described, can be followed step by step, idea for new-users.

0 stars: Disliked: I wanted to follow "Hands-on: Remove genes found in less than 3 cells" with tool "filter with scanpy", had given the every parameter correctly, but could not be successful every time.

February 2021

5 stars: Liked: Easy to follow tutorial. I learnt about so many tools used for data analysis.

July 2020

0 stars: Disliked: Theres a copy-paste error in Cell type annotations in the first hands-on in the solution to the question. The "Annotated data matrix" should be another file.

4 stars: Disliked: The workflow for download has 4 imports (barcodes.tsv, genes.tsv, matrix.mtx and Pasted Entry) but ONLY Pasted Entry connects to the remainder of the workflow. This doesn't seem to match up to Import Data and the remainder of the workflow. Or, there is no obvious place when one would execute the workflow.

5 stars: Liked: Independent self-paced learning. Assistance through the gitter chat (Galaxy Training Network/galaxy-single-cell) was appreciated twice. Disliked: Two steps that I could not make work because of an incompatible newer version of a "plot with scanpy tool" and one step with a collection output (from Method used for plotting”: Marker genes: Plot ranking of genes as violin plot, using 'pl.rank_genes_groups_violin') that was unclear .