Analysis of plant scRNA-Seq Data with Scanpy

Overview

question Questions
  • Can we reclaim cell markers using a different analysis method?

  • Are highly variable genes paramount to the analysis?

objectives Objectives
  • Perform filtering, dimensionality reduction, and clustering

  • Generate a DotPlot emulating the original paper using a different analysis tool

  • Determine robust clusters across scRNA-seq pipelines

requirements Requirements
time Time estimation: 2 hours
Supporting Materials
last_modification Last modification: May 27, 2021

Single cell RNA-seq analysis is a cornerstone of developmental research and provides a great level of detail in understanding the underlying dynamic processes within tissues. In the context of plants, this highlights some of the key differentiation pathways that root cells undergo.

The various cell subpopulations that are expected within this tutorial

This tutorial replicates the paper “Spatiotemporal Developmental Trajectories in the Arabidopsis Root Revealed Using High-Throughput Single-Cell RNA Sequencing” (Denyer et al. 2019), where the major plant cell types are recovered in the data as well as distinguishing between QC and meristematic cells. The original paper used the Seurat analysis suite (Satija et al. 2015), but here we will use the ScanPy analysis suite (Wolf et al. 2018) integrated within the single-cell resources in Galaxy (Tekman et al. 2020).

Agenda

In this tutorial, we will cover:

  1. Data
    1. Data upload
    2. CSV to AnnData
    3. Merge Batches and Relabel
  2. Quality Control
    1. Filtering the Matrix
    2. Confounder Removal
    3. Dimensionality Reduction
  3. Finding Cell Types
    1. Clustering
    2. DotPlot and Validating Cell Types
    3. Relabel clusters
  4. Conclusions

comment Comment

Please familiarise yourself with the “Clustering 3K PBMCs with ScanPy” tutorial first, as much of the process is the same, and the accompanying slide deck better explains some of the methods and concepts better.

Data

The Arabidopsis root cells come from two biological replicates which were isolated and profiles using droplet-based scRNA-seq (please see: “Pre-processing of 10X Single-Cell RNA Datasets”); a “short root” mutant (labelled: shr) and a wild-type (labelled: wt):

  • GSE123818_Root_single_cell_shr_datamatrix.fixednames.transposed.csv.gz
  • GSE123818_Root_single_cell_wt_datamatrix.fixednames.transposed.csv.gz

As explained in the Zenodo link, the datasets have been modified to use more common gene names (hence the “fixednames” suffix) and the datasets have been transposed to better suit the AnnData format (hence the “transposed” suffix). The datasets are CSV tabular files which have been compressed using gunzip. We will now import these for analysis.

Data upload

hands_on Hands-on: Data upload

  1. Create a new history for this tutorial
  2. Import the two datasets from Zenodo or from the shared data library

    https://zenodo.org/record/4597857/files/GSE123818_Root_single_cell_shr_datamatrix.fixednames.transposed.csv.gz
    https://zenodo.org/record/4597857/files/GSE123818_Root_single_cell_wt_datamatrix.fixednames.transposed.csv.gz
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    • By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

    Tip: Importing data from a data library

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    • Go into Shared data (top panel) then Data libraries

    • Find the correct folder (ask your instructor)

    • Select the desired files
    • Click on the To History button near the top and select as Datasets from the dropdown menu
    • In the pop-up window, select the history you want to import the files to (or create a new one)
    • Click on Import
  3. Apply the tags #shr and #wt to each of the datasets, with respect to the initial labels. These tags will be inherited for each task that the dataset undergoes.

    Tip: Adding a tag

    • Click on the dataset
    • Click on galaxy-tags Edit dataset tags
    • Add a tag starting with #

      Tags starting with # will be automatically propagated to the outputs of tools using this dataset.

    • Check that the tag is appearing below the dataset name

question Questions

We can peek at the #shr file, and determine the dimensionality and naming scheme of the data. The rows and the columns depict different variables.

  1. Which are the cell barcodes? Rows or Columns?
  2. Which are the gene names? Rows or Columns?
  3. How many cells in the dataset?
  4. How many genes in the dataset?

solution Solution

  1. Rows, notice the long A-C-T-G based names.
  2. Columns, “ATXG” are the gene prefixes commonly used by the Arabidopsis genomes.
  3. There are 1 099 lines and therefore 1 099 cells
  4. If we scroll the preview window all the way to the right we see 27 630 as the last number. Given that the first column are the cell names, we must subtract by 1 to get 27 629 cells.

If the above feels like a convoluted way to get the dimensionality, that’s because we haven’t imported the data into the right format. For this we need to import both datasets into a single AnnData object (see the AnnData section in the previous tutorial).

CSV to AnnData

hands_on Hands-on: Converting the Data

  1. Import Anndata and loom Tool: toolshed.g2.bx.psu.edu/repos/iuc/anndata_import/anndata_import/0.7.5+galaxy0 with the following parameters:
    • “hd5 format to be created”: Anndata file
      • “Format for the annotated data matrix”: Tabular, CSV, TSV
        • param-file “Annotated data matrix”: Multi-select both SHR and WT datasets

    comment Checking Dimensionality

    We can now inspect the dimensionality of the dataset by “peeking” at the dataset in the history and observing the general information, simply by clicking on the name of the dataset.

    The #shr dataset should show:

       [n_obs x n_vars]
           1099 x 27629
    

    and the #wt should show:

       [n_obs x n_vars]
           4727 x 27629
    

    where obs refers to our observed cells, and vars references the genes/variables.

Currently we have two seperate datasets, but we can merge them into one single AnnData object seperated by batch identifiers “shr” and “wt”. To do this, we simply manipulate one of the datasets and concatenate the other onto it.

Merge Batches and Relabel

hands_on Hands-on: Merging Data

  1. Manipulate AnnData Tool: toolshed.g2.bx.psu.edu/repos/iuc/anndata_manipulate/anndata_manipulate/0.7.5+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: SHR dataset (output of Import Anndata and loom tool)
    • “Function to manipulate the object”: Concatenate along the observations axis
      • param-file “Annotated data matrix to add”: WT dataset (output of Import Anndata and loom tool)

    tip Tip: Scheduling

    You can queue on the next task without the previous one being complete yet, just ensure that the input dataset for the next task is at least orange and not grey.

    comment Comment

    At this point, the AnnData dataset batch information has a value of 0 for the “SHR” cells and 1 for the “WT” cells. This is fine if you wish to remember these two values for the remainder of the analysis since the step below is optional, however for plotting purposes it can just be easier to relabel the batch information with the action text labels shr and wt.

  2. Manipulate AnnData Tool: toolshed.g2.bx.psu.edu/repos/iuc/anndata_manipulate/anndata_manipulate/0.7.5+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata (output of Manipulate AnnData tool)
    • “Function to manipulate the object”: Rename categories of annotation
      • “Key for observations or variables annotation”: batch
      • “Comma-separated list of new categories”: shr, wt

    warning Warning: Order matters!

    The order in which the datasets are concatenated can affect the accuracy of the new labelling. If the #wt dataset was first selected and then the #shr dataset concatenated onto it, then the comma-seperated list of new categories above should be inverted.

Quality Control

We can confirm that our datasets have been combined into a single object by peeking at the dataset in the history and confirming two things:

  1. The number of observations is equal to the total of the two initial AnnData datasets.
  2. The file size of the new AnnData dataset is approximately equal to the total of the two initial AnnData datasets.
Batch Number of Cells File Size
shr 1099 117.2 MB
wt 4727 499.7 MB
Total 5826 615.6 MB

A happy coincidence here is that both datasets already had the exact same variables, so that the merging of the two datasets still yields the same number of genes.

Filtering the Matrix

hands_on Hands-on: Generating some metrics

  1. Inspect and manipulate Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_inspect/scanpy_inspect/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata (output of Manipulate AnnData tool)
    • “Method used for inspecting”: Calculate quality control metrics, using 'pp.calculate_qc_metrics'
  2. Plot with scanpy Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_plot/scanpy_plot/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata (output of Manipulate AnnData tool)
    • “Method used for plotting”: Generic: Violin plot, using 'pl.violin'
      • “Keys for accessing variables”: Subset of variables in 'adata.var_names' or fields of '.obs'
        • “Keys for accessing variables”: n_genes_by_counts, total_counts
      • “The key of the observation grouping to consider”: batch
      • In “Violin plot attributes”:
        • “Add a stripplot on top of the violin plot”: Yes
          • “Add a jitter to the stripplot”: Yes
            • “Size of the jitter points”: 0.4
        • “Display keys in multiple panels”: Yes

This should result in the following violin plot, where we see the distribution of the number of genes per cell (n_genes_by_counts) and the library size per cell, where the library size is the total number of mRNA in a cell, regardless of which gene it came from (total_counts).

"Violin Plots per Batch"

Each dot is a cell, and the x-axis has two groups showing the absolute values of for each batch (shr and wt). The y-values correspond to the actual value of each cell, but the x-vales within each group are just random jitter to help see the cells more clearly. The outline of the violin plot shows us the density of each batch, where we can see that most of the cells in the shr group have an average number of features just below 2000 genes, compared to wt which has an average number of features just below 4000 genes. If we compare the library size, however, we notice that the outlines are very similar and that the average library sizes across batches are comparable.

Filter

Most scRNA-seq datasets need to set a minimum threshold of detectability in order to ensure that the cells are of high enough quality to be used for analysis. Conversely, sometimes an upper threshold also should be set to counter the case where a cell is a cell-doublet (in the sense that two cells were captured and sequenced accidentally as a single cell).

The above violin plots suggest quite a few outliers, mostly on the upper-end of the density plots, suggesting a few cell-doublets in the data. The effect of these is not something we cannot easily regress out, and so we must filter them out instead. By looking at the plots, we can see that a good maximum library size here would be 100 000 transcripts, and a good maximum feature size would be 10 000 transcripts.

For this analysis, we will set a minimum threshold of detectability that each cell should count transcripts from at least 200 genes, and that each gene should be expressed in at least 5 cells. These lower-bound thresholds are not so easy to derive from such QC plots, but depend mostly on the type of data you are using. If the number of features and the library sizes were one order of magnitude higher, one might consider also scaling the lower-bound thresholds by one order of magnitude.

hands_on Hands-on: Filtering

  1. Filter Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_filter/scanpy_filter/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Inspect and manipulate tool)
    • “Method used for filtering”: Filter cell outliers based on counts and numbers of genes expressed, using 'pp.filter_cells'
      • “Filter”: Minimum number of genes expressed
        • “Minimum number of genes expressed required for a cell to pass filtering”: 100
  2. Filter Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_filter/scanpy_filter/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Filter tool)
    • “Method used for filtering”: Filter genes based on number of cells or counts, using 'pp.filter_genes'
      • “Filter”: Minimum number of cells expressed
        • “Minimum number of cells expressed required for a gene to pass filtering”: 2

    We now set the upper limits to the analysis.

  3. Manipulate AnnData Tool: toolshed.g2.bx.psu.edu/repos/iuc/anndata_manipulate/anndata_manipulate/0.7.5+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Filter tool)
    • “Function to manipulate the object”: Filter observations or variables
      • “What to filter?”: Observations (obs)
      • “Type of filtering?”: By key (column) values
        • “Key to filter”: n_genes_by_counts
        • “Type of value to filter”: Number
          • “Filter”: less than
          • “Value”: 12000
  4. Manipulate AnnData Tool: toolshed.g2.bx.psu.edu/repos/iuc/anndata_manipulate/anndata_manipulate/0.7.5+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata (output of Manipulate AnnData tool)
    • “Function to manipulate the object”: Filter observations or variables
      • “What to filter?”: Observations (obs)
      • “Type of filtering?”: By key (column) values
        • “Key to filter”: total_counts
        • “Type of value to filter”: Number
          • “Filter”: less than
          • “Value”: 120000

If we inspect the resulting dataset, we see that the number of cells remaining is 5826 and the number of genes is 18913, which is still a good number of cells and genes. It is good practice to save the raw data in case we need to use it data, so we “freeze the raw data” into a slot for later safe keeping.

hands_on Hands-on: Saving the original matrix

  1. Manipulate AnnData Tool: toolshed.g2.bx.psu.edu/repos/iuc/anndata_manipulate/anndata_manipulate/0.7.5+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Filter tool)
    • “Function to manipulate the object”: Freeze the current state into the 'raw' attribute

Confounder Removal

Now we have a set of cells we are reasonably confident contain desirable biological variation (please view the following segment from “An introduction to scRNA-seq data analysis” as a reference for the types of wanted and unwanted variation). We wish to now regress out the library size variation, as it is not indicative of cell type. To do this, we will first ensure that all cells are of the same comparable library size by applying individual size factors to each cell.

We will also apply a log transformation to the cells, as this compresses the variation into a less extreme space, making it easier to see the relative differences between cells.

After normalising and regressing out unwanted factors, we will then scale the data to have unit variance with a zero mean, so that the mean expression of each gene is not a factor in the analysis and only the variation is, ensuring that the later clustering is driven only by the relative variation of the genes, and not neccesarily how expressive those genes are.

tip Highly Variable Genes?

Those of you who are familiar with the ScanPy Tutorial might wonder why we have not reduced the number of genes by performing a highly variable gene selection. The answer is simply that it did not help with this particular dataset, and that by removing the least variable genes in the analysis, it did help us replicate the analysis in the paper. Try it for yourself as an intermediate step (after this analysis) and see!

hands_on Hands-on: Normalize and Scale

  1. Normalize Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_normalize/scanpy_normalize/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata (output of Manipulate AnnData tool)
    • “Method used for normalization”: Normalize counts per cell, using 'pp.normalize_total'
      • “Target sum”: 10000.0
      • “Exclude (very) highly expressed genes for the computation of the normalization factor (size factor) for each cell”: No
  2. Inspect and manipulate Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_inspect/scanpy_inspect/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Normalize tool)
    • “Method used for inspecting”: Logarithmize the data matrix, using 'pp.log1p'
  3. Remove confounders Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_remove_confounders/scanpy_remove_confounders/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Inspect and manipulate tool)
    • “Method used for plotting”: Regress out unwanted sources of variation, using 'pp.regress_out'
      • “Keys for observation annotation on which to regress on”: total_counts
  4. Inspect and manipulate Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_inspect/scanpy_inspect/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Remove confounders tool)
    • “Method used for inspecting”: Scale data to unit variance and zero mean, using 'pp.scale'
      • “Maximum value”: 10.0

Dimensionality Reduction

Dimensionality reduction is the art of reducing a high dimensional dataset into a low dimensional “embedding” that humans can actually see (i.e. 2 or 3 dimensions), ideally such that the relationships or distances between data points are preserved in this embedding. In the context of single-cell datasets, this essentially means compressing > 10 000 genes into just 2 X/Y variables.

tip Tip: What is Dimensionality Reduction?

You can learn more about dimensionality reduction by consulting the following segment from “An introduction to scRNA-seq data analysis”.

This is usually a two step process:

  1. A Principal Component Analysis (PCA) is used to perform an optimized “rotation” of the ~ 20 000 “unit” gene axes that we have, in order to better fit the data. These new axes or “principal components” are then linear combinations of the original unit axes, with an associated score of how variable each new axis is. By sorting these principal components by most variable to least variable, we can select the top N components and discard the rest, leaving us with most of the variation still in the data. Usually we select the top 40 principal components, which from 20 000 is a huge reduction in the dimensionality of the dataset.
  2. We then perform a more complex kind of dimensionality reduction, one that does not assume any linearity in the axes. For scRNA-seq, this is either tSNE or UMAP, with UMAP being the main choice due to how flexible it is in incorporating new data. Though UMAP is capable of working on extremely high dimensional datasets, it is often limited by time and space constraints (read: the computer does not respond in a reasonable timeframe, or it crashes), and so therefore it is quite normal to feed UMAP the output of the PCA as input. This will then be a dimensionality reduction from 50 to 2. With these final 2 dimensions, we can plot the data.

hands_on Hands-on: PCA and UMAP

  1. Cluster, infer trajectories and embed Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_cluster_reduce_dimension/scanpy_cluster_reduce_dimension/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Inspect and manipulate tool)
    • “Method used”: Computes PCA (principal component analysis) coordinates, loadings and variance decomposition, using 'tl.pca'
      • “Type of PCA?”: Full PCA
        • “SVD solver to use”: ARPACK wrapper in SciPy
  2. Inspect and manipulate Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_inspect/scanpy_inspect/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Cluster, infer trajectories and embed tool)
    • “Method used for inspecting”: Compute a neighborhood graph of observations, using 'pp.neighbors'
      • “The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation”: 10
      • “Number of PCs to use”: 40

    comment Comment

    UMAP relies on a connected graph of cells to operate. Please view the following segment from “An introduction to scRNA-seq data analysis” for more information on how this process works.

  3. Cluster, infer trajectories and embed Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_cluster_reduce_dimension/scanpy_cluster_reduce_dimension/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Inspect and manipulate tool)
    • “Method used”: Embed the neighborhood graph using UMAP, using 'tl.umap'

With our data now sufficiently “flat” and ready for human consumption, we can now do a naive plot of both the PCA and the UMAP and show cell assignments coloured by batch.

hands_on Hands-on: Plot the PCA and UMAP by Batch

  1. Plot with scanpy Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_plot/scanpy_plot/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Cluster, infer trajectories and embed tool)
    • “Method used for plotting”: Embeddings: Scatter plot in PCA coordinates, using 'pl.pca
      • “Keys for annotations of observations/cells or variables/genes”: batch
      • In “Plot attributes”
        • “Colors to use for plotting categorical annotation groups”: rainbow (Miscellaneous)
  2. Plot with scanpy Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_plot/scanpy_plot/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Cluster, infer trajectories and embed tool)
    • “Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'
      • “Keys for annotations of observations/cells or variables/genes”: batch
      • “Show edges?”: No
      • In “Plot attributes”
        • “Legend font size”: 14
        • “Colors to use for plotting categorical annotation groups”: rainbow (Miscellaneous)
PCA UMAP
PCA UMAP

From this, we can see that there is a reasonable overlap in our batches shown both in the PCA and UMAP embeddings. This is good because it shows that though there is some batch effect (i.e. cells from one batch appear to cluster on a different side of the plot than the other) it is not significant enough for there not to be some commonality between the batches.

There are batch correction algorithms for cases where one batch clusters completely seperate from the other, but this is not necessary here.

Finding Cell Types

Clustering

Let us cluster the cells and see what cell types we can discover in the plots. The cell types that we wish to recover are those shown in the original paper, as highlighted by the gene markers in the DotPlot below.

UMAP DotPlot
PCA UMAP

hands_on Hands-on: Generate and Plots Clusters

  1. Cluster, infer trajectories and embed Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_cluster_reduce_dimension/scanpy_cluster_reduce_dimension/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Cluster, infer trajectories and embed tool)
    • “Method used”: Cluster cells into subgroups, using 'tl.leiden'
      • “Coarseness of the clusterin”: 0.35
  2. Plot with scanpy Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_plot/scanpy_plot/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Cluster, infer trajectories and embed tool)
    • “Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'
      • “Keys for annotations of observations/cells or variables/genes”: leiden, batch
      • “Show edges?”: No
      • In “Plot attributes”:
        • “Location of legend”: on data
        • “Legend font size”: 14
        • “Colors to use for plotting categorical annotation groups”: rainbow (Miscellaneous)

    comment Comment

    We print the legend on the data because it’s easier to see where the cluster labels apply.

UMAP Leiden Clusters

Here we have recovered 13 clusters but we don’t yet know what types they describe. If we have a list of genes that we know are indicative of a certain cell type (i.e. are marker genes) then we can use this to assign labels to our clusters.

Let us here try to recreate the DotPlot from the paper using the clusters we have discovered.

DotPlot and Validating Cell Types

hands_on Hands-on: Dot Plot

  1. Plot with scanpy Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_plot/scanpy_plot/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Cluster, infer trajectories and embed tool)
    • “Method used for plotting”: Generic: Makes a dot plot of the expression values, using 'pl.dotplot'
      • “Variables to plot (columns of the heatmaps)”: Subset of variables in 'adata.var_names'
        • “List of variables to plot”: AT1G26680,COBL2,AT3G55180,NCED2,AT5G22550,ATL63,RGF8,TEL1,DUF9,AGL42,PLT1,PLT2,PER3,PER9,PBL15,TBL30,AT2G48130,AT3G22620,AED3,T3P18-7,MBK21-8,NPF6-4,CYP71B34,IAMT1,AT1G31950,F14G6-22,GL2,ANL2,PMEI4,MUD21-4,EXPA7,MES15,ADF8,COBL9,TPR14,BHLH144,MYB83,SCPL35,MYB46,COBL4,PIN7,PIN3
      • “The key of the observation grouping to consider”: leiden
      • “Number of categories”: 9
      • “Use ‘raw’ attribute of input if present”: Yes
      • In “Group of variables to highlight”:
        • param-repeat “Insert Group of variables to highlight”
          • “Start”: 0
          • “End”: 5
          • “Label”: Columella
        • param-repeat “Insert Group of variables to highlight”
          • “Start”: 6
          • “End”: 9
          • “Label”: QC
        • param-repeat “Insert Group of variables to highlight”
          • “Start”: 10
          • “End”: 11
          • “Label”: NC
        • param-repeat “Insert Group of variables to highlight”
          • “Start”: 12
          • “End”: 17
          • “Label”: Endodermis
        • param-repeat “Insert Group of variables to highlight”
          • “Start”: 18
          • “End”: 23
          • “Label”: Cortex
        • param-repeat “Insert Group of variables to highlight”
          • “Start”: 24
          • “End”: 29
          • “Label”: Atrichoblast
        • param-repeat “Insert Group of variables to highlight”
          • “Start”: 30
          • “End”: 34
          • “Label”: Trichoblast
        • param-repeat “Insert Group of variables to highlight”
          • “Start”: 35
          • “End”: 39
          • “Label”: Xylem
        • param-repeat “Insert Group of variables to highlight”
          • “Start”: 40
          • “End”: 41
          • “Label”: VC
      • “Custom figure size”: No: the figure width is set based on the number of variable names and the height is set to 10.

By running the above we end up with the following DotPlot:

Dotplot In Analyssis

Notice how we have the Columella, QC and NC sharing the same cluster 11, that the Endodermis is localised in cluster 8, the Cortex in cluster 9, the Trichoblast cells in cluster 3, the Xylem showing strong expression in cluster 12, VC in clusters 4 and 6, and the Atrichoblasts have a mixed expression in a range of clusters. These characteristics are the same as the Dotplot in the original paper.

We can use this Dotplot as a guide to relabel our clusters and give more meaningful annotations to the plots.

Relabel clusters

hands_on Hands-on: Cluster Re-labelling

  1. Manipulate AnnData Tool: toolshed.g2.bx.psu.edu/repos/iuc/anndata_manipulate/anndata_manipulate/0.7.5+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata_out (output of Cluster, infer trajectories and embed tool)
    • “Function to manipulate the object”: Rename categories of annotation
      • “Key for observations or variables annotation”: leiden
      • “Comma-separated list of new categories”: 0, 1, 2, trichoblasts, 4, 5, 6, atrichoblasts, endodermis, cortex, 10, columella+QC+NC, xylem
  2. Plot with scanpy Tool: toolshed.g2.bx.psu.edu/repos/iuc/scanpy_plot/scanpy_plot/1.7.1+galaxy0 with the following parameters:
    • param-file “Annotated data matrix”: anndata (output of Manipulate AnnData tool)
    • “Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'
      • “Keys for annotations of observations/cells or variables/genes”: leiden, batch
      • “Show edges?”: No
      • In “Plot attributes”:
        • “Location of legend”: on data
        • “Legend font size”: 14
        • “Colors to use for plotting categorical annotation groups”: rainbow (Miscellaneous)

Relabelled clusters

If we look at the clusters now, and compare them to the original image in the paper, we can infer that the meristematic cells are likely to be derived from cluster 1, which shares soft clustering with the trichoblasts suggesting a trajectory pathway which could be explored.

Conclusions

In this tutorial, we have recapitulated the same clustering analysis in the “Spatiotemporal Developmental Trajectories in the Arabidopsis Root Revealed Using High-Throughput Single-Cell RNA Sequencing” Denyer et al. 2019 paper, and validated them by comparing DotPlots for specific genes that were used as markers in that paper.

From this point, we can perform a lineage analysis to infer a differentiation pathway between the clusters. For ScanPy there is the PAGA option, however, this does not work so well with the current dataset, so it is encouraged that users use the original Seurat trajectory suite that was given in the paper, or to experiment with Monocle.

Both libraries are available within the RStudio and Jupyter Notebook libraries in the interactive Galaxy envronments which can be found in the “Miscellaneous Tools” section under the “Interactive Tools” subheading. An excellent follow-up tutorial to perform a trajectory analysis in Galaxy using Jupyter notebooks would be the “Trajectory Analysis using Python (Jupyter Notebook) in Galaxy.

This entire tutorial can be invoked from the scRNA Plant Workflow shown below:

scRNA Plant

The input parameters take two datasets as input, and 5 parameters using the defaults shown in this tutorial: min cells, min genes, max features, max library size, and Leiden clustering resolution. Play with these options to see how robust the analysis really is!

keypoints Key points

  • Filtering parameters are dependent on the dataset, and should be explored using scatter or violin plots

  • A DotPlot is a fantastic way to validate clusters across different analyses

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Transcriptomics topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

  1. Satija, R., J. A. Farrell, D. Gennert, A. F. Schier, and A. Regev, 2015 Spatial reconstruction of single-cell gene expression data. Nature biotechnology 33: 495–502.
  2. Wolf, F. A., P. Angerer, and F. J. Theis, 2018 SCANPY: large-scale single-cell gene expression data analysis. Genome biology 19: 15. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0
  3. Denyer, T., X. Ma, S. Klesen, E. Scacchi, K. Nieselt et al., 2019 Spatiotemporal developmental trajectories in the Arabidopsis root revealed using high-throughput single-cell RNA sequencing. Developmental cell 48: 840–852.
  4. Tekman, M., B. Batut, A. Ostrovsky, C. Antoniewski, D. Clements et al., 2020 A single-cell RNA-sequencing training and analysis suite using the Galaxy framework. GigaScience 9: giaa102.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. Mehmet Tekman, Beatriz Serrano-Solano, Cristóbal Gallardo, Pavankumar Videm, 2021 Analysis of plant scRNA-Seq Data with Scanpy (Galaxy Training Materials). https://training.galaxyproject.org/archive/2021-07-01/topics/transcriptomics/tutorials/scrna-plant/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{transcriptomics-scrna-plant,
    author = "Mehmet Tekman and Beatriz Serrano-Solano and Cristóbal Gallardo and Pavankumar Videm",
    title = "Analysis of plant scRNA-Seq Data with Scanpy (Galaxy Training Materials)",
    year = "2021",
    month = "05",
    day = "27"
    url = "\url{https://training.galaxyproject.org/archive/2021-07-01/topics/transcriptomics/tutorials/scrna-plant/tutorial.html}",
    note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
        doi = {10.1016/j.cels.2018.05.012},
        url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
        year = 2018,
        month = {jun},
        publisher = {Elsevier {BV}},
        volume = {6},
        number = {6},
        pages = {752--758.e1},
        author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
        title = {Community-Driven Data Analysis Training for Biology},
        journal = {Cell Systems}
}
                    

congratulations Congratulations on successfully completing this tutorial!