Unsupervised Analysis of Bone Marrow Cells with Flexynesis

Overview
Creative Commons License: CC-BY Questions:
  • How can we identify distinct cell populations in bone marrow single-cell data without prior labels?

  • What cellular patterns and relationships can be discovered through unsupervised deep learning approaches?

  • How does variational autoencoder (VAE) architecture help in dimensionality reduction and feature learning for single-cell data?

Objectives:
  • Apply Flexynesis VAE architecture for unsupervised analysis of single-cell bone marrow data

  • Perform dimensionality reduction and feature learning using deep learning methods

  • Identify and interpret cellular clusters and patterns in high-dimensional single-cell datasets

  • Evaluate the quality of unsupervised representations through visualization and clustering metrics

Time estimation: 2 hours
Supporting Materials:
Published: Aug 10, 2025
Last modification: Aug 10, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1

Traditional dimensionality reduction techniques, while useful, often fail to capture the complex non-linear relationships present in high-dimensional data. Deep learning approaches, particularly Variational Autoencoders (VAEs), have emerged as powerful tools for unsupervised analysis of single-cell transcriptomic data (Zhao et al. 2017). VAEs combine the representational power of neural networks with probabilistic modeling, enabling them to learn meaningful latent representations while accounting for the inherent uncertainty in biological data. The key advantage of VAEs lies in their ability to encode high-dimensional gene expression profiles into a lower-dimensional latent space that preserves the most informative biological variation. This latent representation can then be used for various downstream analyses, including clustering, trajectory inference, and data integration.

Flexynesis represents a state-of-the-art deep learning framework specifically designed for multi-modal data integration in biological research (Uyar et al. 2024). What sets Flexynesis apart is its comprehensive suite of deep learning architectures, including supervised and unsupervised VAEs, that can handle various data integration scenarios while providing robust feature selection and hyperparameter optimization.

When an outcome variable is not available, or it is desired to do an unsupervised training, the supervised_vae model in flexynesis can be utilized. The supervised variational autoencoder class can be trained on the input dataset without a supervisor head. If the user passes no target variables, batch variables, or survival variables, then the class behaves as a plain variational autoencoder.

This training is inspired from the original flexynesis analysis notebook: unsupervised_analysis_single_cell.ipynb.

Here, we demonstrate the capabilities of flexynesis on a Single-cell CITE-Seq dataset of Bone Marrow samples (Stuart et al. 2019). The dataset was downloaded and processed using Seurat (v5.1.0) (Hao et al. 2021). 5000 cells were randomly sampled for training and 5000 cells were sampled for testing.

Warning: LICENSE

Flexynesis is only available for NON-COMMERCIAL use. Permission is only granted for academic, research, and educational purposes. Before using, be sure to review, agree, and comply with the license. For commercial use, please review the flexynesis license on GitHub and contact the copyright holders

Agenda

In this tutorial, we will cover:

  1. Data upload
    1. Get data
  2. Unsupervised Training of Flexynesis
  3. Clustering and visualisation
    1. Louvain clustering
    2. Get optimal clusters
    3. Compute AMI, ARI
    4. UMAP visualisation of true and Louvain lables
  4. Conclusion

Data upload

In the first part of this tutorial we will upload processed CITE-seq data from bone marrow tissue.

All data are in tabular format and they include:

  • ADT (Antibody-Derived Tags which indicates the quantification of cell surface proteins) data
  • RNA expression data
  • Clinical data includes some information about each cell like number of RNAs, genes, … (In next steps we will use the clustering information “celltype_l2”)

Get data

Hands On: Data Upload
  1. Create a new history for this tutorial

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the files from Zenodo:

    https://zenodo.org/records/16287482/files/test-ADT_BMscRNAseq.tabular
    https://zenodo.org/records/16287482/files/test-clin_BMscRNAseq.tabular
    https://zenodo.org/records/16287482/files/test-RNA_BMscRNAseq.tabular
    https://zenodo.org/records/16287482/files/train-ADT_BMscRNAseq.tabular
    https://zenodo.org/records/16287482/files/train-clin_BMscRNAseq.tabular
    https://zenodo.org/records/16287482/files/train-RNA_BMscRNAseq.tabular
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename the datasets
  4. Check that the datatype is tabular

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  5. Add to each dataset a representative tag (RNA, ADT, clin)

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Unsupervised Training of Flexynesis

Hands On: Train unsupervised model
  1. Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Type of Analysis”: Unsupervised Training
      • param-file “Training clinical data”: train-clin_BMscRNAseq.tabular
      • param-file “Test clinical data”: test-clin_BMscRNAseq.tabular
      • param-file “Training omics data”: train-RNA_BMscRNAseq.tabular
      • param-file “Test omics data”: test-RNA_BMscRNAseq.tabular
      • “What type of assay is your input?”: RNA
      • In “Multiple omics layers?”:
        • param-repeat “Insert Multiple omics layers?”
          • param-file “Training omics data”: train-ADT_BMscRNAseq.tabular
          • param-file “Test omics data”: test-ADT_BMscRNAseq.tabular
          • “What type of assay is your input?”: ADT
      • In “Advanced Options”:
        • “How many epochs to wait when no improvements in validation loss are observed.”: 5
        • “Number of iterations for hyperparameter optimization.”: 1
    Comment: Advanced options

    In this tutorial, for the sake of time, we are using 1 iteration for hyperparameter optimization. In a real-life analysis you might want to increase this number according to your dataset.

Question
  1. What are the outputs from Flexynesis?
  1. There are two tabular files for the latent space embeddings and two feature log files for each of the modalities.

Clustering and visualisation

Now, we extract the sample embeddings from the test dataset, cluster the cells using Louvain clustering, and visualize the clusters along with known cell type labels.

Hands On: Extract test embeddings
  1. Extract dataset with the following parameters:
    • param-file “Input List”: results (output of Flexynesis tool)
    • “How should a dataset be selected?”: The first dataset
Question
  1. What are other options to extract datasets from a collection?
  1. It is also possible to use index (here index 0) or data name (here job.embeddings_test) to extract the data. Please always check your collection before extraction.

Louvain clustering

Hands On: Cluster cells by Louvain method
  1. Flexynesis utils ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis utils”: Louvain Clustering
      • param-file “Matrix”: job.embeddings_test (output of Extract dataset tool)
      • param-file “Predicted labels”: test-clin_BMscRNAseq.tabular (Input dataset)
      • “Number of nearest neighbors to connect for each node”: 15
Question
  1. What is the output of this tool?
  1. The output is the test-clin_BMscRNAseq.tabular file with a column added containing Louvain clustering values.

Get optimal clusters

Now we will use k-means clustering with a varying number of expected clusters and pick the best one based on silhouette scores.

Hands On: Get optimal clusters
  1. Flexynesis utils ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis utils”: Get Optimal Clusters
      • param-file “Matrix”: job.embeddings_test (output of Extract dataset tool)
      • param-file “Predicted labels”: louvain_clustering (output of Flexynesis utils tool)
      • “Minimum number of clusters to try”: 5
      • “Maximum number of clusters to try”: 15
    Comment: Predicted labels

    Please make sure to use the output of Louvain clustering. We need those values in one table for next steps.

  2. Rename the output to labels with optimal clusters
Question
  1. What is the output of this tool?
  1. Another column is added to the previous table for k-means clustering values.

In the next step, we will calculate the concrdance between the known cell types and unsupervised cluster labels using AMI (Adjusted Mutual Information) and ARI (Adjusted Rand Index) indices.

Compute AMI, ARI

AMI (Adjusted Mutual Information) and ARI (Adjusted Rand Index) are used to compare clustering results with ground truth labels. They measure concordance (agreement) between two clusterings. AMI ranges from 0 (no agreement) to 1 (perfect match) and ARI ranges from -1 (complete disagreement) to 1 (perfect agreement).

Hands On: Louvain vs true labels
  1. Flexynesis utils ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis utils”: Compute AMI and ARI
      • param-file “Predicted labels”: labels with optimal clusters (output of Flexynesis utils tool)
      • “Column name in the labels file to use for the true labels”: c10 (celltype_l2)
      • “Column name in the labels file to use for the predicted labels”: c12 (louvain_cluster)
Hands On: k-means vs true labels
  1. Flexynesis utils ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis utils”: Compute AMI and ARI
      • param-file “Predicted labels”: labels with optimal clusters (output of Flexynesis utils tool)
      • “Column name in the labels file to use for the true labels”: c10 (celltype_l2)
      • “Column name in the labels file to use for the predicted labels”: c13 (optimal_kmeans_cluster)
Question
  1. Which of the clusterings has better concordance with the known cell type? Louvain and k-means?
  1. The Louvain has AMI = 0.66 and ARI = 0.49 and k-means has AMI = 0.55 and ARI = 0.43. Louvain Clustering seems to yield better AMI/ARI scores. So, we use them to do more visualizations.

UMAP visualisation of true and Louvain lables

Hands On: Dimension reduction plot
  1. Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis plot”: Dimensionality reduction
      • param-file “Embeddings”: job.embeddings_test (output of Extract dataset tool)
      • param-file “Predicted labels”: labels with optimal clusters (output of Flexynesis utils tool)
      • “Column in the labels file to use for coloring the points in the plot”: c10 (celltype_l2)
      • “Transformation method”: UMAP
Hands On: Dimension reduction plot
  1. Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis plot”: Dimensionality reduction
      • param-file “Predicted labels”: labels with optimal clusters (output of Flexynesis utils tool)
      • “Column in the labels file to use for coloring the points in the plot”: c12 (louvain_cluster)
      • “Transformation method”: UMAP
Question
  1. Compare these two UMAP plots, Is the unsupervised clustering close to the ground truth labels?
  1. We can see that like true labels, each UMAP clusters have unique Louvain clusters assigned. This shows that this clustering based on the latent space is close to the ground truth. However, we still don’t know which Louvain cluster, corresponds to which true label.
UMAP plot of test Embeddings colored by true labels. Open image in new tab

Figure 1: UMAP plot of test Embeddings colored by true labels
UMAP plot of test Embeddings colored by predicted labels. Open image in new tab

Figure 2: UMAP plot of test Embeddings colored by predicted labels

To see the real concordance between Louvain clusters and true values, we can observe a tabulation of the concordance between them. (Each row sums up to 1).

Hands On: Concordance plot
  1. Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis plot”: Label concordance heatmap
      • param-file “Predicted labels”: labels with optimal clusters (output of Flexynesis utils tool)
      • “Column in the labels file to use for true labels”: c10 (celltype_l2)
      • “Column in the labels file to use for predicted labels”: c12 (louvain_cluster)

Now it is easier to see which Lovain cluster corresponds to which true value.

Concordance plot of true values vs predicted values. Open image in new tab

Figure 3: Concordance plot of true values vs predicted values

Conclusion

Here we demonstrated the power of Flexynesis for unsupervised analysis of multi-modal single-cell data. We explored how variational autoencoders can capture cellular heterogeneity without requiring labeled training data.