Modeling Breast Cancer Subtypes with Flexynesis

Overview
Creative Commons License: CC-BY Questions:
  • How can we model breast cancer subtypes using transcriptomics and genomic data?

  • What are the key expression patterns that distinguish different BRCA subtypes?

  • How can we interpret the learned features from a deep neural network classifier?

Objectives:
  • Apply Flexynesis to model and visualize BRCA subtypes

  • Interpret the learned representations from the DirectPred model

  • Use UMAP and clustering to explore learned features

Time estimation: 2 hours
Supporting Materials:
Published: Aug 13, 2025
Last modification: Aug 13, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 2

Flexynesis represents a state-of-the-art deep learning framework specifically designed for multi-modal data integration in biological research (Uyar et al. 2024). What sets Flexynesis apart is its comprehensive suite of deep learning architectures that can handle various data integration scenarios while providing robust feature selection and hyperparameter optimization.

Here, we use Flexynesis tool suite on a multi-omics dataset of Breast Cancer samples from the METABRIC consortium (metabric community), one of the landmark breast cancer genomics studies available through cBioPortal (cbioPortal community). This dataset contains comprehensive molecular and clinical data from over 2,000 breast cancer patients, including gene expression profiles, copy number alterations, mutation data, and clinical outcomes. The data was downloaded from Cbioportal and randomly split into train (70% of the samples) and test (30% of the samples) data folders. The data files were processed to follow the same nomenclature.

This training is inspired from the original Flexynesis analysis notebook: brca_subtypes.ipynb.

Warning: LICENSE

Flexynesis is only available for NON-COMMERCIAL use. Permission is only granted for academic, research, and educational purposes. Before using, be sure to review, agree, and comply with the license. For commercial use, please review the Flexynesis license on GitHub and contact the copyright holders

Agenda

In this tutorial, we will cover:

  1. Get data
  2. Classification task with Flexynesis (1 iteration)
    1. Generate UMAP plot of the embeddings
  3. Longer training
    1. Generate UMAP plot of the embeddings
  4. Conclusion
  5. Prepare data for TabPFN
    1. CNA data
    2. GEX data
  6. TabPFN
  7. Conclusion

Get data

In this training we will use a multi-omics dataset from the METABRIC database downloaded through cBioPortal and split to 70% train and 30% test samples.

The multi-omics data consists of three data types:

  • Gene Expression (GEX): Measures the activity levels of genes by quantifying mRNA abundance. This data captures which genes are being actively transcribed in each sample.

  • Copy Number Alterations (CNA): Detects genomic regions that have been duplicated or deleted compared to the normal genome. Cancer cells often have chromosomal instabilities leading to gene amplifications (gains) or deletions (losses), which can drive tumor development and progression.

  • Clinical data (CLIN): Contains patient metadata including demographics, treatment history, tumor characteristics, and importantly for this tutorial, the breast cancer molecular subtypes (such as Luminal A, Luminal B, HER2-enriched, and Basal-like) subtypes based on the expression patterns of key biomarkers including claudin proteins, which are important for cell adhesion and tumor behavior.

These three data modalities provide complementary views of the same tumor samples - the clinical data provides the biological context and classification labels, while the molecular data (GEX and CNA) capture the underlying genomic alterations that drive these clinical phenotypes.

Flexynesis tries to learn the patterns in these molecular alterations and connect them to the phenotype to predict cancer subtypes based on the genomic and transcriptomic profiles.

Hands On: Data Upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo or from the shared data library (GTN - Material -> statistics -> Modeling Breast Cancer Subtypes with Flexynesis):

    https://zenodo.org/records/16287482/files/train_cna_brca.tabular
    https://zenodo.org/records/16287482/files/train_gex_brca.tabular
    https://zenodo.org/records/16287482/files/train_clin_brca.tabular
    https://zenodo.org/records/16287482/files/test_cna_brca.tabular
    https://zenodo.org/records/16287482/files/test_gex_brca.tabular
    https://zenodo.org/records/16287482/files/test_clin_brca.tabular
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Libraries (left panel)
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  3. Rename the datasets
  4. Check that the datatype is tabular

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  5. Add to each database a tag corresponding to the modality (#gex, #cna, #clin)

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Classification task with Flexynesis (1 iteration)

Flexynesis automatically handles key preprocessing steps such as:

  • Data processing:
    • removing uninformative features (e.g. features with near-zero-variation)
    • removing samples with too many NA values
    • removing features with too many NA values and impute remaining NA values using median imputation (replacing missing values with the median of each feature)
  • Feature selection only on training data for each omics layer separately:
    • Features are sorted by Laplacian score
    • Features that make it in the top_percentile
    • Highly redundant features are further removed (for a pair of highly correlated features, keep the one with the higher laplacian score).
  • Harmonize the training data with the test data

  • Normalize the datasets (Subset the test data features to those that are kept for training data)

  • Normalize training data (standard scaling) and apply the same scaling factors to the test data.

  • Log transform the final matrices (Optional)

  • Distinguish numerical and categorical variables in the clinical data file. For categorical variables, create a numerical encoding of the labels for training data. Use the same encoders to map the test samples to the same numerical encodings.
Comment: No manual preprocessing is required

Flexynesis performs all data cleaning internally as part of its pipeline.

Now it’s time to train model using Flexynesis:

  • We choose whether we want to concatenate the data matrices (early integration) or not (intermediate integration) before running them through the neural networks.
  • We want to apply feature selection and keep only top 10% of the features. In the end, we want to keep at least 1000 features per omics layer.
  • We apply a variance threshold (for simplicity of demonstration, we want to keep a few most variable features). Setting this to 80, will remove 80% of the features with the lowest variation from each modality.
  • We choose which model architecture to use:
    • DirectPred: a fully connected network (standard multilayer perceptron) with supervisor heads (one MLP for each target variable)
  • target_variables: Column headers in clinical data.
  • hpo_iter: How many hyperparameter search steps to implement:
    • This example runs 1 hyperparameter search step using DirectPred architecture and a hyperparameter configuration space defined for “DirectPred” with a supervisor head for “CLAUDIN_SUBTYPE” variable. At the end of the parameter optimization, the best model will be selected and returned.

Training a model longer than needed causes the model to overfit, yield worse validation performance, and also it takes a longer time to train the models, considering if we have to run a long hyperparameter optimization routine, not just for 1 step, but say more than 100 steps.

It is possible to set early stopping criteria in Flexynesis, which is basically a simple callback that is handled by Pytorch Lightning. This is regulated using the early_stop_patience. When set to e.g. 10, the training will stop if the validation loss has not been improved in the last 10 epochs.

Hands On: Flexynesis
  1. Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Type of Analysis”: Supervised training
      • param-file “Training clinical data”: train_clin_brca.tabular
      • param-file “Test clinical data”: test_clin_brca.tabular
      • param-file “Training omics data”: train_gex_brca.tabular
      • param-file “Test omics data”: test_gex_brca.tabular
      • “What type of assay is your input?”: gex
      • In “Multiple omics layers?”:
        • param-repeat “Insert Multiple omics layers?”
          • param-file “Training omics data”: train_cna_brca.tabular
      • param-file “Test clinical data”: test_cna_brca.tabular - “What type of assay is your input?”: cna
      • “Model class”: DirectPred
      • “Column name in the train clinical data to use for predictions, multiple targets are allowed”: Column: 16 (CLAUDIN_SUBTYPE)
      • In “Advanced Options”:
        • “Variance threshold (as percentile) to drop low variance features.”: 0.8
        • “Minimum number of features to retain after feature selection.”: 100
        • “Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0
        • “Number of iterations for hyperparameter optimization.”: 1
      • In “Visualization”:
        • “Generate embeddings plot?”: Yes
        • “Generate precision-recall curves plot?”: Yes
Question
  1. What are main outputs of Flexynesis?
  2. What does the generated plots show?
  1. The results collection contains the following data:
    • job.embeddings_test (latent space of test data)
    • job.embeddings_train (latent space of train data)
    • job.feature_importance.GradientShap (feature importance calculated by GradientShap method)
    • job.feature_importance.IntegratedGradients (feature importance calculated by IntegratedGradients method)
    • job.feature_logs.cna
    • job.feature_logs.gex
    • job.predicted_labels (Prediction of the created model)
    • job.stats
  2. There are three figures in plot collection:
    • A PCA plot of the Embeddings colored by known, true subtypes
PCA plot - colored by known subtypes. Open image in new tab

Figure 1: PCA plot - colored by known subtypes
  • A PCA plot of the Embeddings colored by predicted subtypes
PCA plot - colored by predicted subtypes. Open image in new tab

Figure 2: PCA plot - colored by predicted subtypes

We can see that the subtypes were predicted by the model with a good approximation.

  • A precision-recall curves plot, showing the accuracy of the prediction
PR-curve plot - showing prediction accuracy. Open image in new tab

Figure 3: PR-curve plot - showing prediction accuracy

This plot also conforms that most of the subtypes were classified with a good precision (except Class 4 (LumB), in the PCA plot you can see some Her2 samples were classified as LumB for example)

Generate UMAP plot of the embeddings

To generate a UMAP plot, we need to extract the embeddings and predicted table.

Hands On: Extract embeddings and prediction
  1. Extract dataset with the following parameters:
    • param-file “Input List”: results (output of Flexynesis tool)
    • “How should a dataset be selected?”: Select by element identifier
      • “Element identifier:”: job.embeddings_test
  2. Extract dataset with the following parameters:
    • param-file “Input List”: results (output of Flexynesis tool)
    • “How should a dataset be selected?”: Select by element identifier
      • “Element identifier:”: job.predicted_labels
Question
  1. What information do we have in predicted labels?
  1. It is a tabular file which contains the probability of each subtype for each sample and a final predicted label based on the probability.

As you have seen in predicted labels, each sample has a prediction value for each subtype. For UMAP plot we only need the final predicted label.

Hands On: Remove probabilities (duplicates)
  1. Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
    • “Sort Query”: job.predicted_labels (output of Extract dataset tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • param-repeat “Insert Column selections”
        • “on column”: Column: 1
        • “Flavor”: Alphabetical sort
    • “Output unique values”: Yes

And finally the UMAP plot.

Hands On: UMAP plot
  1. Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis plot”: Dimensionality reduction
      • param-file “Predicted labels”: table (output of Sort tool)
      • param-file “Embeddings”: job.embeddings_test (output of Extract dataset tool)
      • “Column in the labels file to use for coloring the points in the plot”: Column: 5 (known_label)
      • “Transformation method”: UMAP
Question
  1. What does the generated plot show?
UMAP plot - colored by known subtypes. Open image in new tab

Figure 4: UMAP plot - colored by known subtypes

This is a UMAP representation of the embeddings which shows a better speration between subtypes

Longer training

In reality, hyperparameter optimization should run for multiple steps so that the parameter search space is large enough to find a good set. However, for demonstration purposes, we only run it for 5 steps here:

  • “Number of iterations for hyperparameter optimization.”: 5
Hands On: Flexynesis
  1. Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Type of Analysis”: Supervised training
      • param-file “Training clinical data”: train_clin_brca.tabular
      • param-file “Test clinical data”: test_clin_brca.tabular
      • param-file “Training omics data”: train_gex_brca.tabular
      • param-file “Test omics data”: test_gex_brca.tabular
      • “What type of assay is your input?”: gex
      • In “Multiple omics layers?”:
        • param-repeat “Insert Multiple omics layers?”
          • param-file “Training omics data”: train_cna_brca.tabular
      • param-file “Test clinical data”: test_cna_brca.tabular - “What type of assay is your input?”: cna
      • “Model class”: DirectPred
      • “Column name in the train clinical data to use for predictions, multiple targets are allowed”: Column: 16 (CLAUDIN_SUBTYPE)
      • In “Advanced Options”:
        • “Variance threshold (as percentile) to drop low variance features.”: 0.8
        • “Minimum number of features to retain after feature selection.”: 100
        • “Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0
        • “Number of iterations for hyperparameter optimization.”: 5
      • In “Visualization”:
        • “Generate embeddings plot?”: Yes
        • “Generate precision-recall curves plot?”: Yes
Question
  1. What does the generated plots show?
  1. There are three figures in plot collection:
    • A PCA plot of the Embeddings colored by known, true subtypes
PCA plot - colored by known subtypes. Open image in new tab

Figure 5: PCA plot - colored by known subtypes
  • A PCA plot of the Embeddings colored by predicted subtypes
PCA plot - colored by predicted subtypes. Open image in new tab

Figure 6: PCA plot - colored by predicted subtypes

We can see that the subtypes were predicted by the model with a good approximation.

  • A precision-recall curves plot, showing the accuracy of the prediction
PR-curve plot - showing prediction accuracy. Open image in new tab

Figure 7: PR-curve plot - showing prediction accuracy

This plot also conforms that most of the subtypes were classified with a good precision. Also the precision has incresed (compare Class4 = 0.23 with 1 iteration to Class4 = 0.41 with 5 iterations)

Generate UMAP plot of the embeddings

To generate a UMAP plot, we need to extract the embeddings and predicted table.

Hands On: Extract embeddings and prediction
  1. Extract dataset with the following parameters:
    • param-file “Input List”: results (output of Flexynesis tool)
    • “How should a dataset be selected?”: Select by element identifier
      • “Element identifier:”: job.embeddings_test
  2. Extract dataset with the following parameters:
    • param-file “Input List”: results (output of Flexynesis tool)
    • “How should a dataset be selected?”: Select by element identifier
      • “Element identifier:”: job.predicted_labels
Hands On: Remove probabilities (duplicates)
  1. Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
    • “Sort Query”: job.predicted_labels (output of Extract dataset tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • param-repeat “Insert Column selections”
        • “on column”: Column: 1
        • “Flavor”: Alphabetical sort
    • “Output unique values”: Yes

And finally the UMAP plot.

Hands On: UMAP plot
  1. Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis plot”: Dimensionality reduction
      • param-file “Predicted labels”: table (output of Sort tool)
      • param-file “Embeddings”: job.embeddings_test (output of Extract dataset tool)
      • “Column in the labels file to use for coloring the points in the plot”: Column: 5 (known_label)
      • “Transformation method”: UMAP
Question
  1. What does the generated plot show?
UMAP plot - colored by known subtypes. Open image in new tab

Figure 8: UMAP plot - colored by known subtypes

This is a UMAP representation of the embeddings which shows a better speration between subtypes

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" (CYOT) section (also known as "Choose Your Own Analysis" (CYOA)), where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

We have another tool for the classification task called TabPFN which is a foundation model for tabular data. We can repeat the classification task using TabPFN and compare it with Flexynesis. Please note that TabPFN is slow on CPU and it will take a lot of time to finish on this dataset. What do you say? Should we continue with TabPFN?

Conclusion

In this tutorial, we demonstrated how Flexynesis can model breast cancer subtypes from multi omics data using a deep generative framework. The tool handles preprocessing, feature selection, and latent space learning automatically, making it efficient and robust.

By training on TCGA BRCA data, we:

  • Visualized meaningful subtype separation in the latent space

  • Identified subtype-specific genes

  • Showed how learned features relate to known clinical labels

Flexynesis provides an accessible way to explore complex omics datasets and uncover biological structure without extensive manual tuning.

Alright, Let’s see how TabPFN can predict the subtypes

Prepare data for TabPFN

Currently, TabPFN supports up to 10,000 samples and 500 features (genes) in a tabular data. We will filter our data by gene variance and will use top 500 variable genes as input for TabPFN.

The train and test tabular data for TabPFN should be transposed so the samples are in rows and genes in columns, the last column should contain the labels and train and test data should have same set of features in same order.

Since TabPFN does not support data integration, we should try gene expression and copy number alteration data separately.

CNA data

First, let’s filter the cna data by variance.

Hands On: Prepare CNA data
Comment: Short explanation of steps:

Here we will:

  • Calculate the variance of each gene in train matrix
  • Add the variance back to the matrix
  • Sort the matrix by variance in descending order
  • Filter the matrix by top 500 genes
  • Sort the matrix by gene name
  1. Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:
    • “Input Single or Multiple Tables”: Single Table
      • param-file “Table”: train_cna_brca.tabular
      • “Type of table operation”: Compute expression across rows or columns
        • “Calculate”: Variance
        • “For each”: Row
  2. Join two Datasets with the following parameters:
    • param-file “Join”: table (output of Table Compute tool)
    • “using column”: Column: 1
    • param-file “with”: train_cna_brca.tabular
    • “and column”: Column: 1
    • “Fill empty columns”: No
    • “Keep the header lines”: Yes
  3. Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “Sort Query”: table (output of Join two Datasets tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • param-repeat “Insert Column selections”
        • “on column”: Column: 2
        • “in”: Descending order
        • “Flavor”: Fast numeric sort (-n)
  4. Select first with the following parameters:
    • “Select first”: 500
    • param-file “from”: table (output of Sort tool)
    • “Dataset has a header”: Yes
  5. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: table (output of Select first tool)
    • “Operation”: Discard
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1, Column: 2
  6. Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “Sort Query”: table (output of Advanced Cut tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • param-repeat “Insert Column selections”
        • “on column”: Column: 1
        • “Flavor”: Alphabetical sort
  7. Rename the output file to train_cna_brca_500gene.tabular
Comment: Short explanation of steps:

Here we will:

  • Extract the list of genes from the train matrix
  • Filter the test data by extracted genes
  • Sort the matrix by gene name
  • Transpose both train and test data
  1. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: train_cna_brca_500gene.tabular (output of Advanced Cut tool)
    • “Operation”: Keep
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1
  2. Join ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “1st file”: table (output of Advanced Cut tool)
    • “Column to use from 1st file”: Column: 1
    • param-file “2nd File”: test_cna_brca.tabular
    • “Column to use from 2nd file”: Column: 1
    • “First line is a header line”: Yes
  3. Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “Sort Query”: output (output of Join tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • param-repeat “Insert Column selections”
        • “on column”: Column: 1
        • “Flavor”: Alphabetical sort
  4. Rename the output file to test_cna_brca_500gene.tabular

  5. Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:
    • param-file “Input tabular dataset”: train_cna_brca_500gene.tabular (output of Sort tool)
  6. Rename the output file to train_cna_brca_500gene_transposed.tabular

  7. Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:
    • param-file “Input tabular dataset”: test_cna_brca_500gene.tabular (output of Sort tool)
  8. Rename the output file to test_cna_brca_500gene_transposed.tabular
Comment: Short explanation of steps:

Here we will:

  • Extract sample_id and CLAUDIN_SUBTYPE from the train and clinical data
  • Add the subtype to the train and test matrix
  • And finally remove the sample_id from the matrices.
  1. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: train_clin_brca.tabular (Input dataset)
    • “Operation”: Keep
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1, Column: 16
  2. Rename the output to Train annotation

  3. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: test_clin_brca.tabular (Input dataset)
    • “Operation”: Keep
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1, Column: 16
  4. Rename the output to Test annotation

  5. Join ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “1st file”: train_cna_brca_500gene_transposed.tabular (output of Transpose tool)
    • “Column to use from 1st file”: Column: 1
    • param-file “2nd File”: Train annotation (output of Advanced Cut tool)
    • “Column to use from 2nd file”: Column: 1
    • “First line is a header line”: Yes
  6. Rename the output to Annotated train matrix cna

  7. Join ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “1st file”: test_cna_brca_500gene_transposed.tabular (output of Transpose tool)
    • “Column to use from 1st file”: Column: 1
    • param-file “2nd File”: Test annotation (output of Advanced Cut tool)
    • “Column to use from 2nd file”: Column: 1
    • “First line is a header line”: Yes
  8. Rename the output to Annotated test matrix

  9. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: Annotated train matrix (output of Join tool)
    • “Operation”: Discard
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1
  10. Rename the output to TabPFN ready train data - CNA

  11. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: Annotated test matrix (output of Join tool)
    • “Operation”: Discard
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1
  12. Rename the output to TabPFN ready test data - CNA

Now the CNA data is ready for TabPFN. Let’s do the same for GEX!

GEX data

Hands On: Prepare GEX data
Comment: Short explanation of steps:

Here we will:

  • Calculate the variance of each gene in train matrix
  • Add the variance back to the matrix
  • Sort the matrix by variance in descending order
  • Filter the matrix by top 500 genes
  • Sort the matrix by gene name
  1. Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:
    • “Input Single or Multiple Tables”: Single Table
      • param-file “Table”: train_gex_brca.tabular
      • “Type of table operation”: Compute expression across rows or columns
        • “Calculate”: Variance
        • “For each”: Row
  2. Join two Datasets with the following parameters:
    • param-file “Join”: table (output of Table Compute tool)
    • “using column”: Column: 1
    • param-file “with”: train_gex_brca.tabular
    • “and column”: Column: 1
    • “Fill empty columns”: No
    • “Keep the header lines”: Yes
  3. Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “Sort Query”: table (output of Join two Datasets tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • param-repeat “Insert Column selections”
        • “on column”: Column: 2
        • “in”: Descending order
        • “Flavor”: Fast numeric sort (-n)
  4. Select first with the following parameters:
    • “Select first”: 500
    • param-file “from”: table (output of Sort tool)
    • “Dataset has a header”: Yes
  5. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: table (output of Select first tool)
    • “Operation”: Discard
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1, Column: 2
  6. Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “Sort Query”: table (output of Advanced Cut tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • param-repeat “Insert Column selections”
        • “on column”: Column: 1
        • “Flavor”: Alphabetical sort
  7. Rename the output file to train_gex_brca_500gene.tabular
Comment: Short explanation of steps:

Here we will:

  • Extract the list of genes from the train matrix
  • Filter the test data by extracted genes
  • Sort the matrix by gene name
  • Transpose both train and test data
  1. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: train_gex_brca_500gene.tabular (output of Advanced Cut tool)
    • “Operation”: Keep
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1
  2. Join ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “1st file”: table (output of Advanced Cut tool)
    • “Column to use from 1st file”: Column: 1
    • param-file “2nd File”: test_gex_brca.tabular
    • “Column to use from 2nd file”: Column: 1
    • “First line is a header line”: Yes
  3. Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “Sort Query”: output (output of Join tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • param-repeat “Insert Column selections”
        • “on column”: Column: 1
        • “Flavor”: Alphabetical sort
  4. Rename the output file to test_gex_brca_500gene.tabular

  5. Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:
    • param-file “Input tabular dataset”: train_gex_brca_500gene.tabular (output of Sort tool)
  6. Rename the output file to train_gex_brca_500gene_transposed.tabular

  7. Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:
    • param-file “Input tabular dataset”: test_gex_brca_500gene.tabular (output of Sort tool)
  8. Rename the output file to test_gex_brca_500gene_transposed.tabular
Comment: Short explanation of steps:

Here we will:

  • Extract sample_id and CLAUDIN_SUBTYPE from the train and clinical data
  • Add the subtype to the train and test matrix
  • And finally remove the sample_id from the matrices.
  1. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: train_clin_brca.tabular (Input dataset)
    • “Operation”: Keep
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1, Column: 16
  2. Rename the output to Train annotation

  3. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: test_clin_brca.tabular (Input dataset)
    • “Operation”: Keep
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1, Column: 16
  4. Rename the output to Test annotation

  5. Join ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “1st file”: train_gex_brca_500gene_transposed.tabular (output of Transpose tool)
    • “Column to use from 1st file”: Column: 1
    • param-file “2nd File”: Train annotation (output of Advanced Cut tool)
    • “Column to use from 2nd file”: Column: 1
    • “First line is a header line”: Yes
  6. Rename the output to Annotated train matrix - gex

  7. Join ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “1st file”: test_gex_brca_500gene_transposed.tabular (output of Transpose tool)
    • “Column to use from 1st file”: Column: 1
    • param-file “2nd File”: Test annotation (output of Advanced Cut tool)
    • “Column to use from 2nd file”: Column: 1
    • “First line is a header line”: Yes
  8. Rename the output to Annotated test matrix - gex

  9. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: Annotated train matrix (output of Join tool)
    • “Operation”: Discard
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1
  10. Rename the output to TabPFN ready train data - GEX

  11. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: Annotated test matrix (output of Join tool)
    • “Operation”: Discard
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: Column: 1
  12. Rename the output to TabPFN ready test data - GEX

Now it is time to run TabPFN for CNA and GEX data.

TabPFN

Warning: High computation time

TabPFN takes a lot of time to do the classification task on CPU (With our data it takes about 7 hours) You can instead import the output (data 55,56,59, and 60) from this archived history.

Hands On: TabPFN on CNA
  1. Tabular data prediction using TabPFN ( Galaxy version 2.0.9+galaxy0) with the following parameters:
    • “Select a machine learning task”: Classification
    • param-file “Train data”: TabPFN ready train data - CNA (output of Advanced Cut tool)
    • param-file “test data”: TabPFN ready test data - CNA (output of Advanced Cut tool)
    • “Does test data contain labels?”: Yes
Hands On: TabPFN on GEX
  1. Tabular data prediction using TabPFN ( Galaxy version 2.0.9+galaxy0) with the following parameters:
    • “Select a machine learning task”: Classification
    • param-file “Train data”: TabPFN ready train data - GEX (output of Advanced Cut tool)
    • param-file “test data”: TabPFN ready test data - GEX (output of Advanced Cut tool)
    • “Does test data contain labels?”: Yes

To make comparison of TabPFN and Flexynesis fair, we should apply Flexynesis on the GEX and CNA with 500 features separately.

Hands On: Flexynesis on CNA
  1. Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Type of Analysis”: Supervised training
      • param-file “Training clinical data”: train_clin_brca.tabular
      • param-file “Test clinical data”: test_clin_brca.tabular
      • param-file “Training omics data”: train_cna_brca_500gene.tabular (output of Advanced Cut tool)
      • param-file “Test omics data”: test_cna_brca_500gene.tabular (output of Advanced Cut tool)
      • “What type of assay is your input?”: cna
      • “Model class”: DirectPred
      • “Column name in the train clinical data to use for predictions, multiple targets are allowed”: Column: 16 (CLAUDIN_SUBTYPE)
      • In “Advanced Options”:
        • “Variance threshold (as percentile) to drop low variance features.”: 0.8
        • “Minimum number of features to retain after feature selection.”: 100
        • “Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0
        • “Number of iterations for hyperparameter optimization.”: 5
      • In “Visualization”:
        • “Generate embeddings plot?”: Yes
        • “Generate precision-recall curves plot?”: Yes
Hands On: Flexynesis on GEX
  1. Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Type of Analysis”: Supervised training
      • param-file “Training clinical data”: train_clin_brca.tabular
      • param-file “Test clinical data”: test_clin_brca.tabular
      • param-file “Training omics data”: train_gex_brca_500gene.tabular (output of Advanced Cut tool)
      • param-file “Test omics data”: test_gex_brca_500gene.tabular (output of Advanced Cut tool)
      • “What type of assay is your input?”: gex
      • “Model class”: DirectPred
      • “Column name in the train clinical data to use for predictions, multiple targets are allowed”: Column: 16 (CLAUDIN_SUBTYPE)
      • In “Advanced Options”:
        • “Variance threshold (as percentile) to drop low variance features.”: 0.8
        • “Minimum number of features to retain after feature selection.”: 100
        • “Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0
        • “Number of iterations for hyperparameter optimization.”: 5
      • In “Visualization”:
        • “Generate embeddings plot?”: Yes
        • “Generate precision-recall curves plot?”: Yes
Question
  1. Compare the PR-curve plots of both Flexynesis and TabPFN predictions, which one is more accurate?
  1. You can see that the prediction of TabPFN was better with the CNA data, however we have got better prediction with Flexynesis on GEX, and with the power of data integration using both CNA and GEX, the prediction is further improved!!
Precision-Recall curve plot of TabPFN on CNA. Open image in new tab

Figure 9: Precision-Recall curve plot of TabPFN on CNA
Precision-Recall curve plot of Flexynesis on CNA. Open image in new tab

Figure 10: Precision-Recall curve plot of Flexynesis on CNA
Precision-Recall curve plot of TabPFN on GEX. Open image in new tab

Figure 11: Precision-Recall curve plot of TabPFN on GEX
Precision-Recall curve plot of Flexynesis on GEX. Open image in new tab

Figure 12: Precision-Recall curve plot of Flexynesis on GEX
Precision-Recall curve plot of Flexynesis. Open image in new tab

Figure 13: Precision-Recall curve plot of Flexynesis
Comment: Try Flexynesis with more iteration

In this training we only used 5 iterations to train our model. You can rerun the tool with higher number of iterations and see how the prediction improves.

Conclusion

In this tutorial, we demonstrated how Flexynesis can model breast cancer subtypes from multi omics data using a deep generative framework. The tool handles preprocessing, feature selection, and latent space learning automatically, making it efficient and robust.

By training on TCGA BRCA data, we:

  • Visualized meaningful subtype separation in the latent space

  • Identified subtype-specific genes

  • Showed how learned features relate to known clinical labels

Flexynesis provides a fast and accessible way to explore complex omics datasets and uncover biological structure without extensive manual tuning.

We also compared Flexynesis with TabPFN and we saw although TabPFN predicted the subtypes with CNA data better, but it was not as good as Flexynesis on GEX data. It also provides fewer outputs compared to Flexynesis for example the integrated latent space embeddings, feature importance and a full information about the predictions.