Modeling Breast Cancer Subtypes with Flexynesis

Author(s)	Amirhossein Naghsh Nilchi Björn Grüning
Reviewers

Overview
Questions:

How can we model breast cancer subtypes using transcriptomics and genomic data?

What are the key expression patterns that distinguish different BRCA subtypes?

How can we interpret the learned features from a deep neural network classifier?

Objectives:

Apply Flexynesis to model and visualize BRCA subtypes

Interpret the learned representations from the DirectPred model

Use UMAP and clustering to explore learned features

Time estimation: 2 hours

Supporting Materials:

Datasets

Workflows

galaxy-history-answer Answer Histories

usegalaxy.eu
2025-08-01

help How to Use This

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

Published: Aug 13, 2025

Last modification: Sep 2, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00554

version Revision: 3

Flexynesis represents a state-of-the-art deep learning framework specifically designed for multi-modal data integration in biological research (Uyar et al. 2024). What sets Flexynesis apart is its comprehensive suite of deep learning architectures that can handle various data integration scenarios while providing robust feature selection and hyperparameter optimization.

Here, we use Flexynesis tool suite on a multi-omics dataset of Breast Cancer samples from the METABRIC consortium (metabric community), one of the landmark breast cancer genomics studies available through cBioPortal (cbioPortal community). This dataset contains comprehensive molecular and clinical data from over 2,000 breast cancer patients, including gene expression profiles, copy number alterations, mutation data, and clinical outcomes. The data was downloaded from Cbioportal and randomly split into train (70% of the samples) and test (30% of the samples) data folders. The data files were processed to follow the same nomenclature.

This training is inspired from the original Flexynesis analysis notebook: brca_subtypes.ipynb.

Warning: LICENSE

Flexynesis is only available for NON-COMMERCIAL use. Permission is only granted for academic, research, and educational purposes. Before using, be sure to review, agree, and comply with the license. For commercial use, please review the Flexynesis license on GitHub and contact the copyright holders

Agenda

In this tutorial, we will cover:

Get data

Classification task with Flexynesis (1 iteration)

Generate UMAP plot of the embeddings

Longer training

Generate UMAP plot of the embeddings

Conclusion

Prepare data for TabPFN

CNA data

GEX data

TabPFN

Conclusion

Get data

In this training we will use a multi-omics dataset from the METABRIC database downloaded through cBioPortal and split to 70% train and 30% test samples.

The multi-omics data consists of three data types:

Gene Expression (GEX): Measures the activity levels of genes by quantifying mRNA abundance. This data captures which genes are being actively transcribed in each sample.
Copy Number Alterations (CNA): Detects genomic regions that have been duplicated or deleted compared to the normal genome. Cancer cells often have chromosomal instabilities leading to gene amplifications (gains) or deletions (losses), which can drive tumor development and progression.
Clinical data (CLIN): Contains patient metadata including demographics, treatment history, tumor characteristics, and importantly for this tutorial, the breast cancer molecular subtypes (such as Luminal A, Luminal B, HER2-enriched, and Basal-like) subtypes based on the expression patterns of key biomarkers including claudin proteins, which are important for cell adhesion and tumor behavior.

These three data modalities provide complementary views of the same tumor samples - the clinical data provides the biological context and classification labels, while the molecular data (GEX and CNA) capture the underlying genomic alterations that drive these clinical phenotypes.

Flexynesis tries to learn the patterns in these molecular alterations and connect them to the phenotype to predict cancer subtypes based on the genomic and transcriptomic profiles.

Hands On: Data Upload
Create a new history for this tutorial
Import the files from Zenodo or from the shared data library (GTN - Material -> statistics -> Modeling Breast Cancer Subtypes with Flexynesis):
https://zenodo.org/records/16287482/files/train_cna_brca.tabular
https://zenodo.org/records/16287482/files/train_gex_brca.tabular
https://zenodo.org/records/16287482/files/train_clin_brca.tabular
https://zenodo.org/records/16287482/files/test_cna_brca.tabular
https://zenodo.org/records/16287482/files/test_gex_brca.tabular
https://zenodo.org/records/16287482/files/test_clin_brca.tabular
Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Rename the datasets

Check that the datatype is tabular

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Add to each database a tag corresponding to the modality (#gex, #cna, #clin)

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

Classification task with Flexynesis (1 iteration)

Flexynesis automatically handles key preprocessing steps such as:

Data processing:
- removing uninformative features (e.g. features with near-zero-variation)
- removing samples with too many NA values
- removing features with too many NA values and impute remaining NA values using median imputation (replacing missing values with the median of each feature)
Feature selection only on training data for each omics layer separately:
- Features are sorted by Laplacian score
- Features that make it in the top_percentile
- Highly redundant features are further removed (for a pair of highly correlated features, keep the one with the higher laplacian score).
Harmonize the training data with the test data
Normalize the datasets (Subset the test data features to those that are kept for training data)
Normalize training data (standard scaling) and apply the same scaling factors to the test data.
Log transform the final matrices (Optional)
Distinguish numerical and categorical variables in the clinical data file. For categorical variables, create a numerical encoding of the labels for training data. Use the same encoders to map the test samples to the same numerical encodings.

Comment: No manual preprocessing is required

Flexynesis performs all data cleaning internally as part of its pipeline.

Now it’s time to train model using Flexynesis:

We choose whether we want to concatenate the data matrices (early integration) or not (intermediate integration) before running them through the neural networks.
We want to apply feature selection and keep only top 10% of the features. In the end, we want to keep at least 1000 features per omics layer.
We apply a variance threshold (for simplicity of demonstration, we want to keep a few most variable features). Setting this to 80, will remove 80% of the features with the lowest variation from each modality.
We choose which model architecture to use:
- DirectPred: a fully connected network (standard multilayer perceptron) with supervisor heads (one MLP for each target variable)
target_variables: Column headers in clinical data.
hpo_iter: How many hyperparameter search steps to implement:
- This example runs 1 hyperparameter search step using DirectPred architecture and a hyperparameter configuration space defined for “DirectPred” with a supervisor head for “CLAUDIN_SUBTYPE” variable. At the end of the parameter optimization, the best model will be selected and returned.

Training a model longer than needed causes the model to overfit, yield worse validation performance, and also it takes a longer time to train the models, considering if we have to run a long hyperparameter optimization routine, not just for 1 step, but say more than 100 steps.

It is possible to set early stopping criteria in Flexynesis, which is basically a simple callback that is handled by Pytorch Lightning. This is regulated using the early_stop_patience. When set to e.g. 10, the training will stop if the validation loss has not been improved in the last 10 epochs.

Hands On: Flexynesis

Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Type of Analysis”: Supervised training

param-file “Training clinical data”: train_clin_brca.tabular

param-file “Test clinical data”: test_clin_brca.tabular

param-file “Training omics data”: train_gex_brca.tabular

param-file “Test omics data”: test_gex_brca.tabular

“What type of assay is your input?”: gex

In “Multiple omics layers?”:

param-repeat “Insert Multiple omics layers?”

param-file “Training omics data”: train_cna_brca.tabular

param-file “Test clinical data”: test_cna_brca.tabular - “What type of assay is your input?”: cna

“Model class”: DirectPred

“Column name in the train clinical data to use for predictions, multiple targets are allowed”: Column: 16 (CLAUDIN_SUBTYPE)

In “Advanced Options”:

“Variance threshold (as percentile) to drop low variance features.”: 0.8

“Minimum number of features to retain after feature selection.”: 100

“Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0

“Number of iterations for hyperparameter optimization.”: 1

In “Visualization”:

“Generate embeddings plot?”: Yes

“Generate precision-recall curves plot?”: Yes

Question

What are main outputs of Flexynesis?

What does the generated plots show?

The results collection contains the following data:

job.embeddings_test (latent space of test data)

job.embeddings_train (latent space of train data)

job.feature_importance.GradientShap (feature importance calculated by GradientShap method)

job.feature_importance.IntegratedGradients (feature importance calculated by IntegratedGradients method)

job.feature_logs.cna

job.feature_logs.gex

job.predicted_labels (Prediction of the created model)

job.stats

There are three figures in plot collection:

A PCA plot of the Embeddings colored by known, true subtypes

Open image in new tab

Figure 1: PCA plot - colored by known subtypes

A PCA plot of the Embeddings colored by predicted subtypes

Open image in new tab

Figure 2: PCA plot - colored by predicted subtypes

We can see that the subtypes were predicted by the model with a good approximation.

A precision-recall curves plot, showing the accuracy of the prediction

Open image in new tab

Figure 3: PR-curve plot - showing prediction accuracy

This plot also conforms that most of the subtypes were classified with a good precision (except Class 4 (LumB), in the PCA plot you can see some Her2 samples were classified as LumB for example)

Generate UMAP plot of the embeddings

To generate a UMAP plot, we need to extract the embeddings and predicted table.

Hands On: Extract embeddings and prediction

Extract dataset with the following parameters:

param-file “Input List”: results (output of Flexynesis tool)

“How should a dataset be selected?”: Select by element identifier

“Element identifier:”: job.embeddings_test

Extract dataset with the following parameters:

param-file “Input List”: results (output of Flexynesis tool)

“How should a dataset be selected?”: Select by element identifier

“Element identifier:”: job.predicted_labels

Question

What information do we have in predicted labels?

It is a tabular file which contains the probability of each subtype for each sample and a final predicted label based on the probability.

As you have seen in predicted labels, each sample has a prediction value for each subtype. For UMAP plot we only need the final predicted label.

Hands On: Remove probabilities (duplicates)

Sort ( Galaxy version 9.5+galaxy2) with the following parameters:

“Sort Query”: job.predicted_labels (output of Extract dataset tool)

“Number of header lines”: 1

In “Column selections”:

param-repeat “Insert Column selections”

“on column”: Column: 1

“Flavor”: Alphabetical sort

“Output unique values”: Yes

And finally the UMAP plot.

Hands On: UMAP plot

Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Flexynesis plot”: Dimensionality reduction

param-file “Predicted labels”: table (output of Sort tool)

param-file “Embeddings”: job.embeddings_test (output of Extract dataset tool)

“Column in the labels file to use for coloring the points in the plot”: Column: 5 (known_label)

“Transformation method”: UMAP

Question

What does the generated plot show?

Open image in new tab

Figure 4: UMAP plot - colored by known subtypes

This is a UMAP representation of the embeddings which shows a better speration between subtypes

Longer training

In reality, hyperparameter optimization should run for multiple steps so that the parameter search space is large enough to find a good set. However, for demonstration purposes, we only run it for 5 steps here:

“Number of iterations for hyperparameter optimization.”: 5

Hands On: Flexynesis

Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Type of Analysis”: Supervised training

param-file “Training clinical data”: train_clin_brca.tabular

param-file “Test clinical data”: test_clin_brca.tabular

param-file “Training omics data”: train_gex_brca.tabular

param-file “Test omics data”: test_gex_brca.tabular

“What type of assay is your input?”: gex

In “Multiple omics layers?”:

param-repeat “Insert Multiple omics layers?”

param-file “Training omics data”: train_cna_brca.tabular

param-file “Test clinical data”: test_cna_brca.tabular - “What type of assay is your input?”: cna

“Model class”: DirectPred

“Column name in the train clinical data to use for predictions, multiple targets are allowed”: Column: 16 (CLAUDIN_SUBTYPE)

In “Advanced Options”:

“Variance threshold (as percentile) to drop low variance features.”: 0.8

“Minimum number of features to retain after feature selection.”: 100

“Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0

“Number of iterations for hyperparameter optimization.”: 5

In “Visualization”:

“Generate embeddings plot?”: Yes

“Generate precision-recall curves plot?”: Yes

Question

What does the generated plots show?

There are three figures in plot collection:

A PCA plot of the Embeddings colored by known, true subtypes

Open image in new tab

Figure 5: PCA plot - colored by known subtypes

A PCA plot of the Embeddings colored by predicted subtypes

Open image in new tab

Figure 6: PCA plot - colored by predicted subtypes

We can see that the subtypes were predicted by the model with a good approximation.

A precision-recall curves plot, showing the accuracy of the prediction

Open image in new tab

Figure 7: PR-curve plot - showing prediction accuracy

This plot also conforms that most of the subtypes were classified with a good precision. Also the precision has incresed (compare Class4 = 0.23 with 1 iteration to Class4 = 0.41 with 5 iterations)

Generate UMAP plot of the embeddings

To generate a UMAP plot, we need to extract the embeddings and predicted table.

Hands On: Extract embeddings and prediction

Extract dataset with the following parameters:

param-file “Input List”: results (output of Flexynesis tool)

“How should a dataset be selected?”: Select by element identifier

“Element identifier:”: job.embeddings_test

Extract dataset with the following parameters:

param-file “Input List”: results (output of Flexynesis tool)

“How should a dataset be selected?”: Select by element identifier

“Element identifier:”: job.predicted_labels

Hands On: Remove probabilities (duplicates)

Sort ( Galaxy version 9.5+galaxy2) with the following parameters:

“Sort Query”: job.predicted_labels (output of Extract dataset tool)

“Number of header lines”: 1

In “Column selections”:

param-repeat “Insert Column selections”

“on column”: Column: 1

“Flavor”: Alphabetical sort

“Output unique values”: Yes

And finally the UMAP plot.

Hands On: UMAP plot

Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Flexynesis plot”: Dimensionality reduction

param-file “Predicted labels”: table (output of Sort tool)

param-file “Embeddings”: job.embeddings_test (output of Extract dataset tool)

“Column in the labels file to use for coloring the points in the plot”: Column: 5 (known_label)

“Transformation method”: UMAP

Question

What does the generated plot show?

Open image in new tab

Figure 8: UMAP plot - colored by known subtypes

This is a UMAP representation of the embeddings which shows a better speration between subtypes

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" (CYOT) section (also known as "Choose Your Own Analysis" (CYOA)), where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

We have another tool for the classification task called TabPFN which is a foundation model for tabular data. We can repeat the classification task using TabPFN and compare it with Flexynesis. Please note that TabPFN is slow on CPU and it will take a lot of time to finish on this dataset. What do you say? Should we continue with TabPFN?

Let's try TabPFN! I'll try later!

Conclusion

In this tutorial, we demonstrated how Flexynesis can model breast cancer subtypes from multi omics data using a deep generative framework. The tool handles preprocessing, feature selection, and latent space learning automatically, making it efficient and robust.

By training on TCGA BRCA data, we:

Visualized meaningful subtype separation in the latent space
Identified subtype-specific genes
Showed how learned features relate to known clinical labels

Flexynesis provides an accessible way to explore complex omics datasets and uncover biological structure without extensive manual tuning.

Alright, Let’s see how TabPFN can predict the subtypes

Prepare data for TabPFN

Currently, TabPFN supports up to 10,000 samples and 500 features (genes) in a tabular data. We will filter our data by gene variance and will use top 500 variable genes as input for TabPFN.

The train and test tabular data for TabPFN should be transposed so the samples are in rows and genes in columns, the last column should contain the labels and train and test data should have same set of features in same order.

Since TabPFN does not support data integration, we should try gene expression and copy number alteration data separately.

CNA data

First, let’s filter the cna data by variance.

Hands On: Prepare CNA data

Comment: Short explanation of steps:

Here we will:

Calculate the variance of each gene in train matrix

Add the variance back to the matrix

Sort the matrix by variance in descending order

Filter the matrix by top 500 genes

Sort the matrix by gene name

Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:

“Input Single or Multiple Tables”: Single Table

param-file “Table”: train_cna_brca.tabular

“Type of table operation”: Compute expression across rows or columns

“Calculate”: Variance

“For each”: Row

Join two Datasets with the following parameters:

param-file “Join”: table (output of Table Compute tool)

“using column”: Column: 1

param-file “with”: train_cna_brca.tabular

“and column”: Column: 1

“Fill empty columns”: No

“Keep the header lines”: Yes

Sort ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “Sort Query”: table (output of Join two Datasets tool)

“Number of header lines”: 1

In “Column selections”:

param-repeat “Insert Column selections”

“on column”: Column: 2

“in”: Descending order

“Flavor”: Fast numeric sort (-n)

Select first with the following parameters:

“Select first”: 500

param-file “from”: table (output of Sort tool)

“Dataset has a header”: Yes

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: table (output of Select first tool)

“Operation”: Discard

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1, Column: 2

Sort ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “Sort Query”: table (output of Advanced Cut tool)

“Number of header lines”: 1

In “Column selections”:

param-repeat “Insert Column selections”

“on column”: Column: 1

“Flavor”: Alphabetical sort

Rename the output file to train_cna_brca_500gene.tabular

Comment: Short explanation of steps:

Here we will:

Extract the list of genes from the train matrix

Filter the test data by extracted genes

Sort the matrix by gene name

Transpose both train and test data

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: train_cna_brca_500gene.tabular (output of Advanced Cut tool)

“Operation”: Keep

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1

Join ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “1st file”: table (output of Advanced Cut tool)

“Column to use from 1st file”: Column: 1

param-file “2nd File”: test_cna_brca.tabular

“Column to use from 2nd file”: Column: 1

“First line is a header line”: Yes

Sort ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “Sort Query”: output (output of Join tool)

“Number of header lines”: 1

In “Column selections”:

param-repeat “Insert Column selections”

“on column”: Column: 1

“Flavor”: Alphabetical sort

Rename the output file to test_cna_brca_500gene.tabular

Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:

param-file “Input tabular dataset”: train_cna_brca_500gene.tabular (output of Sort tool)

Rename the output file to train_cna_brca_500gene_transposed.tabular

Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:

param-file “Input tabular dataset”: test_cna_brca_500gene.tabular (output of Sort tool)

Rename the output file to test_cna_brca_500gene_transposed.tabular

Comment: Short explanation of steps:

Here we will:

Extract sample_id and CLAUDIN_SUBTYPE from the train and clinical data

Add the subtype to the train and test matrix

And finally remove the sample_id from the matrices.

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: train_clin_brca.tabular (Input dataset)

“Operation”: Keep

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1, Column: 16

Rename the output to Train annotation

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: test_clin_brca.tabular (Input dataset)

“Operation”: Keep

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1, Column: 16

Rename the output to Test annotation

Join ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “1st file”: train_cna_brca_500gene_transposed.tabular (output of Transpose tool)

“Column to use from 1st file”: Column: 1

param-file “2nd File”: Train annotation (output of Advanced Cut tool)

“Column to use from 2nd file”: Column: 1

“First line is a header line”: Yes

Rename the output to Annotated train matrix cna

Join ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “1st file”: test_cna_brca_500gene_transposed.tabular (output of Transpose tool)

“Column to use from 1st file”: Column: 1

param-file “2nd File”: Test annotation (output of Advanced Cut tool)

“Column to use from 2nd file”: Column: 1

“First line is a header line”: Yes

Rename the output to Annotated test matrix

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: Annotated train matrix (output of Join tool)

“Operation”: Discard

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1

Rename the output to TabPFN ready train data - CNA

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: Annotated test matrix (output of Join tool)

“Operation”: Discard

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1

Rename the output to TabPFN ready test data - CNA

Now the CNA data is ready for TabPFN. Let’s do the same for GEX!

GEX data

Hands On: Prepare GEX data

Comment: Short explanation of steps:

Here we will:

Calculate the variance of each gene in train matrix

Add the variance back to the matrix

Sort the matrix by variance in descending order

Filter the matrix by top 500 genes

Sort the matrix by gene name

Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:

“Input Single or Multiple Tables”: Single Table

param-file “Table”: train_gex_brca.tabular

“Type of table operation”: Compute expression across rows or columns

“Calculate”: Variance

“For each”: Row

Join two Datasets with the following parameters:

param-file “Join”: table (output of Table Compute tool)

“using column”: Column: 1

param-file “with”: train_gex_brca.tabular

“and column”: Column: 1

“Fill empty columns”: No

“Keep the header lines”: Yes

Sort ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “Sort Query”: table (output of Join two Datasets tool)

“Number of header lines”: 1

In “Column selections”:

param-repeat “Insert Column selections”

“on column”: Column: 2

“in”: Descending order

“Flavor”: Fast numeric sort (-n)

Select first with the following parameters:

“Select first”: 500

param-file “from”: table (output of Sort tool)

“Dataset has a header”: Yes

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: table (output of Select first tool)

“Operation”: Discard

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1, Column: 2

Sort ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “Sort Query”: table (output of Advanced Cut tool)

“Number of header lines”: 1

In “Column selections”:

param-repeat “Insert Column selections”

“on column”: Column: 1

“Flavor”: Alphabetical sort

Rename the output file to train_gex_brca_500gene.tabular

Comment: Short explanation of steps:

Here we will:

Extract the list of genes from the train matrix

Filter the test data by extracted genes

Sort the matrix by gene name

Transpose both train and test data

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: train_gex_brca_500gene.tabular (output of Advanced Cut tool)

“Operation”: Keep

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1

Join ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “1st file”: table (output of Advanced Cut tool)

“Column to use from 1st file”: Column: 1

param-file “2nd File”: test_gex_brca.tabular

“Column to use from 2nd file”: Column: 1

“First line is a header line”: Yes

Sort ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “Sort Query”: output (output of Join tool)

“Number of header lines”: 1

In “Column selections”:

param-repeat “Insert Column selections”

“on column”: Column: 1

“Flavor”: Alphabetical sort

Rename the output file to test_gex_brca_500gene.tabular

Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:

param-file “Input tabular dataset”: train_gex_brca_500gene.tabular (output of Sort tool)

Rename the output file to train_gex_brca_500gene_transposed.tabular

Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:

param-file “Input tabular dataset”: test_gex_brca_500gene.tabular (output of Sort tool)

Rename the output file to test_gex_brca_500gene_transposed.tabular

Comment: Short explanation of steps:

Here we will:

Extract sample_id and CLAUDIN_SUBTYPE from the train and clinical data

Add the subtype to the train and test matrix

And finally remove the sample_id from the matrices.

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: train_clin_brca.tabular (Input dataset)

“Operation”: Keep

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1, Column: 16

Rename the output to Train annotation

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: test_clin_brca.tabular (Input dataset)

“Operation”: Keep

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1, Column: 16

Rename the output to Test annotation

Join ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “1st file”: train_gex_brca_500gene_transposed.tabular (output of Transpose tool)

“Column to use from 1st file”: Column: 1

param-file “2nd File”: Train annotation (output of Advanced Cut tool)

“Column to use from 2nd file”: Column: 1

“First line is a header line”: Yes

Rename the output to Annotated train matrix - gex

Join ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “1st file”: test_gex_brca_500gene_transposed.tabular (output of Transpose tool)

“Column to use from 1st file”: Column: 1

param-file “2nd File”: Test annotation (output of Advanced Cut tool)

“Column to use from 2nd file”: Column: 1

“First line is a header line”: Yes

Rename the output to Annotated test matrix - gex

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: Annotated train matrix (output of Join tool)

“Operation”: Discard

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1

Rename the output to TabPFN ready train data - GEX

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: Annotated test matrix (output of Join tool)

“Operation”: Discard

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: Column: 1

Rename the output to TabPFN ready test data - GEX

Now it is time to run TabPFN for CNA and GEX data.

TabPFN

Warning: High computation time

TabPFN takes a lot of time to do the classification task on CPU (With our data it takes about 7 hours) You can instead import the output (data 55,56,59, and 60) from this archived history.

Hands On: TabPFN on CNA

Tabular data prediction using TabPFN ( Galaxy version 2.0.9+galaxy0) with the following parameters:

“Select a machine learning task”: Classification

param-file “Train data”: TabPFN ready train data - CNA (output of Advanced Cut tool)

param-file “test data”: TabPFN ready test data - CNA (output of Advanced Cut tool)

“Does test data contain labels?”: Yes

Hands On: TabPFN on GEX

Tabular data prediction using TabPFN ( Galaxy version 2.0.9+galaxy0) with the following parameters:

“Select a machine learning task”: Classification

param-file “Train data”: TabPFN ready train data - GEX (output of Advanced Cut tool)

param-file “test data”: TabPFN ready test data - GEX (output of Advanced Cut tool)

“Does test data contain labels?”: Yes

To make comparison of TabPFN and Flexynesis fair, we should apply Flexynesis on the GEX and CNA with 500 features separately.

Hands On: Flexynesis on CNA

Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Type of Analysis”: Supervised training

param-file “Training clinical data”: train_clin_brca.tabular

param-file “Test clinical data”: test_clin_brca.tabular

param-file “Training omics data”: train_cna_brca_500gene.tabular (output of Advanced Cut tool)

param-file “Test omics data”: test_cna_brca_500gene.tabular (output of Advanced Cut tool)

“What type of assay is your input?”: cna

“Model class”: DirectPred

“Column name in the train clinical data to use for predictions, multiple targets are allowed”: Column: 16 (CLAUDIN_SUBTYPE)

In “Advanced Options”:

“Variance threshold (as percentile) to drop low variance features.”: 0.8

“Minimum number of features to retain after feature selection.”: 100

“Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0

“Number of iterations for hyperparameter optimization.”: 5

In “Visualization”:

“Generate embeddings plot?”: Yes

“Generate precision-recall curves plot?”: Yes

Hands On: Flexynesis on GEX

Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Type of Analysis”: Supervised training

param-file “Training clinical data”: train_clin_brca.tabular

param-file “Test clinical data”: test_clin_brca.tabular

param-file “Training omics data”: train_gex_brca_500gene.tabular (output of Advanced Cut tool)

param-file “Test omics data”: test_gex_brca_500gene.tabular (output of Advanced Cut tool)

“What type of assay is your input?”: gex

“Model class”: DirectPred

“Column name in the train clinical data to use for predictions, multiple targets are allowed”: Column: 16 (CLAUDIN_SUBTYPE)

In “Advanced Options”:

“Variance threshold (as percentile) to drop low variance features.”: 0.8

“Minimum number of features to retain after feature selection.”: 100

“Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0

“Number of iterations for hyperparameter optimization.”: 5

In “Visualization”:

“Generate embeddings plot?”: Yes

“Generate precision-recall curves plot?”: Yes

Question

Compare the PR-curve plots of both Flexynesis and TabPFN predictions, which one is more accurate?

You can see that the prediction of TabPFN was better with the CNA data, however we have got better prediction with Flexynesis on GEX, and with the power of data integration using both CNA and GEX, the prediction is further improved!!

Open image in new tab

Figure 9: Precision-Recall curve plot of TabPFN on CNA

Open image in new tab

Figure 10: Precision-Recall curve plot of Flexynesis on CNA

Open image in new tab

Figure 11: Precision-Recall curve plot of TabPFN on GEX

Open image in new tab

Figure 12: Precision-Recall curve plot of Flexynesis on GEX

Open image in new tab

Figure 13: Precision-Recall curve plot of Flexynesis

Comment: Try Flexynesis with more iteration

In this training we only used 5 iterations to train our model. You can rerun the tool with higher number of iterations and see how the prediction improves.

Conclusion

By training on TCGA BRCA data, we:

Visualized meaningful subtype separation in the latent space
Identified subtype-specific genes
Showed how learned features relate to known clinical labels

Flexynesis provides a fast and accessible way to explore complex omics datasets and uncover biological structure without extensive manual tuning.

We also compared Flexynesis with TabPFN and we saw although TabPFN predicted the subtypes with CNA data better, but it was not as good as Flexynesis on GEX data. It also provides fewer outputs compared to Flexynesis for example the integrated latent space embeddings, feature importance and a full information about the predictions.

You've Finished the Tutorial

Key points

Flexynesis learns biologically meaningful latent spaces

BRCA subtypes can be distinguished using learned features

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Uyar, B., T. Savchyn, R. Wurmus, A. Sarigun, M. M. Shaik et al., 2024 Flexynesis: A deep learning framework for bulk multi-omics data integration for precision oncology and beyond. 10.1101/2024.07.16.603606
community, cbioPortal GcBioPortal for cancer genomics. https://www.cbioportal.org/
community, metabric METABRIC. https://ega-archive.org/studies/EGAS00000000083

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Amirhossein Naghsh Nilchi, Björn Grüning, Modeling Breast Cancer Subtypes with Flexynesis (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/statistics/tutorials/flexynesis_classification/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{statistics-flexynesis_classification,
author = "Amirhossein Naghsh Nilchi and Björn Grüning",
	title = "Modeling Breast Cancer Subtypes with Flexynesis (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/flexynesis_classification/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/statistics/tutorials/flexynesis_classification/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: pick_value
  owner: iuc
  revisions: b19e21af9c52
  tool_panel_section_label: Collection Operations
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.