Flexynesis represents a state-of-the-art deep learning framework specifically designed for multi-modal data integration in biological research (Uyar et al. 2024). What sets Flexynesis apart is its comprehensive suite of deep learning architectures that can handle various data integration scenarios while providing robust feature selection and hyperparameter optimization.
Here, we use Flexynesis tool suite on a multi-omics dataset of Breast Cancer samples from the METABRIC consortium (metabric community), one of the landmark breast cancer genomics studies available through cBioPortal (cbioPortal community). This dataset contains comprehensive molecular and clinical data from over 2,000 breast cancer patients, including gene expression profiles, copy number alterations, mutation data, and clinical outcomes. The data was downloaded from Cbioportal and randomly split into train (70% of the samples) and test (30% of the samples) data folders. The data files were processed to follow the same nomenclature.
This training is inspired from the original Flexynesis analysis notebook: brca_subtypes.ipynb.
Warning: LICENSE
Flexynesis is only available for NON-COMMERCIAL use. Permission is only granted for academic, research, and educational purposes. Before using, be sure to review, agree, and comply with the license.
For commercial use, please review the Flexynesis license on GitHub and contact the copyright holders
In this training we will use a multi-omics dataset from the METABRIC database downloaded through cBioPortal and split to 70% train and 30% test samples.
The multi-omics data consists of three data types:
Gene Expression (GEX): Measures the activity levels of genes by quantifying mRNA abundance. This data captures which genes are being actively transcribed in each sample.
Copy Number Alterations (CNA): Detects genomic regions that have been duplicated or deleted compared to the normal genome. Cancer cells often have chromosomal instabilities leading to gene amplifications (gains) or deletions (losses), which can drive tumor development and progression.
Clinical data (CLIN): Contains patient metadata including demographics, treatment history, tumor characteristics, and importantly for this tutorial, the breast cancer molecular subtypes (such as Luminal A, Luminal B, HER2-enriched, and Basal-like) subtypes based on the expression patterns of key biomarkers including claudin proteins, which are important for cell adhesion and tumor behavior.
These three data modalities provide complementary views of the same tumor samples - the clinical data provides the biological context and classification labels, while the molecular data (GEX and CNA) capture the underlying genomic alterations that drive these clinical phenotypes.
Flexynesis tries to learn the patterns in these molecular alterations and connect them to the phenotype to predict cancer subtypes based on the genomic and transcriptomic profiles.
Hands On: Data Upload
Create a new history for this tutorial
Import the files from Zenodo or from
the shared data library (GTN - Material -> statistics
-> Modeling Breast Cancer Subtypes with Flexynesis):
Click galaxy-uploadUpload Data at the top of the tool panel
Select galaxy-wf-editPaste/Fetch Data
Paste the link(s) into the text field
Press Start
Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
Go into Libraries (left panel)
Navigate to the correct folder as indicated by your instructor.
On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
Select the desired files
Click on Add to Historygalaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
“Select history”: the history you want to import the data to (or create a new one)
Click on Import
Rename the datasets
Check that the datatype is tabular
Click on the galaxy-pencilpencil icon for the dataset to edit its attributes
In the central panel, click galaxy-chart-select-dataDatatypes tab on the top
In the galaxy-chart-select-dataAssign Datatype, select tabular from “New Type” dropdown
Tip: you can start typing the datatype into the field to filter the dropdown menu
Click the Save button
Add to each database a tag corresponding to the modality (#gex, #cna, #clin)
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
Click on the dataset to expand it
Click on Add Tagsgalaxy-tags
Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
Press Enter
Check that the tag appears below the dataset name
Tags beginning with # are special!
They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
dataset 3 is used to calculate read coverage using BedTools Genome Coverageseparately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.
Flexynesis automatically handles key preprocessing steps such as:
Data processing:
removing uninformative features (e.g. features with near-zero-variation)
removing samples with too many NA values
removing features with too many NA values and impute remaining NA values using median imputation (replacing missing values with the median of each feature)
Feature selection only on training data for each omics layer separately:
Features are sorted by Laplacian score
Features that make it in the top_percentile
Highly redundant features are further removed (for a pair of highly correlated features, keep the one with the higher laplacian score).
Harmonize the training data with the test data
Normalize the datasets (Subset the test data features to those that are kept for training data)
Normalize training data (standard scaling) and apply the same scaling factors to the test data.
Log transform the final matrices (Optional)
Distinguish numerical and categorical variables in the clinical data file. For categorical variables, create a numerical encoding of the labels for training data. Use the same encoders to map the test samples to the same numerical encodings.
Comment: No manual preprocessing is required
Flexynesis performs all data cleaning internally as part of its pipeline.
Now it’s time to train model using Flexynesis:
We choose whether we want to concatenate the data matrices (early integration) or not (intermediate integration) before running them through the neural networks.
We want to apply feature selection and keep only top 10% of the features. In the end, we want to keep at least 1000 features per omics layer.
We apply a variance threshold (for simplicity of demonstration, we want to keep a few most variable features). Setting this to 80, will remove 80% of the features with the lowest variation from each modality.
We choose which model architecture to use:
DirectPred: a fully connected network (standard multilayer perceptron) with supervisor heads (one MLP for each target variable)
target_variables: Column headers in clinical data.
hpo_iter: How many hyperparameter search steps to implement:
This example runs 1 hyperparameter search step using DirectPred architecture and a hyperparameter configuration space defined for “DirectPred” with a supervisor head for “CLAUDIN_SUBTYPE” variable. At the end of the parameter optimization, the best model will be selected and returned.
Training a model longer than needed causes the model to overfit, yield worse validation performance, and also it takes a longer time to train the models, considering if we have to run a long hyperparameter optimization routine, not just for 1 step, but say more than 100 steps.
It is possible to set early stopping criteria in Flexynesis, which is basically a simple callback that is handled by Pytorch Lightning. This is regulated using the early_stop_patience. When set to e.g. 10, the training will stop if the validation loss has not been improved in the last 10 epochs.
Hands On: Flexynesis
Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:
“I certify that I am not using this tool for commercial purposes.”: Yes
This plot also conforms that most of the subtypes were classified with a good precision (except Class 4 (LumB), in the PCA plot you can see some Her2 samples were classified as LumB for example)
Generate UMAP plot of the embeddings
To generate a UMAP plot, we need to extract the embeddings and predicted table.
Hands On: Extract embeddings and prediction
Extract dataset with the following parameters:
param-file“Input List”: results (output of Flexynesistool)
“How should a dataset be selected?”: Select by element identifier
“Element identifier:”: job.embeddings_test
Extract dataset with the following parameters:
param-file“Input List”: results (output of Flexynesistool)
“How should a dataset be selected?”: Select by element identifier
“Element identifier:”: job.predicted_labels
Question
What information do we have in predicted labels?
It is a tabular file which contains the probability of each subtype for each sample and a final predicted label based on the probability.
As you have seen in predicted labels, each sample has a prediction value for each subtype. For UMAP plot we only need the final predicted label.
Hands On: Remove probabilities (duplicates)
Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
“Sort Query”: job.predicted_labels (output of Extract datasettool)
“Number of header lines”: 1
In “Column selections”:
param-repeat“Insert Column selections”
“on column”: Column: 1
“Flavor”: Alphabetical sort
“Output unique values”: Yes
And finally the UMAP plot.
Hands On: UMAP plot
Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:
“I certify that I am not using this tool for commercial purposes.”: Yes
“Flexynesis plot”: Dimensionality reduction
param-file“Predicted labels”: table (output of Sorttool)
param-file“Embeddings”: job.embeddings_test (output of Extract datasettool)
“Column in the labels file to use for coloring the points in the plot”: Column: 5 (known_label)
This is a UMAP representation of the embeddings which shows a better speration between subtypes
Longer training
In reality, hyperparameter optimization should run for multiple steps so that the parameter search space is large enough to find a good set. However, for demonstration purposes, we only run it for 5 steps here:
“Number of iterations for hyperparameter optimization.”: 5
Hands On: Flexynesis
Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:
“I certify that I am not using this tool for commercial purposes.”: Yes
This plot also conforms that most of the subtypes were classified with a good precision. Also the precision has incresed (compare Class4 = 0.23 with 1 iteration to Class4 = 0.41 with 5 iterations)
Generate UMAP plot of the embeddings
To generate a UMAP plot, we need to extract the embeddings and predicted table.
Hands On: Extract embeddings and prediction
Extract dataset with the following parameters:
param-file“Input List”: results (output of Flexynesistool)
“How should a dataset be selected?”: Select by element identifier
“Element identifier:”: job.embeddings_test
Extract dataset with the following parameters:
param-file“Input List”: results (output of Flexynesistool)
“How should a dataset be selected?”: Select by element identifier
“Element identifier:”: job.predicted_labels
Hands On: Remove probabilities (duplicates)
Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
“Sort Query”: job.predicted_labels (output of Extract datasettool)
“Number of header lines”: 1
In “Column selections”:
param-repeat“Insert Column selections”
“on column”: Column: 1
“Flavor”: Alphabetical sort
“Output unique values”: Yes
And finally the UMAP plot.
Hands On: UMAP plot
Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:
“I certify that I am not using this tool for commercial purposes.”: Yes
“Flexynesis plot”: Dimensionality reduction
param-file“Predicted labels”: table (output of Sorttool)
param-file“Embeddings”: job.embeddings_test (output of Extract datasettool)
“Column in the labels file to use for coloring the points in the plot”: Column: 5 (known_label)
This is a UMAP representation of the embeddings which shows a better speration between subtypes
Hands-on: Choose Your Own Tutorial
This is a "Choose Your Own Tutorial" (CYOT) section (also known as "Choose Your Own Analysis" (CYOA)), where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial
We have another tool for the classification task called TabPFN which is a foundation model for tabular data. We can repeat the classification task using TabPFN and compare it with Flexynesis. Please note that TabPFN is slow on CPU and it will take a lot of time to finish on this dataset. What do you say? Should we continue with TabPFN?
Conclusion
In this tutorial, we demonstrated how Flexynesis can model breast cancer subtypes from multi omics data using a deep generative framework. The tool handles preprocessing, feature selection, and latent space learning automatically, making it efficient and robust.
By training on TCGA BRCA data, we:
Visualized meaningful subtype separation in the latent space
Identified subtype-specific genes
Showed how learned features relate to known clinical labels
Flexynesis provides an accessible way to explore complex omics datasets and uncover biological structure without extensive manual tuning.
Alright, Let’s see how TabPFN can predict the subtypes
Prepare data for TabPFN
Currently, TabPFN supports up to 10,000 samples and 500 features (genes) in a tabular data.
We will filter our data by gene variance and will use top 500 variable genes as input for TabPFN.
The train and test tabular data for TabPFN should be transposed so the samples are in rows and genes in columns, the last column should contain the labels and train and test data should have same set of features in same order.
Since TabPFN does not support data integration, we should try gene expression and copy number alteration data separately.
CNA data
First, let’s filter the cna data by variance.
Hands On: Prepare CNA data
Comment: Short explanation of steps:
Here we will:
Calculate the variance of each gene in train matrix
Add the variance back to the matrix
Sort the matrix by variance in descending order
Filter the matrix by top 500 genes
Sort the matrix by gene name
Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:
“Input Single or Multiple Tables”: Single Table
param-file“Table”: train_cna_brca.tabular
“Type of table operation”: Compute expression across rows or columns
“Calculate”: Variance
“For each”: Row
Join two Datasets with the following parameters:
param-file“Join”: table (output of Table Computetool)
“using column”: Column: 1
param-file“with”: train_cna_brca.tabular
“and column”: Column: 1
“Fill empty columns”: No
“Keep the header lines”: Yes
Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“Sort Query”: table (output of Join two Datasetstool)
“Number of header lines”: 1
In “Column selections”:
param-repeat“Insert Column selections”
“on column”: Column: 2
“in”: Descending order
“Flavor”: Fast numeric sort (-n)
Select first with the following parameters:
“Select first”: 500
param-file“from”: table (output of Sorttool)
“Dataset has a header”: Yes
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: table (output of Select firsttool)
“Operation”: Discard
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1, Column: 2
Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“Sort Query”: table (output of Advanced Cuttool)
“Number of header lines”: 1
In “Column selections”:
param-repeat“Insert Column selections”
“on column”: Column: 1
“Flavor”: Alphabetical sort
Rename the output file to train_cna_brca_500gene.tabular
Comment: Short explanation of steps:
Here we will:
Extract the list of genes from the train matrix
Filter the test data by extracted genes
Sort the matrix by gene name
Transpose both train and test data
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: train_cna_brca_500gene.tabular (output of Advanced Cuttool)
“Operation”: Keep
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1
Join ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“1st file”: table (output of Advanced Cuttool)
“Column to use from 1st file”: Column: 1
param-file“2nd File”: test_cna_brca.tabular
“Column to use from 2nd file”: Column: 1
“First line is a header line”: Yes
Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“Sort Query”: output (output of Jointool)
“Number of header lines”: 1
In “Column selections”:
param-repeat“Insert Column selections”
“on column”: Column: 1
“Flavor”: Alphabetical sort
Rename the output file to test_cna_brca_500gene.tabular
Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:
param-file“Input tabular dataset”: train_cna_brca_500gene.tabular (output of Sorttool)
Rename the output file to train_cna_brca_500gene_transposed.tabular
Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:
param-file“Input tabular dataset”: test_cna_brca_500gene.tabular (output of Sorttool)
Rename the output file to test_cna_brca_500gene_transposed.tabular
Comment: Short explanation of steps:
Here we will:
Extract sample_id and CLAUDIN_SUBTYPE from the train and clinical data
Add the subtype to the train and test matrix
And finally remove the sample_id from the matrices.
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: train_clin_brca.tabular (Input dataset)
“Operation”: Keep
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1, Column: 16
Rename the output to Train annotation
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: test_clin_brca.tabular (Input dataset)
“Operation”: Keep
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1, Column: 16
Rename the output to Test annotation
Join ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“1st file”: train_cna_brca_500gene_transposed.tabular (output of Transposetool)
“Column to use from 1st file”: Column: 1
param-file“2nd File”: Train annotation (output of Advanced Cuttool)
“Column to use from 2nd file”: Column: 1
“First line is a header line”: Yes
Rename the output to Annotated train matrix cna
Join ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“1st file”: test_cna_brca_500gene_transposed.tabular (output of Transposetool)
“Column to use from 1st file”: Column: 1
param-file“2nd File”: Test annotation (output of Advanced Cuttool)
“Column to use from 2nd file”: Column: 1
“First line is a header line”: Yes
Rename the output to Annotated test matrix
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: Annotated train matrix (output of Jointool)
“Operation”: Discard
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1
Rename the output to TabPFN ready train data - CNA
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: Annotated test matrix (output of Jointool)
“Operation”: Discard
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1
Rename the output to TabPFN ready test data - CNA
Now the CNA data is ready for TabPFN. Let’s do the same for GEX!
GEX data
Hands On: Prepare GEX data
Comment: Short explanation of steps:
Here we will:
Calculate the variance of each gene in train matrix
Add the variance back to the matrix
Sort the matrix by variance in descending order
Filter the matrix by top 500 genes
Sort the matrix by gene name
Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:
“Input Single or Multiple Tables”: Single Table
param-file“Table”: train_gex_brca.tabular
“Type of table operation”: Compute expression across rows or columns
“Calculate”: Variance
“For each”: Row
Join two Datasets with the following parameters:
param-file“Join”: table (output of Table Computetool)
“using column”: Column: 1
param-file“with”: train_gex_brca.tabular
“and column”: Column: 1
“Fill empty columns”: No
“Keep the header lines”: Yes
Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“Sort Query”: table (output of Join two Datasetstool)
“Number of header lines”: 1
In “Column selections”:
param-repeat“Insert Column selections”
“on column”: Column: 2
“in”: Descending order
“Flavor”: Fast numeric sort (-n)
Select first with the following parameters:
“Select first”: 500
param-file“from”: table (output of Sorttool)
“Dataset has a header”: Yes
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: table (output of Select firsttool)
“Operation”: Discard
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1, Column: 2
Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“Sort Query”: table (output of Advanced Cuttool)
“Number of header lines”: 1
In “Column selections”:
param-repeat“Insert Column selections”
“on column”: Column: 1
“Flavor”: Alphabetical sort
Rename the output file to train_gex_brca_500gene.tabular
Comment: Short explanation of steps:
Here we will:
Extract the list of genes from the train matrix
Filter the test data by extracted genes
Sort the matrix by gene name
Transpose both train and test data
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: train_gex_brca_500gene.tabular (output of Advanced Cuttool)
“Operation”: Keep
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1
Join ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“1st file”: table (output of Advanced Cuttool)
“Column to use from 1st file”: Column: 1
param-file“2nd File”: test_gex_brca.tabular
“Column to use from 2nd file”: Column: 1
“First line is a header line”: Yes
Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“Sort Query”: output (output of Jointool)
“Number of header lines”: 1
In “Column selections”:
param-repeat“Insert Column selections”
“on column”: Column: 1
“Flavor”: Alphabetical sort
Rename the output file to test_gex_brca_500gene.tabular
Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:
param-file“Input tabular dataset”: train_gex_brca_500gene.tabular (output of Sorttool)
Rename the output file to train_gex_brca_500gene_transposed.tabular
Transpose ( Galaxy version 1.9+galaxy0) with the following parameters:
param-file“Input tabular dataset”: test_gex_brca_500gene.tabular (output of Sorttool)
Rename the output file to test_gex_brca_500gene_transposed.tabular
Comment: Short explanation of steps:
Here we will:
Extract sample_id and CLAUDIN_SUBTYPE from the train and clinical data
Add the subtype to the train and test matrix
And finally remove the sample_id from the matrices.
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: train_clin_brca.tabular (Input dataset)
“Operation”: Keep
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1, Column: 16
Rename the output to Train annotation
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: test_clin_brca.tabular (Input dataset)
“Operation”: Keep
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1, Column: 16
Rename the output to Test annotation
Join ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“1st file”: train_gex_brca_500gene_transposed.tabular (output of Transposetool)
“Column to use from 1st file”: Column: 1
param-file“2nd File”: Train annotation (output of Advanced Cuttool)
“Column to use from 2nd file”: Column: 1
“First line is a header line”: Yes
Rename the output to Annotated train matrix - gex
Join ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“1st file”: test_gex_brca_500gene_transposed.tabular (output of Transposetool)
“Column to use from 1st file”: Column: 1
param-file“2nd File”: Test annotation (output of Advanced Cuttool)
“Column to use from 2nd file”: Column: 1
“First line is a header line”: Yes
Rename the output to Annotated test matrix - gex
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: Annotated train matrix (output of Jointool)
“Operation”: Discard
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1
Rename the output to TabPFN ready train data - GEX
Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
param-file“File to cut”: Annotated test matrix (output of Jointool)
“Operation”: Discard
“Cut by”: fields
“Is there a header for the data’s columns ?”: Yes
“List of Fields”: Column: 1
Rename the output to TabPFN ready test data - GEX
Now it is time to run TabPFN for CNA and GEX data.
TabPFN
Warning: High computation time
TabPFN takes a lot of time to do the classification task on CPU (With our data it takes about 7 hours)
You can instead import the output (data 55,56,59, and 60) from this archived history.
Hands On: TabPFN on CNA
Tabular data prediction using TabPFN ( Galaxy version 2.0.9+galaxy0) with the following parameters:
“Select a machine learning task”: Classification
param-file“Train data”: TabPFN ready train data - CNA (output of Advanced Cuttool)
param-file“test data”: TabPFN ready test data - CNA (output of Advanced Cuttool)
“Does test data contain labels?”: Yes
Hands On: TabPFN on GEX
Tabular data prediction using TabPFN ( Galaxy version 2.0.9+galaxy0) with the following parameters:
“Select a machine learning task”: Classification
param-file“Train data”: TabPFN ready train data - GEX (output of Advanced Cuttool)
param-file“test data”: TabPFN ready test data - GEX (output of Advanced Cuttool)
“Does test data contain labels?”: Yes
To make comparison of TabPFN and Flexynesis fair, we should apply Flexynesis on the GEX and CNA with 500 features separately.
Hands On: Flexynesis on CNA
Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:
“I certify that I am not using this tool for commercial purposes.”: Yes
param-file“Training omics data”: train_gex_brca_500gene.tabular (output of Advanced Cuttool)
param-file“Test omics data”: test_gex_brca_500gene.tabular (output of Advanced Cuttool)
“What type of assay is your input?”: gex
“Model class”: DirectPred
“Column name in the train clinical data to use for predictions, multiple targets are allowed”: Column: 16 (CLAUDIN_SUBTYPE)
In “Advanced Options”:
“Variance threshold (as percentile) to drop low variance features.”: 0.8
“Minimum number of features to retain after feature selection.”: 100
“Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0
“Number of iterations for hyperparameter optimization.”: 5
In “Visualization”:
“Generate embeddings plot?”: Yes
“Generate precision-recall curves plot?”: Yes
Question
Compare the PR-curve plots of both Flexynesis and TabPFN predictions, which one is more accurate?
You can see that the prediction of TabPFN was better with the CNA data, however we have got better prediction with Flexynesis on GEX, and with the power of data integration using both CNA and GEX, the prediction is further improved!!
Figure 13: Precision-Recall curve plot of Flexynesis
Comment: Try Flexynesis with more iteration
In this training we only used 5 iterations to train our model. You can rerun the tool with higher number of iterations and see how the prediction improves.
Conclusion
In this tutorial, we demonstrated how Flexynesis can model breast cancer subtypes from multi omics data using a deep generative framework. The tool handles preprocessing, feature selection, and latent space learning automatically, making it efficient and robust.
By training on TCGA BRCA data, we:
Visualized meaningful subtype separation in the latent space
Identified subtype-specific genes
Showed how learned features relate to known clinical labels
Flexynesis provides a fast and accessible way to explore complex omics datasets and uncover biological structure without extensive manual tuning.
We also compared Flexynesis with TabPFN and we saw although TabPFN predicted the subtypes with CNA data better, but it was not as good as Flexynesis on GEX data. It also provides fewer outputs compared to Flexynesis for example the integrated latent space embeddings, feature importance and a full information about the predictions.
You've Finished the Tutorial
Please also consider filling out the Feedback Form as well!
Uyar, B., T. Savchyn, R. Wurmus, A. Sarigun, M. M. Shaik et al., 2024 Flexynesis: A deep learning framework for bulk multi-omics data integration for precision oncology and beyond. 10.1101/2024.07.16.603606
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{statistics-flexynesis_classification,
author = "Amirhossein Naghsh Nilchi and Björn Grüning",
title = "Modeling Breast Cancer Subtypes with Flexynesis (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/flexynesis_classification/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.