Identifing Survival Markers of Brain tumor with Flexynesis

Author(s)	Amirhossein Naghsh Nilchi Björn Grüning
Reviewers

Overview
Questions:

How can multi-modal genomic data be integrated to identify survival markers in brain tumors?

How can deep learning approaches improve survival prediction in cancer patients?

Which genomic features are most predictive of patient survival in glioma subtypes?

Objectives:

Apply multi-modal data integration techniques to combine mutation, gene expression, and clinical data

Perform survival analysis to identify prognostic biomarkers in brain tumors

They are single sentences describing what a learner should be able to do once they have completed the tutorial

Implement deep learning models for survival prediction using genomic data

Time estimation: 2 hours

Supporting Materials:

Datasets

Workflows

galaxy-history-answer Answer Histories

usegalaxy.eu
2025-08-01

help How to Use This

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

Published: Aug 13, 2025

Last modification: Aug 13, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00555

version Revision: 2

Here, we use Flexynesis tool suite on a multi-omics dataset of 506 Brain Lower Grade Glioma (LGG) and 288 Glioblastoma Multiforme (GBM) samples with matching mutation and copy number alteration. This data were downloaded from the cBioPortal (cbioPortal community). The data was split into train (70% of the samples) and test (30% of the samples).

We will work on CNA (copy number alteration) and MUT (mutation data which is a binary matrix of genexsample) data.

This training is inspired from the original flexynesis analysis notebook: survival_subtypes_LGG_GBM.ipynb.

Comment: Getting data from cBioPortal

If you want to download data directly from cBioPortal, you can go through “Prepare data from CbioPortal for Flexynesis integration” training

Agenda

In this tutorial, we will cover:

Get data

Train Flexynesis model with a survival task

Survival-risk subtypes

Extract predicted labels and calculate median

Assign high and low risk groups

Conclusion

Warning: LICENSE

Flexynesis is only available for NON-COMMERCIAL use. Permission is only granted for academic, research, and educational purposes. Before using, be sure to review, agree, and comply with the license. For commercial use, please review the flexynesis license on GitHub and contact the copyright holders

Get data

Hands On: Data Upload
Create a new history for this tutorial
Import the files from Zenodo:
https://zenodo.org/records/16287482/files/train_mut_lgggbm.tabular
https://zenodo.org/records/16287482/files/train_clin_lgggbm.tabular
https://zenodo.org/records/16287482/files/test_cna_lgggbm.tabular
https://zenodo.org/records/16287482/files/test_mut_lgggbm.tabular
https://zenodo.org/records/16287482/files/test_clin_lgggbm.tabular
https://zenodo.org/records/16287482/files/train_cna_lgggbm.tabular
Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window
Check that the datatype is tabular

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Add to each database a tag corresponding to their modality (#mut, #cna, and #clinical)

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

Train Flexynesis model with a survival task

Now it is time import train and test datasets. We rank genes by Laplacian Scores and pick top 10% of the genes, while removing highly redundant genes with a correlation score threshold of 0.8 and a variance threshold of 50%. We will also use the intermediate fusion of omic layers. For hyperparameter optimization we use the following parameters.

*Model class: We pick DirectPred (a fully connected network)

Column name in the train clinical data to use as survival event and Column name in the train clinical data to use as survival time. Survival status (consists of 0s and 1s) and time since last followup. It is important that the clinical data contains both variables as numerical values.
Column name in the train clinical data to use for predictions, multiple targets are allowed: We can concurrently train the same network to be able to predict other variables such as histological diagnosis, however, here we just focus on the survival endpoints, so we pass an empty list.
Number of iterations for hyperparameter optimization.: We do 5 iterations of hyperparameter optimization. This is a reasonable number of demonstration purposes, but it could be beneficial to increase this value in order to discover even better models.
How many epochs to wait when no improvements in validation loss are observed. : If a training does not show any signs of improving the performance on the validation part of the train_dataset for at least 10 epochs, we stop the training. This not only significantly decreases the amount spent on training by avoiding unnecessary continuation of unpromising training runs, but also helps avoid over-fitting the network on the training data.

Hands On: Survival task

Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Type of Analysis”: Supervised training

param-file “Training clinical data”: train_clin_lgggbm.tabular

param-file “Test clinical data”: test_clin_lgggbm.tabular

param-file “Training omics data”: train_mut_lgggbm.tabular

param-file “Test omics data”: test_mut_lgggbm.tabular

“What type of assay is your input?”: mut

In “Multiple omics layers?”:

param-repeat “Insert Multiple omics layers?”

param-file “Training omics data”: train_cna_lgggbm.tabular

param-file “Test omics data”: test_cna_lgggbm.tabular

“What type of assay is your input?”: cna

“Model class”: DirectPred

“Column name in the train clinical data to use as survival event”: c8

“Column name in the train clinical data to use as survival time”: c7

In “Advanced Options”:

“Fusion method”: intermediate

“Variance threshold (as percentile) to drop low variance features.”: 0.5

“Correlation threshold to drop highly redundant features.”: 0.8

“Minimum number of features to retain after feature selection.”: 1000

“Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0

“Number of iterations for hyperparameter optimization.”: 5

In “Visualization”:

“Generate embeddings plot?”: No

“Generate kaplan meier curves plot?”: Yes

“Generate hazard ratio plot?”: Yes

“Omics layer to use for cox input”: mut

“Number of top important features to include in Cox model”: 5

“Performs K-fold cross-validation?”: No

“Generate scatter plot?”: No

“Generate concordance heatmap plot?”: No

“Generate precision-recall curves plot?”: No

“Generate ROC curves plot?”: No

“Generate boxplot?”: No

Question

What are main outputs of Flexynesis?

What does the generated plots show?

What information is in job.stats output?

What are top 10 features in job.feature_importance.GradientShap?

The results collection contains the following data:

job.embeddings_test (latent space of test data)

job.embeddings_train (latent space of train data)

job.feature_importance.GradientShap (feature importance calculated by GradientShap method)

job.feature_importance.IntegratedGradients (feature importance calculated by IntegratedGradients method)

job.feature_logs.cna

job.feature_logs.mut

job.predicted_labels (Prediction of the created model)

job.stats

There are two figures in plot collection:

The first plot is the Kaplan Meier Curves of the risk subtypes.

The second plot is a forest plot of calculated Cox-PH model of top 5 markers.

It shows that we achieved a reasonable performance of Harrell’s Concordance Index (C-index) on test data.

IDH2, IDH1, HIVEP3, PIK3CA, EGFR, RELN, ATRX, INSRR, TP53, COL6A3. IDH1, TP53, and ATRX are extensively studied and been shown to be relevant in gliomas (Koschmann et al. 2016).

We can also visualize the high and low risk groups with the embeddings.

Survival-risk subtypes

Extract predicted labels and calculate median

Let’s group the samples by predicted survival risk scores into 2 groups and visualize the sample embeddings colored by risk subtypes.

Hands On: Calculate median of predicted survival score

First we should extract the job.predicted_labels and then calculate the median.

Extract dataset with the following parameters:

param-file “Input List”: results (output of Flexynesis tool)

“How should a dataset be selected?”: Select by element identifier

“Element identifier:”: job.predicted_labels

Datamash ( Galaxy version 1.9+galaxy0) with the following parameters:

param-file “Input tabular dataset”: job.predicted_labels (output of Extract dataset tool)

“Input file has a header line”: Yes

In “Operation to perform on each group”:

param-repeat “Insert Operation to perform on each group”

“Type”: Median

“On column”: Column: 6 (*“predicted_label)

Question

What is the median?

-0.66496020555496

Now we add a column to our job.predicted_labels data indicating high and low risk groups (samples with predicted_label > median are high risk groups)

Assign high and low risk groups

Hands On: Assign high/low groups

Compute ( Galaxy version 2.1) with the following parameters:

param-file “Input file”: job.predicted_labels (output of Extract dataset tool)

“Input has a header line with column names?”: Yes

In “Expressions”:

param-repeat “Insert Expressions”

“Add expression”: float(c6)>=-0.66496020555496

“The new column name”: risk_groups

In “Error handling”:

“Autodetect column types”: No

For better visualization, let’s change True to High_risk and False to Low_risk

Hands On: Rename labels

Replace Text ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to process”: table (output of Compute tool)

In “Replacement”:

param-repeat “Insert Replacement”

“in column”: Column: 9

“Find pattern”: True

“Replace with”: High_risk

param-repeat “Insert Replacement”

“in column”: Column: 9

“Find pattern”: False

“Replace with”: Low_risk

Now we visualize the embedding.

Hands On: Dimension reduction plot

Extract dataset with the following parameters:

param-file “Input List”: results (output of Flexynesis tool)

“How should a dataset be selected?”: Select by element identifier

“Element identifier:”: job.embeddings_test

Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Flexynesis plot”: Dimensionality reduction

param-file “Predicted labels”: table (output of Replace Text tool)

param-file “Embeddings”: job.embeddings_test (output of Extract dataset tool)

“Column in the labels file to use for coloring the points in the plot”: Column: 9

Conclusion

In this tutorial, we have successfully applied the Flexynesis framework to analyze survival markers in brain tumors using multi-modal genomic data from 506 Lower Grade Glioma (LGG) and 288 Glioblastoma Multiforme (GBM) samples. Through hands-on analysis of mutation and copy number alteration data, we have separated our samples to high and low risk groups and ranked the genes by their importance from which some are already known to affect survival in glioma.

You've Finished the Tutorial

Key points

Multi-modal data integration improves survival prediction accuracy compared to single data types

Deep learning approaches can effectively handle high-dimensional genomic data for survival analysis

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Koschmann, C., A.-A. Calinescu, F. J. Nunez, A. Mackay, J. Fazal-Salom et al., 2016 ATRX loss promotes tumor growth and impairs nonhomologous end joining DNA repair in glioma. Science Translational Medicine 8: 10.1126/scitranslmed.aac8228
community, cbioPortal GcBioPortal for cancer genomics. https://www.cbioportal.org/

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Amirhossein Naghsh Nilchi, Björn Grüning, Identifing Survival Markers of Brain tumor with Flexynesis (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/statistics/tutorials/flexynesis_survival/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{statistics-flexynesis_survival,
author = "Amirhossein Naghsh Nilchi and Björn Grüning",
	title = "Identifing Survival Markers of Brain tumor with Flexynesis (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/flexynesis_survival/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/statistics/tutorials/flexynesis_survival/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: column_maker
  owner: devteam
  revisions: aff5135563c6
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.