Identifing Survival Markers of Brain tumor with Flexynesis

Overview
Creative Commons License: CC-BY Questions:
  • How can multi-modal genomic data be integrated to identify survival markers in brain tumors?

  • How can deep learning approaches improve survival prediction in cancer patients?

  • Which genomic features are most predictive of patient survival in glioma subtypes?

Objectives:
  • Apply multi-modal data integration techniques to combine mutation, gene expression, and clinical data

  • Perform survival analysis to identify prognostic biomarkers in brain tumors

  • They are single sentences describing what a learner should be able to do once they have completed the tutorial

  • Implement deep learning models for survival prediction using genomic data

Time estimation: 2 hours
Supporting Materials:
Published: Aug 13, 2025
Last modification: Aug 13, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 2

Here, we use Flexynesis tool suite on a multi-omics dataset of 506 Brain Lower Grade Glioma (LGG) and 288 Glioblastoma Multiforme (GBM) samples with matching mutation and copy number alteration. This data were downloaded from the cBioPortal (cbioPortal community). The data was split into train (70% of the samples) and test (30% of the samples).

We will work on CNA (copy number alteration) and MUT (mutation data which is a binary matrix of genexsample) data.

This training is inspired from the original flexynesis analysis notebook: survival_subtypes_LGG_GBM.ipynb.

Comment: Getting data from cBioPortal

If you want to download data directly from cBioPortal, you can go through “Prepare data from CbioPortal for Flexynesis integration” training

Agenda

In this tutorial, we will cover:

  1. Get data
  2. Train Flexynesis model with a survival task
  3. Survival-risk subtypes
    1. Extract predicted labels and calculate median
    2. Assign high and low risk groups
  4. Conclusion
Warning: LICENSE

Flexynesis is only available for NON-COMMERCIAL use. Permission is only granted for academic, research, and educational purposes. Before using, be sure to review, agree, and comply with the license. For commercial use, please review the flexynesis license on GitHub and contact the copyright holders

Get data

Hands On: Data Upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo:

    https://zenodo.org/records/16287482/files/train_mut_lgggbm.tabular
    https://zenodo.org/records/16287482/files/train_clin_lgggbm.tabular
    https://zenodo.org/records/16287482/files/test_cna_lgggbm.tabular
    https://zenodo.org/records/16287482/files/test_mut_lgggbm.tabular
    https://zenodo.org/records/16287482/files/test_clin_lgggbm.tabular
    https://zenodo.org/records/16287482/files/train_cna_lgggbm.tabular
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Check that the datatype is tabular

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  4. Add to each database a tag corresponding to their modality (#mut, #cna, and #clinical)

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Train Flexynesis model with a survival task

Now it is time import train and test datasets. We rank genes by Laplacian Scores and pick top 10% of the genes, while removing highly redundant genes with a correlation score threshold of 0.8 and a variance threshold of 50%. We will also use the intermediate fusion of omic layers. For hyperparameter optimization we use the following parameters.

*Model class: We pick DirectPred (a fully connected network)

  • Column name in the train clinical data to use as survival event and Column name in the train clinical data to use as survival time. Survival status (consists of 0s and 1s) and time since last followup. It is important that the clinical data contains both variables as numerical values.
  • Column name in the train clinical data to use for predictions, multiple targets are allowed: We can concurrently train the same network to be able to predict other variables such as histological diagnosis, however, here we just focus on the survival endpoints, so we pass an empty list.
  • Number of iterations for hyperparameter optimization.: We do 5 iterations of hyperparameter optimization. This is a reasonable number of demonstration purposes, but it could be beneficial to increase this value in order to discover even better models.
  • How many epochs to wait when no improvements in validation loss are observed. : If a training does not show any signs of improving the performance on the validation part of the train_dataset for at least 10 epochs, we stop the training. This not only significantly decreases the amount spent on training by avoiding unnecessary continuation of unpromising training runs, but also helps avoid over-fitting the network on the training data.
Hands On: Survival task
  1. Flexynesis ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Type of Analysis”: Supervised training
      • param-file “Training clinical data”: train_clin_lgggbm.tabular
      • param-file “Test clinical data”: test_clin_lgggbm.tabular
      • param-file “Training omics data”: train_mut_lgggbm.tabular
      • param-file “Test omics data”: test_mut_lgggbm.tabular
      • “What type of assay is your input?”: mut
      • In “Multiple omics layers?”:
        • param-repeat “Insert Multiple omics layers?”
          • param-file “Training omics data”: train_cna_lgggbm.tabular
          • param-file “Test omics data”: test_cna_lgggbm.tabular
          • “What type of assay is your input?”: cna
      • “Model class”: DirectPred
      • “Column name in the train clinical data to use as survival event”: c8
      • “Column name in the train clinical data to use as survival time”: c7
      • In “Advanced Options”:
        • “Fusion method”: intermediate
        • “Variance threshold (as percentile) to drop low variance features.”: 0.5
        • “Correlation threshold to drop highly redundant features.”: 0.8
        • “Minimum number of features to retain after feature selection.”: 1000
        • “Top percentile features (among the features remaining after variance filtering and data cleanup) to retain after feature selection.”: 10.0
        • “Number of iterations for hyperparameter optimization.”: 5
      • In “Visualization”:
        • “Generate embeddings plot?”: No
        • “Generate kaplan meier curves plot?”: Yes
        • “Generate hazard ratio plot?”: Yes
          • “Omics layer to use for cox input”: mut
          • “Number of top important features to include in Cox model”: 5
          • “Performs K-fold cross-validation?”: No
        • “Generate scatter plot?”: No
        • “Generate concordance heatmap plot?”: No
        • “Generate precision-recall curves plot?”: No
        • “Generate ROC curves plot?”: No
        • “Generate boxplot?”: No
Question
  1. What are main outputs of Flexynesis?
  2. What does the generated plots show?
  3. What information is in job.stats output?
  4. What are top 10 features in job.feature_importance.GradientShap?
  1. The results collection contains the following data:
    • job.embeddings_test (latent space of test data)
    • job.embeddings_train (latent space of train data)
    • job.feature_importance.GradientShap (feature importance calculated by GradientShap method)
    • job.feature_importance.IntegratedGradients (feature importance calculated by IntegratedGradients method)
    • job.feature_logs.cna
    • job.feature_logs.mut
    • job.predicted_labels (Prediction of the created model)
    • job.stats
  2. There are two figures in plot collection:
    • The first plot is the Kaplan Meier Curves of the risk subtypes.
    • The second plot is a forest plot of calculated Cox-PH model of top 5 markers.
  3. It shows that we achieved a reasonable performance of Harrell’s Concordance Index (C-index) on test data.

  4. IDH2, IDH1, HIVEP3, PIK3CA, EGFR, RELN, ATRX, INSRR, TP53, COL6A3. IDH1, TP53, and ATRX are extensively studied and been shown to be relevant in gliomas (Koschmann et al. 2016).

We can also visualize the high and low risk groups with the embeddings.

Survival-risk subtypes

Extract predicted labels and calculate median

Let’s group the samples by predicted survival risk scores into 2 groups and visualize the sample embeddings colored by risk subtypes.

Hands On: Calculate median of predicted survival score

First we should extract the job.predicted_labels and then calculate the median.

  1. Extract dataset with the following parameters:
    • param-file “Input List”: results (output of Flexynesis tool)
    • “How should a dataset be selected?”: Select by element identifier
      • “Element identifier:”: job.predicted_labels
  2. Datamash ( Galaxy version 1.9+galaxy0) with the following parameters:
    • param-file “Input tabular dataset”: job.predicted_labels (output of Extract dataset tool)
    • “Input file has a header line”: Yes
    • In “Operation to perform on each group”:
      • param-repeat “Insert Operation to perform on each group”
        • “Type”: Median
        • “On column”: Column: 6 (*“predicted_label)
Question
  1. What is the median?
  1. -0.66496020555496

Now we add a column to our job.predicted_labels data indicating high and low risk groups (samples with predicted_label > median are high risk groups)

Assign high and low risk groups

Hands On: Assign high/low groups
  1. Compute ( Galaxy version 2.1) with the following parameters:
    • param-file “Input file”: job.predicted_labels (output of Extract dataset tool)
    • “Input has a header line with column names?”: Yes
      • In “Expressions”:
        • param-repeat “Insert Expressions”
          • “Add expression”: float(c6)>=-0.66496020555496
          • “The new column name”: risk_groups
    • In “Error handling”:
      • “Autodetect column types”: No

For better visualization, let’s change True to High_risk and False to Low_risk

Hands On: Rename labels
  1. Replace Text ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to process”: table (output of Compute tool)
    • In “Replacement”:
      • param-repeat “Insert Replacement”
        • “in column”: Column: 9
        • “Find pattern”: True
        • “Replace with”: High_risk
      • param-repeat “Insert Replacement”
        • “in column”: Column: 9
        • “Find pattern”: False
        • “Replace with”: Low_risk

Now we visualize the embedding.

Hands On: Dimension reduction plot
  1. Extract dataset with the following parameters:
    • param-file “Input List”: results (output of Flexynesis tool)
    • “How should a dataset be selected?”: Select by element identifier
      • “Element identifier:”: job.embeddings_test
  2. Flexynesis plot ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis plot”: Dimensionality reduction
      • param-file “Predicted labels”: table (output of Replace Text tool)
      • param-file “Embeddings”: job.embeddings_test (output of Extract dataset tool)
      • “Column in the labels file to use for coloring the points in the plot”: Column: 9

Conclusion

In this tutorial, we have successfully applied the Flexynesis framework to analyze survival markers in brain tumors using multi-modal genomic data from 506 Lower Grade Glioma (LGG) and 288 Glioblastoma Multiforme (GBM) samples. Through hands-on analysis of mutation and copy number alteration data, we have separated our samples to high and low risk groups and ranked the genes by their importance from which some are already known to affect survival in glioma.