EWAS data analysis of 450k data


question Questions
  • What is a epigenome-wide association studies?

  • Why is a EWAS analysis useful?

  • Why 450k data?

objectives Objectives
  • Learn how to perform reproducible EWAS analysis

requirements Requirements

time Time estimation: 1 hour

Supporting Materials

last_modification Last modification: Sep 11, 2019


The field of cancer genomics has demonstrated the power of massively parallel sequencing techniques to inform on genes and specific alterations that drive tumor onset and progression. Although large comprehensive sequence data sets continue to be made increasingly available, data analysis remains an ongoing challenge, particularly for laboratories lacking dedicated resources and bioinformatics expertise. To address this, we have provide training based on Galaxy tools EWAS suite that represent many popular algorithms for detecting somatic genetic alterations from cancer genome and exome data.

Figure 1: How epigenetics mechanism can effect health (adapted from https://commonfund.nih.gov/epigenomics/figure)

This exercise uses datasets from the Cell publication by Hugo, Willy, et al., 2015 Hugo et al. 2015. with the goal being to identify differentially methylated regions and positions associated with treatment resistant melanomas. Datasets include the Illumina 450K Methylation array performed in melanoma tumors in a sample of patients pre and post MAPKi and BRAFi treatment with different outcomes (sensitive and resistant). For each sample there is raw green and red colour arrays containing the summarized bead information generated by the Illumina 450k scanner, i.e. there are 8 files in total:

The Infinium Methylation Assay 450k uses two different bead types to detect changes in DNA methylation levels. In the figure we can see M - methylated and U - unmethylated bead types. In our study unmethylated and methylated bead signals are reported as green and red colors respectively.

Figure 2: Infinium Methylation Assay Overview (adapted from https://emea.illumina.com/science/technology/beadarray-technology/infinium-methylation-assay.html?langsel=/gb/)
Accession Sensitivity Treatment
GSM1588704 baseline pre-treatment
GSM1588705 baseline pre-treatment
GSM1588706 resistant BRAFi
GSM1588707 resistant BRAFi


  1. 450K Array Data Upload
  2. Preprocessing and Quality Assessment
  3. Removing probes affected by genetic variation
  4. DMPs and DMRs Identification
  5. Annotation and Visualization

450K Array Data Upload

The first step of EWAS data analysis is the raw methylation data loading (intensity information files for each two color micro array)

hands_on Hands-on: Data Loading

  1. Create a new history for this tutorial and give it a proper name

    tip Tip: Creating a new history

    Click the new-history icon at the top of the history panel

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
  2. Import the following IDAT files from Zenodo or from the data library (ask your instructor)
    • GSM1588704_8795207135_R01C02_Red.idat
    • GSM1588705_8795207119_R05C02_Red.idat
    • GSM1588706_8795207135_R02C02_Red.idat
    • GSM1588707_8795207119_R06C02_Red.idat
    • GSM1588704_8795207135_R01C02_Grn.idat
    • GSM1588705_8795207119_R05C02_Grn.idat
    • GSM1588706_8795207135_R02C02_Grn.idat
    • GSM1588707_8795207119_R06C02_Grn.idat

    details List of Zenodo URLs

    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

    tip Tip: Importing data from a data library

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    • Go into Shared data (top panel) then Data libraries

    • Find the correct folder (ask your instructor)

    • Select the desired files
    • Click on the To History button near the top and select as Datasets from the dropdown menu
    • In the pop-up window, select the history you want to import the files to (or create a new one)
    • Click on Import
  3. Run minfi read450k tool with the following parameters:
    • param-files “red channel files”: all files ending in _Red
    • param-files “green channel files”: all files ending in Grn
  4. Inspect the generated set of data

question Questions

How are the Green and Red signals stored?

solution Solution

Green and Red micro arrays are built up into RGChannelSet

Preprocessing and Quality Assessment

Preprocessing and data quality assurance is an important step in Infinium Methylation Assay analysis.

RGChannelSet represents two colour data with a green and a red channel and can be converted into methylated and unmethylated signals assigned to MethylSet or into Beta values build in RatioSet. Users can convert from RGChannelSet into MethylSet using the minfi mset tool or compute Beta values using minfi set tool. The minfi qc tool tool extracts and plots the quality control data frame with two columns mMed and uMed which are the medians of MethylSet signals (Meth and Unmeth). Comparing them against one another allows users to detect and remove low-quality samples.

hands_on Hands-on: Preprocessing

  1. Run minfi mset tool on the output from minfi_read450K to create MethylSet object
  2. Run minfi qc tool on the ‘MethylSet’ object created in the previous step to estimate sample-specific quality control. Once it is done click on the galaxy-eye (eye) icon next to the QC filename in your history to view QC image.
  3. Once you have inspected the quality control run minfi rset tool on the ‘MethylSet’output from minfi_read450K to convert methylation data from the MethylSet to a ratio of the methylated probe intensity and the overall intensity (sum of methylated and unmethylated probe intensities)
  4. Run minfi maptogenome tool on the output from the previous step to map ratio data to the genomic positions

comment Preprocess and Normalize data

If your files require normalisation, you might prefer to use one of the other preprocessing tools provided in EWAS suite i.e. minfi ppfun tool or minfi ppquantile tool look for recommendation at (ref).

Quality Control
Figure 3: Quality control plot

Removing probes affected by genetic variation

Incomplete annotation of genetic variations such as single nucleotide polymorphism (SNP) may affect DNA measurements and interfere results from downstream analysis.

Hansen, J. P. 2014 highly recommended to remove the probes that contain either a SNP at the methylated loci interrogation or at the single nucleotide extension

hands_on Hands-on: Removing probes affected by genetic variation

  1. Run minfi dropsnp tool on the output of minfi_maptogenome to remove the probes that contain either a SNP at the methylated loci interrogation or at the single nucleotide extension

DMPs and DMRs Identification

The main goal of the EWAS suite is to simplify the way differentially methylated loci sites are detected. The EWAS suite contains minfi dmp tool tool detecting differentially methylated positions (DMPs) with respect to a phenotype covariate, and more complex minfi dmr tool solution for finding differentially methylated regions (DMRs). Genomic regions that are differentially methylated between two conditions can be tracked using a bumphunting algorithm. The algorithm first implements a t-statistic at each methylated loci location, with optional smoothing, then groups probe into clusters with a maximum location gap and a cutoff size to refer the lowest possible value of genomic profile hunted by our tool.

From the next step we will need an additional file that annotates illumina450K with genomic features, which we can obtain from UCSC.

hands_on Hands-on: Data upload from UCSC

  1. Search for UCSC Main in the tool search bar (top left)

  2. Click on UCSC Main tool. You will be taken to the UCSC table browser

  3. Set the following options:
    • “clade”: Mammal
    • “genome”: Human
    • “assembly”: Feb. 2009 (GRCh37/hg19)
    • “group”: Regulation
    • “track”: HAIB Methyl450
    • “table”: GM12878 (wgEncodeHaibMethyl450Gm12878SitesRep1)
    • “region”: genome
    • “output format”: GTF - gene transfer (limited)
    • “Send output to”: Galaxy (only)
  4. Click on the get output button at the bottom of the screen
  5. On the next page, click on the Send Query to Galaxy button
  6. Wait for the upload to finish

hands_on Hands-on: DMPs and DMRs Identification

  1. Import phenotypeTable.txt from Zenodo or data library:
  2. Run minfi dmp tool on the output of the minfi dropsnp tool adjusting the following parameters
    • param-file “Input set”:GenomicRatioSet
    • param-file “Phenotype Table”:phenotypeTable.txt
    • “Phenotype Type”:categorical
    • “qCutoff Size”:0.5 (DMPs with an FDR q-value greater than this will not be returned)
    • “Variance Shrinkage”: TRUE (This is recommended when sample sizes are small <10}
    • “Genome Table”: wgEncodeHaibMethyl450 ...
  3. Run minfi dmr tool on the output of minfi_dmp adjusting the following parameters
    • param-file “Input set”:GenomicRatioSet
    • param-file “Phenotype Table”:phenotypeTable.txt
    • “factor1”: sensitive
    • “factor2”: resistant
    • “maxGap Size”:250
    • “coef Size”:2
    • “Cutoff Size”:0.1
    • “nullMethod”:permutation
    • “verbose”:TRUE
  4. Click on the minfi dmr tool output in your history to expand it
  5. Set the database build of your dataset to Human Feb. 2009 (GRCh37/hg19) (hg19)

    tip Tip: Changing Database/Build (dbkey)

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Database/Build field
    • Select your desired database key from the dropdown list: hg19
    • Click the Save button
  6. Click on display at UCSC towards the bottom of the history item. This will launch UCSC Genome Browser with your Custom Track
Display at UCSC
Figure 4: UCSC genome track showing differentialy methylated regions located on chromosome 6

comment Phenotype table

Phenotype table can be in different sizes with different arguments, however the second column is required to contain phenotype covariate information for each sample.

question Questions

How do we define phenotype covariate?

solution Solution

Phenotype covariate is the set of observable characteristics of an individual resulting from the gene-environment interactions

Annotation and Visualization

In addition to downstream analysis users can annotate the differentially methylated loci at the promoter regions of genes with gene function description, and relationships between these concepts.

hands_on Hands-on: Annotate Differentially Methylated Position

  1. Run chipeakanno annopeaks toolon the output of minfi_dmp with the following parameters
    • param-file “Differentialy methylated data”: output of minfi dmp tool
    • “bindingType”: StartSite
    • “bindingRegionStart”:-5000
    • “bindingRegionEnd”:3000
    • “Additional Column of Score”:8

      Position of column of score optional value if it is required

  2. Cut tool on the previous output adjusting the following parameters to cut “gene_name” column from table of annotated peaks and then get a list of geenes
    • “Cut columns”: c16
    • “Delimited by”: Tab
    • param-file “From”: output of chipeakanno annopeaks tool
  3. Remove beginning tool of Gene List with the following parameters
    • “Remove first”: 1
    • param-file “from”: output of Cut tool
  4. Run clusterProfiler bitr tool on the previous output adjusting the following parameters to convert the list of genes to list of entrez ID
    • “Input Type Gene ID”: SYMBOL
    • “Output Type Gene ID”: ENTREZID
  5. Use the output of the clusterProfiler bitr tool to run a GO Enrichment Analysis using clusterProfiler go
Functional annotations
Figure 5: Results of GO enrichments analysis for DMPs
ID Description pvalue qvalue geneID Count
GO:0048732 gland development 1.38E-58 4.23E-55 PTGS2 / KCNC1 / FZD1 /SLC22A18 /SLC22A3 (…) 372
GO:1901652 response to peptide 3.99E-57 8.13E-54 SULF1/ LAMA5/ MED1 /CFLAR/ MSX2 (…) 359
GO:0048545 response to steroid hormone 1.38EE-54 2.11E-51 HDAC9/ RAB10/ CFLAR/ WDTC1 (…) 394


Epigenetic aberrations which involve DNA modifications give researchers an interest in identifying novel non-genetic factors responsible for complex human phenotypes such as height, weight, and disease. To identify methylation changes researchers need to perform complicated and time consuming computational analysis. Here, the EWAS suite becomes a solution for this inconvenience and provides a simplified downstream analysis available as a ready to run pipline in supplementary materials.

keypoints Key points

  • EWAS analyse changes in DNA methylation in cohorts of different individuals effected by diseases

  • DNA methylation changes can be triggered by environment (i.e. sun exposure, diet) and effect the way how our cells works and in long term cause cancer


  1. Hugo, W., H. Shi, L. Sun, M. Piva, C. Song et al., 2015 Non-genomic and Immune Evolution of Melanoma Acquiring MAPKi Resistance. Cell 162: 1271–1285. 10.1016/j.cell.2015.07.061

Citing this Tutorial

  1. Katarzyna Murat, Krzysztof Poterlowicz, 2019 EWAS data analysis of 450k data (Galaxy Training Materials). /training-material/topics/epigenetics/tutorials/ewas-suite/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

    author = "Katarzyna Murat and Krzysztof Poterlowicz",
    title = "EWAS data analysis of 450k data (Galaxy Training Materials)",
    year = "2019",
    month = "09",
    day = "11"
    url = "\url{/training-material/topics/epigenetics/tutorials/ewas-suite/tutorial.html}",
    note = "[Online; accessed TODAY]"
        doi = {10.1016/j.cels.2018.05.012},
        url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
        year = 2018,
        month = {jun},
        publisher = {Elsevier {BV}},
        volume = {6},
        number = {6},
        pages = {752--758.e1},
        author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
        title = {Community-Driven Data Analysis Training for Biology},
        journal = {Cell Systems}

congratulations Congratulations on successfully completing this tutorial!

Did you use this material as an instructor? Feel free to give us feedback on how it went.