MaxQuant and MSstats for the analysis of TMT data

Overview

Questions:
  • How to analyze tandem-mass-tag (TMT) labelled shotgun (DDA) data analysis in MaxQuant and MSstats?

  • Which proteins are differentially abundant between KMT9α knock-down and control cells?

Objectives:
  • Learn how to use MaxQuant and MSstats for the analysis of TMT labelled shotgun (DDA) data

  • Learn how to create an experimental design template for fractionated data in MaxQuant

  • Learn how to create an annotation file for MSstatsTMT

Requirements:
Time estimation: 1 hour
Supporting Materials:
Last modification: Aug 13, 2021
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

In this training we will cover the full analysis workflow from tandem mass tag (TMT) labeled samples using MaxQuant in conjunction with MSstatsTMT.

The training dataset consists of a dataset from a study which investigates the effects of a recently discovered histone methyltransferase, lysine methyl transferase 9 (KMT9α) in A549 cells. High KMT9α expression in lung cancer correlates with poor patient survival. Therefore, KMT9α is a very interesting research target in light of the fact that lung cancer survival rates could not be improved over the last 15 years. The example dataset consists of 4 cell culture samples from KMT9α knock-down cells and 4 cell culture samples from control (Ctrl) cells. In order to increase the proteome coverage the TMT experiment was pre-fractionated using offline high pH reverse phase chromatography resulting in 12 MS runs.

Agenda

In this tutorial, we will cover:

  1. Get data
  2. MaxQuant analysis
  3. MSstatsTMT analysis
  4. Follow up on MSstatsTMT results

Get data

The raw data is available via the PRIDE repository under the ID: PXD014145. The MaxQuant experimental design template file, MSstatsTMT annotation file and FASTA file for this training are deposited at Zenodo(https://zenodo.org/record/5195800). It is of course possible to use another FASTA file with human proteome sequences, but to ensure that the results are compatible, we recommend using the provided FASTA file. MaxQuant not only adds known contaminants to the FASTA file, but also generates the “decoy” hits for false discovery rate estimation itself, therefore the FASTA file is not allowed to have decoy entries. To learn more about FASTA files, have a look at Protein FASTA Database Handling tutorial.

hands_on Hands-on: Data upload

  1. Create a new history for this tutorial and give it a meaningful name

    Tip: Creating a new history

    Click the new-history icon at the top of the history panel.

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
  2. Import the FASTA database, experimental design template, and annotation file Zenodo

    https://zenodo.org/record/5195800/files/protein_database.fasta
    https://zenodo.org/record/5195800/files/MaxQuant%20experimental%20design%20template.tabular
    https://zenodo.org/record/5195800/files/MSstatsTMT%20annotation%20file.tabular
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    • By default, Galaxy uses the URL as the name, so rename the files with a more useful name.
  3. Once the files are green, rename the fasta file into ‘protein database’, the experimental design template into ‘experimental design template’, and the annotation file into ‘annotation file’.

    Tip: Renaming a dataset

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button
  4. To run MaxQuant, import the raw data from PRIDE.

     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA380.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA381.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA382.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA383.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA384.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA385.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA386.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA387.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA388.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA389.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA390.raw
     ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014145/MFA391.raw
    
  5. Rename the raw datasets into ‘MFA380.raw’, ‘MFA381.raw’, etc.. The naming for the raw files have to be exactly this way to later match the file names provided in the MSstats annotation file.

    Tip: Renaming a dataset

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button
  6. Control that the data type of the raw files is ‘thermo.raw’ otherwise change the datatype into ‘thermo.raw’

    Tip: Changing the datatype

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
    • Select thermo.raw
    • Click the Save button
  7. Generate a collection for all raw files and name it ‘raw_files’, hide the individual raw files

    Tip: Creating a dataset collection

    • Click on Operations on multiple datasets (check box icon) at the top of the history panel Operations on multiple datasets button
    • Check all the datasets in your history you would like to include
    • Click For all selected.. and choose Build dataset list

      build list collection menu item

    • Enter a name for your collection
    • Click Create List to build your collection
    • Click on the checkmark icon at the top of your history again

MaxQuant analysis

We start the MaxQuant run with TMT parameters for MS2 based reporter quantitation. A quality control report will be generated with the PTXQC functionality that is directly implemented in the MaxQuant Galaxy tool. For more information about MaxQuant and its parameter please have a look at the Label-free data analysis using MaxQuant tutorial. To continue with statistical analysis in MSstatsTMT, the Protein Groups and the Evidence files are needed from MaxQuant. The run time of MaxQuant depends on the number and size of the input files and on the chosen parameters. The run of the training datasets will take a while, but the training can be directly continued with the MaxQuant result files from Zenodo.

hands_on Hands-on: MaxQuant analysis

  1. MaxQuant Tool: toolshed.g2.bx.psu.edu/repos/galaxyp/maxquant/maxquant/1.6.17.0+galaxy2 with the following parameters:
    • In “Input Options”:
      • param-collection “FASTA files”: protein database
    • In “Search Options”:
      • param-file “Specify an experimental design template”: experimental design template
    • In “Parameter Group”:
      • “Quantitation Methods”: reporter ion MS2
        • “isobaric labeling”: TMT11plex
        • “Filter by PIF”: Yes
    • “Generate PTXQC (proteomics quality control pipeline) report? (experimental setting)”: Yes
    • In “Output Options”:
      • “Select the desired outputs.”: Protein Groups mqpar.xml Evidence MaxQuant and PTXQC log

    tip Tip: Continue with results from Zenodo

    Because the MaxQuant run takes quite a while, we recommend to download the MaxQuant results from Zenodo and continue with the tutorial

    1. Import the files from Zenodo
    https://zenodo.org/record/5195800/files/MaxQuant%20Result%20evidence.tabular
    https://zenodo.org/record/5195800/files/MaxQuant%20Result%20protein%20groups.tabular
    

question Questions

  1. How many protein groups were found with MaxQuant?

solution Solution

  1. 6911 (numbers might slightly vary with different MaxQuant versions); number of lines minus header line in the MaxQuant protein groups output

tip Tip: PIF parameter for TMT data

Co-isolation of multiple peptides can lead to inaccurate TMT signals, as not only one peptide is fragmented but multiple different peptides are fragmented and the TMT signals cannot be distinguished anymore. MaxQuant can “check” the MS1 scans to predict, whether multiple different peptides were cofragmented. Precursor Intensity Filtering (PIF) allows you to implement a cutoff how high the purity of fragmentation was (1 = no coisolation, 0.75 = default).

tip Tip: Preparing an MaxQuant experimental design template for your own analysis

For this tutorial all files needed for the analysis are provided. Nevertheless here are some guidelines for the generation of the experimental design template for MaxQuant for your own data analyses. For thermo.raw files, “.raw” is considered a part of the filename. Therefore, file names should be ‘Filename1.raw’, ‘Filename2.raw’ etc.. Please pay special attention to your experimental design, since MaxQuant will summarize peptide and protein intensities differently depending on this input. In brief, MaxQuant will summarize the protein intensities within each specified experiment. Often, TMT experiments are not measured in one MS run, but in multiple runs (either as so called fractions, or as replicates). Over the last years, high pH prefractionation has become the standard for fractionating TMT samples. Alternatively, SCX or HILIC fractionations can be employed. Although you fractionated your samples prior to measurement you still want MaxQuant to group all samples which belong to “one” TMT experiment. Since the fractions originate from one sample the experimental design template allows to map multiple measurements to one experiment (= Sample). Additionally, it is possible to have a more complex experimental setup including multiple TMT experiments, possibly containing multiple fractions per experiment. All MS files which are labeled with the same “Experiment” name are summarized in the proteinGroups and peptides output of MaxQuant.

MSstatsTMT analysis

The protein groups and evidence files of MaxQuant can directly be used as an input for MSstatsTMT. MSstatsTMT automatically removes all proteins that are labelled as contaminants (‘+’ sign in the column ‘potential contaminant’ of both MaxQuant outputs). In case you want to prevent this step either run MaxQuant with “Include Contaminants: No” or remove the ‘+’ sign from the MaxQuant files as described in the label-free MSstats training. Additionally, MSstatsTMT will combine all quantitations according to the annotation file, which you have to provide (e.g all fractions of the same mixture are combined). For this dataset, the standard parameters of MSstatsTMT fit quite nicely, as this is a straightforward two group comparison. If you have more than two groups, you may consider using the comparison matrix option of MSstatsTMT (for more information see label-free MSstats training).

hands_on Hands-on: MSstatsTMT analysis

  1. MSstatsTMT Tool: toolshed.g2.bx.psu.edu/repos/galaxyp/msstatstmt/msstatstmt/2.0.0+galaxy0 with the following parameters:
    • “Input Source”: MaxQuant
      • param-file “evidence.txt - feature-level data”: evidence (output of MaxQuant tool)
      • param-file “proteinGroups.txt”: proteinGroups (output of MaxQuant tool)
      • param-file “annotation.txt”: annotation file
    • In “Plot Output Options”:
      • “Select protein IDs to draw plots”: generate all plots for each protein
    • “Compare Groups”: Yes
      • In “Comparison Plot Options”:
        • “Display protein names in Volcano Plot.”: No
    • “Select Outputs”: QC Plot

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

tip Tip: Preparing an MSstatsTMT annotation file for your own analysis

For this tutorial all files needed for the analysis are provided. Nevertheless here are some guidelines for the generation of annotation file for MSstatsTMT for your own data analyses. As MSstatsTMT has to perform statistical analyses, you have to provide the following basic information: Which quantitation channel in every sample corresponds to which original biological sample and which biological condition (e.g WT or knock-down) does this sample belong to? To fully automate the process of choosing an appropriate statistical model for analysis, MSstatsTMT needs an annotation matrix that is extremely powerful because it can deal with any experimental setup.

The annotation file contains the following columns:

  • Run corresponds to your MS file names. Please note that MaxQuant in Galaxy will add a “.thermo” to your raw file name as it appears in your history. Since we first analyse the data with MaxQuant, MSstatsTMT needs this additional extension for this annotation file. The correct run names can be found after MaxQuant is finished in the evidence file in the column ‘Raw file’.
  • Fraction corresponds to the aforementioned optional pre-fractionation fraction of your samples (e.g high pH RP, or SCX)
  • TechRepMixture corresponds to the n-th time this fraction was measured. If you measured the first Fraction of your first sample twice, Fraction would stay the same, but the TechRepMixture would have to differ.
  • Channel corresponds to the TMT channel which was used in your Experiments.We recommend to always include all channels in your searches and your annotations, even when these channels are empty. So if you only used 9 of 11 channels in your TMT experiments, add the last 2 channels anyway and just label them as “Empty” in the Condition column. In Galaxy the channel names need to be named ‘channel.X’, with X being the number of the channel.
  • Condition represents the biological condition, which the observation belongs to (e.g Ctrl). Normalization Channel are labeled as “Norm”. Empty Channels as “Empty”
  • Mixture corresponds to the “TMT experiment” mentioned above.
  • BioReplicate corresponds to the biological specimen this observation derives from. On first glance this might sound overwhelming, but used correctly, this annotation matrix can describe very complex experimental setups in a definitive way.

An example:

Consider you perform a study on the difference between primary liver tumors and liver tumor metastases. It makes a very big difference whether all your samples derive from different patients, or whether you were able to obtain some primary tumors and metastases samples from the same patient. The samples obtained from the same patient are expected to be related to each other (they are not independent observations and this must be considered during statistical testing). A short illustration: You measured 2 different samples of one patient (metastasis and primary tumor sample).This patient would be considered patient 1 and you would label the primary tumor with the first channel of a TMT-11 plex kit (126) and you would label the metastasis with the second channel (127N). After labeling you combine both samples in one Mixture and measure them in one MS run.

Your annotation file for this case would look like this:

Run Fraction TechRepMixture Channel Condition Mixture BioReplicate
File1.thermo.raw 1 1 channel.1 primary mixture1 1
File1.thermo.raw 1 1 channel.2 metastasis mixture1 1

Let us assume you now have 10 primary tumors and 10 metastases (from 20 different patients). Since you do not have 20plex capabilities, you have to divide your samples into two experiments / mixtures. For simplicity, after labeling and combining all channels belonging to the same experiments, the mixtures are directly measured ( 1 fraction per Mixture). To avoid batch effects due to allocation you use randomization. In order to be able to normalize all quantitations you use a reference sample, which you add to each experiment (this is referred to as “Normalization Channel”). The reference sample is ideally a mixture of all your samples.

The annotation matrix would now look like this:

Run Fraction TechRepMixture Channel Condition Mixture BioReplicate
File1.thermo.raw 1 1 channel.1 primary mixture1 1
File1.thermo.raw 1 1 channel.2 metastasis mixture1 2
File1.thermo.raw 1 1 channel.3 primary mixture1 3
File1.thermo.raw 1 1 channel.4 metastasis mixture1 4
File1.thermo.raw 1 1 channel.5 primary mixture1 5
File1.thermo.raw 1 1 channel.6 metastasis mixture1 6
File1.thermo.raw 1 1 channel.7 primary mixture1 7
File1.thermo.raw 1 1 channel.8 metastasis mixture1 8
File1.thermo.raw 1 1 channel.9 primary mixture1 9
File1.thermo.raw 1 1 channel.10 metastasis mixture1 10
File1.thermo.raw 1 1 channel.11 Norm mixture1 100
File2.thermo.raw 1 1 channel.1 primary mixture2 1
File2.thermo.raw 1 1 channel.2 metastasis mixture2 2
File2.thermo.raw 1 1 channel.3 primary mixture2 3
File2.thermo.raw 1 1 channel.4 metastasis mixture2 4
File2.thermo.raw 1 1 channel.5 primary mixture2 5
File2.thermo.raw 1 1 channel.6 metastasis mixture2 6
File2.thermo.raw 1 1 channel.7 primary mixture2 7
File2.thermo.raw 1 1 channel.8 metastasis mixture2 8
File2.thermo.raw 1 1 channel.9 primary mixture2 9
File2.thermo.raw 1 1 channel.10 metastasis mixture2 10
File2.thermo.raw 1 1 channel.11 Norm mixture2 100

Normalization Channels are always labeled as “Norm” in Condition and get an arbitrary BioReplicate number. Since we want the Normalization channel to be a reference, it always has to be the same sample and therefore also should have the same BioReplicate number. If you have empty channels, label them as “Empty” in the Condition column and also give them an arbitrary BioReplicate number.

Exercise:

How would the annotation file look like for the following experiment: 27 samples (12 primary, 12 metastases samples from different patients + 3 reference samples) over 3 TMT experiments / mixtures. You use Channels 1-8 in Exp1, 2-9 in Exp2 and 3-11 in Exp3. Your first used channel is always taken for the reference channel (Exp1: Channel1 = Norm, Exp2: Channel2 = Norm etc). You fractionate all 3 experiments / mixtures into 3 fractions each and measure every MS sample twice (resulting in 18 measurements / raw files total).

The solution can be found on Zenodo.

For more information on MSstats parameters and ideas on how to follow up with your MSstatsTMT results visit the label-free MSstats training.

Follow up on MSstatsTMT results

We obtain several output files from MSstatsTMT. MSstats log file contains the MSstats report with warnings and information about the analysis steps. The QC report allows to visualize protein abundance of conditions for all proteins (first page) or for each individual protein (following pages).

QC report.
Figure 1: QC report for all samples, each boxplot summarizes the protein abundances of one sample

The MSstatsTMT Group Comparison Plot is a very convenient pdf, which you can search for your favorite proteins of interest to check whether these are dysregulated between the conditions. The red error bar gives you a confidence interval of the fold change.

Group comparison plot.
Figure 2: Group comparison plot for the protein A5YKK6 (CCR4-NOT transcription complex subunit 1)

The volcano plot plots the negative log10 of the adjusted P Value derived from statistical testing over the log2 fold change between the conditions you compared. A value of 3 on the y axis corresponds to an adjusted P Value of 0.001, whereas a value of 1 on the x axis corresponds to a ratio of 2 meaning that the protein is twice as high in condition 1 than condition 2.. Please bear in mind that TMT quantitation suffers from ratio compression. There are ways to calculate more accurate fold changes from measured data (or alternatively you can use MS3 quantitation on some machines). However, even with ratio compression, MS2 TMT is an excellent quantitation strategy to detect differentially regulated proteins. Just bear in mind that the reported fold chances of proteins are almost always smaller than the real fold changes between your conditions.

Volcano plot.
Figure 3: Volcano plot QC report for all samples, each boxplot summarizes the protein abundances of one sample

Key points

  • MaxQuant in combination with MSstatsTMT enables quantitative analysis of TMT data.

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Proteomics topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. Klemens Fröhlich, Melanie Föll, Matthias Fahrner, 2021 MaxQuant and MSstats for the analysis of TMT data (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/maxquant-msstats-tmt/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{proteomics-maxquant-msstats-tmt,
author = "Klemens Fröhlich and Melanie Föll and Matthias Fahrner",
title = "MaxQuant and MSstats for the analysis of TMT data (Galaxy Training Materials)",
year = "2021",
month = "08",
day = "13"
url = "\url{https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/maxquant-msstats-tmt/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                

Congratulations on successfully completing this tutorial!