Label-free data analysis using MaxQuant

Overview

question Questions
  • How to perform label-free analysis in Maxquant?

  • Which are the most abundant proteins in serum?

  • How successful was the depletion of those in our experiment?

objectives Objectives
  • Analysis of human serum proteome samples with label-free quantification in MaxQuant

requirements Requirements

time Time estimation: 1 hour

level Level: Introductory level level level

Supporting Materials

last_modification Last modification: Apr 29, 2020

Introduction

The proteome refers to the entirety of proteins in a biological system (e.g cell, tissue, organism). Proteomics is the large-scale experimental analysis of proteins and proteomes, most often performed by mass spectrometry that enables great sensitivity and throughput. Especially for complex protein mixtures, bottom-up mass spectrometry is the standard approach. In bottom-up proteomics, proteins are digested with a specific protease into peptides and the measured peptides are in silico reassembled into the corresponding proteins. Inside the mass spectrometer, not only the peptides are measured (MS1 level), but the peptides are also fragmented into smaller peptides which are measured again (MS2 level). This is referred to as tandem-mass spectrometry (MS/MS). Identification of peptides is performed by peptide spectrum matching of the theoretical spectra generated from the input protein database (fasta file) with the measured spectra. Peptide quantification is most often performed by measuring the area under the curve of the MS1 level peptide peaks, but special techniques such as TMT allow to quantify peptides on MS2 level. Nowadays, bottom-up tandem-mass spectrometry approaches allow for the identification and quantification of several thousand proteins.

MQ_lcmsms
Figure 1: Proteomics using liquid chromatography tandem-mass spectrometry (LC-MS/MS). Adapted from wikipedia.

A plethora of software solutions were developed for the analysis of proteomics data. MaxQuant is one of the most popular proteomics softwares because it is an easy to use and free software that offers functionalities for nearly all kinds of proteomics data analysis challenges Cox and Mann 2008. Mass spectrometry raw data is normally obtained in a vendor specific, proprietary file format. MaxQuant can directly take those raw files as input. For peptide identification MaxQuant uses a search engine called “Andromeda”. MaxQuant offers highly accurate functionalities for many different proteomics quantification strategies, e.g. label-free, SILAC, TMT.

Blood is a commonly used biofluid for diagnostic procedures. The cell-free liquid blood portion is called plasma and after coagulation serum. Plasma and serum proteomics are frequently performed to find new biomarkers e.g. for diagnostic purposes and personalized medicine (Geyer et al. 2017). Serum and Plasma proteomics are particularily challenging due to protein concentration differences in the orders of ten magnitudes. Therefore, most sample preparation protocols include a depletion step in which the most abundant proteins are (partially) depleted from the sample e.g. via columns with immobilized antibodies.

This stand-alone tutorial introduces the data analysis from raw data files to protein identification and quantification of two label-free human serum samples with the MaxQuant software. One sample is a pure serum sample, while the other sample has been depleted for several abundant blood proteins. One of the questions in this tutorial is to find out which sample was depleted and which not.

For more advanced proteomics workflows, please consult the OpenMS identification, quantification as well as SearchGUI/PeptideShaker tutorials.

Agenda

In this tutorial, we will cover:

  1. Get data
  2. MaxQuant Analysis
    1. More details on MaxQuant Parameters
  3. Quality control results
  4. Serum composition

Get data

The serum proteomic samples and the fasta file for this training were deposited at Zenodo. It is of course possible to use other fasta files that contain human proteome sequences, but to ensure that the results are compatible we recommend to use the provided fasta file. MaxQuant not only adds known contaminants to the fasta file, but also generates the “decoy” hits for false discovery rate estimation itself, therefore the fasta file is not allowed to have decoy entries. To learn more about fasta files, have a look at Protein FASTA Database Handling.

hands_on Hands-on: Data upload

  1. Create a new history for this tutorial and give it a meaningful name

    tip Tip: Creating a new history

    Click the new-history icon at the top of the history panel

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
  2. Import the fasta and raw files from Zenodo

    https://zenodo.org/record/3774452/files/Protein_database.fasta
    https://zenodo.org/record/3774452/files/Sample1.raw
    https://zenodo.org/record/3774452/files/Sample2.raw
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

  3. Once the files are green, rename the raw datasets into ‘sample1’ and ‘sample2’ and the fasta file into ‘protein database’

    tip Tip: Renaming a dataset

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button
  4. Set the data type to thermo.raw for ‘sample1’ and ‘sample2’

    tip Tip: Changing the datatype

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
    • Select thermo.raw
    • Click the Change datatype button

MaxQuant Analysis

The MaxQuant Galaxy implementation contains the most important MaxQuant parameters. As an alternative, MaxQuant (using mqpar.xml) tool can be used with a preconfigured mqpar.xml file. We will explain the parameters after starting the MaxQuant run which takes some time to finish.

hands_on Hands-on: MaxQuant analysis

  1. MaxQuant tool with the following parameters:
    • In “Input Options”:
      • param-file “FASTA files”: protein database
    • In “Search Options”:
      • “minimum unique peptides”: 1
    • In “Parameter Group”:
      • param-files “Infiles”: sample1, sample2
      • “missed cleavages”: 1
      • “variable modifications”: ` `
      • “Quantitation Methods”: label free quantification
    • “Generate PTXQC (proteomics quality control pipeline) report?”: Yes
    • In “Output Options”:
      • “Select the desired outputs.”: Protein Groups Peptides mqpar.xml

    comment Comment: Protein Groups

    Proteins that share all their peptides with other proteins cannot be unambiguously identified. Therefore, MaxQuant groups such proteins into one protein group and only one common quantification will be calculated. The different protein properties are separated by semicolon.

More details on MaxQuant Parameters

The “minimum peptide length” defines the minimum number of amino acids a peptide should have to be included for protein identification and quantification. Below 7 amino acids a peptide cannot be unique and is therefore not informative, thus typical values are in the range 7-9.

Several, even longer peptides are not unique, meaning that they are shared by several proteins e.g. when they are part of a common protein domain. During protein inference the peptides are statistically assembled into the corresponding proteins and the decision should be mainly based on the unique peptides. Therefore, we set “min unique peptides” to 1 - only proteins that have at least one unique peptide are reported in the output table.

In most bottom-up proteomics experiments Trypsin is used as a protease because it has many advantages such as it’s accurate cleavage specificity. Trypsin cleaves peptides c-terminal of Arginine (R) and Lysine (K), except when those are followed by Proline (P). Therefore, in MaxQuant the default “enzyme” is set to Trypsin/P. This trypsin specific cleavage rule is used by MaxQuant to perform an in silico digestion of the protein database that was provided in the fasta file.

Protease digestion is not always complete, therefore we set the “number of missed cleavages” to 1, meaning that the in silico digestion also includes peptides that have an additional Arginine or Lysine in their sequence.

MQ_param_missed
Figure 2: Trypsin specificity and missed cleavages. With one missed cleavage all three peptides will be part of the *in silico* peptide database

From the in silico generated peptide database the masses of the peptides are calculated and matched to the measured masses in order to identify them. A peptide’s mass will change due to peptide modifications such as chemical labelling for example applied in different quantitation strategies or biological post translational modifications. Therefore, it is important to also include possible peptide modifications in the in silico generated peptide mass list. “Fixed modifications” are modifications that occur on every occurence of the specific amino acid. Those are often artificially introduced modifications such as Carbamidomethylation of cysteins (C) to prevent re-formation of the disulfid bridges. This is a common procedure in proteomics sample preparation and therefore also the default option in MaxQuant: Carbamidomethyl (C). “Variable modifications” are modifications that do not occur on every amino acid such as Oxidation of Methionine might only occur on some Methionines and not all, but only a few peptide N-termini are acetylated. Variable modifications increase the in silico peptide database because each peptide’s mass is calculated once with and once without the additional modification. To keep computation times as short as possible we did not use any variable modification this training, despite the MaxQuant defaults Oxidation of Methionine and Acetylation of N-termini would have been completely valid to use.

MaxQuant supports different “Quantitation Methods”. The three main categories are label-free quantification, label-based quantification and reporter ion MS2 quantification. In this tutorial we have chosen label-free because we did not apply any specific labeling/quantitation strategy to the samples.

MQ_quant_methods
Figure 3: Overview of MaxQuant quantification methods

The PTXQC software (Bielow et al. 2015) was built to enable direct proteomcs quality control from MaxQuant result files. This quality control can be directly used in the Galaxy MaxQuant wrapper by setting “Generate PTXQC” to yes. This will generate a pdf file with multiple quality control plots. Be aware that the cutoffs set in PTXQC might not be applicable to your experiment and mass spectrometer type and therefore “under performing” and “fail” do not necessarily mean that the quality is poor.

PTXQC_overview
Figure 4: Overview of PTXQC quality measures for sample1 and sample2

MaxQuant automatically generates several output files. In the “Output Options” all or some of the output files can be selected. The protein information can be obtained by selecting Protein Groups, while the peptide information is obtained by choosing Peptides. The applied MaxQuant parameters are stored in the mqpar.xml This file can be re-used as an input file in the MaxQuant (using mqpar.xml) tool.

details More MaxQuant parameters

  • The “parse rules” in the input section are applied to the fasta sequence headers. The default automatically extracts the Uniprot ID from fasta files that were downloaded from uniprot. Regular expressions can be adjusted to keep other information from the fasta file.

  • For pre-fractionated data an “experimental design template” has to be used. This has to be a tab-separated text file with a column for the fractions (e.g. 1-10) and a column for the experiment (sample1, sample2, sample3) and a column for post translational modifications (PTM). Examples are given in the help section of the MaxQuant tool.

  • “Match between run” allows to transfer identifications (peptide sequences) from one run to another. If the MS1 (full length peptide) signal is present in both runs, but was only selected for fragmentation in one of them, MaxQuant can transfer the resulting peptide sequence to the run where the MS1 peptide was not fragmented. The Information if a peptides was identified via fragmentation (MS/MS) or match between run (matching) can be found in the evidence output.

  • MaxQuant allows to process different raw files with different parameters. In this tutorial we have loaded both files into the same “parameter group” in order to process them with the same parameters. To apply different parameters, new parameter groups can be added by clicking on the param-repeat “insert parameter group” button. In each “parameter group” one or several raw files can be specified and for them only the parameter specified within this parameter group section are applied.

tip Tip: Continue with results from Zenodo

In case the MaxQuant run is not yet finished, the results can be downloaded from Zenodo to be able to continue the tutorial

  1. Import the files from Zenodo

    https://zenodo.org/record/3774452/files/PTXQC_report.pdf
    https://zenodo.org/record/3774452/files/MaxQuant_Protein_Groups.tabular
    https://zenodo.org/record/3774452/files/MaxQuant_Peptides.tabular
    https://zenodo.org/record/3774452/files/MaxQuant_mqpar.xml
    

question Questions

  1. How many proteins were found in total?
  2. How many peptides were found in total?
  3. How many proteins identified and quantified? (Tipp: There is a tool called “filter data on any column”, select the LFQ column for both files and remove rows containing “0”)

solution Solution

  1. 271 protein (groups) were found in total.
  2. 2387 peptides were found in total.
  3. Sample1: 237, Sample2: 123 (filter data on any column tool on the protein groups file “with following condition” c32!=0 or c33!=0 and *“Number of header lines” 1)

Quality control results

To get a first overview of the MaxQuant results, the PTXQC report is helpful. Click on the galaxy-eye eye of the PTXQC pdf file to open it in Galaxy. Screening through the different plots might already give you a hint which of the samples was pure and which was depleted of abundant proteins. Both samples failed in some categories (see Figure 4 above), especially due to low peptide and protein numbers, which is expected in serum samples and therefore not a quality problem.

question Questions

  1. How good was the tryptic digestion (percentage of zero missed cleavages)?
  2. Which sample yielded more protein identifications?
  3. In which sample were more contaminantes quantified?
  4. Do you already have a guess on which sample was depleted?

solution Solution

  1. The digestion was not ideal but good enough to work with. The proportion of zero missed cleavages was 75% for sample1 and around 85% for sample2. PTXQC_missed
  2. Sample 1 yielded more protein identifications PTXQC_ids
  3. Sample 2 has more contaminants, especially serum albumin is high abundant. PTXQC_contaminants
  4. Sample1, more information can be found in the next section.

Serum composition

To explore the proteomic composition of the two serum samples some postprocessing steps are necessary. The protein groups file has many different columns, therefore the first step is to extract only columns that are of interest for this task. This are the columns with the fasta headers (this includes the protein name as it was written in the fasta file) and the two columns with LFQ intensities for both files. To find the most abundant proteins the LFQ intensities can be sorted. On this sorted dataset we will explore the composition of the serum proteins within both samples using an interactive pie charts diagram.

hands_on Hands-on: Exploring serum composition

  1. Cut columns from a table tool with the following parameters:
    • “Cut columns”: c8,c32,c33
    • param-file “From”: proteinGroups (output of MaxQuant tool)
  2. Sort data in ascending or descending order tool with the following parameters:
    • param-file “Sort Query”: cut_file (output of Cut tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • “on column”: c2
      • “in”: Descending order
      • “Flavor”: General numeric sort ( scientific notation -g)
  3. Sort data in ascending or descending order tool with the following parameters:
    • param-file “Sort Query”: cut_file (output of Cut tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • “on column”: c3
      • “in”: Descending order
      • “Flavor”: General numeric sort ( scientific notation -g)
  4. Click galaxy-barchart “Visualize this data” on the last Sort tool result.
    • Select Pie chart (NVD3)
    • “Provide a title”: Serum compositions
    • Click Select data galaxy-chart-select-data
    • “Provide a label”: sample1
    • “Labels”: Column: 1
    • “Values”: Column: 2
    • Click Insert data series
    • “Provide a label”: sample2
    • “Labels”: Column: 1
    • “Values”: Column: 3
    • Save galaxy-save (file is saved under “User” –> “Visualizations”)
serum_composition
Figure 5: Quantitative serum composition. In Galaxy one can hover over the graph to see the protein names

question Questions

  1. What are the top 5 most abundant proteins in both files? Do they reflect typical serum proteins?
  2. Which sample was depleted for the top serum proteins?
  3. How much did the serum albumin abundance percentage decrease? Was the depletion overall succesful?

solution Solution

  1. Sample1: Complement C4-A, Ceruloplasmin, Hemopexin, Serum albumin, Complement factor B. Sample2: Serum albumin, Immunoglobulin heavy constant gamma 1, Serotransferrin, Immunoglobulin kappa constant, Haptoglobin. All of those proteins are typical (high abundant) serum proteins plasma proteins found by MS.
  2. Sample1 was depleted, sample2 was pure serum.
  3. In the depleted sample1, there is a depletion in some of the most abundant proteins, especially Albumin, which proportion of the total sample intensities decreased by 58 percentage. Compared to the pure serum the depleted sample showed a duplication of identified and quantified proteins rendering it quite successful. However, there is still room for improvement as some of the most abundant proteins which should have been depleted did not change their abundance compared to the overall protein abundance.

keypoints Key points

  • MaxQuant offers a single tool solution for protein identification and quantification.

  • Label-free quantitation reveals the most abundant proteins in serum samples.

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

  1. Cox, J., and M. Mann, 2008 MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology 26: 1367–1372. 10.1038/nbt.1511
  2. Bielow, C., G. Mastrobuoni, and S. Kempa, 2015 Proteomics Quality Control: Quality Control Software for MaxQuant Results. Journal of Proteome Research 15: 777–787. 10.1021/acs.jproteome.5b00780
  3. Geyer, P. E., L. M. Holdt, D. Teupser, and M. Mann, 2017 Revisiting biomarker discovery by plasma proteomics. Molecular Systems Biology 13: 942. 10.15252/msb.20156297

Citing this Tutorial

  1. Melanie Föll, Matthias Fahrner, 2020 Label-free data analysis using MaxQuant (Galaxy Training Materials). /training-material/topics/proteomics/tutorials/maxquant-label-free/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{proteomics-maxquant-label-free,
    author = "Melanie Föll and Matthias Fahrner",
    title = "Label-free data analysis using MaxQuant (Galaxy Training Materials)",
    year = "2020",
    month = "04",
    day = "29"
    url = "\url{/training-material/topics/proteomics/tutorials/maxquant-label-free/tutorial.html}",
    note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
        doi = {10.1016/j.cels.2018.05.012},
        url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
        year = 2018,
        month = {jun},
        publisher = {Elsevier {BV}},
        volume = {6},
        number = {6},
        pages = {752--758.e1},
        author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
        title = {Community-Driven Data Analysis Training for Biology},
        journal = {Cell Systems}
}
                    

congratulations Congratulations on successfully completing this tutorial!



Did you use this material as an instructor? Feel free to give us feedback on how it went.