Peptide and Protein Quantification via Stable Isotope Labelling (SIL)

Overview

question Questions
  • What are MS1 features?

  • How to quantify based on MS1 features?

  • How to map MS1 features to MS2 identifications?

  • How to evaluate and optimize the results?

objectives Objectives
  • MS1 feature quantitation and mapping of quantitations to peptide and protein IDs.

requirements Requirements

time Time estimation: 1 hour

level Level: Intermediate level level level

Supporting Materials

Introduction

To compare protein amounts in different samples from MS/MS data, two different experiment setups exist. Firstly, unmodified proteins can be measured in separate runs at one sample per MS-run. Secondly, proteins of samples to compare can be labelled with small chemical tags, mixed, and measured side-by-side in a single MS-run. There are two types of chemical tags:

  1. Isobaric tags display the same mass on first hand, but fragment during the generation of the MS/MS spectra to yield reporter ions of different mass. The intensity of those reporter ions can be compared in MS/MS spectra. There are two types of isobaric tags commercially available: tandem mass tags (TMT) and isobaric tags for relative and absolute quantitation (iTRAQ).
  2. Isotopic tags are chemically identical, but differ in their mass due to incorporated stable isotopes. Examples of different isotopic tags for stable isotope labelling (SIL) are ICAT, SILAC, dimethylation, or heavy oxygen (18O).

This tutorial deals with protein quantitation via stable isotope labelling (SIL). For isotopic tags, quantitation can be achieved by comparing the intensity of MS1 peptide mass traces. The whole MS1 profile of a peptide, i.e. the intensities of all its isotopic peaks over time, is called a peptide feature (Figure 1a). Incorporation of stable isotopes results in different peptide masses on MS1 level, which give rise to coeluting ion traces in the TIC with a mass difference typical for each different chemical tag (Figure 1b). Figure originally published in Nilse et al, 2015.

In this tutorial, we will use tools of the OpenMS suite to identify and quantify peptides and proteins.

ms1 feature
Figure 1: MS1 mass traces. A) Two peptide features of co-eluting SIL peptides. B) MS1 spectra at a given RT. C) XIC monoisotopic peak light peptide. D) XIC monoisotopic peak heavy peptide.

Prerequisites

If you are in the planning phase of your quantitative proteomics experiment, you may want to consider our tutorial on different quantitation methods first.

To learn about protein identification in Galaxy, please consider our OpenMS-based peptide ID tutorial.

hands_on Hands-on: Introduction

In the hands-on section of this tutorial, we will use a quantitative comparison of HEK cell lysate as a test dataset. In this experiment, HEK cells were once labelled with light, once with heavy SILAC. Both cultures were lysed simultaneously and the cell lysates were mixed in a certain ratio. A detailed description of the full dataset is available in the PRIDE archive.

Your objective in this hands-on-tutorial is to find out the correct mixing ratio of the test sample.

To speed up analysis, the input dataset was filtered to include only those data acquired in second 2000-3000 of the original LC gradient.

Agenda

In this tutorial, we will deal with:

  1. Peptide and Protein Identification
  2. MS1 Feature Detection
  3. Mapping Identifications to Features
  4. Descriptive Statistics and Plotting of Protein Quantitations
  5. Evaluation and Optimization of Quantitation Results
    1. Examples
    2. Typical Problems
    3. Optimization of Quantitation Results

Peptide and Protein Identification

In this tutorial, peptide identification will be performed using the workflow of the previous Peptide ID Tutorial. Alternatively one can perform the protein identification step by step in the Peptide ID Tutorial using the SILAC dataset from zenodo but beware to specify the labels in the param_variable_modifications of XTandemAdapter tool.

A common problem in mass spectrometry are misassigned mono-isotopic precursor peaks. Most search engines allow for some adaptation of the monoisotopic peak and we will use this by leaving By default, misassignment to the first and second isotopic 13C peak are also considered at No.

hands_on Hands-on: Data upload

  1. Create a new history for this SILAC Proteome exercise

    tip Tip: Creating a new history

    Click the new-history icon at the top of the history panel

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
  2. Import the mzml file, containing the measured mass spectra from Zenodo or a data library:

    https://zenodo.org/record/1051552/files/HEK_SILAC-K6R6_ST905_part.mzml
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

    tip Tip: Importing data from a data library

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    • Go into Shared data (top panel) then Data libraries

    • Find the correct folder (ask your instructor)

    • Select the desired files
    • Click on the To History button near the top and select as Datasets from the dropdown menu
    • In the pop-up window, select the history you want to import the files to (or create a new one)
    • Click on Import

    comment Comment

    The data have been preprocessed during the conversion from the machine raw file. We used background removal on MS1 and MS2 level, and MS2 deisotoping.

  3. Import the human protein database (including cRAP contaminants and decoys) from zenodo or a data library:

    https://zenodo.org/record/892005/files/Human_database_including_decoys_%28cRAP_added%29.fasta
    
  4. Import the workflow

    https://galaxyproject.github.io///training-material/topics/proteomics/tutorials/protein-id-oms/workflows/workflow.ga
    
  5. Modify the workflow

  6. Connect the mzML input directly to the XTandemAdapter tool node.
  7. Change the XTandemAdapter tool parameters: Add the param_variable_modifications Label:13C(6) (K) and Label:13C(6) (R)

  8. Run the workflow with
    • the mzML dataset 1: mzML dataset
    • the human FASTA database 2: protein FASTA database

tip Tip: Using Galaxy Workflows

If you want to learn more about Galaxy workflows, please consult the Galaxy Introduction

question Questions

  1. How many peptides and proteins were successfully identified?

solution Solution

  1. 2276 non-redundant peptides and 780 proteins were identified.

MS1 Feature Detection

MS1 feature detection is a critical step in quantitative workflows. In principle, there are two different ways to define features:

  1. Feature detection solely based upon MS1 data (mzML/raw files) without prior knowledge of peptide identifications (IDs).
    • Advantage: Feature results can be used to assist in peptide identification.
    • Drawback: Not all peptide identifications can be mapped to features, thus not every identified peptide can be quantified.
  2. Feature detection based upon peptide IDs.
    • Advantage: Most peptide IDs trigger a feature detection.
    • Drawback: Feature results cannot be used to improve peptide ID.

The OpenMS suite provides several tools (FeatureFinders) for MS1 feature detection. For SIL we have to use FeatureFinderMultiplex, which does not take peptide IDs as an input.

hands_on Hands-on: MS1 Feature Detection

  1. Run FeatureFinderMultiplex tool with
    • param-file “LC-MS dataset in centroid or profile mode”: mzML file
    • “Labels used for labelling the samples”: [ ][Arg6,Lys6],
    • “m/z tolerance for search of peak patterns”: 10
    • “Maximum number of missed cleavages due to incomplete digestion”: 1

comment Comment: Multiple labels per peptide

When using SILAC-KR or dimethyl-labelling and trypsin digestion, exactly one labelled amino acid per peptide is expected. The only labelled amino acids are lysine (K) and arginine (R) and trypsin cuts after each of them. However, a small percentage of missed cleavage normally occur also in those datasets. Setting “Maximum number of missed cleavages due to incomplete digestion” to 1 will be sufficient to deal with most missed cleavages.

When using other enzymes (e.g. Lys-C) or other labels (e.g. ), several labelled amino acids per peptide are expected. You can search for those features by increasing the parameter “Maximum number of missed cleavages due to incomplete digestion”.

Mapping Identifications to Features

We now have feature quantifications for MS1 elution peaks, peptide identifications for the MS2 spectra and protein identifications. The next step is to map the MS2-based peptide identifications to the quantified MS1 precursor peaks (“peptide features”). This will enable the quantification of identified peptides. For labelled data, it is necessary to map peptide identifications to consensus features (i.e. a pair of one light peptide feature with one matching heavy feature in the correct m/z distance). For consensusXML, IDMapper uses the consensus centroids, not the feature boundaries for mapping. Therefore, the RT tolerance has to be set higher than for mapping to featureXML. A good starting value is half the expected typical retention time.

Sometimes several peptide identifications are mapped to a feature. The tool IDConflictResolver filters the mapping so that only the identification with the best score is associated to each feature. Another refinement of the quantitative result is obtained by removing falsely mapped identifications e.g. light identification mapped to heavy feature. This step is performed by the MultiplexResolver tool that returns a first file with the correctly mapped peptides and as a second output the falsly mapped peptides.

Finally, the correctly mapped peptides will be combined into protein quantifications with the ProteinQuantifier tool.

hands_on Hands-on: Quant to ID matching

  1. Run IDMapper tool with
    • param-file “Protein/peptide identifications file”: output of IDFilter
    • param-file “Feature map/consensus map file”: the consensusXML output of FeatureFinderMultiplex
    • “RT tolerance (in seconds) for the matching of peptide identifications and (consensus) features”: 20
    • “m/z tolerance (in ppm or Da) for matching of peptide identifications and (consensus) features”: 10
    • “Match using RT and m/z of sub-features instead of consensus RT and m/z”: Yes
    • “Advanced Options”: Show advanced options
    • “Store the map index of the sub-feature in the peptide ID”: Yes
  2. Run FileFilter tool with
    • param-file “Input file”: output of IDMapper
    • “Remove unassigned peptide identifications”: Yes
  3. Run IDConflictResolver tool
    • param-file “Input file”: output of FileFilter
  4. Run MultiplexResolver tool with
    • param-file “Peptide multiplets with assigned sequence information”: output of IDConflictResolver
    • “Labels used for labelling the samples”: [ ][Arg6,Lys6]
    • “Maximum number of missed cleavages due to incomplete digestion”: 1
  5. Run ProteinQuantifier tool with
    • param-file “Input file”: first output of MultiplexResolver
    • param-file “Protein inference results”: output of IDFilter
    • “Calculate protein abundance from this number of proteotypic peptides (most abundant first; ‘0’ for all)”: 0
    • “Averaging method used to compute protein abundances from peptide abundances”: sum
    • “Add the log2 ratios of the abundance values to the output”: Yes

comment Comment: ProteinQuantifier parameters

Peptide quantitation algorithms are more precise for high abundant peptides. Therefore, it is recommended to base protein quantitations on those peptides. In ProteinQuantifier, you may restrict the calculation of protein abundances to the most abundant peptides by using the option “Calculate protein abundance from this number of proteotypic peptides”. However, we recommend to use the averaging method sum instead. By using this option, protein ratios are based on the sum of all peptide abundances. Thus, highly abundant peptides thus have more influence on protein abundance calculation than low abundant peptides. A simple sum-of-intensities algorithm provided the best estimates of true protein ratios in a comparison of several protein quantitation algorithms (Carrillo et al., Bioinformatics, 2009).

Descriptive Statistics and Plotting of Protein Quantitations

ProteinQuantifier produces two output tables: the first one gives information about the quantified proteins, the second one gives information about the quantified peptides. For proteins, we added a log-transformed ratio to the output, which is saved in column 8 of the protein table. The ratio is calculated as log2 (abundance2/abundance1), which is sometimes called the fold change (FC) ratio.

To get a quick overview of the results, you can calculate basic descriptive statistics and plot the data as a histogram. Comment lines in the beginning of a tabular file may sometimes cause errors, therefore we will remove them with the tool Select last lines from a dataset (tail).

hands_on Hands-on: Descriptive Statistics

  1. Run Summary Statistics tool with
    • param-file “Summary statistics on”: protein table output (first file) of ProteinQuantifier
    • “Column or expression: c8
  2. Run Select last (tail) tool with
    • param-file “Text file”: protein table output (first file) of ProteinQuantifier
    • “Operation”: Keep everything from this line on
    • “Number of lines”: 4
  3. Run Histogram tool with
    • param-file “Dataset”: output of Select last
    • “Numerical column for x axis”: Column: 8
    • “Number of breaks (bars)”: 20
    • “Plot title” and “Label for x axis” to something meaningful.

comment Calculating descriptive statistics for peptides

The peptide table output of ProteinQuantifier does not give the log-transformed ratio for each peptide. Nonetheless, you may calculate basic statistics of the FC values by running Summary Statistics with “Column or expression set to log(c6/c5,2).

question Questions

  1. How many peptides and proteins were successfully quantified?
  2. What might have been the mixing ratio of the input dataset?
  3. In the histogram, there is a second local maximum at about FC 0. What might that mean?

solution Solution

  1. With the above parameters, you should have quantified 792 peptides and 408 proteins.
  2. In the histogram, you see that the peak of the density curve is between -1.1 and -1.2. In the summary statistics, you can see that the mean protein ratio was -0.98. An FC of -1 indicates that the unlabelled proteins were twice as abundant as their heavy-labelled counterparts. Indeed, the mixing ratio of the dataset was 2 parts light labelled HEK cell lysate and 1 part heavy labelled HEK cell lysate.
  3. Some proteins were quantified with an FC close to 0. These may stem from incomplete SILAC labelling. Even after two weeks of cell culture in SILAC medium, some proteins with a very low turnover rate may remain unlabelled.

Evaluation and Optimization of Quantitation Results

Protein quantitation is a multi-step procedure. Many parameters of different steps influence the final results. Therefore, it is recommended to optimize the tool parameters for each dataset and to carefully evaluate quantitation results. While the total number of quantified proteins is a first important parameter for optimization, it is also necessary to visualize the results and check for correct feature finding and ID mapping.

Galaxy does not provide a tool for proteomics visualization, we recommend to use the OpenMS Viewer TOPPView. Basic TOPPView tutorials are available as videos and a more comprehensive tutorial as HTML or PDF.

For the optimization of tool parameters, it is recommended not to work with a complete LC-MS/MS run. Instead, we will use FileFilter to extract a small RT-slice of our input dataset, i.e. a fraction of the original dataset that was measured during a short period of time. Reducing the test data reduces the time needed for analysis and facilitates visual examination of the data.

Using Galaxy Workflows enables us to quickly re-run a full analysis with changed parameters. To learn about Galaxy Workflows, consult this tutorial.

Cave: Be aware that only very small parts of your dataset can be checked by visual examination. To minimize biases, try to look at the same areas / features of each result file.

hands_on Hands-on: Data reduction and visual evaluation with TOPPView

  1. Run FileFilter tool with
    • param-file “Input file”: mzML file
    • “Retention time range to extract”: 2000:2200
  2. Extract a workflow out of your history or import the premade workflow

  3. Run the whole workflow again with default settings on the reduced mzML file.

  4. Run FileFilter tool with
    • param-file “Input file”: IDConflictResolver output
    • “Remove features without annotations”: Yes
  5. Rename the FileFilter output to Annotated features

    tip Tip: Renaming a dataset

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button
  6. Run FileFilter tool with
    • param-file “Input file”: IDConflictResolver output
    • “Remove features without annotations”: Yes
  7. Rename the FileFilter output to UNannotated features

  8. Download locally the following files:
    • Spectra: mzML file
    • peptide IDs: IDScoreSwitcher idXML output file
    • features: FeatureFinderMultiplex featureXML output file
    • consensus features: FileFilter consensusXML output files (“Annotated” and “UNannotated” features)
  9. Open TOPPView
  10. Open the mzML file in TOPPView with
    • Open as set to new window
    • Map view set to 2D
    • Low intensity cutoff set to off
  11. Open all other downloaded files in TOPPView with
    • Open as set to new layer
  12. Activate the mzML layer and click on Show projections
showProjections
Figure 2: The projections display intensities plotted against RT (top panel) and plotted against m/z (right panel).
  1. Activate the consensusXML layers and click on Show consensus feature element positions
showConsensus
Figure 3: Arrows between linked light and heavy features centroids are displayed.
  1. Evaluate your data analysis, by
    • zooming into a specific region (hold Ctrl and use the mouse to zoom)
    • measuring m/z and RT distances (select the mzML layer, hold Shift and use the mouse to measure)
    • displaying an area in 3D view (right-click into the 2D View and select Switch to 3D view)
    • switching on and off the display of single layers (left-click at the tick-boxes in the window “Layers”)

Examples

  1. Displaying annotated vs. UNannotated features: visualize annotated (= mapped) and unannotated (= unmapped) features by switching between activating only the “Annotated_features.consensusxml” or only the “UNannotated_features.consensusxml” layer

    displaying feature annotations
    Figure 4: The feature below was mapped to and annotated with a peptide identification, the feature above was not mapped. A) Annotated layer is active. B) UNannotated layer is active.
  2. Correct mapping: a feature was detected, a peptide was identified and the two were mapped.

    Correct mapping of a high-abundant peptide
    Figure 5: A high-abundant peptide. A) 2D View. B) 3D View.
    Correct mapping of a low-abundant peptide
    Figure 6: A low-abundant peptide. A) 2D View. B) 3D View.
  3. No feature detected for a contaminant. Contaminants are often not labelled, but occur only in their unlabelled isoform. Therefore, they do not give rise to a consensus feature in FeatureFinderMultiplex.

    contaminant
    Figure 7: A contaminant. There are no elution peaks for the heavy labelled peptide isoform. A) 2D View. B) 3D View.

hands_on Hands-on: Check a possible contaminant

  1. Run TextExporter tool with
    • param-file “Input file”: IDFilter output file
  2. Run Search in textfiles (grep) tool with
    • param-file“Select lines from”: output of TextExporter
    • “Regular Expression: the peptide sequence (e.g. MFLSFPTTK)
  3. Check if the peptide was mapped to a protein marked with “CONTAMINANT”.

Typical Problems

Three problems typically hamper correct peptide mapping:

  1. A feature is detected, but no peptide identification is nearby.

    • Possible cause: This may be caused by imperfect peptide identification. However, it is never expected that every single MS2-spectrum leads to an identification. The protein might be missing in the database, or the peptide may carry a modification that was not included in the search.
    • Possible solution: Improve your search engine settings.
    noID
    Figure 8: Several features without peptide IDs. Several MS2 spectra were generated (fragment scan precursors marked with red circles), but did not lead to a peptide identification.
  2. A peptide was identified, but no feature is nearby.

    • Possible cause:
      1. The elution peaks of the peptide may be distorted. This is typical for low intensity peptides. If a lot of peptides have distorted elution peaks this may be a sign of spray instability.
      2. The peptide is a contaminant.
    • Possible solution:
      1. Lower the FeatureFinderMultiplex parameters Two peptides in a multiplet are expected to have the same isotopic pattern and/or The isotopic pattern of a peptide should resemble the averagine model at this m/z position or broaden the Range of isotopes per peptide in the sample (in Advanced Options).
      2. No optimization of parameters is needed (see example above)
    noID
    Figure 9: A peptide was identified in its unlabelled and in its labelled isoform, but no feature was detected. A) 2D View. No third isotopic peak was detected for the labelled peptide. B) 3D View.
  3. A peptide was identified and a feature was detected nearby, but the two are not mapped to each other.

    • Possible cause:
      1. The MS2 event and the feature are too far apart to be mapped.
      2. The precursor of the MS2 was not correctly assigned to the mono-isotopic peak.
      3. The detected feature is too small in RT dimension and covers only a part of the peptide peaks.
    • Possible solution:
      1. Increase the IDMapper parameter RT tolerance (in seconds) for the matching of peptide identifications and (consensus) features.
      2. Feature size in RT dimension cannot be directly corrected, use solution 1 instead.
    noMapping
    Figure 10: A feature was detected, but the RT dimension is shorter than the peptides's elution peak. Also, the 2nd isotopic peak was fragmented and was not corrected due to the short feature. A) 2D View. B) 3D View.

Two problems typically disturb correct peptide quantitation:

  1. A peptide is mapped to the wrong feature.
    • Possible cause: Co-eluting peptides of a similar mass may be falsely mapped to a nearby feature, if the correct peptide did not lead to an identification or was identified only with a low score. In high-resolution data, this problem of is very limited. Co-eluting peptides can normally be distinguished by slightly different m/z values. In low-resolution data, wrong assignment may occur more often.
    • Possible solution: If a high value is used for the precursor mass tolerance, try to keep the RT tolerance low to avoid false mapping.
  2. Background noise (1) or co-eluting peptides (2) are incorporated in a feature.
    • Possible solution:
      1. Use noise-filtering either during pre-processing or by increasing the FeatureFinderMultiplex parameter Lower bound for the intensity of isotopic peaks
      2. Reduce the FeatureFinderMultiplex parameter m/z tolerance for search of peak patterns.

question Question

  1. How many peptides could not be mapped to MS1 features? (Click on the IDMapper output and look at the tool’s infobox.)
  2. How many features could not be mapped to a peptide identification? (Click on the ProteinQuantifier output and look at the tool’s infobox.)
  3. Which problems are most prominent in the test dataset?

solution Solution

  1. 1,395 peptide IDs could not be mapped to a feature.
  2. 1,898 features, corresponding to 949 consensus features could not be mapped to a peptide identification.
  3. The mapping of peptide IDs to features seems to have worked mostly fine. The main problems seem to be (1) missing peptide identifications, (2) missing features where a peptide was identified and (3) features that span a shorter RT range than the corresponding peptide’s elution peak.

Optimization of Quantitation Results

For optimization, it is critical to modify only one parameter at a time. Also, it is recommended to optimize the tools in the order of their position in the workflow.

In the test dataset, several peptides were identified, but not quantified. Some of the peptides were even identified in the unlabelled, as well as in the labelled form. To optimize the feature detection, we will relax the parameters of FeatureFinderMultiplex.

hands_on Hands-on: Optimize Feature Detection

  1. Run the whole workflow again:
    • Change the FeatureFinderMultiplex parameter “Range of isotopes per peptide in the sample” from 3:6 to 2:6.
  2. Integrate the HighResPrecursorMassCorrector tool into the workflow.
    • Use the mzML file and the featureXML from the FeatureFinderMultiplex tool as input
    • Set Additional retention time tolerance added to feature boundaries between 0.0 and 10.0
    • Connect the output with the search engine.
  3. Compare the number of identified proteins, unmatched features and unmapped peptides for each parameter setting.
  4. Visualize the results with TOPPView to check for correct feature detection and feature-to-peptide mapping.

tip Tip: Sending workflow results to new history

When running a workflow, you may send the results to a new history. This helps keeping track of different parameter settings.

question Question

Which parameter improved the number of quantified proteins?

solution Solution

Both changes led to more quantified proteins. Increasing the isotope range led to 26 \% more protein quantitations, increasing the RT tolerance led to 7 \% more protein quantitations.

keypoints Key points

  • Peptides labelled with stable isotopes result in multiple parallel MS1 ion traces.

  • MS1 features can be used for relative protein quantitation.

  • Quantitations have to be mapped to PSMs.

  • PSM quantitations can be used to calculate protein quantitations.

  • Proper quantitation and mapping needs careful evaluation and optimization.

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

congratulations Congratulations on successfully completing this tutorial!