Peptide and Protein ID using SearchGUI and PeptideShaker

Author(s)	Florian Christoph Sigloch Björn Grüning
Reviewers

Overview
Questions:

How to convert LC-MS/MS raw files?

How to identify peptides?

How to identify proteins?

How to evaluate the results?

Objectives:

Protein identification from LC-MS/MS raw files.

Requirements:

Introduction to Galaxy Analyses

Time estimation: 45 minutes

Level: Introductory Introductory

Supporting Materials:

Datasets

Workflows

FAQs

Published: Jun 12, 2017

Last modification: Apr 8, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00229

rating Rating: 4.5 (0 recent ratings, 2 all time)

version Revision: 28

Identifying the proteins contained in a sample is an important step in any proteomic experiment. However, in most experimental set ups, proteins are digested to peptides before the LC-MS/MS analysis. In this so-called “bottom-up” procedure, only peptide masses are measured. Therefore, protein identification cannot be performed directly from raw data, but is a multi-step process:

Raw data preparation
Peptide-to-Spectrum matching
Peptide inference
Protein inference

A plethora of software solutions exist for each step. In this tutorial, we will show how to use the ProteoWizard tool MSconvert and the OpenMS tool PeakPickerHiRes for step 1, and the Compomics tools SearchGUI and PeptideShaker, for the steps 2-4.

For an alternative identification pipeline using only tools provided by the OpenMS software suite, please consult this tutorial.

Input data

As an example dataset, we will use an LC-MS/MS analysis of HeLa cell lysate published in Vaudel et al., 2014, Proteomics. Detailed information about the dataset can be found on PRIDE. For step 2, we will use a validated human Uniprot FASTA database without appended decoy sequences. If you already completed the tutorial on Database Handling you can use the constructed database priot to the DecoyDatabase tool step. You can find a prepared database, as well as the input proteomics data in different file formats on Zenodo.

Agenda

In this tutorial, we will deal with:

Input data

Preparing Raw Data

Peptide and Protein Identification

Analysis of Contaminants

Evaluation of Peptide and Protein IDs

Premade Workflow

Further Reading

Preparing Raw Data

Raw data conversion is the first step of any proteomic data analysis. The most common converter is msconvert from the ProteoWizard software suite, the format to convert to is mzML. SearchGUI needs MGF format as input, but as we need the mzML format for several other tasks, we will convert to mzML first. Due to licensing reasons, msconvert runs only on windows systems and will not work on most Galaxy servers.

Depending on your machine settings, raw data will be generated either in profile mode or centroid mode. For most peptide search engines, the tandem mass spectrometry (MS2) data have to be converted to centroid mode, a process called “peak picking” or “centroiding”. Machine vendors offer algorithms to extract peaks from profile raw data. This is implemented in msconvert tool and can be run in parallel to the mzML conversion. However, the OpenMS tool PeakPickerHiRes tool is reported to generate slightly better results (Lange et al., 2006, Pac Symp Biocomput) and is therefore recommended for quantitative studies (Vaudel et al., 2010, Proteomics). If your data were generated on a low resolution mass spectrometer, use PeakPickerWavelet tool instead.

Hands On: File Conversion and Peak Picking

We provide the input data in the original raw format and also already converted to MGF and mzML file formats. If msconvert tool does not run on your Galaxy instance, please download the preconverted mzML as an input.

Create a new history for this Peptide and Protein ID exercise.

To create a new history simply click the new-history icon at the top of the history panel:

Load the example dataset into your history from Zenodo: raw mzML

Rename the dataset to something meaningful.

(optional) Run msconvert tool on the test data to convert to the mzML format.

Run PeakPickerHiRes tool on the resulting file. Click + Insert param.algorithm_ms_levels and change the entry to “2”. Thus, peak picking will only be performed on MS2 level.

Run FileConverter tool on the picked mzML. In the Advanced Options set the Output file type to MGF.

Comment: Local Use of MSConvert

The vendor libraries used by msconvert are only licensed for Windows systems and are therefore rarely implemented in Galaxy instances. If msconvert tool is not available in your Galaxy instance, please install the software on a Windows computer and run the conversion locally. You can find a detailed description of the necessary steps (“Peak List Generation”). Afterwards, upload the resulting mzML file to your Galaxy history.

Peptide and Protein Identification

Mass spectrometry experiments identify peptides by isolating them, ioinizing and subsequently colliding them with a gas for fragmentation. This method generates a spectrum of peptide fragment masses for each isolated peptide - an MS2 spectrum. To find out the peptide sequences, the MS2 spectrum is compared to a theoretical spectrum generated from a protein database. This step is called peptide-to-spectrum (also: spectrum-to-sequence) matching. Accordingly, a peptide that is successfully matched to a sequence is termed PSM (Peptide-Spectrum-Match). There can be multiple PSMs per peptide, if the peptide was fragmented several times. Different peptide search engines have been developed to fulfill the matching procedure.

It is generally recommended to use more than one peptide search engine and use the combined results for the final peptide inference (Shteynberg et al., 2013, Mol. Cell. Proteomics). Again, there are several software solutions for this, e.g. iProphet (TPP) or ConsensusID (OpenMS). In this tutorial we will use Search GUI tool, as it can automatically search the data using several search engines. Its partner tool Peptide Shaker tool is then used to combine and evaluate the search engine results.

In bottom-up proteomics, it is necessary to combine the identified peptides to proteins. This is not a trivial task, as proteins are redundant in most eukaryotic organisms. Thus, not every peptide can be assigned to only one protein. Luckily, the Peptide Shaker tool already takes care of protein inference and even gives us some information on validity of the protein identifications. We will discuss validation in a later step of this tutorial.

Hands On: Peptide and Protein Identification

Copy the prepared protein database from the tutorial Database Handling into your current history by using the multiple history view or upload the ready-made database from Zenodo.

Open Search GUI tool to search the mgf file against the protein database. In the Search Engine Options select X!Tandem and MS-GF+. In the Protein Modification Options add the Fixed Modifications: Carbamidomethylation of C and the Variable Modifications: Oxidation of M.

Run Peptide Shaker tool on the Search GUI output. Enable the following outputs: Zip File for import to Desktop App, mzidentML File, PSM Report, Peptide Report, Protein Report.

Comment: Search GUI Parameters

We ran Search GUI tool with default settings. When you are processing files of a different experiment, you may need to adjust some of the parameters. Search GUI bundles numerous peptide search engines for matching MS/MS to peptide sequences within a database. In practice, using 2-3 different search engines offers high confidence while keeping analysis time reasonable. In our hands, X! tandem, MS-GF+, OMSSA and Comet search algorithms offer good results. The Precursor Options have to be adjusted to the mass spectrometer which was used to generate the files. The default settings fit a high resolution Orbitrap instrument. In the Advanced Options you may set much more detailed settings for each of the used search engines. When using X!Tandem, we recommend to switch off the advanced X!Tandem options Noise suppression, Quick Pyrolidone and Quick Acetyl. When using MSGF, we recommend to select the correct Instrument type.

Comment: PeptideShaker Outputs

Peptide Shaker offers a variety of outputs. The Zip File for import to Desktop App can be downloaded to view and evaluate the search results in the Peptide Shaker viewer (Download). The several Reports contain tabular, human-readable information. Also, an mzidentML (= mzid) file can be created that contains all peptide sequence matching information and can be utilized by compatible downstream software. The Certificate of Analysis provides details on all parameters settings of both Search GUI and Peptide Shaker used for the analysis.

Question

How many peptides were identified? How many proteins?

How many peptides with oxidized methionine were identified?

You should have identified 3,325 peptides and 1,170 proteins.

328 peptides contain an oxidized methionine (MeO). To get to this number, you can use Select tool on the Peptide Report and search for either “Oxidation of M” or “M<ox>”.

Analysis of Contaminants

The FASTA database used for the peptide to spectrum matching contained some entries that were not expected to stem from the HeLa cell lysate, but are common contaminations in LC-MS/MS samples. The main reason to add those is to avoid misidentification of the spectra to other proteins. However, it also enables you to check for contaminations in your samples. CAVE: in human samples, many proteins that are common contaminants may also stem from the real sample. The real source of such human proteins might require advanced investigation.

Hands On: Analysis of Contaminants

Run Select tool on the Peptide Shaker Protein Report to select all lines that match the pattern “CONTAMINANT”.

Remove all contaminants from your protein list by running Select tool on the Peptide Shaker Protein Report. Select only those lines that do not match the pattern “CONTAMINANT”.

Question

Which contaminants did you identify? Where do these contaminations come from?

What other sources of contaminants exist?

How many mycoplasma proteins did you identify? Does this mean that the analyzed HeLa cells were infected with mycoplasma?

How many false positives do we expect in our list? How many of these are expected to match mycoplasma proteins?

TRY_BOVIN is bovine trypsin. It was used to degrade the proteins to peptides. ALBU_BOVIN is bovine serum albumin. It is added to cell culture medium in high amounts.

Contaminants often stem from the experimenter, these are typically keratins or other high-abundant human proteins. Basically any protein present in the room of the mass spectrometer might get into the ion source, if it is airborne. As an example, sheep keratins are sometimes found in proteomic samples, stemming from clothing made of sheep wool.

There should be five Mycoplasma proteins in your protein list. However, all of them stem from different Mycoplasma species. Also, every protein was identified by one peptide only. You can see this in column 17-19 of your output. These observations make it quite likely that we might have identified false positives here.

As we were allowing for a false discovery rate of 1 %, we would expect 12 false positive proteins in our list. False positives are expected to be randomly assigned to peptides in the FASTA database. Our database consists of about 20,000 human proteins and 4,000 mycoplasma proteins. Therefore, we would expect 17 % (= 2) of all false positives matching to mycoplasma proteins.

Evaluation of Peptide and Protein IDs

Peptide Shaker tool provides you with validation results for the identified PSM, peptides and proteins. It classifies all these IDs in the categories “Confident” or “Doubtful”. On each level, the meaning of these terms differs to some extent:

PSMs are marked as “Doubtful” when the measured MS2 spectrum did not fit well to the theoretical spectrum.
Peptides have a combined scoring of their PSMs. They are marked as “Doubtful”, when the score is below a set threshold. The threshold is defined by the false discovery rate (FDR).
Proteins are marked as “Doubtful”, when they were identified by only a single peptide or when they were identified solely by “Doubtful” peptides.

Hands On: Evaluation of Peptide and Protein IDs

Remove all “Doubtful” proteins from your protein list by running Select tool on the Peptide Shaker Protein Report. Select only those lines that do not match the pattern “Doubtful”.

Question

How to exclude mycoplasma proteins?

How many “Confident” non-contaminant proteins were identified?

Add another Select tool matching the pattern “HUMAN”.

You should have identified 582 human non-contaminant proteins that were validated to be “Confident”.

Premade Workflow

A premade workflow for this tutorial is available

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Florian Christoph Sigloch, Björn Grüning, Peptide and Protein ID using SearchGUI and PeptideShaker (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/protein-id-sg-ps/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{proteomics-protein-id-sg-ps,
author = "Florian Christoph Sigloch and Björn Grüning",
	title = "Peptide and Protein ID using SearchGUI and PeptideShaker (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/protein-id-sg-ps/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

ELIXIR Europe

de.NBI

UFR

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/proteomics/tutorials/protein-id-sg-ps/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: peptideshaker
  owner: galaxyp
  revisions: 78fad25eff17
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: peptideshaker
  owner: galaxyp
  revisions: 78fad25eff17
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

5 stars 1

4 stars 1

December 2018

4 stars: Liked: Tutorial is absolutely awesome! Thaks a lot for the work you made! Disliked: Maybe you could make the tutorial more detailed. Write all the commands exactly as they should be written in the terminal. I catch lots of failures and can't improve them by myself.(

October 2018

5 stars: Liked: Background info such as: The explanation of 'doubtful' for all levels Disliked: SearchGui crashed with the database containing mycoplasma, with human and crap database it worked