# Peptide and Protein ID using SearchGUI and PeptideShaker

### Overview

Questions
• How to convert LC-MS/MS raw files?

• How to identify peptides?

• How to identify proteins?

• How to evaluate the results?

Objectives
• Protein identification from LC-MS/MS raw files.

Requirements
Time estimation: 45 minutes
Level: Introductory
Supporting Materials
Last modification: May 31, 2021

# Introduction

Identifying the proteins contained in a sample is an important step in any proteomic experiment. However, in most experimental set ups, proteins are digested to peptides before the LC-MS/MS analysis. In this so-called “bottom-up” procedure, only peptide masses are measured. Therefore, protein identification cannot be performed directly from raw data, but is a multi-step process:

1. Raw data preparation
2. Peptide-to-Spectrum matching
3. Peptide inference
4. Protein inference

A plethora of software solutions exist for each step. In this tutorial, we will show how to use the ProteoWizard tool MSconvert and the OpenMS tool PeakPickerHiRes for step 1, and the Compomics tools SearchGUI and PeptideShaker, for the steps 2-4.

For an alternative identification pipeline using only tools provided by the OpenMS software suite, please consult this tutorial.

# Input data

As an example dataset, we will use an LC-MS/MS analysis of HeLa cell lysate published in Vaudel et al., 2014, Proteomics. Detailed information about the dataset can be found on PRIDE. For step 2, we will use a validated human Uniprot FASTA database without appended decoy sequences. If you already completed the tutorial on Database Handling you can use the constructed database priot to the DecoyDatabase tool step. You can find a prepared database, as well as the input proteomics data in different file formats on Zenodo.

### Agenda

In this tutorial, we will deal with:

# Preparing Raw Data

Raw data conversion is the first step of any proteomic data analysis. The most common converter is msconvert from the ProteoWizard software suite, the format to convert to is mzML. SearchGUI needs MGF format as input, but as we need the mzML format for several other tasks, we will convert to mzML first. Due to licensing reasons, msconvert runs only on windows systems and will not work on most Galaxy servers.

Depending on your machine settings, raw data will be generated either in profile mode or centroid mode. For most peptide search engines, the tandem mass spectrometry (MS2) data have to be converted to centroid mode, a process called “peak picking” or “centroiding”. Machine vendors offer algorithms to extract peaks from profile raw data. This is implemented in msconvert tool and can be run in parallel to the mzML conversion. However, the OpenMS tool PeakPickerHiRes tool is reported to generate slightly better results (Lange et al., 2006, Pac Symp Biocomput) and is therefore recommended for quantitative studies (Vaudel et al., 2010, Proteomics). If your data were generated on a low resolution mass spectrometer, use PeakPickerWavelet tool instead.

### hands_on Hands-On: File Conversion and Peak Picking

We provide the input data in the original raw format and also already converted to MGF and mzML file formats. If msconvert tool does not run on your Galaxy instance, please download the preconverted mzML as an input.

1. Create a new history for this Peptide and Protein ID exercise.

### Tip: Creating a new history

Click the new-history icon at the top of the history panel.

If the new-history is missing:

1. Click on the galaxy-gear icon (History options) on the top of the history panel
2. Select the option Create New from the menu
2. Load the example dataset into your history from Zenodo: raw mzML
3. Rename the dataset to something meaningful.
4. (optional) Run msconvert tool on the test data to convert to the mzML format.
5. Run PeakPickerHiRes tool on the resulting file. Click + Insert param.algorithm_ms_levels and change the entry to “2”. Thus, peak picking will only be performed on MS2 level.
6. Run FileConverter tool on the picked mzML. In the Advanced Options set the Output file type to MGF.

### comment Comment: Local Use of MSConvert

The vendor libraries used by msconvert are only licensed for Windows systems and are therefore rarely implemented in Galaxy instances. If msconvert tool is not available in your Galaxy instance, please install the software on a Windows computer and run the conversion locally. You can find a detailed description of the necessary steps here (“Peak List Generation”). Afterwards, upload the resulting mzML file to your Galaxy history.

# Peptide and Protein Identification

Mass spectrometry experiments identify peptides by isolating them, ioinizing and subsequently colliding them with a gas for fragmentation. This method generates a spectrum of peptide fragment masses for each isolated peptide - an MS2 spectrum. To find out the peptide sequences, the MS2 spectrum is compared to a theoretical spectrum generated from a protein database. This step is called peptide-to-spectrum (also: spectrum-to-sequence) matching. Accordingly, a peptide that is successfully matched to a sequence is termed PSM (Peptide-Spectrum-Match). There can be multiple PSMs per peptide, if the peptide was fragmented several times. Different peptide search engines have been developed to fulfill the matching procedure.

It is generally recommended to use more than one peptide search engine and use the combined results for the final peptide inference (Shteynberg et al., 2013, Mol. Cell. Proteomics). Again, there are several software solutions for this, e.g. iProphet (TPP) or ConsensusID (OpenMS). In this tutorial we will use Search GUI tool, as it can automatically search the data using several search engines. Its partner tool Peptide Shaker tool is then used to combine and evaluate the search engine results.

In bottom-up proteomics, it is necessary to combine the identified peptides to proteins. This is not a trivial task, as proteins are redundant in most eukaryotic organisms. Thus, not every peptide can be assigned to only one protein. Luckily, the Peptide Shaker tool already takes care of protein inference and even gives us some information on validity of the protein identifications. We will discuss validation in a later step of this tutorial.

### hands_on Hands-On: Peptide and Protein Identification

1. Copy the prepared protein database from the tutorial Database Handling into your current history by using the multiple history view or upload the ready-made database from this link.
2. Open Search GUI tool to search the mgf file against the protein database. In the Search Engine Options select X!Tandem and MS-GF+. In the Protein Modification Options add the Fixed Modifications: Carbamidomethylation of C and the Variable Modifications: Oxidation of M.
3. Run Peptide Shaker tool on the Search GUI output. Enable the following outputs: Zip File for import to Desktop App, mzidentML File, PSM Report, Peptide Report, Protein Report.

### comment Comment: Search GUI Parameters

We ran Search GUI tool with default settings. When you are processing files of a different experiment, you may need to adjust some of the parameters. Search GUI bundles numerous peptide search engines for matching MS/MS to peptide sequences within a database. In practice, using 2-3 different search engines offers high confidence while keeping analysis time reasonable. In our hands, X! tandem, MS-GF+, OMSSA and Comet search algorithms offer good results. The Precursor Options have to be adjusted to the mass spectrometer which was used to generate the files. The default settings fit a high resolution Orbitrap instrument. In the Advanced Options you may set much more detailed settings for each of the used search engines. When using X!Tandem, we recommend to switch off the advanced X!Tandem options Noise suppression, Quick Pyrolidone and Quick Acetyl. When using MSGF, we recommend to select the correct Instrument type.

### comment Comment: PeptideShaker Outputs

Peptide Shaker offers a variety of outputs. The Zip File for import to Desktop App can be downloaded to view and evaluate the search results in the Peptide Shaker viewer (Download). The several Reports contain tabular, human-readable information. Also, an mzidentML (= mzid) file can be created that contains all peptide sequence matching information and can be utilized by compatible downstream software. The Certificate of Analysis provides details on all parameters settings of both Search GUI and Peptide Shaker used for the analysis.

### question Questions:

1. How many peptides were identified? How many proteins?
2. How many peptides with oxidized methionine were identified?

### solution Solution

1. You should have identified 3,325 peptides and 1,170 proteins.
2. 328 peptides contain an oxidized methionine (MeO). To get to this number, you can use Select tool on the Peptide Report and search for either “Oxidation of M” or “M<ox>”.

# Analysis of Contaminants

The FASTA database used for the peptide to spectrum matching contained some entries that were not expected to stem from the HeLa cell lysate, but are common contaminations in LC-MS/MS samples. The main reason to add those is to avoid misidentification of the spectra to other proteins. However, it also enables you to check for contaminations in your samples. CAVE: in human samples, many proteins that are common contaminants may also stem from the real sample. The real source of such human proteins might require advanced investigation.

### hands_on Hands-On: Analysis of Contaminants

1. Run Select tool on the Peptide Shaker Protein Report to select all lines that match the pattern “CONTAMINANT”.
2. Remove all contaminants from your protein list by running Select tool on the Peptide Shaker Protein Report. Select only those lines that do not match the pattern “CONTAMINANT”.

### question Questions

1. Which contaminants did you identify? Where do these contaminations come from?
2. What other sources of contaminants exist?
3. How many mycoplasma proteins did you identify? Does this mean that the analyzed HeLa cells were infected with mycoplasma?
4. How many false positives do we expect in our list? How many of these are expected to match mycoplasma proteins?

### solution Solution

1. TRY_BOVIN is bovine trypsin. It was used to degrade the proteins to peptides. ALBU_BOVIN is bovine serum albumin. It is added to cell culture medium in high amounts.
2. Contaminants often stem from the experimenter, these are typically keratins or other high-abundant human proteins. Basically any protein present in the room of the mass spectrometer might get into the ion source, if it is airborne. As an example, sheep keratins are sometimes found in proteomic samples, stemming from clothing made of sheep wool.
3. There should be five Mycoplasma proteins in your protein list. However, all of them stem from different Mycoplasma species. Also, every protein was identified by one peptide only. You can see this in column 17-19 of your output. These observations make it quite likely that we might have identified false positives here.
4. As we were allowing for a false discovery rate of 1 %, we would expect 12 false positive proteins in our list. False positives are expected to be randomly assigned to peptides in the FASTA database. Our database consists of about 20,000 human proteins and 4,000 mycoplasma proteins. Therefore, we would expect 17 % (= 2) of all false positives matching to mycoplasma proteins.

# Evaluation of Peptide and Protein IDs

Peptide Shaker tool provides you with validation results for the identified PSM, peptides and proteins. It classifies all these IDs in the categories “Confident” or “Doubtful”. On each level, the meaning of these terms differs to some extent:

• PSMs are marked as “Doubtful” when the measured MS2 spectrum did not fit well to the theoretical spectrum.
• Peptides have a combined scoring of their PSMs. They are marked as “Doubtful”, when the score is below a set threshold. The threshold is defined by the false discovery rate (FDR).
• Proteins are marked as “Doubtful”, when they were identified by only a single peptide or when they were identified solely by “Doubtful” peptides.

### hands_on Hands-On: Evaluation of Peptide and Protein IDs

1. Remove all “Doubtful” proteins from your protein list by running Select tool on the Peptide Shaker Protein Report. Select only those lines that do not match the pattern “Doubtful”.

### question Questions:

1. How to exclude mycoplasma proteins?
2. How many “Confident” non-contaminant proteins were identified?

### solution Solution

1. Add another Select tool matching the pattern “HUMAN”.
2. You should have identified 582 human non-contaminant proteins that were validated to be “Confident”.

A premade workflow for this tutorial can be found here

### Key points

• LC-MS/MS raw files have to be locally converted to mgf/mzML prior to further analysis on most Galaxy servers.

• SearchGUI can be used for running several peptide search engines at once.

• PeptideShaker can be used to combine and evaluate the results, and to perform protein inference.

# Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

# Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

# Citing this Tutorial

1. Florian Christoph Sigloch, Björn Grüning, 2021 Peptide and Protein ID using SearchGUI and PeptideShaker (Galaxy Training Materials). https://training.galaxyproject.org/archive/2021-09-01/topics/proteomics/tutorials/protein-id-sg-ps/tutorial.html Online; accessed TODAY
2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

### details BibTeX

@misc{proteomics-protein-id-sg-ps,
author = "Florian Christoph Sigloch and Björn Grüning",
title = "Peptide and Protein ID using SearchGUI and PeptideShaker (Galaxy Training Materials)",
year = "2021",
month = "05",
day = "31"
url = "\url{https://training.galaxyproject.org/archive/2021-09-01/topics/proteomics/tutorials/protein-id-sg-ps/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
doi = {10.1016/j.cels.2018.05.012},
url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
year = 2018,
month = {jun},
publisher = {Elsevier {BV}},
volume = {6},
number = {6},
pages = {752--758.e1},
author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
title = {Community-Driven Data Analysis Training for Biology},
journal = {Cell Systems}
}
`