# Statistical analysis of DIA data

OverviewQuestions:

How to perform statistical analysis on DIA mass spectrometry data?

How to detect and quantify differentially abundant proteins in a HEK-Ecoli Benchmark DIA datatset?

Objectives:

Statistical analysis of a HEK-Ecoli Benchmark DIA dataset

Understanding statistical approaches in proteomic analysis

Using MSstats to find significantly dysregulated proteins in a Spike-in dataset

Requirements:

- Introduction to Galaxy Analyses
- Proteomics

- Library Generation for DIA Analysis: tutorial hands-on
- DIA Analysis using OpenSwathWorkflow: tutorial hands-on
Time estimation:1 hourLevel:Intermediate IntermediateSupporting Materials:Last modification:Oct 18, 2022License:Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurlPURL:https://gxy.io/GTN:T00210

# Introduction

This training covers the statistical analysis of data independent acquisition (DIA) mass spectrometry (MS) data, after successfull identification and quantification of peptides and proteins. We therefore recommend to first go through the DIA library generation tutorial as well as the DIA analysis tutorial, which teach the principles and characteristics of DIA data analysis.

Modern mass spectrometry approaches enables the identification and quantification of thousands of proteins and tens of thousands of peptides in single measurements. This provides immense potential to in-depth explorative analysis of a variety of biological samples. However, often the number of available samples is limited leading to large proteomic datasets with only a few numbers of replicates or samples per condition. Thus, the statistical analysis remains challenging in such in-depth proteomic studies.

Here we will use **MSstats**, which enables the statistical analysis and processing of proteomic data (Choi *et al.* 2014).

AgendaIn this tutorial, we will cover:

## Get data

Hands-on: Data upload

Create a new history for this tutorial and give it a meaningful name

Click the new-history icon at the top of the history panel.

If the new-history is missing:

- Click on the galaxy-gear icon (
History options) on the top of the history panel- Select the option
Create Newfrom the menu- Import the DIA analysis results, the sample annotation and the comparison matrix from Zenodo
`https://zenodo.org/record/4307758/files/PyProphet_export.tabular https://zenodo.org/record/4307758/files/Sample_annot_MSstats.txt https://zenodo.org/record/4307758/files/Comp_matrix_HEK_Ecoli.txt https://zenodo.org/record/4307758/files/PyProphet_msstats_input.tabular`

- Copy the link location
Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

- Select
Paste/Fetch DataPaste the link(s) into the text field

Press

StartClosethe windowOnce the files are green, rename the sample annotation file in ‘Sample_annot_MSstats’, the comparison matrix file in ‘Comp_matrix_HEK_Ecoli’ and the two DIA analysis results files in ‘PyProphet_export’ and ‘PyProphet_msstats_input’

- Click on the galaxy-pencil
pencil iconfor the dataset to edit its attributes- In the central panel, change the
Namefield- Click the
Savebutton

# Statistical analysis with **MSstats**

Hands-on: Performing statistical analysis usingMSstatsand the tabular output fromPyProphet export

MSstatsTool: toolshed.g2.bx.psu.edu/repos/galaxyp/msstats/msstats/3.20.1.0 with the following parameters:

“input source”:`OpenSWATH`

- param-file
“OpenSWATH_input”:`PyProphet_export`

- param-file
“OpenSWATH_annotation”:`Sample_annot_MSstats`

“Compare Groups”:`Yes`

- param-file
“Comparison Matrix”:`Comp_matrix_HEK_Ecoli`

Comment: data Process and group comparisonDuring the data process step the peptide intensities are normalized and protein inference is performed. Using a predefined comparison matrix multiple comparisons can be performed.

Question

- How many lines does the PyPyprophet_export.tabular file have? How many lines does the ProcessedData have and do you notice any differences in their structure or format?
- How many proteins were used for the Group comparison? (see ComparisonResult)

- The PyPyprophet_export.tabular has appr. 230.000 lines whereas the ProcessedData has over 1 mio lines and is in the so called long format. Here every individual transition (single m/z value) is reported per row.
- The ComparisonResult has 5022 lines, meaning over 5000 Proteins were compared between the two different Spike-in conditions.

In case the

MSstatsrun is not yet finished, the results can be downloaded from Zenodo to be able to continue the tutorial

- Import the files from Zenodo
`https://zenodo.org/record/4307758/files/MSstats_ComparisonResult_export_tabular.tsv`

## Detailed investigation of Ecoli identifications and quantifications

Hands-on: Investigating Ecoli proteins in the MSstats comparison results

SelectTool: Grep1 with the following parameters:

- param-file
“Select lines from”:`MSstats_ComparisonResult_export_tabular`

(output ofMSstatstool)“the pattern”:`(ECOLI)|(log2FC)`

FilterTool: Filter1 with the following parameters:

- param-file
“Filter”:`Select_Ecoli`

(output ofSelecttool)“With following condition”:`c7!='NA'`

“Number of header lines to skip”:`1`

HistogramTool: toolshed.g2.bx.psu.edu/repos/devteam/histogram/histogram_rpy/1.0.4 with the following parameters:

- param-file
“Dataset”:`Filter_Ecoli`

(output of the previousFiltertool)“Numerical column for x axis”:`Column: 3`

“Number of breaks (bars)”:`25`

“Plot title”:`Distribution of Ecoli Protein log FC values`

“Label for x axis”:`log2 Fold Change`

Comment: Extracting Ecoli informationFirst we only select rows containing specific terms such as “Ecoli” from the complete ComparisonResults file. Afterwards, the table is filtered to containg only proteins with valid statistical information (e.g. p-value). Using the log2 Fold change values from all remaining Ecoli proteins we can observe the distribution of log2FC values from the comparison of the two Spike-in conditions.

Question

- How many Ecoli proteins were identified and for how many was the p-value for the comparison of the two Spike-in conditions computed?
- How does the distribution of the log2FC values look like? Which Spike-in contained higher amounts of Ecoli and is it possible to see how much more Ecoli was spiked-in?

- In total, over 800 Ecoli proteins were identified from which 500 have a p-value for the comparison of the two Spike-in conditions.
- We can see a gaussian distribution of the log2FC values around a positive value of 3. Since we compared Spike_in_2 / Spike_in_1 we can directly see that Spike_in_2 contained higher amounts of Ecoli. Furthermore, since the apex of the distribution is around 3 and we compared log2 intensities, we could estimate that Spike_in_2 contained approx. 8-times more Ecoli than Spike_in_1.

# Statistical analysis with **MSstats**

Hands-on: Performing statistical analysis usingMSstatsand the msstats_input fromPyProphet export

MSstatsTool: toolshed.g2.bx.psu.edu/repos/galaxyp/msstats/msstats/3.20.1.0 with the following parameters:

“input source”:`MStats 10 column format`

- param-file
“MSstats 10-column input”:`PyProphet_msstats_input`

“Compare Groups”:`Yes`

Comment: MSstats input formatFor the statistical analysis using MSstats the input must be in the long format, containing all relevant information in 10 predefined columns. The conversion of the

PyProphet exportoutput can either be done usingMSstats(as we did above), or during thePyprophet exportstep by using another R package called swath2stats (Blattmannet al.2016). Prior to the conversion the data can be processed and filtered using the swath2stats functionalities.

Question

- How many lines does the
`PyProphet_msstats_input.tabular`

file have? How many lines does the ProcessedData have and do you notice any differences in their structure or format?- How many proteins were used for the Group comparison? And do you already see a difference to the first
MSstatsstep?

- The
`PyProphet_msstats_input.tabular`

has over 870.000 lines and the ProcessedData has over 1 mio lines. Here both files are in the long format, in which every transition is reported per row.- Here the ComparisonResult has only 3871 lines, meaning that almonst 1200 fewer proteins were used in the comparison of the two spike-in conditions.

In case the

MSstatsrun is not yet finished, the results can be downloaded from Zenodo to be able to continue the tutorial

- Import the files from Zenodo
`https://zenodo.org/record/4307758/files/MSstats_ComparisonResult_msstats_input.tsv`

## Detailed investigation of Ecoli identifications and quantifications

Hands-on: Investigating Ecoli proteins in the MSstats comparison results

SelectTool: Grep1 with the following parameters:

- param-file
“Select lines from”:`MSstats_ComparisonResult_msstats_input`

(output of the secondMSstatstool)“the pattern”:`(ECOLI)|(log2FC)`

FilterTool: Filter1 with the following parameters:

- param-file
“Filter”:`Select_Ecoli`

(output of the secondSelecttool)“With following condition”:`c7!='NA'`

“Number of header lines to skip”:`1`

HistogramTool: toolshed.g2.bx.psu.edu/repos/devteam/histogram/histogram_rpy/1.0.4 with the following parameters:

- param-file
“Dataset”:`Filter_Ecoli`

(output of the previousFiltertool)“Numerical column for x axis”:`Column: 3`

“Number of breaks (bars)”:`25`

“Plot title”:`Distribution of Ecoli Protein log FC values`

“Label for x axis”:`log2 Fold Change`

Question

- How many Ecoli proteins were identified and for how many was the p-value for the comparison of the two Spike-in conditions computed? Are there any differences compared to the selected and filtered results from the previous
MSstatsstep?- How does the distribution of the log2FC values look like? Are there any differences compared to the selected and filtered results from the previous
MSstatsstep?

- In total, over 600 Ecoli proteins were identified from which 500 have a p-value for the comparison of the two Spike-in conditions. Here we identify 200 Ecoli proteins less than before, however, the number of proteins for which a p-value was calculated differs only slightly.
- Generally, the two log2FC distribution look very similar, showing a gaussian distribution of the log2FC values around a positive value of 3. There seems to be a slight difference of the apex of the distribution, in the first
MSstatsanalysis it seems to be higher than 3, whereas in the secondMSstatsanalysis the apex seems to be lower than 3.

# Conclusion

Using **MSstats** we were able to identify and quantify differentially regulated proteins between two Spike-in conditions in a HEK/Ecoli Benchmark DIA datatset. Furthermore, the preprocessing of the proteomic data prior to the statistical analysis can directly impact results. Thus, it might be beneficial to try various ways of intermediate data processing and statistical analysis to increase the sensitivity and specificity of the investigation.