Clinical-MP-1-Database-Generation

Author(s)	Subina Mehta Katherine Do Dechen Bhuming
Editor(s)	Pratik Jagtap Timothy J. Griffin

Overview
Questions:

Why do we need to generate a customized database for metaproteomics research?

How do we reduce the size of the database?

Objectives:

Downloading databases related to 16SrRNA data

For better identification results, combine host and microbial proteins.

Reduced database provides better FDR stats.

Requirements:

Introduction to Galaxy Analyses

Proteomics

Time estimation: 3 hours

Supporting Materials:

Datasets

FAQs

instances Available on these Galaxies

Possibly Working

UseGalaxy.eu

UseGalaxy.org

UseGalaxy.org.au

UseGalaxy.fr

Containers

docker_image Docker image

Published: Feb 6, 2024

Last modification: Feb 12, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00413

version Revision: 2

Metaproteomics is the large-scale characterization of the entire complement of proteins expressed by microbiota. However, metaproteomics analysis of clinical samples is challenged by the presence of abundant human (host) proteins which hampers the confident detection of lower abundant microbial proteins Batut et al. 2018 ; [Jagtap et al. 2015 .

To address this, we used tandem mass spectrometry (MS/MS) and bioinformatics tools on the Galaxy platform to develop a metaproteomics workflow to characterize the metaproteomes of clinical samples. This clinical metaproteomics workflow holds potential for general clinical applications such as potential secondary infections during COVID-19 infection, microbiome changes during cystic fibrosis as well as broad research questions regarding host-microbe interactions.

The first workflow for the clinical metaproteomics data analysis is the Database generation workflow. The Galaxy-P team has developed a workflow wherein a large database is generated by downloading protein sequences of known disease-causing microorganisms and then generating a compact database from the comprehensive database using the Metanovo tool.

Agenda

In this tutorial, we will cover:

Data Upload

Get data

Import Workflow

Step-by-step analysis

Download Protein Sequences using taxon names

Download Species Protein Sequences using UniProt XML downloader with UniProt

Merging databases to obtain a large comprehensive database for MetaNovo

Reducing Database size

Metanovo tool generates a compact database from your comprehensive database with MetaNovo

Merging databases to obtain reduced MetaNovo database for peptide discovery with FASTA Merge Files and Filter Unique Sequences

Conclusion

Data Upload

Get data

Hands-on: Data Upload
Create a new history for this tutorial
Import the files from Zenodo or from the shared data library (GTN - Material -> proteomics -> Clinical-MP-1-Database-Generation):
https://zenodo.org/records/10105821/files/HUMAN_SwissProt_Protein_Database.fasta
https://zenodo.org/records/10105821/files/Species_UniProt_FASTA.fasta
https://zenodo.org/records/10105821/files/Contaminants_(cRAP)_Protein_Database.fasta
https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F10_9Aug19_Rage_Rep-19-06-08.mgf
https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F11_9Aug19_Rage_Rep-19-06-08.mgf
https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F13_9Aug19_Rage_Rep-19-06-08.mgf
https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F15_9Aug19_Rage_Rep-19-06-08.mgf
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Shared data (top panel) then Data libraries

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Rename the datasets

Check that the datatype

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select datatypes from “New type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Optional-Add to each database a tag corresponding to the file name.

Create a dataset collection of the 4 MGF datasets.

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add a tag starting with #

Tags starting with # will be automatically propagated to the outputs of tools using this dataset.

Press Enter

Check that the tag appears below the dataset name
Link to here | FAQs | Gitter Chat | Help Forum

Import Workflow

Hands-on: Running the Workflow

Import the workflow into Galaxy:

Hands-on: Importing and launching a GTN workflow

Launch Pretreatments (View on GitHub, Download workflow) workflow.

Click to Launch Pretreatments (View on GitHub, Download workflow)

Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.

Click on galaxy-upload Import at the top-right of the screen

Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/clinical-mp-database-generation/workflows/main_workflow.ga

Click the Import workflow button

Below is a short video demonstrating how to import a workflow from GitHub using this procedure:

Video: Importing a workflow from URL

Link to here | FAQs | Gitter Chat | Help Forum

Run Workflow workflow using the following parameters:

“Send results to a new history”: No

param-file ” Input Dataset collection”: MGF dataset collection

param-file ” Species_tabular”: Species_tabular.tabular

Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.

Click on the workflow-run (Run workflow) button next to your workflow

Configure the workflow as needed

Click the Run Workflow button at the top-right of the screen

You may have to refresh your history to see the queued jobs

Link to here | FAQs | Gitter Chat | Help Forum

Step-by-step analysis

Download Protein Sequences using taxon names

First, we want to generate a large comprehensive protein sequence database using the UniProt XML Downloader to extract sequences for species of interest. To do so, you will need a tabular file that contains a list of species.

For this tutorial, a literature survey was conducted to obtain 118 taxonomic species of organisms that are commonly associated with the female reproductive tract Afiuni-Zadeh et al. 2018. This species list was used to generate a protein sequence FASTA database was generated using the UniProt XML Downloader tool within the Galaxy framework. In this tutorial, the Species FASTA database (~3.38 million sequences) has already been provided as input. However, if you have your own list of species of interest as a tabular file (Your_Species_tabular.tabular), steps to generate a FASTA file from a tabular file are included:

Download Species Protein Sequences using UniProt XML downloader with UniProt

Hands-on: UniProt XML downloader

UniProt ( Galaxy version 2.3.0) with the following parameters:

“Select”: Your_Species_tabular.tabular

param-file “Dataset (tab separated) with Taxon ID/Name column”: output (Input dataset)

“Column with Taxon ID/name”: c1

“UniProt output format”: fasta

Rename the output as Species_UniProt_FASTA.fasta

Comment: UniProt description

This tool will help download the protein fasta sequences by inputting the taxon names.

Link to here | FAQs | Gitter Chat | Help Forum

Question

Can we use a higher taxonomy clade than species for the UniProt XML downloader?

Why are we using the tools separately? Can we run it all together?

Can we select multiple files together?

How many FASTA files can be merged at once, i.e. is there a limit on the number/size of files?

Yes, the UniProt XML downloader can also be used for generating a database from Genus, Family, Order, or any other higher taxonomy clade.

The tools are run separately to reduce the load on the server and tool. If you have a limited number of taxon names, then you can run it all together.

Yes, that certainly can be done. We used one input file at a time to maintain the order of sequences in the database.

There is no limit.

Merging databases to obtain a large comprehensive database for MetaNovo

Once generated, the Species UniProt database (~3.38 million sequences) will be merged with the Human SwissProt database (reviewed only; ~20.4K sequences) and contaminant (cRAP) sequences database (116 sequences) and filtered to generate the large comprehensive database (~2.59 million sequences). The large comprehensive database will be used to generate a compact database using MetaNovo, which is much more manageable.

Hands-on: Download contaminants with **Protein Database Downloader

Protein Database Downloader ( Galaxy version 0.3.4) with the following parameters:

“Download from?”: cRAP (contaminants)

Rename as “Protein Database Contaminants (cRAP)”

Link to here | FAQs | Gitter Chat | Help Forum

Hands-on: Human SwissProt (reviewed) database

Protein Database Downloader ( Galaxy version 0.3.4) with the following parameters:

“Download from?”: UniProtKB(reviewed only)

In “Taxonomy”: Homo sapiens (Human)

In “reviewed”: UniProtKB/Swiss-Prot (reviewed only)

In “Proteome Set”: Reference Proteome Set

In “Include isoform data”: False

Rename as “Protein Database Human SwissProt”.

Link to here | FAQs | Gitter Chat | Help Forum

Question

How often is the Protein Database Downloader updated?

It is updated every 3 months.

Hands-on: FASTA Merge Files and Filter Unique Sequences

FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) with the following parameters:

“Run in batch mode?”: Merge individual FASTAs (output collection if input is collection)

In “Input FASTA File(s)”:

param-repeat “Insert Input FASTA File(s)”

param-file “FASTA File”: Species_UniProt_FASTA (output of UniProt XML downloader tool)

param-file “FASTA File”: Protein Database Human SwissProt (output of Protein Database Downloader tool)

param-file “FASTA File”: Protein Database Contaminants (cRAP) (output of Protein Database Downloader tool)

Rename out as “Human UniProt Microbial Proteins cRAP for MetaNovo”.

Link to here | FAQs | Gitter Chat | Help Forum

Reducing Database size

Metanovo tool generates a compact database from your comprehensive database with MetaNovo

Next, the large comprehensive database of ~2.59 million sequences can be reduced using the MetaNovo tool to generate a more manageable database that contains identified proteins.

The compact MetaNovo-generated database (~1.9K sequences) will be merged with Human SwissProt (reviewed only) and contaminants (cRAP) databases to generate the reduced database (~21.2k protein sequences) that will be used for peptide identification (see Discovery Module tutorial).

Hands-on: MetaNovo

MetaNovo ( Galaxy version 1.9.4+galaxy4) with the following parameters:

“MGF Input Type”: Collection

param-collection “MGF Collection”: output (Input dataset collection)

param-file “FASTA File”: output (output of FASTA Merge Files and Filter Unique Sequences tool)

In “Spectrum Matching Parameters”:

“Fragment ion mass tolerance”: 0.01

“Enzyme”: Trypsin (no P rule)

“Fixed modifications as comma separated list”: Carbamidomethylation of C TMT 10-plex of K TMT 10-plex of peptide N-term

“Variable modifications as comma separated list”: Oxidation of M

“Maximal charge to search for”: 5

In “Import Filters”:

“The maximal peptide length to consider when importing identification files”: 50

Rename as “MetaNovo Compact Database”.

Link to here | FAQs | Gitter Chat | Help Forum

Question

Why are we reducing the size of the database?

Why is this running TMT10 plex modification when the data is 11-plex?

Regarding MetaNovo Spectrum Matching parameters, what are the most “important” parameters? Meaning, that if a user wants to reduce or increase the sensitivity/number of output sequences, what should they change?

Reducing the size of the database improves search speed, FDR, and sensitivity.

There is no option for 11-plex modifications in Metanovo, hence we use the TMT-10plex.

The most important parameters are the tolerance (MS1 and MS2) and any modifications introduced during the processing of the data.

Merging databases to obtain reduced MetaNovo database for peptide discovery with FASTA Merge Files and Filter Unique Sequences

Hands-on: FASTA Merge Files and Filter Unique Sequences

FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) with the following parameters:

“Run in batch mode?”: Merge individual FASTAs (output collection if input is collection)

In “Input FASTA File(s)”:

param-repeat “Insert Input FASTA File(s)”

param-file “FASTA File”: MetaNovo Compact Database (output of MetaNovo tool)

param-file “FASTA File”: Protein Database Human SwissProt (output of Protein Database Downloader tool)

param-file “FASTA File”: Protein Database Contaminants (cRAP) (output of Protein Database Downloader tool)

Link to here | FAQs | Gitter Chat | Help Forum

Conclusion

The first step for the Clinical Metaproteomics study is database generation. As we didn’t have a reference database or information from 16srRNA-seq data, we generated a fasta database doing a literature survey, however, if 16S rRNA data is present, the taxon identified can be used for a customized database generation. As the size of the comprehensive database is generally too large, we used the Metanovo tool to reduce the size of the database. This reduced database will be then used for clinical metaproteomics discovery workflow.

You've Finished the Tutorial

Key points

Create a customized proteomics database from 16SrRNA results.

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Proteomics topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Jagtap, P. D., A. Blakely, K. Murray, S. Stewart, J. Kooren et al., 2015 Metaproteomic analysis using the Galaxy framework. PROTEOMICS 15: 3553–3565. 10.1002/pmic.201500074
Afiuni-Zadeh, S., K. L. M. Boylan, P. D. Jagtap, T. J. Griffin, J. D. Rudney et al., 2018 Evaluating the potential of residual Pap test fluid as a resource for the metaproteomic analysis of the cervical-vaginal microbiome. Scientific Reports 8: 10.1038/s41598-018-29092-4
Batut, B., S. Hiltemann, A. Bagnacani, D. Baker, V. Bhardwaj et al., 2018 Community-Driven Data Analysis Training for Biology. Cell Systems 6: 752–758.e1. 10.1016/j.cels.2018.05.012

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Subina Mehta, Katherine Do, Dechen Bhuming, Clinical-MP-1-Database-Generation (Galaxy Training Materials). https://training.galaxyproject.org/archive/2024-07-01/topics/proteomics/tutorials/clinical-mp-1-database-generation/tutorial.html Online; accessed Tue Apr 01 2025
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{proteomics-clinical-mp-1-database-generation,
author = "Subina Mehta and Katherine Do and Dechen Bhuming",
	title = "Clinical-MP-1-Database-Generation (Galaxy Training Materials)",
	year = "",
	month = "",
	day = ""
	url = "\url{https://training.galaxyproject.org/archive/2024-07-01/topics/proteomics/tutorials/clinical-mp-1-database-generation/tutorial.html}",
	note = "[Online; accessed Tue Apr 01 2025]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol} Computational Biology}
}

Funding

These individuals or organisations provided funding support for the development of this resource

Congratulations on successfully completing this tutorial!

Go Further
Do you want to extend your knowledge? Follow one of our recommended follow-up trainings:

tutorial Hands-on: Clinical-MP-2-Discovery

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.
shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/archive/2024-07-01/api/topics/proteomics/tutorials/clinical-mp-1-database-generation/tutorial.json | jq .admin_install_yaml -r)
Alternatively you can copy and paste the following YAML
--- {}