Metaproteomics is the large-scale characterization of the entire complement of proteins expressed by microbiota. However, metaproteomics analysis of clinical samples is challenged by the presence of abundant human (host) proteins which hampers the confident detection of lower abundant microbial proteins Batut et al. 2018 ; [Jagtap et al. 2015 .
To address this, we used tandem mass spectrometry (MS/MS) and bioinformatics tools on the Galaxy platform to develop a metaproteomics workflow to characterize the metaproteomes of clinical samples. This clinical metaproteomics workflow holds potential for general clinical applications such as potential secondary infections during COVID-19 infection, microbiome changes during cystic fibrosis as well as broad research questions regarding host-microbe interactions.
The first workflow for the clinical metaproteomics data analysis is the Database generation workflow. The Galaxy-P team has developed a workflow wherein a large database is generated by downloading protein sequences of known disease-causing microorganisms and then generating a compact database from the comprehensive database using the Metanovo tool.
Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
Click on galaxy-uploadImport at the top-right of the screen
Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/clinical-mp-database-generation/workflows/main_workflow.ga
Click the Import workflow button
Below is a short video demonstrating how to import a workflow from GitHub using this procedure:
First, we want to generate a large comprehensive protein sequence database using the UniProt XML Downloader to extract sequences for species of interest. To do so, you will need a tabular file that contains a list of species.
For this tutorial, a literature survey was conducted to obtain 118 taxonomic species of organisms that are commonly associated with the female reproductive tract Afiuni-Zadeh et al. 2018. This species list was used to generate a protein sequence FASTA database was generated using the UniProt XML Downloader tool within the Galaxy framework. In this tutorial, the Species FASTA database (~3.38 million sequences) has already been provided as input. However, if you have your own list of species of interest as a tabular file (Your_Species_tabular.tabular), steps to generate a FASTA file from a tabular file are included:
Download Species Protein Sequences using UniProt XML downloader with UniProt
Hands-on: UniProt XML downloader
UniProt ( Galaxy version 2.3.0) with the following parameters:
“Select”: Your_Species_tabular.tabular
param-file“Dataset (tab separated) with Taxon ID/Name column”: output (Input dataset)
“Column with Taxon ID/name”: c1
“UniProt output format”: fasta
Rename the output as Species_UniProt_FASTA.fasta
Comment: UniProt description
This tool will help download the protein fasta sequences by inputting the taxon names.
Can we use a higher taxonomy clade than species for the UniProt XML downloader?
Why are we using the tools separately? Can we run it all together?
Can we select multiple files together?
How many FASTA files can be merged at once, i.e. is there a limit on the number/size of files?
Yes, the UniProt XML downloader can also be used for generating a database from Genus, Family, Order, or any other higher taxonomy clade.
The tools are run separately to reduce the load on the server and tool. If you have a limited number of taxon names, then you can run it all together.
Yes, that certainly can be done. We used one input file at a time to maintain the order of sequences in the database.
There is no limit.
Merging databases to obtain a large comprehensive database for MetaNovo
Once generated, the Species UniProt database (~3.38 million sequences) will be merged with the Human SwissProt database (reviewed only; ~20.4K sequences) and contaminant (cRAP) sequences database (116 sequences) and filtered to generate the large comprehensive database (~2.59 million sequences). The large comprehensive database will be used to generate a compact database using MetaNovo, which is much more manageable.
Hands-on: Download contaminants with **Protein Database Downloader
Protein Database Downloader ( Galaxy version 0.3.4) with the following parameters:
Metanovo tool generates a compact database from your comprehensive database with MetaNovo
Next, the large comprehensive database of ~2.59 million sequences can be reduced using the MetaNovo tool to generate a more manageable database that contains identified proteins.
The compact MetaNovo-generated database (~1.9K sequences) will be merged with Human SwissProt (reviewed only) and contaminants (cRAP) databases to generate the reduced database (~21.2k protein sequences) that will be used for peptide identification (see Discovery Module tutorial).
Hands-on: MetaNovo
MetaNovo ( Galaxy version 1.9.4+galaxy4) with the following parameters:
Why is this running TMT10 plex modification when the data is 11-plex?
Regarding MetaNovo Spectrum Matching parameters, what are the most “important” parameters? Meaning, that if a user wants to reduce or increase the sensitivity/number of output sequences, what should they change?
Reducing the size of the database improves search speed, FDR, and sensitivity.
There is no option for 11-plex modifications in Metanovo, hence we use the TMT-10plex.
The most important parameters are the tolerance (MS1 and MS2) and any modifications introduced during the processing of the data.
Merging databases to obtain reduced MetaNovo database for peptide discovery with FASTA Merge Files and Filter Unique Sequences
Hands-on: FASTA Merge Files and Filter Unique Sequences
FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) with the following parameters:
“Run in batch mode?”: Merge individual FASTAs (output collection if input is collection)
In “Input FASTA File(s)”:
param-repeat“Insert Input FASTA File(s)”
param-file“FASTA File”: MetaNovo Compact Database (output of MetaNovotool)
param-file“FASTA File”: Protein Database Human SwissProt (output of Protein Database Downloadertool)
param-file“FASTA File”: Protein Database Contaminants (cRAP) (output of Protein Database Downloadertool)
The first step for the Clinical Metaproteomics study is database generation. As we didn’t have a reference database or information from 16srRNA-seq data, we generated a fasta database doing a literature survey, however, if 16S rRNA data is present, the taxon identified can be used for a customized database generation. As the size of the comprehensive database is generally too large, we used the Metanovo tool to reduce the size of the database. This reduced database will be then used for clinical metaproteomics discovery workflow.
You've Finished the Tutorial
Please also consider filling out the Feedback Form as well!
Key points
Create a customized proteomics database from 16SrRNA results.
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
References
Jagtap, P. D., A. Blakely, K. Murray, S. Stewart, J. Kooren et al., 2015 Metaproteomic analysis using the Galaxy framework. PROTEOMICS 15: 3553–3565. 10.1002/pmic.201500074
Afiuni-Zadeh, S., K. L. M. Boylan, P. D. Jagtap, T. J. Griffin, J. D. Rudney et al., 2018 Evaluating the potential of residual Pap test fluid as a resource for the metaproteomic analysis of the cervical-vaginal microbiome. Scientific Reports 8: 10.1038/s41598-018-29092-4
Batut, B., S. Hiltemann, A. Bagnacani, D. Baker, V. Bhardwaj et al., 2018 Community-Driven Data Analysis Training for Biology. Cell Systems 6: 752–758.e1. 10.1016/j.cels.2018.05.012
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{proteomics-clinical-mp-1-database-generation,
author = "Subina Mehta and Katherine Do and Dechen Bhuming",
title = "Clinical-MP-1-Database-Generation (Galaxy Training Materials)",
year = "",
month = "",
day = ""
url = "\url{https://training.galaxyproject.org/archive/2024-07-01/topics/proteomics/tutorials/clinical-mp-1-database-generation/tutorial.html}",
note = "[Online; accessed Tue Apr 01 2025]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol} Computational Biology}
}
Funding
These individuals or organisations provided funding support for the development of this resource
Congratulations on successfully completing this tutorial!
Go Further
Do you want to extend your knowledge? Follow one of our recommended follow-up trainings: