Protein FASTA Database Handling

Author(s)	Florian Christoph Sigloch Björn Grüning
Reviewers

Overview
Questions:

How to download protein FASTA databases of a certain organism?

How to download a contaminant database?

How to create a decoy database?

How to combine databases?

Objectives:

Creation of a protein FASTA database ready for use with database search algorithms.

Requirements:

Introduction to Galaxy Analyses

Time estimation: 30 minutes

Level: Introductory Introductory

Supporting Materials:

Workflows

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.org.au ✅ ⭐️

Possibly Working

UseGalaxy.cz

UseGalaxy.no

UseGalaxy.org (Main)

Published: Feb 14, 2017

Last modification: Apr 8, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00214

rating Rating: 4.8 (0 recent ratings, 5 all time)

version Revision: 30

In mass spectrometry based proteomics experiments, peptides are assigned to experimentally acquired tandem mass spectra (MS2) by a method called peptide-spectral matching. Peptide spectral matching is commonly achieved by using search algorithms to match the acquired MS2 spectra to theoretical spectra. The theoretical spectra are generated from an in silico digestion and fragmentation of proteins in the FASTA database. Ideally, the protein FASTA databases will contain all proteins of the organism under investigation.

Agenda

In this tutorial, we will deal with:

Uploading a protein search Database

Contaminant databases

Merging databases

Creating a Decoy Database

Concluding remarks

Uploading a protein search Database

There are a many ways how you can upload your protein search database (FASTA file with protein sequences). Three of these ways are:

Using Protein Database Downloader tool .
Using a direct weblink to the database.
Uploading a database from the data library.

In this tutorial, we will explore Protein Database Downloader tool for generating a protein search database. For this we will download the proteome of an organism of interest. In this tutorial, we will use a database of the human proteome.

Hands On: Uploading a protein search database

Create a new history for this Database Handling exercise.

To create a new history simply click the new-history icon at the top of the history panel:

Open Protein Database Downloader ( Galaxy version 0.3.1)

Select in the drop-down menues “Taxonomy”: Homo sapiens (Human) and reviewed: “UniprotKB/Swiss-Prot (reviewed only)”.

Click on Run Tool. There will be a new dataset named Protein database in your history, now.

Rename the Protein database to Main database.

Comment: Types of uniprot databases

Uniprot offers several types of databases. You may choose to download only reviewed (UniProtKB/Swissprot) databases, only unreviewed (UniProtKB/TREMBL) or both (UniProtKB). In model organisms which are well-researched, e.g. Homo sapiens or D. melanogaster, reviewed (Swissprot) databases contain curated proteins and may lead to smaller databases and cleaner search results. However, if the researcher is interested in identifying proteins that are unreviewed, it might be wiser to include the TrEMBL database.

You may also include protein isoforms by setting the tick box Include isoform data to Yes.

Question

What is the difference between a “reference proteome set” and a “complete proteome set”?

A UniProt complete proteome consists of the set of proteins thought to be expressed by an organism whose genome has been completely sequenced. A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical and biotechnological research. Reference proteomes constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB. Species of particular importance may be represented by numerous reference proteomes for specific ecotypes or strains of interest. Link to source

Contaminant databases

In proteomic samples, some protein contaminants are commonly present. These protein contaminants are introduced into the sample during sample preparation, either from contaminated samples (e.g. mycoplasma in cell culture), chemicals or the experimenter in person. In order to avoid misidentification of spectra derived from contaminants, protein sequences of common laboratory contaminants are added to the database. This has two benefits:

The degree of contamination can be observed, heavily contaminated samples can be excluded from analysis.
Contaminant peptides cannot be misassigned to similar peptides in the database reducing the risk of identifying false positives.

A widely used database for common contaminants is the common Repository of Adventitious Proteins (cRAP). When using samples generated in cell cultures, it is furthermore recommended to include Mycoplasma proteomes in the search database. Mycoplasma infections are very common in cell culture and often go unnoticed (Drexler and Uphoff, Cytotechnology, 2002).

Hands On: Contaminant databases

Open Protein Database Downloader ( Galaxy version 0.3.1)

Select “Download from”: cRAP (contaminants) and execute.

Rename the new database to crap database.

To be able to distinguish contaminants from proteins of interest, you should add a tag to each contaminant protein.

Run FASTA-to-Tabular ( Galaxy version 1.1.1) on your crap database.

Run Add column on the new output. In the field Add this value enter “CONTAMINANT” and execute.

Run Tabular-to-FASTA ( Galaxy version 1.1.1) .

“Title column”: Column 1 and Column 3

“Sequence column”: Column 2

Rename the new fasta file to Tagged cRAP database.

Question

The cRAP database contains some human proteins. What does it mean if you identify those typical contaminants in a human sample?

What does it mean in a non-human sample?

In samples stemming from a human source, identified human contaminants do not necessarily mean a contaminated sample. The proteins may as well originate from the research study sample. Users are advised to use discretion when interpreting the data.

In samples from non-human sources, identified human contaminants do mean contamination by the experimenter.

Hands On: Optional Hands-On: _Mycoplasma_ databases

90 - 95 % of mycoplasma infection in cell culture originates from the following species: M. orale, M. hyorhinis, M. arginini, M. fermentans, M. hominis and A. laidlawii (Drexler and Uphoff, Cytotechnology, 2002).

Run Protein Database Downloader ( Galaxy version 0.3.1) five times to download all reviewed mycoplasma databases. We will merge them to the main database in the next part of the tutorial.

Run FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) to combine all mycoplasma databases into a single one.

Tag each entry in the combined database with the string “MYCOPLASMA_CONTAMINANT” by using FASTA-to-Tabular ( Galaxy version 1.1.1), Add column and Tabular-to-FASTA ( Galaxy version 1.1.1), as explained above.

Rename the Tabular-to-FASTA tool output to Tagged Mycoplasma database.

Comment

The reviewed mycoplasma databases do not contain all known proteins. It is better to also include the TREMBL database. Mycoplasma proteomes are relatively small, so even downloading TrEMBL sequences will not incraese the size of your main database by much.

Merging databases

Depending on the search algorithm in use, you might need to merge all FASTA entries (i.e. proteins of interest and contaminants) in a single database. Make sure to merge the tagged versions of your contaminant databases.

Hands On: Merging databases

Run FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) on the main database and the tagged cRAP database.

In “Input FASTA File(s)”:

param-repeat Click on “Insert Input FASTA File(s)”

In “Input FASTA File(s)”

“FASTA file”: tagged cRAP database

param-repeat Click on “Insert Input FASTA File(s)”

In “Input FASTA File(s)”

“FASTA file”: main database

Set “How are sequences judged to be unique?” to Accession Only.

Optional: Merging mycoplasma databases

At this step you may also merge the mycoplasma protein databases that you downloaded earlier on. Simply enter them as additional inputs in FASTA Merge Files and Filter Unique Sequences tool. You can enter any number of protein databases when you click on Insert Input FASTA file(s).

Creating a Decoy Database

The most common method of peptide and protein False Discovery Rate (FDR) calculation is by adding protein sequences that are not expected to be present in the sample. These are also called decoy protein sequences. This can be done by generating reverse sequences of the target protein entries and appending these protein entries to the protein database. Some search algoritmms use premade target-decoy protein sequences while others can generate a target-decoy protein sequence database from a target protein sequence database before using them for peptide spectral matching.

Hands On: Creating a Decoy Database

Run DecoyDatabase ( Galaxy version 2.6+galaxy0) on the merged database.

Rename the final database to human reviewed cRAP decoy database, or human reviewed cRAP mycoplasma decoy database

Comment: Decoy tags

The string you enter as a decoy tag will be added as a prefix or suffix (your choice) to the description of each decoy protein entry. Thus you can see from which entry in the target database the decoy was computed.

Comment

DecoyDatabase tool may also take several databases as input which are then automatically merged into one database.

Concluding remarks

In order to keep your protein databases up-to-date, it is recommended to create a workflow out of the hands-on sections (to learn about workflows see this tutorial). You might also want to combine the mycoplasma databases to a single file, which you then easily can add to each of your main databases.

Often you may not want to use the most recent database for reasons of reproducibility. If so, you can transfer the final database of this tutorial into other histories to work with it.

Further reading about construction of the optimal database: (Kumar et al., Methods in molecular biology, 2017).

This tutorial is based upon parts of the GalaxyP-101 tutorial (https://usegalaxyp.readthedocs.io/en/latest/sections/galaxyp_101.html).

You've Finished the Tutorial

Key points

There are several types of Uniprot databases.

Search databases should always include possible contaminants.

For analyzing cell culture or organic samples, search databases should include mycoplasma databases.

Some peptide search engines depend on decoys to calculate the FDR.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Florian Christoph Sigloch, Björn Grüning, Protein FASTA Database Handling (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/database-handling/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{proteomics-database-handling,
author = "Florian Christoph Sigloch and Björn Grüning",
	title = "Protein FASTA Database Handling (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/database-handling/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

ELIXIR Europe

de.NBI

UFR

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/proteomics/tutorials/database-handling/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: fasta_to_tabular
  owner: devteam
  revisions: e7ed3c310b74
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: tabular_to_fasta
  owner: devteam
  revisions: 0a7799698fe5
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: dbbuilder
  owner: galaxyp
  revisions: c1b437242fee
  tool_panel_section_label: Get Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: fasta_merge_files_and_filter_unique_sequences
  owner: galaxyp
  revisions: f546e7278f04
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: openms_decoydatabase
  owner: galaxyp
  revisions: 370141bc0da3
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 4

t{ hist[0] | to_stars }} 1

September 2021

5 stars: Liked: Very well explained

May 2020

5 stars: Liked: Extremely clear and easy to follow Disliked: Maybe add the subheadings in the tool library under which the specific tools can be found

September 2018

4 stars: Liked: Many details and background information Disliked: Fasta merging parameter How are sequences judged to be unique? should be set to 'Accession only'