Neoantigen 3: Database merge and FragPipe discovery

Overview
Creative Commons License: CC-BY Questions:
  • What are the key features and unique sequences in protein datasets that contribute to neoantigen discovery?

  • How can we identify neoantigens from proteomic data?

Objectives:
  • Understand the process of merging neoantigen databases with human protein sequences.

  • Learn to use FragPipe for proteomics data analysis.

  • Gain hands-on experience with bioinformatics tools such as FASTA file processing, database validation, and peptide identification.

Requirements:
Time estimation: 3 hours
Supporting Materials:
Published: Jan 14, 2025
Last modification: Jan 14, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1

In this tutorial, we will guide you through a bioinformatics workflow aimed at merging neoantigen databases with known human protein sequences, preparing the data for proteomics analysis using FragPipe. This workflow involves processing FASTA files, filtering for unique sequences, and validating the FASTA databases before using FragPipe to perform peptide identification and validation of neoantigens.

Throughout the tutorial, you will learn how to integrate multiple datasets, ensuring that you can perform analyses such as the identification of potential neoantigens, which are critical for cancer immunotherapy and vaccine development. The tools and steps covered here are important for any bioinformatics pipeline dealing with proteomics and neoantigen discovery.

Discovery Overview workflow.

Agenda

In this tutorial, we will cover:

  1. Neoantigen Fragpipe Discovery
  2. A: Merging Databases
    1. Get data
    2. Merging FASTA Files and Filtering for Unique Sequences
    3. Sequence database parsing with Validate FASTA Database
  3. B: Discovery search
    1. Running FragPipe for Neoantigen Discovery
    2. Collapsing Datasets and Selecting Relevant Data
    3. Querying Tabular Results for Further Analysis
  4. Conclusion
  5. Rerunning on your own data
  6. Disclaimer

Neoantigen Fragpipe Discovery

This tutorial guides users through the process of performing database searching or neoantigen protein/peptide discovery. It encompasses essential bioinformatics steps to identify and variant-specific peptides for immunological studies.The overview is divided into 2 major stages: (A) Merging all the variant databases and Validating the sequences, (B) Performing database searching using Fragpipe to discover neoantigen peptides. Below is an overview of each major stages

  • A: Merging Databases
    1. Get Data The first step involves gathering and uploading the necessary proteomics data files into the analysis environment. These files typically contain protein sequences or raw spectrum data that will be processed throughout the tutorial. Proper data organization and tagging are essential to ensure smooth workflow execution.
    2. Merging FASTA Files and Filtering for Unique Sequences In this step, multiple FASTA files containing protein sequences are merged into a single file. After merging, sequences are filtered to retain only the unique ones, ensuring that redundancy is removed and only relevant protein data is used for downstream analysis.
  • B: Discovery search
    1. Validating FASTA Databases Once the FASTA files are merged and filtered, it’s important to validate the database to ensure that the protein sequences are correctly formatted and usable for analysis. This step helps identify and correct any issues in the dataset before performing more complex analysis tasks.
    2. Running FragPipe for Neoantigen Discovery FragPipe, a proteomics analysis tool, is then employed to process the data further. This involves peptide identification, protein quantification, and running specialized workflows such as nonspecific-HLA searches to identify potential neoantigens that may be targeted for immunotherapy.
    3. Collapsing Datasets and Selecting Relevant Data After the analysis, the resulting datasets are collapsed into a single dataset to simplify further processing. This step helps streamline the data, making it easier to select and focus on the relevant sequences that match the biological question being addressed.
    4. Querying Tabular Results for Further Analysis In the final step, tabular results from the analysis are queried using SQL-like commands to filter and extract the most relevant data. This allows for focused analysis on specific protein sequences or neoantigens that were identified, enabling further downstream analysis and interpretation.

A: Merging Databases

Database Merging.

Get data

Hands-on: Data Upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo or from the shared data library (GTN - Material -> proteomics -> Neoantigen 3: Database merge and FragPipe discovery):

    https://zenodo.org/records/14374118/files/Experimental-Design-Fragpipe.tabular
    https://zenodo.org/records/14374118/files/Arriba-Fusion-Database.fasta
    https://zenodo.org/records/14374118/files/Human_cRAP_Non_normal_transcripts_dB.fasta
    https://zenodo.org/records/14374118/files/STS_26T_2_Eclipse_02102024.raw
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  3. Rename the datasets
  4. Check that the datatype

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select datatypes from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  5. Add to each database a tag corresponding to …

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Merging FASTA Files and Filtering for Unique Sequences

Next, we will merge the FASTA files, ensuring that any redundant sequences are removed. This step ensures that we only work with unique sequences, improving the quality and accuracy of the subsequent analysis. In this step, we combine the fusion database generated from the Arriba Pipeline (first neoantigen workflow) with the non-normal database created from HISAT, Freebayes, CustomPRODB, and the Stringtie Pipeline (second neoantigen workflow). Once merging is done, we validate the database to ensure that the sequences are in the right format.

Hands-on: FASTA Merge Files and Filter Unique Sequences
  1. FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) with the following parameters:
    • “Run in batch mode?”: Merge individual FASTAs (output collection if input is collection)
      • In “Input FASTA File(s)”:
        • param-repeat “Insert Input FASTA File(s)”
          • param-file “FASTA File”: Human_cRAP_Non_normal_transcripts_dB.fasta (Input dataset)
      • In “Input FASTA File(s)”:
        • param-repeat “Insert Input FASTA File(s)”
          • param-file “FASTA File”: Arriba-Fusion-Database.fasta (Input dataset)

Sequence database parsing with Validate FASTA Database

Hands-on: Validate FASTA Database
  1. Validate FASTA Database ( Galaxy version 0.1.5) with the following parameters:
    • param-file “Select input FASTA dataset”: output (output of FASTA Merge Files and Filter Unique Sequences tool)
Question
  1. Why is it important to validate a FASTA database before using it for further analysis in a proteomics pipeline?
  2. What potential issues might be identified by the Validate FASTA Database tool?
  1. Validating a FASTA database ensures that the sequences are correctly formatted and that no errors are present. It helps ensure the accuracy of the data used in downstream analysis, which is critical for correct protein identification and neoantigen discovery.
  2. The Validate FASTA Database tool might identify issues such as incorrect sequence formatting, missing information (e.g., headers or sequence data), or invalid characters in the sequences, which could cause errors during further analysis.

B: Discovery search

Running FragPipe for Neoantigen Discovery

The FragPipe tool is used for processing and analyzing proteomics data in mass spectrometry experiments. This tool integrates multiple components such as MSFragger, IonQuant, and Percolator, allowing users to perform tasks such as peptide-spectrum match (PSM) identification, protein quantification, and post-translational modification (PTM) analysis. In this task, FragPipe is being used to identify potential neoantigens by analyzing mass spectrometry (MS) data and correlating it with a validated FASTA sequence database. The FragPipe tool is executed with the MS data and a validated FASTA sequence database to identify peptides, proteins, and potential neoantigens from the raw mass spectrometry data. Users must provide MS spectrum files, a manifest file, and the validated FASTA database (created in the previous step using the Validate FASTA Database tool). The tool runs a series of analyses, including non-specific enzyme digestion, quantification of protein abundance, and validation of peptide-spectrum matches (PSMs) with Percolator and Protein Prophet. Output includes identified peptides, proteins, and quantification results, which are essential for downstream neoantigen discovery.

In this workflow, FragPipe is used after FASTA database validation to ensure that the sequence database is correctly formatted and ready for mass spectrometry analysis. It integrates several tools in a single workflow to process the raw MS data, identify peptides and proteins, and provide validation for the results. It also supports isobaric and label-free quantification methods for protein and peptide quantification, important for understanding relative protein abundance and identifying potential neoantigens.

Fragpipe-discovery.

Hands-on: Fragpipe
  1. FragPipe - Academic Research and Education User License (Non-Commercial) ( Galaxy version 20.0+galaxy2) with the following parameters:
    • “I understand that these tools, including MSFragger, IonQuant, Bruker, and Thermo Raw File Reader, are available freely for academic research and educational purposes only, and agree to the following terms.”: Yes
    • param-file “Proteomics Spectrum files”: STS_26T_2_Eclipse_02102024.raw (Input dataset)
    • param-file “Manifest file”: Experimental-Design-Fragpipe.tabular (Input dataset)
    • param-file “Proteomics Search Database in FASTA format”: goodFastaOut (output of Validate FASTA Database tool)
    • “Split database”: 200
    • “Workflow”: Nonspecific-HLA
      • In “MSFragger Options”:
        • In “Search Tolerances”:
          • “Set Precursor Mass tolerances”: Use default
          • “Set Precursor True tolerance”: Use default
          • “Set Fragment Mass tolerances”: ppm
        • In “In-silico Digestion Parameters”:
          • “Protein Digestion Enzyme”: nonspecific
          • “Second Enzyme Digest”: No
          • “Maximum length of peptides to be generated during in-silico digestion”: 20
        • In “Variable Modifications”:
          • “Select Variable Modifications”: Oxidation of M (15.99491461956) modaa
        • In “Spectrum Processing”:
          • “Precursor mass mode”: corrected
          • “Precursor Charge Override”: Use default
      • In “Validation”:
        • “Run Validation”: Yes
          • “PSM Validation”: Run Percolator
          • “Run Protein Prophet”: Yes
          • “Generate Philosopher Reports”: Yes
      • In “Quant (MS1)”:
        • “Perform Label-Free Quantification”: Use workflow default
      • In “PTMs”:
        • “Run PTM Shepherd”: no
      • In “Quant (Isobaric)”:
        • “Perform Isobaric Quantification”: no
Question
  1. Why do we need to use a validated FASTA database with FragPipe?
  2. What is the significance of running Percolator and Protein Prophet in the FragPipe workflow?
  1. A validated FASTA database ensures that the protein sequences used in the analysis are correctly formatted and free from errors. This improves the accuracy of peptide identification and ensures reliable downstream analysis for neoantigen discovery.
  2. Percolator is used to validate peptide-spectrum matches (PSMs) by improving the accuracy of identification through machine learning-based scoring. Protein Prophet is used for protein identification validation, providing confidence levels for protein inference. Both tools help ensure high-quality, reliable results in proteomics analysis.

Collapsing Datasets and Selecting Relevant Data

The Collapse Collection tool is used to combine multiple input files into a single dataset, which is especially useful in workflows where various outputs, such as peptide or protein data, need to be merged for further analysis. In this specific workflow, the tool consolidates peptide data produced by FragPipe into one unified collection. This step is important because it simplifies data handling and organization, making downstream processes like querying or visualization much easier and more efficient. Collapsing the collection reduces the complexity of dealing with multiple files and minimizes the risk of errors. It also ensures that all related data is available in a single dataset, speeding up subsequent analysis and improving overall workflow efficiency. By merging the results into one file, researchers can focus on analysis rather than managing fragmented data.

In this workflow, we aim to consolidate the peptide output from FragPipe. In the current version of FragPipe, the peptide output is generated in a collection format. We need to collapse this collection to create a single tabular output. The next step is to select only relevant variant sequences, hence we remove all the contaminants and known HUMAN sequences from the output.

Hands-on: Collapse Collection
  1. Collapse Collection ( Galaxy version 5.1.0) with the following parameters:
    • param-file “Collection of files to collapse into single dataset”: output_peptide (output of FragPipe - Academic Research and Education User License (Non-Commercial) tool)
Hands-on: Removing contaminants and known proteins with Select
  1. Select with the following parameters:
    • param-file “Select lines from”: output (output of Collapse Collection tool)
    • “that”: NOT Matching
    • “the pattern”: (HUMAN)|(contam_)|(con_)|(ENSP)
    • “Keep header line”: Yes

Querying Tabular Results for Further Analysis

The Query Tabular tool is used to query and filter tabular data using SQL-like commands. It allows users to extract specific data from a larger dataset based on defined criteria. This tool is essential for querying complex data and narrowing down results for further analysis. In this step, the Query Tabular tool is applied to the output of the Select tool (a previous step in the workflow). The input is a tabular dataset (out_file1), and a custom SQL query is provided to filter the dataset based on specific conditions. The query provided (SELECT c1 FROM fp WHERE (c16 IS NULL OR c16 = ‘’) AND (c18 IS NULL OR c18 = ‘’)) retrieves a subset of the data, selecting only the rows where columns c16 and c18 are empty or null.

In this workflow, this step filters the tabular data to isolate the rows that meet the specified conditions. By using this query, unnecessary or irrelevant data is excluded from the dataset, and only the relevant results are retained for further analysis. This step ensures that the subsequent analysis is based on clean and focused data.

Hands-on: Query Tabular
  1. Query Tabular ( Galaxy version 3.3.2) with the following parameters:
    • In “Database Table”:
      • param-repeat “Insert Database Table”
        • param-file “Tabular Dataset for Table”: out_file1 (output of Select tool)
        • In “Table Options”:
          • “Specify Name for Table”: fp
    • “SQL Query to generate tabular output”:
      SELECT c1 FROM fp WHERE (c16 IS NULL OR c16 = '') AND (c18 IS NULL OR c18 = '')
      
    • “include query result column headers”: No
Question
  1. What is the purpose of using an SQL query in the Query Tabular tool in this workflow?
  2. How does the SQL query provided (SELECT c1 FROM fp WHERE (c16 IS NULL OR c16 = ‘’) AND (c18 IS NULL OR c18 = ‘’)) filter the dataset, and what does it aim to accomplish?
  1. The purpose of using an SQL query in the Query Tabular tool is to filter and retrieve specific rows from a tabular dataset based on defined conditions, allowing the user to focus on relevant data for further analysis.
  2. The SQL query filters the dataset by selecting only the rows where columns c16 and c18 are either null or empty. It aims to exclude any rows where these columns contain data, ensuring that only relevant and clean data is retained for further processing.

Conclusion

In this workflow, tools like FragPipe, Collapse Collection, and Query Tabular are used to process, merge, and filter proteomics data for efficient analysis. FragPipe allows for in-depth proteomic analysis, producing results that are consolidated using Collapse Collection into a unified dataset, simplifying further analysis. The Query Tabular tool then enables targeted querying of this dataset, filtering it based on specific conditions to refine the data for downstream analysis. Together, these tools streamline the workflow, enhancing the ability to manage, manipulate, and extract meaningful insights from complex biological data. This step-by-step process exemplifies how bioinformatics tools can be integrated to handle large-scale data, making the analysis more efficient and focused on specific research questions.

Rerunning on your own data

To rerun this entire analysis at once, you can use our workflow. Below we show how to do this:

Hands-on: Running the Workflow
  1. Import the workflow into Galaxy:

    Hands-on: Importing and launching a GTN workflow
    Launch Fragpipe Discovery (View on GitHub, Download workflow) workflow.
    • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
    • Click on galaxy-upload Import at the top-right of the screen
    • Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/neoantigen-3-fragpipe-discovery/workflows/main_workflow.ga
    • Click the Import workflow button

    Below is a short video demonstrating how to import a workflow from GitHub using this procedure:

    Video: Importing a workflow from URL

  2. Run Workflow workflow using the following parameters:

    • “Send results to a new history”: No
    • param-file “Non-Normal protein database”: Human_cRAP_Non_normal_transcripts_dB.fasta
    • param-file “Fusion protein database”: Arriba-Fusion-Database.fasta
    • param-file “Input raw file”: STS_26T_2_Eclipse_02102024.raw
    • param-file “Experimental design file for Fragpipe”: Experimental-Design-Fragpipe.tabular
    • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
    • Click on the workflow-run (Run workflow) button next to your workflow
    • Configure the workflow as needed
    • Click the Run Workflow button at the top-right of the screen
    • You may have to refresh your history to see the queued jobs

Disclaimer

Please note that all the software tools used in this workflow are subject to version updates and changes. As a result, the parameters, functionalities, and outcomes may differ with each new version. Additionally, if the protein sequences are downloaded at different times, the number of sequences may also vary due to updates in the reference databases or tool modifications. We recommend the users to verify the specific versions of software tools used to ensure the reproducibility and accuracy of results.