A neoantigen is a novel peptide (protein fragment) that is produced by cancer cells due to mutations, including gene fusions, that alter the DNA sequence in a way that generates unique proteins not found in normal cells. Because these mutated proteins are unique to the tumor, they are recognized as “foreign” by the immune system. Neoantigens are valuable in immunotherapy because they can serve as specific targets for the immune system, allowing treatments to selectively attack cancer cells while sparing normal tissue. By stimulating an immune response specifically against these neoantigens, therapies like cancer vaccines or T-cell-based treatments can be developed to enhance the body’s natural defense mechanisms, making neoantigens a promising avenue for personalized cancer treatment.
Creating a fusion database is essential in cancer genomics and personalized medicine, as it identifies crucial biomarkers, enhances diagnostic accuracy, and supports therapeutic development. Gene fusions, where parts of two previously separate genes merge, can produce abnormal proteins that drive cancer. Cataloging these fusion events in a database helps researchers identify specific biomarkers linked to cancer types and design more targeted treatments. Additionally, fusion events may lead to unique peptide sequences, known as neoantigens, which are found only in cancer cells. These neoantigens can be targeted by the immune system, making fusion databases valuable in designing personalized immunotherapies like cancer vaccines or T-cell therapies. Some gene fusions also create oncogenic proteins that promote tumor growth, such as the BCR-ABL fusion in chronic myeloid leukemia. Including such information in a database aids in identifying potential therapeutic targets and predicting treatment efficacy. On the diagnostic side, known gene fusions serve as reliable markers, helping clinicians better classify cancer types and choose the most effective treatments. Finally, fusion databases provide a critical reference for researchers studying fusion mechanisms, their impact on disease progression, and their prevalence across cancers, ultimately fueling the discovery of novel treatments and therapies.
To generate the fusion database, RNA star and Arriba tools are used in this workflow.
The workflow in this tutorial guides users through the generation of a fusion neoantigen database, covering key steps in bioinformatics to identify, filter, and prepare fusion-specific peptides for further immunological study. Below is an overview of each major stage:
Get Data The process begins with the upload and quality assessment of raw sequencing data, which is then uncompressed. This stage sets the groundwork for all subsequent analyses.
Fusion Detection and Alignment RNA sequencing data undergoes alignment to a reference genome using tools like RNA STAR, followed by Arriba to detect fusion events. These tools identify gene fusions and help characterize the gene segments that combine to form new fusion genes.
Filtering and Refinement After identifying fusions, various filters are applied to remove non-specific or common fusion events using blacklist data and other criteria. This step ensures that only relevant, unique fusion events are retained for neoantigen prediction.
Peptide Sequence Extraction and Formatting
Potential neoantigen peptides are extracted from the fusion gene sequences. Using tools such as Text Reformatting and Tabular-to-FASTA, the data is transformed into formats suitable for further immunological analysis.
Final Database Formatting
The workflow concludes by applying regex adjustments and formatting functions to standardize the output, creating a database of potential fusion neoantigens.
In summary, this workflow provides a structured approach to preparing fusion neoantigen data for downstream applications, such as immunotherapy research, by making fusion-derived peptides accessible in a database for experimental or clinical exploration.
Get data
Hands On: Data Upload
Create a new history for this tutorial
Import the files from Zenodo or from
the shared data library (GTN - Material -> proteomics
-> Neoantigen 1: Fusion-Database-Generation):
Click galaxy-uploadUpload Data at the top of the tool panel
Select galaxy-wf-editPaste/Fetch Data
Paste the link(s) into the text field
Press Start
Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
Go into Libraries (left panel)
Navigate to the correct folder as indicated by your instructor.
On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
Select the desired files
Click on Add to Historygalaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
“Select history”: the history you want to import the data to (or create a new one)
Click on Import
Rename the datasets
Check that the datatype
Click on the galaxy-pencilpencil icon for the dataset to edit its attributes
In the central panel, click galaxy-chart-select-dataDatatypes tab on the top
In the galaxy-chart-select-dataAssign Datatype, select datatypes from “New type” dropdown
Tip: you can start typing the datatype into the field to filter the dropdown menu
Click the Save button
Add to each database a tag corresponding to …
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
Click on the dataset to expand it
Click on Add Tagsgalaxy-tags
Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
Press Enter
Check that the tag appears below the dataset name
Tags beginning with # are special!
They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
dataset 3 is used to calculate read coverage using BedTools Genome Coverageseparately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.
Uncompressing data is a crucial first step in many bioinformatics workflows because raw sequencing data files, especially from high-throughput sequencing, are often stored in compressed formats (such as .gz or .zip) to save storage space and facilitate faster data transfer. Compressed files need to be uncompressed to make the data readable and accessible for analysis tools, which generally require the data to be in plain text or other compatible formats. By uncompressing these files, we ensure that downstream applications can efficiently process and analyze the raw sequencing data without compatibility issues related to compression. In this workflow, we do that for both forward and reverse files.
Hands On: Converting compressed to uncompressed
Convert compressed file to uncompressed. with the following parameters:
RNA STAR (Spliced Transcripts Alignment to a Reference) is a high-performance tool used to align RNA sequencing (RNA-seq) reads to a reference genome. It identifies the best matches between RNA reads and genome sequences by detecting exon-exon junctions, which are critical for accurately mapping reads from spliced transcripts. RNA STAR uses a “two-pass” mapping approach that first identifies splice junctions across all reads and then uses these junctions to guide a more accurate alignment on the second pass. This capability is especially valuable for studying gene expression, discovering novel splice variants, and identifying fusion genes in cancer and other disease research. The output includes aligned sequences that can be used in subsequent steps of bioinformatics pipelines, such as fusion detection and differential expression analysis.
Hands On: Spliced transcripts Alignment to a human reference
RNA STAR ( Galaxy version 2.7.10b+galaxy4) with the following parameters:
“Single-end or paired-end reads”: Paired-end (as individual datasets)
param-file“RNA-Seq FASTQ/FASTA file, forward reads”: RNA-Seq_Reads_1.fastqsanger (output of Convert compressed file to uncompressed.tool)
param-file“RNA-Seq FASTQ/FASTA file, reverse reads”: RNA-Seq_Reads_2.fastqsanger (output of Convert compressed file to uncompressed.tool)
“Custom or built-in reference genome”: Use a built-in index
“Reference genome with or without an annotation”: use genome reference without builtin gene-model but provide a gtf
“Select reference genome”: Human Dec. 2013 (GRCh38/hg38) (hg38)
param-file“Gene model (gff3,gtf) file for splice junctions”: human_reference_genome_annotation.gtf (Input dataset)
“Per gene/transcript output”: No per gene or transcript output
“Use 2-pass mapping for more sensitive novel splice junction discovery”: Yes, perform single-sample 2-pass mapping of all reads
“Report chimeric alignments?”: Within the BAM output (together with regular alignments; WithinBAM SoftClip) soft-clipping in the CIGAR for supplemental chimeric alignments
In “Output filter criteria”:
“Would you like to set additional output filters?”: No
In “Algorithmic settings”:
“Configure seed, alignment and limits options”: Use parameters suggested for STAR-Fusion
“Compute coverage”: No coverage
Question
What is RNA STAR, and what does it do?
How do I interpret the alignment statistics in STAR’s output?
STAR is a tool for aligning RNA-Seq reads to a reference genome, helping researchers understand gene expression and identify splice junctions. STAR requires RNA-Seq reads, usually in FASTQ format. It also needs a reference genome file in FASTA format and annotation files in GTF/GFF format to build an index. STAR outputs alignments in BAM/SAM format, as well as splice junction files. It can also provide additional alignment stats in log files.
STAR provides logs with mapping statistics, such as the percentage of uniquely mapped reads, which can be useful for quality control. Aligned BAM files from STAR can be visualized in genome browsers like IGV (Integrative Genomics Viewer) to examine coverage and splicing.
Fusion detection with Arriba
Arriba is a specialized tool used for detecting gene fusions from RNA sequencing (RNA-seq) data. It is particularly focused on identifying fusion events in cancer, where gene fusions can drive oncogenic processes. Arriba uses the output from RNA STAR alignments, specifically looking at chimeric alignments that result from fusion transcripts, and applies a series of filtering steps to reduce false positives.
Arriba’s pipeline includes features for:
Filtering out common artifacts and false-positive fusions based on blacklisted regions.
Annotating fusion breakpoints.
Generating a visualization of detected fusion events.
The output includes a list of fusion candidates with key information like fusion partners, breakpoint locations, reading frames, and peptide sequences. Arriba’s results can provide insight into potential neoantigens, helping guide research into therapeutic targets or immune-based therapies for cancer.
Hands On: Fusion detection
Arriba ( Galaxy version 2.4.0+galaxy1) with the following parameters:
param-file“STAR Aligned.out.sam”: mapped_reads (output of RNA STARtool)
“Genome assembly fasta (that was used for STAR alignment)”: From your history
param-file“Gene annotation in GTF format”: human_reference_genome_annotation.gtf (Input dataset)
param-file“File containing blacklisted ranges.”: blacklist (output of Arriba Get Filterstool)
param-file“File containing protein domains”: protein_domains (output of Arriba Get Filterstool)
param-file“File containing known fusions”: known_fusions (output of Arriba Get Filterstool)
“Use whole-genome sequencing data”: no
“Generate visualization”: Yes
param-file“Cytobands”: cytobands (output of Arriba Get Filterstool)
Question
What is ARRIBA, and what does it do?
How can I ensure ARRIBA finds specific known fusions?
ARRIBA is a tool for detecting gene fusions in RNA-Seq data, especially helpful for identifying cancer-associated fusions and other structural variations. ARRIBA needs:A sorted BAM file with RNA-Seq reads aligned by STAR; STAR’s chimeric output (Chimeric.out.junction) to identify candidate fusion junctions; Reference annotation files, like a gene annotation GTF file and a blacklist file to filter false positives.
Ensure that the STAR alignment and ARRIBA parameters are optimized for sensitivity. Adjusting settings for segment length and alignment quality in STAR can improve detection of specific known fusions.
Postprocessing
Clean up data using Text reformatting
Text Reformatting is a step used in bioinformatics workflows to manipulate and clean up data for easier downstream processing. In fusion detection workflows, text reformatting is often used to parse and restructure output files, making the data consistent and accessible for subsequent analysis steps.
In this workflow, text reformatting involves:
Extracting specific columns or fields from tabular outputs, such as gene names, breakpoint coordinates, or fusion peptide sequences.
Formatting peptide sequences and related information into specific columns or concatenating fields for unique identifiers.
Converting the data into a consistent format that downstream tools can interpret, such as converting tab-separated values into a structured layout for database input or analysis.
The reformatting step ensures that the processed data adheres to the requirements of other tools, enabling seamless integration across the workflow and supporting reliable, interpretable final results.
Hands On: Formating Arriba output
Text reformatting ( Galaxy version 1.1.2) with the following parameters:
param-file“File to process”: fusions_tsv (output of Arribatool)
“AWK Program”:
(NR==1){
for (i=1;i<=NF;i++) {
if ($i ~ gene1) {
gene1 = i;
}
if ($i == gene2) {
gene2 = i;
}
if ($i == breakpoint1) {
breakpoint1 = i;
}
if ($i == breakpoint2) {
breakpoint2 = i;
}
if ($i == reading_frame) {
reading_frame = i;
}
if ($i == peptide_sequence) {
pscol = i;
}
}
}
(NR>1){
pseq = $pscol
if (pseq != .) {
bp = index(pseq,|);
pos = bp - 8;
n=split(pseq,array,|);
pep = toupper(array[1] array[2])
sub([*],,pep)
g1 = $gene1;
g2 = $gene2;
sub([(,].*,,g1);
sub([(,].*,,g2);
id = g1 _ g2
brkpnts = $breakpoint1 _ $breakpoint2
neopep = substr(pep,pos)
if ($reading_frame == in-frame) {
neopep = substr(pep,pos,16)
}
print(id \t (NR-1) \t brkpnts \t neopep);
}
}
Data refinement with Query Tabular
Query Tabular is a bioinformatics tool used to extract and manipulate specific data from tabular datasets in workflows. This tool allows users to perform SQL-like queries on tabular data, enabling them to filter, aggregate, and transform datasets based on user-defined criteria.
In this workflow, the Query Tabular tool is employed for several purposes:
Data Filtering: Users can select specific rows based on certain conditions (e.g., filtering fusions that meet particular criteria).
Column Manipulation: Users can specify which columns to retain or create new columns by combining or transforming existing data.
Aggregation: The tool allows for summarizing data, such as counting occurrences of specific fusion events or summarizing results based on particular categories.
Output Customization: Users can format the output to suit downstream processing needs, making it easier to pass data to subsequent analysis tools.
By leveraging Query Tabular, researchers can efficiently refine and structure their data, ensuring that only relevant information is carried forward in the workflow, ultimately aiding in the identification and analysis of significant biological insights.
Hands On: Manipulating the data to extract fusions
Query Tabular ( Galaxy version 3.3.1) with the following parameters:
In “Database Table”:
param-repeat“Insert Database Table”
param-file“Tabular Dataset for Table”: outfile (output of Text reformattingtool)
Tabular to FASTA conversion is a common task in bioinformatics that transforms data structured in a tabular format (such as CSV or TSV) into FASTA format, widely used for representing nucleotide or protein sequences. This conversion is essential when sequence data needs to be input into various bioinformatics tools or databases that require FASTA-formatted files.
Hands On: Converting tabular to fasta
Tabular-to-FASTA ( Galaxy version 1.1.1) with the following parameters:
param-file“Tab-delimited file”: output (output of Query Tabulartool)
“Title column(s)”: c['1']
“Sequence column”: c2
Using Regex Find And Replace
Using regex (regular expressions) for find and replace is a powerful technique for text manipulation, allowing you to search for patterns and replace them with desired text. Below is a guide on how to use regex for find and replace, including examples in different programming languages. In this context, we are adding “fusion” to the database header.
Hands On: Adding fusion tag in the fasta header
Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
param-file“Select lines from”: output (output of Tabular-to-FASTAtool)
In “Check”:
param-repeat“Insert Check”
“Find Regex”: >(\b\w+\S+)(.*$)
“Replacement”: >generic|fusion_\1|\2
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression
Matches
abc
an occurrence of abc within your data
(abc|def)
abcordef
[abc]
a single character which is either a, b, or c
[^abc]
a character that is NOT a, b, nor c
[a-z]
any lowercase letter
[a-zA-Z]
any letter (upper or lower case)
[0-9]
numbers 0-9
\d
any digit (same as [0-9])
\D
any non-digit character
\w
any alphanumeric character
\W
any non-alphanumeric character
\s
any whitespace
\S
any non-whitespace character
.
any character
\.
{x,y}
between x and y repetitions
^
the beginning of the line
$
the end of the line
Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.
Examples
Regular expression
matches
\d{4}
4 digits (e.g. a year)
chr\d{1,2}
chr followed by 1 or 2 digits
.*abc$
anything with abc at the end of the line
^$
empty line
^>.*
Line starting with > (e.g. Fasta header)
^[^>].*
Line not starting with > (e.g. Fasta sequence)
Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.
Regular expression
Input
Captures
chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984
\1 = 24, \2 = 1984
An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.
Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.
You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.
Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.
There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip:RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip:Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip:Cyrilex is a visual regular expression tester.
Rename the output FASTA as Arriba-Fusion-Database.fasta
Conclusion
The workflow outlined above demonstrates a systematic approach to processing biological data, emphasizing the importance of each step in ensuring accurate and reliable results. By integrating tools like RNA-STAR for alignment and Arriba for structural variant detection, researchers can effectively analyze complex genomic information. The transition from tabular data to FASTA format and the application of regex for find-and-replace operations further streamline data management, enhancing efficiency and clarity. Ultimately, this workflow not only facilitates the identification of neoantigens but also contributes to the broader goals of personalized medicine and targeted therapies. By leveraging these methodologies, researchers can gain deeper insights into the genetic underpinnings of diseases and advance the development of innovative treatments.
Rerunning on your own data
To rerun this entire analysis at once, you can use our workflow. Below we show how to do this:
Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
Click on galaxy-uploadImport at the top-right of the screen
Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/neoantigen-1-fusion-database-generation/workflows/main_workflow.ga
Click the Import workflow button
Below is a short video demonstrating how to import a workflow from GitHub using this procedure:
Video: Importing a workflow from URL
Run Workflowworkflow using the following parameters:
Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
Click on the workflow-run (Run workflow) button next to your workflow
Configure the workflow as needed
Click the Run Workflow button at the top-right of the screen
You may have to refresh your history to see the queued jobs
Disclaimer
Please note that all the software tools used in this workflow are subject to version updates and changes. As a result, the parameters, functionalities, and outcomes may differ with each new version. Additionally, if the protein sequences are downloaded at different times, the number of sequences may also vary due to updates in the reference databases or tool modifications. We recommend the users to verify the specific versions of software tools used to ensure the reproducibility and accuracy of results.
You've Finished the Tutorial
Please also consider filling out the Feedback Form as well!
Key points
Create a customized fusion proteomics database from 16SrRNA results.
Frequently Asked Questions
Have questions about this tutorial? Have a look at the available FAQ pages and support channels
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{proteomics-neoantigen-1-fusion-database-generation,
author = "Subina Mehta and Katherine Do and James Johnson",
title = "Neoantigen 1: Fusion-Database-Generation (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/neoantigen-1-fusion-database-generation/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings: