Predicting EI+ mass spectra with QCxMS

Author(s) orcid logoJulia Jakiela avatar Julia Jakielaorcid logoHelge Hecht avatar Helge Hecht
Reviewers Saskia Hiltemann avatarMelanie Föll avatarJulia Jakiela avatar
Overview
Creative Commons License: CC-BY Questions:
  • Can I predict QC-based mass spectra starting from SMILES?

  • How can I run those computationally heavy predictions?

  • Can I take into account different conformers?

Objectives:
  • To prepare structure-data format (SDF) files for further operations analysis, starting from chemical structure descriptors in simplified molecular-input line-entry system (SMILES) format.

  • To generate 3D conformers and optimise them using semi-empirical quantum mechanical (SQM) methods.

  • To produce simulated mass spectra for a given molecule in MSP (text based) format.

Requirements:
Time estimation: 1 hour
Level: Introductory Introductory
Supporting Materials:
Published: Oct 1, 2024
Last modification: Oct 1, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00458
version Revision: 1

Mass spectrometry (MS) is a powerful analytical technique used in many fields, including proteomics, metabolomics, drug discovery and many more areas relying on compound identifications. Even though nowadays MS is a standard and popular method, there are many compounds which lack experimental spectra. In those cases, predicting mass spectra from the chemical structure can reveal useful information, help in compound identification and expand the spectral databases, improving the accuracy and efficiency of database search. [Zhu and Jonas 2023, Allen et al. 2016]. There have been several methods developed to predict mass spectra, which can be classified as either first-principles physical-based simulation or data-driven statistical methods [Zhu and Jonas 2023]. To the first category we can assign purely statistical theories (quasi-equilibrium theory (QET) or Rice–Ramsperger–Kassel–Marcus (RRKM) theories) [Vetter 1994], as well as QCxMS [Koopman and Grimme 2021] and semiempirical GFNn-xTB [Koopman and Grimme 2019] which use Born–Oppenheimer molecular dynamics (MD) combined with fragmentation pathways. Data-driven statistical methods - forming the second category - reach back to 1960s when the DENDRAL project (using rule-based heuristic programming) was started by early artificial intelligence (AI) scientists [Lindsay et al. 1980]. More recently, CFM-ID has been introduced [Allen et al. 2014, Allen et al. 2014, Allen et al. 2016, Djoumbou-Feunang et al. 2019, Wang et al. 2020], which uses rule-based fragmentation and employs machine learning methods. Current advancements in machine learning led to recent work using deep neural networks that allow predicting spectra from molecular graphs or fingerprints [Wei et al. 2019].

You will be able to check out how QCxMS works in practice since we are going to use Galaxy tool suite based on this method [Grimme 2013, Bauer and Grimme 2014, Bauer and Grimme 2016, Koopman and Grimme 2021]. Beforehand, we will generate conformers of the query molecule with RDKit and we will use xTB for molecular optimisation [Bannwarth et al. 2020]. But first things first, let’s get some toy data to play with and crack on!

Question

What does QCEIMS stand for?

With a little knowledge of chemistry, you’ll be able to work it out yourself!

Look at the acronym in the following way: QC-EI-MS.

QC = quantum chemical

EI = electron ionisation

MS = mass spectrometry

Hence QCEIMS = quantum-chemical electron ionisation mass spectrometry

QCxMS is a successor of QCEIMS, where the EI part is replaced by x to take into account other ionisation methods and improve the applicability of the program. In QCEIMS, EI stands for electron ionisation, while in QCxMS, x refers to EI or CID (collision-induced dissociation) [Koopman and Grimme 2021]. Currently, only EI simulations are supported - using CID with the Galaxy tool wrappers is still under development.

Importing data and pre-processing

In this tutorial, you can choose whether you want to predict the mass spectrum for one molecule only, or if you want to do it for multiple molecules at once. The pre-processing steps will slightly differ depending on your choice. If you are completing this tutorial just to see how the QCxMS tools work, feel free to follow the instructions for one molecule to skip some pre-processing steps.

In both cases, we start from molecule’s SMILES, and then we convert it to SDF, so if you already have SDF files to work with, simply jump in the relevant place in the workflow and carry on from there.

SMILES (.smi) - the simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of a chemical species using short ASCII strings. It is a linear text format which can describe the connectivity and chirality of a molecule [Weininger 1988]. Even though many different forms of SMILES exist, the differences are not relevant for us in this application.

Depiction of a molecular structure and the corresponding SMILES string. Starts of rings are denoted with a number after the corresponding element (i.e. C1). Non-organic elements need to be written in brackets and isotopic forms need to be denoted before the atom and in brackets as well. Atoms with charges need to be in brackets as well and the charge needs to be denoted behind the atom or the functional group. Hydrogens inside brackets need to be stated explicitly. Double bonds are denoted with a `=` and triple bonds with a `#`.Open image in new tab

Figure 1: A quick explanation of the SMILES system.

Image credit: Helge Hecht, License: MIT

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

Choose below if you just want to follow the pipeline for prediting the spectrum for only one molecule or multiple molecules at once!

Upload data onto Galaxy

Working on a single molecule means that we will work on a dataset and not on a collection of datasets. To simulate the spectrum, we will use ethanol (C2H5 OH) as an example, but you can choose any other molecule that you want, but be aware that the more complex structure you choose, the more time it will take to complete the analysis since the workflow involves generating conformers, semiempirical methods and molecular optimisation. We will simply start with molecule’s SMILES. You have three options for uploading the data. The first two - importing via history and Zenodo link will give a file specific to this tutorial, while the last one – “Paste data uploader” gives you more flexibility in terms of the compounds you would like to test with this workflow.

Hands-on: Option 1: Data upload - Import history
  1. You can simply import this history with the input file.

    1. Open the link to the shared history
    2. Click on the Import this history button on the top left
    3. Enter a title for the new history
    4. Click on Copy History

  2. Rename galaxy-pencil the history to your name of choice.

Hands-on: Option 2: Data upload - Add to history via Zenodo
  1. Create a new history for this tutorial
  2. Import the input table from Zenodo

    https://zenodo.org/records/13327051/files/ethanol_SMILES.smi
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

Hands-on: Option 3: Data Upload - paste data
  1. Create a new history for this tutorial
    • Click galaxy-upload Upload Data at the top of the tool panel
    • Select galaxy-wf-edit Paste/Fetch Data at the bottom
    • Paste the SMILES into the text field:
      CCO
      
    • Change Type from “Auto-detect” to smi
    • Press Start and Close the window
  2. You can then rename the dataset as you wish (here we use ethanol_SMILES)
  3. Check that the datatype is smi.

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select smi from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

SMILES to SDF

Hands-on: Convert SMILES to SDF

Compound conversion ( Galaxy version 3.1.1+galaxy0) with the following parameters:

  • param-file “Molecular input file”: ethanol_SMILES (your SMILES dataset from your history)
  • “Output format”: MDL MOL format (sdf, mol)
  • “Append the specified text after each molecule title”: ethanol

We now have an SDF file, containing the atoms’ coordinates and the investigated molecule’s name.

3D Conformer generation & optimization

Generate conformers

The next step involves generating three-dimensional (3D) conformers for our molecule. It crteaes the actual 3D topology of the molecule based on electromagnetic forces. This process might seem trivial for very small and simplistic (meaning no complex structure) molecules, but this can be more challenging for larger molecules with a more flexible geometry. This concerns for example P containing biomolecules, where P often forms a rotational center of the molecule. The number of conformers to generate can be specified as an input parameter, with a default value of 1 if not provided. This process is crucial for exploring the possible shapes and energies that a molecule can adopt. The output of this step is a file containing the generated 3D conformers.

Conformers are different spatial arrangements of a molecule that result from rotations around single bonds. They have different potential energies and hence some are more favourable (local minima on the potential energy surface) than others.

Newman projections of butane conformations & their relative energy differences (not total energies). Conformations form when butane rotates about one of its single covalent bond. Torsional/dihedral angle is shown on x-axis.
Open image in new tab

Figure 2: Conformers of butane and their relative energy differences.

Image credit: Keministi, License: Creative Commons CC0 1.0.

Hands-on: Generate conformers

Generate conformers ( Galaxy version 1.1.4+galaxy0) with the following parameters:

  • param-file “Input file”: output of Convert to sdf tool
  • “Number of conformers to generate”: 1

Now - once again format conversion! This time we will convert the generated conformers from the SDF format to Cartesian coordinate (XYZ) format. The XYZ format lists the atoms in a molecule and their respective 3D coordinates, which is a common format used in computational chemistry for further processing and analysis.

Hands-on: Molecular Format Conversion
  1. Compound conversion ( Galaxy version 3.1.1+galaxy0) with the following parameters:
    • param-file “Molecular input file”: output (sdf) of Generate conformers tool
    • “Output format”: XYZ cartesian coordinates format
    • “Add hydrogens appropriate for pH”: 7.0
  2. Check that the datatype is xyz. If it’s not, just change it – below is the tip how to do it.

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select xyz from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Molecular optimization

As shown in the image in the details Details box above, different conformers have different energies. Therefore, our next step will optimize the geometry of the molecules to find the lowest energy conformation. This is crucial to achieve convergence in the next steps. If the input geometry for the QCxMS method is too crude, the ground state neutral run will not converge and we won’t be able to sample the geometry to calculate individual trajectories. We will perform semi-empirical optimization on the molecules using the Extended Tight-Binding (xTB) method. The level of optimization accuracy to be used can be specified as an input parameter, “Optimization Levels”. The default quantum chemical method is GFN2-xTB.

Hands-on: Molecular optimisation with xTB

xtb molecular optimization ( Galaxy version 6.6.1+galaxy1) with the following parameters:

  • param-file “Atomic coordinates file”: output of Convert to xyz tool)
  • “Optimization Levels”: tight
  • “Keep molecule name”: galaxy-toggle Yes

QCxMS Spectra Prediction

Neutral and production runs

Finally, let’s predict the spectra for our molecule. As mentioned, we will use QCxMS for this purpose. First, we need to prepare the necessary input files for the QCxMS production runs. These files are required for running the QCxMS simulations, which will predict the mass spectrum of the molecule. This step typically formats the optimized molecular data into a format that can be used for the production simulations. This step performs the ground state calculations. The resulting geometry trajectory is then sampled and one representation is used for each trajectory.

Hands-on: QCxMS neutral run

QCxMS neutral run ( Galaxy version 5.2.1+galaxy3) with the following parameters:

  • param-file “Molecule 3D structure [.xyz]”: output of xtb molecular optimization tool
  • “QC Method”: GFN2-xTB

The outputs of the above step are as follows:

  • .in output: Input file for the QCxMS production run
  • .start output: Start file for the QCxMS production run
  • .xyz output: Cartesian coordinate file for the QCxMS production run

We can now use those files as input for the next tool which calculates the mass spectra using QCxMS. This simulation generates .res files, which contain the raw results of the mass spectra calculations. These results are essential for predicting how the molecules will appear in mass spectrometry experiments.

Hands-on: QCxMS production run
  1. QCxMS production run ( Galaxy version 5.2.1+galaxy3) with the following parameters:
    • param-collection Dataset collection “in files [.in]”: input in files generated by QCxMS neutral run tool
    • param-collection Dataset collection “start files [.start]”: input start files generated by QCxMS neutral run tool
    • param-collection Dataset collection “xyz files [.xyz]”: input xyz files generated by QCxMS neutral run tool

Filter failed datasets

It might be the case that some runs might have failed, therefore it is crucial to filter out any failed runs from the dataset to ensure only successful results are processed further. This step is important to maintain the integrity and quality of the data being analyzed in subsequent steps. The output is a file containing only the successful mass spectra results.

Hands-on: Filter failed datasets
  1. Filter failed datasets with the following parameters:
    • param-collection “Input Collection”: res files generated by QCxMS (output of QCxMS production run tool )

Get MSP spectra

The filtered collection contains .res files from the QCxMS production run. This final step converts the .res files into simulated mass spectra in MSP (Mass Spectrum Peak) file format. The MSP format is widely used for storing and sharing mass spectrometry data, enabling easy comparison and analysis of the results, for example by comparing the spectra using the matchms package.

Hands-on: QCxMS get MSP results
  1. QCxMS get results ( Galaxy version 5.2.1+galaxy2) with the following parameters:
    • param-file “Molecule 3D structure [.xyz]”: Convert to xyz (output of Compound conversion tool)
    • “res files [.res]”: output of Filter failed datasets tool (if the collection doesn’t appear in the drop-down list, simply drag and drop it from the history panel to the input box)

MSP (Mass Spectrum Peak) file is a text file structured according to the NIST MSSearch spectra format. MSP is one of the generally accepted formats for mass spectral libraries (or collections of unidentified spectra, so called spectral archives), and it is compatible with lots of spectra processing programmes (MS-DIAL, NIST MS Search, AMDIS, matchms, etc.). It can contain one or more mass spectra, these are split by an empty line. The individual spectra essentially consist of two sections: metadata (such as name, spectrum type, ion mode, retention time, and the number of m/z peaks) and peaks, consisting of m/z and intensity tuples.

You can now dataset-save download the MSP file and open it in your spectra processing software for further investigation!

To give you some insight into how well QCxMS can perform, below is the mass spectrum of ethanol resulting from our workflow compared with an experimental spectrum. Both spectra were compiled using an online mass spectrum generator which requires only m/z values and intensities – so the values that you can get from our MSP file! As you can see, the predicted peaks nicely correspond to experimental ones. But be careful - there might be slight deviations for molecules with more structural complexity!

Upper panel shows the experimental spectrum of ethanol, while the lower panel shows analogical spectrum but predicted with the current workflow. The predicted peaks correspond well to the experimental ones. Intensities of simulated peaks have not been predicted perfectly, but the most important trends are preserved.Open image in new tab

Figure 3: Comparison between experimental (upper panel) and predicted (lower panel) mass spectra of ethanol.

Conclusion

trophy Well done, you’ve simulated the mass spectrum! You might want to consult your results with the key history. If you would like to process multiple molecules at once, you can simply use the workflow or move to “Predict MS for multiple molecules at once” tab of this tutorial to learn how the pipeline differs from the one that we’ve just covered.

Upload data onto Galaxy

We will work on two simple molecules – ethanol (C2H5 OH) and ethylene (C2H4). Of course, you might add more and choose any other molecules that you want, but be aware that the more complex structure you choose, the more time it will take to complete the analysis since the workflow involves generating conformers, semiempirical methods and molecular optimisation. We will start with a table with the first column being molecule names and the second one – corresponding SMILES. You have three options for uploading the data. The first two - importing via history and Zenodo link will give a file specific to this tutorial, while the last one – “Paste data uploader” gives you more flexibility in terms of the compounds you would like to test with this workflow.

Hands-on: Option 1: Data upload - Import history
  1. You can simply import this history with the input table.

    1. Open the link to the shared history
    2. Click on the Import this history button on the top left
    3. Enter a title for the new history
    4. Click on Copy History

  2. Rename galaxy-pencil the history to your name of choice.

Hands-on: Option 2: Data upload - Add to history via Zenodo
  1. Create a new history for this tutorial
  2. Import the input table from Zenodo

    https://zenodo.org/records/13327051/files/qcxms_prediction_input.tabular
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

Hands-on: Option 3: Data Upload - paste data
  1. Create a new history for this tutorial
    • Click galaxy-upload Upload Data at the top of the tool panel
    • Select galaxy-wf-edit Paste/Fetch Data at the bottom
    • Paste the contents into the text field, separated by space. First, enter the name of the molecule, then its SMILES. Please note that we are not using headers here. For this tutorial, we’ll use the example of ethanol and ethylene, but feel free to use your own examples.
      ethanol CCO
      ethylene C=C
      
    • Change Type from “Auto-detect” to tabular
    • Find the gear symbol (galaxy-gear), deselect any ticked options and select only (galaxy-gear) Convert spaces to tabs
    • Press Start and Close the window
  2. You can then rename the dataset as you wish.
  3. Check that the datatype is tabular.

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Input pre-processing

Once your dataset is uploaded, we can do some simple pre-processing to prepare the file for downstream analysis. Let’s start with ‘cutting’ the table into two columns – one with SMILES, the other with the molecule name – and then separating each entry to create a dataset collection, followed by parsing out the text information.

Hands-on: Cutting out name column
  1. Advanced Cut ( Galaxy version 9.3+galaxy1) with the following parameters:
    • param-file “File to cut”: the tabular file in your history with the compound name and SMILES
    • “Operation”: Keep
    • “Cut by”: fields
    • “Delimited by”: Tab
    • “Is there a header for the data’s columns”: No
    • “List of Fields”: Column: 1
  2. You can now rename the resulting dataset or just add a tag in order not to confuse it with subsequent outputs:
    • galaxy-tags Add tag: #names

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Hands-on: Creating dataset collection (molecule name)

Split file to dataset collection ( Galaxy version 0.5.2) with the following parameters:

  • param-file “Select the file type to split”: Tabular
  • “Tabular file to split”: output of Advanced Cut tool (with #names tag)
  • “Number of header lines to transfer to new files”: 0
  • “Split by row or by a column?”: By row
  • “Specify number of output files or number of records per file?”: Number of records per file (‘chunk mode’)
  • “Chunk size”: 1
  • “Base name for new files in collection”: split_file
  • “Method to allocate records to new files”: Maintain record order
Hands-on: Parsing out name info

Parse parameter value with the following parameters:

  • “Input file containing parameter to parse out of”: * click on param-collection Dataset collection and select output of Split file tool
  • “Select type of parameter to parse”: Text
  • “Remove newlines ?”: galaxy-toggle Yes

If you have any problems with accessing Parse parameter value tool, you can open the tool directly using a given link.

We will repeat the first two steps, but processing SMILES this time.

Hands-on: Cutting out SMILES column
  1. Advanced Cut ( Galaxy version 9.3+galaxy1) with the following parameters:
    • param-file “File to cut”: the tabular file in your history with the compound name and SMILES
    • “Operation”: Keep
    • “Cut by”: fields
    • “Delimited by”: Tab
    • “Is there a header for the data’s columns”: No
    • “List of Fields”: Column: 2
  2. You can now rename the resulting dataset or just add a tag in order not to confuse it with subsequent outputs:
    • galaxy-tags Add tag: #SMILES

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Hands-on: Creating dataset collection (SMILES)
  1. Split file to dataset collection ( Galaxy version 0.5.2) with the following parameters:
    • param-file “Select the file type to split”: Tabular
    • “Tabular file to split”: output of the latest Advanced Cut tool (higher number in your history, with #SMILES tag)
    • “Number of header lines to transfer to new files”: 0
    • “Split by row or by a column?”: By row
    • “Specify number of output files or number of records per file?”: Number of records per file (‘chunk mode’)
    • “Chunk size”: 1
    • “Base name for new files in collection”: split_file
    • “Method to allocate records to new files”: Maintain record order
  2. Check that the datatype is smi. If it’s not, just change it!

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select smi from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Now, onto format conversion. Let’s convert our SMILES to SDF (Structure Data File) and append the molecule’s name that we have already extracted.

Hands-on: Convert SMILES to SDF

Compound conversion ( Galaxy version 3.1.1+galaxy0) with the following parameters:

  • “Molecular input file”: click on param-collection Dataset collection and select output of Split file tool on SMILES
  • “Output format”: MDL MOL format (sdf, mol)

Another parameter of this tool, “Append the specified text after each molecule title”, allows you to add name of the molecule to the title of the generated file. If you are working on a single molecule file, you can simply type the name of that compound into the parameter box. However, if your input is a collection of SMILES (like in this case), then you have to add the names (output of Parse parameter value tool) at the level of the workflow editor.

We now have two SDF files, each containing the coordinates of the atoms and the name of the investigated molecule. Let’s combine them to make life easier and work on just one file.

Hands-on: Concatenating the files

Concatenate datasets tail-to-head (cat) ( Galaxy version 9.3+galaxy1) with the following parameters:

  • param-file “Datasets to concatenate”:
  • if nothing pops up when you click on param-collection Dataset collection, then click on “switch to column select”
  • from the datasets list, find the output of Compound conversion tool (should have the highest number in your history) and select it
  • workflow-run Run tool!

3D Conformer generation & optimization

Generate conformers

The next step involves generating three-dimensional (3D) conformers for each molecule from the generated SDF. It crteaes the actual 3D topology of the molecules based on electromagnetic forces. This process might seem trivial for very small and simplistic (meaning no complex structure) molecules, but this can be more challenging for larger molecules with a more flexible geometry. This concerns for example P containing biomolecules, where P often forms a rotational center of the molecule. The number of conformers to generate can be specified as an input parameter, with a default value of 1 if not provided. This process is crucial for exploring the possible shapes and energies that a molecule can adopt. The output of this step is a file containing the generated 3D conformers.

Conformers are different spatial arrangements of a molecule that result from rotations around single bonds. They have different potential energies and hence some are more favourable (local minima on the potential energy surface) than others.

Newman projections of butane conformations & their relative energy differences (not total energies). Conformations form when butane rotates about one of its single covalent bond. Torsional/dihedral angle is shown on x-axis.
Open image in new tab

Figure 4: Conformers of butane and their relative energy differences.

Image credit: Keministi, License: Creative Commons CC0 1.0.

Hands-on: Generate conformers

Generate conformers ( Galaxy version 1.1.4+galaxy0) with the following parameters:

  • param-file “Input file”: output of Concatenate datasets tool
  • “Number of conformers to generate”: 1

Now - once again format conversion! This time we will convert the generated conformers from the SDF format to Cartesian coordinate (XYZ) format. The XYZ format lists the atoms in a molecule and their respective 3D coordinates, which is a common format used in computational chemistry for further processing and analysis.

Hands-on: Molecular Format Conversion
  1. Compound conversion ( Galaxy version 3.1.1+galaxy0) with the following parameters:
    • param-file “Molecular input file”: output of Generate conformers tool
    • “Output format”: XYZ cartesian coordinates format
    • “Split multi-molecule files into a collection”: galaxy-toggle Yes
    • “Add hydrogens appropriate for pH”: 7.0
  2. Check that the datatype is xyz. If it’s not, just change it!

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select xyz from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Molecular optimization

As shown in the image in the details Details box above, different conformers have different energies. Therefore, our next step will optimize the geometry of the molecules to find the lowest energy conformation. This is crucial to achieve convergence in the next steps. If the input geometry for the QCxMS method is too crude, the ground state neutral run will not converge and we won’t be able to sample the geometry to calculate individual trajectories. We will perform semi-empirical optimization on the molecules using the Extended Tight-Binding (xTB) method. The level of optimization accuracy to be used can be specified as an input parameter, “Optimization Levels”. The default quantum chemical method is GFN2-xTB.

Hands-on: Molecular optimisation with xTB

xtb molecular optimization ( Galaxy version 6.6.1+galaxy1) with the following parameters:

  • “Atomic coordinates file”: click on param-collection Dataset collection and select Prepared ligands (output of Compound conversion tool)
  • “Optimization Levels”: tight
  • “Keep molecule name”: galaxy-toggle Yes

QCxMS Spectra Prediction

Neutral and production runs

Finally, let’s predict the spectra for our molecules. As mentioned, we will use QCxMS for this purpose. First, we need to prepare the necessary input files for the QCxMS production runs. These files are required for running the QCxMS simulations, which will predict the mass spectra of the molecules. This step typically formats the optimized molecular data into a format that can be used for the production simulations. This step performs the ground state calculations. The resulting geometry trajectory is then sampled and one representation is used for each trajectory.

Hands-on: QCxMS neutral run

QCxMS neutral run ( Galaxy version 5.2.1+galaxy3) with the following parameters:

  • “Molecule 3D structure [.xyz]”: click on param-collection Dataset collection and select output of xtb molecular optimization tool
  • “QC Method”: GFN2-xTB

The outputs of the above step are as follows:

  • .in output: Input file for the QCxMS production run
  • .start output: Start file for the QCxMS production run
  • .xyz output: Cartesian coordinate file for the QCxMS production run

We can now use those files as input for the next tool which calculates the mass spectra for each molecule using QCxMS (Quantum Chemistry and Mass Spectrometry). This simulation generates .res files, which contain the raw results of the mass spectra calculations. These results are essential for predicting how the molecules will appear in mass spectrometry experiments.

Hands-on: QCxMS production run
  1. QCxMS production run ( Galaxy version 5.2.1+galaxy3) with the following parameters:
    • param-collection Dataset collection “in files [.in]”: input in files generated by QCxMS neutral run tool
    • param-collection Dataset collection “start files [.start]”: input start files generated by QCxMS neutral run tool
    • param-collection Dataset collection “xyz files [.xyz]”: input xyz files generated by QCxMS neutral run tool

Filter failed datasets

It might be the case that some runs might have failed, therefore it is crucial to filter out any failed runs from the dataset to ensure only successful results are processed further. This step is important to maintain the integrity and quality of the data being analyzed in subsequent steps. The output is a file containing only the successful mass spectra results.

Hands-on: Filter failed datasets
  1. Filter failed datasets with the following parameters:
    • param-collection “Input Collection”: output of QCxMS production run tool

Get MSP spectra

The filtered collection contains .res files from the QCxMS production run. This final step converts the .res files into simulated mass spectra in MSP (Mass Spectrum Peak) file format. The MSP format is widely used for storing and sharing mass spectrometry data, enabling easy comparison and analysis of the results, for example by comparing the spectra using the matchms package.

Hands-on: QCxMS get MSP results
  1. QCxMS get results ( Galaxy version 5.2.1+galaxy2) with the following parameters:
    • “Molecule 3D structure [.xyz]”: click on param-collection Dataset collection and select Prepared ligands (output of Compound conversion tool)
    • param-file “res files [.res]”: output of Filter failed datasets tool (if the collection doesn’t appear in the drop-down list, simply drag and drop it from the history panel to the input box)

The output of this step is a collection of two MSP files - one per each molecule. However, if you want, you can combine those two into one MSP file.

Hands-on: Combine MSP files into one
  1. Concatenate datasets tail-to-head (cat) ( Galaxy version 9.3+galaxy1) with the following parameters:
    • param-file “Datasets to concatenate”:
    • if nothing pops up when you click on param-collection Dataset collection, then click on “switch to column select”
    • from the datasets list, find the output of QCxMS get results tool (should be the latest output dataset) and select it
    • workflow-run Run tool!

MSP (Mass Spectrum Peak) file is a text file structured according to the NIST MSSearch spectra format. MSP is one of the generally accepted formats for mass spectral libraries (or collections of unidentified spectra, so called spectral archives), and it is compatible with lots of spectra processing programmes (MS-DIAL, NIST MS Search, AMDIS, matchms, etc.). It can contain one or more mass spectra, these are split by an empty line. The individual spectra essentially consist of two sections: metadata (such as name, spectrum type, ion mode, retention time, and the number of m/z peaks) and peaks, consisting of m/z and intensity tuples.

You can now dataset-save download the MSP file and open it in your spectra processing software for further investigation!

To give you some insight into how well QCxMS can perform, below is the mass spectrum of ethanol resulting from our workflow compared with experimental spectrum. Both spectra were compiled using an online mass spectrum generator which requires only m/z values and intensities – so the values that you can get from our MSP file! As you can see, the predicted peaks nicely correspond to experimental ones. But be careful - there might be slight deviations for molecules with more structural complexity!

Upper panel shows the experimental spectrum of ethanol, while the lower panel shows analogical spectrum but predicted with the current workflow. The predicted peaks correspond well to the experimental ones. Intensities of simulated peaks have not been predicted perfectly, but the most important trends are preserved.Open image in new tab

Figure 5: Comparison between experimental (upper panel) and predicted (lower panel) mass spectra of ethanol.

Conclusion

trophy Well done, you’ve simulated mass spectra! You might want to consult your results with the key history or use the workflow associated with this tutorial.

The prediction of mass spectra might be very useful, particularly for compounds that lack experimental data. Simulating the spectra can also save time and resources. This field has been developing quite rapidly, and recent advancements in new algorithms and packages have led to more and more accurate results. However, one cannot forget that this kind of software should be used to deepen our chemical understanding of the structures of studied compounds and not as a replacement for practical experiments.