Submitting raw sequencing reads to ENA

Overview

Questions
  • How do you submit raw viral sequence reads to the European Nucleotide Archive?

Objectives
  • Add ENA Webin credentials to your Galaxy user information

  • Use Galaxy’s ‘ENA upload tool’ to interactively generate metadata

  • Use a metadata template to upload bulk metadata

  • Submit raw sequencing reads and metadata to ENA’s test server

Requirements
Time estimation: 1 hour
Level: Intermediate
Supporting Materials
Last modification: Aug 10, 2021
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is MIT

Introduction

Raw reads contain valuable information, such as coverage depth and quality scores, that is lost in a consensus sequence. Submission of raw SARS-CoV-2 reads to public repositories allows reuse of data and reproducibility of analysis and enables discovery of minor allelic variants and [intrahost variation]Maier et al. 2021.

The European Nucleotide Archive is an Open and FAIR repository of nucleotide data. As part of the International Nucleotide Sequence Database Collaboration (INSDC), ENA also indexes data from the NCBI and DDBJ Arita et al. 2020. Data submitted to ENA must be accompanied by sufficient metadata. You can learn more from this introductory slide deck or directly from ENA.

In this tutorial we will show you how to use Galaxy’s ‘ENA Upload tool’ to submit SARS-CoV-2 raw sequencing reads and its associated metadata to ENA Roncoroni et al. 2021. You will learn to add your ENA Webin credentials to Galaxy, input metadata interactively or via a metadata template and submit the reads to ENA (test) server. Specifically, we will use one ONT sequencing file to demonstrate interactive metadata input and two sets of PE Illumina reads to demonstrate how to use the metadata template. Data will be submitted to ENA’s test server and will not be public.

comment Nature of the input data

We will use data derived from sequencing data of bronchoalveolar lavage fluid (BALF) samples obtained from early COVID-19 patients in China as our input data. Human traces have been removed in Galaxy.

Agenda

In this tutorial, we will cover:

  1. Adding ENA Webin credentials to your Galaxy user information
  2. Option 1: submitting to ENA using interactive metadata generator
  3. Option 2: submitting to ENA using the metadata template

Adding ENA Webin credentials to your Galaxy user information

In order to submit data to ENA, you need to have a valid Webin account. If you don’t have one already you can register for one here. Webin credentials need to be included in your Galaxy user information before you can use the ENA Upload tool Tool: toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload/0.3.2 .

hands_on Hands-on: Add Webin credentials to your Galaxy user information

  1. If you have not already done so, log in to usegalaxy.eu
  2. Navigate to “User” > “Preferences” on the top menu
    • Click on Manage Information
    • Scroll down to “Your ENA Webin account details” and fill in your ENA Webin ID and Password
ENA Webin Account details in Galaxy
Figure 1: ENA Webin Account details

Option 1: submitting to ENA using interactive metadata generator

In this first example, you will submit one ONT sequence file using ENA Upload tool Tool: toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload/0.3.2 interactive metadata forms. This method is only convenient for small submissions. For bulk submissions, we recommend you use the metadata template described below in Option 2.

hands_on Hands-on: Data upload

  1. Upload the ONT data from Zenodo via URLs

    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    • By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

    The URL for our example data is this:

    https://zenodo.org/record/5176347/files/SRR10902284_ONT.fq.gz
    

Once the data is uploaded, we fill the metadata using ENA Upload tool Tool: toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload/0.3.2 ‘. Interactive metadata forms are nested to fit ENA’s metadata model. Briefly, you add Samples to a Study, Experiments to Samples and Runs to Experiments.

We recommend always submitting to the test server before submitting to the public one. After you confirm that all the data and metadata looks ok, you can go ahead and submit to the public ENA server.

hands_on Hands-on: add metadata interactively and submit a single sequence to ENA

  1. ENA Upload tool Tool: toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload/0.3.2 :
    • “Submit to test ENA server?”: yes
    • “Would you like to submit pregenerated table files or interactively define the input structures?”: Interactive generation of the study structure
    • “Does your submission contains viral samples?”: yes
  2. Fill all metadata boxes and make sure that:
    • “Enter the species of the sample”: Severe acute respiratory syndrome coronavirus 2
    • “Enter the taxonomic ID corresponding to the sample species”: 2697049
    • “Host subject id”: avoid using ID that can be use to trace samples back to patients
    • “Host scientific name”: Homo sapiens
    • “Library layout”: SINGLE
    • “Select the sequencing platform used”: Oxford Nanopore
    • “Instrument model”: minION
    • “Runs executed within this experiment”
      • File(s) associated with this run: param-files select the uploaded ONT fastq.gz file
    • “Affiliation center”: your institution

warning Submit to the test server first

Make sure “Submit to test ENA server?”: yes. Otherwise your data will be submitted to the public server.

Four metadata tables (Study, Sample, Experiment and Run), and a metadata ticket with submission information are generated. You can confirm a successful submission at ENA test server (or the public server, if you chose it).

Option 2: submitting to ENA using the metadata template

For larger submissions, interactive metadata input can be tedious and not practical. In the second example, you will submit two sets of Illumina PE sequence files and input metadata using a template spreadsheet. For this exercise, we provide you with a pre-filled template and encourage you to explore it.

hands_on Hands-on: Upload and inspect data

  1. Upload the ONT data from Zenodo via URLs:

    https://zenodo.org/record/5176347/files/GTN_tutorial_mock_metadata_template.xlsx
    https://zenodo.org/record/5176347/files/SRR10903401_1.fastq.gz
    https://zenodo.org/record/5176347/files/SRR10903401_2.fastq.gz
    https://zenodo.org/record/5176347/files/SRR10903402_1.fastq.gz
    https://zenodo.org/record/5176347/files/SRR10903402_2.fastq.gz
    
  2. Arrange the data into a paired dataset collection

    Tip: Creating a paired collection

    • Click on Operations on multiple datasets (check box icon) at the top of the history panel Operations on multiple datasets button
    • Check all the datasets in your history you would like to include
    • Click For all selected.. and choose Build List of Dataset Pairs

    • Change the text of unpaired forward to a common selector for the forward reads
    • Change the text of unpaired reverse to a common selector for the reverse reads
    • Click Pair these datasets for each valid forward and reverse pair.
    • Enter a name for your collection
    • Click Create List to build your collection
    • Click on the checkmark icon at the top of your history again

    For the example datasets this means:

    • You need to tell Galaxy about the suffix for your forward and reverse reads, respectively:
      • set the text of unpaired forward to: _1.fastq.gz
      • set the text of unpaired reverse to: _2.fastq.gz
      • click: Auto-pair

      All datasets should now be moved to the paired section of the dialog, and the middle column there should show that only the sample accession numbers, i.e. SRR10903401 and SRR10903402, will be used as the pair names.

    • Make sure Hide original elements is checked to obtain a cleaned-up history after building the collection.
    • Click Create Collection
  3. Inspect the GTN_tutorial_mock_metadata.xlsx (filled-in template) file by clicking on the galaxy-eye (eye) icon

    question Questions

    1. How many metadata sheets are there?
    2. Which metadata section is different from the corresponding section in the interactive metadata input?

    solution Solution

    1. There are four metadata sheets, one per metadata object (Study, Sample, Experiment, Run)
    2. The Sample section is more extensive in the template spreadsheet, because it contains Mandatory, Recommended and Optional fields, whereas the interactive metadata Sample form contains only Mandatory ones.

As before, the submission is done to the test server before submitting to the public one.

hands_on Hands-on: add metadata interactively and submit a single sequence to ENA

  1. ENA Upload tool Tool: toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload/0.3.2 :
    • “Submit to test ENA server?”: yes
    • “Would you like to submit pregenerated table files or interactively define the input structures?”: User generated metadata tables based on Excel template
    • “Does your submission contains viral samples?”: yes
    • “Select Excel (xlsx) file based on templates”: param-files select the uploaded .xlsx template file
    • “Select runs input format”: Input from a paired collection
      • “List of paired-end runs files”: : select the PE collection containing the PE sequencing reads
    • “Affiliation center”: your institution

warning Submit to the test server first

Make sure “Submit to test ENA server?”: yes. Otherwise your data will be submitted to the public server.

Four metadata tables (Study, Sample, Experiment and Run), and a metadata ticket with submission information are generated. You can confirm a successful submission at ENA test server (or the public server, if you chose it).

Key points

  • Use Galaxy’s ‘ENA Upload tool’ to submit raw SARS-CoV-2 reads to ENA

  • You need to include your ENA Webin credentials in Galaxy

  • For small submission use ‘ENA Upload tool’ interactive metadata forms feature

  • For bulk submissions use a spreadsheet metadata template

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Using Galaxy and Managing your Data topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. Arita, M., I. Karsch-Mizrachi, and G. Cochrane, 2020 The international nucleotide sequence database collaboration. Nucleic Acids Research 49: D121–D124. 10.1093/nar/gkaa967
  2. Maier, W., S. Bray, M. van den Beek, D. Bouvier, N. Coraor et al., 2021 Freely accessible ready to use global infrastructure for SARS-CoV-2 monitoring. 10.1101/2021.03.25.437046
  3. Roncoroni, M., B. Droesbeke, I. Eguinoa, K. D. Ruyck, F. D’Anna et al., 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive (Z. Lu, Ed.). Bioinformatics. 10.1093/bioinformatics/btab421

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. Miguel Roncoroni, 2021 Submitting raw sequencing reads to ENA (Galaxy Training Materials). https://training.galaxyproject.org/archive/2021-09-01/topics/galaxy-interface/tutorials/upload-data-to-ena/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{galaxy-interface-upload-data-to-ena,
author = "Miguel Roncoroni",
title = "Submitting raw sequencing reads to ENA (Galaxy Training Materials)",
year = "2021",
month = "08",
day = "10"
url = "\url{https://training.galaxyproject.org/archive/2021-09-01/topics/galaxy-interface/tutorials/upload-data-to-ena/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                

Congratulations on successfully completing this tutorial!