From NCBI's Sequence Read Archive (SRA) to Galaxy: SARS-CoV-2 variant analysis

Overview

question Questions
  • Learn how to get and use data from the Sequence Read Archive in Galaxy.

objectives Objectives
  • Understand how Galaxy and the Sequence Read Archive interact.

  • Be able to go from Galaxy to the Short Reach Archive, query SRA, use the SRA Run Selector to send selected metadata to Galaxy, and then import sequence data from SRA into Galaxy.

requirements Requirements

time Time estimation: 45 minutes

Supporting Materials

last_modification Last modification: Dec 1, 2020

Introduction

The aim of this tutorial is to learn how to Galaxy and NCBI’s Sequence Read Archive interact (SRA) with each other. This tutorial uses a COVID-19 variant calling example, but it isn’t about variant calling per se.

SRA, like many data sources is Galaxy aware. It has support for sending information directly to Galaxy, and it tracks which Galaxy instance it was invoked from. Getting sequence data from SRA is a multi-step process. This tutorial explains each step in the process and then demonstrates a particular example of how to use SRA data in Galaxy.

At the completion of this tutorial you know

  • how to go from Galaxy to SRA.
  • a basic understanding of how to select data in SRA
  • how to send SRA metadata (such as accession numbers) to Galaxy
  • how to use that SRA metadata to import sequence data from SRA into Galaxy.
  • how to run a simple variant analysis in Galaxy using that data

Agenda

In this tutorial, we will cover:

  1. The Sequence Read Archive
    1. Accessing SRA
    2. SRA Run Selector
    3. Back in Galaxy
    4. Download sequencing data with Faster Download and Extract Reads in FASTQ
  2. Now what?
  3. Variation Analysis of SARS-Cov-2 sequencing data
  4. Get the reference genome data
    1. Adapter trimming with fastp
    2. Alignment with Map with BWA-MEM
    3. Remove duplicates with MarkDuplicates
    4. Generate alignment statistics with Samtools stats
    5. Realign reads with lofreq viterbi
    6. Add indel qualities with lofreq Insert indel qualities
    7. Call Variants using lofreq Call variants
    8. Annotate variant effects with SnpEff eff:
    9. Create table of variants using SnpSift Extract Fields
    10. Summarize data with MultiQC

The Sequence Read Archive

The Sequence Read Archive (SRA) is the primary archive of unassembled reads for the US National Institutes of Health (NIH). SRA is a great place to get the sequencing data that underlie publications and studies.

This tutorial covers how to get sequence data from SRA into Galaxy using a direct connection between the two.

comment Comment

You will also hear SRA referred to as the Short Read Archive, its original name.

Accessing SRA

SRA can be reached either directly through it’s website, or through the tool panel on Galaxy.

comment Comment

Initially the tool panel option exists only on the usegalaxy.org server. Support for the direct connection to SRA will be included in the 20.05 release of Galaxy

hands_on Hands-on: Explore SRA Entrez

  1. Go to usegalaxy.org
  2. If your history is not already empty, than start a new history (see here for more on Galaxy histories)
  3. Click Get Data at the top of the tool panel.
  4. Click SRA Server in the list of tools shown under Get Data. This takes you the Sequence Read Archive home page – you can also start directly from the SRA. A search box is shown at the top of the page. Try searching for something you are interested in, such as dolphin or kidney or dolphin kidney and then click the Search button.

    This returns a list of SRA Experiments that match your search string. SRA Experiments, also know as SRX entries, contain sequence data from a particular experiment, as well as an explanation of the experiment itself and any other related data. You can explore the returned experiments by clicking on their name. See Understanding SRA Search Results in the SRA Knowledge Base for more.

    When you enter text in the SRA search box, you are using SRA’s Entrez search interface. Entrez supports both simple text searches, and very precise searches that check specific metadata and use arbitrarily complex logical expressions. Entrez allows you to scale up your searches from basic to advanced as you narrow your searches. The syntax of advanced searches can seem daunting, but SRA provides a graphical Advanced Search Builder to generate the specific syntax. And, as we shall see below, the SRA Run Selector provides an even friendlier user interface for narrowing our selected data.

    Play around with the SRA Entrez interface, including the advanced query builder, to see if you can identify a set of SRA experiments that are relevant to one of your research areas.

hands_on Hands-on: Generate list of matching experiments using Entrez

Now that you have a basic familiarity with SRA Entrez, let’s find the sequences used in this tutorial.

  1. If you aren’t already there, navigate back to the Sequence Read Archive search page
  2. Clear any search text from the search box.
  3. Type sars-cov-2 in the search box and click Search. This returns a longish list of SRA experiments that match our search, and that list is far too long to use in a tutorial exercise. At this point we could use the advanced Entrez query builder we learned about above. But we won’t. Instead lets send the too long for a tutorial list results we have to the SRA Run Selector, and use its friendlier interface to narrow our results.

sra entrez

hands_on Hands-on: Go from Entrez to SRA Run Selector

View results as an expanded interactive table using the RunSelector.

  1. Click Send results to Run selector, which appears in a box at the top of the search results.

sra entrez result

You may have noticed this text earlier when you were exploring Entrez search. This text only appears some of the time, when the number of search results falls within a fairly broad window. You won’t see it if you only have a few results, and you won’t see it if you have more results than the Run Selector can accept.

You need to get to Run Selector to send your results to Galaxy. What if you don’t have enough results to trigger this link being shown? In that case you call get to the Run Selector by clicking on the Send to pulldown menu at the top right of the results panel. To get to Run Selector, select Run Selector and then click the Go button. sra entrez send to

  1. Click Send results to Run selector at the top of the search results panel. (If you don’t see this link, then see the comment directly above.)

SRA Run Selector

We learned earlier how to narrow our search results by using Entrez’s advanced syntax. However, we didn’t take advantage of that power when we were in Entrez. Instead we used a simple search and then sent all the results to the Run Selector. We don’t yet have the (short) list of results we want to run analysis on. What are we doing?

We are using Entrez and the Run Selector how they are designed to be used:

  • Use the Entrez interface to narrow your results down to a size that the Run Selector can consume.
  • Send those Entrez results to the SRA Run Selector
  • Use the Run Selector’s much friendlier interface to
    1. More easily understand the data we have
    2. Narrow those results using that knowledge.

comment Run Selector is both more and less than Entrez

Run Selector can do most, but not all of what Entrez search syntax can do. Run selector uses faceted search technology which is easy to use, and powerful, but which has inherent limits. Specifically, Entrez will work better when searching on attributes that have tens, hundreds, or thousands of different values. Run Selector will work better searching attributes with fewer than 20 different values. Fortunately, that describes most searches.

The Run Selector window is divided into several panels:

  • Filters List: In the upper left hand corner. This is where we will refine our search.
  • Select: A summary of what was initially passed to Run Selector, and how much of that we have selected so far. (And so far, we haven’t selected any of it.) Also note the tantalizing, but still grayed out, Galaxy button.
  • Found x Items Initially, this is the list of items sent to Run Selector from Entrez. This list will shrink as we apply filters to it. sra run selector

comment Why did the number of found items go up?

Recall that the Entrez interface lists SRA experiments (SRX entries). Run Selector lists runs — sequencing datasets — and there are one or more runs per experiment. We have the same data as before, we are now just seeing it in finer detail.

The Filters List in the upper left shows columns in our results that have either continuous numerical values, or 10 or less (you can change this number) distinct values in them. Scroll down through the list select a few of the filters. When a filter is selected, a values box appears below, listing options for this filter, and the number of runs with each option. These values / options are pulled from the dataset metadata. Try selecting a few interesting sounding filters and then select one or more options for each filter. Try unselecting options and filters. As you do this, the number of found results will decrease or increase.

tip Tip: Use Filters to better understand the data

Filters are how you narrow the datasets under consideration for sending to Galaxy, but they are also an excellent way to understand your data: First, selecting a filter is an easy way to see the range of values in a column. You may not be able to find documentation on what the sirs_outcome column means, but you can possibly figure it out by seeing what values are in it. Second, you can explore how different columns relate to each other. Is there a relationship between sirs_outcome values and disease_stage values?

hands_on Hands-on: Narrow your results using Run Selector

  1. If you have any filters turned on, unselect them. Once you have done this, there won’t be any values boxes appearing below the Filters List.
  2. Copy and paste this search string into the Found Items search box.
      SRR11772204 OR SRR11597145 OR SRR11667145
    

    This hand-picked set of runs limits our results to 3 runs from different geographic distribution.

This reduces your Found Items list from tens of thousands of runs to 3 runs (a manageable number for a tutorial!). But we aren’t quite done with Run Selector yet. Note that the Galaxy button is still grayed out. We have narrowed our options, but we haven’t actually selected anything to send to Galaxy yet.

It’s possible to select every remaining run by clicking the checkmark at the top of the first column. You can unselect everything by clicking the X.

hands_on Hands-on: Select runs and send to Galaxy

  1. Select all runs by clicking the X. And now, the Galaxy button is live.
  2. Click the Galaxy button in the Select section at the top of the page.

Back in Galaxy

When we click Galaxy in Run Selector several things happen. First, it launches a new browser tab or window which opens in Galaxy. You will see the big green box indicating that the handshake between SRA and Galaxy was successful and you will then see a new SRA job in your history panel. This box may start out as gray / pending, indicating that the transfer has not yet started, or it may go straight to yellow / running or to green / done.

hands_on Hands-on: Examine the new SRA Dataset

  1. Once the SRA transfer is complete, click on the dataset’s galaxy-eye (eye) icon.

    This displays the dataset in Galaxy’s center panel.

The SRA dataset is not sequence data, but rather metadata that we will use to get sequence data from SRA. This metadata mirrors the information we saw in the Run Selector’s Found Items section. The metadata is not the end data that we are seeking from SRA, but having all that metadata is often useful in subsequent analysis steps.

Lets now use that metadata to fetch the sequence data from SRA. SRA provides tools for extracting all sorts of information, including the sequence data itself. The Galaxy Tool Faster Download and Extract Reads in FASTQ is based on the SRA fasterq-dump utility, and does just that.

Download sequencing data with Faster Download and Extract Reads in FASTQ

hands_on Hands-on: Task description

  1. Faster Download and Extract Reads in FASTQ tool with the following parameters:
    • “select input type”: List of SRA accession, one per line
      • param-file “sra accession list”: SRA (The SRA dataset that we retrieved in the previous step
    • Click the Execute button. This will run the tool, which retrieves the sequence read datasets for the runs that were listed in the SRA dataset.

    tip Tip: Finding tools

    Galaxy may have an overwhelming amount of tools installed. To find a specific tool type the tool name, in this case Faster Download and Extract Reads in FASTQ, in the tool panel search box to find the tool.

  2. Several entries are created in your history panel when you submit this job:
    • Pair-end data (fasterq-dump): Contains Paired-end datasets (if available)
    • Single-end data (fasterq-dump) Contains Single-end datasets (if available)
    • Other data (fasterq-dump) Contains Unpaired datasets (if available)
    • fasterq-dump log Contains Information about the tool execution

The first three items are actually collections of datasets. Collections in Galaxy are logical groupings of datasets that reflect the semantic relationships between them in the experiment / analysis. In this case the tool creates a separate collection each for paired-end reads, single reads, and other. See the Collections tutorials for more.

Explore the collections by first clicking on the collection name in the history panel. This takes you inside the collection and shows you the datasets in it. You can then navigate back to the outer level of your history.

Once fasterq finishes transferring data (all boxes are green / done), we are ready to analyze it.

Now what?

You can now analyze the retrieved data using any sequence analysis tools and workflows in Galaxy. SRA holds backing data for every imaginable type of *-seq experiment.

If you ran this tutorial, but retrieved datasets that you were interested in, then see the rest of the GTN library for ideas on how to analyze in Galaxy.

However, if you retrieved the datasets used in this tutorial’s examples above, then you are ready to run the SARS-CoV-2 variant analysis below.

Variation Analysis of SARS-Cov-2 sequencing data

In this part of the tutorial we will perform variant calling and basic analysis of the datasets downloaded above. We will start by downloading the Wuhan-Hu-1 SARS-CoV-2 reference sequence, then run adapter trimming, alignment and variant calling and finally look at the geographic distribution of some of the found variants.

comment The usegalaxy.* COVID-19 analysis project

This tutorial uses a subset of the data and runs through the Variation Analysis section of covid19.galaxyproject.org. The data for covid19.galaxyproject.org is being updated continuously as new datasets are made public.

Get the reference genome data

The reference genome data for today is for SARS-CoV-2, “Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome”, having the accession ID of NC_045512.2.

This data is available from Zenodo using the following link.

hands_on Hands-on: Get the reference genome data

  1. Import the following file into your history:

    https://zenodo.org/record/3906454/files/NC_045512.2.fasta
    
  • Copy the link location
  • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

  • Select Paste/Fetch Data
  • Paste the link into the text field

  • Press Start

  • Close the window

By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

Adapter trimming with fastp

Removing sequencing adapters improves alignments and variant calling. fastp tool can automatically detect widely used sequencing adapters.

hands_on Hands-on: Task description

  1. fastp tool with the following parameters:
    • “Single-end or paired reads: Paired Collection
      • param-file “Select paired collection(s)”: list_paired (output of Faster Download and Extract Reads in FASTQ tool)
    • In “Output Options”:
      • “Output JSON report”: Yes

Alignment with Map with BWA-MEM

BWA-MEM tool is a widely used sequence aligner for short-read sequencing datasets such as those we are analysing in this tutorial.

hands_on Hands-on: Align sequencing reads to reference genome

  1. Map with BWA-MEM tool with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: output (Input dataset)
    • “Single or Paired-end reads: Paired Collection
      • param-file “Select a paired collection”: output_paired_coll (output of fastp tool)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode

Remove duplicates with MarkDuplicates

MarkDuplicates tool removes duplicate sequences originating from library preparation artifacts and sequencing artifacts. It is important to remove these artefactual sequences to avoid artificial overrepresentation of single molecule.

hands_on Hands-on: Remove PCR duplicates

  1. MarkDuplicates tool with the following parameters:
    • param-file “Select SAM/BAM dataset or dataset collection”: bam_output (output of Map with BWA-MEM tool)
    • “If true do not write duplicates to the output file instead of writing them with appropriate flags set”: Yes

Generate alignment statistics with Samtools stats

After the duplicate marking step above we can generate statistic about the alignment we have generated.

hands_on Hands-on: Generate alignment statistics

  1. Samtools stats tool with the following parameters:
    • param-file “BAM file”: outFile (output of MarkDuplicates tool)
    • “Set coverage distribution”: No
    • “Output”: One single summary file
    • “Filter by SAM flags”: Do not filter
    • “Use a reference sequence”: No
    • “Filter by regions”: No

Realign reads with lofreq viterbi

Realign reads tool corrects misalignments around insertions and deletions. This is required in order to accurately detect variants.

hands_on Hands-on: Realign reads around indels

  1. Realign reads with lofreq tool with the following parameters:
    • param-file Reads to realign”: outFile (output of MarkDuplicates tool)
    • “Choose the source for the reference genome”: History
      • param-file “Reference”: output (Input dataset)
    • In “Advanced options”:
      • “How to handle base qualities of 2?”: Keep unchanged

Add indel qualities with lofreq Insert indel qualities

This step adds indel qualities into our alignment file. This is necessary in order to call variants using Call variants with lofreq tool

hands_on Hands-on: Add indel qualities

  1. Insert indel qualities with lofreq tool with the following parameters:
    • param-file Reads: realigned (output of Realign reads tool)
    • “Indel calculation approach”: Dindel
      • “Choose the source for the reference genome”: History
        • param-file “Reference”: output (Input dataset)

Call Variants using lofreq Call variants

We are now ready to call variants.

hands_on Hands-on: Call variants

  1. Call variants with lofreq tool with the following parameters:
    • param-file “Input reads in BAM format”: output (output of Insert indel qualities tool)
    • “Choose the source for the reference genome”: History
      • param-file “Reference”: output (Input dataset)
    • “Call variants across”: Whole reference
    • “Types of variants to call”: SNVs and indels
    • “Variant calling parameters”: Configure settings
      • In “Coverage”:
        • “Minimal coverage”: 50
      • In “Base-calling quality”:
        • “Minimum baseQ”: 30
        • “Minimum baseQ for alternate bases”: 30
      • In Mapping quality”:
        • “Minimum mapping quality”: 20
    • “Variant filter parameters”: Preset filtering on QUAL score + coverage + strand bias (lofreq call default)

The output of this step is a collection of VCF files that can be visualized in a genome browser.

Annotate variant effects with SnpEff eff:

We will now annotate the variants we called in the previous step with the effect they have on the SARS-CoV-2 genome.

hands_on Hands-on: Annotate variant effects

  1. SnpEff eff: tool with the following parameters:
    • param-file “Sequence changes (SNPs, MNPs, InDels)”: variants (output of Call variants tool)
    • “Output format”: VCF (only if input is VCF)
    • “Create CSV report, useful for downstream analysis (-csvStats)”: Yes
    • “Annotation options”: ``
    • “Filter output”: ``
    • “Filter out specific Effects”: No

The output of this step is a VCF file with added variant effects.

Create table of variants using SnpSift Extract Fields

We will now select various effects from the VCF and create a tabular file that is easier to understand for humans.

hands_on Hands-on: Create table of variants

  1. SnpSift Extract Fields tool with the following parameters:
    • param-file “Variant input file in VCF format”: snpeff_output (output of SnpEff eff: tool)
    • “Fields to extract”: CHROM POS REF ALT QUAL DP AF SB DP4 EFF[*].IMPACT EFF[*].FUNCLASS EFF[*].EFFECT EFF[*].GENE EFF[*].CODON
    • “multiple field separator”: ,
    • “empty field text”: .

We can inspect the output files and see check if Variants in this file are also described in an observable notebook that shows the geographic distribution of SARS-CoV-2 variant sequences

Interesting variants include the C to T variant at position 14408 (14408C/T) in SRR11772204, 28144T/C in SRR11597145 and 25563G/T in SRR11667145.

Summarize data with MultiQC

We will now summarize our analysis with MultiQC, which generates a beautiful report for our data.

hands_on Hands-on: Summarize data

  1. MultiQC tool with the following parameters:
    • In “Results”:
      • param-repeat “Insert Results”
        • “Which tool was used generate logs?”: fastp
          • param-file “Output of fastp”: report_json (output of fastp tool)
      • param-repeat “Insert Results”
        • “Which tool was used generate logs?”: Samtools
          • In “Samtools output”:
            • param-repeat “Insert Samtools output”
              • “Type of Samtools output?”: stats
                • param-file “Samtools stats output”: output (output of Samtools stats tool)
      • param-repeat “Insert Results”
        • “Which tool was used generate logs?”: Picard
          • In “Picard output”:
            • param-repeat “Insert Picard output”
              • “Type of Picard output?”: Markdups
              • param-file “Picard output”: metrics_file (output of MarkDuplicates tool)
      • param-repeat “Insert Results”
        • “Which tool was used generate logs?”: SnpEff
          • param-file “Output of SnpEff”: csvFile (output of SnpEff eff: tool)

Conclusion

Congratulations, you now know how to import sequence data from the SRA and how to run an example analysis on these datasets.

keypoints Key points

  • Sequence data in the SRA can be directly imported into Galaxy

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. Marius van den Beek, Dave Clements, Daniel Blankenberg, 2020 From NCBI's Sequence Read Archive (SRA) to Galaxy: SARS-CoV-2 variant analysis (Galaxy Training Materials). /archive/2020-12-01/topics/variant-analysis/tutorials/sars-cov-2/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{variant-analysis-sars-cov-2,
    author = "Marius van den Beek and Dave Clements and Daniel Blankenberg",
    title = "From NCBI's Sequence Read Archive (SRA) to Galaxy: SARS-CoV-2 variant analysis (Galaxy Training Materials)",
    year = "2020",
    month = "12",
    day = "01"
    url = "\url{/archive/2020-12-01/topics/variant-analysis/tutorials/sars-cov-2/tutorial.html}",
    note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
        doi = {10.1016/j.cels.2018.05.012},
        url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
        year = 2018,
        month = {jun},
        publisher = {Elsevier {BV}},
        volume = {6},
        number = {6},
        pages = {752--758.e1},
        author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
        title = {Community-Driven Data Analysis Training for Biology},
        journal = {Cell Systems}
}
                    

congratulations Congratulations on successfully completing this tutorial!