Genome Assembly of MRSA from Oxford Nanopore MinION data (and optionally Illumina data)

Overview
Creative Commons License: CC-BY Questions:
  • How to check the quality of the MinION data (together with Illumina data)?

  • How to perform an assembly of a bacterial genome with MinION data?

  • How to check the quality of an assembly?

Objectives:
  • Run tools to evaluate sequencing data on quality and quantity

  • Process the output of quality control tools

  • Improve the quality of sequencing data

  • Run a tool to assemble a bacterial genome using short reads

  • Run tools to assess the quality of an assembly

  • Understand the outputs of tools to assess the quality of an assembly

Requirements:
Time estimation: 2 hours
Level: Introductory Introductory
Supporting Materials:
Published: Mar 24, 2021
Last modification: Mar 13, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00037
rating Rating: 4.0 (0 recent ratings, 2 all time)
version Revision: 14

Sequencing (determining of DNA/RNA nucleotide sequence) is used all over the world for all kinds of analysis. The product of these sequencers are reads, which are sequences of detected nucleotides. Depending on the technique these have specific lengths (30-500bp) or using Oxford Nanopore Technologies sequencing have much longer variable lengths.

Comment: Nanopore sequencing

Nanopore sequencing has several properties that make it well-suited for our purposes

  1. Long-read sequencing technology offers simplified and less ambiguous genome assembly
  2. Long-read sequencing gives the ability to span repetitive genomic regions
  3. Long-read sequencing makes it possible to identify large structural variations

How nanopore sequencing works

When using Oxford Nanopore Technologies (ONT) sequencing, the change in electrical current is measured over the membrane of a flow cell. When nucleotides pass the pores in the flow cell the current change is translated (basecalled) to nucleotides by a basecaller. A schematic overview is given in the picture above.

When sequencing using a MinIT or MinION Mk1C, the basecalling software is present on the devices. With basecalling the electrical signals are translated to bases (A,T,G,C) with a quality score per base. The sequenced DNA strand will be basecalled and this will form one read. Multiple reads will be stored in a fastq file.

In this training we will build an assembly of a bacterial genome, from data produced in “Complete Genome Sequences of Eight Methicillin-Resistant Staphylococcus aureus Strains Isolated from Patients in Japan” Hikichi et al. 2019:

Methicillin-resistant Staphylococcus aureus (MRSA) is a major pathogen causing nosocomial infections, and the clinical manifestations of MRSA range from asymptomatic colonization of the nasal mucosa to soft tissue infection to fulminant invasive disease. Here, we report the complete genome sequences of eight MRSA strains isolated from patients in Japan.

Agenda

In this tutorial, we will cover:

  1. Galaxy and data preparation
  2. Quality Control
  3. Assembly
    1. Assembly Evaluation
    2. Assembly Polishing
  4. Conclusion

Galaxy and data preparation

Any analysis should get its own Galaxy history. So let’s start by creating a new one and get the data into it.

Hands-on: History creation
  1. Create a new history for this analysis

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Rename the history

    1. Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
    2. Type the new name
    3. Click on Save
    4. To cancel renaming, click the galaxy-undo “Cancel” button

    If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:

    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter

Now, we need to import the data: 1 FASTQ file containing the reads from the sequencer.

Hands-on: Data upload
  1. Import the files from Zenodo or from the shared data library

    https://zenodo.org/record/10669812/files/DRR187567.fastq.bz2
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  2. Rename the dataset to keep only the sequence run ID (DRR187567)

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field to DRR187567
    • Click the Save button

  3. Tag the dataset #unfiltered

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

  4. View galaxy-eye the renamed file

The dataset is a FASTQ file.

Question
  1. What are the 4 main features of each read in a fastq file.
  2. What is the name of your first read?
  1. The following:

    • A @ followed by a name and sometimes information of the read
    • A nucleotide sequence
    • A + (optional followed by the name)
    • The quality score per base of nucleotide sequence (Each symbol represents a quality score, which will be explained later)
  2. DRR187567.1

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

Do you have associated Illumina MiSeq data?

Hands-on: Illumina Data upload
  1. Import the files from Zenodo or from the shared data library

    https://zenodo.org/record/10669812/files/DRR187559_1.fastqsanger.bz2
    https://zenodo.org/record/10669812/files/DRR187559_2.fastqsanger.bz2
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  2. Rename the datasets to remove .fastqsanger.bz2 and keep only the sequence run ID (DRR187559_1 and DRR187559_2)

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field to DRR187567
    • Click the Save button

Quality Control

During sequencing, errors are introduced, such as incorrect nucleotides being called. These are due to the technical limitations of each sequencing platform. Sequencing errors might bias the analysis and can lead to a misinterpretation of the data. Sequence quality control is therefore an essential first step in any analysis.

When assessing the fastq files all bases had their own quality (or Phred score) represented by symbols. You can read more in our dedicated Quality Control Tutorial.

To assess the quality by hand would be too much work. That’s why tools like NanoPlot or FastQC are made, which will generate a summary and plots of the data statistics. NanoPlot is mainly used for long-read data, like ONT and PACBIO and FastQC for any read.

Hands-on: Quality Control
  1. FastQC ( Galaxy version 0.74+galaxy0) with the following parameters:
    • param-files “Short read data from your current history”: DRR187567
  2. Inspect the webpage output

FastQC combines quality statistics from all separate reads and combines them in plots. An important plot is the Per base sequence quality.

FastQC plot showing reads that mostly stay in the read.

Here, we are going to trim the Illumina data using fastp (Chen et al. 2018):

  • Trim the start and end of the reads if those fall below a quality score of 20
  • Filter for reads to keep only reads with at least 30 bases: Anything shorter will complicate the assembly
Hands-on: Quality improvement of the Illumina reads
  1. fastp ( Galaxy version 0.23.2+galaxy0) with the following parameters:
    • “Single-end or paired reads”: Paired
      • param-file “Input 1”: DRR187559_1
      • param-file “Input 2”: DRR187559_2
    • In “Filter Options”:
      • In “Length filtering Options”:
        • Length required: 30
    • In “Read Modification Options”:
      • In “Per read cuitting by quality options”:
        • Cut by quality in front (5’): Yes
        • Cut by quality in front (3’): Yes
        • Cutting window size: 4
        • Cutting mean quality: 20
    • In “Output Options”:
      • “Output JSON report”: Yes

Depending on the analysis it could be possible that a certain quality or length is needed. The reads can be filtered using the Filtlong tool. In this training all reads below 1000bp will be filtered.

When Illumina reads are available, we can use them if they are good Illumina reads (high depth and complete coverage) as external reference. In this case, Filtlong ignores the Phred quality scores and instead judges read quality using k-mer matches to the reference (a more accurate gauge of quality).

Hands-on: Filtering
  1. filtlong ( Galaxy version 0.2.1+galaxy0) with the following parameters:
    • param-file “Input FASTQ”: DRR187567
    • In “Output thresholds”:
      • “Min. length”: 1000
    • In “External references”:
      • param-file “Reference Illumina read”: fastp Read 1 output
      • param-file “Reference Illumina read”: fastp Read 2 output
  2. Rename the dataset to DRR187567-filtered

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field to DRR187567-filtered
    • Click the Save button

The output can be evaluated using NanoPlot for plotting long read sequencing data and alignments

Hands-on: Filtering
  1. Convert the datatype of DRR187567 to uncompress it

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-gear Convert tab on the top
    • In the upper part galaxy-gear Convert, select Convert compressed to uncompressed
    • Click the Create dataset button to start the conversion.

  2. NanoPlot ( Galaxy version 1.41.0+galaxy0) with the following parameters:

    • “Select multifile mode”: batch
      • “Type of the file(s) to work on”: fastq
        • param-files “files”: both DRR187567 uncompressed and DRR187567-filtered
    • In “Options for customizing the plots created”:
      • “Show the N50 mark in the read length histogram.”: Yes

We ran the NanoPlot two times: one for the raw reads (DRR187567) and one for the reads after filtering (DRR187567-filtered).

For each run, NanoPlot generates 5 outputs:

  • 2 plots:
  • 2 tabular files with statistics: one general and one after filtering.

    The second one is empty because we did not used NanoPlot filtering options.

  • A HTML report summarizing above information

    We can compare the two generated reports. Galaxy allows to view several datasets side-by-side using the Window Manager function

    Hands-on: Inspect NanoPlot reports
    1. Enable Window Manager

      If you would like to view two or more datasets at once, you can use the Window Manager feature in Galaxy:

      1. Click on the Window Manager icon galaxy-scratchbook on the top menu bar.
        • You should see a little checkmark on the icon now
      2. View galaxy-eye a dataset by clicking on the eye icon galaxy-eye to view the output
        • You should see the output in a window overlayed over Galaxy
        • You can resize this window by dragging the bottom-right corner
      3. Click outside the file to exit the Window Manager
      4. View galaxy-eye a second dataset from your history
        • You should now see a second window with the new dataset
        • This makes it easier to compare the two outputs
      5. Repeat this for as many files as you would like to compare
      6. You can turn off the Window Manager galaxy-scratchbook by clicking on the icon again

    2. Open both NanoPlot HTML Reports
    3. Check the Summary statistics section of each to compare the results
    Summary statistics Not Filtered Filtered (Filtlong) Change (%)
    Number of reads 91,288 69,906 -23.4%
    Number of bases 621,945,741.0 609,657,642.0 -2.0%
    Median read length 3,400.0 5,451.0 60.3%
    Mean read length 6,813.0 8,721.1 28.0%
    Read length N50 14,810.0 15,102.0 2.0%
    Mean read quality 9 9 0.0%
    Median read quality 8.9 9.0 1.1%
    Question
    1. What is the increase of your median read length?
    2. What is the decrease in the number of bases?
    3. What is coverage?
    4. What would be the coverage before and after trimming, based on a genome size of 2.9 Mbp?
    1. 3,400 bp to 5,451 bp, a 60.3% increase
    2. A -2.0% decrease is not a very significant decrease. Our data was quite good to start with and didn’t have many short reads which were removed (23.4%)
    3. The coverage is a measure of how many reads ‘cover’ on average a single base in the genome. If you divide the total reads by the genome size, you will get a number how many times your genomes could theoretically be “covered” by reads.
    4. Before \(\frac{621,945,741}{2,900,000} = 214.4\) and after \(\frac{609,657,642}{2,900,000} = 210.2\). This is not a very big decrease in coverage, so no cause for concern. Generally in sequencing experiments you have some minimum coverage you expect to see based on how much of your sample you sequenced. If it falls below that threshold it might be cause for concern.

While there is currently no community consensus over the best trimming or filtering practices with long read data, there are still some steps that can be beneficial to do for the assembly process. Porechop ( Galaxy version 0.2.3) is a commonly used tool for removing adapter sequences, and we used filtlong ( Galaxy version 0.2.0) for removing shorter reads which might make the assembly process more difficult.

Many people do not do any trimming of their NanoPore data based on the quality as it is expected that the quality is low, and often the focus is on assembling large Structural Variations (SVs) rather than having high quality reads and base-level variation analyses.

Assembly

When the quality of the reads is determined and the data is filtered (like we did with filtlong) and/or trimmed (like is more often done with short read data) an assembly can be made.

There are many tools that create assembly for long-read data, e.g. Canu (Koren et al. 2017), Raven (Vaser and Šikić 2021), Miniasm (Li 2016). In this tutorial, we use Flye (Lin et al. 2016). Flye is a de novo assembler for single molecule sequencing reads. It can be used from bacteria to human assemblies. The Flye assembly is based on finding overlapping reads with variable length with high error tolerance.

Comment: Results may vary

Your results may be slightly different from the ones presented in this tutorial due to differing versions of tools, reference data, external databases, or because of stochastic processes in the algorithms.

Hands-on: Assembly using Flye
  1. Flye assembly ( Galaxy version 2.9.1+galaxy0) with the following parameters:
    • param-file “Input reads”: DRR187567-filtered (output of filtlong tool)
    • “Mode”: Nanopore corrected (--nano-corr)
    • “Reduced contig assembly coverage”: Disable reduced coverage for initial disjointing assembly

Flye generates 4 outputs:

  • A FASTA file called consensus with contigs, i.e. the contiguous sequences made by combining separate reads in the assembly, and possibly scaffolds built by Flye

    Question

    How many contigs are there?

    There are 2 sequences in the consensus dataset

  • 2 assembly graph files: assembly graph and graphical fragment assembly

    We can visualize the assembly graph using the graphical fragment assembly with Bandage (Wick et al. 2015), a package for exploring assembly graphs through summary reports and visualizations of their contents.

    Hands-on: Assembly inspection
    1. Bandage Image ( Galaxy version 2022.09+galaxy4) with the following parameters:
      • param-file “Graphical Fragment Assembly”: graphical fragment assembly (output of Flye)

    Bandage output showing 2 contigs.

  • A table assembly info with assembly information

    It should be something similar but probably sligtly different because Flye can differ a bit per assembly:

    #seq_name length cov. circ. repeat mult. alt_group graph_path
    contig_1 2927029 183 Y N 2 * 1
    contig_2 60115 379 Y Y 2 * 2
    Question
    1. What is the coverage of your longest contig?
    2. What is the length of your longest contig?
    3. Could this contig potentially be a MRSA genome?

    While results may vary due to randomness in the assembly process, in our case we had:

    1. 183
    2. 2.9Mb
    3. 2.9Mb is approximately the size of a MRSA genome. So contig 1 could be the genome. Contig 2 could be one small potential plasmid genome.

Assembly Evaluation

To evaluate the assembly, we use also Quast (Gurevich et al. 2013) (QUality ASsessment Tool), a tool providing quality metrics for assemblies. This tool can be used to compare multiple assemblies, can take an optional reference file as input to provide complementary metrics, etc

Hands-on: Quality Control of assembly using Quast
  1. Quast ( Galaxy version 5.2.0+galaxy1) with the following parameters:
    • Assembly mode?: Co-assembly
      • “Use customized names for the input files?”: No, use dataset names
        • param-file “Contigs/scaffolds file”: consensus output of Flye

QUAST outputs assembly metrics as an HTML file with metrics and graphs.

Question
  1. How many contigs is there?
  2. How long is the largest contig?
  3. What is the total length of all contigs?
  4. What is you GC content?
  5. How does it compare to results for KUN1163 in Table 1 in Hikichi et al. 2019?
  1. 2 contigs
  2. The largest contig is 2,907,099 bp.
  3. 2,967,214 (Total length (>= 0 bp)).
  4. The GC content for our assembly is 32.91%.
  5. KUN1163 has a genome size of 2,914,567 bp (not far from the length of the largest contig), with a GC content of 32.91% (as in our assembly)

Assembly Polishing

We can now polish the assembly using both the short reads and/or long reads. This process aligns the reads to the assembly contigs, and makes corrections to the contigs where warranted.

Several tools exist for polishing, e.g. Racon (Vaser et al. 2017). Here we will use Polypolish (Wick and Holt 2022), a tool for polishing genome assemblies with short reads.

Polypolish needs as input the assembly but also SAM/BAM files where each read has been aligned to all possible locations (not just a single best location). Errors in repeats will be covered by short-read alignments, and Polypolish can therefore fix those errors.

To get the SAM/BAM files, we need to map the short reads on the assembly. We will use BWA-MEM (Li and Durbin 2009, Li and Durbin 2010, Li 2013).

We need to set up BWA-MEM so it aligns each read to all possible locations, not just the best location. This option does not work with paired-end alignment. We will then need to align forward and reverse read files separately, instead of aligning both read files with a single BWA-MEM run as usually recommended.

Hands-on: Align short-reads on assembly
  1. Change the datatype of both FASTQ outputs of fastp to fastqsanger.gz

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select fastqsanger.gz from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  2. BWA-MEM2 ( Galaxy version 2.2.1+galaxy1) with the following parameters:

    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: consensus output of Flye
    • “Single or Paired-end reads”: Single
      • param-files “Select fastq dataset”: both outputs of fastp
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 5.Full list of options
      • “Set algorithmic options?”: Do not set
      • “Set scoring options?”: Do not set
      • “Set input/output options”: Set
        • “Output all alignments for single-ends or unpaired paired-ends”: Yes
    • “BAM sorting mode”: Not sorted (sorted as input)

We can now run Polypolish.

Hands-on: Polish assembly
  1. Polypolish ( Galaxy version 0.5.0+galaxy2) with the following parameters:
    • In “Input sequences”:
      • param-file “Select a draft genome for polishing”: consensus output of Flye
      • “Select aligned data to polish”: Paired SAM/BAM files
        • param-file “Select forward SAM/BAM file”: output of BWA-MEM2 on the Read 1 output of fastp
        • param-file “Select reverse SAM/BAM file”: output of BWA-MEM2 on the Read 2 output of fastp

To check the impact of the polishing, let’s run QUAST on both Flye and Polypolish outputs.

Hands-on: Quality Control of polished assembly
  1. Quast ( Galaxy version 5.2.0+galaxy1) with the following parameters:
    • Assembly mode?: Co-assembly
      • “Use customized names for the input files?”: No, use dataset names
        • param-files “Contigs/scaffolds file”: consensus output of Flye and output of Polypolish

The HTML report generated by QUAST gives metrics for both assembly side-by-side

Statistics Flye output Polypolish output
Number of contigs 2 2
Largest contig 2,907,099 2,915,230
Total length (>= 0 bp) 2,967,214 2,975,666
GC (%) 32.91 32.84
Question

Is the assembly after polishing better than before given the results for KUN1163 in Table 1 in Hikichi et al. 2019?

The largest contig after polishing has a length of 2,915,230 bp, which closr to the expected 2,914,567 bp. But the GC content (32.84% after polishing) is slightly worst given the expected 32.91% also found in the assembly before polishing.

Conclusion

In this tutorial, we prepared long reads (using short reads if we had some) assembled them, inspect the produced assembly for its quality, and polished it (if short reads where provided). The assembly, even if uncomplete, is reasonable good to be used in downstream analysis, like AMR gene detection