De Bruijn Graph Assembly

Overview
Creative Commons License: CC-BY Questions:
  • What are the factors that affect genome assembly?

  • How does Genome assembly work?

Objectives:
  • Perform an optimised Velvet assembly with the Velvet Optimiser

  • Compare this assembly with those we did in the basic tutorial

  • Perform an assembly using the SPAdes assembler.

Requirements:
Time estimation: 2 hours
Level: Introductory Introductory
Supporting Materials:
Published: May 24, 2017
Last modification: Jan 7, 2026
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00031
rating Rating: 2.9 (0 recent ratings, 9 all time)
version Revision: 26

Optimised de Bruijn Graph assemblies using the Velvet Optimiser and SPAdes

In this activity, we will perform de novo assemblies of a short read set using the Velvet Optimiser and the SPAdes assemblers. We are using the Velvet Optimiser for illustrative purposes. For real assembly work, a more suitable assembler should be chosen - such as SPAdes.

The Velvet Optimiser is a script written by Simon Gladman avatar Simon Gladman to optimise the k-mer size and coverage cutoff parameters for Velvet. More information can be found in its repository.

SPAdes is a de novo genome assembler written by Pavel Pevzner’s group in St. Petersburg. More details on it can be found on Spades’ website>

Agenda

In this tutorial, we will deal with:

  1. Get the data
  2. Assemble with the Velvet Optimiser
  3. Assemble with SPAdes

Get the data

We will be using the same data that we used in the introductory tutorial, so if you have already completed that and have the data, skip this section.

Hands On: Getting the data
  1. Create and name a new history for this tutorial.

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the sequence read raw data (*.fastq) from Zenodo

    https://zenodo.org/record/582600/files/mutant_R1.fastq
    https://zenodo.org/record/582600/files/mutant_R2.fastq
    
    • Copy the link location
    • Click galaxy-upload Upload at the top of the activity panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename the files galaxy-pencil
    • The name of the files are the full URL, let’s make the names a little clearer
    • Change the names to just the last part, Mutant_R1.fastq, Mutant_R2.fastq, respectively
    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

    Question
    1. What are four key features of a FASTQ file?
    2. What is the main difference between a FASTQ and a FASTA file?
  4. Create a paired collection named Paired Reads

    • Click on galaxy-selector Select Items at the top of the history panel Select Items button
    • Check all the datasets in your history you would like to include
    • Click n of N selected and choose Advanced Build List

      build paired collection menu item

    • You are in the collection building wizard. Choose List of Paired Datasets and click ‘Next’ button at the right bottom corner.

      collection building wizard paired list

    • Check and configure auto-pairing. Commonly matepairs have suffix _1 and _2 or _R1 and _R2. Click on ‘Next’ at the bottom.

      edit and build a paired list collection

    • Edit the List Identifier as required.
    • Enter a name for your collection
    • Click Build to build your collection
    • Click on the checkmark icon at the top of your history again

    • We will need to use both the individual datasets (Mutant_R1.fastq and Mutant_R2.fastq) and the paired end collection (Paired Reads), so toggle off the Hide original elements option when creating the collection.
    • Alternatively, you can un-hide the datasets by selecting the galaxy-show-hidden icon on the hidden dataset in your history.

    If your history contains hidden datasets you will see galaxy-show-hidden “Include hidden” button directly above the dataset display.

    To un-hide datasets:

    • Type visible:hidden in the search box
    • Select datasets you want to un-hide
    • Click the dropdown that would appear at the top of the history;
    • Select “Unhide” option.

    An animated gif showing how to unhide datasets

    Alternatively, you can:

    • click galaxy-show-hidden “Include hidden” button directly above dataset display. This will cause hidden datasets to appear in history along with normal (un-hidden) datasets;
    • hidden datasets are distinguished by having galaxy-show-hidden within dataset box. Clicking on this icon will un-hide a given dataset;

Assembly with the Velvet Optimiser

We will perform an assembly with the Velvet Optimiser, which automatically runs and optimises the output of the Velvet assembler (Zerbino and Birney 2008). It will automatically choose a suitable value for the k-mer size (k). It will then go on to optimise the coverage cutoff (cov_cutoff) which corrects for read errors. It will use the “n50” metric for optimising the k-mer size and the “total number of bases in contigs” for optimising the coverage cutoff.

Hands On: Assemble with the Velvet Optimiser
  1. Velvet Optimiser ( Galaxy version 2.2.6): Optimise your assembly with the following parameters:
    • “Start k-mer size”: 45
    • “End k-mer size”: 73
    • “Input Files”:
    • 1: Input Files
      • “Input file type”: Fastq
      • “Single or paired end reads”: Paired
      • param-file “Select first set of reads”: mutant_R1.fastq
      • param-file “Select second set of reads”: mutant_R2.fastq

Your history will now contain a number of new files:

  • Velvet optimiser contigs
    • A fasta file of the final assembled contigs
  • Velvet optimiser contig stats
    • A table of the lengths (in k-mer length) and coverages (k-mer coverages) for the final contigs.

Have a look at each file.

Hands On: Get contig statistics for Velvet Optimiser contigs
  1. Fasta Statistics ( Galaxy version 1.0.1): Produce a summary of the velvet optimiser contigs:
    • param-file “fasta or multifasta file”: Select your velvet optimiser contigs file
  2. View the output

    Question

    Compare the output we got here with the output of the simple assemblies obtained in the introductory tutorial.

    1. What are the main differences between them?
    2. Which has a higher “n50”? What does this mean?

Tables of results from (a) Simple assembly and (b) optimised assembly.

(a) The results of the contigs from Simple assembly.

(b) The results of the contigs from Optimised assembly. In contrast to simple assembly produced much higher n_50, while num_seq is lower.

  • Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler (Zerbino et al. 2009)

Visualisation of the Assembly

Now that we’ve assembled the genomes, let’s visualise this assembly using Bandage (Wick et al. 2015). This tool will let us better understand how the assembly graph really looks, and can give us a feeling for if the genome was well assembled or not.

Currently VelvetOptimiser does not include the LastGraph output, so we will manually run velveth and velvetg with the optimised parameters.

Hands On: Manually running velvetg/h
  1. Locate the output called “VelvetOptimiser: Contigs” in your history

  2. Click the dataset-info information icon

  3. Check the tool stderr in the information page for the optimised k-mer value

Question

What was the optimal k-mer value? (referred to as “hash” in the stderr log)

55

With this information in hand, let’s run velvet:

Hands On: Manually running velvetg/h
  1. velveth ( Galaxy version 1.2.10.3): Prepare a dataset for the Velvet velvetg Assembler
    • “Hash length”: 55
    • “Input Files”:
      • + Insert Input Files
      • 1: Input Files
        • “Choose the input type”: separate paired reads
        • “read type”: shortPaired reads
        • “Dataset”: mutant_R1.fastq (forward reads)
        • “Dataset”: mutant_R2.fastq (reverse reads)
  2. velvetg ( Galaxy version 1.2.10.2): Velvet sequence assembler for very short reads
    • “Velvet dataset”: output from velveth tool
    • “Coverage cutoff”: Specify Cutoff Value
      • “Remove nodes with coverage below”: 1.44
    • “Additional outputs”: Generate velvet LastGraph file
    • “Using Paired Reads”: Yes

The LastGraph contains a detailed representation of the De Bruijn graph, which can give us an idea how velvet has assembled the genome and potentially resolved any conflicts.

Hands On: Bandage
  1. Bandage Image ( Galaxy version 2022.09+galaxy4): visualize de novo assembly graphs
    • “Graphical Fragment Assembly”: The “LastGraph” output of velvetg tool
    • “Produce jpg, png or svg file?”: .png
  2. Execute
  3. View the output file

And now you should be able to see the graph that velvet produced:

velvet graph.

Interpreting Bandage Graphs

k-mer size has a significant effect on the assembly. You can play around with various k-mers to see this effect in practice.

k-mer graph
21 21.
33 33.
53 53.
77 77.

The next thing to be aware of is that there can be multiple valid interpretations of a graph, all equally valid in absence of other data. The following is taken verbatim from Bandage’s wiki:

For a simple case, imagine a bacterial genome that contains a single repeated element in two separate places in the chromosome:

Simple example 1.

A researcher (who does not yet know the structure of the genome) sequences it, and the resulting 100 bp reads are assembled with a de novo assembler:

Simple example 2.

Because the repeated element is longer than the sequencing reads, the assembler was not able to reproduce the original genome as a single contig. Rather, three contigs are produced: one for the repeated sequence (even though it occurs twice) and one for each sequence between the repeated elements.

Given only the contigs, the relationship between these sequences is not clear. However, the assembly graph contains additional information which is made apparent in Bandage:

Simple example 3.

There are two principal underlying sequences compatible with this graph: two separate circular sequences that share a region in common, or a single larger circular sequence with an element that occurs twice:

Simple example 4.

Additional knowledge, such as information on the approximate size of the bacterial chromosome, can help the researcher to rule out the first alternative. In this way, Bandage has assisted in turning a fragmented assembly of three contigs into a completed genome of one sequence.

Assemble with SPAdes

We will now perform an assembly with the much more modern SPAdes assembler (Bankevich et al. 2012). It goes through a similar process to Velvet in the fact that it uses and simplifies de Bruijn graphs but it uses multiple values for k-mer size and combines the resultant graphs. This combination produces very good assemblies. When using SPAdes it is typical to choose at least 3 k-mer sizes. One low, one medium and one high. We will use 33, 55 and 91.

Hands On: Assemble with SPAdes
  1. SPAdes ( Galaxy version 4.2.0+galaxy0): Assemble the reads:
    • “Operation mode”: Only assembler (--only_assembler)
    • “Single-end or paired-end short-reads”: Paired-end: list of dataset pairs
      • “FASTA/FASTQ file(s): collection”: Paired Reads
    • “Set coverage cutoff option”: auto
    • “Select k-mer detection option”: User specific
      • “K-mer size values”: 33,55,91 [note: no spaces!]
    • “Select optional output file(s)”: Assembly graph, Assembly graph with scaffold, Contigs, Scaffolds, Log

You will now have 5 new files in your history:

  • two Fasta files, one for contigs and one for scaffolds
  • two statistics files, one for contigs and one for scaffolds
  • the SPAdes log file.

Examine each file, especially the stats files.

Contig stats file with NODE_5 being the shortest contig with the highest coverage and NODE_1 being the opposite.

Question
  1. Why would one of the contigs have much higher coverage than the others?
  2. What could this represent?
Hands On: Visualize assembly with Bandage
  1. Bandage Image ( Galaxy version 2022.09+galaxy4) with the following parameters:
    • “Graphical Fragment Assembly”: assembly graph with scaffolds output from SPAdes tool
    • “Produce jpg, png or svg file?”: .png
  2. Examine the output image galaxy-eye

The visualized assembly should look something like this:

bandage spades.

Question

Which assembly looks better to you? Why?

Hands On: Get contig statistics for SPAdes contigs
  1. Fasta Statistics ( Galaxy version 1.0.1): Produce a summary of the SPAdes contigs:
    • param-file “fasta or multifasta file”: Select your velvet optimiser contigs file
  2. Look at the output file.

    Question

    Compare the output we got here with the output of the simple assemblies obtained in the introductory tutorial.

    1. What are the main differences between them?
    2. Did SPAdes produce a better assembly than the Velvet Optimiser?