Genome assembly using PacBio data

Overview
Creative Commons License: CC-BY Questions:
  • How to perform a genome assembly with PacBio data ?

  • How to check assembly quality ?

Objectives:
  • Assemble a Genome with PacBio data

  • Assess assembly quality

Requirements:
Time estimation: 6 hours
Level: Intermediate Intermediate
Supporting Materials:
Published: Nov 29, 2021
Last modification: Nov 3, 2023
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00033
rating Rating: 4.0 (0 recent ratings, 2 all time)
version Revision: 16

In this tutorial, we will assemble a genome of a species of fungi in the family Mucoraceae, Mucor mucedo, from PacBio sequencing data. These data were obtained from NCBI (SRR8534473, SRR8534474 and SRR8534475). The quality of the assembly obtained will be analyzed, in particular by comparing it to a reference assembly, obtained with Falcon assembler, and available on the JGI website.

Agenda

In this tutorial, we will cover:

  1. Get data
    1. Get data from Zenodo
    2. Get data from JGI website
  2. Genome Assembly with Flye
  3. Quality assessment
    1. Genome assembly metrics with Fasta Statistics
    2. Genome assemblies comparison with Quast
    3. Genome assembly assessment with BUSCO
  4. Conclusion

Get data

We will use long reads sequencing data: CLR (continuous long reads) from PacBio sequencing of Mucor mucedo genome. This data is a subset of data from NCBI. We will also use later a reference genome assembly downloaded from the JGI website. This reference genome was assembled using the same PacBio data, we will use it as a comparison with our own assembly.

Get data from Zenodo

Hands-on: Data upload from Zenodo
  1. Create a new history for this tutorial
  2. Import the files from Zenodo

    https://zenodo.org/records/5702408/files/SRR8534473_subreads.fastq.gz
    https://zenodo.org/records/5702408/files/SRR8534474_subreads.fastq.gz
    https://zenodo.org/records/5702408/files/SRR8534475_subreads.fastq.gz
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename the datasets
  4. Check that the datatype is fastqsanger.gz

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select datatypes from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Get data from JGI website

Hands-on: Data upload from JGI website
  1. Create a JGI account in registration page of JGI: JGI registration
  2. Sign in JGI Genome Portal JGI Genome Portal
  3. Genome assembly is available here: JGI Mucor mucedo
  4. Import fasta assembly file Mucmuc1_AssemblyScaffolds.fasta on your computer locally
  5. Upload this file on Galaxy
  6. Check that the datatype is fasta

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select datatypes from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Genome Assembly with Flye

We will use Flye, a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly. All informations about Flye assembler are here: Flye.

Hands-on: Assembly
  1. Flye ( Galaxy version 2.9+galaxy0) with the following parameters:
    • param-file “Input reads”: the three sequencing datasets
    • “Mode”: PacBio raw
    • “Number of polishing iterations”: 1
    • “Reduced contig assembly coverage”: Disable reduced coverage for initial disjointing assembly

    The tool produces four datasets: consensus, assembly graph, graphical fragment assembly and assembly info

Question

What are the different output datasets?

  • The first dataset (consensus) is a fasta file containing the final assembly (1461 contigs). You may notice that the result (contigs number) you obtained is sligthy different from the one presented here. This is due to the Flye assembly algorithm which doesn’t always give the eact same results.
  • The second and third dataset are assembly graph files. These graphs are used to represent the final assembly of a genome, they are based on reads and their overlap information. Some tools such as Bandage allow to visualize the assembly graph.
  • The fourth dataset is a tabular file (assembly_info) containing extra information about contigs/scaffolds.

Quality assessment

Genome assembly metrics with Fasta Statistics

Fasta statistics displays the summary statistics for a fasta file. In the case of a genome assembly, we need to calculate different metrics such as assembly size, scaffolds number or N50 value. These metrics will allow us to evaluate the quality of this assembly.

Hands-on: Fasta statistics on Flye assembly
  1. Fasta Statistics ( Galaxy version 2.0) with the following parameters:
    • param-file “fasta or multifasta file”: consensus (output of Flye tool)
Hands-on: Fasta statistics on the reference assembly
  1. Fasta Statistics ( Galaxy version 2.0) with the following parameters:
    • param-file “fasta or multifasta file”: Mucmuc1_AssemblyScaffolds.fasta
Question
  1. Compare the different metrics obtained for Flye assembly and reference genome.
  2. What can you conclude about the quality of this new assembly ?
  1. We compare the metrics of the two genome assembly:
    • The Flye assembly: 1461 contigs/scaffolds, N50 = 222 kb, length max = 897 kb, size = 48.6 Mb, 36.6% GC
    • The reference genome: 456 contigs/scaffolds, N50 = 202 kb, length max = 776 kb, size = 46.1 Mb, 36.7% GC
  2. Metrics are very similar, Flye generated an assembly with a quality similar to that of the reference genome.

Genome assemblies comparison with Quast

Another way to calculate metrics assembly is to use QUAST = QUality ASsessment Tool. Quast is a tool to evaluate genome assemblies by computing various metrics and to compare genome assembly with a reference genome. The manual of Quast is here: Quast

Hands-on: Task description
  1. Quast ( Galaxy version 5.0.2+galaxy3) with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: consensus (output of Flye tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: Yes
      • param-file “Reference genome”: Mucmuc1_AssemblyScaffolds.fasta
      • “Type of organism”: Fungus: use of GeneMark-ES for gene finding, ...
Question

What additional informations are generated by Quast, compared to the Fasta Statistics outputs?

Quast allows us to compare Flye assembly to the reference genome:

  1. Genome fraction (90.192 %) is the percentage of aligned bases in the reference genome.
  2. Duplication ratio (1.094) is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome.
  3. Largest alignment (698452) is the length of the largest continuous alignment in the assembly.
  4. Total aligned length (45.2 Mb) is the total number of aligned bases in the assembly.

Quast also generates some plots:

  1. Cumulative length plot shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of the x largest contigs in the assembly.
  2. GC content plot shows the distribution of GC content in the contigs.

Genome assembly assessment with BUSCO

BUSCO (Benchmarking Universal Single-Copy Orthologs) allows a measure for quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content. Details for this tool are here: Busco website

Hands-on: BUSCO on Flye assembly

First on the Flye assembly:

  1. Busco ( Galaxy version 5.2.2+galaxy0) with the following parameters:
    • param-file “Sequences to analyse”: consensus (output of Flye tool)
    • “Auto-detect or select lineage”: Select lineage
      • “Lineage”: Mucorales

Then, on the reference assembly:

  1. Busco ( Galaxy version 5.2.2+galaxy0) with the following parameters:
    • param-file “Sequences to analyse”: Mucmuc1_AssemblyScaffolds.fasta
    • “Auto-detect or select lineage”: Select lineage
      • “Lineage”: Mucorales
Question

Compare the number of BUSCO genes identified in the Flye assembly and the reference genome. What do you observe ?

Short summary generated by BUSCO indicates that reference genome contains:

  1. 2327 Complete BUSCOs (of which 2302 are single-copy and 25 are duplicated),
  2. 13 fragmented BUSCOs,
  3. 109 missing BUSCOs.

Short summary generated by BUSCO indicates that Flye assembly contains:

  1. 2348 complete BUSCOs (2310 single-copy and 38 duplicated),
  2. 8 fragmented BUSCOs
  3. 93 missing BUSCOs.

BUSCO analysis confirms that these two assemblies are of similar quality, with similar number of complete, fragmented and missing BUSCOs genes.

Conclusion

This pipeline shows how to generate and evaluate a genome assembly from long reads PacBio data. Once you are satisfied with your genome sequence, you might want to annotate it: have a look at the RepeatMasker and Funannoate tutorials to learn how to do it!