VGP assembly pipeline

Overview

Questions:
  • what combination of tools can produce the highest quality assembly of vertebrate genomes?

  • How can we evaluate how good it is?

Objectives:
  • Learn the tools necessary to perform a de novo assembly of a vertebrate genome

  • Evaluate the quality of the assembly

Requirements:
Time estimation: 2 hours
Supporting Materials:
Last modification: Nov 30, 2021
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

Advances in sequencing technologies over the last few decades have revolutionised the field of genomics, allowing for a reduction in both the time and resources required to de novo genome assembly. Until recently, second-generation sequencing technologies (also known as Next Generation Sequencing or NGS) allowed to produce highly accurate but short (up to 800bp), whose extension was not long enough to cope with the difficulties associated with repetitive regions. Today, so-called third-generation sequencing (TGS) technologies, usually known as single-molecule real-time (SMRT) sequencing, have become dominant in de novo assembly of large genomes. TGS can use native DNA without amplication, reducing sequencing error and bias (Hon et al. 2020, Giani et al. 2020). Very recently, Pacific Biosciences introducedHigh-Fidelity (HiFi) sequencing, which produces reads 10-20 kpb in length with a minimum accuracy of 99% (Q20). In this tutorial you will use HiFi reads in combination with data from additional sequencing technologies to generate a high-quality reference genome assembly.

Deciphering the structural organisation of complex vertebrate genomes is currently one of the most challenges in genomics (Frenkel et al. 2012). Despite the significant progress made in recent years, a key question remains: what combination of data and tools can produce the highest quality assembly? In order to adequately answer it, it is necessary to analyse two of the main factors that determine the difficulty of genome assembly processes: repetitive content and heterozigosity.

Repetitive elements can be grouped into two categories: interspersed repeats, such as transposable elements (TE) that occur at multiple loci throughout the genome, and tandem repeats (TR), that occur at a single locus (Tørresen et al. 2019). Repetitive elements are an important component of eukariotyc genomes, constituting over a third of the genome in the case of mammals (Sotero-Caio et al. 2017, Chalopin et al. 2015). In the case of tamdem repeats, various estimates suggest that they are present in at least one third of human protein sequences (Marcotte et al. 1999). TE content is among the main factors contributing to the lack of continuity in the reconstruction of genomes, especially in the case of large ones, as its content is highly correlated with genome size (Sotero-Caio et al. 2017). On the other hand, TR usually lead to local genome assembly collapse, especially when their length is close to that of the reads (Tørresen et al. 2019).

Heterozygosity is also an important factor in genome assembly. Haplotype phasing, that is, the identification of alleles that are co-located on the same chromosome, has become a fundamental problem in heterozygous and polyploid genome assemblies (Zhang et al. 2020). A common strategy to overcome these difficulties is to remap genomes to a single haplotype, which represents the whole genome. This approach is useful for highly inbred samples that are nearly homozygous, but when applied to highly heterozygous genomes, such as aquatic organism, it missses potential differences in sequence, structure, and gene presence, usually leading to ambiguties and redundancies in the initial contig-level assemblies (Angel et al. 2018, Zhang et al. 2020).

To address these problems, the G10K consortium launched the Vertebrate Genomes Project (VGP), whose goal is generating high-quality, near-error-free, gap-free, chromosome-level, haplotype-phased, annotated reference genome assembly for each of the vertebrate species currently present on planet Earth (Rhie et al. 2021). The protocol proposed in this tutorial, the VGP assembly pipeline, is the result of years of study and analysis of the available tools and data sources.

Agenda

In this tutorial, we will cover:

  1. VGP assembly pipeline overview
    1. Background on datasets
  2. Get data
  3. Data quality assessment
  4. Genome profile analysis
    1. Generation of k-mer spectra with Meryl
    2. Genome profiling with GenomeScope2
  5. HiFi long read phased assembly with hifiasm
    1. Genome assembly with hifiasm
    2. Convert GFA format to FASTA with GFA to FASTA
    3. Initial assembly evaluation
  6. Post-assembly processing
    1. Remove haplotypic duplication with purge_dups
    2. Second assembly evaluation assembly evaluation
    3. Sub-step with Concatenate datasets
  7. Hybrid scaffolding based on phased assembly and Bionano data
    1. Sub-step with Bionano Hybrid Scaffold
    2. Sub-step with Concatenate datasets
    3. Sub-step with Parse parameter value
    4. Sub-step with Quast
  8. Hybrid scaffolding based on a phased assembly and HiC mapping data
    1. 1. Map the HiC reads to the assembly
    2. 2. View a contact map of the mapped HiC reads
    3. 3. Salsa scaffolding
    4. 4. Evaluate the Salsa scaffolding results
    5. 5. Generate a post-scaffolding contact map

VGP assembly pipeline overview

The figure 1 represents the VGP assembly pipeline.

fig1:VGP pipeline.
Figure 1: VPG Pipeline 2.0

In order to facilitate the development of the workflow, we will structure it in four main sections:

  • Genome profile analysis
  • HiFi long read phased assembly with Hifiasm
  • Hybrid scaffolding based on phased assembly and Bionano data
  • Hybrid scaffolding based on a phased assembly and Hi-C mapping data

TODO: suggest including here something about how the Galaxy workflow has additional steps (e.g. parse parameter value), so that it can run automatically, but if run step by step in this tutorial, then some of those steps are just manually entered value (e.g. genome size parameter)

Background on datasets

In order to reduce processing times, we will use samples from Saccharomyces cerevisiae, one of the most intensively studied eukaryotic model organisms in molecular and cell biology. This organisms can be haploid or diploid, depending the stage of its life cycle. Both cell types are stable and can reproduce asexually by mitosis.

The VGP assembly pipeline requires datasets generated by three different technologies: PacBio HiFi reads, Bionano optical maps, and Hi-C chromatin interaction maps.

PacBio HiFi reads rely on the Single Molecule Real-Time (SMRT) sequencing technology. It is based on real-time imaging of fluorescently tagged nucleotides as they are synthesized along individual DNA template molecules, combining multiple subreads of the same circular template using a statistical model to produce one highly accurate consensus sequence, along with base quality values (figure 2). This technology allows to generate long-read sequencing dataseets with read lengths averaging 10-25 kb and accuracies greater than 99.5%.

fig2:PacBio sequencing technolgoy.
Figure 2: PacBio HiFi sequencing

Optical genome mapping is a method for detecting structural variants. The generation of Bionano optical maps starts with high molecular weight DNA, which is labeled at specific sequences motif with a fluorescent dye, resulting in a unique fluorescence pattern for each individual genome. The comparison of the labelled fragments among different samples enables the detection of structural variants. Optical maps are integrated with the primary assemby sequence in order to identify and correct potential chimeric joints, and estimate the gap sizes.

The high-throughput chromosome conformation capture (Hi-C) technology is based on the capture of the chromatin conformation, enabling the identification of topological domains. Hi-C chromatin interaction maps methods first crosslink the chromatin in its 3D conformation. The crosslinked DNA is digested using restriction enzymes, and the digested ends are filled with biotinylated nucleotides. Next, the blunt ends of spatially proximal digested end are ligated, preserving the chromosome interaction regions. Finally, the DNA is purified to assure that only fragments originating from ligation events are sequenced.

Get data

hands_on Hands-on: Data upload

  1. Create a new history for this tutorial
  2. Import the files from Zenodo

    • Open the file galaxy-upload upload menu
    • Click on Rule-based tab
    • “Upload data as”: Datasets
    • Copy the tabular data, paste it into the textbox and press Build

      Hi-C_dataset_F   https://zenodo.org/record/5550653/files/SRR7126301_1.fastq.gz?download=1   fastqsanger.gz    Hi-C
      Hi-C_dataset_R   https://zenodo.org/record/5550653/files/SRR7126301_2.fastq.gz?download=1   fastqsanger.gz    Hi-C
      Bionano_dataset    https://zenodo.org/record/5550653/files/bionano.cmap?download=1   cmap    Bionano
      
    • From Rules menu select Add / Modify Column Definitions
      • Click Add Definition button and select Name: column A
      • Click Add Definition button and select URL: column B
      • Click Add Definition button and select Type: column C
      • Clich Add Definition button and select Name Tag: column D
    • Click Apply and press Upload
  3. Import the remaining datasets from Zenodo

    • Open the file galaxy-upload upload menu
    • Click on Rule-based tab
    • “Upload data as”: Collections
    • Copy the tabular data, paste it into the textbox and press Build

      SRR13577846_1    https://zenodo.org/record/5550653/files/SRR13577846_1.30x.wgaps.fastq.gz?download=1  fastqsanger.gz    HiFi  HiFi_collection
      SRR13577846_2    https://zenodo.org/record/5550653/files/SRR13577846_2.30x.wgaps.fastq.gz?download=1  fastqsanger.gz    HiFi  HiFi_collection
      SRR13577846_3    https://zenodo.org/record/5550653/files/SRR13577846_3.30x.wgaps.fastq.gz?download=1  fastqsanger.gz    HiFi  HiFi_collection
      
    • From Rules menu select Add / Modify Column Definitions
      • Click Add Definition button and select List Identifier(s): column A
      • Click Add Definition button and select URL: column B
      • Click Add Definition button and select Type: column C
      • Clich Add Definition button and select Group Tag: column D
      • Clich Add Definition button and select Collection Name: column E
    • Click Apply and press Upload

Data quality assessment

To begin our analysis we will carry out the evaluation and pre-processing of our data. In order to identify potential anomalies in the data, we are going to use FastQC, an open-source tool that provides a simple way to quality control raw sequence data.

hands_on Hands-on: Quality check

  1. Run FastQC tool with the following parameters
    • param-collection “Raw read data from your current history”: HiFi_collection
  2. MultiQC Tool: toolshed.g2.bx.psu.edu/repos/iuc/multiqc/multiqc/1.8+galaxy1 with the following parameters:
    • In “Results”:
      • “Which tool was used generate logs?”: FastQC
      • param-collection “Dataset collection”: select the FastQC on collection:Raw Data dataset.
    • In “Report title”: HiFi quality report
  3. Click on the galaxy-eye (eye) icon and inspect the generated HTML file
fig3:HiFi Quality report.
Figure 3: PacBio HiFi qualiry report

As we can see, the mean Phred score is over 80 in all the samples, which means that the base call accuracy is around 99.999999%!

comment Comments

For more information on the topic of quality control, please see our training materials here.

According the quality report, less that 0.1% of the reads include adaptor sequences. Despide of this, we will trim the residual adaptors sequences by using Cutadapt, in order to avoid potential reads which could interfer in the subsequent steps.

hands_on Hands-on: Optional step: primer removal

  1. Cutadapt Tool: toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/3.4 with the following parameters:
    • “Single-end or Paired-end reads?”: Single-end
      • param-collection “FASTQ/A file”: HiFi_collection
      • In “Read 1 Options”:
        • In “5’ or 3’ (Anywhere) Adapters”:
          • param-repeat “Insert 5’ or 3’ (Anywhere) Adapters”
            • “Source”: Enter custom sequence
              • “Enter custom 5’ or 3’ adapter name”: First adapter
              • “Enter custom 5’ or 3’ adapter sequence”: ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT
          • param-repeat “Insert 5’ or 3’ (Anywhere) Adapters”
            • “Source”: Enter custom sequence
              • “Enter custom 5’ or 3’ adapter name”: Second adapter
              • “Enter custom 5’ or 3’ adapter sequence”: ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT
    • In “Adapter Options”:
      • “Match times”: 3
      • “Maximum error rate”: 0.1
      • “Minimum overlap length”: 35
      • “Look for adapters in the reverse complement”: True
    • In “Filter Options”:
      • “Discard Trimmed Reads: Yes
  2. Rename the output file as HiFi_collection (trim). To rename an output file, click on the result, and then click again on the title to change it. After closing, you may need to refresh the history to see the name change.

Genome profile analysis

An important step before starting a de novo genome assembly project is to proceed with the analysis of the genome profile. Determining these characteristics in advance has the potential to reveal whether an analysis is not reflecting the full complexity of the genome, for example, if the number of variants is underestimated or a significant fraction of the genome is not assembled (Vurture et al. 2017).

Traditionally DNA flow citometry was considered the golden standart for estimating the genome size, one of the most important factors to determine the required coverage level. However, nowadays experimental methods have been replaced by computational approaches Wang et al. 2020. One of the most widely used procedures for undertaking genomic profiling is the analyis of k-mer frequencies. It allows to provide information not only about the genomic complexity, such as the genome size, levels of heterozygosity and repeat content, but also about the data quality. In addition, k-mer spectra analysis can be used in a reference-free manner for assessing genome assembly quality metrics (Rhie et al. 2020).

details K-mer size, sequencing coverage and genome size

K-mers are unique substrings of length k contained within a DNA sequence. For example, the DNA sequence TCGATCACA can be decomposed into six unique k-mers that have five bases long: TCGAT, CGATC, GATCA, ATCAC and TCACA. A sequence of length L will have L-k+1 k-mers. On the other hand, the number of possible k-mers can be calculated as nk, where n is number of possible monomers and k is the k-mer size.

Bases K-mer size Total possible k-mers
4 1 4
4 2 16
4 3 64
4 4 256
4
4 10 1.048.576

Thus, the k-mer size is a key parameter, which must be large enough to map uniquely to the genome, but not too large, since it can lead to wasting computational resources. In the case of the human genome, k-mers of 31 bases in length lead to 96.96% of unique k-mers.

Each unique k-mer can be assigned a value for coverage based on the number of times it occurs in a sequence, whose distribution will approximate a Poisson distribution, with the peak corresponding to the average genome sequencing depth. From the genome coverage, the genome size can be easily computed.

In section we will use two basic tools to computationally estimate the genome features: Meryl and GenomeScope.

Generation of k-mer spectra with Meryl

Meryl will allow us to perform the k-mer profiling by decomposing the sequencing data into k-lenght substrings and determining its frequency. The original version was developed for use in the Celera Assembler, and it comprises three modules: one for generating k-mer databases, one for filtering and combining databases, and one for searching databases. The k-mer database is stored in sorted order, similar to words in a dictionary (Rhie et al. 2020).

comment K-mer size estimation

Given an estimated genome size (G) and a tolerable collision rate (p), an appropriate k can be computed as k = log4 (G(1 − p)/p).

hands_on Hands-on: Generate k-mers count distribution

  1. Meryl Tool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy2 with the following parameters:
    • “Operation type selector”: Count operations
      • “Count operations”: Count: count the ocurrences of canonical k-mers
      • param-collection “Input sequences”: HiFi_collection (trim)
      • “K-mer size selector”: Set a k-mer size
        • K-mer size”: 21

    comment Election of k-mer size

    We used 21 as k-mer size, as this length has demonstrated to be sufficiently long that most k-mers are not repetitive and is short enough that the analysis will be more robust to sequencing errors. For extremely large (haploid size over 10 Gb) and/or very repetitive genomes, it is recommended to use larger k-mer lengths to increase the number of unique k-mers.

  2. Rename it Collection meryldb

  3. Meryl Tool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy1 with the following parameters:
    • “Operation type selector”: Operations on sets of k-mers
      • “Operations on sets of k-mers”: Union-sum: return k-mers that occur in any input, set the count to the sum of the counts
      • param-file “Input meryldb”: Collection meryldb
  4. Rename it as Merged meryldb

  5. Meryl Tool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy0 with the following parameters:
    • “Operation type selector”: Generate histogram dataset
      • param-file “Input meryldb”: Merged meryldb
  6. Finally, rename it as Meryldb histogram.

Genome profiling with GenomeScope2

The next step is to computationally infer the genome properties from the k-mer count distribution generated by Meryl, for which we’ll use GenomeScope2. It relies in a nonlinear least-squares optimization to fit a mixture of negative binomial distributions, generating estimated values for genome size, repetitiveness, and heterozygosity rates (Ranallo-Benavidez et al. 2020).

hands_on Hands-on: Estimate genome properties

  1. GenomeScope Tool: toolshed.g2.bx.psu.edu/repos/iuc/genomescope/genomescope/2.0 with the following parameters:
    • param-file “Input histogram file”: Meryldb histogram
    • “K-mer length used to calculate k-mer spectra”: 21
  • In “Output options”: mark Summary of the analysis
  • In “Advanced options”:
    • “Create testing.tsv file with model parameters”: true

Genomescope will generate six outputs:

  • Plots
    • Linear plot: K-mer spectra and fitted models: frequency (y-axis) versus coverage.
    • Log plot: logarithmic transformation of the previous plot.
    • Transformed linear plot: K-mer spectra and fitted models: frequency times coverage (y-axis) versus coverage (x-axis). It allows to increases the heights of higher-order peaks, overcoming the effect of high heterozygosity.
    • Transformed log plot: logarithmic transformation of the previous plot.
  • Model: this file includes a detailed report about the model fitting.
  • Summary: it includes the properties infered from the model, such as genome haploid length and the percentage of heterozygosity.

Now, let’s analyze the k-mer profiles, fitted models and estimated parameters:

fig3:Genomescope plot.
Figure 4: Genomescope2 plot

As we can see, there is an unique peak centered around 28, which is the coverage with the highest number of different 21-mers. According the normal-like k-mer spectra, we can infer that it is a haploid genome. The large number of unique k-mers on the left side with frequence around one is due to error during the sequencing process.

Before jumping to the next section, we need to carry out some operation on the output generated by Genomescope2. The goal is to generate some parameters which at a later stage will be used by purge_dups (Guan et al. 2019). Lets start with the estimated genome size.

hands_on Hands-on: Get estimated genome size

  1. Replace Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3 with the following parameters:
    • param-file “File to process”: summary (output of GenomeScope tool)
    • “Find pattern”: bp
    • “Replace all occurences of the pattern”: Yes
    • “Find and Replace text in”: entire line
  2. Replace Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3 with the following parameters:
    • param-file “File to process”: output file of Replace tool)
    • “Find pattern”: ,
    • “Replace all occurences of the pattern”: Yes
    • “Find and Replace text in”: entire line
  3. Search in textfiles (grep) Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1 with the following parameters:
    • param-file “Select lines from”: output file of the previous step.
    • “Type of regex”: Basic
    • “Regular Expression: Haploid
  4. Convert delimiters to TAB Tool: Convert characters1 with the following parameters:
    • param-file “in Dataset”: output of Search in textfiles tool
  5. Advanced Cut Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0 with the following parameters:
    • param-file “File to cut”: output of Convert delimiters to TAB tool
    • “Cut by”: fields
      • “List of Fields”: Column: 5
  6. Parse parameter value Tool: param_value_from_file (param_value_from_file) with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output of Advanced Cut tool
    • “Select type of parameter to parse”: Integer
  7. Rename the output as Estimated genome size.

question Questions

What is the estimated genome size?

solution Solution

The estimated genome size is 12664060 bp.

Now let’s parse the upper bound for the read depth estimation parameter.

hands_on Hands-on: Get maximum read depth

  1. Compute an expression on every row Tool: toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/1.6 with the following parameters:
    • “Add expression: 1.5*c3
    • param-file “as a new column to”: model_params (output of GenomeScope tool)
    • “Round result?”: Yes
    • “Input has a header line with column names?”: No
  2. Compute an expression on every row Tool: toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/1.6 with the following parameters:
    • “Add expression: 3*c7
    • param-file “as a new column to”: output of Compute tool)
    • “Round result?”: Yes
    • “Input has a header line with column names?”: No
  3. Rename it as Parsing temporal output

  4. Advanced Cut Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0 with the following parameters:
    • param-file “File to cut”: Parsing temporal output (output of Compute tool)
    • “Cut by”: fields
      • “List of Fields”: Column 8
  5. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output of Advanced Cut tool
    • “Select type of parameter to parse”: Integer
  6. Rename it as Maximum depth

question Questions

What is the estimated maximum depth?

solution Solution

The estimated maximum depth is 63 reads.

Finally, let’s parse the transition between haploid and diploid coverage depths parameter.

hands_on Hands-on: Get transition parameter

  1. Advanced Cut Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0 with the following parameters:
    • param-file “File to cut”: Parsing temporal output (output of Compute tool)
    • “Cut by”: fields
      • “List of Fields”: Column 7
  2. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output of Advanced Cut tool
    • “Select type of parameter to parse”: Integer
  3. Rename it as Transition parameter

question Questions

What is the estimated transition parameter?

solution Solution

The estimated transition parameter is 21 reads.

HiFi long read phased assembly with hifiasm

Once we have finished the genome profiling stage, we can start the genome assembly with hifiasm, a fast open-source de novo assembler specifically developed for PacBio HiFi reads.

Genome assembly with hifiasm

One of the key focus of hifiams is to different copies of a segmental duplication involving a single segregating site, allowing to resolve near-identical, but not exactly identical, repeats and segmental duplications (Cheng et al. 2021).

comment Hifiasm algorithm details

By default hifiasm performs three rounds of haplotype-aware error correction to correct sequence errors but keeping heterozygous alleles. A position on the target read to be corrected is considered informative if there are tow different nucleotides at the position of the alignment, and each type is supported by at least tree reads.

fig4:Hifiasm algorithm overview.
Figure 5: Hifiasm algorithm overview. Orange and blue bars represent the reads with heterozygous alleles carrying local phasing information, while green bars come from the homozygous regions without any heterozygous alleles.

Then, hifiasm builds a phased assembly string graph with local phasing information from the corrected reads. Only the reads coming from the same haplotype are connected in the phased assembly graph. After transitive reduction, a pair of heterozygous alleles will be represented by a bubble in the string graph. If there are no additional data, hifiasm arbitrarily selects one side of each bubble and outputs a primary assembly. For a heterozygous genome, the primary assembly generated at this step may still contain haplotigs from more than one homologous haplotype.

hands_on Hands-on: Phased assembly with hifiasm

  1. Hifiasm Tool: toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.14+galaxy0 with the following parameters:
    • “Assembly mode”: Standard
      • param-file “Input reads: HiFi_collection (trim) (output of Cutadapt tool)
    • “Options for purging duplicates”: Specify
      • “Coverage upper bound”: 63 (maximum depth previously obtained)
    • “Options for Hi-C partition”: Specify
      • “Hi-C R1 reads: Hi-C_dataset_F
      • “Hi-C R2 reads: Hi-C_dataset_R
  2. Rename the Hi-C hap1 contig graph as Primary contig graph and add a #primary tag
  3. Rename the Hi-C hap2 contig graph as Alternate contig graph and add a #alternate tag

Hifiasm generates four outputs in GFA format; this format is specially designed to capture sequence graphs as the product of an assembly, a representation of variation in genomes, splice graphs in genes, or even overlap between reads from long-read sequencing technology.

Convert GFA format to FASTA with GFA to FASTA

We have obtained the fully phased contig graphs of the primary and alternate haplotypes, but the output format of hifiasm is not adequate for the subsequent steps, so we will convert them into fasta format.

hands_on Hands-on: convert GFA to FASTA

  1. GFA to FASTA Tool: toolshed.g2.bx.psu.edu/repos/iuc/gfa_to_fa/gfa_to_fa/0.1.2 with the following parameters:
    • param-files “Input GFA file”: select Primary contig graph and the Alternate contig graph datasets
  2. Rename the outputs as Primary contig FASTA and Alternate contig FASTA

Initial assembly evaluation

Once generated the draft assembly, it is a good idea to evaluate its quality.

hands_on Hands-on: assembly evaluation with Quast

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: Yes, specify custom names
    • In “1. Contigs/scaffolds”:
      • param-file “Contigs/scaffolds file”: Primary contig FASTA
      • “Name”: Primary assembly
    • Click in “Insert Contigs/scaffolds”
    • In “2. Contigs/scaffolds”:
      • param-file “Contigs/scaffolds file”: Alternate contig FASTA
      • “Name”: Alternate assembly
    • Reads options”: Pacbio SMRT reads
      • param-collection “FASTQ file”: HiFi collection (trim)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
        • “Estimated reference genome size (in bp) for computing NGx statistics”: 12664060 (previously estimated)
      • “Type of organism”: Eukaryote: use of GeneMark-ES for gene finding, Barrnap for ribosomal RNA genes prediction, BUSCO for conserved orthologs finding (--eukaryote)
    • “Is genome large (>100Mpb)?”: No

    comment Comment

    Remember that for this training we are using S. cerevisiae, a reduced genome. In the case of assembling a vertebrate genome, you must select yes in the previous option.

  2. Rename the HTML report as QUAST initial report

Let’s have a look at the HTML report.

fig5:QUAST plot.
Figure 6: Quast initial report.

question Questions

  1. What is the longest contig in the primary assembly? And in the alternate one?
  2. What is the N50 of the primary assembly?
  3. Which percentage of reads mapped to each assembly?

solution Solution

  1. The longest contig in the primary assembly is 914.549 bp, and 15.845 bp in the alternate assembly.
  2. The N50 of the primary assembly is 425.706 bp.
  3. According the report, 100% of reads mapped to the primary assembly, but only around 57% mapped to the alternate assembly.

hands_on Hands-on: assessing assembly completness with BUSCO

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-files “Sequences to analyse”: Primary contig FASTA and Alternate contig FASTA
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Auto-detect or select lineage?”: Select lineage
      • “Lineage”: Saccharomycetes
    • “Which outputs should be generated”: short summary text

    comment Comment

    Remember to modify the lineage option if you are working with vertebrate genomes.

  2. Rename the summary as BUSCO initial report

question Questions

  1. Which percentage of Benchmarking Universal Single-Copy Orthologs (BUSCO) genes have been identified?
  2. How many BUSCOs gene are absent?

solution Solution

  1. According the report, our assembly contains the complete sequence of 99.3% of BUSCO genes.
  2. 8 BUSCO genes are missing.

hands_on Hands-on: K-mer based evaluation with Merqury

  1. Merqury Tool: toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3 with the following parameters:
    • “Evaluation mode”: Default mode
      • param-file “K-mer counts database”: Merged meryldb
      • “Number of assemblies”: `Two assemblies
        • param-file “First genome assembly”: Primary contig FASTA
        • param-file “Second genome assembly”: Alternate contig FASTA

Post-assembly processing

An ideal haploid representation would consist of one allelic copy of all heterozygous regions in the two haplomes (haplotype contigs), as well as all hemizygous regions from both haplomes. However, the allelic relationship between haplotypes still present a problem for de novo genome assembly, specially in high heterozygous genomes; sequence divergence between pair of allelic sequences can lead to assemble there regions as separate contigs, rather than the expected single haplotype-fused contig. It can result in assemblies signicantly larger than the haploid genome size, which can lead to interferences in downstream stages, such as scaffolding and gene annotation (Guan et al. 2019, Roach et al. 2018).

Usually, allelic relationships are inferred at the post-assembly stage. Despite the haplotig processing requites multiple steps, the approach used in this tutorial can be summaryzed in two steps: firstly we will identify the syntenic contigs by using the mapped read coverage and minimap2 (Li 2018) alignments. Then, we will resolve the haplotigs and overlaps in the primary assembly by using purge_dups.

Remove haplotypic duplication with purge_dups

This step includes 11 steps, summarized in the following scheme:

fig4:Post-processing step.
Figure 7: Purge_dups pipeline

hands_on Hands-on: purge_dups pipeline

  1. Collapse Collection Tool: toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.2 with the following parameters:
    • param-collection “Collection of files to collapse into single dataset”:HiFi_collection (trim)
  2. Rename de output as HiFi reads collapsed

  3. Map with minimap2 Tool: toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: Primary contig FASTA
    • “Single or Paired-end reads: Single
      • param-collection “Select fastq dataset”: HiFi_collection (trim) (output of Cutadapt tool)
      • “Select a profile of preset options”: Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)
    • In “Set advanced output options”:
      • “Select an output format”: paf
  4. Rename the output as Reads mapped to contigs

  5. purge_dups Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy3 with the following parameters:
    • “Function mode”: Calculate coverage cutoff, base-level read depth and create read depth histogram for PacBio data (calcuts+pbcstats)
      • param-file “PAF input file”: Reads mapped to contigs
      • In “Calcuts options”:
        • “Upper bound for read depth”: 63 (the previously estimated maximum depth)
        • “Ploidity”: Haploid

    comment Comment

    In the case you are working with a diploid orgasm, you should select diploid in the ploidity option. It will generate three outputs: the base-level coverage file (PBCSTAT base coverage), the cutoff file (calcuts cutoff) and a histogram plot.

  6. purge_dups Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Function mode”: split assembly FASTA file by 'N's (split_fa)
      • param-file “Assembly FASTA file”: Primary contig FASTA
  7. Rename the output as Split FASTA

  8. Map with minimap2 Tool: toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: Split FASTA
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: Split FASTA
      • “Select a profile of preset options”: Construct a self-homology map - use the same genome as query and reference (-DP -k19 -w 19 -m200) (self-homology)
    • In “Set advanced output options”:
      • “Select an output format”: PAF
  9. Rename the output as Self-homology map

  10. purge_dups Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy5 with the following parameters:
    • “Select the purge_dups function”: Purge haplotigs and overlaps for an assembly (purge_dups)
    • param-file “PAF input file”: Self-homology map
    • param-file “Base-level coverage file”: PBCSTAT base coverage (output of the fifth step)
    • param-file “Cutoffs file”: calcuts cutoff (output of the fifth step)
  11. purge_dups Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: Obtain sequences after purging (get_seqs)
    • param-file “Assembly FASTA file”: Primary contig FASTA
    • param-file BED input file”: purge_dups BED (output of the previous step)

Second assembly evaluation assembly evaluation

Once we have purged the duplications, let’s evaluate the assembly again.

hands_on Hands-on: assembly evaluation with Quast

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: Yes, specify custom names
    • In “1. Contigs/scaffolds”:
      • param-file “Contigs/scaffolds file”: Primary contig FASTA
      • “Name”: Primary assembly
    • Click in “Insert Contigs/scaffolds”
    • In “2. Contigs/scaffolds”:
      • param-file “Contigs/scaffolds file”: Alternate contig FASTA
      • “Name”: Alternate assembly
    • Reads options”: Pacbio SMRT reads
      • param-collection “FASTQ file”: HiFi collection (trim)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
        • “Estimated reference genome size (in bp) for computing NGx statistics”: 12664060 (previously estimated)
      • “Type of organism”: Eukaryote: use of GeneMark-ES for gene finding, Barrnap for ribosomal RNA genes prediction, BUSCO for conserved orthologs finding (--eukaryote)
    • “Is genome large (>100Mpb)?”: No
  2. Rename the HTML report as QUAST second report

hands_on Hands-on: assessing assembly completness with BUSCO

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-files “Sequences to analyse”: Primary contig FASTA and Alternate contig FASTA
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Auto-detect or select lineage?”: Select lineage
      • “Lineage”: Saccharomycetes
    • In “Advanced Options”:
      • “Which outputs should be generated”: short summary text

    comment Comment

    Remember to modify the lineage option if you are working with vertebrate genomes.

  2. Rename the summary as BUSCO initial report

Sub-step with Concatenate datasets

hands_on Hands-on: Task description

  1. Concatenate datasets Tool: cat1 with the following parameters:
    • param-file “Concatenate Dataset”: get_seqs_hap (output of Purge overlaps tool)
    • In “Dataset”:
      • param-repeat “Insert Dataset”
        • param-file “Select”: out_fa (output of GFA to FASTA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Hybrid scaffolding based on phased assembly and Bionano data

Sub-step with Bionano Hybrid Scaffold

hands_on Hands-on: Task description

  1. Bionano Hybrid Scaffold Tool: toolshed.g2.bx.psu.edu/repos/bgruening/bionano_scaffold/bionano_scaffold/3.6.1+galaxy2 with the following parameters:
    • param-file “NGS FASTA”: output (Input dataset)
    • param-file “BioNano CMAP”: output (Input dataset)
    • “Configuration mode”: VGP mode
    • param-file “Conflict resolution file”: output (Input dataset)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Concatenate datasets

hands_on Hands-on: Task description

  1. Concatenate datasets Tool: cat1 with the following parameters:
    • param-file “Concatenate Dataset”: ngs_contigs_scaffold_trimmed (output of Bionano Hybrid Scaffold tool)
    • In “Dataset”:
      • param-repeat “Insert Dataset”
        • param-file “Select”: ngs_contigs_not_scaffolded_trimmed (output of Bionano Hybrid Scaffold tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Parse parameter value

hands_on Hands-on: Task description

  1. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (Input dataset)
    • “Select type of parameter to parse”: Integer

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: out_file1 (output of Concatenate datasets tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
        • “Estimated reference genome size (in bp) for computing NGx statistics”: {'id': 7, 'output_name': 'integer_param'}
      • “Type of organism”: Eukaryote (--eukaryote): use of GeneMark-ES for gene finding, Barrnap for ribosomal RNA genes prediction, BUSCO for conserved orthologs finding
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Hybrid scaffolding based on a phased assembly and HiC mapping data

TODO: need to re-name a lot of the inputs and outputs here. They have been auto-generated from the workflow but I think we want people to be able to run this step by step. I’ve taken out some of the steps that are “parse parameter value” etc.

In this section we map HiC reads to scaffold the genome assembly. In HiC sequencing, parts of the genome that are close together are artificially joined. A DNA fragment is then sequenced from each end of this artificial junction, giving a read pair. If reads from this read pair map to two contigs, it indicates that those two contigs are close together in the genome. A good short video showing the HiC process is here: https://youtu.be/-MxEw3IXUWU

TODO: add image here of HiC

Inputs required for this section:

  • An assembly FASTA file. This can be the output of the phased assembly section, and/or the output of the Bionano scaffolding section.
  • HiC reads, one set of forward reads and one set of reverse reads. If there is more than one set of Hi-C pair-read datasets, concatenate all the forward reads into one file, and the reverse reads into another file, in the same order.
  • Genome size estimate: we can get this from an earlier step using GenomeScope. This is the haploid length.

A summary of the 5 steps in this section:

  • Map the HiC reads to the assembly
  • View a contact map of HiC reads against the assembly, before scaffolding
  • Scaffold the assembly with HiC reads: using the assembly file, and the mapped HiC reads
  • Evaluate the scaffolding results: use Busco and Quast
  • View a contact map of the HiC reads after scaffolding

Outputs from this section:

  • A scaffolded assembly FASTA file
  • contact maps of HiC reads pre- and post scaffolding
  • post-scaffolding reports from Busco and Quast

A simplified image of the workflow for this section (not showing Quast and Busco):

Hic-wf-summary.
Figure 8: HiC workflow

1. Map the HiC reads to the assembly

We will do this separately for the forward and reverse set of HiC reads. We have to do this separately because these are not standard paired-end reads.

Map the forward HiC reads:

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: output (Input dataset)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode
    • “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

Map the reverse HiC reads:

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: output (Input dataset)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode
    • “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

Merge the mapped reads:

Now we will merge these two BAM files:

hands_on Hands-on: Task description

  1. Filter and merge Tool: toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy0 with the following parameters:
    • param-file “First set of reads: bam_output (output of Map with BWA-MEM tool)
    • param-file “Second set of reads: bam_output (output of Map with BWA-MEM tool)

Convert the mapped BAM file to a BED file:

hands_on Hands-on: Task description

  1. bedtools BAM to BED Tool: toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.30.0+galaxy1 with the following parameters:
    • param-file “Convert the following BAM file to BED: outfile (output of Filter and merge tool)
    • “What type of BED output would you like”: Create a full, 12-column "blocked" BED file

Sort the BED file:

hands_on Hands-on: Task description

  1. Sort Tool: sort1 with the following parameters:
    • param-file “Sort Dataset”: output (output of bedtools BAM to BED tool)
    • “on column”: c4
    • “with flavor”: Alphabetical sort
    • “everything in”: Ascending order

2. View a contact map of the mapped HiC reads

Most of the paired reads from HiC will map to the same (or nearby) contigs. On a graph, with ordered contigs on each axis, a lot of the contacts will be along the diagonal (mapping to self), or nearby (around that diagonal line). But some may be in odd places - for example, showing a lot of reads mapped to both contig 4 and contig 19. We will now generate a contact map of the assembly before it is scaffolded, to compare to the contact map after scaffolding.

Generate a contact map:

hands_on Hands-on: Task description

  1. PretextMap Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.6+galaxy0 with the following parameters:
    • param-file “Input dataset in SAM or BAM format”: outfile (output of Filter and merge tool)
    • “Sort by”: Don't sort

Convert the map to an image:

hands_on Hands-on: Task description

  1. Pretext Snapshot Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy0 with the following parameters:
    • param-file “Input Pretext map file”: pretext_map_out (output of PretextMap tool)
    • “Output image format”: png
    • “Show grid?”: Yes

TODO: explain the output here. What does it mean. What does this show about our data/assembly so far (e.g. do the contigs look fairly well ordered, or not).

3. Salsa scaffolding

Files required: The assembly file (optional: and the assembly graph), the sorted BED file, and the restriction enzyme sequence from the HiC sequencing. If you are using VGP GenomeArk data, you can get this information from the same file as the HiC reads, in a file called re_bases.txt.

Prepare the assembly file:

hands_on Hands-on: Task description

  1. Replace Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3 with the following parameters:
    • param-file “File to process”: output (Input dataset)
    • “Find pattern”: :
    • “Replace all occurences of the pattern”: Yes
    • “Find and Replace text in”: entire line

SALSA scaffolding:

hands_on Hands-on: Task description

  1. SALSA Tool: toolshed.g2.bx.psu.edu/repos/iuc/salsa/salsa/2.3+galaxy0 with the following parameters:
    • param-file “Initial assembly file”: outfile (output of Replace tool)
    • param-file Bed alignment”: out_file1 (output of Sort tool)
    • param-file “Sequence graphs”: output (Input dataset)
    • “Restriction enzyme sequence(s)”: add the enzyme sequence(s) here

4. Evaluate the Salsa scaffolding results

The scaffolded assembly fasta file can then be analysed in Busco and Quast.

Busco:

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.2.2+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: scaffolds_fasta (output of SALSA tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: ``
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

There are four outputs: short summary, summary as an image, and two tables (full results and missing buscos).

TODO: explain what these outputs mean; are the results “good” ?

Quast:

Inputs required for Quast: scaffolded assembly file from Salsa, and estimated genome size. The estimated genome size is obtained from an earlier step with GenomeScope.

Run Quast:

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: scaffolds_fasta (output of SALSA tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
        • “Estimated reference genome size (in bp) for computing NGx statistics”: enter estimated genome size
      • “Type of organism”: Eukaryote (--eukaryote): use of GeneMark-ES for gene finding, Barrnap for ribosomal RNA genes prediction, BUSCO for conserved orthologs finding
    • “Is genome large (> 100 Mbp)?”: Yes
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

There are four outputs: the Quast report in three formats, and a log file.

TODO: explain what these outputs mean; are the results “good” ?

5. Generate a post-scaffolding contact map

There are five steps:

  • Map the forward HiC reads to the scaffolded assembly
  • Map the reverse HiC reads to the scaffolded assembly
  • Combine these bam files into a single file
  • Generate a contact map
  • Conver the map to an image

Map the forward HiC reads:

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: scaffolds_fasta (output of SALSA tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode

Map the reverse HiC reads:

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: scaffolds_fasta (output of SALSA tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode

Merge the mapped reads:

hands_on Hands-on: Task description

  1. Filter and merge Tool: toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy0 with the following parameters:
    • param-file “First set of reads: bam_output (output of Map with BWA-MEM tool)
    • param-file “Second set of reads: bam_output (output of Map with BWA-MEM tool)

Generate a contact map:

hands_on Hands-on: Task description

  1. PretextMap Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.6+galaxy0 with the following parameters:
    • param-file “Input dataset in SAM or BAM format”: outfile (output of Filter and merge tool)
    • “Sort by”: Don't sort

Convert the map to an image:

hands_on Hands-on: Task description

  1. Pretext Snapshot Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy0 with the following parameters:
    • param-file “Input Pretext map file”: pretext_map_out (output of PretextMap tool)
    • “Output image format”: png
    • “Show grid?”: Yes

TODO: explain the output here. What does the pretext map show. How does it compare to the pre-scaffolding map.

TODO: overall, explain what the scaffolding section results mean. What are the next possible steps.

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Conclusion

Sum up the tutorial and the key takeaways here. We encourage adding an overview image of the pipeline used.

Key points

  • The take-home messages

  • They will appear at the end of the tutorial

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Assembly topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. Marcotte, E. M., M. Pellegrini, T. O. Yeates, and D. Eisenberg, 1999 A census of protein repeats. Journal of Molecular Biology 293: 151–160. 10.1006/jmbi.1999.3136
  2. Frenkel, S., V. Kirzhner, and A. Korol, 2012 Organizational Heterogeneity of Vertebrate Genomes (V. Laudet, Ed.). PLoS ONE 7: e32076. 10.1371/journal.pone.0032076
  3. Chalopin, D., M. Naville, F. Plard, D. Galiana, and J.-N. Volff, 2015 Comparative Analysis of Transposable Elements Highlights Mobilome Diversity and Evolution in Vertebrates. Genome Biology and Evolution 7: 567–580. 10.1093/gbe/evv005
  4. Sotero-Caio, C. G., R. N. Platt, A. Suh, and D. A. Ray, 2017 Evolution and Diversity of Transposable Elements in Vertebrate Genomes. Genome Biology and Evolution 9: 161–177. 10.1093/gbe/evw264
  5. Vurture, G. W., F. J. Sedlazeck, M. Nattestad, C. J. Underwood, H. Fang et al., 2017 GenomeScope: fast reference-free genome profiling from short reads (B. Berger, Ed.). Bioinformatics 33: 2202–2204. 10.1093/bioinformatics/btx153
  6. Angel, V. D. D., E. Hjerde, L. Sterck, S. Capella-Gutierrez, C. Notredame et al., 2018 Ten steps to get started in Genome Assembly and Annotation. F1000Research 7: 148. 10.12688/f1000research.13598.1
  7. Li, H., 2018 Minimap2: pairwise alignment for nucleotide sequences (I. Birol, Ed.). Bioinformatics 34: 3094–3100. 10.1093/bioinformatics/bty191
  8. Roach, M. J., S. A. Schmidt, and A. R. Borneman, 2018 Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19: 10.1186/s12859-018-2485-7
  9. Guan, D., S. A. McCarthy, J. Wood, K. Howe, Y. Wang et al., 2019 Identifying and removing haplotypic duplication in primary genome assemblies. 10.1101/729962
  10. Tørresen, O. K., B. Star, P. Mier, M. A. Andrade-Navarro, A. Bateman et al., 2019 Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Research 47: 10994–11006. 10.1093/nar/gkz841
  11. Giani, A. M., G. R. Gallo, L. Gianfranceschi, and G. Formenti, 2020 Long walk to genomics: History and current approaches to genome sequencing and assembly. Computational and Structural Biotechnology Journal 18: 9–19. 10.1016/j.csbj.2019.11.002
  12. Hon, T., K. Mars, G. Young, Y.-C. Tsai, J. W. Karalius et al., 2020 Highly accurate long-read HiFi sequencing data for five complex genomes. Scientific Data 7: 10.1038/s41597-020-00743-4
  13. Ranallo-Benavidez, T. R., K. S. Jaron, and M. C. Schatz, 2020 GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications 11: 10.1038/s41467-020-14998-3
  14. Rhie, A., B. P. Walenz, S. Koren, and A. M. Phillippy, 2020 Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21: 10.1186/s13059-020-02134-9
  15. Wang, H., B. Liu, Y. Zhang, F. Jiang, Y. Ren et al., 2020 Estimation of genome size using k-mer frequencies from corrected long reads. arXiv preprint arXiv:2003.11817.
  16. Zhang, X., R. Wu, Y. Wang, J. Yu, and H. Tang, 2020 Unzipping haplotypes in diploid and polyploid genomes. Computational and Structural Biotechnology Journal 18: 66–72. 10.1016/j.csbj.2019.11.011
  17. Cheng, H., G. T. Concepcion, X. Feng, H. Zhang, and H. Li, 2021 Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18: 170–175. 10.1038/s41592-020-01056-5
  18. Rhie, A., S. A. McCarthy, O. Fedrigo, J. Damas, G. Formenti et al., 2021 Towards complete and error-free genome assemblies of all vertebrate species. Nature 592: 737–746. 10.1038/s41586-021-03451-0

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Delphine Lariviere, Alex Ostrovsky, Cristóbal Gallardo, 2021 VGP assembly pipeline (Galaxy Training Materials). https://training.galaxyproject.org/archive/2022-01-01/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{assembly-vgp_genome_assembly,
author = "Delphine Lariviere and Alex Ostrovsky and Cristóbal Gallardo",
title = "VGP assembly pipeline (Galaxy Training Materials)",
year = "2021",
month = "11",
day = "30"
url = "\url{https://training.galaxyproject.org/archive/2022-01-01/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                

Congratulations on successfully completing this tutorial!