VGP assembly pipeline

Overview

Questions:
  • what combination of tools can produce the highest quality assembly of vertebrate genomes?

  • How can we evaluate how good it is?

Objectives:
  • Learn the tools necessary to perform a de novo assembly of a vertebrate genome

  • Evaluate the quality of the assembly

Requirements:
Time estimation: 2 hours
Supporting Materials:
Last modification: Sep 12, 2021
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

An assembly can be defined as a hierarchical data structure that maps the sequence data to a putative reconstruction of the target (Miller et al. 2010). Advances in sequencing technologies over the last few decades have revolutionised the field of genomics, allowing for a reduction in both the time and resources required to carry out de novo genome assembly. Until recently, second-generation DNA sequencing technologies allowed to produced either short highly accurate reads, or error-prone long reads. However, in recent years, third-generation sequencing technologies, usually known as real-time single-molecule sequencing, have become dominant in de novo assembly of large genomes. It uses native DNA fragments to sequence instead of template amplification, avoiding copying errors, sequence-dependent biases and information losses (Hon et al. 2020). An example of such a technology is PacBio Single molecule high-fidelity (HiFi) Sequencing, which enables average read lengths of 10-20 kb with average sequence identity greater than 99%, which is one of the technologies used to generate the data for this tutorial.

Deciphering the structural organisation of complex vertebrate genomes is currently one of the most important problems in genomics. (Frenkel et al. 2012). However, despite the great progress made in recent years, a key question remain to be answered: what combination of data and tools can produce the highest quality assembly? In order to adequately answer this question, it is necessary to analyse two of the main factors that determine the difficulty of genome assembly processes: repeated sequences and heterozigosity.

Repetitive sequences can be grouped into two categories: interspersed repeats, such as transposable elements (TE) that occur at multiple loci throughout the genome, and tandem repeats (TR), that occur at a single locus (Tørresen et al. 2019). Repetitive sequences, and TE in particular, are an important component of eukariotes genomes, constituting more than a third of the genome in the case of mammals (Sotero-Caio et al. 2017, Chalopin et al. 2015). In the case of tamdem repeats, various estimates suggest that they are present in at least one third of human protein sequences (Marcotte et al. 1999). TE content is probably the main factor contributing to fragmented genomes, specially in the case of large genomes, as its content is highly correlated with genome size (Sotero-Caio et al. 2017). On the other hand, TR usually lead to local genome assembly collapse and partial or complete loss of genes, specially when the read length of the sequencing method is shorter than the TR (Tørresen et al. 2019).

In the case of heterozygosity, haplotype phasing, that is, the identification of alleles that are co-located on the same chromosome, has become a fundamental problem in heterozygous and polyploid genome assemblies (Zhang et al. 2020). A common strategy to overcome these difficulties is to remap genomes to a single haplotype, which represents the whole genome. This approach is useful for highly inbred samples that are nearly homozygous, but when applied to highly heterozygous genomes, such as aquatic organism, it missses potential differences in sequence, structure, and gene presence, usually leading to ambiguties and redundancies in the initial contig-level assemblies (Angel et al. 2018, Zhang et al. 2020).

To address these problems, the G10K consortium launched the Vertebrate Genomes Project (VGP), whose goal is generating high-quality, near-error-free, gap-free, chromosome-level, haplotype-phased, annotated reference genome assembly for each of the vertebrate species currently present on planet Earth (Rhie et al. 2021). The protocol proposed in this tutorial, the VGP assembly pipeline, is the result of years of study and analysis of the available tools and data sources.

Agenda

In this tutorial, we will cover:

  1. VGP assembly pipeline overview
    1. Background on datasets
  2. Get data
  3. Data quality assessment
  4. Genome profile analysis
    1. Generation of k-mer spectra with Meryl
    2. Genome profiling with GenomeScope2
  5. HiFi long read phased assembly with Hifiasm
    1. Sub-step with Hifiasm
    2. Sub-step with GFA to FASTA: primary assembly
    3. Sub-step with Map with minimap2
    4. Sub-step with Purge overlaps
    5. Sub-step with Purge overlaps
    6. Sub-step with Map with minimap2
    7. Sub-step with Purge overlaps
    8. Sub-step with Purge overlaps
    9. Sub-step with Purge overlaps
    10. Sub-step with Quast
    11. Sub-step with Busco
    12. Sub-step with Busco
    13. Sub-step with GFA to FASTA: alternate assembly
    14. Sub-step with Merqury
    15. Sub-step with Concatenate datasets
    16. Sub-step with Purge overlaps
    17. Sub-step with Map with minimap2
    18. Sub-step with Purge overlaps
    19. Sub-step with Purge overlaps
    20. Sub-step with Purge overlaps
    21. Sub-step with Quast
    22. Sub-step with Busco
    23. Sub-step with Merqury
  6. Hybrid scaffolding based on phased assembly and Bionano data
    1. Sub-step with Bionano Hybrid Scaffold
    2. Sub-step with Concatenate datasets
    3. Sub-step with Parse parameter value
    4. Sub-step with Quast
  7. Hybrid scaffolding based on a phased assembly and HiC mapping data
    1. Sub-step with Map with BWA-MEM
    2. Sub-step with Map with BWA-MEM
    3. Sub-step with Filter and merge
    4. Sub-step with PretextMap
    5. Sub-step with Pretext Snapshot
    6. Sub-step with Parse parameter value
    7. Sub-step with bedtools BAM to BED
    8. Sub-step with Sort
    9. Sub-step with Replace
    10. Sub-step with Parse parameter value
    11. Sub-step with SALSA
    12. Sub-step with Parse parameter value
    13. Sub-step with Quast
    14. Sub-step with Busco
    15. Sub-step with Map with BWA-MEM
    16. Sub-step with Map with BWA-MEM
    17. Sub-step with Filter and merge
    18. Sub-step with PretextMap
    19. Sub-step with Pretext Snapshot

VGP assembly pipeline overview

The figure 1 represents the VGP assembly pipeline.

fig1:VGP pipeline
Figure 1: VPG Pipeline 2.0

In order to facilitate the development of the workflow, we will structure it in four main sections:

  • Genome profile analysis
  • HiFi long read phased assembly with Hifiasm
  • Hybrid scaffolding based on phased assembly and Bionano data
  • Hybrid scaffolding based on a phased assembly and Hi-C mapping data

Background on datasets

In order to reduce processing times, we will use samples from the fungus Saccharomyces cerevisiae for this training. The VGP assembly pipeline requires datasets generated by three different technologies: PacBio HiFi reads, Bionano optical maps, and Hi-C chromatin interaction maps.

PacBio HiFi reads rely on the Single Molecule Real-Time (SMRT) sequencing technology. It is based on real-time imaging of fluorescently tagged nucleotides as they are synthesized along individual DNA template molecules, combining multiple subreads of the same circular template using a statistical model to produce one highly accurate consensus sequence, along with base quality values (figure 2). This technology allows to generate long-read sequencing dataseets with read lengths averaging 10-25 kb and accuracies greater than 99.5%.

fig2:PacBio sequencing technolgoy
Figure 2: PacBio HiFi sequencing

Optical genome mapping is a method for detecting structural variants. The generation of Bionano optical maps starts with high molecular weight DNA, which is labeled at specific sequences motif with a fluorescent dye, resulting in a unique fluorescence pattern for each individual genome. The comparison of the labelled fragments among different samples enables the detection of structural variants. Optical maps are integrated with the primary assemby sequence in order to identify and correct potential chimeric joints, and estimate the gap sizes.

The high-throughput chromosome conformation capture (Hi-C) technology is based on the capture of the chromatin conformation, enabling the identification of topological domains. Hi-C chromatin interaction maps methods first crosslink the chromatin in its 3D conformation. The crosslinked DNA is digested using restriction enzymes, and the digested ends are filled with biotinylated nucleotides. Next, the blunt ends of spatially proximal digested end are ligated, preserving the chromosome interaction regions. Finally, the DNA is purified to assure that only fragments originating from ligation events are sequenced.

Get data

hands_on Hands-on: Data upload

  1. Create a new history for this tutorial
  2. Import the files from Zenodo

    • Open the file galaxy-upload upload menu
    • “Upload data as”: Datasets
    • Copy the tabular data, paste it into the textbox and press Build
      SRR7126301_1.fastq.gz   https://zenodo.org/record/5383832/files/SRR7126301_1.fastq.gz   fastqsanger.gz
      SRR7126301_2.fastq.gz   https://zenodo.org/record/5383832/files/SRR7126301_2.fastq.gz   fastqsanger.gz
      SRR13577846.fastq.gz    https://zenodo.org/record/5383832/files/SRR13577846.30x.wgaps.fastq.gz  fastqsanger.gz
      bionano.cmap    https://zenodo.org/record/5383832/files/bionano.cmap    cmap
    
    • From Rules menu select Add / Modify Column Definitions
      • Click Add Definition button and select Name: column A
      • Click Add Definition button and select URL: column B
      • Click Add Definition button and select Type: column C
    • Click Apply and press Upload
  3. Add to each database a tag

    Tip: Adding a tag

    • Click on the dataset
    • Click on galaxy-tags Edit dataset tags
    • Add a tag starting with #

      Tags starting with # will be automatically propagated to the outputs of tools using this dataset.

    • Check that the tag is appearing below the dataset name

Data quality assessment

To begin our analysis we will carry out the evaluation and pre-processing of our data, in order to identify potential inconsistencies and other anomalies in the data, and if identified, correct them. In order to obtain a general overview of our datasets, we are going to use FastQC, an open-source tool that provides a simple way to quality control raw sequence data.

hands_on Hands-on: Quality check

  1. Run FastQC tool with the following parameters
    • param-files “Raw read data from your current history”: SRR13577846.fastq.gz
  2. Inspect the generated HTML files

Next, will remove the adaptors by using Cutadapt.

hands_on Hands-on: Primer removal

  1. Cutadapt Tool: toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/3.4 with the following parameters:
    • “Single-end or Paired-end reads?”: Single-end
      • param-collection “FASTQ/A file”: output (Input dataset collection)
      • In “Read 1 Options”:
        • In “5’ or 3’ (Anywhere) Adapters”:
          • param-repeat “Insert 5’ or 3’ (Anywhere) Adapters”
            • “Source”: Enter custom sequence
              • “Enter custom 5’ or 3’ adapter sequence”: ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT
          • param-repeat “Insert 5’ or 3’ (Anywhere) Adapters”
            • “Source”: Enter custom sequence
              • “Enter custom 5’ or 3’ adapter sequence”: ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT
    • In “Adapter Options”:
      • “Match times”: 3
      • “Maximum error rate”: 0.1
      • “Minimum overlap length”: 35
      • “Look for adapters in the reverse complement”: True
    • In “Filter Options”:
      • “Discard Trimmed Reads: Yes

Genome profile analysis

An important step before starting a de novo genome assembly project is to proceed with the analysis of the genome profile. Determining these characteristics in advance has the potential to reveal whether an analysis is not reflecting the full complexity of the genome, for example, if the number of variants is underestimated or a significant fraction of the genome is not assembled (Vurture et al. 2017).

Traditionally DNA flow citometry was considered the golden standart for estimating the genome size, one of the most important factors to determine the required coverage level. However, nowadays experimental methods have been replaced by computational approaches Wang et al. 2020. One of the most widely used procedures for undertaking genomic profiling is the analyis of k-mer frequencies. It allows to provide information not only about the genomic complexity, such as the genome size, levels of heterozygosity and repeat content, but also about the data quality. In addition, k-mer spectra analysis can be used in a reference-free manner for assessing genome assembly quality metrics (Rhie et al. 2020).

details K-mer size, sequencing coverage and genome size

K-mers are unique substrings of length k contained within a DNA sequence. For example, the DNA sequence TCGATCACA can be decomposed into six unique k-mers that have five bases long: TCGAT, CGATC, GATCA, ATCAC and TCACA. A sequence of length L will have L-k+1 k-mers. On the other hand, the number of possible k-mers can be calculated as nk, where n is number of possible monomers and k is the k-mer size.

Bases K-mer size Total possible k-mers
4 1 4
4 2 16
4 3 64
4 4 256
4
4 10 1.048.576

Thus, k-mer size is a key parameter in order to ensure as may unique k-mers as possible, avoiding unnecessary waste of computational resources as far as possible. In the case of the human genome, k-mers of 31 bases in length lead to 96.96% of unique k-mers.

Each unique k-mer can be assigned a value for coverage based on the number of times it occurs in a sequence, whose distribution will approximate a Poisson distribution, with the peak corresponding to the average genome sequencing depth. From the genome coverage, the genome size can be easily computed.

In section we will use two basic tools to computationally estimate the genome features: Meryl and GenomeScope.

Generation of k-mer spectra with Meryl

Meryl will allow us to perform the k-mer profiling by decomposing the sequencing data into k-lenght substrings and determining its frequency. The original version was developed for use in the Celera Assembler, and it comprises three modules: one for generating k-mer databases, one for filtering and combining databases, and one for searching databases. The k-mer database is stored in sorted order, similar to words in a dictionary (Rhie et al. 2020).

comment K-mer size estimation

One of the important aspects is the size of the k-mer, which must be large enough to map uniquely to the genome, but not too large, since it can lead to wasting computational resources. Given an estimated genome size (G) and a tolerable collision rate (p), an appropriate k can be computed as k = log4 (G(1 − p)/p).

hands_on Hands-on: Generate k-mers count distribution

  1. Meryl Tool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy2 with the following parameters:
    • “Operation type selector”: Count operations
      • “Count operations”: Count: count the ocurrences of canonical k-mers
      • param-collection “Input sequences”: SRR13577846.fastq.gz
      • “K-mer size selector”: Set a k-mer size
        • K-mer size”: 21

    comment Election of k-mer size

    We used 21 as k-mer size, as this length is has demonstrated to be sufficiently long that most k-mers are not repetitive and is short enough that the analysis will be more robust to sequencing errors. For extremely large (haploid size over 10 Gb) and/or very repetitive genomes, it is recommended to use larger k-mer lengths to increase the number of unique k-mers.

  2. Rename it Collection meryldb

  3. Meryl Tool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy1 with the following parameters:
    • “Operation type selector”: Operations on sets of k-mers
      • “Operations on sets of k-mers”: Union-sum: return k-mers that occur in any input, set the count to the sum of the counts
      • param-file “Input meryldb”: Collection meryldb
  4. Rename it as Merged meryldb

  5. Meryl Tool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy0 with the following parameters:
    • “Operation type selector”: Generate histogram dataset
      • param-file “Input meryldb”: Merged meryldb
  6. Finally, rename it as Meryldb histogram.

Genome profiling with GenomeScope2

The next step is to computationally infer the genome properties from the k-mer count distribution generated by Meryl, for which we’ll use GenomeScope2. It relies in a nonlinear least-squares optimization to fit a mixture of negative binomial distributions, generating estimated values for genome size, repetitiveness, and heterozygosity rates (Ranallo-Benavidez et al. 2020).

hands_on Hands-on: Estimate genome properties

  1. GenomeScope Tool: toolshed.g2.bx.psu.edu/repos/iuc/genomescope/genomescope/2.0 with the following parameters:
    • param-file “Input histogram file”: Meryldb histogram
    • “K-mer length used to calculate k-mer spectra”: 21
  • In “Output options”: mark Generate a file with the model parameters and Summary of the analysis

Genomescope will generate six outputs:

  • Plots
    • Linear plot: K-mer spectra and fitted models: frequency (y-axis) versus coverage.
    • Log plot: logarithmic transformation of the previous plot.
    • Transformed linear plot: K-mer spectra and fitted models: frequency times coverage (y-axis) versus coverage (x-axis). It allows to increases the heights of higher-order peaks, overcoming the effect of high heterozygosity.
    • Transformed log plot: logarithmic transformation of the previous plot.
  • Model: this file includes a detailed report about the model fitting.
  • Summary: it includes the properties infered from the model, such as genome haploid length and the percentage of heterozygosity.

Now, let’s analyze the k-mer profiles, fitted models and estimated parameters:

fig3:Genomescope plot
Figure 3: Genomescope2 plot

As we can see, there is an unique peak centered around 28, which is the coverage with the highest number of different 21-mers. According the normal-like k-mer spectra, we can infer that it is a haploid genome. The The large number of unique k-mers on the left size with frequence around one is due to error during the sequencing process.

Before jumping to the next section, we need to carry out some operation on the output generated by Genomescope2.

hands_on Hands-on: Get estimated genome size

  1. Replace Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3 with the following parameters:
    • param-file “File to process”: summary (output of GenomeScope tool)
    • “Find pattern”: ` bp`
    • “Replace all occurences of the pattern”: Yes
    • “Find and Replace text in”: entire line

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

  2. Replace Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3 with the following parameters:
    • param-file “File to process”: outfile (output of Replace tool)
    • “Find pattern”: ,
    • “Replace all occurences of the pattern”: Yes
    • “Find and Replace text in”: entire line

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

  3. Search in textfiles Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1 with the following parameters:
    • param-file “Select lines from”: outfile (output of Replace tool)
    • “Type of regex”: Basic
    • “Regular Expression: Haploid

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

  4. Convert Tool: Convert characters1 with the following parameters:
    • param-file “in Dataset”: output (output of Search in textfiles tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

  5. Advanced Cut Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0 with the following parameters:
    • param-file “File to cut”: out_file1 (output of Compute tool)
    • “Cut by”: fields
      • “List of Fields”: cc7

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

  6. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (output of Advanced Cut tool)
    • “Select type of parameter to parse”: Integer

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

hands_on Hands-on: Task description

  1. Compute Tool: toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/1.6 with the following parameters:
    • “Add expression: 1.5*c3
    • param-file “as a new column to”: model_params (output of GenomeScope tool)
    • “Round result?”: Yes
    • “Input has a header line with column names?”: No

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

  2. Compute Tool: toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/1.6 with the following parameters:
    • “Add expression: 3*c7
    • param-file “as a new column to”: out_file1 (output of Compute tool)
    • “Round result?”: Yes
    • “Input has a header line with column names?”: No

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

  3. Advanced Cut Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0 with the following parameters:
    • param-file “File to cut”: out_file1 (output of Compute tool)
    • “Cut by”: fields
      • “List of Fields”: cc7

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

  4. Advanced Cut Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0 with the following parameters:
    • param-file “File to cut”: out_file1 (output of Compute tool)
    • “Cut by”: fields
      • “List of Fields”: cc7

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

  5. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (output of Advanced Cut tool)
    • “Select type of parameter to parse”: Integer

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

  6. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (output of Advanced Cut tool)
    • “Select type of parameter to parse”: Integer

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

HiFi long read phased assembly with Hifiasm

Sub-step with Hifiasm

hands_on Hands-on: Task description

  1. Hifiasm Tool: toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.14+galaxy0 with the following parameters:
    • “Assembly mode”: Standard
      • param-file “Input reads: out1 (output of Cutadapt tool)
    • “Advanced options”: Leave default
    • “Assembly options”: Leave default
    • “Options for purging duplicates”: Specify

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with GFA to FASTA: primary assembly

hands_on Hands-on: Task description

  1. GFA to FASTA Tool: toolshed.g2.bx.psu.edu/repos/iuc/gfa_to_fa/gfa_to_fa/0.1.2 with the following parameters:
    • param-file “Input GFA file”: primary_contig_graph (output of Hifiasm tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: out_fa (output of GFA to FASTA tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
        • “Estimated reference genome size (in bp) for computing NGx statistics”: {'id': 31, 'output_name': 'integer_param'}
      • “Type of organism”: Eukaryote (--eukaryote): use of GeneMark-ES for gene finding, Barrnap for ribosomal RNA genes prediction, BUSCO for conserved orthologs finding
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Map with minimap2

hands_on Hands-on: Task description

  1. Map with minimap2 Tool: toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: out_fa (output of GFA to FASTA tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: out1 (output of Cutadapt tool)
      • “Select a profile of preset options”: Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)
    • In “Alignment options”:
      • “Customize spliced alignment mode?”: No, use profile setting or leave turned off
    • In “Set advanced output options”:
      • “Select an output format”: paf

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: split FASTA file by 'N's
      • param-file “Base-level coverage file”: out_fa (output of GFA to FASTA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: create read depth histogram and base-level read depth for pacbio data
      • param-file “PAF input file”: alignment_output (output of Map with minimap2 tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Map with minimap2

hands_on Hands-on: Task description

  1. Map with minimap2 Tool: toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.20+galaxy1 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: out_fa (output of GFA to FASTA tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: out1 (output of Cutadapt tool)
      • “Select a profile of preset options”: Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)
    • In “Alignment options”:
      • “Customize spliced alignment mode?”: No, use profile setting or leave turned off
    • In “Set advanced output options”:
      • “Select an output format”: PAF

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy0 with the following parameters:
    • “Select the purge_dups function”: purge haplotigs and overlaps for an assembly
      • param-file “PAF input file”: alignment_output (output of Map with minimap2 tool)
      • param-file “Base-level coverage file”: pbcstat_cov (output of Purge overlaps tool)
      • param-file “Cutoffs file”: calcuts_tab (output of Purge overlaps tool)
      • “Rounds of chaining”: 1 round

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: obtain seqeuences after purging
      • param-file “Fasta input file”: out_fa (output of GFA to FASTA tool)
      • param-file Bed input file”: purge_dups_bed (output of Purge haplotigs tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: obtain seqeuences after purging
      • param-file “Fasta input file”: out_fa (output of GFA to FASTA tool)
      • param-file Bed input file”: purge_dups_bed (output of Purge haplotigs tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: get_seqs_purged (output of Purge overlaps tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
        • “Estimated reference genome size (in bp) for computing NGx statistics”: {'id': 31, 'output_name': 'integer_param'}
      • “Type of organism”: Eukaryote (--eukaryote): use of GeneMark-ES for gene finding, Barrnap for ribosomal RNA genes prediction, BUSCO for conserved orthologs finding
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Busco

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: out_fa (output of GFA to FASTA tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: ``
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Busco

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: out_fa (output of GFA to FASTA tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: ``
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with GFA to FASTA: alternate assembly

hands_on Hands-on: Task description

  1. GFA to FASTA Tool: toolshed.g2.bx.psu.edu/repos/iuc/gfa_to_fa/gfa_to_fa/0.1.2 with the following parameters:
    • param-file “Input GFA file”: primary_contig_graph (output of Hifiasm tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Merqury

hands_on Hands-on: Task description

  1. Merqury Tool: toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3 with the following parameters:
    • “Evaluation mode”: Default mode
      • param-file “K-mer counts database”: read_db (output of Meryl tool)
      • “Number of assemblies”: One assembly (pseudo-haplotype or mixed-haplotype)
        • param-file “Genome assembly”: out_fa (output of GFA to FASTA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Concatenate datasets

hands_on Hands-on: Task description

  1. Concatenate datasets Tool: cat1 with the following parameters:
    • param-file “Concatenate Dataset”: get_seqs_hap (output of Purge overlaps tool)
    • In “Dataset”:
      • param-repeat “Insert Dataset”
        • param-file “Select”: out_fa (output of GFA to FASTA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy3 with the following parameters:
    • “Select the purge_dups function”: Split FASTA file by 'N's (split_fa)
      • param-file “Base-level coverage file”: out_fa (output of GFA to FASTA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Map with minimap2

hands_on Hands-on: Task description

  1. Map with minimap2 Tool: toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.20+galaxy1 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: out_fa (output of GFA to FASTA tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: out1 (output of Cutadapt tool)
      • “Select a profile of preset options”: Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)
    • In “Alignment options”:
      • “Customize spliced alignment mode?”: No, use profile setting or leave turned off
    • In “Set advanced output options”:
      • “Select an output format”: PAF

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy3 with the following parameters:
    • “Select the purge_dups function”: Calculate coverage cutoff and create read depth histogram and base-levelread depth for PacBio data (calcuts+pbcstats)
      • param-file “PAF input file”: alignment_output (output of Map with minimap2 tool)
      • In “Calcuts options”:
        • “Upper bound for read depth”: {'id': 28, 'output_name': 'integer_param'}

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy3 with the following parameters:
    • “Select the purge_dups function”: Purge haplotigs and overlaps for an assembly (purge_dups)
      • param-file “PAF input file”: alignment_output (output of Map with minimap2 tool)
      • param-file “Base-level coverage file”: pbcstat_cov (output of Purge overlaps tool)
      • param-file “Cutoffs file”: calcuts_tab (output of Purge overlaps tool)
      • “Rounds of chaining”: 1 round

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy3 with the following parameters:
    • “Select the purge_dups function”: Obtain seqeuences after purging (get_seqs)
      • param-file “Fasta input file”: out_fa (output of GFA to FASTA tool)
      • param-file Bed input file”: purge_dups_bed (output of Purge overlaps tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: get_seqs_purged (output of Purge overlaps tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
        • “Estimated reference genome size (in bp) for computing NGx statistics”: {'id': 31, 'output_name': 'integer_param'}
      • “Type of organism”: Eukaryote (--eukaryote): use of GeneMark-ES for gene finding, Barrnap for ribosomal RNA genes prediction, BUSCO for conserved orthologs finding
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Busco

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: out_fa (output of GFA to FASTA tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: ``
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Merqury

hands_on Hands-on: Task description

  1. Merqury Tool: toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3 with the following parameters:
    • “Evaluation mode”: Default mode
      • param-file “K-mer counts database”: output (Input dataset)
      • “Number of assemblies”: Two assemblies (diploid)
        • param-file “Second genome assembly”: out_fa (output of GFA to FASTA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Hybrid scaffolding based on phased assembly and Bionano data

Sub-step with Bionano Hybrid Scaffold

hands_on Hands-on: Task description

  1. Bionano Hybrid Scaffold Tool: toolshed.g2.bx.psu.edu/repos/bgruening/bionano_scaffold/bionano_scaffold/3.6.1+galaxy2 with the following parameters:
    • param-file “NGS FASTA”: output (Input dataset)
    • param-file “BioNano CMAP”: output (Input dataset)
    • “Configuration mode”: VGP mode
    • param-file “Conflict resolution file”: output (Input dataset)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Concatenate datasets

hands_on Hands-on: Task description

  1. Concatenate datasets Tool: cat1 with the following parameters:
    • param-file “Concatenate Dataset”: ngs_contigs_scaffold_trimmed (output of Bionano Hybrid Scaffold tool)
    • In “Dataset”:
      • param-repeat “Insert Dataset”
        • param-file “Select”: ngs_contigs_not_scaffolded_trimmed (output of Bionano Hybrid Scaffold tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Parse parameter value

hands_on Hands-on: Task description

  1. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (Input dataset)
    • “Select type of parameter to parse”: Integer

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: out_file1 (output of Concatenate datasets tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
        • “Estimated reference genome size (in bp) for computing NGx statistics”: {'id': 7, 'output_name': 'integer_param'}
      • “Type of organism”: Eukaryote (--eukaryote): use of GeneMark-ES for gene finding, Barrnap for ribosomal RNA genes prediction, BUSCO for conserved orthologs finding
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Hybrid scaffolding based on a phased assembly and HiC mapping data

Sub-step with Map with BWA-MEM

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: output (Input dataset)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode
    • “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Map with BWA-MEM

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: output (Input dataset)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode
    • “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Filter and merge

hands_on Hands-on: Task description

  1. Filter and merge Tool: toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy0 with the following parameters:
    • param-file “First set of reads: bam_output (output of Map with BWA-MEM tool)
    • param-file “Second set of reads: bam_output (output of Map with BWA-MEM tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with PretextMap

hands_on Hands-on: Task description

  1. PretextMap Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.6+galaxy0 with the following parameters:
    • param-file “Input dataset in SAM or BAM format”: outfile (output of Filter and merge tool)
    • “Sort by”: Don't sort

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Pretext Snapshot

hands_on Hands-on: Task description

  1. Pretext Snapshot Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy0 with the following parameters:
    • param-file “Input Pretext map file”: pretext_map_out (output of PretextMap tool)
    • “Output image format”: png
    • “Show grid?”: Yes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Parse parameter value

hands_on Hands-on: Task description

  1. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (Input dataset)
    • “Select type of parameter to parse”: Integer

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with bedtools BAM to BED

hands_on Hands-on: Task description

  1. bedtools BAM to BED Tool: toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.30.0+galaxy1 with the following parameters:
    • param-file “Convert the following BAM file to BED: outfile (output of Filter and merge tool)
    • “What type of BED output would you like”: Create a full, 12-column "blocked" BED file

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Sort

hands_on Hands-on: Task description

  1. Sort Tool: sort1 with the following parameters:
    • param-file “Sort Dataset”: output (output of bedtools BAM to BED tool)
    • “on column”: c4
    • “with flavor”: Alphabetical sort
    • “everything in”: Ascending order

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Replace

hands_on Hands-on: Task description

  1. Replace Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3 with the following parameters:
    • param-file “File to process”: output (Input dataset)
    • “Find pattern”: :
    • “Replace all occurences of the pattern”: Yes
    • “Find and Replace text in”: entire line

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Parse parameter value

hands_on Hands-on: Task description

  1. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (Input dataset)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with SALSA

hands_on Hands-on: Task description

  1. SALSA Tool: toolshed.g2.bx.psu.edu/repos/iuc/salsa/salsa/2.3+galaxy0 with the following parameters:
    • param-file “Initial assembly file”: outfile (output of Replace tool)
    • param-file Bed alignment”: out_file1 (output of Sort tool)
    • param-file “Sequence graphs”: output (Input dataset)
    • “Restriction enzyme sequence(s)”: {'id': 14, 'output_name': 'text_param'}

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Parse parameter value

hands_on Hands-on: Task description

  1. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (Input dataset)
    • “Select type of parameter to parse”: Integer

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: scaffolds_fasta (output of SALSA tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
        • “Estimated reference genome size (in bp) for computing NGx statistics”: {'id': 13, 'output_name': 'integer_param'}
      • “Type of organism”: Eukaryote (--eukaryote): use of GeneMark-ES for gene finding, Barrnap for ribosomal RNA genes prediction, BUSCO for conserved orthologs finding
    • “Is genome large (> 100 Mbp)?”: Yes
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Busco

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.2.2+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: scaffolds_fasta (output of SALSA tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: ``
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Map with BWA-MEM

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: scaffolds_fasta (output of SALSA tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Map with BWA-MEM

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: scaffolds_fasta (output of SALSA tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Sub-step with Filter and merge

hands_on Hands-on: Task description

  1. Filter and merge Tool: toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy0 with the following parameters:
    • param-file “First set of reads: bam_output (output of Map with BWA-MEM tool)
    • param-file “Second set of reads: bam_output (output of Map with BWA-MEM tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with PretextMap

hands_on Hands-on: Task description

  1. PretextMap Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.6+galaxy0 with the following parameters:
    • param-file “Input dataset in SAM or BAM format”: outfile (output of Filter and merge tool)
    • “Sort by”: Don't sort

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Pretext Snapshot

hands_on Hands-on: Task description

  1. Pretext Snapshot Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy0 with the following parameters:
    • param-file “Input Pretext map file”: pretext_map_out (output of PretextMap tool)
    • “Output image format”: png
    • “Show grid?”: Yes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

Conclusion

Sum up the tutorial and the key takeaways here. We encourage adding an overview image of the pipeline used.

Key points

  • The take-home messages

  • They will appear at the end of the tutorial

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Assembly topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. Marcotte, E. M., M. Pellegrini, T. O. Yeates, and D. Eisenberg, 1999 A census of protein repeats. Journal of Molecular Biology 293: 151–160. 10.1006/jmbi.1999.3136
  2. Miller, J. R., S. Koren, and G. Sutton, 2010 Assembly algorithms for next-generation sequencing data. Genomics 95: 315–327. 10.1016/j.ygeno.2010.03.001
  3. Frenkel, S., V. Kirzhner, and A. Korol, 2012 Organizational Heterogeneity of Vertebrate Genomes (V. Laudet, Ed.). PLoS ONE 7: e32076. 10.1371/journal.pone.0032076
  4. Chalopin, D., M. Naville, F. Plard, D. Galiana, and J.-N. Volff, 2015 Comparative Analysis of Transposable Elements Highlights Mobilome Diversity and Evolution in Vertebrates. Genome Biology and Evolution 7: 567–580. 10.1093/gbe/evv005
  5. Sotero-Caio, C. G., R. N. Platt, A. Suh, and D. A. Ray, 2017 Evolution and Diversity of Transposable Elements in Vertebrate Genomes. Genome Biology and Evolution 9: 161–177. 10.1093/gbe/evw264
  6. Vurture, G. W., F. J. Sedlazeck, M. Nattestad, C. J. Underwood, H. Fang et al., 2017 GenomeScope: fast reference-free genome profiling from short reads (B. Berger, Ed.). Bioinformatics 33: 2202–2204. 10.1093/bioinformatics/btx153
  7. Angel, V. D. D., E. Hjerde, L. Sterck, S. Capella-Gutierrez, C. Notredame et al., 2018 Ten steps to get started in Genome Assembly and Annotation. F1000Research 7: 148. 10.12688/f1000research.13598.1
  8. Tørresen, O. K., B. Star, P. Mier, M. A. Andrade-Navarro, A. Bateman et al., 2019 Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Research 47: 10994–11006. 10.1093/nar/gkz841
  9. Wang, H., B. Liu, Y. Zhang, F. Jiang, Y. Ren et al., 2020 Estimation of genome size using k-mer frequencies from corrected long reads. arXiv preprint arXiv:2003.11817.
  10. Zhang, X., R. Wu, Y. Wang, J. Yu, and H. Tang, 2020 Unzipping haplotypes in diploid and polyploid genomes. Computational and Structural Biotechnology Journal 18: 66–72. 10.1016/j.csbj.2019.11.011
  11. Ranallo-Benavidez, T. R., K. S. Jaron, and M. C. Schatz, 2020 GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications 11: 10.1038/s41467-020-14998-3
  12. Rhie, A., B. P. Walenz, S. Koren, and A. M. Phillippy, 2020 Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21: 10.1186/s13059-020-02134-9
  13. Hon, T., K. Mars, G. Young, Y.-C. Tsai, J. W. Karalius et al., 2020 Highly accurate long-read HiFi sequencing data for five complex genomes. Scientific Data 7: 10.1038/s41597-020-00743-4
  14. Rhie, A., S. A. McCarthy, O. Fedrigo, J. Damas, G. Formenti et al., 2021 Towards complete and error-free genome assemblies of all vertebrate species. Nature 592: 737–746. 10.1038/s41586-021-03451-0

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. Delphine Lariviere, Alex Ostrovsky, 2021 VGP assembly pipeline (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{assembly-vgp_genome_assembly,
author = "Delphine Lariviere and Alex Ostrovsky",
title = "VGP assembly pipeline (Galaxy Training Materials)",
year = "2021",
month = "09",
day = "12"
url = "\url{https://training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                

Congratulations on successfully completing this tutorial!