VGP assembly pipeline

Overview

Questions
  • what combination of tools can produce the highest quality assembly of vertebrate genomes?

  • How can we evaluate how good it is?

Objectives
  • Learn the tools necessary to perform a de novo assembly of a vertebrate genome

  • Evaluate the quality of the assembly

Requirements
Time estimation: 2 hours
Supporting Materials
Last modification: Jul 22, 2021
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is MIT

Introduction

An assembly can be defined as a hierarchical data structure that maps the sequence data to a putative reconstruction of the target (Miller et al. 2010). Advances in sequencing technologies over the last few decades have revolutionised the field of genomics, allowing for a reduction in both the time and resources required to carry out de novo genome assembly. Until recently, second-generation DNA sequencing technologies allowed to produced either short highly accurate reads, or error-prone long reads. However, in recent years, third-generation sequencing technologies, usually known as real-time single-molecule sequencing, have become dominant in de novo assembly of large genomes. It uses native DNA fragments to sequence instead of template amplification, avoiding copying errors, sequence-dependent biases and information losses (Hon et al. 2020). An example of such a technology is PacBio Single molecule high-fidelity (HiFi) Sequencing, which enables average read lengths of 10-20 kb with average sequence identity greater than 99%, which is one of the technologies used to generate the data for this tutorial.

Deciphering the structural organisation of complex vertebrate genomes is currently one of the most important problems in genomics. (Frenkel et al. 2012). However, despite the great progress made in recent years, a key question remain to be answered: what combination of data and tools can produce the highest quality assembly? In order to adequately answer this question, it is necessary to analyse two of the main factors that determine the difficulty of genome assembly processes: repeated sequences and heterozigosity.

Repetitive sequences can be grouped into two categories: interspersed repeats, such as transposable elements (TE) that occur at multiple loci throughout the genome, and tandem repeats (TR), that occur at a single locus (Tørresen et al. 2019). Repetitive sequences, and TE in particular, are an important component of eukariotes genomes, constituting more than a third of the genome in the case of mammals (Sotero-Caio et al. 2017, Chalopin et al. 2015). In the case of tamdem repeats, various estimates suggest that they are present in at least one third of human protein sequences (Marcotte et al. 1999). TE content is probably the main factor contributing to fragmented genomes, specially in the case of large genomes, as its content is highly correlated with genome size (Sotero-Caio et al. 2017). On the other hand, TR usually lead to local genome assembly collapse and partial or complete loss of genes, specially when the read length of the sequencing method is shorter than the TR (Tørresen et al. 2019).

In the case of heterozygosity, haplotype phasing, that is, the identification of alleles that are co-located on the same chromosome, has become a fundamental problem in heterozygous and polyploid genome assemblies (Zhang et al. 2020). A common strategy to overcome these difficulties is to remap genomes to a single haplotype, which represents the whole genome. This approach is useful for highly inbred samples that are nearly homozygous, but when applied to highly heterozygous genomes, such as aquatic organism, it missses potential differences in sequence, structure, and gene presence, usually leading to ambiguties and redundancies in the initial contig-level assemblies (Angel et al. 2018, Zhang et al. 2020).

To address these problems, the G10K consortium launched the Vertebrate Genomes Project (VGP), whose goal is generating high-quality, near-error-free, gap-free, chromosome-level, haplotype-phased, annotated reference genome assembly for each of the vertebrate species currently present on planet Earth (Rhie et al. 2021). The protocol proposed in this tutorial, the VGP assembly pipeline, is the result of years of study and analysis of the available tools and data sources.

Agenda

In this tutorial, we will cover:

  1. VGP assembly pipeline overview
    1. Title for a subsection
  2. Hands-on Sections
    1. Get data
  3. Genome profile analysis
    1. Generation of k-mer spectra with Meryl
    2. Sub-step with Meryl
    3. Sub-step with GenomeScope
    4. Re-arrange
    5. Sub-step with Parse parameter value
    6. Sub-step with Cutadapt
    7. Sub-step with Collapse Collection
    8. Sub-step with Hifiasm
    9. Sub-step with GFA to FASTA
    10. Sub-step with GFA to FASTA
    11. Sub-step with Meryl
    12. Sub-step with Quast
    13. Sub-step with Purge overlaps
    14. Sub-step with Busco
    15. Sub-step with Merqury
    16. Sub-step with Map with minimap2
    17. Sub-step with Map with minimap2
    18. Sub-step with Purge overlaps
    19. Sub-step with Compute
    20. Sub-step with Compute
    21. Sub-step with Advanced Cut
    22. Sub-step with Advanced Cut
    23. Sub-step with Parse parameter value
    24. Sub-step with Parse parameter value
    25. Sub-step with Purge overlaps
    26. Sub-step with Purge haplotigs
    27. Sub-step with Purge overlaps
    28. Sub-step with Merqury
    29. Sub-step with Bionano Hybrid Scaffold
    30. Sub-step with Quast
    31. Sub-step with Busco
    32. Sub-step with Concatenate datasets
    33. Sub-step with Concatenate datasets
    34. Sub-step with Map with minimap2
    35. Sub-step with Purge overlaps
    36. Sub-step with Merqury
    37. Sub-step with Quast
    38. Sub-step with Busco
    39. Sub-step with Map with BWA-MEM
    40. Sub-step with Map with BWA-MEM
    41. Sub-step with Purge overlaps
    42. Sub-step with Map with minimap2
    43. Sub-step with bellerophon
    44. Sub-step with Purge overlaps
    45. Sub-step with bedtools BAM to BED
    46. Sub-step with Purge haplotigs
    47. Sub-step with Sort
    48. Sub-step with Purge overlaps
    49. Sub-step with SALSA
    50. Sub-step with Quast
    51. Sub-step with Merqury
    52. Sub-step with Busco
    53. Sub-step with Merqury
    54. Sub-step with Busco
    55. Sub-step with Quast
    56. Sub-step with Map with BWA-MEM
    57. Sub-step with PretextMap
    58. Sub-step with Pretext Snapshot
    59. Re-arrange

VGP assembly pipeline overview

fig1:VGP pipeline
Figure 1: VPG Pipeline 2.0

Give some background about what the trainees will be doing in the section. Remember that many people reading your materials will likely be novices, so make sure to explain all the relevant concepts.

Title for a subsection

Section and subsection titles will be displayed in the tutorial index on the left side of the page, so try to make them informative and concise!

Hands-on Sections

Below are a series of hand-on boxes, one for each tool in your workflow file. Often you may wish to combine several boxes into one or make other adjustments such as breaking the tutorial into sections, we encourage you to make such changes as you see fit, this is just a starting point :)

Anywhere you find the word “TODO”, there is something that needs to be changed depending on the specifics of your tutorial.

have fun!

Get data

hands_on Hands-on: Data upload

  1. Create a new history for this tutorial
  2. Import the files from Zenodo or from the shared data library (GTN - Material -> assembly -> VGP assembly pipeline):

    
    

    TODO: Add the files by the ones on Zenodo here (if not added)

    TODO: Remove the useless files (if added)

    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    • By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

    Tip: Importing data from a data library

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    • Go into Shared data (top panel) then Data libraries

    • Find the correct folder (ask your instructor)

    • Select the desired files
    • Click on the To History button near the top and select as Datasets from the dropdown menu
    • In the pop-up window, select the history you want to import the files to (or create a new one)
    • Click on Import
  3. Rename the datasets
  4. Check that the datatype

    Tip: Changing the datatype

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
    • Select datatypes
    • Click the Change datatype button
  5. Add to each database a tag corresponding to …

    Tip: Adding a tag

    • Click on the dataset
    • Click on galaxy-tags Edit dataset tags
    • Add a tag starting with #

      Tags starting with # will be automatically propagated to the outputs of tools using this dataset.

    • Check that the tag is appearing below the dataset name

Genome profile analysis

An important step before starting a de novo genome assembly project is to proceed with the analysis of the genome profile. Determining these characteristics in advance has the potential to reveal whether an analysis is not reflecting the full complexity of the genome, for example, if the number of variants is underestimated or a significant fraction of the genome is not assembled (Vurture et al. 2017).

Traditionally DNA flow citometry was considered the golden standart for estimating the genome size, one of the most important factors to determine the required coverage level. However, nowadays experimental methods have been replaced by computational approaches Wang et al. 2020. One of the most widely used procedures for undertaking genomic profiling is the analyis of k-mer frequencies. It allows to provide information not only about the genomic complexity, such as the genome size, levels of heterozygosity and repeat content, but also about the data quality. In addition, k-mer spectra analysis can be used in a reference-free manner for assessing genome assembly quality metrics (Rhie et al. 2020).

In this tutorial we will use two basic tools to computationally estimate the genome features: Meryl and GenomeScope.

Generation of k-mer spectra with Meryl

Meryl will allow us to perform the k-mer profiling by decomposing the sequencing data into k-lenght substrings and determining its frequency. The original version was developed for use in the Celera Assembler, and it comprises three modules: one for generating k-mer databases, one for filtering and combining databases, and one for searching databases. The k-mer database is stored in sorted order, similar to words in a dictionary (Rhie et al. 2020).

comment K-mer size estimation

One of the important aspects is the size of the k-mer, which must be large enough to map uniquely to the genome, but not too large, since it can lead to wasting computational resources. Given an estimated genome size (G) and a tolerable collision rate (p), an appropriate k can be computed as k = log4 (G(1 − p)/p).

hands_on Hands-on: Task description

  1. Meryl Tool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy2 with the following parameters:
    • “Operation type selector”: Count operations
      • param-collection “Input sequences”: output (Input dataset collection)
      • “K-mer size selector”: Estimate the best k-mer size

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Meryl

hands_on Hands-on: Task description

  1. Meryl Tool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy1 with the following parameters:
    • “Operation type selector”: Operations on sets of k-mers
      • “Operations on sets of k-mers”: Union-sum: return k-mers that occur in any input, set the count to the sum of the counts
      • param-file “Input meryldb”: read_db (output of Meryl tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with GenomeScope

hands_on Hands-on: Task description

  1. GenomeScope Tool: toolshed.g2.bx.psu.edu/repos/iuc/genomescope/genomescope/2.0 with the following parameters:
    • param-file “Input histogram file”: read_db_hist (output of Meryl tool)
    • “Add the model parameters to your history”: Yes
    • “Output a summary of the analysis”: Yes
    • “K-mer length used to calculate k-mer spectra”: 31
    • “Create testing.tsv file with model parameters”: Yes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Re-arrange

To create the template, each step of the workflow had its own subsection.

TODO: Re-arrange the generated subsections into sections or other subsections. Consider merging some hands-on boxes to have a meaningful flow of the analyses

Conclusion

Sum up the tutorial and the key takeaways here. We encourage adding an overview image of the pipeline used.

Sub-step with Parse parameter value

hands_on Hands-on: Task description

  1. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (Input dataset)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Cutadapt

hands_on Hands-on: Task description

  1. Cutadapt Tool: toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/3.4 with the following parameters:
    • “Single-end or Paired-end reads?”: Single-end
      • param-collection “FASTQ/A file”: output (Input dataset collection)
      • In “Read 1 Options”:
        • In “5’ or 3’ (Anywhere) Adapters”:
          • param-repeat “Insert 5’ or 3’ (Anywhere) Adapters”
            • “Source”: Enter custom sequence
              • “Enter custom 5’ or 3’ adapter sequence”: ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT
          • param-repeat “Insert 5’ or 3’ (Anywhere) Adapters”
            • “Source”: Enter custom sequence
              • “Enter custom 5’ or 3’ adapter sequence”: ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT
    • In “Adapter Options”:
      • “Match times”: 3
      • “Maximum error rate”: 0.1
      • “Minimum overlap length”: 35
      • “Look for adapters in the reverse complement”: True
    • In “Filter Options”:
      • “Discard Trimmed Reads: Yes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Collapse Collection

hands_on Hands-on: Task description

  1. Collapse Collection Tool: toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.2 with the following parameters:
    • param-file “Collection of files to collapse into single dataset”: out1 (output of Cutadapt tool)
    • “Prepend File name”: Yes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Hifiasm

hands_on Hands-on: Task description

  1. Hifiasm Tool: toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.14+galaxy0 with the following parameters:
    • “Assembly mode”: Standard
      • param-file “Input reads: out1 (output of Cutadapt tool)
    • “Advanced options”: Leave default
    • “Assembly options”: Leave default
    • “Options for purging duplicates”: Specify

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with GFA to FASTA

hands_on Hands-on: Task description

  1. GFA to FASTA Tool: toolshed.g2.bx.psu.edu/repos/iuc/gfa_to_fa/gfa_to_fa/0.1.2 with the following parameters:
    • param-file “Input GFA file”: primary_contig_graph (output of Hifiasm tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with GFA to FASTA

hands_on Hands-on: Task description

  1. GFA to FASTA Tool: toolshed.g2.bx.psu.edu/repos/iuc/gfa_to_fa/gfa_to_fa/0.1.2 with the following parameters:
    • param-file “Input GFA file”: alternate_contig_graph (output of Hifiasm tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Meryl

hands_on Hands-on: Task description

  1. Meryl Tool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy0 with the following parameters:
    • “Operation type selector”: Generate histogram dataset
      • param-file “Input meryldb”: read_db (output of Meryl tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: out_fa (output of GFA to FASTA tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: split FASTA file by 'N's
      • param-file “Base-level coverage file”: out_fa (output of GFA to FASTA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Busco

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: out_fa (output of GFA to FASTA tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: Vertebrata
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Merqury

hands_on Hands-on: Task description

  1. Merqury Tool: toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3 with the following parameters:
    • “Evaluation mode”: Default mode
      • param-file “K-mer counts database”: read_db (output of Meryl tool)
      • “Number of assemblies”: One assembly (pseudo-haplotype or mixed-haplotype)
        • param-file “Genome assembly”: out_fa (output of GFA to FASTA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Map with minimap2

hands_on Hands-on: Task description

  1. Map with minimap2 Tool: toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: out_fa (output of GFA to FASTA tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: out1 (output of Cutadapt tool)
      • “Select a profile of preset options”: Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)
    • In “Alignment options”:
      • “Customize spliced alignment mode?”: No, use profile setting or leave turned off
    • In “Set advanced output options”:
      • “Select an output format”: paf

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Map with minimap2

hands_on Hands-on: Task description

  1. Map with minimap2 Tool: toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: split_fasta (output of Purge overlaps tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: split_fasta (output of Purge overlaps tool)
      • “Select a profile of preset options”: Construct a self-homology map - use same genome as query and reference (-DP -k19 -w19 -m200) (self-homology)
    • In Mapping options”:
      • “force minimap2 to always use k-mers occuring this many times or fewer”: 100
      • “minimal chaining score (matching bases minus log gap penalty)”: 40
    • In “Alignment options”:
      • “Customize spliced alignment mode?”: No, use profile setting or leave turned off
      • “Score for a sequence match”: 1
      • “Penalty for a mismatch”: 19
      • “Gap open penalties for deletions”: 39
      • “Gap open penalties for insertions”: 81
      • “Gap extension penalties; a gap of size k cost ‘-O + -Ek’. If two numbers are specified, the first is the penalty of extending a deletion and the second for extending an insertion”*: 3
      • “Gap extension penalty for extending an insertion; if left empty uses the value specified for Gap extension penalties above”: 1
      • “Z-drop threshold for truncating an alignment”: 200
    • In “Set advanced output options”:
      • “Select an output format”: paf

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: create read depth histogram and base-level read depth for pacbio data
      • param-file “PAF input file”: alignment_output (output of Map with minimap2 tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Compute

hands_on Hands-on: Task description

  1. Compute Tool: toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/1.6 with the following parameters:
    • “Add expression: 1.5*c3
    • param-file “as a new column to”: model_params (output of GenomeScope tool)
    • “Round result?”: Yes
    • “Input has a header line with column names?”: No

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Compute

hands_on Hands-on: Task description

  1. Compute Tool: toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/1.6 with the following parameters:
    • “Add expression: 3*c7
    • param-file “as a new column to”: out_file1 (output of Compute tool)
    • “Round result?”: Yes
    • “Input has a header line with column names?”: No

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Advanced Cut

hands_on Hands-on: Task description

  1. Advanced Cut Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0 with the following parameters:
    • param-file “File to cut”: out_file1 (output of Compute tool)
    • “Cut by”: fields
      • “List of Fields”: c8

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Advanced Cut

hands_on Hands-on: Task description

  1. Advanced Cut Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0 with the following parameters:
    • param-file “File to cut”: out_file1 (output of Compute tool)
    • “Cut by”: fields
      • “List of Fields”: cc7

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Parse parameter value

hands_on Hands-on: Task description

  1. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (output of Advanced Cut tool)
    • “Select type of parameter to parse”: Integer

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Parse parameter value

hands_on Hands-on: Task description

  1. Parse parameter value Tool: param_value_from_file with the following parameters:
    • param-file “Input file containing parameter to parse out of”: output (output of Advanced Cut tool)
    • “Select type of parameter to parse”: Integer

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: calculate coverage cutoffs
      • param-file “STAT input file”: pbcstat_stat (output of Purge overlaps tool)
      • “Transition between haploid and diploid”: {'id': 26, 'output_name': 'integer_param'}
      • “Upper bound for read depth”: {'id': 25, 'output_name': 'integer_param'}

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge haplotigs

hands_on Hands-on: Task description

  1. Purge haplotigs Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy0 with the following parameters:
    • “Select the purge_dups function”: purge haplotigs and overlaps for an assembly
      • param-file “PAF input file”: alignment_output (output of Map with minimap2 tool)
      • param-file “Base-level coverage file”: pbcstat_cov (output of Purge overlaps tool)
      • param-file “Cutoffs file”: calcuts_tab (output of Purge overlaps tool)
      • “Rounds of chaining”: 1 round

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: obtain seqeuences after purging
      • param-file “Fasta input file”: out_fa (output of GFA to FASTA tool)
      • param-file Bed input file”: purge_dups_bed (output of Purge haplotigs tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Merqury

hands_on Hands-on: Task description

  1. Merqury Tool: toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3 with the following parameters:
    • “Evaluation mode”: Default mode
      • param-file “K-mer counts database”: read_db (output of Meryl tool)
      • “Number of assemblies”: One assembly (pseudo-haplotype or mixed-haplotype)
        • param-file “Genome assembly”: get_seqs_purged (output of Purge overlaps tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Bionano Hybrid Scaffold

hands_on Hands-on: Task description

  1. Bionano Hybrid Scaffold Tool: toolshed.g2.bx.psu.edu/repos/bgruening/bionano_scaffold/bionano_scaffold/3.6.1+galaxy2 with the following parameters:
    • param-file “NGS FASTA”: get_seqs_purged (output of Purge overlaps tool)
    • param-file “BioNano CMAP”: output (Input dataset)
    • “Configuration mode”: VGP mode

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: get_seqs_purged (output of Purge overlaps tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Busco

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: get_seqs_purged (output of Purge overlaps tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: Vertebrata
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Concatenate datasets

hands_on Hands-on: Task description

  1. Concatenate datasets Tool: cat1 with the following parameters:
    • param-file “Concatenate Dataset”: get_seqs_hap (output of Purge overlaps tool)
    • In “Dataset”:
      • param-repeat “Insert Dataset”
        • param-file “Select”: out_fa (output of GFA to FASTA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Concatenate datasets

hands_on Hands-on: Task description

  1. Concatenate datasets Tool: cat1 with the following parameters:
    • param-file “Concatenate Dataset”: ngs_contigs_scaffold_trimmed (output of Bionano Hybrid Scaffold tool)
    • In “Dataset”:
      • param-repeat “Insert Dataset”
        • param-file “Select”: ngs_contigs_not_scaffolded_trimmed (output of Bionano Hybrid Scaffold tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Map with minimap2

hands_on Hands-on: Task description

  1. Map with minimap2 Tool: toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: out_file1 (output of Concatenate datasets tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: out1 (output of Cutadapt tool)
      • “Select a profile of preset options”: Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)
    • In “Alignment options”:
      • “Customize spliced alignment mode?”: No, use profile setting or leave turned off
    • In “Set advanced output options”:
      • “Select an output format”: paf

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: split FASTA file by 'N's
      • param-file “Base-level coverage file”: out_file1 (output of Concatenate datasets tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Merqury

hands_on Hands-on: Task description

  1. Merqury Tool: toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3 with the following parameters:
    • “Evaluation mode”: Default mode
      • param-file “K-mer counts database”: read_db (output of Meryl tool)
      • “Number of assemblies”: One assembly (pseudo-haplotype or mixed-haplotype)
        • param-file “Genome assembly”: out_file1 (output of Concatenate datasets tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: out_file1 (output of Concatenate datasets tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Busco

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: out_file1 (output of Concatenate datasets tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: Vertebrata
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Map with BWA-MEM

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: out_file1 (output of Concatenate datasets tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode
    • “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Map with BWA-MEM

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: out_file1 (output of Concatenate datasets tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode
    • “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: create read depth histogram and base-level read depth for pacbio data
      • param-file “PAF input file”: alignment_output (output of Map with minimap2 tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Map with minimap2

hands_on Hands-on: Task description

  1. Map with minimap2 Tool: toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: split_fasta (output of Purge overlaps tool)
    • “Single or Paired-end reads: Single
      • param-file “Select fastq dataset”: split_fasta (output of Purge overlaps tool)
      • “Select a profile of preset options”: Construct a self-homology map - use same genome as query and reference (-DP -k19 -w19 -m200) (self-homology)
    • In Mapping options”:
      • “force minimap2 to always use k-mers occuring this many times or fewer”: 100
      • “minimal chaining score (matching bases minus log gap penalty)”: 40
    • In “Alignment options”:
      • “Customize spliced alignment mode?”: No, use profile setting or leave turned off
      • “Score for a sequence match”: 1
      • “Penalty for a mismatch”: 19
      • “Gap open penalties for deletions”: 39
      • “Gap open penalties for insertions”: 81
      • “Gap extension penalties; a gap of size k cost ‘-O + -Ek’. If two numbers are specified, the first is the penalty of extending a deletion and the second for extending an insertion”*: 3
      • “Gap extension penalty for extending an insertion; if left empty uses the value specified for Gap extension penalties above”: 1
      • “Z-drop threshold for truncating an alignment”: 200
    • In “Set advanced output options”:
      • “Select an output format”: paf

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with bellerophon

hands_on Hands-on: Task description

  1. bellerophon Tool: toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy0 with the following parameters:

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: calculate coverage cutoffs
      • param-file “STAT input file”: pbcstat_stat (output of Purge overlaps tool)
      • “Transition between haploid and diploid”: 31
      • “Upper bound for read depth”: 94

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with bedtools BAM to BED

hands_on Hands-on: Task description

  1. bedtools BAM to BED Tool: toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.30.0+galaxy1 with the following parameters:
    • param-file “Convert the following BAM file to BED: outfile (output of bellerophon tool)
    • “What type of BED output would you like”: Create a full, 12-column "blocked" BED file

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge haplotigs

hands_on Hands-on: Task description

  1. Purge haplotigs Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy0 with the following parameters:
    • “Select the purge_dups function”: purge haplotigs and overlaps for an assembly
      • param-file “PAF input file”: alignment_output (output of Map with minimap2 tool)
      • param-file “Base-level coverage file”: pbcstat_cov (output of Purge overlaps tool)
      • param-file “Cutoffs file”: calcuts_tab (output of Purge overlaps tool)
      • “Rounds of chaining”: 1 round

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Sort

hands_on Hands-on: Task description

  1. Sort Tool: sort1 with the following parameters:
    • param-file “Sort Dataset”: output (output of bedtools BAM to BED tool)
    • “on column”: c4
    • “with flavor”: Alphabetical sort
    • “everything in”: Ascending order

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Purge overlaps

hands_on Hands-on: Task description

  1. Purge overlaps Tool: toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2 with the following parameters:
    • “Select the purge_dups function”: obtain seqeuences after purging
      • param-file “Fasta input file”: out_file1 (output of Concatenate datasets tool)
      • param-file Bed input file”: purge_dups_bed (output of Purge haplotigs tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with SALSA

hands_on Hands-on: Task description

  1. SALSA Tool: toolshed.g2.bx.psu.edu/repos/iuc/salsa/salsa/2.2+galaxy0 with the following parameters:
    • param-file “Initial assembly file”: out_file1 (output of Concatenate datasets tool)
    • param-file Bed alignment”: out_file1 (output of Sort tool)
    • “Restriction enzyme sequence(s)”: {'id': 5, 'output_name': 'text_param'}

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: get_seqs_purged (output of Purge overlaps tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Merqury

hands_on Hands-on: Task description

  1. Merqury Tool: toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3 with the following parameters:
    • “Evaluation mode”: Default mode
      • param-file “K-mer counts database”: read_db (output of Meryl tool)
      • “Number of assemblies”: One assembly (pseudo-haplotype or mixed-haplotype)
        • param-file “Genome assembly”: get_seqs_purged (output of Purge overlaps tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Busco

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: get_seqs_purged (output of Purge overlaps tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: Vertebrata
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Merqury

hands_on Hands-on: Task description

  1. Merqury Tool: toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3 with the following parameters:
    • “Evaluation mode”: Default mode
      • param-file “K-mer counts database”: read_db (output of Meryl tool)
      • “Number of assemblies”: One assembly (pseudo-haplotype or mixed-haplotype)
        • param-file “Genome assembly”: scaffolds_fasta (output of SALSA tool)

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Busco

hands_on Hands-on: Task description

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: scaffolds_fasta (output of SALSA tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Lineage”: Vertebrata
    • In “Advanced Options”:
      • “Which outputs should be generated”: ``

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Quast

hands_on Hands-on: Task description

  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy1 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: scaffolds_fasta (output of SALSA tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
    • In “Genes”:
      • “Tool for gene prediction”: Don't predict genes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Map with BWA-MEM

hands_on Hands-on: Task description

  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: scaffolds_fasta (output of SALSA tool)
    • “Single or Paired-end reads: Paired
      • param-file “Select first set of reads: output (Input dataset)
      • param-file “Select second set of reads: output (Input dataset)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with PretextMap

hands_on Hands-on: Task description

  1. PretextMap Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.6+galaxy0 with the following parameters:
    • param-file “Input dataset in SAM or BAM format”: bam_output (output of Map with BWA-MEM tool)
    • “Sort by”: Don't sort

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Sub-step with Pretext Snapshot

hands_on Hands-on: Task description

  1. Pretext Snapshot Tool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy0 with the following parameters:
    • param-file “Input Pretext map file”: pretext_map_out (output of PretextMap tool)
    • “Output image format”: png
    • “Show grid?”: Yes

    TODO: Check parameter descriptions

    TODO: Consider adding a comment or tip box

    comment Comment

    A comment about the tool or something else. This box can also be in the main text

TODO: Consider adding a question to test the learners understanding of the previous exercise

question Questions

  1. Question1?
  2. Question2?

solution Solution

  1. Answer for question1
  2. Answer for question2

Re-arrange

To create the template, each step of the workflow had its own subsection.

TODO: Re-arrange the generated subsections into sections or other subsections. Consider merging some hands-on boxes to have a meaningful flow of the analyses

Conclusion

Sum up the tutorial and the key takeaways here. We encourage adding an overview image of the pipeline used.

Key points

  • The take-home messages

  • They will appear at the end of the tutorial

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Assembly topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. Marcotte, E. M., M. Pellegrini, T. O. Yeates, and D. Eisenberg, 1999 A census of protein repeats. Journal of Molecular Biology 293: 151–160. 10.1006/jmbi.1999.3136
  2. Miller, J. R., S. Koren, and G. Sutton, 2010 Assembly algorithms for next-generation sequencing data. Genomics 95: 315–327. 10.1016/j.ygeno.2010.03.001
  3. Frenkel, S., V. Kirzhner, and A. Korol, 2012 Organizational Heterogeneity of Vertebrate Genomes (V. Laudet, Ed.). PLoS ONE 7: e32076. 10.1371/journal.pone.0032076
  4. Chalopin, D., M. Naville, F. Plard, D. Galiana, and J.-N. Volff, 2015 Comparative Analysis of Transposable Elements Highlights Mobilome Diversity and Evolution in Vertebrates. Genome Biology and Evolution 7: 567–580. 10.1093/gbe/evv005
  5. Sotero-Caio, C. G., R. N. Platt, A. Suh, and D. A. Ray, 2017 Evolution and Diversity of Transposable Elements in Vertebrate Genomes. Genome Biology and Evolution 9: 161–177. 10.1093/gbe/evw264
  6. Vurture, G. W., F. J. Sedlazeck, M. Nattestad, C. J. Underwood, H. Fang et al., 2017 GenomeScope: fast reference-free genome profiling from short reads (B. Berger, Ed.). Bioinformatics 33: 2202–2204. 10.1093/bioinformatics/btx153
  7. Angel, V. D. D., E. Hjerde, L. Sterck, S. Capella-Gutierrez, C. Notredame et al., 2018 Ten steps to get started in Genome Assembly and Annotation. F1000Research 7: 148. 10.12688/f1000research.13598.1
  8. Tørresen, O. K., B. Star, P. Mier, M. A. Andrade-Navarro, A. Bateman et al., 2019 Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Research 47: 10994–11006. 10.1093/nar/gkz841
  9. Wang, H., B. Liu, Y. Zhang, F. Jiang, Y. Ren et al., 2020 Estimation of genome size using k-mer frequencies from corrected long reads. arXiv preprint arXiv:2003.11817.
  10. Zhang, X., R. Wu, Y. Wang, J. Yu, and H. Tang, 2020 Unzipping haplotypes in diploid and polyploid genomes. Computational and Structural Biotechnology Journal 18: 66–72. 10.1016/j.csbj.2019.11.011
  11. Rhie, A., B. P. Walenz, S. Koren, and A. M. Phillippy, 2020 Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21: 10.1186/s13059-020-02134-9
  12. Hon, T., K. Mars, G. Young, Y.-C. Tsai, J. W. Karalius et al., 2020 Highly accurate long-read HiFi sequencing data for five complex genomes. Scientific Data 7: 10.1038/s41597-020-00743-4
  13. Rhie, A., S. A. McCarthy, O. Fedrigo, J. Damas, G. Formenti et al., 2021 Towards complete and error-free genome assemblies of all vertebrate species. Nature 592: 737–746. 10.1038/s41586-021-03451-0

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. Delphine Lariviere, Alex Ostrovsky, 2021 VGP assembly pipeline (Galaxy Training Materials). https://training.galaxyproject.org/archive/2021-08-01/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{assembly-vgp_genome_assembly,
author = "Delphine Lariviere and Alex Ostrovsky",
title = "VGP assembly pipeline (Galaxy Training Materials)",
year = "2021",
month = "07",
day = "22"
url = "\url{https://training.galaxyproject.org/archive/2021-08-01/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                

Congratulations on successfully completing this tutorial!