ERGA post-assembly QC

Overview
Creative Commons License: CC-BY Questions:
  • what combination of tools can assess the quality of an post-assembly?

  • what metrics can help to analyse the quality?

  • how to evaluate the outputs?

Objectives:
  • apply the post-assembly-QC-workflow using the necessary tools

  • analyse and evaluate the results of the workflow

Requirements:
Time estimation: 3 hours
Supporting Materials:
Published: Jun 23, 2023
Last modification: Nov 3, 2023
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00355
rating Rating: 5.0 (1 recent ratings, 1 all time)
version Revision: 12

The European Reference Genome Atlas (ERGA) is a large-scale project aimed at generating and integrating high-quality reference genomes for a wide range of European organisms. The project will use state-of-the-art sequencing technologies and advanced bioinformatics tools to produce high-quality genome assemblies.

Reference genomes provide a baseline for understanding genetic diversity within and among populations, and can be used to identify populations at risk of genetic erosion. This information is crucial for developing effective conservation strategies and management plans for threatened and endangered species (Shafer et al. 2015). Additionally, by better understanding the genetic basis of important traits, such as disease resistance and adaptation to changing environments, researchers can develop targeted interventions to mitigate the effects of environmental change and prevent the loss of genetic diversity (FRANKHAM et al. 2011). The ERGA project has the potential to greatly benefit biodiversity conservation efforts and advance our understanding of the genetic basis of biodiversity.

Genome post-assembly quality control (GPAQC) is a crucial step for evaluating the accuracy and completeness of newly assembled genomes. This involves assessing the contiguity, completeness, accuracy, and consistency of the genome assembly using various bioinformatic tools and methods (Koren et al. 2017, Hunt et al. 2015, Mikheenko et al. 2015, Vaser et al. 2017, Zimin et al. 2017). GPAQC aims to ensure that genomic data is reliable and useful for downstream analyses such as annotation, comparative genomics, and functional studies.

In this tutorial you will learn how to implement the ERGA GPAQC pipeline, and how to interpretate the potential outcomes.

Agenda

In this tutorial, we will cover:

  1. Study cases
  2. Get data
  3. Genome assembly overview with BlobToolKit
    1. Generate read coverage data with HISAT2
    2. Generate sequence similarity data with DIAMOND
    3. Generate BUSCO report
    4. Generate interactive plots with BlobToolKit
  4. K-mer based genome profiling and evaluation
    1. Generating k-mer profile with Meryl
    2. K-mer profile analysis with GenomeScope
    3. K-mer based assembly evaluation with Merqury
    4. Providing assembly statistics with gfastats
  5. Hi-C scaffolding
    1. Pre-processing Hi-C data
    2. Generate Hi-C contact map
  6. Assembly graph
  7. Conclusion

Study cases

In this tutorial we will evaluate three genome assemblies, belonging to three different taxonomic groups, in order to illustrate the different scenarios that can be identified. The characteristics of each of them are described briefly below.

Case 1: Chondrosia reniformis

Chondrosia reniformis is a slow-growing marine sponge cosmopolitan species which can be found in the Mediterranean Sea and the eastern Atlantic Ocean in shallow waters; it is considered to playing an important ecological role in the marine ecosystem by filtering large volumes of water and providing habitat for other species (Voultsiadou 2005). The members of this species are gonochoristic and oviparous, whose physiology and behavious seems to be highly influence for the presence of endosymbiosis heterotrophic bacteria (Sara Michele and Carlo 1998).

Figure 1: Distribution of data. Open image in new tab

Figure 1: Histories corresponding to the three cases of study.

The assembly is based on 70x PacBio HiFi and Arima2 Hi-C data generated by the Aquatic Symbiosis Genomics Project. The assembly process included the following sequence of steps: initial PacBio assembly generation with Hifiasm, retained haplotig separation with purge_dups, and Hi-C based scaffolding with YaHS. The mitochondrial genome was assembled using MitoHiFi. Finally, the primary assembly was analysed and manually improved using gEVAL.

Case 2: Erythrolamprus reginae

Erythrolamprus reginae is a species of colubrid snake found in South America. This species has been reported to include triploid individuals with parthenogenic reproduction; this type of seems to be associated with higher mutation rate tandem dupliation (Bogart 1980).

Figure 2: Distribution of data. Open image in new tab

Figure 2: Histories corresponding to the three cases of study.

The assembly used in this tutorial correspond to the curated primary assembly generated by the VGP project, based on 34x Pacbio HiFi and Arima2 Hi-C data, by using the VGP assembly pipeline.

Case 3: Eschrichtius robustus

Eschrichtius robustus, commonly known as the gray whale, is a species of whale found primarily in the North Pacific Ocean. Adult gray whales can reach lengths of up to 14.9 meters and weights of up to 36,000 kilograms. It is a diploid specie, genetically characterized by high homozygosity as a result of inbreeding (Brüniche-Olsen et al. 2018).

Figure 3: Distribution of data. Open image in new tab

Figure 3: Histories corresponding to the three cases of study.

The assembly used in this tutorial correspond to the curated primary assembly generated by the VGP project, based on 29x Pacbio HiFi data, by using the VGP assembly pipeline.

Get data

As a first step we will get the data from Zenodo.

Hands-on: Upload data
  1. Create a new history for allocating temporally all the datasets. This history can be named as Datasets history

  2. Import the Pacbio HiFi files from Zenodo:

    • Open the file galaxy-upload upload menu
    • Click on Rule-based tab
    • “Upload data as”: Collections
    • Copy the tabular data, paste it into the textbox and press Build

      dataset_01   https://zenodo.org/record/7786773/files/hifi.fastq.gz   fastq.gz    HiFi    CReniformis_Pacbio
      dataset_01   https://zenodo.org/record/7784764/files/m64055_210602_211608.hifi_reads.fastq.gz   fastq.gz    HiFi    EReginae_PacBio
      dataset_02   https://zenodo.org/record/7784764/files/m54306Ue_211020_191957.hifi_reads.fastq.gz   fastq.gz    HiFi    EReginae_PacBio
      dataset_03   https://zenodo.org/record/7784764/files/m54306Ue_211016_070813.hifi_reads.fastq.gz   fastq.gz    HiFi    EReginae_PacBio
      dataset_04   https://zenodo.org/record/7786773/files/m54306U_210529_225553.hifi_reads.fastq.gz   fastq.gz    HiFi    EReginae_PacBio
      dataset_01   https://zenodo.org/record/7781236/files/m54306Ue_220411_220734.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz   fastq.gz    HiFi    ERobustus_PacBio
      dataset_02   https://zenodo.org/record/7781236/files/m54306Ue_220520_051743.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz   fastq.gz    HiFi    ERobustus_PacBio
      dataset_03   https://zenodo.org/record/7781236/files/m64055e_220615_033108.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz   fastq.gz    HiFi    ERobustus_PacBio
      dataset_04   https://zenodo.org/record/7786773/files/m64055e_220603_182128.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz   fastq.gz    HiFi    ERobustus_PacBio
      
    • From Rules menu select Add / Modify Column Definitions
      • Click Add Definition button and select List Identifier(s): column A
      • Click Add Definition button and select URL: column B
      • Click Add Definition button and select Type: column C
      • Click Add Definition button and select Group Tag: column D
      • Click Add Definition button and select Collection Name: column E
    • Click Apply and press Upload
  3. Import the remainig datasets from Zenodo:

    • Open the file galaxy-upload upload menu
    • Click on Rule-based tab
    • “Upload data as”: Datasets
    • Copy the tabular data, paste it into the textbox and press Build

      CReformitis_assembly    https://zenodo.org/record/7831298/files/assembly_sponge.fasta.gz    fasta.gz    assembly
      CReformitis_metadata    https://zenodo.org/record/7781236/files/metadata_chon.yaml  yaml    metadata
      CReniformis_Hi-C_F   https://zenodo.org/record/7786773/files/hiC_1.fastq.gz   fastq.gz    Hi-C
      CReniformis_Hi-C_R   https://zenodo.org/record/7786773/files/hiC_2.fastq.gz   fastq.gz    Hi-C
      EReginae_Hi-C_F_01   https://zenodo.org/record/7831762/files/rEryReg1_Royal_Ground_Snake_R1.fastq.gz.gz.001   fastq.gz    Hi-C
      EReginae_Hi-C_F_02   https://zenodo.org/record/7831762/files/rEryReg1_Royal_Ground_Snake_R1.fastq.gz.gz.002   fastq.gz    Hi-C
      EReginae_Hi-C_F_03   https://zenodo.org/record/7831763/files/rEryReg1_Royal_Ground_Snake_R1.fastq.gz.gz.003   fastq.gz    Hi-C
      EReginae_Hi-C_R_01   https://zenodo.org/record/7833514/files/rEryReg1_Royal_Ground_Snake_R2.fastq.gz.gz.001   fastq.gz    Hi-C
      EReginae_Hi-C_R_02   https://zenodo.org/record/7833514/files/rEryReg1_Royal_Ground_Snake_R2.fastq.gz.gz.002   fastq.gz    Hi-C
      EReginae_Hi-C_R_03   https://zenodo.org/record/7831763/files/rEryReg1_Royal_Ground_Snake_R2.fastq.gz.gz.003   fastq.gz    Hi-C
      EReginata_assembly    https://zenodo.org/record/7788734/files/rEryReg1.pri.cur.20230105.fasta.gz    fasta.gz    assembly
      ERegina_metadata    https://zenodo.org/record/7781236/files/metadata_eryth.yaml  fasta.gz    metadata
      ERobustus_assembly  https://zenodo.org/record/7781236/files/mEscRob2.pri.cur.20221201.fasta.gz  fasta.gz    assembly
      ERobustus_metadata  https://zenodo.org/record/7781236/files/metadata_esch.yaml  yaml    metadata
      Taxonomy_data   https://zenodo.org/record/7781236/files/new_taxdump.tar.gz  gz  taxonomy
      Diamond_db   https://zenodo.org/record/7788734/files/Diamond_makedb.dmnd  dmnd  database
      
    • From Rules menu select Add / Modify Column Definitions
      • Click Add Definition button and select Name: column A
      • Click Add Definition button and select URL: column B
      • Click Add Definition button and select Type: column C
      • Click Add Definition button and select Name Tag: column D
    • Click Apply and press Upload

Once we have imported all the datasets, we will move each one to its correspondent history.

Hands-on: Upload data
  1. Create three new empty histories, one for each specie.
  2. Rename the histories as Case 1: *Chondrosia reniformis*, Case 2: *Erythrolamprus reginae* and Case 3: *Eschrichtius robustus*.
  3. Click in History options and select Show Histories Side-by-Side
  4. Click in Select histories, and include the histories corresponding to the three species.
  5. Move the datasets to its correspondent history.

    Comment: Non-unique datasets

    Both the Taxonomic_data and the Diamond_database should be included in all of them.

  1. Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
  2. Type the new name
  3. Click on Save

If you do not have the galaxy-pencil (Edit) next to the history name:

  1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
  2. Type the new name
  3. Press Enter

Once all the datasets have been copied to their correspondent history, we should obtain something similar to this:

Figure 4: Distribution of data. Open image in new tab

Figure 4: Histories corresponding to the three cases of study.

Genome assembly overview with BlobToolKit

BlobToolKit is a tool designed to assist researchers in analyzing and visualizing genome assembly data. The tool uses information from multiple data sources such as read coverage, gene expression, and taxonomic annotations to generate a comprehensive overview of genome assembly data (Challis et al. 2020). One of the key characteristics of BlobToolKit is its ability to provide with a user-friendly interactive interface for analyzing complex genome assembly data.

In this tutorial, we will use BlobToolKit in order to integrate the following data:

  • Read coverage data: BlobToolKit can use read coverage information to identify potential errors or gaps in the genome assembly. By comparing the depth of coverage across the genome, it can highlight regions that may be overrepresented or underrepresented, which can indicate potential issues with the assembly.
  • Taxonomic annotations: BlobToolKit can use taxonomic information to identify potential contaminants or foreign DNA in the genome assembly. It does this by comparing the taxonomic profile of the genome assembly to a reference database of known organisms.
  • Sequence similarity data: Sequence similarity data can be used to identify potential misassemblies or contaminants in the genome assembly. BlobToolKit can use BLAST/DIAMOND searches to compare the genome assembly to reference databases and identify regions that may be problematic.
  • BUSCO reports: BlobToolKit can use BUSCO data to provide additional information about the quality of a genome assembly. It can generate plots of the number of complete and partial BUSCO genes in the genome assembly, as well as the number of missing and fragmented genes.
Comment: Why should we evaluate contaminants?

A significant proportion of the genome sequences in both GenBank and RefSeq (0.54% and 0.34% of entries, respectively) include sequences from contaminants; the contamination primarily exists in the form of short contigs, flanking regions on longer contigs, or areas of larger scaffolds that are flanked by Ns, although a few longer sequences with contamination were also detected (Steinegger and Salzberg 2020).

In the next steps, we will generate the data required for generating the visualization plots with BlobToolKit.

Generate read coverage data with HISAT2

Read coverage is an essential metric for evaluating the quality of genome assemblies, and it provides valuable information for identifying regions of high and low quality, detecting misassemblies, and identifying potential contaminants. Thus, for example, unexpected regions of low coverage suggests potential errors, such as misassemblies, gaps, or low complexity regions (Koren et al. 2017).

In this tutorial we will use HISAT2 for generation the coverage data. This tool uses a indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, which enables efficient and accurate alignment (Zhang et al. 2021). It then provides the alignment output in BAM file format which we will then use as an input for BlobToolKit.

Comment: How is coverage information encoded in the BAM file?

Coverage information is encoded in the BAM file format through the number of reads that align to each position in the reference genome. Specifically, the BAM file stores each read’s start and end position in the genome, as well as the length and orientation of the read, among other information. To calculate coverage, the number of reads that overlap each position in the reference genome is counted.

Hands-on: Generate BAM file with HISAT2
  1. Collapse Collection ( Galaxy version 5.1.0) with the following parameters:
    • param-collection “Collection of files to collapse into single dataset”: CReniformis_Pacbio
  2. HISAT2 ( Galaxy version 2.2.1+galaxy1) with the following parameters:
    • “Source for the reference genome”: Use a genome from history
      • param-file “Select the reference genome”: CReformitis_assembly
    • “Is this a single or paired library”: Single-end
      • param-file “FASTA/Q file”: output of Collapse Collection tool
  3. Repeat the prevous steps with the datasets from the two remanining species.

Generate sequence similarity data with DIAMOND

DIAMOND is a sequence alignment tool that utilizes a more efficient algorithm compared to BLAST, allowing for much faster searches of large sequence databases. Specifically, DIAMOND uses a sensitive seed-extension approach that compares a set of small segments (seeds) from the query sequence to a database, and then extends the alignments based on the highest-scoring hits. This approach allows DIAMOND to perform up to 20,000 times faster than BLAST, with comparable or improved sensitivity and accuracy (Buchfink et al. 2014).

Hands-on: Task description
  1. Diamond ( Galaxy version 2.0.15+galaxy0) with the following parameters:
    • “Alignment mode”: DNA query sequences (blastx)
      • “Allow for frameshifts?”: yes
        • “restrict hit culling to overlapping query ranges”: Yes
        • “frame shift penalty”: 15
    • param-file “Input query file in FASTA or FASTQ format”: CReformitis_assembly
    • “Will you select a reference database from your history or use a built-in index?”: Use one from the history
      • param-file “Select the reference database”: Diamond_db
    • “Restrict search taxonomically?”: No
    • “Sensitivity Mode”: Fast (--fast)
      • “Block size in billions of sequence letters to be processed at a time”: 10.0
    • “Method to filter?”: Maximum e-value to report alignments
    • “Method to restrict the number of hits?”: Maximum number of target sequences
    • In “Output options”:
      • “Format of output file”: BLAST tabular
        • “Tabular fields”: Query Seq-id,Subject Seq -id, Start of alignment in query, End of alignment in query, Expected value, Bit score and Unique Subject Taxonomy ID(s), separated by a ',' (in numerical order)
  2. Repeat the prevous steps with the datasets from the two remanining species.

Generate BUSCO report

BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool used for quantitative assessment of genome assembly, gene set, and transcriptome completeness. It is based on evolutionarily informed expectations of gene content derived from near-universal single-copy orthologs (Simão et al. 2015)

Comment: Orthologs

Orthologs are genes in different species which have usually the same function and have evolved from a common ancestral gene. They are important for new genome assemblies in order to predict gene functions and help with gene annotation (Gennarelli and Cattaneo 2010).

Hands-on: Estimate single copy gene representation completeness
  1. Busco ( Galaxy version 5.4.4+galaxy0) with the following parameters:
    • param-file “Sequences to analyse”: outfile (output of Replace tool)
    • “Mode”: Genome assemblies (DNA)
      • “Use Augustus instead of Metaeuk”: Use Metaeuk
    • “Auto-detect or select lineage?”: Auto-detect
    • “Which outputs should be generated”: Short summary text

    BUSCO sets represent 3023 genes for vertebrates, 2675 for arthropods, 843 for metazoans, 1438 for fungi and 429 for eukaryotes. An intuitive metric is provided in BUSCO notation - C:complete[D:dublicated], F:fragmented, M:missing, n:number of genes used.

  2. Repeat the prevous steps with the datasets from the two remanining species.
Question

What are the values of complete BUSCO genes for each species?

  1. Eschrichtius robustus: 96.0%
  2. Erythrolamprus reginae: 91.3%
  3. Chondrosia reniformis: 82.8%

Generate interactive plots with BlobToolKit

BlobToolKit is a tool designed to assist researchers in analyzing and visualizing genome assembly data. The tool uses information from multiple data sources such as read coverage, gene expression, and taxonomic annotations to generate a comprehensive overview of genome assembly data (Challis et al. 2020). One of the key characteristics of BlobToolKit is its ability to provide with a user-friendly interactive interface for analyzing complex genome assembly data.

To work with BlobToolKit we need to create a new dataset structure called BlobDir. Therefore the minimum requirement is a fasta file which contains the sequence of our assembly. A list of sequence identifiers and some statistics like length, GC proportion and undefined bases will then be generated.

To get a more meaningful analysis and therefore more useful information about our assembly, it is better to provide as much data as we can get. In our case we will also provide a metadata file if possible, NCBI taxonomy ID and the NCBI taxdump directory (Challis et al. 2020).

Hands-on: Creating the BlobDir dataset
  1. BlobToolKit ( Galaxy version 3.4.0+galaxy0) with the following parameters:
    • “Select mode”: Create a BlobToolKit dataset
      • param-file “Genome assembly file”: CReformitis_assembly
      • param-file “Metadata file”: CReformitis_metadata
      • “NCBI taxonomy ID”: 121349
      • param-file “NCBI taxdump directory”: Taxonomy_Data
  2. BlobToolKit ( Galaxy version 3.4.0+galaxy0) with the following parameters:
    • “Select mode”: Add data to a BlobToolKit dataset
      • param-file “Blobdir.tgz file”: output of BlobToolKit tool
      • param-file “BUSCO full table file”: Full table (output of Busco tool)
      • “BLAST/Diamond hits”: Enabled
        • “BLAST/Diamond hits dataset”: output of Diamond tool
      • param-file “BAM/SAM/CRAM read alignment file”: output of HISAT2 tool
  3. Interactive BlobToolKit with the following parameters:
    • param-file “Blobdir file”: output of BlobToolKit tool
  4. Click on the galaxy-eye (eye) icon and inspect the output of Interactive BlobToolKit.

  5. Repeat the prevous steps with the datasets from the two remanining species.

Now, we will evaluate three of plots that we can find in the Blobtoolkit interactve interface. First, we will start with the snailplot, which provides a holistic view of the assembly.

Figure 5: BlobToolKit snail plot. Open image in new tab

Figure 5: Chondrosia reniformis snail plot summary. The grey line represents the assembly scaffolds, where the distance to the center of the circle indicates the length of the scaffolds. A scale line in the center of the circle helps to gauge the length. Longest contig: The red line in the plot indicates the longest contig in the assembly. N50 and N90: The dark orange and light orange lines represent the scaffolds contained in the N50 and N90 metrics, respectively. GC/AT content: The outer light and dark blue lines represent the GC and AT content, respectively. In an ideal assembly, the line between the two colors should be consistent, without much fluctuation. Gaps: Gaps in the assembly, if present, are denoted by the percent N in the bottom right of the image.

The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 117,390,217 bp assembly. The distribution of sequence lengths is shown in dark grey with the plot radius scaled to the longest sequence present in the assembly (10,413,042 bp, shown in red).

Orange and pale-orange arcs show the N50 and N90 sequence lengths (8,459,200 and 6,903,244 bp), respectively. The pale grey spiral shows the cumulative sequence count on a log scale with white scale lines showing successive orders of magnitude. The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot. A summary of complete, fragmented, duplicated and missing BUSCO genes in the eukaryota_odb10 set is shown in the top right.

Question: Snailplot question

Which differences do you appreciate between the snailplots corresponding to Erythrolamprus reginae and Eschrichtius robustus?

Figure 6: BlobToolKit snailplots comparison. Open image in new tab

Figure 6: Comparison between snailplots from E. reginae (A) and E. robustus (B).
  • (A) The snake (Erythrolamprus reginae) does have a relatively large proportion of the genome assembled into some few and long scaffolds (reflected in the high “longest scaffold” value and the high “Log10 scaffold count” value).
  • (B) The whale (Eschrichtius robustus) does have a relatively large proportion of the genome assembled into fragmented scaffolds (reflected in the low N90 value compared to the N50 value and the high “Log10 scaffold count” value)

Now, we are going to analyze the blob plot corresponding to Erythrolamprus reginae, which is specially useful for gaining insights into the composition of your genomic data and identify potential contaminants or endosymbionts.

Figure 7: BlobToolKit circle plot. Open image in new tab

Figure 7: Blob plot of base coverage in input against GC proportion for sequences in *E. reginate assembly*. Sequences are coloured by phylum. Circles are sized in proportion to sequence length . Histograms show the distribution of sequence length sum along each axis

The blob plot is a two-dimensional scatter plot that helps in visualizing and analyzing genomic data for quality control, contaminant detection, and filtering. In the circle plot, each sequence is represented by a circle, with its diameter proportional to the sequence length. Circles are colored based on their taxonomic affiliation, and their positions on the X and Y axes are determined by their GC content and coverage, respectively. GC content is the proportion of G and C bases in the sequence, which can differ substantially between genomes. Coverage, on the other hand, is a measure of the number of times a particular sequence has been read during the sequencing process. The plot also includes coverage and GC histograms for each taxonomic group, weighted by the total span (cumulative length) of sequences occupying each bin.

Comment: Pre-curated assembly evaluation

Usually, the evaluation of the pre-curated assembly yield quite different results when compared with the post-curation evaluation. A frequent difference is the abudance of contigs that belong to different organisms, as a result of contamination in the reads. It can be clearly appreciated in the next figure, which correspond to the E. reginae pre-curated assembly. A striking element is the presence of 38 contigs corresponding to the phylum Artverviricota (dark green), a group of viruses that encode a reverse transcriptase.

Figure 8: BlobToolKit circle plot. Open image in new tab

Figure 8: Blob plot of base coverage in input against GC proportion for sequences in pre-curated E. reginate assembly. Sequences are coloured by phylum. Circles are sized in proportion to sequence length . Histograms show the distribution of sequence length sum along each axis

In addition, as we can appreciate, the samples included contamination from fungi, arthropoda and bacteria.

Question: Blob plot question

Does exist significant contamination in the Eschrichtius robustus and Chondrosia reniformis assemblies?

Figure 9: BlobToolKit snailplots comparison. Open image in new tab

Figure 9: Comparison between blob plots from E. robustus (A) and C. reniformis (B).
  • (A) The whale (Eschrichtius robustus) does have significant contamination. Approximately half of the assembly of the whale does contain contaminants or cosymbionts (total count: 703, Chordata count: 316).
  • (B) The sponge (Condrosia reniformis) doesn’t contain any contamination (total count: 15, Chordata count: 15).

Finally, let’s have a look at the accumulative plot, which shows the curves for subsets of scaffolds assigned to each phylum relative to the overall assembly. It is useful for evaluating the contribution of the contigs from contaminated reads to the final assembly. In that case, first we will evaluate all sequences together, and then we will remove those correspoding to chordata or not classified (not-hit) in order to be able to evaluate in detail each of the contaminants. So, let’s start with the accumulative plot corresponding to all sequences (fig. 10).

In this kind of plot, the x-axis represent the number of contigs (sorted by species and length), and the y-axis correspond to the cumulative length in nucleotides. The gray line shows the cumulative length of all sequences. As we can appreciate, most sequences corespond to the taxa chordata (1.9Gb, distributed along 151 contigs). In addition, we can see that the final assembly includes 174 contigs corresponding to different taxa, or not classified at all. In order to be able to analyze the contribution of those sequences, we will filter the contigs corresponding to the chordata.

Figure 10: BlobToolKit cumulativ plot. Open image in new tab

Figure 10: Cumulative sequence length for assembly Blobdir. The grey line shows cumulative length for all sequences. Coloured lines show cumulative lengths of sequences assigned to each phylum using the bestsum taxrule. Plot axes are scaled to the filtered assembly.
Comment: How do I apply filters to a dataset?

Datasets can be filtered based on any category or variable using the “Filters” menu. To filter based on a category, click the tab above any bar of the preview histogram to hide/show records assigned to this category. To filter variables, use the sliders at either end of the preview histogram to change the maximum and minimum values or enter the numbers directly into the text boxes above the histogram.

By hidding the contings corresponding to chordata, we can have a detailed view of the contributions of each contaminant (fig. 11).

Figure 11: BlobToolKit cumulativ plot. Open image in new tab

Figure 11: The grey line shows cumulative length for all sequences. Coloured lines show cumulative lengths of sequences assigned to each phylum using the bestsum taxrule and are stacked by cumulative value on the x-axis to show the proportion of each phylum in the overall assembly. The assembly has been filtered to exclude sequences with phylum matches Chordata. Plot axes are scaled to the filtered assembly.

In figure 11 we can appreciate clearly the relative contribution of each contaminant with respect to the total contamination.

K-mer based genome profiling and evaluation

k-mers can be utilized to infer genome characteristics such as length and heterozygosity by analyzing their frequency distribution in raw sequencing reads (Manekar and Sathe 2018). In this tutorial we will make use of three different tools in order to perform the k-mer based genome profiling and assembly evaluation.

Comment: Genome profiling details

You can find more details about genome profiling in the VGP training.

Generating k-mer profile with Meryl

DNA is double stranded and normally only one strand is sequenced. For our assembly we want to consider the other strand as well. Therefore canonical k-mers are used in most counting tools, exactly like in Meryl. A full k-mer pair is a sequence and the reverse complement of the sequence (e.g. ATG/CAT). The canonical sequence of a k-mer pair is the lexicographically smaller of the two reverse complementary sequences. So if CAT appears it will be counted as ATG. (Clavijo)

Meryl is a powerful tool for counting k-mers in large-scale genomic datasets. It uses a sorting-based approach that sorts the k-mers in lexicographical order.

Hands-on: Generating k-mer profile
  1. Meryl ( Galaxy version 1.3+galaxy6) with the following parameters:
    • “Operation type selector”: Count operations
      • param-collection “Input sequences”: output (Input dataset collection)
      • “K-mer size selector”: Set a k-mer size
        • “K-mer size”: 21
    Comment: compute k

    In general k can be computed as k=log4 (G(1-p)/p), with G as genome size and p as tolerable collision rate. In our case we set the k-mer size to 21.

  2. Run Meryl ( Galaxy version 1.3+galaxy6) again with the following parameters:
    • “Operation type selector”: Operations on sets of k-mers
      • “Operations on sets of k-mers”: Union-sum: return k-mers that occur in any input, set the count to the sum of the counts
      • param-file “Input meryldb”: previous output of Meryl tool
  3. Run Meryl ( Galaxy version 1.3+galaxy6) a third time with the following parameters:
    • “Operation type selector”: Generate histogram dataset
      • param-file “Input meryldb”: previous output of Meryl tool
  4. Rename the output as Histogram dataset

K-mer profile analysis with GenomeScope

Genomescope is used for analysing genomes with the help of k-mer profile analysis. It estimates the overall genome characteristics and the overall read characteristics. The tool will use a given k-mer profile which is calculated only from raw reads sequencing data. It then generates a plot with the calculated data giving us information about the completeness and quality of the to be assembled data. (Vurture et al. 2017)

Hands-on: Generate plots for analysis
  1. GenomeScope ( Galaxy version 2.0+galaxy2) with the following parameters:
    • param-file “Input histogram file”: Histogram dataset
    • “Ploidy for model to use”: 2 for C. reniformis and E. robustus, 3 for E. reginae.
    Comment: Plots

    The generated plots will have the coverage on the x-axis and the frequency on the y-axis. It shows the fitted model in the observed k-mer data as well as the k-mer peaks and other information like some estimated parameters of the assembly.

Figure 12: Genomescope plot. Open image in new tab

Figure 12: Genomescope 21-mer profile (k:21) of Chondrosia reniformis, diploid (p:2). The plot includes estimations about the genome length (len:124,167,659bp), unique sequences (uniq:61.5%), homozygous portion (aa: 98.7%), heterozygous portions (ab:1.32%), mean k-mer coverage for heterozygous bases (kcov:35.4), read error rate (err:0.752%) and average rate of read duplications (dup:0.378).

Figure 12 corresponds to the k-mer profile of the sponge (Chondrosia reniformis). Presumably, the large gene number in the sponge genome is due to regional gene duplication; so far evidence for a transposition in sponges has been presented. Data indicate that only 0.25 % of the total sponge genome comprises CA/TG microsatellites, and until now also no SINEs/transposable elements have been identified. The estimated genome size around 124,Mbp is relatively close to Chondrosia reniformis genome size (117,39Mbp).

Question: Genome profile question

Which differences can you appreciate between the genome profile of Erythrolamprus reginae and Eschrichtius robustus?

Figure 13: Genomescope plot . Open image in new tab

Figure 13: Genomescope genome profile. Erythrolamprus reginae (A). Eschrichtius robustus (B).
  • (A) The genome profile of the snake (Erythrolamprus reginae) suggests a high level of heterozygosity (reflected by the first peak at 11.1x coverage). 51.6% unique sequences have been estimated. In this case the low value is attributed to the parthenogenesis and the polyploidy of this snake which causes tandem duplications and higher DNA replication as a result of the triploid-diploid cycles.
  • (B) The genome profile of the whale (Eschrichtius robustus) suggests a high level of homozygosity (reflected by the second peak at 29x coverage). The homozygous k-mer distribution can be expected from the high grade of inbreeding.

K-mer based assembly evaluation with Merqury

Merqury is designed for evaluating the completeness and accuracy of long-read genome assemblies using short-read sequencing data. Thus the quality of assemblies which are generated by using third-generation sequencing technologies can be reviewed and assessed by the tool. Merqury works by comparing k-mers of an assembly to those from unassembled high-accuracy reads of the raw sequencing data. K-mer-based methods are also used to identify errors and missing sequences. (Rhie et al. 2020)

Hands-on: Generating stats and plots
  1. Merqury ( Galaxy version 1.3+galaxy2) with the following parameters:
    • “Evaluation mode”: Default mode
      • param-file “K-mer counts database”: read_db (output of Meryl tool)
      • “Number of assemblies”: One assembly (pseudo-haplotype or mixed-haplotype)
        • param-file “Genome assembly”: output (Input dataset)
    Comment: Output

    Merqury will now generate following outputs:

    1. stats with completeness statistics
    2. QV stats with quality value statistics
    3. plots

Let’s have a closer look at the copy number plots for each of the three species. Merqury will generate k-mers from the raw sequencing data (in the following called the ‘read set’) and will compare them to the assembly, in contrast to genomescope. Despite being diploid or triploid the species only got plotted with the primary assembly. This is why only one-copy k-mers get evaluated.

Figure 14: Merqury plot. Open image in new tab

Figure 14: Merqury copy numbers (CN) plot of the whale (*Eschrichtius robustus*). The red area displays the k-mers of the assembly. The black area displays the k-mers only found in the read set. The black area can be indicative for sequencing error in the read set or missing sequences in the assembly.

The small black area indicates that most of the k-mers found in the read set are also found in the assembly (but not all since only the primary assembly got plotted). It reflects a highly homozygous assembly. This plot indicates a sequencing coverage at ~30x as seen before with genomescope.

Question: Genome profile question

Which differences can you appreciate between the genome profile of Chondrosia reniformis and Erythrolamprus reginae?

Figure 15: Merqury plot. Open image in new tab

Figure 15: Merqury copy numbers (CN) plot of Chondrosia reniformis (A) and Erythrolamprus reginae (B). The red area displays the k-mers of the assembly. The black area displays the k-mers only found in the read set. The black area can be indicative for sequencing error in the read set or missing sequences in the assembly.
  • (A) The sponge (Chondrosia reniformis) does have two peaks as seen before with genomescope which is typical for a diploid species. The large black area indicates that there is a high amount of k-mers in the read set which is not used in the assembly since only the primary assembly got plotted. The secondary assembly is missing.
  • (B) The snake (Erythrolamprus reginae) does have three peaks as seen before with genomescope which is typical for a triploid species. The black area is even larger compared to the sponge since only the primary assembly got plotted. The secondary and third assembly are missing.

Providing assembly statistics with gfastats

gfastats is a tool for providing summary statistics and genome file manipulation. In our case it will generate genome assembly statistics in a tabular-format output. Metrics like N50/L50, GC-content and lengths of contigs, scaffolds and gaps as well as other statistical information are provided for assessing the contiguity of the assembly.

Hands-on: Generate summary statistics
  1. gfastats ( Galaxy version 1.2.0+galaxy0) with the following parameters:
    • param-file “Input file”: output (Input dataset)
    • “Specify target sequences”: Disabled
    • “Tool mode”: Summary statistics generation
      • “Report mode”: Genome assembly statistics (--nstar-report)
    Comment: N50/NG50

    Consider taking all contigs and sorting them by size. Starting with the largest and ending with the smallest. Now add up the length of each contig beginning with the largest, then the second largest and so on. When reaching 50% of the total length of all contigs it’s done. The length of the contig you stopped is the N50 value. (Videvall)

    NG50 or more general NG(X) is based on the same idea as N50. The difference is that in this case the whole genome size or estimated genome size is taken into account. Through this comparisons can be made over different assemblies and genome sizes.

    Comment: L50

    Remember adding up the length of each contig until reaching the 50%. The L50 value is the number of the contig you have stopped.

    Example: The sum of all contigs together is 2000 kbp. The contig at 50% has length 300 kbp and is the third one and thus the third largest. Then N50 = 300 kbp and L50 = 3.

In the best case a high quality assembly should consist of just a few and large contigs to represent the genome as a whole. Therefore a good assembly should lead to a high N50 value and in contrast a low quality assembly with tiny, fragmented contigs would lead to a low N50 value. (Videvall)

The metric doesn’t only rely on measuring the 50% mark. The general case is N(X) where X ranges from 0 - 100 mostly in ten steps. However the NX metric is not suitable for comparing different species with different genome lengths.

The GC-content or guanine-cytosine ratio tells one about the occurrence of guanine and cytosine in a genome. It is stated in percent. The two nucleobases are held together by three hydrogen bonds. A high GC-content makes DNA more stable than a low GC-content. Because the ratio of most species and organisms has been found out by now, it is also a good metric to gauge completeness.(Wikipedia)

Another good metric for gauging contiguity is the CC ratio. The value is calculated by dividing contig counts by the chromosome pair number. A perfect score would be 1. Therefore the lower the value the better the contiguity of the assembly.

Hi-C scaffolding

To understand how chromosomes are arranged in their three-dimensional structure in the nucleus, high resolution and high throughput imaging techniques have been developed. Hi-C is a high throughput method to measure pairwise contacts between pairs of genomic loci. It is based on cross-linking the DNA in the nucleus and its histones(proteins). However the DNA gets cut and marked and afterwards the cutted parts get ligated together. Then the DNA gets fragmented and sequenced forwards and backwards generating paired reads. The resulting information can be used to assemble the reads to the corresponding chromosome. (Pal et al. 2018)

Pre-processing Hi-C data

Hands-on: Mapping Hi-C reads against a reference genome
  1. BWA-MEM2 ( Galaxy version 2.2.1+galaxy0) with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: output (Input dataset)
    • “Single or Paired-end reads”: Single
      • param-file “Select fastq dataset”: output (output of Collapse Collection tool)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode
    • “BAM sorting mode”: Sort by read names (i.e., the QNAME field)
  2. Run BWA-MEM2 ( Galaxy version 2.2.1+galaxy0) again with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
      • param-file “Use the following dataset as the reference sequence”: output (Input dataset)
    • “Single or Paired-end reads”: Single
      • param-file “Select fastq dataset”: output (output of Collapse Collection tool)
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: 1.Simple Illumina mode
    • “BAM sorting mode”: Sort by read names (i.e., the QNAME field)
  3. Filter and merge ( Galaxy version 1.0+galaxy0) with the following parameters:
    • param-file “First set of reads”: bam_output (output of BWA-MEM2 tool)
    • param-file “Second set of reads”: bam_output (output of BWA-MEM2 tool)

Generate Hi-C contact map

PretextMap converts BAM/SAM files into genome contact maps. With those contact maps it is possible to analyse the 3D organisation of chromosomes. The resulting map will be used for visualisation and editing contigs/scaffolds. (Kumar et al. 2017)

Hands-on: Generate a contact map with PretextMap and PretextSnapshot
  1. PretextMap ( Galaxy version 0.1.9+galaxy0) with the following parameters:
    • param-file “Input dataset in SAM or BAM format”: outfile (output of Filter and merge tool)
    • “Sort by”: Don't sort
  2. Pretext Snapshot ( Galaxy version 0.0.3+galaxy1) with the following parameters:
    • param-file “Input Pretext map file”: pretext_map_out (output of PretextMap tool)
    • “Output image format”: PNG
    • “Show grid?”: Yes
Figure 16: Contact Map Snake. Open image in new tab

Figure 16: Hi-C contact map of Erythrolamprus reginae generated by Pretext.

The contact map of the snake (Erythrolamprus reginae) does have a clear diagonal pattern. These interactions are commonly called “cis” interactions and occur between genomic loci that are spatially close to each other. Interactions beyond the diagonal pattern are called “trans” and indicate long-range interactions.

Question: Hi-C contact map question

What can you say about the Hi-C map corresponding to Chondrosia reniformis?

Figure 17: Contact Map Sponge. Open image in new tab

Figure 17: Hi-C contact map of Chondrosia reniformis generated by Pretext.

The contact map of the sponge (Chondrosia reniformis) does contain widespread interactions all over the genome. Additionally a diagonal pattern can be observed. The widespread interactions all over the genome can be the result of experimental and technical issues or it can have a biological and funcitonal significance.

Assembly graph

Bandage is a tool to visualise de novo assembly graphs with connections. (Wick et al. 2015)

Hands-on: Generate assembly graph
  1. gfastats ( Galaxy version 1.3.6+galaxy0) with the following parameters:
    • param-file “Input file”: output (Input dataset)
    • “Specify target sequences”: Disabled
    • “Tool mode”: Genome assembly manipulation
      • “Output format”: GFA
        • “Terminal overlaps selection”: No (default)
  2. Bandage Image ( Galaxy version 2022.09+galaxy4) with the following parameters:
    • param-file “Graphical Fragment Assembly”: output (output of gfastats tool)
Figure 18: Assembly Graph Sponge. Open image in new tab

Figure 18: Assembly graph of the sponge (*Chrondrosia reniformis*). Contigs (nodes) are displayed in different colors and their connections (edges) are displayed as red dotted lines.

The Assembly graph of Chondrosia reniformis is relatively simple and doesn’t have a complex structure. The connectivity of the nodes is clear but uncomplete and lacking some indicative regions of overlapping contigs. Repetitive regions can be investigated by searching for multiple paths or connections in the graph.

Question: Assembly graph question

Which differences can you appreciate between the assembly graph corresponding to Erythrolamprus reginae and Eschrichtius robustus?

Figure 19: Assembly Graph Snake and Whale. Open image in new tab

Figure 19: Assembly graph of Erythrolamprus reginae (A) and Eschrichtius robustus (B). Contigs (nodes) are displayed in different colors and their connections (edges) are displayed as red dotted lines.
  • (A) The assembly graph of the snake (Erythrolamprus reginae) does have a confusing and complex structure. This can be an indicator for errors or misassemblies. The nodes are well-connected but relatively unclear in their path which is reflected by many small contigs and only a few large contigs. This assembly graph does contain a lot of repetitive sequences that result in multiple paths and connections between the nodes.
  • (B) The assembly graph of the whale (Eschrichtius robustus) is overall relatively simple and doesn’t have a too complex structure. The nodes are well-connected and most of the paths are clear which is reflected by many large contigs and only some small contigs. This assembly does also contain repetitive regions but clearly less than in the assembly of the snake.

Conclusion

In conclusion, it’s worth to run the the post-assembly workflow to assess the quality of genome assemblies. The following table contains metrics to sum up the quality control.

Figure 20:  Final stats. Open image in new tab

Figure 20: This table contains 9 indicators for quality evaluation of three different organisms: Chondrosia reniformis, Eschrichtius robustus and Erythrolamprus reginae.