ERGA post-assembly QC
Author(s) | Fabian Recktenwald Cristóbal Gallardo Tom Brown Delphine Lariviere |
Reviewers |
OverviewQuestions:Objectives:
What combination of tools can assess the quality of a genome assembly?
What metrics can help to analyse the quality of an assembly?
How do we evaluate the outputs?
Requirements:
Apply the post-assembly-QC-workflow using the necessary tools
Analyse and evaluate the results of the workflow
Time estimation: 3 hoursSupporting Materials:Published: Jun 23, 2023Last modification: Oct 8, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00355rating Rating: 5.0 (0 recent ratings, 1 all time)version Revision: 6
The European Reference Genome Atlas (ERGA, Mazzoni et al. 2023) is a large-scale network of researchers aiming to generate high-quality reference genomes for all eukaryotic life in Europe and build capacity to allow researchers anywhere to generate reference genomes and use them to answer questions regarding species conservation and biodiversity. ERGA uses state-of-the-art sequencing technologies and advanced bioinformatics tools to produce high-quality genome assemblies.
Reference genomes provide a baseline for understanding genetic diversity within and among populations, and can be used to identify populations at risk of genetic erosion. This information is crucial for developing effective conservation strategies and management plans for threatened and endangered species (Shafer et al. 2015). Additionally, by better understanding the genetic basis of important traits, such as disease resistance and adaptation to changing environments, researchers can develop targeted interventions to mitigate the effects of environmental change and prevent the loss of genetic diversity (FRANKHAM et al. 2011). The ERGA project has the potential to greatly benefit biodiversity conservation efforts and advance our understanding of the genetic basis of biodiversity.
Genome post-assembly quality control is a crucial step for evaluating the accuracy and completeness of newly assembled genomes. This involves assessing the contiguity, completeness, accuracy, and consistency of the genome assembly using various bioinformatic tools and methods (Koren et al. 2017, Hunt et al. 2015, Mikheenko et al. 2015, Vaser et al. 2017, Zimin et al. 2017). The QC process aims to ensure that genomic data is reliable and useful for downstream analyses such as annotation, comparative genomics, and functional studies.
In this tutorial you will learn how to implement the ERGA Genome Assembly QC pipeline, and how to interpretate the potential outcomes.
AgendaIn this tutorial, we will cover:
Study cases
In this tutorial we will evaluate three genome assemblies, belonging to three different taxonomic groups, in order to illustrate the different scenarios that can be identified. The characteristics of each of them are described briefly below.
Case 1: Chondrosia reniformis
Chondrosia reniformis is a slow-growing marine sponge cosmopolitan species which can be found in the Mediterranean Sea and the eastern Atlantic Ocean in shallow waters; it is considered to playing an important ecological role in the marine ecosystem by filtering large volumes of water and providing habitat for other species (Voultsiadou 2005). The members of this species are gonochoristic and oviparous, whose physiology and behavious seems to highly influence the presence of endosymbiosis heterotrophic bacteria (Sara Michele and Carlo 1998).
The assembly is based on 70x PacBio HiFi and Arima2 Hi-C data generated by the Aquatic Symbiosis Genomics Project. The assembly process included the following sequence of steps: initial PacBio assembly generation with Hifiasm, retained haplotig separation with purge_dups, and Hi-C based scaffolding with YaHS. The mitochondrial genome was assembled using MitoHiFi. Finally, the primary assembly was analysed and manually improved using gEVAL.
Case 2: Erythrolamprus reginae
Erythrolamprus reginae is a species of colubrid snake found in South America. This species has been reported to include triploid individuals with parthenogenic reproduction; this type of seems to be associated with higher mutation rates and more tandem duplications in the genome (Bogart 1980).
The assembly used in this tutorial correspond to the curated primary assembly generated by the VGP project, based on 34x Pacbio HiFi and Arima2 Hi-C data, by using the VGP assembly pipeline.
Case 3: Eschrichtius robustus
Eschrichtius robustus, commonly known as the gray whale, is a species of whale found primarily in the North Pacific Ocean. Adult gray whales can reach lengths of up to 14.9 meters and weights of up to 36,000 kilograms. It is a diploid species, genetically characterized by high homozygosity (Brüniche-Olsen et al. 2018).
The assembly used in this tutorial correspond to the curated primary assembly generated by the VGP project, based on 29x Pacbio HiFi data, by using the VGP assembly pipeline.
Get data
As a first step we will get the data from Zenodo.
Hands-on: Upload data
Create a new history for allocating temporally all the datasets. This history can be named as
Datasets history
Import the Pacbio HiFi files from Zenodo:
- Open the file galaxy-upload upload menu
- Click on Rule-based tab
- “Upload type”:
Collections
Copy the tabular data, paste it into the textbox and press Build
dataset_01 https://zenodo.org/record/7786773/files/hifi.fastq.gz fastqsanger.gz HiFi CReniformis_Pacbio dataset_01 https://zenodo.org/record/7784764/files/m64055_210602_211608.hifi_reads.fastq.gz fastqsanger.gz HiFi EReginae_PacBio dataset_02 https://zenodo.org/record/7784764/files/m54306Ue_211020_191957.hifi_reads.fastq.gz fastqsanger.gz HiFi EReginae_PacBio dataset_03 https://zenodo.org/record/7784764/files/m54306Ue_211016_070813.hifi_reads.fastq.gz fastqsanger.gz HiFi EReginae_PacBio dataset_04 https://zenodo.org/record/7786773/files/m54306U_210529_225553.hifi_reads.fastq.gz fastqsanger.gz HiFi EReginae_PacBio dataset_01 https://zenodo.org/record/7781236/files/m54306Ue_220411_220734.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz fastqsanger.gz HiFi ERobustus_PacBio dataset_02 https://zenodo.org/record/7781236/files/m54306Ue_220520_051743.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz fastqsanger.gz HiFi ERobustus_PacBio dataset_03 https://zenodo.org/record/7781236/files/m64055e_220615_033108.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz fastqsanger.gz HiFi ERobustus_PacBio dataset_04 https://zenodo.org/record/7786773/files/m64055e_220603_182128.demultiplex.bc1010--bc1010.hifi_reads.fastq.gz fastqsanger.gz HiFi ERobustus_PacBio
- Click on
Build
- From Rules menu select
Add / Modify Column Definitions
- Click
Add Definition
button and selectList Identifier(s)
: columnA
- Click
Add Definition
button and selectURL
: columnB
- Click
Add Definition
button and selectType
: columnC
- Click
Add Definition
button and selectGroup Tag
: columnD
- Click
Add Definition
button and selectCollection Name
: columnE
- Click
Apply
and press UploadImport the remainig datasets from Zenodo:
- Open the file galaxy-upload upload menu
- Click on Rule-based tab
- “Upload type”:
Datasets
Copy the tabular data, paste it into the textbox and press Build
CReformitis_assembly https://zenodo.org/record/7831298/files/assembly_sponge.fasta.gz fasta.gz assembly CReformitis_metadata https://zenodo.org/record/7781236/files/metadata_chon.yaml yaml metadata CReniformis_Hi-C_F https://zenodo.org/record/7786773/files/hiC_1.fastq.gz fastqsanger.gz Hi-C CReniformis_Hi-C_R https://zenodo.org/record/7786773/files/hiC_2.fastq.gz fastqsanger.gz Hi-C EReginae_Hi-C_F_01 https://zenodo.org/record/7831762/files/rEryReg1_Royal_Ground_Snake_R1.fastq.gz.gz.001 fastqsanger.gz Hi-C EReginae_Hi-C_F_02 https://zenodo.org/record/7831762/files/rEryReg1_Royal_Ground_Snake_R1.fastq.gz.gz.002 fastqsanger.gz Hi-C EReginae_Hi-C_F_03 https://zenodo.org/record/7831763/files/rEryReg1_Royal_Ground_Snake_R1.fastq.gz.gz.003 fastqsanger.gz Hi-C EReginae_Hi-C_R_01 https://zenodo.org/record/7833514/files/rEryReg1_Royal_Ground_Snake_R2.fastq.gz.gz.001 fastqsanger.gz Hi-C EReginae_Hi-C_R_02 https://zenodo.org/record/7833514/files/rEryReg1_Royal_Ground_Snake_R2.fastq.gz.gz.002 fastqsanger.gz Hi-C EReginae_Hi-C_R_03 https://zenodo.org/record/7831763/files/rEryReg1_Royal_Ground_Snake_R2.fastq.gz.gz.003 fastqsanger.gz Hi-C EReginae_assembly https://zenodo.org/record/7788734/files/rEryReg1.pri.cur.20230105.fasta.gz fasta.gz assembly EReginae_metadata https://zenodo.org/record/7781236/files/metadata_eryth.yaml yaml metadata ERobustus_assembly https://zenodo.org/record/7781236/files/mEscRob2.pri.cur.20221201.fasta.gz fasta.gz assembly ERobustus_metadata https://zenodo.org/record/7781236/files/metadata_esch.yaml yaml metadata Taxonomy_data https://zenodo.org/record/7781236/files/new_taxdump.tar.gz gz taxonomy Diamond_db https://zenodo.org/record/7788734/files/Diamond_makedb.dmnd dmnd database
- From Rules menu select
Add / Modify Column Definitions
- Click
Add Definition
button and selectName
: columnA
- Click
Add Definition
button and selectURL
: columnB
- Click
Add Definition
button and selectType
: columnC
- Click
Add Definition
button and selectName Tag
: columnD
- Click
Apply
and press Upload
What data have we imported?
For each species, we have uploaded the fasta file corresponding to the assembly, the raw sequencing data used to generate each assembly - in these cases PacBio HiFi longreads and Illumina Hi-C reads - and databases necessary to run the QC tools included in this tutorial. We will use the taxonomy database new_taxdump.tar.gz
and the diamond database of protein sequences from the ncbi nt database to identify any sequence from contaminants or symbiont which are in our assembly using BlobToolKit
.
Once we have imported all the datasets, we will move each one to its correspondent history.
Hands-on: Upload data
- Create three new empty histories, one for each specie.
- Rename the histories as
Case 1: *Chondrosia reniformis*
,Case 2: *Erythrolamprus reginae*
andCase 3: *Eschrichtius robustus*
.- Click in History options and select
Show Histories Side-by-Side
- Click in
Select histories
, and include the histories corresponding to the three species.Drag and drop to move the datasets to their correspondent histories.
Comment: Non-unique datasetsBoth the Taxonomic_data and the Diamond_database should be included in all of them.
- Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
- Type the new name
- Click on Save
- To cancel renaming, click the galaxy-undo “Cancel” button
If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:
- Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
- Type the new name
- Press Enter
Once all the datasets have been copied to their correspondent history, we should obtain something similar to this:
Genome assembly overview with BlobToolKit
BlobToolKit is a tool designed to assist researchers in analyzing and visualizing genome assembly data. The tool uses information from multiple data sources such as read coverage, sequence length, and taxonomic annotations to generate a comprehensive overview of genome assembly data (Challis et al. 2020). One of the key characteristics of BlobToolKit is its ability to provide with a user-friendly interactive interface for analyzing complex genome assembly data.
In this tutorial, we will use BlobToolKit in order to integrate the following data:
- Read coverage data: BlobToolKit can use read coverage information to identify potential errors or gaps in the genome assembly. By comparing the depth of coverage across the genome, it can highlight regions that may be overrepresented or underrepresented, which can indicate potential issues with the assembly.
- Taxonomic annotations: BlobToolKit can use taxonomic information to identify potential contaminants or foreign DNA in the genome assembly. It does this by comparing the taxonomic profile of the genome assembly to a reference database of known organisms.
- Sequence similarity data: Sequence similarity data can be used to identify potential misassemblies or contaminants in the genome assembly. BlobToolKit can use BLAST/DIAMOND searches to compare the genome assembly to reference databases and identify regions that may be problematic.
- BUSCO reports: BlobToolKit can use BUSCO data to provide additional information about the quality of a genome assembly. It can generate plots of the number of complete and partial BUSCO genes in the genome assembly, as well as the number of missing and fragmented genes.
Comment: Why should we evaluate contaminants?A significant proportion of the genome sequences in both GenBank and RefSeq (0.54% and 0.34% of entries, respectively) include sequences from contaminants; the contamination primarily exists in the form of short contigs, flanking regions on longer contigs, or areas of larger scaffolds that are flanked by Ns, although a few longer sequences with contamination were also detected (Steinegger and Salzberg 2020).
In the next steps, we will generate the data required for generating the visualization plots with BlobToolKit.
Generate read coverage data with Minimap2
Read coverage is an essential metric for evaluating the quality of genome assemblies, and it provides valuable information for identifying regions of high and low quality, detecting misassemblies, and identifying potential contaminants. Thus, for example, unexpected regions of low coverage suggests potential errors, such as misassemblies, gaps, or low complexity regions (Koren et al. 2017).
In this tutorial we will use Minimap2 for generation the coverage data. Minimap2 is a multi-purpose aligner which is particularly efficient at aligning long-reads produced by sequencing machines from PacBio or Oxford Nanopore (Li 2018).
Comment: How is coverage information encoded in the BAM file?Coverage information is encoded in the BAM file format through the number of reads that align to each position in the reference genome. Specifically, the BAM file stores each read’s start and end position in the genome, as well as the length and orientation of the read, among other information. To calculate coverage, the number of reads that overlap each position in the reference genome is counted.
Hands-on: Generate BAM file with Minimap2
- Collapse Collection ( Galaxy version 5.1.0) with the following parameters:
- param-collection “Collection of files to collapse into single dataset”:
CReniformis_Pacbio
- Map with minimap2 ( Galaxy version 2.28+galaxy0) with the following parameters:
- “Will you select a reference genome from your history or use a built-in index?”:
Use a genome from history and build index
- param-file “Select the reference genome”:
CReformitis_assembly
- “Single or Paired-end reads”:
Single
- param-file “FASTA/Q file”: output of Collapse Collection tool
- “Select a profile of preset options”:
PacBio HiFi reads vs reference mapping (-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200 ) (map-hifi)
- Repeat the prevous steps with the datasets from the two remanining species.
Generate sequence similarity data with DIAMOND
DIAMOND is a sequence alignment tool that utilizes a more efficient algorithm compared to BLAST, allowing for much faster searches of large sequence databases. Specifically, DIAMOND uses a sensitive seed-extension approach that compares a set of small segments (seeds) from the query sequence to a database, and then extends the alignments based on the highest-scoring hits. This approach allows DIAMOND to perform up to 20,000 times faster than BLAST, with comparable or improved sensitivity and accuracy (Buchfink et al. 2014).
Hands-on: Task description
- Diamond ( Galaxy version 2.0.15+galaxy0) with the following parameters:
- “Alignment mode”:
DNA query sequences (blastx)
- “Allow for frameshifts?”:
yes
- “restrict hit culling to overlapping query ranges”:
Yes
- “frame shift penalty”:
15
- param-file “Input query file in FASTA or FASTQ format”:
CReformitis_assembly
- “Will you select a reference database from your history or use a built-in index?”:
Use one from the history
- param-file “Select the reference database”:
Diamond_db
- “Restrict search taxonomically?”:
No
- “Sensitivity Mode”:
Fast (--fast)
- “Block size in billions of sequence letters to be processed at a time”:
10.0
- “Method to filter?”:
Maximum e-value to report alignments
- “Method to restrict the number of hits?”:
Maximum number of target sequences
- In “Output options”:
- “Format of output file”:
BLAST tabular
- “Tabular fields”:
Query Seq-id
,Subject Seq -id
,Start of alignment in query
,End of alignment in query
,Expected value
,Bit score
andUnique Subject Taxonomy ID(s), separated by a ',' (in numerical order)
- Repeat the prevous steps with the datasets from the two remanining species.
Generate BUSCO report
BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool used for quantitative assessment of genome assembly, gene set, and transcriptome completeness. It is based on evolutionarily informed expectations of gene content derived from near-universal single-copy orthologs (Simão et al. 2015)
Comment: OrthologsOrthologs are genes in different species which have usually the same function and have evolved from a common ancestral gene. They are important for new genome assemblies in order to predict gene functions and help with gene annotation (Gennarelli and Cattaneo 2010).
Hands-on: Estimate single copy gene representation completeness
- Busco ( Galaxy version 5.7.1+galaxy0) with the following parameters:
- param-file “Sequences to analyse”:
CReformitis_assembly
- “Lineage data source”:
Use cached lineage Data
- “Cached database with lineage”:
Busco v5 Lineage Datasets
- “Mode”:
Genome assemblies (DNA)
- “Select a gene predictor”:
Metaeuk
- “Auto-detect or select lineage?”:
Auto-detect
(For Erythrolamprus reginae, selectSelect lineage
thenVertebrata
in “Lineage”)- “auto-lineage group”:
All taxonomic groups (--auto-lineage)
- “Which outputs should be generated”:
Short summary text
BUSCO sets represent 3,023 genes for vertebrates, 2,675 for arthropods, 843 for metazoans, 1,438 for fungi and 429 for eukaryotes. An intuitive metric is provided in BUSCO notation - C:complete[S:single, D:duplicated], F:fragmented, M:missing, n:number of genes used.
- Repeat the prevous steps with the datasets from the two remanining species.
QuestionWhat are the values of complete BUSCO genes for each species?
- Eschrichtius robustus: 96.0%
- Erythrolamprus reginae: 91.3%
- Chondrosia reniformis: 82.8%
Generate interactive plots with BlobToolKit
BlobToolKit is a tool designed to assist researchers in analyzing and visualizing genome assembly data. The tool uses information from multiple data sources such as read coverage, gene expression, and taxonomic annotations to generate a comprehensive overview of genome assembly data (Challis et al. 2020). One of the key characteristics of BlobToolKit is its ability to provide with a user-friendly interactive interface for analyzing complex genome assembly data.
To work with BlobToolKit we need to create a new dataset structure called BlobDir. Therefore the minimum requirement is a fasta file which contains the sequence of our assembly. A list of sequence identifiers and some statistics like length, GC proportion and undefined bases will then be generated.
To get a more meaningful analysis and therefore more useful information about our assembly, it is better to provide as much data as we can get. In our case we will also provide a metadata file if possible, NCBI taxonomy ID and the NCBI taxdump directory (Challis et al. 2020).
Hands-on: Creating the BlobDir dataset
- BlobToolKit ( Galaxy version 3.4.0+galaxy0) with the following parameters:
- “Select mode”:
Create a BlobToolKit dataset
- param-file “Genome assembly file”:
CReformitis_assembly
- param-file “Metadata file”:
CReformitis_metadata
- “NCBI taxonomy ID”:
121349
- param-file “NCBI taxdump directory”:
Taxonomy_Data
- BlobToolKit ( Galaxy version 3.4.0+galaxy0) with the following parameters:
- “Select mode”:
Add data to a BlobToolKit dataset
- param-file “Blobdir.tgz file”: output of BlobToolKit tool
- param-file “BUSCO full table file”:
Full table
(output of Busco tool)- “BLAST/Diamond hits”:
Enabled
- “BLAST/Diamond hits dataset”: output of Diamond tool
- “BLAST/Diamond file column order”:
1=qseqid,7=staxids,6=bitscore,2=sseqid,3=qstart,4=qend,5=evalue
- param-file “BAM/SAM/CRAM read alignment file”: output of Minimap2 tool
- Interactive BlobToolKit with the following parameters:
- param-file “Blobdir file”: output of BlobToolKit tool
Click on the galaxy-eye (eye) icon and inspect the output of Interactive BlobToolKit.
- Repeat the prevous steps with the datasets from the two remanining species.
Now, we will evaluate three of plots that we can find in the Blobtoolkit interactve interface. First, we will start with the snailplot, which provides a holistic view of the assembly.
The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 117,390,217 bp assembly. The distribution of sequence lengths is shown in dark grey with the plot radius scaled to the longest sequence present in the assembly (10,413,042 bp, shown in red).
Orange and pale-orange arcs show the N50 and N90 sequence lengths (8,459,200 and 6,903,244 bp), respectively. The pale grey spiral shows the cumulative sequence count on a log scale with white scale lines showing successive orders of magnitude. The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot. A summary of complete, fragmented, duplicated and missing BUSCO genes in the eukaryota_odb10 set is shown in the top right.
Question: Snailplot questionWhich differences do you appreciate between the snailplots corresponding to Erythrolamprus reginae and Eschrichtius robustus?
- (A) The snake (Erythrolamprus reginae) does have a relatively large proportion of the genome assembled into some few and long scaffolds (reflected in the high “longest scaffold” value and the high “Log10 scaffold count” value).
- (B) The whale (Eschrichtius robustus) does have a relatively large proportion of the genome assembled into fragmented scaffolds (reflected in the low N90 value compared to the N50 value and the high “Log10 scaffold count” value)
Now, we are going to analyze the blob plot corresponding to Erythrolamprus reginae, which is especially useful for gaining insights into the composition of your genomic data and identify potential contaminants or endosymbionts.
The blob plot is a two-dimensional scatter plot that helps in visualizing and analyzing genomic data for quality control, contaminant detection, and filtering. In the circle plot, each sequence is represented by a circle, with its diameter proportional to the sequence length. Circles are colored based on their taxonomic affiliation, and their positions on the X and Y axes are determined by their GC content and coverage, respectively. GC content is the proportion of G and C bases in the sequence, which can differ substantially between genomes. Coverage, on the other hand, is a measure of the number of times a particular sequence has been read during the sequencing process. The plot also includes coverage and GC histograms for each taxonomic group, weighted by the total span (cumulative length) of sequences occupying each bin.
Comment: Pre-curated assembly evaluationUsually, the evaluation of the pre-curated assembly yield quite different results when compared with the post-curation evaluation. A frequent difference is the abudance of contigs that belong to different organisms, as a result of contamination in the reads. It can be clearly appreciated in the next figure, which correspond to the E. reginae pre-curated assembly. A striking element is the presence of 38 contigs corresponding to the phylum Artverviricota (dark green), a group of viruses that encode a reverse transcriptase.
In addition, as we can appreciate, the samples included contamination from fungi, arthropoda and bacteria.
Question: Blob plot questionIs there significant contamination in the Eschrichtius robustus and Chondrosia reniformis assemblies?
- (A) The whale (Eschrichtius robustus) does have significant contamination. Approximately half of the assembled sequences of the whale does contain contaminants or cosymbionts (total count: 703, Chordata count: 316), however as the foreign sequences are smaller, they make up a small percentage of the total assembly.
- (B) The sponge (Condrosia reniformis) doesn’t contain any foreign sequences or contaminatio (total count: 15, Chordata count: 15).
Finally, let’s have a look at the cumulative sequence length plot, which shows the curves for subsets of scaffolds assigned to each phylum relative to the overall assembly. It is useful for evaluating the contribution of the contigs from contaminated reads to the final assembly. In that case, first we will evaluate all sequences together, and then we will remove those correspoding to chordata or not classified (not-hit) in order to be able to evaluate in detail each of the contaminants. So, let’s start with the accumulative plot corresponding to all sequences (fig. 10).
In this kind of plot, the x-axis represent the number of contigs (sorted by species and length), and the y-axis correspond to the cumulative length in nucleotides. The gray line shows the cumulative length of all sequences. As we can appreciate, most sequences corespond to the taxa chordata (1.9Gb, distributed along 151 contigs). In addition, we can see that the final assembly includes 174 contigs corresponding to different taxa, or not classified at all. In order to be able to analyze the contribution of those sequences, we will filter the contigs corresponding to the chordata.
Comment: How do I apply filters to a dataset?Datasets can be filtered based on any category or variable using the “Filters” menu. To filter based on a category, click the tab above any bar of the preview histogram to hide/show records assigned to this category. To filter variables, use the sliders at either end of the preview histogram to change the maximum and minimum values or enter the numbers directly into the text boxes above the histogram.
By hidding the contings corresponding to chordata, we can have a detailed view of the contributions of each contaminant (fig. 11).
In figure 11 we can appreciate clearly the relative contribution of each contaminant with respect to the total contamination.
K-mer based genome profiling and evaluation
k-mers can be utilized to infer genome characteristics such as length and heterozygosity by analyzing their frequency distribution in raw sequencing reads (Manekar and Sathe 2018). In this tutorial we will make use of three different tools in order to perform the k-mer based genome profiling and assembly evaluation.
Comment: Genome profiling detailsYou can find more details about genome profiling in the VGP training.
Generating k-mer profile with Meryl
DNA is double stranded and normally only one strand is sequenced. For our assembly we want to consider the other strand as well. Therefore canonical k-mers are used in most counting tools, exactly like in Meryl. A full k-mer pair is a sequence and the reverse complement of the sequence (e.g. ATG/CAT). The canonical sequence of a k-mer pair is the lexicographically smaller of the two reverse complementary sequences. So if CAT appears it will be counted as ATG. (Clavijo)
Meryl is a powerful tool for counting k-mers in large-scale genomic datasets. It uses a sorting-based approach that sorts the k-mers in lexicographical order.
Hands-on: Generating k-mer profile
- Meryl ( Galaxy version 1.3+galaxy6) with the following parameters:
- “Operation type selector”:
Count operations
- param-collection “Input sequences”:
CReniformis_Pacbio
- “K-mer size selector”:
Set a k-mer size
- “K-mer size”:
21
Comment: compute kIn general k can be computed as k=log4 (G(1-p)/p), with G as genome size and p as tolerable collision rate. In our case we set the k-mer size to 21.
- Run Meryl ( Galaxy version 1.3+galaxy6) again with the following parameters:
- “Operation type selector”:
Operations on sets of k-mers
- “Operations on sets of k-mers”:
Union-sum: return k-mers that occur in any input, set the count to the sum of the counts
- param-file “Input meryldb”: previous output of Meryl tool
- Run Meryl ( Galaxy version 1.3+galaxy6) a third time with the following parameters:
- “Operation type selector”:
Generate histogram dataset
- param-file “Input meryldb”: previous output of Meryl tool
- Rename the output as
Histogram dataset
K-mer profile analysis with GenomeScope
Genomescope is used for analysing genomes with the help of k-mer profile analysis. It estimates the overall genome characteristics and the overall read characteristics. The tool will use a given k-mer profile which is calculated only from raw reads sequencing data. It then generates a plot with the calculated data giving us information about the completeness and quality of the to be assembled data. (Vurture et al. 2017)
Hands-on: Generate plots for analysis
- GenomeScope ( Galaxy version 2.0+galaxy2) with the following parameters:
- param-file “Input histogram file”:
Histogram dataset
- “Ploidy for model to use”:
2
for C. reniformis and E. robustus,3
for E. reginae.Comment: PlotsThe generated plots will have the coverage on the x-axis and the frequency on the y-axis. It shows the fitted model in the observed k-mer data as well as the k-mer peaks and other information like some estimated parameters of the assembly.
Figure 12 corresponds to the k-mer profile of the sponge (Chondrosia reniformis). Presumably, the large gene number in the sponge genome is due to regional gene duplication; so far evidence for a transposition in sponges has been presented. Data indicate that only 0.25 % of the total sponge genome comprises CA/TG microsatellites, and until now also no SINEs/transposable elements have been identified. The estimated genome size around 124,Mbp is relatively close to Chondrosia reniformis genome size (117,39Mbp).
Question: Genome profile questionWhich differences can you appreciate between the genome profile of Erythrolamprus reginae and Eschrichtius robustus?
- (A) The genome profile of the snake (Erythrolamprus reginae) suggests a high level of heterozygosity (reflected by the first peak at 11.1x coverage). 51.6% unique sequences have been estimated. In this case the low value is attributed to the parthenogenesis and the polyploidy of this snake which causes tandem duplications and higher DNA replication as a result of the triploid-diploid cycles.
- (B) The genome profile of the whale (Eschrichtius robustus) suggests a high level of homozygosity (reflected by the second peak at 29x coverage). The homozygous k-mer distribution can be expected from the high grade of inbreeding.
K-mer based assembly evaluation with Merqury
Merqury is designed for evaluating the completeness and accuracy of long-read genome assemblies using short-read sequencing data. Thus the quality of assemblies which are generated by using third-generation sequencing technologies can be reviewed and assessed by the tool. Merqury works by comparing k-mers of an assembly to those from unassembled high-accuracy reads of the raw sequencing data. K-mer-based methods are also used to identify errors and missing sequences. (Rhie et al. 2020)
Hands-on: Generating stats and plots
- Merqury ( Galaxy version 1.3+galaxy2) with the following parameters:
- “Evaluation mode”:
Default mode
- param-file “K-mer counts database”:
read_db
(output of Meryl tool)- “Number of assemblies”:
One assembly (pseudo-haplotype or mixed-haplotype)
- param-file “Genome assembly”:
CReformitis_assembly
Comment: OutputMerqury will now generate following outputs:
- stats with completeness statistics
- QV stats with quality value statistics
- plots
Let’s have a closer look at the copy number plots for each of the three species. Merqury will generate k-mers from the raw sequencing data (in the following called the ‘read set’) and will compare them to the assembly, in contrast to genomescope. Despite being diploid or triploid the species only got plotted with the primary assembly. This is why only one-copy k-mers get evaluated.
The small black area indicates that most of the k-mers found in the read set are also found in the assembly (but not all since only the primary assembly got plotted). It reflects a highly homozygous assembly. This plot indicates a sequencing coverage at ~30x as seen before with genomescope.
Question: Genome profile questionWhich differences can you appreciate between the genome profile of Chondrosia reniformis and Erythrolamprus reginae?
- (A) The sponge (Chondrosia reniformis) does have two peaks as seen before with genomescope which is typical for a diploid species. The large black area indicates that there is a high amount of k-mers in the read set which is not used in the assembly since only the primary assembly got plotted. The secondary assembly is missing.
- (B) The snake (Erythrolamprus reginae) does have three peaks as seen before with genomescope which is typical for a triploid species. The black area is even larger compared to the sponge since only the primary assembly got plotted. The secondary and third assembly are missing.
Providing assembly statistics with gfastats
gfastats is a tool for providing summary statistics and genome file manipulation. In our case it will generate genome assembly statistics in a tabular-format output. Metrics like N50/L50, GC-content and lengths of contigs, scaffolds and gaps as well as other statistical information are provided for assessing the contiguity of the assembly.
Hands-on: Generate summary statistics
- gfastats ( Galaxy version 1.2.0+galaxy0) with the following parameters:
- param-file “Input file”:
CReformitis_assembly
- “Specify target sequences”:
Disabled
- “Tool mode”:
Summary statistics generation
- “Report mode”:
Genome assembly statistics (--nstar-report)
Comment: N50/NG50Consider taking all contigs and sorting them by size. Starting with the largest and ending with the smallest. Now add up the length of each contig beginning with the largest, then the second largest and so on. When reaching 50% of the total length of all contigs it’s done. The length of the contig you stopped is the N50 value. (Videvall)
NG50 or more general NG(X) is based on the same idea as N50. The difference is that in this case the whole genome size or estimated genome size is taken into account. Through this comparisons can be made over different assemblies and genome sizes.
Comment: L50Remember adding up the length of each contig until reaching the 50%. The L50 value is the number of the contig you have stopped.
Example: The sum of all contigs together is 2000 kbp. The contig at 50% has length 300 kbp and is the third one and thus the third largest. Then N50 = 300 kbp and L50 = 3.
In the best case a high quality assembly should consist of just a few and large contigs to represent the genome as a whole. Therefore a good assembly should lead to a high N50 value and in contrast a low quality assembly with tiny, fragmented contigs would lead to a low N50 value. (Videvall)
The metric doesn’t only rely on measuring the 50% mark. The general case is N(X) where X ranges from 0 - 100 mostly in ten steps. However the NX metric is not suitable for comparing different species with different genome lengths.
The GC-content or guanine-cytosine ratio tells one about the occurrence of guanine and cytosine in a genome. It is stated in percent. The two nucleobases are held together by three hydrogen bonds. A high GC-content makes DNA more stable than a low GC-content. Because the ratio of most species and organisms has been found out by now, it is also a good metric to gauge completeness.(Wikipedia)
Another good metric for gauging contiguity is the CC ratio. The value is calculated by dividing contig counts by the chromosome pair number. A perfect score would be 1. Therefore the lower the value the better the contiguity of the assembly.
Hi-C scaffolding
To understand how chromosomes are arranged in their three-dimensional structure in the nucleus, high resolution and high throughput imaging techniques have been developed. Hi-C is a high throughput method to measure pairwise contacts between pairs of genomic loci. It is based on cross-linking the DNA in the nucleus and its histones (proteins around which the DNA is wrapped). However the DNA gets cut and marked and afterwards the cut parts get ligated together. Then the DNA gets fragmented and sequenced forwards and backwards generating paired reads. The resulting information can be used to assemble the reads to the corresponding chromosome. (Pal et al. 2018)
Pre-processing Hi-C data
The following analysis use Hi-C reads in single datasets: one for the forward reads and one for the reverse reads. If you have more than one fastq datasets, as we do for Erythrolamprus reginae you will need to merge them into single datasets. If you have only one pair of datasets, skip this step.
Hands-on: Merge Erythrolamprus reginae Hi-C data into single datasets
Create a collection with the Hi-C forward reads name
EReginae Hi-C forward reads
.Do the same with the reverse reads (R) and name it
EReginae Hi-C reverse reads
.For species with more than one set of Hi-C reads, verify that the datasets are ordered the same way in both collections. If not, sort both collections with the tool Sort collection
- Collapse Collection ( Galaxy version 5.1.0) with the following parameters:
- param-collection “Collection of files to collapse into single dataset”:
EReginae Hi-C forward Reads
Rename the dataset
EReginae Hi-C forward Reads
- Collapse Collection ( Galaxy version 5.1.0) with the following parameters:
- param-collection “Collection of files to collapse into single dataset”:
EReginae Hi-C reverse reads
- Rename the dataset
EReginae Hi-C reverse Reads
- Click on galaxy-selector Select Items at the top of the history panel
- Check the fastq.gz containing the Hi-C forward reads (F)
Click 1 of N selected and choose Build Dataset List
- Enter a name for your collection to EReginae Hi-C forward Reads
- Click Create collection to build your collection
- Click on the checkmark icon at the top of your history again
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field to
Hi-C forward Reads
- Click the Save button
Hands-on: Mapping Hi-C reads against a reference genome
- BWA-MEM2 ( Galaxy version 2.2.1+galaxy0) with the following parameters:
- “Will you select a reference genome from your history or use a built-in index?”:
Use a genome from history and build index
- param-file “Use the following dataset as the reference sequence”:
CReformitis_assembly
- “Single or Paired-end reads”:
Single
- param-file “Select fastq dataset”:
Hi-C forward Reads
(output of Collapse Collection tool)- “Set read groups information?”:
Do not set
- “Select analysis mode”:
1.Simple Illumina mode
- “BAM sorting mode”:
Sort by read names (i.e., the QNAME field)
Rename the bam dataset
Mapped Hi-C forward Reads
- BWA-MEM2 ( Galaxy version 2.2.1+galaxy0) with the following parameters:
- “Will you select a reference genome from your history or use a built-in index?”:
Use a genome from history and build index
- param-file “Use the following dataset as the reference sequence”:
CReformitis_assembly
- “Single or Paired-end reads”:
Single
- param-file “Select fastq dataset”:
Hi-C reverse Reads
(output of Collapse Collection tool)- “Set read groups information?”:
Do not set
- “Select analysis mode”:
1.Simple Illumina mode
- “BAM sorting mode”:
Sort by read names (i.e., the QNAME field)
Rename the bam dataset
Mapped Hi-C reverse Reads
- Filter and merge ( Galaxy version 1.0+galaxy0) with the following parameters:
- param-file “First set of reads”:
Mapped Hi-C forward Reads
(output of BWA-MEM2 tool)- param-file “Second set of reads”:
Mapped Hi-C reverse Reads
(output of BWA-MEM2 tool)
Generate Hi-C contact map
PretextMap converts BAM/SAM files into genome contact maps. With those contact maps it is possible to analyse the 3D organisation of chromosomes. The resulting map will be used for visualisation and editing contigs/scaffolds. (Kumar et al. 2017)
Hands-on: Generate a contact map with PretextMap and PretextSnapshot
- PretextMap ( Galaxy version 0.1.9+galaxy1) with the following parameters:
- param-file “Input dataset in SAM or BAM format”:
outfile
(output of Filter and merge tool)- “Sort by”:
Ascending
- Pretext Snapshot ( Galaxy version 0.0.3+galaxy1) with the following parameters:
- param-file “Input Pretext map file”:
pretext_map_out
(output of PretextMap tool)- “Output image format”:
PNG
- “Show grid?”:
Yes
The contact map of the snake (Erythrolamprus reginae) does have a clear diagonal pattern. These interactions are commonly called “cis” interactions and occur between genomic loci that are spatially close to each other. Interactions beyond the diagonal pattern are called “trans” and indicate long-range interactions.
Question: Hi-C contact map questionWhat can you say about the Hi-C map corresponding to Chondrosia reniformis?
The contact map of the sponge (Chondrosia reniformis) does contain widespread interactions all over the genome. Additionally a diagonal pattern can be observed. The widespread interactions all over the genome can be the result of experimental and technical issues or it can have a biological and functional significance - such as interactions between chromosomes in the nucleus.
Assembly graph
Bandage is a tool to visualise de novo assembly graphs with connections. (Wick et al. 2015)
Hands-on: Generate assembly graph
- gfastats ( Galaxy version 1.3.6+galaxy0) with the following parameters:
- param-file “Input file”:
CReformitis_assembly
(Input dataset)- “Specify target sequences”:
Disabled
- “Tool mode”:
Genome assembly manipulation
- “Output format”:
GFA
- “Terminal overlaps selection”:
No (default)
- Bandage Image ( Galaxy version 2022.09+galaxy4) with the following parameters:
- param-file “Graphical Fragment Assembly”:
output
(output of gfastats tool)
The Assembly graph of Chondrosia reniformis is relatively simple and doesn’t have a complex structure. The connectivity of the nodes is clear but incomplete and lacking some indicative regions of overlapping contigs. Repetitive regions can be investigated by searching for multiple paths or connections in the graph.
Question: Assembly graph questionWhich differences can you appreciate between the assembly graph corresponding to Erythrolamprus reginae and Eschrichtius robustus?
- (A) The assembly graph of the snake (Erythrolamprus reginae) does have a confusing and complex structure. This can be an indicator for errors or misassemblies. The nodes are well-connected but relatively unclear in their path which is reflected by many small contigs and only a few large contigs. This assembly graph does contain a lot of repetitive sequences that result in multiple paths and connections between the nodes.
- (B) The assembly graph of the whale (Eschrichtius robustus) is overall relatively simple and doesn’t have a too complex structure. The nodes are well-connected and most of the paths are clear which is reflected by many large contigs and only some small contigs. This assembly does also contain repetitive regions but clearly less than in the assembly of the snake.
Conclusion
In conclusion, it’s worth to run the a post-assembly QC workflow to assess the quality of genome assemblies. The following table contains metrics to sum up the quality control.