The European Reference Genome Atlas (ERGA) is a large-scale project aimed at generating and integrating high-quality reference genomes for a wide range of European organisms. The project will use state-of-the-art sequencing technologies and advanced bioinformatics tools to produce high-quality genome assemblies.
Reference genomes provide a baseline for understanding genetic diversity within and among populations, and can be used to identify populations at risk of genetic erosion. This information is crucial for developing effective conservation strategies and management plans for threatened and endangered species (Shafer et al. 2015). Additionally, by better understanding the genetic basis of important traits, such as disease resistance and adaptation to changing environments, researchers can develop targeted interventions to mitigate the effects of environmental change and prevent the loss of genetic diversity (FRANKHAM et al. 2011). The ERGA project has the potential to greatly benefit biodiversity conservation efforts and advance our understanding of the genetic basis of biodiversity.
Genome post-assembly quality control (GPAQC) is a crucial step for evaluating the accuracy and completeness of newly assembled genomes. This involves assessing the contiguity, completeness, accuracy, and consistency of the genome assembly using various bioinformatic tools and methods (Koren et al. 2017, Hunt et al. 2015, Mikheenko et al. 2015, Vaser et al. 2017, Zimin et al. 2017). GPAQC aims to ensure that genomic data is reliable and useful for downstream analyses such as annotation, comparative genomics, and functional studies.
In this tutorial you will learn how to implement the ERGA GPAQC pipeline, and how to interpretate the potential outcomes.
In this tutorial we will evaluate three genome assemblies, belonging to three different taxonomic groups, in order to illustrate the different scenarios that can be identified. The characteristics of each of them are described briefly below.
Case 1: Chondrosia reniformis
Chondrosia reniformis is a slow-growing marine sponge cosmopolitan species which can be found in the Mediterranean Sea and the eastern Atlantic Ocean in shallow waters; it is considered to playing an important ecological role in the marine ecosystem by filtering large volumes of water and providing habitat for other species (Voultsiadou 2005). The members of this species are gonochoristic and oviparous, whose physiology and behavious seems to be highly influence for the presence of endosymbiosis heterotrophic bacteria (Sara Michele and Carlo 1998).
The assembly is based on 70x PacBio HiFi and Arima2 Hi-C data generated by the Aquatic Symbiosis Genomics Project. The assembly process included the following sequence of steps: initial PacBio assembly generation with Hifiasm, retained haplotig separation with purge_dups, and Hi-C based scaffolding with YaHS. The mitochondrial genome was assembled using MitoHiFi. Finally, the primary assembly was analysed and manually improved using gEVAL.
Case 2: Erythrolamprus reginae
Erythrolamprus reginae is a species of colubrid snake found in South America. This species has been reported to include triploid individuals with parthenogenic reproduction; this type of seems to be associated with higher mutation rate tandem dupliation (Bogart 1980).
The assembly used in this tutorial correspond to the curated primary assembly generated by the VGP project, based on 34x Pacbio HiFi and Arima2 Hi-C data, by using the VGP assembly pipeline.
Case 3: Eschrichtius robustus
Eschrichtius robustus, commonly known as the gray whale, is a species of whale found primarily in the North Pacific Ocean. Adult gray whales can reach lengths of up to 14.9 meters and weights of up to 36,000 kilograms. It is a diploid specie, genetically characterized by high homozygosity as a result of inbreeding (Brüniche-Olsen et al. 2018).
The assembly used in this tutorial correspond to the curated primary assembly generated by the VGP project, based on 29x Pacbio HiFi data, by using the VGP assembly pipeline.
Get data
As a first step we will get the data from Zenodo.
Hands-on: Upload data
Create a new history for allocating temporally all the datasets. This history can be named as Datasets history
Import the Pacbio HiFi files from Zenodo:
Open the file galaxy-uploadupload menu
Click on Rule-based tab
“Upload data as”: Collections
Copy the tabular data, paste it into the textbox and press Build
BlobToolKit is a tool designed to assist researchers in analyzing and visualizing genome assembly data. The tool uses information from multiple data sources such as read coverage, gene expression, and taxonomic annotations to generate a comprehensive overview of genome assembly data (Challis et al. 2020). One of the key characteristics of BlobToolKit is its ability to provide with a user-friendly interactive interface for analyzing complex genome assembly data.
In this tutorial, we will use BlobToolKit in order to integrate the following data:
Read coverage data: BlobToolKit can use read coverage information to identify potential errors or gaps in the genome assembly. By comparing the depth of coverage across the genome, it can highlight regions that may be overrepresented or underrepresented, which can indicate potential issues with the assembly.
Taxonomic annotations: BlobToolKit can use taxonomic information to identify potential contaminants or foreign DNA in the genome assembly. It does this by comparing the taxonomic profile of the genome assembly to a reference database of known organisms.
Sequence similarity data: Sequence similarity data can be used to identify potential misassemblies or contaminants in the genome assembly. BlobToolKit can use BLAST/DIAMOND searches to compare the genome assembly to reference databases and identify regions that may be problematic.
BUSCO reports: BlobToolKit can use BUSCO data to provide additional information about the quality of a genome assembly. It can generate plots of the number of complete and partial BUSCO genes in the genome assembly, as well as the number of missing and fragmented genes.
Comment: Why should we evaluate contaminants?
A significant proportion of the genome sequences in both GenBank and RefSeq (0.54% and 0.34% of entries, respectively) include sequences from contaminants; the contamination primarily exists in the form of short contigs, flanking regions on longer contigs, or areas of larger scaffolds that are flanked by Ns, although a few longer sequences with contamination were also detected (Steinegger and Salzberg 2020).
In the next steps, we will generate the data required for generating the visualization plots with BlobToolKit.
Generate read coverage data with HISAT2
Read coverage is an essential metric for evaluating the quality of genome assemblies, and it provides valuable information for identifying regions of high and low quality, detecting misassemblies, and identifying potential contaminants. Thus, for example, unexpected regions of low coverage suggests potential errors, such as misassemblies, gaps, or low complexity regions (Koren et al. 2017).
In this tutorial we will use HISAT2 for generation the coverage data. This tool uses a indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, which enables efficient and accurate alignment (Zhang et al. 2021). It then provides the alignment output in BAM file format which we will then use as an input for BlobToolKit.
Comment: How is coverage information encoded in the BAM file?
Coverage information is encoded in the BAM file format through the number of reads that align to each position in the reference genome. Specifically, the BAM file stores each read’s start and end position in the genome, as well as the length and orientation of the read, among other information. To calculate coverage, the number of reads that overlap each position in the reference genome is counted.
Hands-on: Generate BAM file with HISAT2
Collapse CollectionTool: toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/5.1.0 with the following parameters:
param-collection“Collection of files to collapse into single dataset”: CReniformis_Pacbio
HISAT2Tool: toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1 with the following parameters:
“Source for the reference genome”: Use a genome from history
param-file“Select the reference genome”: CReformitis_assembly
“Is this a single or paired library”: Single-end
param-file“FASTA/Q file”: output of Collapse Collectiontool
Repeat the prevous steps with the datasets from the two remanining species.
Generate sequence similarity data with DIAMOND
DIAMOND is a sequence alignment tool that utilizes a more efficient algorithm compared to BLAST, allowing for much faster searches of large sequence databases. Specifically, DIAMOND uses a sensitive seed-extension approach that compares a set of small segments (seeds) from the query sequence to a database, and then extends the alignments based on the highest-scoring hits. This approach allows DIAMOND to perform up to 20,000 times faster than BLAST, with comparable or improved sensitivity and accuracy (Buchfink et al. 2014).
Hands-on: Task description
DiamondTool: toolshed.g2.bx.psu.edu/repos/bgruening/diamond/bg_diamond/2.0.15+galaxy0 with the following parameters:
“Alignment mode”: DNA query sequences (blastx)
“Allow for frameshifts?”: yes
“restrict hit culling to overlapping query ranges”: Yes
“frame shift penalty”: 15
param-file“Input query file in FASTA or FASTQ format”: CReformitis_assembly
“Will you select a reference database from your history or use a built-in index?”: Use one from the history
param-file“Select the reference database”: Diamond_db
“Restrict search taxonomically?”: No
“Sensitivity Mode”: Fast (--fast)
“Block size in billions of sequence letters to be processed at a time”: 10.0
“Method to filter?”: Maximum e-value to report alignments
“Method to restrict the number of hits?”: Maximum number of target sequences
In “Output options”:
“Format of output file”: BLAST tabular
“Tabular fields”: Query Seq-id,Subject Seq -id, Start of alignment in query, End of alignment in query, Expected value, Bit score and Unique Subject Taxonomy ID(s), separated by a ',' (in numerical order)
Repeat the prevous steps with the datasets from the two remanining species.
Generate BUSCO report
BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool used for quantitative assessment of genome assembly, gene set, and transcriptome completeness. It is based on evolutionarily informed expectations of gene content derived from near-universal single-copy orthologs (Simão et al. 2015)
Comment: Orthologs
Orthologs are genes in different species which have usually the same function and have evolved from a common ancestral gene. They are important for new genome assemblies in order to predict gene functions and help with gene annotation (Gennarelli and Cattaneo 2010).
Hands-on: Estimate single copy gene representation completeness
BuscoTool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.4.4+galaxy0 with the following parameters:
param-file“Sequences to analyse”: outfile (output of Replacetool)
“Mode”: Genome assemblies (DNA)
“Use Augustus instead of Metaeuk”: Use Metaeuk
“Auto-detect or select lineage?”: Auto-detect
“Which outputs should be generated”: Short summary text
BUSCO sets represent 3023 genes for vertebrates, 2675 for arthropods, 843 for metazoans, 1438 for fungi and 429 for eukaryotes. An intuitive metric is provided in BUSCO notation - C:complete[D:dublicated], F:fragmented, M:missing, n:number of genes used.
Repeat the prevous steps with the datasets from the two remanining species.
Question
What are the values of complete BUSCO genes for each species?
Eschrichtius robustus: 96.0%
Erythrolamprus reginae: 91.3%
Chondrosia reniformis: 82.8%
Generate interactive plots with BlobToolKit
BlobToolKit is a tool designed to assist researchers in analyzing and visualizing genome assembly data. The tool uses information from multiple data sources such as read coverage, gene expression, and taxonomic annotations to generate a comprehensive overview of genome assembly data (Challis et al. 2020). One of the key characteristics of BlobToolKit is its ability to provide with a user-friendly interactive interface for analyzing complex genome assembly data.
To work with BlobToolKit we need to create a new dataset structure called BlobDir. Therefore the minimum requirement is a fasta file which contains the sequence of our assembly. A list of sequence identifiers and some statistics like length, GC proportion and undefined bases will then be generated.
To get a more meaningful analysis and therefore more useful information about our assembly, it is better to provide as much data as we can get. In our case we will also provide a metadata file if possible, NCBI taxonomy ID and the NCBI taxdump directory (Challis et al. 2020).
Hands-on: Creating the BlobDir dataset
BlobToolKitTool: toolshed.g2.bx.psu.edu/repos/bgruening/blobtoolkit/blobtoolkit/3.4.0+galaxy0 with the following parameters:
BlobToolKitTool: toolshed.g2.bx.psu.edu/repos/bgruening/blobtoolkit/blobtoolkit/3.4.0+galaxy0 with the following parameters:
“Select mode”: Add data to a BlobToolKit dataset
param-file“Blobdir.tgz file”: output of BlobToolKittool
param-file“BUSCO full table file”: Full table (output of Buscotool)
“BLAST/Diamond hits”: Enabled
“BLAST/Diamond hits dataset”: output of Diamondtool
param-file“BAM/SAM/CRAM read alignment file”: output of HISAT2tool
Interactive BlobToolKitTool: interactive_tool_blobtoolkit with the following parameters:
param-file“Blobdir file”: output of BlobToolKittool
Click on the galaxy-eye (eye) icon and inspect the output of Interactive BlobToolKit.
Repeat the prevous steps with the datasets from the two remanining species.
Now, we will evaluate three of plots that we can find in the Blobtoolkit interactve interface. First, we will start with the snailplot, which provides a holistic view of the assembly.
The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 117,390,217 bp assembly. The distribution of sequence lengths is shown in dark grey with the plot radius scaled to the longest sequence present in the assembly (10,413,042 bp, shown in red).
Orange and pale-orange arcs show the N50 and N90 sequence lengths (8,459,200 and 6,903,244 bp), respectively. The pale grey spiral shows the cumulative sequence count on a log scale with white scale lines showing successive orders of magnitude. The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot. A summary of complete, fragmented, duplicated and missing BUSCO genes in the eukaryota_odb10 set is shown in the top right.
Question: Snailplot question
Which differences do you appreciate between the snailplots corresponding to Erythrolamprus reginae and Eschrichtius robustus?
(A) The snake (Erythrolamprus reginae) does have a relatively large proportion of the genome assembled into some few and long scaffolds (reflected in the high “longest scaffold” value and the high “Log10 scaffold count” value).
(B) The whale (Eschrichtius robustus) does have a relatively large proportion of the genome assembled into fragmented scaffolds (reflected in the low N90 value compared to the N50 value and the high “Log10 scaffold count” value)
Now, we are going to analyze the blob plot corresponding to Erythrolamprus reginae, which is specially useful for gaining insights into the composition of your genomic data and identify potential contaminants or endosymbionts.
The blob plot is a two-dimensional scatter plot that helps in visualizing and analyzing genomic data for quality control, contaminant detection, and filtering. In the circle plot, each sequence is represented by a circle, with its diameter proportional to the sequence length. Circles are colored based on their taxonomic affiliation, and their positions on the X and Y axes are determined by their GC content and coverage, respectively. GC content is the proportion of G and C bases in the sequence, which can differ substantially between genomes. Coverage, on the other hand, is a measure of the number of times a particular sequence has been read during the sequencing process. The plot also includes coverage and GC histograms for each taxonomic group, weighted by the total span (cumulative length) of sequences occupying each bin.
Comment: Pre-curated assembly evaluation
Usually, the evaluation of the pre-curated assembly yield quite different results when compared with the post-curation evaluation. A frequent difference is the abudance of contigs that belong to different organisms, as a result of contamination in the reads. It can be clearly appreciated in the next figure, which correspond to the E. reginae pre-curated assembly. A striking element is the presence of 38 contigs corresponding to the phylum Artverviricota (dark green), a group of viruses that encode a reverse transcriptase.
(A) The whale (Eschrichtius robustus) does have significant contamination. Approximately half of the assembly of the whale does contain contaminants or cosymbionts (total count: 703, Chordata count: 316).
(B) The sponge (Condrosia reniformis) doesn’t contain any contamination (total count: 15, Chordata count: 15).
Finally, let’s have a look at the accumulative plot, which shows the curves for subsets of scaffolds assigned to each phylum relative to the overall assembly. It is useful for evaluating the contribution of the contigs from contaminated reads to the final assembly. In that case, first we will evaluate all sequences together, and then we will remove those correspoding to chordata or not classified (not-hit) in order to be able to evaluate in detail each of the contaminants. So, let’s start with the accumulative plot corresponding to all sequences (fig. 10).
In this kind of plot, the x-axis represent the number of contigs (sorted by species and length), and the y-axis correspond to the cumulative length in nucleotides. The gray line shows the cumulative length of all sequences. As we can appreciate, most sequences corespond to the taxa chordata (1.9Gb, distributed along 151 contigs). In addition, we can see that the final assembly includes 174 contigs corresponding to different taxa, or not classified at all. In order to be able to analyze the contribution of those sequences, we will filter the contigs corresponding to the chordata.
Datasets can be filtered based on any category or variable using the “Filters” menu. To filter based on a category, click the tab above any bar of the preview histogram to hide/show records assigned to this category. To filter variables, use the sliders at either end of the preview histogram to change the maximum and minimum values or enter the numbers directly into the text boxes above the histogram.
By hidding the contings corresponding to chordata, we can have a detailed view of the contributions of each contaminant (fig. 11).
In figure 11 we can appreciate clearly the relative contribution of each contaminant with respect to the total contamination.
K-mer based genome profiling and evaluation
k-mers can be utilized to infer genome characteristics such as length and heterozygosity by analyzing their frequency distribution in raw sequencing reads (Manekar and Sathe 2018). In this tutorial we will make use of three different tools in order to perform the k-mer based genome profiling and assembly evaluation.
Comment: Genome profiling details
You can find more details about genome profiling in the VGP training.
Generating k-mer profile with Meryl
DNA is double stranded and normally only one strand is sequenced. For our assembly we want to consider the other strand as well. Therefore canonical k-mers are used in most counting tools, exactly like in Meryl. A full k-mer pair is a sequence and the reverse complement of the sequence (e.g. ATG/CAT). The canonical sequence of a k-mer pair is the lexicographically smaller of the two reverse complementary sequences. So if CAT appears it will be counted as ATG. (Clavijo)
Meryl is a powerful tool for counting k-mers in large-scale genomic datasets. It uses a sorting-based approach that sorts the k-mers in lexicographical order.
Hands-on: Generating k-mer profile
MerylTool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6 with the following parameters:
In general k can be computed as k=log4 (G(1-p)/p), with G as genome size and p as tolerable collision rate.
In our case we set the k-mer size to 21.
Run MerylTool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6 again with the following parameters:
“Operation type selector”: Operations on sets of k-mers
“Operations on sets of k-mers”: Union-sum: return k-mers that occur in any input, set the count to the sum of the counts
param-file“Input meryldb”: previous output of Meryltool
Run MerylTool: toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6 a third time with the following parameters:
“Operation type selector”: Generate histogram dataset
param-file“Input meryldb”: previous output of Meryltool
Rename the output as Histogram dataset
K-mer profile analysis with GenomeScope
Genomescope is used for analysing genomes with the help of k-mer profile analysis. It estimates the overall genome characteristics and the overall read characteristics. The tool will use a given k-mer profile which is calculated only from raw reads sequencing data. It then generates a plot with the calculated data giving us information about the completeness and quality of the to be assembled data. (Vurture et al. 2017)
Hands-on: Generate plots for analysis
GenomeScopeTool: toolshed.g2.bx.psu.edu/repos/iuc/genomescope/genomescope/2.0+galaxy2 with the following parameters:
“Ploidy for model to use”: 2 for C. reniformis and E. robustus, 3 for E. reginae.
Comment: Plots
The generated plots will have the coverage on the x-axis and the frequency on the y-axis. It shows the fitted model in the observed k-mer data as well as the k-mer peaks and other information like some estimated parameters of the assembly.
Figure 12 corresponds to the k-mer profile of the sponge (Chondrosia reniformis). Presumably, the large gene number in the sponge genome is due to regional gene duplication; so far evidence for a transposition in sponges has been presented. Data indicate that only 0.25 % of the total sponge genome comprises CA/TG microsatellites, and until now also no SINEs/transposable elements have been identified. The estimated genome size around 124,Mbp is relatively close to Chondrosia reniformis genome size (117,39Mbp).
Question: Genome profile question
Which differences can you appreciate between the genome profile of Erythrolamprus reginae and Eschrichtius robustus?
(A) The genome profile of the snake (Erythrolamprus reginae) suggests a high level of heterozygosity (reflected by the first peak at 11.1x coverage). 51.6% unique sequences have been estimated. In this case the low value is attributed to the parthenogenesis and the polyploidy of this snake which causes tandem duplications and higher DNA replication as a result of the triploid-diploid cycles.
(B) The genome profile of the whale (Eschrichtius robustus) suggests a high level of homozygosity (reflected by the second peak at 29x coverage). The homozygous k-mer distribution can be expected from the high grade of inbreeding.
K-mer based assembly evaluation with Merqury
Merqury is designed for evaluating the completeness and accuracy of long-read genome assemblies using short-read sequencing data. Thus the quality of assemblies which are generated by using third-generation sequencing technologies can be reviewed and assessed by the tool.
Merqury works by comparing k-mers of an assembly to those from unassembled high-accuracy reads of the raw sequencing data. K-mer-based methods are also used to identify errors and missing sequences. (Rhie et al. 2020)
Hands-on: Generating stats and plots
MerquryTool: toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3+galaxy2 with the following parameters:
“Evaluation mode”: Default mode
param-file“K-mer counts database”: read_db (output of Meryltool)
“Number of assemblies”: One assembly (pseudo-haplotype or mixed-haplotype)
Let’s have a closer look at the copy number plots for each of the three species. Merqury will generate k-mers from the raw sequencing data (in the following called the ‘read set’) and will compare them to the assembly, in contrast to genomescope. Despite being diploid or triploid the species only got plotted with the primary assembly. This is why only one-copy k-mers get evaluated.
The small black area indicates that most of the k-mers found in the read set are also found in the assembly (but not all since only the primary assembly got plotted). It reflects a highly homozygous assembly. This plot indicates a sequencing coverage at ~30x as seen before with genomescope.
Question: Genome profile question
Which differences can you appreciate between the genome profile of Chondrosia reniformis and Erythrolamprus reginae?
(A) The sponge (Chondrosia reniformis) does have two peaks as seen before with genomescope which is typical for a diploid species. The large black area indicates that there is a high amount of k-mers in the read set which is not used in the assembly since only the primary assembly got plotted. The secondary assembly is missing.
(B) The snake (Erythrolamprus reginae) does have three peaks as seen before with genomescope which is typical for a triploid species. The black area is even larger compared to the sponge since only the primary assembly got plotted. The secondary and third assembly are missing.
Providing assembly statistics with gfastats
gfastats is a tool for providing summary statistics and genome file manipulation. In our case it will generate genome assembly statistics in a tabular-format output. Metrics like N50/L50, GC-content and lengths of contigs, scaffolds and gaps as well as other statistical information are provided for assessing the contiguity of the assembly.
Hands-on: Generate summary statistics
gfastatsTool: toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.2.0+galaxy0 with the following parameters:
Consider taking all contigs and sorting them by size. Starting with the largest and ending with the smallest. Now add up the length of each contig beginning with the largest, then the second largest and so on. When reaching 50% of the total length of all contigs it’s done. The length of the contig you stopped is the N50 value. (Videvall)
NG50 or more general NG(X) is based on the same idea as N50. The difference is that in this case the whole genome size or estimated genome size is taken into account. Through this comparisons can be made over different assemblies and genome sizes.
Comment: L50
Remember adding up the length of each contig until reaching the 50%. The L50 value is the number of the contig you have stopped.
Example: The sum of all contigs together is 2000 kbp. The contig at 50% has length 300 kbp and is the third one and thus the third largest.
Then N50 = 300 kbp and L50 = 3.
In the best case a high quality assembly should consist of just a few and large contigs to represent the genome as a whole. Therefore a good assembly should lead to a high N50 value and in contrast a low quality assembly with tiny, fragmented contigs would lead to a low N50 value. (Videvall)
The metric doesn’t only rely on measuring the 50% mark. The general case is N(X) where X ranges from 0 - 100 mostly in ten steps. However the NX metric is not suitable for comparing different species with different genome lengths.
The GC-content or guanine-cytosine ratio tells one about the occurrence of guanine and cytosine in a genome. It is stated in percent. The two nucleobases are held together by three hydrogen bonds. A high GC-content makes DNA more stable than a low GC-content. Because the ratio of most species and organisms has been found out by now, it is also a good metric to gauge completeness.(Wikipedia)
Another good metric for gauging contiguity is the CC ratio. The value is calculated by dividing contig counts by the chromosome pair number. A perfect score would be 1. Therefore the lower the value the better the contiguity of the assembly.
Hi-C scaffolding
To understand how chromosomes are arranged in their three-dimensional structure in the nucleus, high resolution and high throughput imaging techniques have been developed. Hi-C is a high throughput method to measure pairwise contacts between pairs of genomic loci.
It is based on cross-linking the DNA in the nucleus and its histones(proteins). However the DNA gets cut and marked and afterwards the cutted parts get ligated together. Then the DNA gets fragmented and sequenced forwards and backwards generating paired reads. The resulting information can be used to assemble the reads to the corresponding chromosome.
(Pal et al. 2018)
Pre-processing Hi-C data
Hands-on: Mapping Hi-C reads against a reference genome
BWA-MEM2Tool: toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy0 with the following parameters:
“Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
param-file“Use the following dataset as the reference sequence”: output (Input dataset)
“Single or Paired-end reads”: Single
param-file“Select fastq dataset”: output (output of Collapse Collectiontool)
“Set read groups information?”: Do not set
“Select analysis mode”: 1.Simple Illumina mode
“BAM sorting mode”: Sort by read names (i.e., the QNAME field)
Run BWA-MEM2Tool: toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy0 again with the following parameters:
“Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
param-file“Use the following dataset as the reference sequence”: output (Input dataset)
“Single or Paired-end reads”: Single
param-file“Select fastq dataset”: output (output of Collapse Collectiontool)
“Set read groups information?”: Do not set
“Select analysis mode”: 1.Simple Illumina mode
“BAM sorting mode”: Sort by read names (i.e., the QNAME field)
Filter and mergeTool: toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy0 with the following parameters:
param-file“First set of reads”: bam_output (output of BWA-MEM2tool)
param-file“Second set of reads”: bam_output (output of BWA-MEM2tool)
Generate Hi-C contact map
PretextMap converts BAM/SAM files into genome contact maps. With those contact maps it is possible to analyse the 3D organisation of chromosomes. The resulting map will be used for visualisation and editing contigs/scaffolds. (Kumar et al. 2017)
Hands-on: Generate a contact map with PretextMap and PretextSnapshot
PretextMapTool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.9+galaxy0 with the following parameters:
param-file“Input dataset in SAM or BAM format”: outfile (output of Filter and mergetool)
“Sort by”: Don't sort
Pretext SnapshotTool: toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy1 with the following parameters:
param-file“Input Pretext map file”: pretext_map_out (output of PretextMaptool)
The contact map of the snake (Erythrolamprus reginae) does have a clear diagonal pattern. These interactions are commonly called “cis” interactions and occur between genomic loci that are spatially close to each other. Interactions beyond the diagonal pattern are called “trans” and indicate long-range interactions.
Question: Hi-C contact map question
What can you say about the Hi-C map corresponding to Chondrosia reniformis?
The contact map of the sponge (Chondrosia reniformis) does contain widespread interactions all over the genome. Additionally a diagonal pattern can be observed. The widespread interactions all over the genome can be the result of experimental and technical issues or it can have a biological and funcitonal significance.
Assembly graph
Bandage is a tool to visualise de novo assembly graphs with connections. (Wick et al. 2015)
Hands-on: Generate assembly graph
gfastatsTool: toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0 with the following parameters:
param-file“Input file”: output (Input dataset)
“Specify target sequences”: Disabled
“Tool mode”: Genome assembly manipulation
“Output format”: GFA
“Terminal overlaps selection”: No (default)
Bandage ImageTool: toolshed.g2.bx.psu.edu/repos/iuc/bandage/bandage_image/2022.09+galaxy4 with the following parameters:
param-file“Graphical Fragment Assembly”: output (output of gfastatstool)
The Assembly graph of Chondrosia reniformis is relatively simple and doesn’t have a complex structure. The connectivity of the nodes is clear but uncomplete and lacking some indicative regions of overlapping contigs. Repetitive regions can be investigated by searching for multiple paths or connections in the graph.
Question: Assembly graph question
Which differences can you appreciate between the assembly graph corresponding to Erythrolamprus reginae and Eschrichtius robustus?
(A) The assembly graph of the snake (Erythrolamprus reginae) does have a confusing and complex structure. This can be an indicator for errors or misassemblies. The nodes are well-connected but relatively unclear in their path which is reflected by many small contigs and only a few large contigs. This assembly graph does contain a lot of repetitive sequences that result in multiple paths and connections between the nodes.
(B) The assembly graph of the whale (Eschrichtius robustus) is overall relatively simple and doesn’t have a too complex structure. The nodes are well-connected and most of the paths are clear which is reflected by many large contigs and only some small contigs. This assembly does also contain repetitive regions but clearly less than in the assembly of the snake.
Conclusion
In conclusion, it’s worth to run the the post-assembly workflow to assess the quality of genome assemblies. The following table contains metrics to sum up the quality control.
The ERGA post-assembly pipeline allows to assess and improve the quality of genome assemblies
The ERGA post-assembly pipeline contains of three main steps: Genome assembly decontamination and overview with BlobToolKit, providing analysis information and statistics and Hi-C scaffolding.
Bogart, J. P., 1980 Evolutionary Implications of Polyploidy in Amphibians and Reptiles, pp. 341–378 inPolyploidy, Springer US. 10.1007/978-1-4613-3069-1_18
Voultsiadou, E., 2005 Sponge diversity in the Aegean Sea: Check list and new information. Italian Journal of Zoology 72: 53–64. 10.1080/11250000509356653
Gennarelli, M., and A. Cattaneo, 2010 Genetic Variations and Association, pp. 129–151 inInternational Review of Neurobiology, Elsevier. 10.1016/b978-0-12-384976-2.00006-x
FRANKHAM, R. I. C. H. A. R. D., J. O. N. A. T. H. A. N. D. BALLOU, M. A. R. K. D. B. ELDRIDGE, R. O. B. E. R. T. C. LACY, K. A. T. H. E. R. I. N. E. RALLS et al., 2011 Predicting the Probability of Outbreeding Depression. Conservation Biology 25: 465–475. 10.1111/j.1523-1739.2011.01662.x
Buchfink, B., C. Xie, and D. H. Huson, 2014 Fast and sensitive protein alignment using DIAMOND. Nature Methods 12: 59–60. 10.1038/nmeth.3176
Hunt, M., N. D. Silva, T. D. Otto, J. Parkhill, J. A. Keane et al., 2015 Circlator: automated circularization of genome assemblies using long sequencing reads. Genome Biology 16: 10.1186/s13059-015-0849-0
Mikheenko, A., V. Saveliev, and A. Gurevich, 2015 MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32: 1088–1090. 10.1093/bioinformatics/btv697
Shafer, A. B. A., J. B. W. Wolf, P. C. Alves, L. Bergström, M. W. Bruford et al., 2015 Genomics and the challenging translation into conservation practice. Trends in Ecology &\mathsemicolon Evolution 30: 78–87. 10.1016/j.tree.2014.11.009
Simão, F. A., R. M. Waterhouse, P. Ioannidis, E. V. Kriventseva, and E. M. Zdobnov, 2015 BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31: 3210–3212. 10.1093/bioinformatics/btv351
Wick, R. R., M. B. Schultz, J. Zobel, and K. E. Holt, 2015 Bandage: interactive visualization of \lessi\greaterde novo\less/i\greater genome assemblies. Bioinformatics 31: 3350–3352. 10.1093/bioinformatics/btv383
Koren, S., B. P. Walenz, K. Berlin, J. R. Miller, N. H. Bergman et al., 2017 Canu: scalable and accurate long-read assembly via adaptive \lessi\greaterk\less/i\greater-mer weighting and repeat separation. Genome Research 27: 722–736. 10.1101/gr.215087.116
Kumar, R., H. Sobhy, P. Stenberg, and L. Lizana, 2017 Genome contact map explorer: a platform for the comparison, interactive visualization and analysis of genome contact maps. Nucleic Acids Research 45: e152–e152. 10.1093/nar/gkx644
Vaser, R., I. Sović, N. Nagarajan, and M. Šikić, 2017 Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research 27: 737–746. 10.1101/gr.214270.116
Vurture, G. W., F. J. Sedlazeck, M. Nattestad, C. J. Underwood, H. Fang et al., 2017 GenomeScope: fast reference-free genome profiling from short reads (B. Berger, Ed.). Bioinformatics 33: 2202–2204. 10.1093/bioinformatics/btx153
Zimin, A. V., D. Puiu, R. Hall, S. Kingan, B. J. Clavijo et al., 2017 The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum. GigaScience 6: 10.1093/gigascience/gix097
Brüniche-Olsen, A., R. Westerman, Z. Kazmierczyk, V. V. Vertyankin, C. Godard-Codding et al., 2018 The inference of gray whale (Eschrichtius robustus) historical population attributes from whole-genome sequences. BMC Evolutionary Biology 18: 10.1186/s12862-018-1204-3
Manekar, S. C., and S. R. Sathe, 2018 A benchmark study of k-mer counting methods for high-throughput sequencing. GigaScience. 10.1093/gigascience/giy125
Pal, K., M. Forcato, and F. Ferrari, 2018 Hi-C analysis: from data generation to integration. Biophysical Reviews 11: 67–78. 10.1007/s12551-018-0489-1
Challis, R., E. Richards, J. Rajan, G. Cochrane, and M. Blaxter, 2020 BlobToolKit – Interactive Quality Assessment of Genome Assemblies. G3 Genes\vertGenomes\vertGenetics 10: 1361–1374. 10.1534/g3.119.400908
Rhie, A., B. P. Walenz, S. Koren, and A. M. Phillippy, 2020 Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21: 10.1186/s13059-020-02134-9
Steinegger, M., and S. L. Salzberg, 2020 Terminating contamination: large-scale search identifies more than 2, 000, 000 contaminated entries in GenBank. Genome Biology 21: 10.1186/s13059-020-02023-1
Zhang, Y., C. Park, C. Bennett, M. Thornton, and D. Kim, 2021 Rapid and accurate alignment of nucleotide conversion sequencing reads with HISAT-3N. Genome Research 31: 1290–1295. 10.1101/gr.275193.120
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{assembly-ERGA-post-assembly-QC,
author = "Fabian Recktenwald and Cristóbal Gallardo",
title = "ERGA post-assembly QC (Galaxy Training Materials)",
year = "",
month = "",
day = ""
url = "\url{https://training.galaxyproject.org/training-material/topics/assembly/tutorials/ERGA-post-assembly-QC/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol} Computational Biology}
}
Congratulations on successfully completing this tutorial!
Galaxy Administrators: Install the missing tools
You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.