View markdown source on GitHub

An introduction to get started in genome assembly and annotation



last_modification Last modification: May 17, 2022

Steps before starting a genome project


Build a wide community for the project (if it’s possible)

.left[ The aim of a genome project is to sequence the entire target genome for a wide range of genomics applications. ]

.left[ Analyses, reanalyses and integration of genomic and other phenotype information require: ]

.left[ warning Data storage, maintenance, transfer, and analysis costs will also likely remain substantial and represent an increasing proportion of overall sequencing costs in the future. ]

Genome information: Genome size

.pull-left[ How to collect informations?

.pull-right[ .image-100[ variation in estimated genome sizes in base pairs. ]]


Genome information: GC content

.pull-left[ Why?

.left[ Sequencing is not random! GC and AT rich regions are under-represented. ]

How to solve?

.pull-right[ .image-100[ Sequencing coverage by GC content. ]]

.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]

Genome information: Ploidy level

.pull-right[ .image-55[ Heterozigous genotype. ]]


Ploidy (N):

Number of sets of chromosomes in a cell

Organism Ploidy
Bacteria 1N
Human, mouse, rat 2N
Amphibians (Xenopus) 2N to 12N
Plants (wheat) 6N
Autopolyploid .
Hybrids .


Higher ploidy -> harder to assemble => Increase of sequencing depth

.footnote[Daniel Hartl. Essential Genetics: A Genomics Perspective. Jones & Bartlett Learning. p. 177. ISBN 978-0-7637-7364-9. (2011).]

Genome information: Heterozygosity level

.pull-left[ .left[ Heterozygous: Locus-specific, diploid (2N) organism has two different alleles of a particular gene at the same locus ]]

.pull-right[ .image-100[ Heterozigous genotype. ]]

.left[ Heterozygosity is a metric used to denote the probability an individual will be heterozygous for a particular allele ]

Higher heterozygosity -> harder to assemble => Increase of sequencing depth


Genome information: Heterozygosity level

.image-125[ Concepts in phased assemblies. ]

.footnote[Heng Li’s blog:]

Genome information: Complexity aka repeats elements

.left[ It is impossible to resolve repeats of length L unless you have reads longer than L ]

Most common source of assembly errors:

.pull-left[ .image-65[ Collapsed consensus from repeat copies. ]]

.pull-right[ .image-65[ Collapsed, excision and rearrangement consensus. ]]

Genome information: Others

Genome information: Tips


.pull-right[ .image-20[ GenomeScope Profile. ]]

The best possible DNA

.left[ Select the best possible DNA source and extraction. The extraction of high-quality DNA is the most important aspect of a successful genome project

The lack of a good starting material will limit the choice of sequencing technology and will affect the quality of obtained data ]

The best possible DNA: Chemical purity of DNA

.left[ Sample-related contaminants:

All these contaminants can impair the efficacy of library preparation in any technology, especially true for PCR-free libraries (both PacBio and ONT) ]

The best possible DNA: Quantity of DNA

.left[ Different technologies require different amount of DNA:

The best possible DNA: Structural integrity of DNA

.left[ High Molecular Weight (HMW) for Nanopore/PacBio (obtained mainly from fresh material) ]

The best possible DNA: Tips


Appropriate sequencing technology

.left[ Mostly dependant on the quantity and quality of DNA and the cost but many parameters need to be considered prior to running an NGS experiment:

Appropriate sequencing technology: Primary assembly


Appropriate sequencing technology: Scaffolding


Appropriate sequencing technology: Short vs long reads

.pull-left[ Depending on the sequencing technology reads length and sequencing depth are different.

.pull-rigth[ .image-40[ Gigabase per run per read length. ]]


Appropriate sequencing technology: Short vs long reads

.pull-left[ Reads accuracy differs depending on the sequencing technology:

.pull-rigth[ .image-40[ Reads accuracy distribution. ]]

Appropriate sequencing technology

.image-100[ Several sequencing technologies. ]

.footnote[Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat Rev Genetics 20, 631–656 (2019).]

Appropriate sequencing technology: Coverage versus depth

.left[ Coverage: fraction of the genome sequenced at least by one read

Depth: average number of reads that cover any given region

Intuitively, more reads should increase coverage and depth. ]

.image-40[ Sequencing coverage by GC content. ]

.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]

Computational resources and requirements

.left[ To succeed you need to have sufficient compute resources (CPUS, RAM, walltime and storage).

Typical sequencing strategies: Bacterial genomes


Typical sequencing strategies: Larger genomes


Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.


These individuals or organisations provided funding support for the development of this resource

This project (2020-1-NL01-KA203-064717) is funded with the support of the Erasmus+ programme of the European Union. Their funding has supported a large number of tutorials within the GTN across a wide array of topics. eu flag with the text: with the support of the erasmus programme of the european union