View markdown source on GitHub

An introduction to get started in genome assembly and annotation


AvatarAnthony Bretaudeau AvatarAlexandre Cormier AvatarStéphanie Robin AvatarErwan Corre AvatarLaura Leroi AvatarErasmus+ Programme


last_modification Last modification: Dec 14, 2021

Steps before starting a genome project


Build a wide community for the project (if it’s possible)

.left[ The aim of a genome project is to sequence the entire target genome for a wide range of genomics applications. ]

.left[ Analyses, reanalyses and integration of genomic and other phenotype information require: ]

.left[ warning Data storage, maintenance, transfer, and analysis costs will also likely remain substantial and represent an increasing proportion of overall sequencing costs in the future. ]

Genome information: Genome size

.pull-left[ How to collect informations?

.pull-right[ .image-100[ variation in estimated genome sizes in base pairs ]]


Genome information: GC content

.pull-left[ Why?

.left[ Sequencing is not random! GC and AT rich regions are under-represented. ]

How to solve?

.pull-right[ .image-100[ Sequencing coverage by GC content ]]

.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]

Genome information: Ploidy level

.pull-right[ .image-55[ Heterozigous genotype ]]


Ploidy (N):

Number of sets of chromosomes in a cell

Organism Ploidy
Bacteria 1N
Human, mouse, rat 2N
Amphibians (Xenopus) 2N to 12N
Plants (wheat) 6N
Autopolyploid .
Hybrids .


Higher ploidy -> harder to assemble => Increase of sequencing depth

.footnote[Daniel Hartl. Essential Genetics: A Genomics Perspective. Jones & Bartlett Learning. p. 177. ISBN 978-0-7637-7364-9. (2011).]

Genome information: Heterozygosity level

.pull-left[ .left[ Heterozygous: Locus-specific, diploid (2N) organism has two different alleles of a particular gene at the same locus ]]

.pull-right[ .image-100[ Heterozigous genotype ]]

.left[ Heterozygosity is a metric used to denote the probability an individual will be heterozygous for a particular allele ]

Higher heterozygosity -> harder to assemble => Increase of sequencing depth


Genome information: Heterozygosity level

.image-125[ Concepts in phased assemblies ]

.footnote[Heng Li’s blog:]

Genome information: Complexity aka repeats elements

.left[ It is impossible to resolve repeats of length L unless you have reads longer than L ]

Most common source of assembly errors:

.pull-left[ .image-65[ Collapsed consensus from repeat copies ]]

.pull-right[ .image-65[ Collapsed, excision and rearrangement consensus ]]

Genome information: Others

Genome information: Tips


.pull-right[ .image-20[ GenomeScope Profile ]]

The best possible DNA

.left[ Select the best possible DNA source and extraction. The extraction of high-quality DNA is the most important aspect of a successful genome project

The lack of a good starting material will limit the choice of sequencing technology and will affect the quality of obtained data ]

The best possible DNA: Chemical purity of DNA

.left[ Sample-related contaminants:

All these contaminants can impair the efficacy of library preparation in any technology, especially true for PCR-free libraries (both PacBio and ONT) ]

The best possible DNA: Quantity of DNA

.left[ Different technologies require different amount of DNA:

The best possible DNA: Structural integrity of DNA

.left[ High Molecular Weight (HMW) for Nanopore/PacBio (obtained mainly from fresh material) ]

The best possible DNA: Tips


Appropriate sequencing technology

.left[ Mostly dependant on the quantity and quality of DNA and the cost but many parameters need to be considered prior to running an NGS experiment:

Appropriate sequencing technology: Primary assembly


Appropriate sequencing technology: Scaffolding


Appropriate sequencing technology: Short vs long reads

.pull-left[ Depending on the sequencing technology reads length and sequencing depth are different.

.pull-rigth[ .image-40[ Gigabase per run per read length ]]


Appropriate sequencing technology: Short vs long reads

.pull-left[ Reads accuracy differs depending on the sequencing technology:

.pull-rigth[ .image-40[ Reads accuracy distribution ]]

Appropriate sequencing technology

.image-100[ Several sequencing technologies ]

.footnote[Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat Rev Genetics 20, 631–656 (2019).]

Appropriate sequencing technology: Coverage versus depth

.left[ Coverage: fraction of the genome sequenced at least by one read

Depth: average number of reads that cover any given region

Intuitively, more reads should increase coverage and depth. ]

.image-40[ Sequencing coverage by GC content ]

.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]

Computational resources and requirements

.left[ To succeed you need to have sufficient compute resources (CPUS, RAM, walltime and storage).

Typical sequencing strategies: Bacterial genomes


Typical sequencing strategies: Larger genomes


Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.