View markdown source on GitHub

An introduction to get started in genome assembly and annotation

Contributors

AvatarAnthony Bretaudeau AvatarAlexandre Cormier AvatarStéphanie Robin AvatarErwan Corre AvatarLaura Leroi AvatarErasmus+ Programme

Questions

last_modification Last modification: Dec 14, 2021

Steps before starting a genome project

.left[


Build a wide community for the project (if it’s possible)

.left[ The aim of a genome project is to sequence the entire target genome for a wide range of genomics applications. ]

.left[ Analyses, reanalyses and integration of genomic and other phenotype information require: ]

.left[ warning Data storage, maintenance, transfer, and analysis costs will also likely remain substantial and represent an increasing proportion of overall sequencing costs in the future. ]


Genome information: Genome size

.pull-left[ How to collect informations?

.pull-right[ .image-100[ variation in estimated genome sizes in base pairs. ]]

.footnote[https://commons.wikimedia.org/w/index.php?curid=19537795]


Genome information: GC content

.pull-left[ Why?

.left[ Sequencing is not random! GC and AT rich regions are under-represented. ]

How to solve?

.pull-right[ .image-100[ Sequencing coverage by GC content. ]]

.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]


Genome information: Ploidy level

.pull-right[ .image-55[ Heterozigous genotype. ]]

.pull-left[

Ploidy (N):

Number of sets of chromosomes in a cell

Organism Ploidy
Bacteria 1N
Human, mouse, rat 2N
Amphibians (Xenopus) 2N to 12N
Plants (wheat) 6N
Autopolyploid .
Hybrids .

]

Higher ploidy -> harder to assemble => Increase of sequencing depth

.footnote[Daniel Hartl. Essential Genetics: A Genomics Perspective. Jones & Bartlett Learning. p. 177. ISBN 978-0-7637-7364-9. (2011).]


Genome information: Heterozygosity level

.pull-left[ .left[ Heterozygous: Locus-specific, diploid (2N) organism has two different alleles of a particular gene at the same locus ]]

.pull-right[ .image-100[ Heterozigous genotype. ]]

.left[ Heterozygosity is a metric used to denote the probability an individual will be heterozygous for a particular allele ]

Higher heterozygosity -> harder to assemble => Increase of sequencing depth

.footnote[https://www.genome.gov/genetics-glossary/heterozygous]


Genome information: Heterozygosity level

.image-125[ Concepts in phased assemblies. ]

.footnote[Heng Li’s blog: lh3.github.io/2021/04/17/concepts-in-phased-assemblies]


Genome information: Complexity aka repeats elements

.left[ It is impossible to resolve repeats of length L unless you have reads longer than L ]

Most common source of assembly errors:

.pull-left[ .image-65[ Collapsed consensus from repeat copies. ]]

.pull-right[ .image-65[ Collapsed, excision and rearrangement consensus. ]]


Genome information: Others


Genome information: Tips

.pull-left[

.pull-right[ .image-20[ GenomeScope Profile. ]]


The best possible DNA

.left[ Select the best possible DNA source and extraction. The extraction of high-quality DNA is the most important aspect of a successful genome project

The lack of a good starting material will limit the choice of sequencing technology and will affect the quality of obtained data ]


The best possible DNA: Chemical purity of DNA

.left[ Sample-related contaminants:

All these contaminants can impair the efficacy of library preparation in any technology, especially true for PCR-free libraries (both PacBio and ONT) ]


The best possible DNA: Quantity of DNA

.left[ Different technologies require different amount of DNA:


The best possible DNA: Structural integrity of DNA

.left[ High Molecular Weight (HMW) for Nanopore/PacBio (obtained mainly from fresh material) ]


The best possible DNA: Tips

.left[


Appropriate sequencing technology

.left[ Mostly dependant on the quantity and quality of DNA and the cost but many parameters need to be considered prior to running an NGS experiment:


Appropriate sequencing technology: Primary assembly

.left[


Appropriate sequencing technology: Scaffolding

.left[


Appropriate sequencing technology: Short vs long reads

.pull-left[ Depending on the sequencing technology reads length and sequencing depth are different.

.pull-rigth[ .image-40[ Gigabase per run per read length. ]]

.footnote[https://github.com/alexcorm/developments-in-next-generation-sequencing]


Appropriate sequencing technology: Short vs long reads

.pull-left[ Reads accuracy differs depending on the sequencing technology:

.pull-rigth[ .image-40[ Reads accuracy distribution. ]]


Appropriate sequencing technology

.image-100[ Several sequencing technologies. ]

.footnote[Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat Rev Genetics 20, 631–656 (2019).]


Appropriate sequencing technology: Coverage versus depth

.left[ Coverage: fraction of the genome sequenced at least by one read

Depth: average number of reads that cover any given region

Intuitively, more reads should increase coverage and depth. ]

.image-40[ Sequencing coverage by GC content. ]

.footnote[Chaisson et al. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015).]


Computational resources and requirements

.left[ To succeed you need to have sufficient compute resources (CPUS, RAM, walltime and storage).


Typical sequencing strategies: Bacterial genomes

.left[


Typical sequencing strategies: Larger genomes

.left[


Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.