View markdown source on GitHub

An Introduction to Genome Assembly

Contributors

Simon Gladman

Questions

Objectives

last_modification Last modification: Jul 26, 2021

.enlarge120[

De novo Genome Assembly

]

With thanks to T Seemann, D Bulach, I Cooke and Simon Gladman


.enlarge120[

De novo assembly

]

.pull-left[

The process of reconstructing the original DNA sequence from the fragment reads alone.

]

.pull-right[ Graphic of a shattered, human-like egg sitting on a wall, dressed in a suit. Several men stand around him attempting to piece back together shattered fragments of the egg. ]


Another View

A stack of newspapers is labelled genomic DNA. A line labelled points an image of a room full of shredded paper and people inside, labelled reads. Then the line continues to a pile of newspaper clippings reading draft genome sequence, and finally to closed genome sequence with a cartoon of a newspaper.


Assembly: An Example


A small “genome”

<img src=”../../images/shakespear1.png” alt=”The text “Friends, romans, countrymen, lend me your ears;”. A small drawing of Shakespeare has a speech bubble reading “I’ll return them tomorrow”” loading=”lazy”>


Shakespearomics

<img src=”../../images/shakespear2.png” alt=”A set of reads is shown, with various subsets of the sentence like “ds, romans, count” and “friends, rom” and “crymen, lend me”. The c in crymen is highlighted yellow, as it should be trymen (from countrymen.) The drawing of Shakespeare now says “Oops! I dropped them.”” loading=”lazy”>


Shakespearomics

<img src=”../../images/shakespear3.png” alt=”The reads are shown again, now with overlaps below, reconstructing the sentence from the fragments. Shakespeare says I’m good with words. Crymen and “send me your ears” have their first letters highlighted in yellow due to their typos.” loading=”lazy”>


Shakespearomics

<img src=”../../images/shakespear4.png” alt=”Finally a “majority consensus” is shown below the overlaps, in two other reads we saw count and countrymen, in addition to our crymen. So that makes 2/3 that have the correct text, and we go with the majority. The same is done for the other typo. Shakespeare says We have a consensus!” loading=”lazy”>


So far, so good!


The Awful Truth

A meme image showing boromir from lord of the rings. The text reads: one does not simply assemble a genome.

“Genome assembly is impossible.” - A/Prof. Mihai Pop


.enlarge120[

Why is it so hard?

]

.pull-left[

.pull-right[ <img src=”../../images/worlds_hardest.png” alt=”A picture of a jigsaw puzzle titled “The world’s most difficult” and showing a field of small round candies. It boasts the same artwork on both sides.” loading=”lazy”> ]


Assembly recipe


Reads are provided to the algorithm, they are in the colours of the rainbow. Next overlaps are identified and the rainbow resolves itself. A subset of that is highlighted and points to reads connected by overlaps with many arrows going between the bluegreen fragments that are highlighted. This goes to the hamiltonian path identified with a re-run arrow between, indicating some mount of backtracking needed to find the best path. Finally the hamiltonian produces a consensus sequence with the correct final ordering.


A more realistic graph

A graph showing maybe 500 nodes connected with messy lines, it is intentionally impossible to read and a mess to highlight the scope of the problem.


.image-15[fun with a strike through it.] What ruins the graph?


Repeats


.enlarge120[

What is a repeat?

]

.pull-left[

A segment of DNA which occurs more than once in the genome sequence

]

.pull-right[

Three human children wearing similar shirts. One reads I was planned, one I was not, and the third Me neither.

]


Effect on Assembly

A genome with a repeat in two distinct locations is shown. Arrows point to the repeats being collapsed, and then the in-between bits being cut out of the sequence completely.


.enlarge120[

The law of repeats .image-15[A picture of the ocean with text reading repeat after me.]

]

It is impossible to resolve repeats of length S unless you have reads longer than S

It is impossible to resolve repeats of length S unless you have reads longer than S


Scaffolding


.enlarge120[

Beyond contigs

]

.pull-left[

Contig sizes are limited by:

]


.enlarge120[

Types of reads

]

.pull-left[.enlarge120[Example fragment]]

.remark-code[.enlarge120[atcgtatgatcttgagattctctcttcccttatagctgctata]]

.pull-left[.enlarge120[“Single-end” read]]

.remark-code[.enlarge120[atcgtatgatcttgagattctctcttcccttatagctgctata]]

sequence one end of the fragment

.pull-left[.enlarge120[“Paired-end” read]]

.remark-code[.enlarge120[atcgtatgatcttgagattctctcttcccttatagctgctata]]

sequence both ends of the same fragment

We can exploit this information!

.enlarge120[# Scaffolding]

.enlarge120[# Contigs to Scaffolds]

A scaffold with gaps as yellow boxes is shown. Above is a set of contigs and paired-end reads shown bridging the gaps.


.enlarge120[# Assessing assemblies]


.enlarge120[# The “N50”]

.enlarge120[The length of that contig from which 50% of the bases are in it and shorter contigs]


.enlarge120[# 2 levels of assembly]


.enlarge120[# How do I do it?] — .enlarge120[

Example


.enlarge120[# Assembly tools

And many, many others…

]


.enlarge120[

Assembly Exercise #1

]


Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.