An Introduction to Genome Assembly

name: inverse
layout: true
class: center, middle, inverse

</span></div>

</span></div>

---

# An Introduction to Genome Assembly

<div class="contributors-line">
		Authors: <a href="/training-material/hall-of-fame/slugger70/" class="contributor-badge contributor-slugger70"><img src="/training-material/assets/images/orcid.png" alt="orcid logo" width="36" height="36"/><img src="https://avatars.githubusercontent.com/slugger70?s=36" alt="Simon Gladman avatar" width="36" class="avatar" />
    Simon Gladman</a>
	</div>

</div>

<div class="footnote" style="bottom: 8em;">
  <i class="far fa-calendar" aria-hidden="true"></i><span class="visually-hidden">last_modification</span> Updated:   
  <i class="fas fa-fingerprint" aria-hidden="true"></i><span class="visually-hidden">purl</span><abbr title="Persistent URL">PURL</abbr>: <a href="https://gxy.io/GTN:S00033">gxy.io/GTN:S00033</a>
</div>

<div class="footnote" style="bottom: 6em;">

<i class="fas fa-file-alt" aria-hidden="true"></i><span class="visually-hidden">text-document</span><a href="slides-plain.html"> Plain-text slides</a>
</div>

<div class="footnote" style="bottom: 2em;">
    <strong>Tip: </strong>press <kbd>P</kbd> to view the presenter notes
    | <i class="fa fa-arrows" aria-hidden="true"></i><span class="visually-hidden">arrow-keys</span> Use arrow keys to move between slides

</div>

???
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press `P` again to switch presenter notes off

Press `C` to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.

Useful when presenting.

---

## Requirements

Before diving into this slide deck, we recommend you to have a look at:

- [Introduction to Galaxy Analyses](/training-material/topics/introduction)

- [Sequence analysis](/training-material/topics/sequence-analysis)
    - Quality Control: [<i class="fab fa-slideshare" aria-hidden="true"></i><span class="visually-hidden">slides</span> slides](/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html) - [<i class="fas fa-laptop" aria-hidden="true"></i><span class="visually-hidden">tutorial</span> hands-on](/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html)

---

### <i class="far fa-question-circle" aria-hidden="true"></i><span class="visually-hidden">question</span> Questions

- How do we perform a very basic genome assembly from short read data?

---

### <i class="fas fa-bullseye" aria-hidden="true"></i><span class="visually-hidden">objectives</span> Objectives

- assemble some paired end reads using Velvet

- examine the output of the assembly.

---

.enlarge120[

# ***De novo* Genome Assembly**

]

#### With thanks to T Seemann, D Bulach, I Cooke and Simon Gladman
---
.enlarge120[

# ***De novo* assembly**

]

.pull-left[

**The process of reconstructing the original DNA sequence from the fragment reads alone.**

* Instinctively like a jigsaw puzzle

* Find reads which "fit together" (overlap)
  * Could be missing pieces (sequencing bias)
  * Some pieces will be dirty (sequencing errors)

]

.pull-right[ ![Graphic of a shattered, human-like egg sitting on a wall, dressed in a suit. Several men stand around him attempting to piece back together shattered fragments of the egg.](../../images/Humpty.jpg) ]

---

# **Another View**

![A stack of newspapers is labelled genomic DNA. A line labelled points an image of a room full of shredded paper and people inside, labelled reads. Then the line continues to a pile of newspaper clippings reading draft genome sequence, and finally to closed genome sequence with a cartoon of a newspaper.](../../images/newspaper.png)

---

# **Assembly: An Example**

---

# **A small "genome"**

![The text "Friends, romans, countrymen, lend me your ears;". A small drawing of Shakespeare has a speech bubble reading "I'll return them tomorrow"](../../images/shakespear1.png)

---

# **Shakespearomics**

![A set of reads is shown, with various subsets of the sentence like "ds, romans, count" and "friends, rom" and "crymen, lend me". The c in crymen is highlighted yellow, as it should be trymen (from countrymen.) The drawing of Shakespeare now says "Oops! I dropped them."](../../images/shakespear2.png)

---

# **Shakespearomics**

![The reads are shown again, now with overlaps below, reconstructing the sentence from the fragments. Shakespeare says I'm good with words. Crymen and "send me your ears" have their first letters highlighted in yellow due to their typos.](../../images/shakespear3.png)

---

# **Shakespearomics**

![Finally a "majority consensus" is shown below the overlaps, in two other reads we saw count and countrymen, in addition to our crymen. So that makes 2/3 that have the correct text, and we go with the majority. The same is done for the other typo. Shakespeare says We have a consensus!](../../images/shakespear4.png)

---

# **So far, so good!**

---

# **The Awful Truth**

![A meme image showing boromir from lord of the rings. The text reads: one does not simply assemble a genome.](../../images/notsimply.png)

## "Genome assembly is impossible." - A/Prof. Mihai Pop

---
.enlarge120[

# **Why is it so hard?**

]

.pull-left[
* Millions of pieces
  * Much, much shorter than the genome
  * Lots of them look similar
* Missing pieces
  * Some parts can't be sequenced easily
* Dirty Pieces
  * Lots of errors in reads
]

.pull-right[ ![A picture of a jigsaw puzzle titled "The world's most difficult" and showing a field of small round candies. It boasts the same artwork on both sides.](../../images/worlds_hardest.png) ]

---

# **Assembly recipe**

* Find all overlaps between reads
  * Hmm, sounds like a lot of work..
* Build a graph
  * A picture of the read connections
* Simplify the graph
  * Sequencing errors will mess it up a lot
* Traverse the graph
  * Trace a sensible path to produce a consensus

---

![Reads are provided to the algorithm, they are in the colours of the rainbow. Next overlaps are identified and the rainbow resolves itself. A subset of that is highlighted and points to reads connected by overlaps with many arrows going between the bluegreen fragments that are highlighted. This goes to the hamiltonian path identified with a re-run arrow between, indicating some mount of backtracking needed to find the best path. Finally the hamiltonian produces a consensus sequence with the correct final ordering.](../../images/olc_pic.png)

---

# **A more realistic graph**

![A graph showing maybe 500 nodes connected with messy lines, it is intentionally impossible to read and a mess to highlight the scope of the problem.](../../images/real_graph.png)

---

# .image-15[![fun with a strike through it.](../../images/nofun.png)] **What ruins the graph?**

* Read errors
  * Introduces false edges and nodes

* Non haploid organisms
  * Heterozygosity causes lots of detours

* Repeats
  * If they are longer than the read length
  * Causes nodes to be shared, locality confusion.

---

# **Repeats**

---
.enlarge120[
# **What is a repeat?**
]

.pull-left[

#### ***A segment of DNA which occurs more than once in the genome sequence***

* Very common
  * Transposons (self replicating genes)
  * Satellites (repetitive adjacent patterns)
  * Gene duplications (paralogs)

]

.pull-right[

![Three human children wearing similar shirts. One reads I was planned, one I was not, and the third Me neither.](../../images/triplets.png)

]

---

# **Effect on Assembly**

![A genome with a repeat in two distinct locations is shown. Arrows point to the repeats being collapsed, and then the in-between bits being cut out of the sequence completely.](../../images/repeat_effect.png)

---
.enlarge120[
# **The law of repeats** .image-15[![A picture of the ocean with text reading repeat after me.](../../images/repeatafterme.png)]
]

## **It is impossible to resolve repeats of length S unless you have reads longer than S**

---

# **Scaffolding**

---
.enlarge120[
# **Beyond contigs**
]

.pull-left[

Contig sizes are limited by:

* the length of the repeats in your genome
  * Can't change this

* the length (or "span") of the reads
  * Use long read technology
  * Use tricks with other technology

]

---
.enlarge120[
# **Types of reads**
]

.pull-left[.enlarge120[**Example fragment**]]

.remark-code[.enlarge120[atcgtatgatcttgagattctctcttcccttatagctgctata]]

.pull-left[.enlarge120[**"Single-end" read**]]

.remark-code[.enlarge120[**atcgtatg**atcttgagattctctcttcccttatagctgctata]]

sequence *one* end of the fragment

.pull-left[.enlarge120[**"Paired-end" read**]]

.remark-code[.enlarge120[**atcgtatg**atcttgagattctctcttcccttatag**ctgctata**]]

sequence both ends of the same fragment

**We can exploit this information!**
---

.enlarge120[# **Scaffolding**]

* **Paired end reads**
  * Known sequences at each end of fragment
  * Roughly known fragment length

* **Most ends will occur in same contig**

* **Some will occur in different contigs**
  * ***evidence that these contigs are linked***
---

.enlarge120[# **Contigs to Scaffolds**]

![A scaffold with gaps as yellow boxes is shown. Above is a set of contigs and paired-end reads shown bridging the gaps.](../../images/scaffolding.png)

---

.enlarge120[# **Assessing assemblies**]

* We desire
  * Total length similar to genome size
  * Fewer, larger contigs
  * Correct contigs

* Metrics
  * No generally useful measure. (No real prior information)
  * Longest contigs, total base pairs in contigs, **N50**, ...

---

.enlarge120[# **The "N50"**]

.enlarge120[***The length of that contig from which 50% of the bases are in it and shorter contigs***]

* Imagine we have 7 contigs with lengths:
  * 1, 1, 3, 5, 8, 12, 20

* Total
  * 1+1+3+5+8+12+20 = 50

* N50 is the "halfway sum" = 25
  * 1+1+3+5+8+**12** = 30 (>25) so **N50 is 12**

---

.enlarge120[# **2 levels of assembly**]

* Draft assembly
  * Will contain a number of non-linked scaffolds with gaps of unknown sequence
  * Fairly easy to get to

* Closed (finished) assembly
  * One sequence for each chromosome
  * Takes a **lot** more work
  * Small genomes are becoming easier with long read tech
  * Large genomes are the province of big consortia (e.g. Human Genome Consortium)

---
.enlarge120[# **How do I do it?**]
---
.enlarge120[
# **Example**

* Culture your bacterium

* Extract your genomic DNA

* Send it to your sequencing centre for Illumina sequencing
  * 250bp paired end

* Get back 2 files
  * .remark-code[MRSA_R1.fastq.gz]
  * .remark-code[MRSA_R2.fastq.gz]

* ***Now what?***
]

---
.enlarge120[# **Assembly tools**

* **Genome**
  * **Velvet, Velvet Optimizer, Spades,** Abyss, MIRA, Newbler, SGA, AllPaths, Ray, SOAPdenovo, ...

* Meta-genome
  * Meta Velvet, SGA, custom scripts + above

* Transcriptome
  * Trinity, Oases, Trans-abyss

***And many, many others...***

]

---
.enlarge120[
# **Assembly Exercise #1**

* We will do a simple assembly using **Velvet** in **Galaxy**
* We can do a number of different assemblies and compare some assembly metrics.

]

---
### <i class="fas fa-key" aria-hidden="true"></i><span class="visually-hidden">keypoints</span> Key points

- We assembled some Illumina fastq reads into contigs using a short read assembler called Velvet

- We showed what effect one of the key assembly parameters, the k-mer size, has on the assembly

- It looks as though there are some exploitable patterns in the metric data vs the k-mer size.

---

## Thank You!

This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!

</div>

</div>

Tutorial Content is licensed under <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/>