View markdown source on GitHub

Deeper look into Genome Assembly algorithms

Contributors

AvatarAnthony Bretaudeau AvatarSimon Gladman AvatarAnton Nekrutenko AvatarDelphine Lariviere

Questions

last_modification Last modification: Dec 28, 2020

Different types of input data

Genome Assembly can be done with:

Specific algorithms for each


Genome Assembly algorithms

Detect overlaps between reads to build the longest possible sequences

Algorithms use graphs to represent overlapping reads/words

Two steps:

Two main types of algorithms:


OLC (Overlap Layout Consensus)

The older, first used for Sanger sequencing

Compare all reads, look for read overlaps

If a suffix of one read is similar to a prefix of another read…

TCTATATCTCGGCTCTAGG    <- read 1
    ||||||| |||||||
    TATCTCGACTCTAGGCC  <- read 2

…then they may overlap within the genome.


Directed graphs

We build a directed graph where directed edges connect nodes representing overlapping reads: Image showing example reads represented as nodes, and an overlap between two reads as an edge

Directed graph representing overlapping reads. (Image from Ben Langmead).


Directed graphs

For example, consider these 15 three letter reads:

TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT

We’re lucky, we know the corresponding genome sequence is TAATGCCATGGGATGTT

So the 15 reads can be represented as the following directed graph:

A graph representing the 15 overlapping example reads


However, in real life we do not know the actual genome! All we have is a collection of reads. Let’s first build an overlap graph by connecting two 3-mers if suffix of one is equal to the prefix of the other:

All possible overlap connections for our 3-mer collection

Overlap graph. All possible overlap connections for our 3-mer collection. (Fig. 4.7 from CP)

So to determine the sequence of the underlying genome we are looking for a path in this graph that visits every node (3-mer) once. Such path is called Hamiltonian path and it may not be unique.


For example for our 3-mer collection there are two possible Hamiltonian paths:

First possible Hamiltonian path Second possible Hamiltonian path

Two Hamiltonian paths for the 15 3-mers. Edges spelling “genomes” TAATGCCATGGGATGTT and TAATGGGATGCCATGTT are highlighted in black. (Fig. 4.9. from [CP](http://bioinformaticsalgorithms.com/)).


The reason for this “duality” is the fact that we have a repeat: 3-mer ATG is present three times in our data (green, red, and blue). Repeats cause a lot of trouble in genome assembly.


Limits of OLC

=> deBruijn Graphs


de Bruijn Graphs

Based on a counter-intuitive idea


What are K-mers?

K-mers are subwords of length k of a string


K-mers de Bruijn graph

Example with 4 english words


K-mers de Bruijn graph

Example with 4 english words


K-mers de Bruijn graph

Example with 4 english words


K-mers de Bruijn graph

Example with 4 english words


The problem of repeats

Example with a word containing repeats: Mississippi


The problem of repeats

Example with a word containing repeats: Mississippi


The problem of repeats

Example with a word containing repeats: Mississippi


The problem of repeats

Example with a word containing repeats: Mississippi


Different k

Effect of the size of k

Different k

Effect of the size of k

Different k

Effect of the size of k

Different k

Effect of the size of k


Choose k wisely

Optimum value for k will balance these effects.


Read errors

.image-75[Example of english words with errors] Example of english words with errors

Read errors

.image-75[Example of english words with errors] Example of english words with errors

Read errors

.image-75[Example of english words with errors] Example of english words with errors

Read errors

.image-75[Example of english words with errors] Example of english words with errors


Coverage (or sequencing depth)

The number of unique reads that contain a given nucleotide in the reconstructed sequence.

Aligned reads showing the notion of coverage

Source: Wikimedia

Estimated coverage

(number of reads x read length) / estimated size of the genome

1,000,000 reads x 150bp / 10Mb = 15X


More coverage

More coverage depth will help overcome errors!

Read errors revisited

Graph annotated with coverage


de Bruijn graph assembly process

  1. Select a value for k
  2. “Hash” the reads (make the kmers)
  3. Count the kmers
  4. Make the de Bruijn graph
  5. Perform graph simplification steps
  6. Read off contigs from simplified graph

Make contigs


Assembly in real life

In this topic we’ve learned about two ways of representing the relationship between reads derived from a genome we are trying to assemble:

  1. Overlap graphs - nodes are reads, edges are overlaps between reads.
  2. De Bruijn graphs - nodes are overlaps, edges are reads.

Overlap graph

An example overlap graph


de Bruijn Graph - same data

An example de Bruijn graph


Whatever the representation will be it will be messy:

An example of a real-life, noisy graph

A fragment of a very large De Bruijn graph (Image from BL).


There are multiple reasons for such messiness:

Sequence errors

Sequencing machines do not give perfect data. This results in spurious deviations on the graph. Some sequencing technologies such as Oxford Nanopore have very high error rate of ~15%.

Zoom on graph region containing sequencing errors

Graph components resulting from sequencing errors (Image from BL).


Ploidy

As we discussed earlier humans are diploid and there are multiple differences between maternal and paternal genomes. This creates “bubbles” on assembly graphs:

Impact of diploid genome sequencing on graphs

Bubbles due to a heterozygous site (Image from BL).


Repeats

As we’ve seen the third law of assembly is unbeatable. As a result some regions of the genome simply cannot be resolved and are reported in segments called contigs:

Impact of repeats on graphs

The following “genomic” segment will be reported in three pieces corresponding to regions flanking the repeat and repeat itself (Image from BL).


What to do with my reads?

Hybrid assembly: long reads to resolve repeats, short reads to correct errors

Other data and tools for polishing (scaffolding, gap filling, …)


Other data: Optical maps

Optical mapping workflow (Source: Wikimedia)

Use for scaffolding by comparing the restriction map with the predicted restriction sites in the contig sequences


Other data: Hi-C

Hi-C sequencing workflow (Source: Dovetail genomics)

Sequence chunks of genome colocalized


Other data: Linked-Reads

Linked-Reads workflow (Source: 10X)

Reads grouped by physical regions on the genome thanks to barcodes

10X Genomics (discontinued in 2020)


Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.