+ - 0:00:00
Notes for current slide

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Notes for next slide



Unicycler assembly of SARS-CoV-2 genome with preprocessing to remove human genome reads



last_modification Updated:   purlPURL: gxy.io/GTN:S00031

text-document Plain-text slides |

Tip: press P to view the presenter notes | arrow-keys Use arrow keys to move between slides
1 / 23

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 23

question Questions

  • How can a genome of interest be assembled against a background of contaminating reads from other genomes?

  • How can sequencing data used to obtain an assembled genome?

3 / 23

objectives Objectives

  • Know basic characteristics of SARS-CoV-2

  • Understand Nanopore and Illumina technologies

  • Detect and remove human reads from SARS-CoV-2 sequencing reads

  • Know main de novo genome assembly algorithms

  • Perform a hybrid de novo genome assembly

4 / 23

SARS-CoV-2

The severe acute respiratory syndrome coronavirus, known as SARS-CoV-2, is a betacoronavirus which belongs to the subfamily Coronavidinae, family Coronavidae.

  • Genome characteristics:

    • Positive-sense single-stranded RNA (+ssRNA) virus of 30 kb.
    • Encode 9860 aminoacids.
    • Includes 14 functional ORFs.
    • Codify 4 structural proteins and 23 non-structural proteins.
5 / 23

SARS-CoV-2 genome structure

Structure of the SARS CoV 2 genome, a 5' utr, a polyprotein pp1ab/pp1b, and several structural and accessory proteins before the 3' utr. The pp1ab polyprotein is shown exapnded into a series of non-structural proteins labelled nsp1 to nsp16.

  • NSP1-NSP16 encodes the replicase-transcriptase complex
  • It includes four structural proteins: Spike (S), Envelope (E), Membrane (M) and Nucleocapsid (N).
6 / 23

SARS-CoV-2 structure

A graphic of the virus as a sphere with spikes (S1 and S2) coming out from the membrane, envelope proteins embedded within the membrane, and then a nucelocapsid inside. The 3d protein structures are shown in two styles of the spike proteins with the receptor binding domain highlighted.

7 / 23

Hybrid assembly

Hybrid assembly consists in using a combination of long and short reads to produce genome sequence.

Long reads are used to resolve ambiguities that exist in genomes previously assembled using the short reads. In addition, low rate-error short reads are used to correct errors that exist in the error-prone long reads.

Cartoon of hybrid assembly. Step 1 shows short reads from Illumina and long reads from Nanopore. In step 2 these are assembled separately and there are ambiguities in sequence assembly. In step 3, hybrid assembly shows the assembly done with both sets of data and it helps resolve ambiguities with higher coverage.

8 / 23

Data sources

Illumina reads

  • Second generation sequencing
  • Short size: 200 bp
  • Low error rate (~1%)

A picture of an illumina sequencer. It is a large white and grey box with a computer screen

Oxford Nanopore reads

  • Third generation sequencing
  • Long size: >40,000 bp
  • High error rate (~10%)

A picture of someone holding a nanopore device, approximately the size of an oversize usb stick with a chip visible.

9 / 23

Data sources: Illumina sequencing

Cartoon of illumina sequence, DNA is fragmented and adapters added. This binds to nanowells with oligonucleotides. Then the DNA bends and attaches to another binder. A primer attaches to the adapter and polymerase adds flourescently tagged dNTPs. Imaging happens while these are added. Then it is split, the DNA strand is denaturalised and now it there are two strands, bound to different adapters in the well. This process is repeated many times.

10 / 23

Data sources: Nanopore sequencing

Single nucleic acid molecules pass through a nanopore and changes in electrical field are measured. The magnitude of the current density depends on which nucleotide occupies the nanopore. This produces a graph which is then rad into individual bases.

11 / 23

Quality control

Quality control, read trimming and filtering are essential preprocessing steps required to garantee accurate results from RNA-seq datasets. Due to their very different nature, Illumina and Nanopore reads should be processed by using different tools.

Schematic of a workflow, RNA-seq dataset is input which consists of illumina and nanopore reads. Those go through quality assessment with fastp and nanoplot, trimming, and filtering, before producing processed reads.

12 / 23

Subtraction of reads mapping to the human reference genome

Since the SARS-CoV-2 samples were obtained from human tissues, it is necessary to retain only the reads that don't map to the human genome, i.e those of potential viral origin.

13 / 23

Subtraction of reads mapping to the human reference genome

A set of reads and a human genome are put together, mapping done with bowtie2 or minimap2 to identify reads which map to the human genome. (Then these are removed.)

As with quality control, differential characteristics of Illumina and Nanopore reads require different tools for mapping the reads to the human genome:

  • Bowtie2: It is optimized for the read lengths and error modes yielded by typical Illumina sequencers.

  • Minimap2: It is particularly efficient for mapping Nanopore long reads.

14 / 23

Genome assembly

Now everything is ready to perform genome assembly!

A picture of a jigsaw puzzle with a DNA image, and several missing pieces scattered around.

Genome assembly is a complex computational process whose objetive is to reconstruct a genome from the reads obtained by sequencing technologies.

15 / 23

De novo genome assembly

De novo assembly is a method for constructing genomes from a large number of DNA fragments, with no a priori knowledge of the correct sequence or order of those fragments.

Two common types of de novo assemblers are greedy algorithm assemblers and graph method assemblers.

16 / 23

De novo genome assembly algorithms

Greedy algorithm assemblers

It finds overlaps between reads, then builds a consensus sequence from the aligned overlapping reads.


  • Relative efficiency
  • Do not work well for large read sets because only takes into account local information
  • Do not perform well with repeat regions

Rainbow coloured reads are aligned locally to make small high quality overlaps. These are then built up into a larger consensus with the entire rainbow.

Graph method assemblers

Basically it represent reads as a set of nodes, and overlaps between these reads as directed edges which connect these nodes to form a complete graph.

  • Computationally more expensive
  • Aim for global optima
  • Perform well on large read sets, specially when they contain repeat regions.

A graph with many nodes connected by lines in a large tangle.

17 / 23

Graph methods assemblers: de Brujin graphs

De Bruijn graphs is the graph model used by most genome assemblers.

During the assembly process reads are broken into smaller fragments of a specified size, the k-mers, whichs are then used as nodes in the graph assembly. Nodes that overlap are then connected by an edge, which represents the reads. An ideal genome assembly corresponds to the path that visits every node exactly once.

Reads are provided to the algorithm, they are in the colours of the rainbow. Next overlaps are identified and the rainbow resolves itself. A subset of that is highlighted and points to reads connected by overlaps with many arrows going between the bluegreen fragments that are highlighted. This goes to the hamiltonian path identified with a re-run arrow between, indicating some mount of backtracking needed to find the best path. Finally the hamiltonian produces a consensus sequence with the correct final ordering.

18 / 23

Graph methods assemblers: de Brujin graphs

The de Brujin graph assembly tutorial provides a detailed explanation about this topic.

19 / 23

Assembly genome with Unicycler

unicycler logo

Unicycler is a software tool designed specifically for hybrid assembly of small genomes.

20 / 23

Assembly genome with Unicycler

It employs a multi-step process that utilizes a set of software tools.

schematic of the unicycler pipeline, illumina short reads are assembled with spades into an assembly graph in one branch. Another branch brings nanopore reads through miniasm and racon to assemble and polish into long read contigs. Bridge application and contig merging combin the assembly graph and long reads, and this is sent to bowtie2 and pilon for polishing, producing the final assembly.

21 / 23

keypoints Key points

  • Certain types of NGS samples can be heavily contaminated with sequences from other genomes.

  • Reads from known/expected contaminating sources can be identified by mapping to the respective genomes.

  • The different characteristics of Illumina and Nanopore sequencing technologies require processing by different tools.

  • Hybrid genome assembly allows to obtain high quality genome sequences.

22 / 23

Thank You!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!

Author(s) orcid logoCristóbal Gallardo avatar Cristóbal Gallardo
Reviewers Saskia Hiltemann avatarBjörn Grüning avatarHelena Rasche avatarCristóbal Gallardo avatar
Galaxy Training Network

Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.

23 / 23

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 23
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow