name: inverse layout: true class: center, middle, inverse
# Unicycler assembly of SARS-CoV-2 genome with preprocessing to remove human genome reads
--- ## Requirements Before diving into this slide deck, we recommend you to have a look at: - [Introduction to Galaxy Analyses](/archive/2021-05-01/topics/introduction) - [Sequence analysis](/archive/2021-05-01/topics/sequence-analysis) - Quality Control: [
slides](/archive/2021-05-01/topics/sequence-analysis/tutorials/quality-control/slides.html) - [
hands-on](/archive/2021-05-01/topics/sequence-analysis/tutorials/quality-control/tutorial.html) .footnote[Tip: press `P` to view the presenter notes] ??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off Press `C` to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other. Useful when presenting. --- ### <i class="far fa-question-circle" aria-hidden="true"></i><span class="visually-hidden">question</span> Questions - How can a genome of interest be assembled against a background of contaminating reads from other genomes? - How can sequencing data used to obtain an assembled genome? --- ### <i class="fas fa-bullseye" aria-hidden="true"></i><span class="visually-hidden">objectives</span> Objectives - Know basic characteristics of SARS-CoV-2 - Understand Nanopore and Illumina technologies - Detect and remove human reads from SARS-CoV-2 sequencing reads - Know main _de novo_ genome assembly algorithms - Perform a hybrid _de novo_ genome assembly --- .enlarge120[ # **SARS-CoV-2** ] The severe acute respiratory syndrome coronavirus, known as SARS-CoV-2, is a betacoronavirus which belongs to the subfamily Coronavidinae, family Coronavidae. * Genome characteristics: * Positive-sense single-stranded RNA (+ssRNA) virus of 30 kb. * Encode 9860 aminoacids. * Includes 14 functional ORFs. * Codify 4 structural proteins and 23 non-structural proteins. --- # **SARS-CoV-2 genome structure** !(../../images/SARS-CoV-2_Genome.png) - NSP1-NSP16 encodes the replicase-transcriptase complex - It includes four structural proteins: Spike (S), Envelope (E), Membrane (M) and Nucleocapsid (N). --- # **SARS-CoV-2 structure** !(../../images/structure_SARS-CoV_2.png) --- .enlarge200[ # **Hybrid assembly** ] .reduce70[ Hybrid assembly consists in using a combination of long and short reads to produce genome sequence. Long reads are used to resolve ambiguities that exist in genomes previously assembled using the short reads. In addition, low rate-error short reads are used to correct errors that exist in the error-prone long reads. ] !(../../images/hybrid_assembly.png) --- .enlarge200[ # **Data sources** ] .pull-left[ **Illumina reads** - Second generation sequencing - Short size: 200 bp - Low error rate (~1%) !(../../images/illumina_sequencing.png) ] .pull-right[ **Oxford Nanopore reads** - Third generation sequencing - Long size: >40,000 bp - High error rate (~10%) !(../../images/nanopore_sequencing.png) ] --- .enlarge120[ # **Data sources: Illumina sequencing** ] !(../../images/illumina_sequencing_cluster.png) --- .enlarge200[ # **Data sources: Nanopore sequencing** ] !(../../images/nanopore_sequencing_process.png) --- .enlarge200[ # **Quality control** ] Quality control, read trimming and filtering are essential preprocessing steps required to garantee accurate results from RNA-seq datasets. Due to their very different nature, Illumina and Nanopore reads should be processed by using different tools. !(../../images/rna-seq_processing.png) .reduce70[ ] --- .enlarge200[ # **Subtraction of reads mapping to the human reference genome** ] Since the SARS-CoV-2 samples were obtained from human tissues, it is necessary to retain only the reads that don't map to the human genome, i.e those of potential viral origin. --- .enlarge200[ # **Subtraction of reads mapping to the human reference genome** ] .image-75[!(../../images/mapping_human_genome.png)] .reduce70[ As with quality control, differential characteristics of Illumina and Nanopore reads require different tools for mapping the reads to the human genome: - __Bowtie2__: It is optimized for the read lengths and error modes yielded by typical Illumina sequencers. - __Minimap2__: It is particularly efficient for mapping Nanopore long reads. ] --- .enlarge200[ # **Genome assembly** ] Now everything is ready to perform genome assembly! !(../../images/jigsaw_genome.png) Genome assembly is a complex computational process whose objetive is to reconstruct a genome from the reads obtained by sequencing technologies. --- .enlarge200[ # **_De novo_ genome assembly** ] _De novo_ assembly is a method for constructing genomes from a large number of DNA fragments, with no a priori knowledge of the correct sequence or order of those fragments. Two common types of _de novo_ assemblers are greedy algorithm assemblers and graph method assemblers. --- .enlarge200[ # **_De novo_ genome assembly algorithms** ] .pull-left[ __Greedy algorithm assemblers__ It finds overlaps between reads, then builds a consensus sequence from the aligned overlapping reads. </br> .reduce70[ - Relative efficiency - Do not work well for large read sets because only takes into account local information - Do not perform well with repeat regions ] .image-75[!(../../images/greedy_algoritm.png)] ] .pull-right[ __Graph method assemblers__ Basically it represent reads as a set of nodes, and overlaps between these reads as directed edges which connect these nodes to form a complete graph. .reduce70[ - Computationally more expensive - Aim for global optima - Perform well on large read sets, specially when they contain repeat regions. ] .image-75[!(../../images/graph_example.png)] ] --- .enlarge200[ # **Graph methods assemblers: de Brujin graphs** ] .reduce70[ De Bruijn graphs is the graph model used by most genome assemblers. During the assembly process reads are broken into smaller fragments of a specified size, the k-mers, whichs are then used as nodes in the graph assembly. Nodes that overlap are then connected by an edge, which represents the reads. An ideal genome assembly corresponds to the path that visits every node exactly once. ] .image-75[!(../../images/olc_pic.png)] --- .enlarge200[ # **Graph methods assemblers: de Brujin graphs** ] The [de Brujin graph assembly tutorial](https://training.galaxyproject.org/training-material/topics/assembly/tutorials/debruijn-graph-assembly/slides.html#1) provides a detailed explanation about this topic. --- .enlarge200[ # **Assembly genome with Unicycler** ] .image-50[!(../../images/unicycler_logo.png)] Unicycler is a software tool designed specifically for hybrid assembly of small genomes. --- .enlarge200[ # **Assembly genome with Unicycler** ] It employs a multi-step process that utilizes a set of software tools. !(../../images/unicycler_pipeline.png) --- ### <i class="fas fa-key" aria-hidden="true"></i><span class="visually-hidden">keypoints</span> Key points - Certain types of NGS samples can be heavily contaminated with sequences from other genomes. - Reads from known/expected contaminating sources can be identified by mapping to the respective genomes. - The different characteristics of Illumina and Nanopore sequencing technologies require processing by different tools. - Hybrid genome assembly allows to obtain high quality genome sequences. --- ## Thank you! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://wiki.galaxyproject.org/Teach/GTN) and all the contributors!
This material is licensed under the
Creative Commons Attribution 4.0 International License