name: inverse layout: true class: center, middle, inverse
---
# Introduction to metatranscriptomics
Subina Mehta
Pratik Jagtap
Saskia Hiltemann
last_modification
Updated:
purl
PURL
:
gxy.io/GTN:S00115
text-document
Plain-text slides
|
Tip:
press
P
to view the presenter notes |
arrow-keys
Use arrow keys to move between slides
??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off Press `C` to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other. Useful when presenting. --- ## Requirements Before diving into this slide deck, we recommend you to have a look at: - [Introduction to Galaxy Analyses](/training-material/topics/introduction) --- ### <i class="far fa-question-circle" aria-hidden="true"></i><span class="visually-hidden">question</span> Questions - How to analyze metatranscriptomics data? - What information can be extracted of metatranscriptomics data? - How to assign taxa and function to the identified sequences? --- ## Why study the microbiome? .pull-left[ - Health care research - Humans are full of microorganisms - Skin, gut, oral cavity, nasal cavity, eyes, .. - Affects health, drug efficacy, etc ] .pull-right[ .image-100[  ] ] - Sometimes referred to as your **second genome** - ~10 times more cells than you - ~100 times more genes than you - ~1000s different species --- ## Why study the microbiome? - Environmental studies - Microbes in the soil affect plants and animals - Improve agriculture .image-75[  ] --- ## Meta- Omics  --- ## This Tutorial: ASaiM pipeline .pull-left[ - Quality Control - Assess Quality - Trim and Filter raw reads - Filter ribosomal RNA (rRNA) - Community profiling (Who?) - Determine composition of sample - Visualisation - Functional Analysis (What?) ] .pull-right[ .image-90[] ] .footnote[Batut et al Gigascience. 2018 7(6) doi: 10.1093/gigascience/giy057] ??? For this short tutorial, while the workflow is running, these slides can be useful to explain the tools that are being run in that section. After explaining the tools, the workflows should be far enough along to start showing the results --- ## Input: Cellulose Degradation in a Biogas Reactor  ??? A 100 µl aliquot of an enriched community from a biogas reactor was transferred to 27 anaerobic bottles containing a rich medium and 10g/L of cellulose as sole carbon source and incubated at 65 °C. Three bottles were collected at 9 different time points (0, 8, 13, 18, 23, 28, 33, 38 and 43 h) and processed in triplicates. Metatranscriptomic analysis was performed on all time points. Metaproteomics analysis on 4 data points. --- ## Input Format: FastQ Files - Four lines per read  ??? - Four lines per read - `@` + identifier on first line, just like fasta - sequence - `+` - quality score characters Segue: so what do the quality chars mean? --- ## FastQ: Quality score - Each character denotes a different Phred score  - Phred scores are logarithmic .small[ Phred Quality Score | Probability of incorrect base call | Base call accuracy --- | --- | --- 10 | 1 in 10 | 90% 20 | 1 in 100 | 99% 30 | 1 in 1000 | 99.9% 40 | 1 in 10,000 | 99.99% ... ] ??? - Logarithmic scale - Different flavours of encoding exist --- ## Preprocessing In this tutorial we start with some preprocessing steps  --- ## Preprocessing: Tools In this tutorial we start with some preprocessing steps  ??? | Step | Tools | |:------|-------| |Quality Control reports | FastQC <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> and MultiQC <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> | |Trimming and Filtering | Cutadapt <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> | |Filter ribosomal RNA | SortMeRNA <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> | |Interlace FastQ files | FastQ interlacer <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> | --- ## Quality Reports: FastQC - Generate a web report with quality metrics of your FastQ file  .footnote[see also our [dedicated QC tutorial](/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html) ] --- ## Quality Reports: FastQC - Many different QC plots - Example: Per-base sequence Quality plot  .footnote[explanation of different plots: [dedicated QC tutorial](/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html) ] ??? - in the per-base sequence quality plot, a boxplot of the base quality (y-axis) per position in the read (x-axis) is drawn - often you might observe a drop in quality towards the end of the reads, and may consider trimming ends - this example is very good --- ## Quality Reports: FastQC - Many more plots - See [QC tutorial](/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html) for more information  --- ## Quality Reports: MultiQC .pull-left[ - Combine multiple FastQC reports into one report - Also for outputs of other tools - Great when sequencing large numbers of samples ] .pull-right[  ] --- ## Read Trimming and Filtering: Cutadapt - Trim low-quality bases from reads - Filter reads based on length, mean quality score, .. - Remove adapters/primers .image-75[] - Many tools: CutAdapt <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span>, TrimGalore <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span>, Trimmomatic <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> .. ??? These are some examples of ways to trim and filter data, but many more are possible and depend on your experiment what is necessary --- ## SortMeRNA .pull-left[ - Most RNA sequences will be ribosomal RNA (rRNA) - Great for taxonomical assignment (who is there?) - Not informative for functional analysis (What are they doing?) - Filter out rRNA before doing functional analysis ] .pull-right[ .image-90[] ] --- ## FastQ interlacer - Paired-end data often comes in two separate FastQ files - One file with *forward* reads, one with *reverse* reads <br>  --- ## FastQ interlacer - Some tools require a single **interlaced** FastQ file - Galaxy has tools for **interlacing** and **deinterlacing** FastQ files <br>  ??? forward and reverse files are 'zipped' together into a single file --- ## Community Profile - We want to identify which organisms are present in our sample, and their relative abundances  - **MetaPhlan2** <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> for identification - **Krona** <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> and **Graphlan** <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> for visualisation --- ## MetaPhlan2 <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> - Estimates the presence and relative abundance of microbial cells - Maps reads against a set of marker sequences - Caveat: this tool is designed for DNA-seq - Be careful interpreting abundances when using this tool with transcriptomics data .footnote[Nat Methods. 2012 Jun 10;9(8):811-4. doi: 10.1038/nmeth.2066.] ??? About the caveat: The theoretical problem is that we quantify species abundance by averaging the coverage of marker genes. Marker genes are supposed to be at the same coverage as they are single copy genes from the same genome, but this is not true for their transcripts. So MetaPhlAn2 on metatranscriptomics gives an idea about the average transcriptional rate of a given species. So it _can_ be used with caution... --- ## Krona <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> - Visualization of community composition, interactive plot <iframe id="krona" src="krona.html" frameBorder="0" width="80%" height="600px">  </iframe> --- ## Graphlan <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> - *Cladogram* visualisation  --- ## Genus Abundance - Tutorial: one timepoint - Over multiple timepoints:  --- ## Functional Analysis - Pathways - Gene Ontology - Biological process - Molecular function - Cellular component - Gene Family --- ## Workflow  ??? HUMAnN2 - next generation - HMP Unified Metabolic Analysis Network - developed by Huttenhower lab - itself a workflow/pipeline - basically answering the question about what the microbial community is capable of? --- ## HUMAnN2 <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span> - Profiles presence/absence and abundance of microbial community - Efficiently characterizes microbial metabolic pathways - **Input** - Interlaced non-rRNA reads - Taxonomic profile (MetaPhlAn2 <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span>) - **Output** - Gene families and their abundance - Pathways and their coverage - Pathways and their abundance ??? - contains 5 parts -> non rRNA reads, MetaPhlAn taxonomy, NCBI nucleotide db, Uniref 50/90 protein db, MetaCyc/Unipathway. - Show the Galaxy wrapper --- ## HUMAnN2 Tiered Search --- class: top .left-column70[ <br> - Meta-omic sequences (DNA/RNA) ] .right-column30[ .image-60[] ] -- .left-column70[ <br> - Initial screen through **MetaPhlAn2** <i class="fas fa-wrench" aria-hidden="true"></i><span class="visually-hidden">tool</span>: known microbial species - Database: merging pangenomes of identified species ] .right-column30[ .image-60[] ] -- .left-column70[ <br> - Nucleotide-level mapping against database ] .right-column30[ .image-60[] ] -- .left-column70[ <br> - Unaligned reads searched against proteinDB (Uniref) through accelerated translated search ] .right-column30[ .image-75[] ] ??? Takes non rRNA reads + MetaPhlAn2 gives list of abundant organism, then it does Nucleotide level pangenome mapping with Bowtie and uses CHocophlAN db giving unmapped and organims specific gene hits, the unmapped reads are further searched against accelerated translated protein database the protein hits are tehn combined with gene hits and metacyc to give the output. --- **Result:** Gene family and pathway abundances .image-40[] --- ## Gene Families Abundances  RPK (reads per kilobase) = sum of alignment scores ??? Gene families: groups of evolutionary related protein that perform similar function Pathway: sum over genes catalyzing the reaction Pathway coverage: presence/absence RPK relative gene copy number : is computed as the sum of all alignments scores over a particular gene family UNMAPPED: total number of reads that remained unmapped even after both alignment steps UNINTEGRATED: no pathway detected. --- ## Gene Families to Functional Annotation  ??? Gene familes are too large depending on the complexity thus to simplify users can regroup gene families using grouping tool, can download mapping files. HUMAnN2 regroups Uniref 50/90 values to Go terms to get a broad overview. --- ## Group Abundances  ??? Group abundances converts GO terms to Go slim (subset of GO terms) into Mol function, biological process and cellular components. --- ## Gene Families to Functional Annotation    --- class: top ## Output .left-column30[ <br> Molecular Function ] .right-column70[ .image-90[] ] -- .left-column70[ .image-90[] ] .right-column30[ <br> Biological Process ] -- .right-column70[ .image-90[] ] .left-column30[ <br><br> Cellular Component ] ??? g is genus s is species level --- ## Unpack pathway abundances to show genes included - Renormalize the gene and pathway abundances in copies per million or relative abundance - This tool unpacks the pathways abundance by including gene families  --- ## Function: Cellulose Degradation - Quantitative analysis of gene family outputs from HUMAnN2 shows upregulation of cellulase .image-75[] ??? explain about datasets first cellulose 1,4 beta-cellulobiosidase responsible for hydrolysis of cellulose Gene encoding for the cellulose-binding domain protein shows an initial decrease and subsequent increase during cellulose degradation. --- ## Functions associated with a selected taxon .image-75[] ??? In gene abundance, Coprothermobacter and Clostridium were observed to be the most abundant. In this figure we are looking at Coprothermobacter only->Glycolysis is observed to be the most abundant functional pathway across time points in Coprothermobacter --- ## Taxa associated with a selected function .image-75[] ??? This figure shows the contribution of genera to adenosine ribonucleotides denovo biosynthesis across time points. it shows during ATP synthesis, we see clostridium and coprothermobacter in abundance. --- # Tabular Outputs from ASaIM Workflow - Taxonomy (Who?) - Kingdom, phylum, class, order, family, genus, species, strain - Function (What?) - Pathways - Gene Ontology - Biological Process - Molecular Function - Cellular Component - Gene Family --- ## Thank You! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!
Author(s)
Subina Mehta
Pratik Jagtap
Saskia Hiltemann
Reviewers
Tutorial Content is licensed under
Creative Commons Attribution 4.0 International License
.