View markdown source on GitHub

Introduction to metatranscriptomics

Contributors

AvatarSubina Mehta AvatarPratik Jagtap AvatarSaskia Hiltemann

Questions

last_modification Last modification: Jul 9, 2021

Why study the microbiome?

.pull-left[

] .pull-right[

.image-100[ Image of a human with various pie charts pointing to various regions of the body where microbe populations live ]

]


Why study the microbiome?

.image-75[ Rhizodeposition: image of a tree converting sun and co2 into fixed carbon used as food for soil microbes. ]


Meta- Omics

meta-momics diagram


This Tutorial: ASaiM pipeline

.pull-left[

]

.pull-right[ .image-90[ASiaiM diagram] ]

.footnote[Batut et al Gigascience. 2018 7(6) doi: 10.1093/gigascience/giy057]

Speaker Notes

For this short tutorial, while the workflow is running, these slides can be useful to explain the tools that are being run in that section. After explaining the tools, the workflows should be far enough along to start showing the results


Input: Cellulose Degradation in a Biogas Reactor

Workflow graph showing biogas reactor extract being transferred to cellulose and incubated. Time series samples are taken and run through a mass spectrometer and genomic sequencer.

Speaker Notes

A 100 µl aliquot of an enriched community from a biogas reactor was transferred to 27 anaerobic bottles containing a rich medium and 10g/L of cellulose as sole carbon source and incubated at 65 °C.

Three bottles were collected at 9 different time points (0, 8, 13, 18, 23, 28, 33, 38 and 43 h) and processed in triplicates. Metatranscriptomic analysis was performed on all time points. Metaproteomics analysis on 4 data points.


Input Format: FastQ Files

Image of a fastq file with label on the first line, sequence on the second, + on the third, and quality scores on the fourth as ascii chars. A callout shows that Base=T, quality=colon, and that means a score of 25.

Speaker Notes

Segue: so what do the quality chars mean?


FastQ: Quality score

.small[ Phred Quality Score | Probability of incorrect base call | Base call accuracy — | — | — 10 | 1 in 10 | 90% 20 | 1 in 100 | 99% 30 | 1 in 1000 | 99.9% 40 | 1 in 10,000 | 99.99% … ]

Speaker Notes


Preprocessing

In this tutorial we start with some preprocessing steps

preprocessing workflow


Preprocessing: Tools

In this tutorial we start with some preprocessing steps

preprocessing workflow

Speaker Notes | Step | Tools | |:——|——-| |Quality Control reports         | FastQC tool and MultiQC tool | |Trimming and Filtering | Cutadapt tool | |Filter ribosomal RNA | SortMeRNA tool | |Interlace FastQ files | FastQ interlacer tool |


Quality Reports: FastQC

Screenshot of FastQC report, showing the table of contents with green checks on nearly every result, and the base statistics and per-base sequence quality graphs shown.

.footnote[see also our dedicated QC tutorial ]


Quality Reports: FastQC

Fastqc quality score plot, most results are in the green region but the box portion of the box and whisker plot start to dip into the yellow, medium quality (less than 30) region near 34+ base position in read. The whiskers begin extending to the red region (less than 20) by base 31 and get progressively worse.

.footnote[explanation of different plots: dedicated QC tutorial ]

Speaker Notes


Quality Reports: FastQC

Montage of several different fastq reports showing sequence quality graphs, and a numb er of other line graphs.


Quality Reports: MultiQC

.pull-left[

.pull-right[ Multiqc's report showing an aggregation of multiple samples. An overview at the top provides context for the 4 samples, and a sequence quality histogram shows 4 samples with similarly behaving quality scores ]


Read Trimming and Filtering: Cutadapt

Speaker Notes These are some examples of ways to trim and filter data, but many more are possible and depend on your experiment what is necessary


SortMeRNA

.pull-left[

.pull-right[ .image-90[SortMeRNA] ]


FastQ interlacer


paired end deinterlaced file


FastQ interlacer


paired end interlaced file

Speaker Notes forward and reverse files are ‘zipped’ together into a single file


Community Profile

Cartoon of several differently coloured and shaped microbes in a circle.


MetaPhlan2 tool

.footnote[Nat Methods. 2012 Jun 10;9(8):811-4. doi: 10.1038/nmeth.2066.]

Speaker Notes

About the caveat: The theoretical problem is that we quantify species abundance by averaging the coverage of marker genes. Marker genes are supposed to be at the same coverage as they are single copy genes from the same genome, but this is not true for their transcripts. So MetaPhlAn2 on metatranscriptomics gives an idea about the average transcriptional rate of a given species. So it can be used with caution…


Krona tool


Graphlan tool

Colourful cladogram which begins from the center and expands outward with the lineage of the samples. Each sector of the chart is coloured differently for each group of genus and spieces. E.g. streptococcus streptococcaceae has three different leafs of the cladogram tree.


Genus Abundance

stacked bar chart with timepoints along the x axis and genus abundance as a percentage along the y axis. Each of the 7 time samples consists mostly of Coprothermobacter and Clostridium.


Functional Analysis


Workflow

functional analysis workflow schematic

Speaker Notes HUMAnN2


HUMAnN2 tool

Speaker Notes



class: top

.left-column70[


]

.right-column30[ .image-60[Cartoon of several reads coloured into four groups, Species 1, Species 2, Unclassified, Novel.] ]

.left-column70[


]

.right-column30[ .image-60[Four bins labelled 1 (red), 2 (blue), 3, 4 with reads from the top cartoon show piles of 1 an 2 with NO signs over 3 and 4.] ]

.left-column70[


]

.right-column30[ .image-60[Regions x 1 y in red and x 2 y in blue are shown, the pangenomes of each of the red and blue species are shown. Reads map to most segments of the pangenome.] ]

.left-column70[

]

.right-column30[ .image-75[Reads are shown matching against portions of protein sequences of X, Y, Z] ]

Speaker Notes Takes non rRNA reads + MetaPhlAn2 gives list of abundant organism, then it does Nucleotide level pangenome mapping with Bowtie and uses CHocophlAN db giving unmapped and organims specific gene hits, the unmapped reads are further searched against accelerated translated protein database the protein hits are tehn combined with gene hits and metacyc to give the output.


Result: Gene family and pathway abundances

.image-40[![A table with two columns, Feature on left and RPK on right. GeneX has an RPK of 8. GeneX Species1 has an RPK of 2, species2 and unclassified are listed with an RPK of 3.](../../images/metatranscriptomics/humann2_tiered5.png)]

Gene Families Abundances

Screenshot of a table in Galaxy with Gene Family on left and humann2 abundance RPKs on right.

RPK (reads per kilobase) = sum of alignment scores

Speaker Notes Gene families: groups of evolutionary related protein that perform similar function Pathway: sum over genes catalyzing the reaction Pathway coverage: presence/absence RPK relative gene copy number : is computed as the sum of all alignments scores over a particular gene family UNMAPPED: total number of reads that remained unmapped even after both alignment steps UNINTEGRATED: no pathway detected.


Gene Families to Functional Annotation

Humann2 regroup table is the left node in a flow chart with UniRef50. Multiple lines are drawn to an unlabelled right node that lists metacyc, kegg, pfam, EC, GO, informative GO, slim GO.

Speaker Notes Gene familes are too large depending on the complexity thus to simplify users can regroup gene families using grouping tool, can download mapping files. HUMAnN2 regroups Uniref 50/90 values to Go terms to get a broad overview.


Group Abundances

humann2 regroup table, lines from uniref50 to GO. group humann2 to GO slim terms shows a similar graphic, lines from uniref50 to slim GO.

Speaker Notes Group abundances converts GO terms to Go slim (subset of GO terms) into Mol function, biological process and cellular components.


Gene Families to Functional Annotation

Table from Galaxy shown with gene family and RPK

group Human2 to GO slim terms with lines from uniref50 to slim GO and boxes of Molecular Function, biological process, and cellular component below Slim GO

Another galaxy table screenshot with GO id, GO name, and abundance.


class: top

Output

.left-column30[
Molecular Function

]

.right-column70[ .image-90[Table in Galaxy with GO ID, name, abundance.] ] –

.left-column70[ .image-90[Basically the same table as above.] ]

.right-column30[
Biological Process

]

.right-column70[ .image-90[Again the same columns in a table. None of the specific data is legible or important.] ]

.left-column30[



Cellular Component

]

Speaker Notes g is genus s is species level —

Unpack pathway abundances to show genes included

output file unpack pathway tool


Function: Cellulose Degradation

.image-75[line chart shown cellulase abundance decreasing from 80 copies per million to 40 as time goes from 13 to 43. Cellulose 1,4 beta cellobiosidase starts at 140 cpm and dps at hour 23 to 120 before increasing to 200 by the end of the graph]

Speaker Notes explain about datasets first cellulose 1,4 beta-cellulobiosidase responsible for hydrolysis of cellulose Gene encoding for the cellulose-binding domain protein shows an initial decrease and subsequent increase during cellulose degradation. —

Functions associated with a selected taxon

.image-75[Stacked bar chart with a lot of organisms as left axis (abundance, copies per million) and time on the bottom. It is labelled Coprothermobacter: Functional Pathways]

Speaker Notes In gene abundance, Coprothermobacter and Clostridium were observed to be the most abundant. In this figure we are looking at Coprothermobacter only->Glycolysis is observed to be the most abundant functional pathway across time points in Coprothermobacter


Taxa associated with a selected function

.image-75[Bar chart titled Adenosine ribonucleotides de novo biosynthesis with time in hours as x axis, and Genus abundance (copies per million). Coprothermobacter and Clostridium decrease from ~2000 combined copies per million to ~800, in approximately equal amounts.]

Speaker Notes This figure shows the contribution of genera to adenosine ribonucleotides denovo biosynthesis across time points. it shows during ATP synthesis, we see clostridium and coprothermobacter in abundance. —

Tabular Outputs from ASaIM Workflow


Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.