16S Microbial Analysis with mothur (extended)

Author(s)	Saskia Hiltemann Bérénice Batut Dave Clements
Reviewers

Overview
Questions:

What is the effect of normal variation in the gut microbiome on host health?

Objectives:

Analyze of 16S rRNA sequencing data using the mothur toolsuite in Galaxy

Using a mock community to assess the error rate of your sequencing experiment

Visualize sample diversity using Krona and Phinch

Requirements:

Introduction to Galaxy Analyses

Time estimation: 6 hours

Supporting Materials:

Datasets

Workflows

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.be ✅

Possibly Working

UseGalaxy.cz

UseGalaxy.fr

UseGalaxy.no

UseGalaxy.org (Main)

UseGalaxy.org.au

Published: Feb 12, 2017

Last modification: Mar 14, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00390

rating Rating: 5.0 (2 recent ratings, 47 all time)

version Revision: 4

Overview

In this tutorial we will perform an analysis based on the Standard Operating Procedure (SOP) for MiSeq data, developed by the Schloss lab, the creators of the mothur software package Schloss et al. 2009.

Comment: Note: Two versions of this tutorial

Because this tutorial consists of many steps, we have made two versions of it, one long and one short.

This is the extended version. We will run every tool manually and discuss the results in detail. If you would like to run through the tutorial a bit quicker and use workflows to run groups of analysis steps (e.g. data cleaning) at once, please see the shorter version of this tutorial

You can also switch between the long and short version at the start of any section.

Agenda

In this tutorial, we will cover:

Overview

Obtaining and preparing data

Understanding our input data

Importing the data into Galaxy

Quality Control

Create contigs from paired-end reads

Data Cleaning

Optimize files for computation

Sequence Alignment

More Data Cleaning

Chimera Removal

Taxonomic Classification

Removal of non-bacterial sequences

Optional: Calculate error rates based on our mock community

Cluster mock sequences into OTUs

OTU Clustering

Remove Mock Sample

Cluster sequences into OTUs

Diversity Analysis

Alpha diversity

Beta diversity

Visualisations

Krona

Conclusion

Comment: Results may vary

Your results may be slightly different from the ones presented in this tutorial due to differing versions of tools, reference data, external databases, or because of stochastic processes in the algorithms.

Obtaining and preparing data

In this tutorial we use 16S rRNA data, but similar pipelines can be used for WGS data.

Comment: Background: The 16S ribosomal RNA gene

The 16S rRNA gene has several properties that make it ideally suited for our purposes

Present in all prokaryotes

Highly conserved + highly variable regions

Huge reference databases

The highly conserved regions make it easy to target the gene across different organisms, while the highly variable regions allow us to distinguish between different species.

(slide credit https://www.slideshare.net/beiko/ccbc-tutorial-beiko)

Understanding our input data

In this tutorial we use the dataset generated by the Schloss lab to illustrate their MiSeq SOP.

They describe the experiment as follows:

“The Schloss lab is interested in understanding the effect of normal variation in the gut microbiome on host health. To that end, we collected fresh feces from mice on a daily basis for 365 days post weaning. During the first 150 days post weaning (dpw), nothing was done to our mice except allow them to eat, get fat, and be merry. We were curious whether the rapid change in weight observed during the first 10 dpw affected the stability microbiome compared to the microbiome observed between days 140 and 150.”

To speed up analysis for this tutorial, we will use only a subset of this data. We will look at a single mouse at 20 different time points (10 early, 10 late). In order to assess the error rate of the analysis pipeline and experimental setup, the Schloss lab additionally sequenced a mock community with a known composition (genomic DNA from 21 bacterial strains). The sequences used for this mock sample are contained in the file HMP_MOCK.v35.fasta

Comment: Dataset naming scheme

For this tutorial, you are given 20 pairs of files. For example, the following pair of files:
F3D0_S188_L001_R1_001.fastq
F3D0_S188_L001_R2_001.fastq

The first part of the file name indicates the sample; F3D0 here signifies that this sample was obtained from Female 3 on Day 0. The rest of the file name is identical, except for _R1 and _R2, this is used to indicate the forward and reverse reads respectively.

Importing the data into Galaxy

Now that we know what our input data is, let’s get it into our Galaxy history:

All data required for this tutorial has been made available from Zenodo

Hands On: Obtaining our data
Make sure you have an empty analysis history. Give it a name.

To create a new history simply click the new-history icon at the top of the history panel:
Import Sample Data.
Import the sample FASTQ files to your history, either from a shared data library (if available), or from Zenodo using the URLs listed in the box below (click param-repeat to expand):
https://zenodo.org/record/800651/files/F3D0_R1.fastq
https://zenodo.org/record/800651/files/F3D0_R2.fastq
https://zenodo.org/record/800651/files/F3D141_R1.fastq
https://zenodo.org/record/800651/files/F3D141_R2.fastq
https://zenodo.org/record/800651/files/F3D142_R1.fastq
https://zenodo.org/record/800651/files/F3D142_R2.fastq
https://zenodo.org/record/800651/files/F3D143_R1.fastq
https://zenodo.org/record/800651/files/F3D143_R2.fastq
https://zenodo.org/record/800651/files/F3D144_R1.fastq
https://zenodo.org/record/800651/files/F3D144_R2.fastq
https://zenodo.org/record/800651/files/F3D145_R1.fastq
https://zenodo.org/record/800651/files/F3D145_R2.fastq
https://zenodo.org/record/800651/files/F3D146_R1.fastq
https://zenodo.org/record/800651/files/F3D146_R2.fastq
https://zenodo.org/record/800651/files/F3D147_R1.fastq
https://zenodo.org/record/800651/files/F3D147_R2.fastq
https://zenodo.org/record/800651/files/F3D148_R1.fastq
https://zenodo.org/record/800651/files/F3D148_R2.fastq
https://zenodo.org/record/800651/files/F3D149_R1.fastq
https://zenodo.org/record/800651/files/F3D149_R2.fastq
https://zenodo.org/record/800651/files/F3D150_R1.fastq
https://zenodo.org/record/800651/files/F3D150_R2.fastq
https://zenodo.org/record/800651/files/F3D1_R1.fastq
https://zenodo.org/record/800651/files/F3D1_R2.fastq
https://zenodo.org/record/800651/files/F3D2_R1.fastq
https://zenodo.org/record/800651/files/F3D2_R2.fastq
https://zenodo.org/record/800651/files/F3D3_R1.fastq
https://zenodo.org/record/800651/files/F3D3_R2.fastq
https://zenodo.org/record/800651/files/F3D5_R1.fastq
https://zenodo.org/record/800651/files/F3D5_R2.fastq
https://zenodo.org/record/800651/files/F3D6_R1.fastq
https://zenodo.org/record/800651/files/F3D6_R2.fastq
https://zenodo.org/record/800651/files/F3D7_R1.fastq
https://zenodo.org/record/800651/files/F3D7_R2.fastq
https://zenodo.org/record/800651/files/F3D8_R1.fastq
https://zenodo.org/record/800651/files/F3D8_R2.fastq
https://zenodo.org/record/800651/files/F3D9_R1.fastq
https://zenodo.org/record/800651/files/F3D9_R2.fastq
https://zenodo.org/record/800651/files/Mock_R1.fastq
https://zenodo.org/record/800651/files/Mock_R2.fastq
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Import Reference Data

Import the following reference datasets

silva.v4.fasta

HMP_MOCK.v35.fasta

trainset9_032012.pds.fasta

trainset9_032012.pds.tax
https://zenodo.org/record/800651/files/HMP_MOCK.v35.fasta
https://zenodo.org/record/800651/files/silva.v4.fasta
https://zenodo.org/record/800651/files/trainset9_032012.pds.fasta
https://zenodo.org/record/800651/files/trainset9_032012.pds.tax
https://zenodo.org/record/800651/files/mouse.dpw.metadata

Now that’s a lot of files to manage. Luckily Galaxy can make life a bit easier by allowing us to create dataset collections. This enables us to easily run tools on multiple datasets at once.

Since we have paired-end data, each sample consist of two separate fastq files, one containing the forward reads, and one containing the reverse reads. We can recognize the pairing from the file names, which will differ only by _R1 or _R2 in the filename. We can tell Galaxy about this paired naming convention, so that our tools will know which files belong together. We do this by building a List of Dataset Pairs

Hands On: Organizing our data into a paired collection

Click on the checkmark icon param-check at top of your history.

Select all the FASTQ files (40 in total)

Tip: type fastq in the search bar at the top of your history to filter only the FASTQ files; you can now use the All button at the top instead of having to individually select all 40 input files.

Click on All 40 selected

Select Build List of Dataset Pairs from the dropdown menu

In the next dialog window you can create the list of pairs. By default Galaxy will look for pairs of files that differ only by a _1 and _2 part in their names. In our case however, these should be _R1 and _R2.

Click on “Choose Filters” and select Forward: _R1, Reverse: _R2 (note that you can also enter Filters manually in the text fields on the top)

You should now see a list of pairs suggested by Galaxy:

Click on Auto-pair to create the suggested pairs.

Or click on “Pair these datasets” manually for every pair that looks correct.

Name the pairs

The middle segment is the name for each pair.

These names will be used as sample names in the downstream analysis, so always make sure they are informative!

Make sure that param-check Remove file extensions is checked

Check that the pairs are named F3D0-F3D9, F3D141-F3D150 and Mock.

Note: The names should not have the .fastq extension

If needed, the names can be edited manually by clicking on them

Name your collection at the bottom right of the screen

You can pick whatever name makes sense to you

Click the Create Collection button.

A new dataset collection item will now appear in your history

Quality Control

exchange Switch to short tutorial

For more information on the topic of quality control, please see our dedicated QC training materials.

Before starting any analysis, it is always a good idea to assess the quality of your input data and improve it where possible by trimming and filtering reads. The mothur toolsuite contains several tools to assist with this task. We will begin by merging our reads into contigs, followed by filtering and trimming of reads based on quality score and several other metrics.

Create contigs from paired-end reads

In this experiment, paired-end sequencing of the ~253 bp V4 region of the 16S rRNA gene was performed. The sequencing was done from either end of each fragment. Because the reads are about 250 bp in length, this results in a significant overlap between the forward and reverse reads in each pair. We will combine these pairs of reads into contigs.

The Make.contigs tool creates the contigs, and uses the paired collection as input. Make.contigs will look at each pair, take the reverse complement reverse read, and then determine the overlap between the two sequences. Where an overlapping base call differs between the two reads, the quality score is used to determine the consensus base call. A new quality score is derived by combining the two original quality scores in both of the reads for all the overlapping positions.

Hands On: Combine forward and reverse reads into contigs

Make.contigs ( Galaxy version 1.39.5.1) with the following parameters

param-select “Way to provide files”: Multiple pairs - Combo mode

param-collection “Fastq pairs”: the collection you just created

Leave all other parameters to the default settings

This step combined the forward and reverse reads for each sample, and also combined the resulting contigs from all samples into a single file. So we have gone from a paired collection of 20x2 FASTQ files, to a single FASTA file. In order to retain information about which reads originated from which samples, the tool also output a group file. View that file now, it should look something like this:

M00967_43_000000000-A3JHG_1_1101_10011_3881     F3D0
M00967_43_000000000-A3JHG_1_1101_10050_15564    F3D0
M00967_43_000000000-A3JHG_1_1101_10051_26098    F3D0
[..]

Here the first column contains the read name, and the second column contains the sample name.

Data Cleaning

As the next step, we want to improve the quality of our data. But first, let’s get a feel of our dataset:

Hands On: Summarize data

Summary.seqs ( Galaxy version 1.39.5.0) with the following parameters

param-file “fasta”: the trim.contigs.fasta file created by Make.contigs tool

“Output logfile?”: yes

The summary output files give information per read. The logfile outputs also contain some summary statistics:

             Start    End        NBases     Ambigs   Polymer  NumSeqs
Minimum:     1        248        248        0        3        1
2.5%-tile:   1        252        252        0        3        3810
25%-tile:    1        252        252        0        4        38091
Median:      1        252        252        0        4        76181
75%-tile:    1        253        253        0        5        114271
97.5%-tile:  1        253        253        6        6        148552
Maximum:     1        502        502        249      243      152360
Mean:        1        252.811    252.811    0.70063  4.44854
 # of Seqs:   152360

In this dataset:

Almost all of the reads are between 248 and 253 bases long.
2,5% or more of our reads had ambiguous base calls (Ambigs column).
The longest read in the dataset is 502 bases.
There are 152,360 sequences.

Our region of interest, the V4 region of the 16S gene, is only around 250 bases long. Any reads significantly longer than this expected value likely did not assemble well in the Make.contigs step. Furthermore, we see that 2,5% of our reads had between 6 and 249 ambiguous base calls (Ambigs column). In the next steps we will clean up our data by removing these problematic reads.

We do this data cleaning using the Screen.seqs tool, which removes

sequences with ambiguous bases (maxambig) and
contigs longer than a given threshold (maxlength).

Hands On: Filter reads based on quality and length

Screen.seqs ( Galaxy version 1.39.5.1) with the following parameters

param-file “fasta”: the trim.contigs.fasta file created by Make.contigs tool

param-file “group”: the group file created in the Make.contigs tool step

“maxlength”: 275

“maxambig”: 0

Question

How many reads were removed in this screening step? (Hint: run the summary.seqs tool again)

23,488.

This can be determined by looking at the number of lines in bad.accnos output of screen.seqs or by comparing the total number of seqs between of the summary log before and after this screening step

Optimize files for computation

Microbiome samples typically contain a large numbers of the same organism, and therefore we expect to find many identical sequences in our data. In order to speed up computation, we first determine the unique reads, and then record how many times each of these different reads was observed in the original dataset. We do this by using the Unique.seqs tool.

Hands On: Remove duplicate sequences

Unique.seqs ( Galaxy version 1.39.5.0) with the following parameters

param-file “fasta”: the good.fasta output from Screen.seqs tool

“output format”: Name File

Question

How many sequences were unique? How many duplicates were removed?

16,426 unique sequences and 112,446 duplicates.

This can be determined from the number of lines in the fasta (or names) output, compared to the number of lines in the fasta file before this step.

Here we see that this step has greatly reduced the size of our sequence file; not only will this speed up further computational steps, it will also greatly reduce the amount of disk space (and your Galaxy quota) needed to store all the intermediate files generated during this analysis. This Unique.seqs tool created two files, one is a FASTA file containing only the unique sequences, and the second is a so-called names file. This names file consists of two columns, the first contains the sequence names for each of the unique sequences, and the second column contains all other sequence names that are identical to the representative sequence in the first column.

name          representatives
read_name1    read_name2,read_name,read_name5,read_name11
read_name4    read_name6,read_name,read_name10
read_name7    read_name8
...

To recap, we now have the following files:

a FASTA file containing every distinct sequence in our dataset (the representative sequences)
a names file containing the list of duplicate sequences
a group file containing information about the samples each read originated from

To further reduce file sizes and streamline analysis, we can use the Count.seqs tool to combine the group file and the names file into a single count table.

Hands On: Generate count table

Count.seqs ( Galaxy version 1.39.5.0) with the following parameters

param-file “name”: the names output from Unique.seqs tool

“Use a Group file”: yes

param-file “group”: the group file we created using the Screen.seqs tool

Have a look at the count_table output from the Count.seqs tool, it summarizes the number of times each unique sequence was observed across each of the samples. It will look something like this:

Representative_Sequence                      total   F3D0   F3D1  F3D141  F3D142  ...
M00967_43_000000000-A3JHG_1_1101_14069_1827  4402    370    29    257     142
M00967_43_000000000-A3JHG_1_1101_18044_1900  28      1      0     1       0
M00967_43_000000000-A3JHG_1_1101_13234_1983  10522   425    281   340     205
...

The first column contains the read names of the representative sequences, and the subsequent columns contain the number of duplicates of this sequence observed in each sample.

Comment: Representative sequences vs Total sequences

From now on, we will only work with the set of unique sequences, but it’s important to remember that these represent a larger number of total sequences, which we keep track of in the count table.

In the following we will use the unique sequences together with the count table as input to tools instead of the complete set of sequences. If this is done for the Summary.seqs tool tool it will report both the number of unique representative sequences as well as the total sequences they represent.

Sequence Alignment

exchange Switch to short tutorial

For more information on the topic of alignment, please see our dedicated alignment training materials

We are now ready to align our sequences to the reference alignment. This is an important step to improve the clustering of your OTUs Schloss 2012.

In mothur this is done by determining for each unique sequence the entry of the reference database that has the most k-mers in common (i.e. the most substring of fixed length k). For the reference sequence with the most common k-mers and the unique sequence a standard global sequence alignment is computed (using the Needleman-Wunsch algorithm).

Hands On: Align sequences
Align.seqs ( Galaxy version 1.39.5.0) with the following parameters

param-file “fasta”: the fasta output from Unique.seqs tool

param-file “reference”: silva.v4.fasta reference file from your history
Question

Have a look at the alignment output, what do you see?
At first glance, it might look like there is not much information there. We see our read names, but only period . characters below it.
>M00967_43_000000000-A3JHG_1_1101_14069_1827
............................................................................
>M00967_43_000000000-A3JHG_1_1101_18044_1900
............................................................................
This is because the V4 region is located further down our reference database and nothing aligns to the start of it. If you scroll to right you will start seeing some more informative bits:
.....T-------AC---GG-AG-GAT------------
Here we start seeing how our sequences align to the reference database. There are different alignment characters in this output:

.: terminal gap character (before the first or after the last base in our query sequence)

-: gap character within the query sequence

We will cut out only the V4 region in a later step (Filter.seqs tool)
Summary.seqs ( Galaxy version 1.39.5.0) with the following parameters:

param-file “fasta”: the align output from Align.seqs tool

param-file “count”: count_table output from Count.seqs tool

“Output logfile?”: yes

Have a look at the summary output (log file):

            Start    End      NBases  Ambigs   Polymer  NumSeqs
Minimum:    1250     10693    250     0        3        1
2.5%-tile:  1968     11550    252     0        3        3222
25%-tile:   1968     11550    252     0        4        32219
Median:     1968     11550    252     0        4        64437
75%-tile:   1968     11550    253     0        5        96655
97.5%-tile: 1968     11550    253     0        6        125651
Maximum:    1982     13400    270     0        12       128872
Mean:       1967.99  11550    252.462 0        4.36693
# of unique seqs:   16426
total # of seqs:    128872

The Start and End columns tell us that the majority of reads aligned between positions 1968 and 11550, which is what we expect to find given the reference file we used. However, some reads align to very different positions, which could indicate insertions or deletions at the terminal ends of the alignments or other complicating factors.

Also notice the Polymer column in the output table. This indicates the average homopolymer length. Since we know that our reference database does not contain any homopolymer stretches longer than 8 reads, any reads containing such long stretches are likely the result of PCR errors and we would be wise to remove them.

Next we will clean our data further by removing poorly aligned sequences and any sequences with long homopolymer stretches.

More Data Cleaning

To ensure that all our reads overlap our region of interest, we will:

Remove any reads not overlapping the region V4 region (position 1968 to 11550) using Screen.seqs tool.
Remove any overhang on either end of the V4 region to ensure our sequences overlap only the V4 region, using Filter.seqs tool.
Clean our alignment file by removing any columns that have a gap character (-, or . for terminal gaps) at that position in every sequence (also using Filter.seqs tool).
Remove redundancy in the aligned sequences that might have been introduced by filtering columns by running Unique.seqs once more.
Group near-identical sequences together with Pre.cluster tool. Sequences that only differ by one or two bases at this point are likely to represent sequencing errors rather than true biological variation, so we will cluster such sequences together.

Hands On: Remove poorly aligned sequences

Screen.seqs ( Galaxy version 1.39.5.1) with the following parameters

param-file “fasta”: the aligned fasta file from Align.seqs tool

“start”: 1968

“end”: 11550

“maxhomop”: 8

param-file “count”: the count table file from Count.seqs tool

Note: we supply the count table so that it can be updated for the sequences we’re removing.

Question

How many sequences were removed in this step?

128 sequences were removed. This is the number of lines in the bad.accnos output.

Next, we will remove any overhang on either side of the V4 region, and

Filter.seqs ( Galaxy version 1.39.5.0) with the following parameters

param-file “fasta”: good.fasta output from the latest Screen.seqs tool

“vertical”: yes

“trump”: .

“Output logfile”: yes

Your resulting alignment (filtered fasta output) should look something like this:

>M00967_43_000000000-A3JHG_1_1101_14069_1827
TAC--GG-AG-GAT--GCG-A-G-C-G-T-T--AT-C-CGG-AT--TT-A-T-T--GG-GT--TT-A-AA-GG-GT-GC-G-TA-GGC-G-G-C-CT-G-C-C-AA-G-T-C-A-G-C-G-G--TA-A-AA-TT-G-C-GG-GG--CT-C-AA-C-C-C-C-G-T-A--CA-G-C-CGTT-GAAAC-TG-C-CGGGC-TCGA-GT-GG-GC-GA-G-A---AG-T-A-TGCGGAATGCGTGGTGT-AGCGGT-GAAATGCATAG-AT-A-TC-AC-GC-AG-AACCCCGAT-TGCGAAGGCA------GC-ATA-CCG-G-CG-CC-C-T-ACTGACG-CTGA-GGCA-CGAAA-GTG-CGGGG-ATC-AAACAGG
>M00967_43_000000000-A3JHG_1_1101_18044_1900
TAC--GG-AG-GAT--GCG-A-G-C-G-T-T--GT-C-CGG-AA--TC-A-C-T--GG-GC--GT-A-AA-GG-GC-GC-G-TA-GGC-G-G-T-TT-A-A-T-AA-G-T-C-A-G-T-G-G--TG-A-AA-AC-T-G-AG-GG--CT-C-AA-C-C-C-T-C-A-G-CCT-G-C-CACT-GATAC-TG-T-TAGAC-TTGA-GT-AT-GG-AA-G-A---GG-A-G-AATGGAATTCCTAGTGT-AGCGGT-GAAATGCGTAG-AT-A-TT-AG-GA-GG-AACACCAGT-GGCGAAGGCG------AT-TCT-CTG-G-GC-CA-A-G-ACTGACG-CTGA-GGCG-CGAAA-GCG-TGGGG-AGC-AAACAGG
>M00967_43_000000000-A3JHG_1_1101_13234_1983
TAC--GG-AG-GAT--GCG-A-G-C-G-T-T--AT-C-CGG-AT--TT-A-T-T--GG-GT--TT-A-AA-GG-GT-GC-G-CA-GGC-G-G-A-AG-A-T-C-AA-G-T-C-A-G-C-G-G--TA-A-AA-TT-G-A-GA-GG--CT-C-AA-C-C-T-C-T-T-C--GA-G-C-CGTT-GAAAC-TG-G-TTTTC-TTGA-GT-GA-GC-GA-G-A---AG-T-A-TGCGGAATGCGTGGTGT-AGCGGT-GAAATGCATAG-AT-A-TC-AC-GC-AG-AACTCCGAT-TGCGAAGGCA------GC-ATA-CCG-G-CG-CT-C-A-ACTGACG-CTCA-TGCA-CGAAA-GTG-TGGGT-ATC-GAACAGG

These are all our representative reads again, now with additional alignment information.

In the log file of the Filter.seqs step we see the following additional information:

Length of filtered alignment: 376
Number of columns removed: 13049
Length of the original alignment: 13425
Number of sequences used to construct filter: 16298

From this log file we see that while our initial alignment was 13425 positions wide, after filtering the overhangs (trump parameter) and removing positions that had a gap in every aligned read (vertical parameter), we have trimmed our alignment down to a length of 376.

Because any filtering step we perform might lead to sequences no longer being unique, we deduplicate our data by re-running the Unique.seqs tool:

Hands On: Re-obtain unique sequences

Unique.seqs ( Galaxy version 1.39.5.0) with the following parameters

param-file “fasta”: the filtered fasta output from Filter.seqs tool

param-file “name file or count table”: the count table from the last Screen.seqs tool

Question

How many duplicate sequences did our filter step produce?

3: The number of unique sequences was reduced from 16298 to 16295

Pre-clustering

The next step in cleaning our data, is to merge near-identical sequences together. Sequences that only differ by around 1 in every 100 bases are likely to represent sequencing errors, not true biological variation. Because our contigs are ~250 bp long, we will set the threshold to 2 mismatches.

Hands On: Perform preliminary clustering of sequences

Pre.cluster ( Galaxy version 1.39.5.0) with the following parameters

param-file “fasta”: the fasta output from the last Unique.seqs tool run

param-file “name file or count table”: the count table from the last Unique.seqs tool

“diffs”: 2

Question

How many unique sequences are we left with after this clustering of highly similar sequences?

5720: This is the number of lines in the fasta output

Chimera Removal

We have now thoroughly cleaned our data and removed as much sequencing error as we can. Next, we will look at a class of sequencing artefacts known as chimeras.

During PCR amplification, it is possible that two unrelated templates are combined to form a sort of hybrid sequence, also called a chimera. Needless to say, we do not want such sequencing artefacts confounding our results. We’ll do this chimera removal using the VSEARCH algorithm Rognes et al. 2016 that is called within mothur, using the Chimera.vsearch tool tool.

Comment: Background: Chimeras

(slide credit: http://slideplayer.com/slide/4559004/ )

This command will split the data by sample and check for chimeras. The recommended way of doing this is to use the abundant sequences as our reference.

Hands On: Remove chimeric sequences

Chimera.vsearch ( Galaxy version 1.39.5.1) with the following parameters

param-file “fasta”: the fasta output from Pre.cluster tool

param-select “Select Reference Template from”: Self

param-file “count”: the count table from the last Pre.cluster tool

param-check “dereplicate” to Yes

Running Chimera.vsearch with the count file will remove the chimeric sequences from the count table, but we still need to remove those sequences from the fasta file as well. We do this using Remove.seqs:

Remove.seqs ( Galaxy version 1.39.5.0) with the following parameters

param-file “accnos”: the vsearch.accnos file from Chimera.vsearch tool

param-file “fasta”: the fasta output from Pre.cluster tool

param-file “count”: the count table from Chimera.vsearch tool

Question

How many sequences were flagged as chimeric? what is the percentage? (Hint: summary.seqs)

Looking at the chimera.vsearch accnos output, we see that 3,439 representative sequences were flagged as chimeric. If we run summary.seqs on the resulting fasta file and count table, we see that we went from 128,655 sequences down to 118,091 total sequences in this step, for a reduction of 10,564 total sequences, or 8.2%. This is a reasonable number of sequences to be flagged as chimeric.

Taxonomic Classification

exchange Switch to short tutorial

Now that we have thoroughly cleaned our data, we are finally ready to assign a taxonomy to our sequences.

We will do this using a Bayesian classifier (via the Classify.seqs tool tool) and a mothur-formatted training set provided by the Schloss lab based on the RDP (Ribosomal Database Project, Cole et al. 2013) reference taxonomy.

Comment: Background: Taxonomic assignment

In this tutorial we will use the RDP classifier and reference taxonomy for classification, but there are several different taxonomic assignment algorithms and reference databases available for this purpose.

An overview of different methods is given by Liu et al. 2008 and shown below:

The most commonly used reference taxonomies are:

SILVA (Quast et al. 2012)

GreenGenes (DeSantis et al. 2006)

RDP (Cole et al. 2013)

NCBI Taxonomy Database (Federhen 2011)

The choice of taxonomic classifier and reference taxonomy can impact downstream results. The figure from Liu et al. 2008 given below shows the taxonomic composition determined when using different classifiers and reference taxonomies, for different primer sets (16S regions).

$Compositions at the phylum level for each of the three datasets: (a) Guerrero Negro mat, (b) Human gut and (c) Mouse gut, using a range of different methods (separate subpanels within each group). The x-axis of each graph shows region sequenced. The y-axis shows abundance as a fraction of the total number of sequences in the community. The legend shows colors for phyla (consistent across graphs).$

Figure: Compositions at the phylum level for each of the three datasets: (a) Guerrero Negro mat, (b) Human gut and (c) Mouse gut, using a range of different methods (separate subpanels within each group). The x-axis of each graph shows region sequenced. The y-axis shows abundance as a fraction of the total number of sequences in the community. The legend shows colors for phyla (consistent across graphs).

Which reference taxonomy is best for your experiments depends on a number of factors such as the type of sample and variable region sequenced.

Another discussion about how these different databases compare was described by Balvočiūtė and Huson 2017.

Removal of non-bacterial sequences

Despite all we have done to improve data quality, there may still be more to do: there may be 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria that have survived all the cleaning steps up to this point. We are generally not interested in these sequences and want to remove them from our dataset.

Hands On: Taxonomic Classification and Removal of undesired sequences

Classify.seqs ( Galaxy version 1.39.5.0) with the following parameters

param-file “fasta”: the fasta output from Remove.seqs tool

param-file “reference”: trainset9032012.pds.fasta from your history

param-file “taxonomy”: trainset9032012.pds.tax from your history

param-file “count”: the count table file from Remove.seqs tool

Have a look at the taxonomy output. You will see that every read now has a classification.

Now that everything is classified we want to remove our undesirables. We do this with the Remove.lineage tool:

Remove.lineage ( Galaxy version 1.39.5.0) with the following parameters

param-file “taxonomy”: the taxonomy output from Classify.seqs tool

param-text “taxon - Manually select taxons for filtering”: Chloroplast-Mitochondria-unknown-Archaea-Eukaryota

param-file “fasta”: the fasta output from Remove.seqs tool

param-file “count”: the count table from Remove.seqs tool

Question

How many unique (representative) sequences were removed in this step?

How many sequences in total?

20 representative sequences were removed. The fasta file output from Remove.seqs had 2281 sequences while the fasta output from Remove.lineages contained 2261 sequences.

162 total sequences were removed. If you run summary.seqs with the count table, you will see that we now have 2261 unique sequences representing a total of 117,929 total sequences (down from 118,091 before). This means 162 of our sequences were in represented by these 20 representative sequences.

The data is now as clean as we can get it. In the next section we will use the Mock sample to assess how accurate our sequencing and bioinformatics pipeline is.

Optional: Calculate error rates based on our mock community

exchange Switch to short tutorial

Comment: Skipping the mock community analysis

The mock community analysis is optional. If you are low on time or want to skip ahead, you can jump straight to the next section where we will cluster our sequences into OTUs, classify them and perform some visualisations.

If you wish to skip the mock community analysis, you can go directly to the next section and continue with the analysis.

The following step is only possible if you have co-sequenced a mock community with your samples. A mock community is a sample of which you know the exact composition and is something we recommend to do, because it will give you an idea of how accurate your sequencing and analysis protocol is.

Comment: Background: Mock communities

What is a mock community?

A mock community is an artificially constructed sample; a defined mixture of microbial cells and/or viruses or nucleic acid molecules created in vitro to simulate the composition of a microbiome sample or the nucleic acid isolated therefrom.

Why sequence a mock community?

In a mock community, we know exactly which sequences/organisms we expect to find, and at which proportions. Therefore, we can use such an artificial sample to assess the error rates of our sequencing and analysis pipeline.

Did we miss any of the sequences we know to be present in the sample (false negatives)?

Do we find any sequences that were not present in the sample (false positives)?

Were we able to accurately detect their relative abundances?

If our workflow performed well on the mock sample, we have more confidence in the accuracy of the results on the rest of our samples.

Example

As an example, consider the following image from Fouhy et al Fouhy et al. 2016. A mock community sample was sequenced for different combinations of sequencer and primer sets (V-regions). Since we know the expected outcome, we can assess the accuracy of each pipeline. A similar approach can be used to assess different parameter settings of the in-silico analysis pipeline.

Open image in new tab

Figure 1: Example of usage of a mock community to assess accuracy. On the left is the expected result given that we know the exact composition of the mock sample. This was then used to assess the accuracy of different combinations of sequencing platform and primer set (choice of V-region)

Further reading

Next generation sequencing data of a defined microbial mock community Singer et al. 2016

16S rRNA gene sequencing of mock microbial populations- impact of DNA extraction method, primer choice and sequencing platform Fouhy et al. 2016

The mock community in this experiment was composed of genomic DNA from 21 bacterial strains. So in a perfect world, this is exactly what we would expect the analysis to produce as a result.

First, let’s extract the sequences belonging to our mock samples from our data:

Hands On: extract mock sample from our dataset

Get.groups ( Galaxy version 1.39.5.0) with the following parameters

param-file “group file or count table”: the count table from Remove.lineage tool

param-select “groups”: Mock

param-file “fasta”: fasta output from Remove.lineage tool

param-check “output logfile?”: yes

In the log file we see the following:

Selected 58 sequences from your fasta file.
Selected 4046 sequences from your count file

The Mock sample has 58 unique sequences, representing a total of 4,046 total sequences.

The Seq.error tool measures the error rates using our mock reference. Here we align the reads from our mock sample back to their known sequences, to see how many fail to match.

Hands On: Assess error rates based on a mock community

Seq.error ( Galaxy version 1.39.5.0) with the following parameters

param-file “fasta”: the fasta output from Get.groups tool

param-file “reference”: HMP_MOCK.v35.fasta file from your history

param-file “count”: the count table from Get.groups tool

param-check “output log?”: yes

In the log file we see something like this:

Overall error rate:    6.5108e-05
Errors    Sequences
0    3998
1    3
2    0
3    2
4    1
[..]

That is pretty good! The error rate is only 0.0065%! This gives us confidence that the rest of the samples are also of high quality, and we can continue with our analysis.

Cluster mock sequences into OTUs

We will now estimate the accuracy of our sequencing and analysis pipeline by clustering the Mock sequences into OTUs, and comparing the results with the expected outcome.

For this a distance matrix is calculated (i.e. the distances between all pairs of sequences). From this distance matrix a clustering is derived using the OptiClust algorithm:

OptiClust starts with a random OTU clustering
Then iteratively sequences are moved to all other OTUs or new clusters and the option is chosen that improved the mathews correlation coefficient (MCC)
Step 2 is repeated until the MCC converges

Comment: Background: What are Operational Taxonomic Units (OTUs)?

In 16S metagenomics approaches, OTUs are clusters of similar sequence variants of the 16S rDNA marker gene sequence. Each of these clusters is intended to represent a taxonomic unit of a bacteria species or genus depending on the sequence similarity threshold. Typically, OTU cluster are defined by a 97% identity threshold of the 16S gene sequence variants at species level. 98% or 99% identity is suggested for strain separation.

(Image credit: Danzeisen et al. 2013, 10.7717/peerj.237)

Hands On: Cluster mock sequences into OTUs

First we calculate the pairwise distances between our sequences

Dist.seqs ( Galaxy version 1.39.5.0) with the following parameters

param-file “fasta”: the fasta from Get.groups tool

“cutoff”: 0.20

Next we group sequences into OTUs

Cluster - Assign sequences to OTUs ( Galaxy version 1.39.5.0) with the following parameters

param-file “column”: the dist output from Dist.seqs tool

param-file “count”: the count table from Get.groups tool

Now we make a shared file that summarizes all our data into one handy table

Make.shared ( Galaxy version 1.39.5.0) with the following parameters

param-file “list”: the OTU list from Cluster tool

param-file “count”: the count table from Get.groups tool

“label”: 0.03 (this indicates we are interested in the clustering at a 97% identity threshold)

And now we generate intra-sample rarefaction curves

Rarefaction.single ( Galaxy version 1.39.5.0) with the following parameters

param-file “shared”: the shared file from Make.shared tool

Question

How many OTUs were identified in our mock community?

Answer: 34

This can be determined by opening the shared file or OTU list and looking at the header line. You will see a column for each OTU

Open the rarefaction output (dataset named sobs inside the rarefaction curves output collection), it should look something like this:

numsampled	0.03-	lci-	hci-
1.0000	1.0000	1.0000
18.0240	16.0000	20.0000
19.2160	17.0000	22.0000
19.8800	18.0000	22.0000
20.3600	19.0000	22.0000

[..]

30.4320	28.0000	33.0000
30.8800	28.0000	34.0000
31.3200	29.0000	34.0000
31.6320	29.0000	34.0000
31.9920	30.0000	34.0000
32.3440	30.0000	34.0000
32.6560	31.0000	34.0000
32.9920	31.0000	34.0000
33.2880	32.0000	34.0000
33.5920	32.0000	34.0000
33.8560	33.0000	34.0000
34.0000	34.0000	34.0000

When we use the full set of 4060 sequences, we find 34 OTUs from the Mock community; and with 3000 sequences, we find about 31 OTUs. In an ideal world, we would find exactly 21 OTUs. Despite our best efforts, some chimeras or other contaminations may have slipped through our filtering steps.

Comment: Background: Rarefaction

To estimate the fraction of species sequenced, rarefaction curves are typically used. A rarefaction curve plots the number of species as a function of the number of individuals sampled. The curve usually begins with a steep slope, which at some point begins to flatten as fewer species are being discovered per sample: the gentler the slope, the less contribution of the sampling to the total number of operational taxonomic units or OTUs.

Green, most or all species have been sampled; blue, this habitat has not been exhaustively sampled; red, species rich habitat, only a small fraction has been sampled.

(A Primer on Metagenomics Wooley et al. 2010 )

Now that we have assessed our error rates we are ready for some real analysis.

OTU Clustering

exchange Switch to short tutorial

In this tutorial we will continue with an OTU-based approach, for the phylotype and phylogenic approaches, please refer to the mothur wiki page.

Comment: Background: What are Operational Taxonomic Units (OTUs)?

In 16S metagenomics approaches, OTUs are clusters of similar sequence variants of the 16S rDNA marker gene sequence. Each of these clusters is intended to represent a taxonomic unit of a bacteria species or genus depending on the sequence similarity threshold. Typically, OTU cluster are defined by a 97% identity threshold of the 16S gene sequence variants at species level. 98% or 99% identity is suggested for strain separation.

(Image credit: Danzeisen et al. 2013, 10.7717/peerj.237)

Remove Mock Sample

Now that we have cleaned up our data set as best we can, and assured ourselves of the quality of our sequencing pipeline by considering a mock sample, we are almost ready to cluster and classify our real data. But before we start, we should first remove the Mock dataset from our data, as we no longer need it. We do this using the Remove.groups tool:

Hands On: Remove Mock community from our dataset

Remove.groups ( Galaxy version 1.39.5.0) with the following parameters

param-select “Select input type”: fasta , name, taxonomy, or list with a group file or count table

param-file “group or count table”: the pick.count_table output from Remove.lineage tool

param-select “groups”: Mock

param-file “fasta”: the pick.fasta output from Remove.lineage tool

param-file “taxonomy”: the pick.taxonomy output from Remove.lineage tool

Cluster sequences into OTUs

There are several ways we can perform clustering. For the Mock community, we used the traditional approach of using the Dist.seqs and Cluster tools. Alternatively, we can also use the Cluster.split tool. With this approach, the sequences are split into bins, and then clustered with each bin. Taxonomic information is used to guide this process. The Schloss lab have published results showing that if you split at the level of Order or Family, and cluster to a 0.03 cutoff, you’ll get just as good of clustering as you would with the “traditional” approach. In addition, this approach is less computationally expensive and can be parallelized, which is especially advantageous when you have large datasets.

We’ll now use the Cluster tool, with taxlevel set to 4, requesting that clustering be done at the Order level.

Hands On: Cluster our data into OTUs

Cluster.split ( Galaxy version 1.39.5.0) with the following parameters

“Split by”: Classification using fasta

param-file “fasta”: the fasta output from Remove.groups tool

param-file “taxonomy”: the taxonomy output from Remove.groups tool

param-file “name file or count table”: the count table output from Remove.groups tool

“taxlevel”: 4

“cutoff”: 0.03

Next we want to know how many sequences are in each OTU from each group and we can do this using the Make.shared tool. Here we tell mothur that we’re really only interested in the 0.03 cutoff level:

Make.shared ( Galaxy version 1.39.5.0) with the following parameters

param-file “list”: the list output from Cluster.split tool

param-file “count”: the count table from Remove.groups tool

“label”: 0.03

We probably also want to know the taxonomy for each of our OTUs. We can get the consensus taxonomy for each OTU using the Classify.otu tool:

Classify.otu ( Galaxy version 1.39.5.0) with the following parameters

param-file “list”: the list output from Cluster.split tool

param-file “count”: the count table from Remove.groups tool

param-file “taxonomy”: the taxonomy output from Remove.groups tool

“label”: 0.03

Examine galaxy-eye the taxonomy output of Classify.otu tool. This is a collection, and the different levels of taxonomy are shown in the names of the collection elements. In this example we only calculated one level, 0.03. This means we used a 97% similarity threshold. This threshold is commonly used to differentiate at species level.

Opening the taxonomy output for level 0.03 (meaning 97% similarity, or species level) shows a file structured like the following:

OTU       Size    Taxonomy
..
Otu0008	5260	Bacteria(100);"Bacteroidetes"(100);"Bacteroidia"(100);"Bacteroidales"(100);"Rikenellaceae"(100);Alistipes(100);
Otu0009	3613	Bacteria(100);"Bacteroidetes"(100);"Bacteroidia"(100);"Bacteroidales"(100);"Porphyromonadaceae"(100);"Porphyromonadaceae"_unclassified(100);
Otu0010	3058	Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Lactobacillaceae(100);Lactobacillus(100);
Otu0011	2958	Bacteria(100);"Bacteroidetes"(100);"Bacteroidia"(100);"Bacteroidales"(100);"Porphyromonadaceae"(100);"Porphyromonadaceae"_unclassified(100);
Otu0012	2134	Bacteria(100);"Bacteroidetes"(100);"Bacteroidia"(100);"Bacteroidales"(100);"Porphyromonadaceae"(100);"Porphyromonadaceae"_unclassified(100);
Otu0013	1856	Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Lactobacillaceae(100);Lactobacillus(100);
..

The first line shown in the snippet above indicates that Otu008 occurred 5260 times, and that all of the sequences (100%) were binned in the genus Alistipes.

Question

Which samples contained sequences belonging to an OTU classified as Staphylococcus?

Examine the tax.summary file output by Classify.otu tool.

Samples F3D141, F3D142, F3D144, F3D145, F3D2. This answer can be found by examining the tax.summary output and finding the columns with nonzero values for the line of Staphylococcus

Before we continue, let’s remind ourselves what we set out to do. Our original question was about the stability of the microbiome and whether we could observe any change in community structure between the early and late samples.

Because some of our sample may contain more sequences than others, it is generally a good idea to normalize the dataset by subsampling.

Hands On: Subsampling

First we want to see how many sequences we have in each sample. We’ll do this with the Count.groups tool:

Count.groups ( Galaxy version 1.39.5.0) with the following parameters

param-file “shared”: the shared file from Make.shared tool

Question

How many sequences did the smallest sample consist of?

The smallest sample is F3D143, and consists of 2389 sequences. This is a reasonable number, so we will now subsample all the other samples down to this level.

Sub.sample ( Galaxy version 1.39.5.0) with the following parameters

“Select type of data to subsample”: OTU Shared

param-file “shared”: the shared file from Make.shared tool

“size”: 2389

Question

What would you expect the result of count.groups on this new shared output collection to be? Check if you are correct.

all groups (samples) should now have 2389 sequences. Run count.groups again on the shared output collection by the sub.sample tool to confirm that this is indeed what happened.

Note: since subsampling is a stochastic process, your results from any tools using this subsampled data will deviate from the ones presented here.

Diversity Analysis

exchange Switch to short tutorial

Species diversity is a valuable tool for describing the ecological complexity of a single sample (alpha diversity) or between samples (beta diversity). However, diversity is not a physical quantity that can be measured directly, and many different metrics have been proposed to quantify diversity by Finotello et al. 2016.

Comment: Background: Species Diversity

Species diversity consists of three components: species richness, taxonomic or phylogenetic diversity and species evenness.

Species richness = the number of different species in a community.

Species evenness = how even in numbers each species in a community is.

Phylogenetic diversity = how closely related the species in a community are.

Each of these factors play a role in diversity, but how to combine them into a single measure of diversity is nontrivial. Many different metrics have been proposed for this, for example: shannon, chao, pd, ace, simpson, sobs, jack, npshannon, smithwilson, heip bergerparker, boney, efron, shen, solow, bootstrap, qstat, coverage, anderberg, hamming, jclass, jest, ochiai, canberra, thetayc, invsimpson, just to name a few ;). A comparison of several different diversity metrics is discussed in Bonilla-Rosso et al. 2012

Question

To understand the difference between richness and evenness, consider the following example:

Which of these communities has the highest richness?

Which of these communities has the highest evenness?

Both communities have 4 different species, so they have same richness.

Community B is more even, because each species has the same abundance.

Even when two samples have identical richness and evenness, we still may conclude that one is more diverse than the other if the species are very dissimilar in one of the samples (have high phylogenetic distance), but very closely related to each other in the second sample.

Now, you do not need to know what all these different metrics are, but just remember that there is not a single definition of diversity and as always, the metric you choose to use may influence your results.

Alpha diversity

In order to estimate alpha diversity of the samples, we first generate the rarefaction curves. Recall that rarefaction measures the number of observed OTUs as a function of the subsampling size.

Comment: Background: Rarefaction

To estimate the fraction of species sequenced, rarefaction curves are typically used. A rarefaction curve plots the number of species as a function of the number of individuals sampled. The curve usually begins with a steep slope, which at some point begins to flatten as fewer species are being discovered per sample: the gentler the slope, the less contribution of the sampling to the total number of operational taxonomic units or OTUs.

Green, most or all species have been sampled; blue, this habitat has not been exhaustively sampled; red, species rich habitat, only a small fraction has been sampled.

(A Primer on Metagenomics Wooley et al. 2010 )

We calculate rarefaction curves with the Rarefaction.single tool tool:

Hands On: Calculate Rarefaction

Rarefaction.single ( Galaxy version 1.39.5.0) with the following parameters

param-file “shared”: the shared file from Make.shared tool

Note that we used the default diversity measure here (sobs; observed species richness), but there are many more options available under the calc parameter. Descriptions of some of these calculators can be found on the mothur wiki.

Examine the rarefaction curve output.

numsampled    0.03-F3D0    lci-F3D0    hci-F3D0    0.03-F3D1   ...
            1.0000       1.0000      1.0000      1.0000
         41.6560      35.0000     48.0000     45.0560
         59.0330      51.0000     67.0000     61.5740
         70.5640      62.0000     79.0000     71.4700
         78.8320      71.0000     87.0000     78.4730
         85.3650      77.0000     94.0000     83.9990
...

This file displays the number of OTUs identified per amount of sequences used (numsampled). What we would like to see is the number of additional OTUs identified when adding more sequences reaching a plateau. Then we know we have covered our full diversity. This information would be easier to interpret in the form of a graph. Let’s plot the rarefaction curve for a couple of our sequences:

Hands On: Plot Rarefaction

Plotting tool - for multiple series and graph types ( Galaxy version 1.0.2) with the following parameters

“Plot Title”: Rarefaction

“Label for x axis”: Number of Sequences

“Label for y axis”: Number of OTUs

“Output File Type”: PNG

param-repeat Click on Insert Series,

param-collection “Dataset”: rarefaction curve collection

“Header in first line?”: Yes

“Column for x axis”: Column 1

“Column for y-axis”: Column 2 and Column 5 and every third column until the end (we are skipping the low confidence and high confidence interval columns)

View the rarefaction plot output. From this image can see that the rarefaction curves for all samples have started to level off so we are confident we cover a large part of our sample diversity:

Finally, let’s use the Summary.single tool to generate a summary report. The following step will randomly subsample down to 2389 sequences, repeat this process 1000 times, and report several metrics:

Hands On: Summary.single

Summary.single ( Galaxy version 1.39.5.2) with the following parameters

param-file “share”: the shared file from Make.shared tool

“calc”: nseqs,coverage,sobs,invsimpson

“size”: 2389

View the summary output from Summary.single tool. This shows several alpha diversity metrics:

sobs: observed richness (number of OTUs)
coverage: Good’s coverage index (1 - (number of OTUs containing a single sequence / total number of sequences ))
invsimpson: Inverse Simpson Index (1 / probability that two random individuals represent the OTU)
nseqs: number of sequences

label   group   sobs          coverage    invsimpson   invsimpson_lci   invsimpson_hci  nseqs
03    F3D0    167.000000    0.994697    25.686387    24.648040        26.816067       6223.000000
03    F3D1    145.000000    0.994030    34.598470    33.062155        36.284520       4690.000000
03    F3D141  154.000000    0.991060    19.571632    18.839994        20.362390       4698.000000
03    F3D142  141.000000    0.978367    17.029921    16.196090        17.954269       2450.000000
03    F3D143  135.000000    0.980738    18.643635    17.593785        19.826728       2440.000000
03    F3D144  161.000000    0.980841    15.296728    14.669208        15.980336       3497.000000
03    F3D145  169.000000    0.991222    14.927279    14.494740        15.386427       5582.000000
03    F3D146  161.000000    0.989167    22.266620    21.201364        23.444586       3877.000000
03    F3D147  210.000000    0.995645    15.894802    15.535594        16.271013       12628.000000
03    F3D148  176.000000    0.995725    17.788205    17.303206        18.301177       9590.000000
03    F3D149  194.000000    0.994957    21.841083    21.280343        22.432174       10114.000000
03    F3D150  164.000000    0.989446    23.553161    22.462533        24.755101       4169.000000
03    F3D2    179.000000    0.998162    15.186238    14.703161        15.702137       15774.000000
03    F3D3    127.000000    0.994167    14.730640    14.180453        15.325243       5315.000000
03    F3D5    138.000000    0.990523    29.415378    28.004777        30.975621       3482.000000
03    F3D6    155.000000    0.995339    17.732145    17.056822        18.463148       6437.000000
03    F3D7    126.000000    0.991916    13.343631    12.831289        13.898588       4082.000000
03    F3D8    158.000000    0.992536    23.063894    21.843396        24.428855       4287.000000
03    F3D9    162.000000    0.994803    24.120541    23.105499        25.228865       5773.000000

There are a couple of things to note here:

The differences in diversity and richness between early and late time points is small.
All sample coverage is above 97%.

There are many more diversity metrics, and for more information about the different calculators available in mothur, see the mothur wiki page

We could perform additional statistical tests (e.g. ANOVA) to confirm our feeling that there is no significant difference based on sex or early vs. late, but this is beyond the scope of this tutorial.

Beta diversity

Beta diversity is a measure of the similarity of the membership and structure found between different samples. The default calculator in the following section is thetaYC, which is the Yue & Clayton theta similarity coefficient. We will also calculate the Jaccard index (termed jclass in mothur).

We calculate this with the Dist.shared tool, which will rarefy our data.

Hands On: Beta diversity

Dist.shared ( Galaxy version 1.39.5.0) with the following parameters

param-file “shared”: to the shared file from Make.shared tool

“calc”: thetayc,jclass

“subsample”: 2389

Let’s visualize our data in a Heatmap:

Heatmap.sim ( Galaxy version 1.39.5.0) with the following parameters

“Generate Heatmap for”: phylip

param-collection “phylip”: the output of Dist.shared tool (this is a collection input)

Look at some of the resulting heatmaps (you may have to download the SVG images first). In all of these heatmaps the red colors indicate communities that are more similar than those with black colors.

For example this is the heatmap for the thetayc calculator (output thetayc.0.03.lt.ave):

and the jclass calulator (output jclass.0.03.lt.ave):

When generating Venn diagrams we are limited by the number of samples that we can analyze simultaneously. Let’s take a look at the Venn diagrams for the first 4 time points of female 3 using the Venn tool:

Hands On: Venn diagram

Venn ( Galaxy version 1.39.5.0) with the following parameters

param-collection “OTU Shared”: output from Sub.sample tool (collection)

“groups”: F3D0,F3D1,F3D2,F3D3

This generates a 4-way Venn diagram and a table listing the shared OTUs.

Examine the Venn diagram:

This shows that there were a total of 180 OTUs observed between the 4 time points. Only 76 of those OTUs were shared by all four time points. We could look deeper at the shared file to see whether those OTUs were numerically rare or just had a low incidence.

Next, let’s generate a dendrogram to describe the similarity of the samples to each other. We will generate a dendrogram using the jclass and thetayc calculators within the Tree.shared tool:

Hands On: Tree

Tree.shared ( Galaxy version 1.39.5.0) with the following parameters

“Select input format”: Phylip Distance Matrix

param-collection “phylip”: the distance files output from Dist.shared tool

Newick display ( Galaxy version 1.6+galaxy1) with the following parameters

param-collection “Newick file”: output from Tree.shared tool

Inspection of the the tree shows that the early and late communities cluster with themselves to the exclusion of the others.

thetayc.0.03.lt.ave:

jclass.0.03.lt.ave:

Visualisations

Krona

A tool we can use to visualize the composition of our community, is Krona

Hands On: Krona

First we convert our mothur taxonomy file to a format compatible with Krona

Taxonomy-to-Krona ( Galaxy version 1.0) with the following parameters

param-collection “Taxonomy file”: the taxonomy output from Classify.otu

Krona pie chart ( Galaxy version 2.7.1+galaxy0) with the following parameters

“Type of input”: Tabular

param-collection “Input file”: the taxonomy output from Taxonomy-to-Krona tool

The resulting file is an HTML file containing an interactive visualization. For instance try double-clicking the innermost ring labeled “Bacteroidetes” below:

Question

What percentage of your sample was labelled Lactobacillus?

Explore the Krona plot, double click on Firmicutes, here you should see Lactobacillus clearly (16% in our case), click on this segment and the right-hand side will show you the percentages at any point in the hierarchy (here 5% of all)

Exercise: generating per-sample Krona plots (Optional)

You may have noticed that this plot shows the results for all samples together. In many cases however, you would like to be able to compare results for different samples.

In order to save computation time, mothur pools all reads into a single file, and uses the count table file to keep track of which samples the reads came from. However, Krona does not understand the mothur count table format, so we cannot use that to supply information about the groups. But luckily we can get Classify.otu tool to output per-sample taxonomy files. In the following exercise, we will create a Krona plot with per-sample subplots.

Question: Exercise: per-sample plots

Try to create per-sample Krona plots. An few hints are given below, and the full answer is given in the solution box.

Re-run galaxy-refresh the Classify.otu tool tool we ran earlier

See if you can find a parameter to output a taxonomy file per sample (group)

Run Taxonomy-to-Krona tool again on the per-sample taxonomy files (collection)

Run Krona tool

Find the previous run of Classify.otu tool in your history

Hit the rerun button galaxy-refresh to load the parameters you used before:

param-file “list”: the list output from Cluster.split tool

param-file “count”: the count table from Remove.groups tool

param-file “taxonomy”: the taxonomy output from Remove.groups tool

“label”: 0.03

Add new parameter setting:

“persample - allows you to find a consensus taxonomy for each group”: Yes

You should now have a collection with per-sample files

Taxonomy-to-Krona ( Galaxy version 1.0) with the following parameters

param-collection “Taxonomy file”: the taxonomy collection from Classify.otu tool

Krona pie chart ( Galaxy version 2.7.1+galaxy0) with the following parameters

“Type of input”: Tabular

param-collection “Input file”: the collection from Taxonomy-to-Krona tool

“Combine data from multiple datasets?”: No

The final result should look something like this (switch between samples via the list on the left):

Conclusion

Well done! trophy You have completed the basics of the Schloss lab’s Standard Operating Procedure for Illumina MiSeq data. You have worked your way through the following pipeline:

You've Finished the Tutorial

Key points

16S rRNA gene sequencing analysis results depend on the many algorithms used and their settings

Quality control and cleaning of your data is a crucial step in order to obtain optimal results

Adding a mock community to serve as a control sample can help you asses the error rate of your experimental setup

We can explore alpha and beta diversities using Krona and Phinch for dynamic visualizations

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

DeSantis, T. Z., P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie et al., 2006 Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Applied and Environmental Microbiology 72: 5069–5072. 10.1128/aem.03006-05
Liu, Z., T. Z. DeSantis, G. L. Andersen, and R. Knight, 2008 Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research 36: e120–e120. 10.1093/nar/gkn491
Schloss, P. D., S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann et al., 2009 Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75: 7537–7541.
Wooley, J. C., A. Godzik, and I. Friedberg, 2010 A Primer on Metagenomics (P. E. Bourne, Ed.). PLoS Computational Biology 6: e1000667. 10.1371/journal.pcbi.1000667
Federhen, S., 2011 The NCBI Taxonomy database. Nucleic Acids Research 40: D136–D143. 10.1093/nar/gkr1178
Bonilla-Rosso, G., L. E. Eguiarte, D. Romero, M. Travisano, and V. Souza, 2012 Understanding microbial community diversity metrics derived from metagenomes: performance evaluation using simulated data sets. FEMS Microbiology Ecology 82: 37–49. 10.1111/j.1574-6941.2012.01405.x
Quast, C., E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer et al., 2012 The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research 41: D590–D596. 10.1093/nar/gks1219
Schloss, P. D., 2012 Secondary structure improves OTU assignments of 16S rRNA gene sequences. The ISME Journal 7: 457–460. 10.1038/ismej.2012.102
Cole, J. R., Q. Wang, J. A. Fish, B. Chai, D. M. McGarrell et al., 2013 Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Research 42: D633–D642. 10.1093/nar/gkt1244
Finotello, F., E. Mastrorilli, and B. D. Camillo, 2016 Measuring the diversity of the human microbiota with targeted next-generation sequencing. Briefings in Bioinformatics bbw119. 10.1093/bib/bbw119
Fouhy, F., A. G. Clooney, C. Stanton, M. J. Claesson, and P. D. Cotter, 2016 16S rRNA gene sequencing of mock microbial populations- impact of DNA extraction method, primer choice and sequencing platform. BMC Microbiology 16: 10.1186/s12866-016-0738-z
Rognes, T., T. Flouri, B. Nichols, C. Quince, and F. Mahé, 2016 VSEARCH: a versatile open source tool for metagenomics. PeerJ 4: e2584. 10.7717/peerj.2584
Singer, E., B. Andreopoulos, R. M. Bowers, J. Lee, S. Deshpande et al., 2016 Next generation sequencing data of a defined microbial mock community. Scientific Data 3: 160081. 10.1038/sdata.2016.81
Balvočiūtė, M., and D. H. Huson, 2017 SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? BMC Genomics 18: 10.1186/s12864-017-3501-4

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Saskia Hiltemann, Bérénice Batut, Dave Clements, 16S Microbial Analysis with mothur (extended) (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/mothur-miseq-sop/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{microbiome-mothur-miseq-sop,
author = "Saskia Hiltemann and Bérénice Batut and Dave Clements",
	title = "16S Microbial Analysis with mothur (extended) (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/mothur-miseq-sop/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/microbiome/tutorials/mothur-miseq-sop/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: taxonomy_krona_chart
  owner: crs4
  revisions: c9f8fef1df74
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: xy_plot
  owner: devteam
  revisions: ecb437f1d298
  tool_panel_section_label: Graph/Display Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_align_seqs
  owner: iuc
  revisions: abf7c15ecfef
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_chimera_vsearch
  owner: iuc
  revisions: 12e856482d26
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_classify_otu
  owner: iuc
  revisions: ff1bc0299372
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_classify_seqs
  owner: iuc
  revisions: '09bf9e496c64'
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_cluster
  owner: iuc
  revisions: 6366f4483b35
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_cluster_split
  owner: iuc
  revisions: 19f45bdbffd9
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_count_groups
  owner: iuc
  revisions: dc3826bfc1ed
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_count_seqs
  owner: iuc
  revisions: ffee46220ddd
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_dist_seqs
  owner: iuc
  revisions: 4d76fba83f24
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_dist_shared
  owner: iuc
  revisions: 418100660ba3
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_filter_seqs
  owner: iuc
  revisions: 3524ca06478e
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_get_groups
  owner: iuc
  revisions: 03af14e245fb
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_heatmap_sim
  owner: iuc
  revisions: 7f5707dd6073
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_make_biom
  owner: iuc
  revisions: 0b5397a74165
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_make_contigs
  owner: iuc
  revisions: ab12df91b47f
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_make_shared
  owner: iuc
  revisions: d0967ae326be
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_pre_cluster
  owner: iuc
  revisions: 6532c3424bfe
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_rarefaction_single
  owner: iuc
  revisions: 47264675a4ea
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_remove_groups
  owner: iuc
  revisions: 498dc0671d9e
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_remove_lineage
  owner: iuc
  revisions: 947faa8b3205
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_remove_seqs
  owner: iuc
  revisions: '008f62670efc'
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_screen_seqs
  owner: iuc
  revisions: f4d7d5755f01
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_seq_error
  owner: iuc
  revisions: 5a7674ac2c59
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_sub_sample
  owner: iuc
  revisions: 67cb43432196
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_summary_seqs
  owner: iuc
  revisions: 600c14c6dcc5
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_summary_single
  owner: iuc
  revisions: 72be09dfd803
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_taxonomy_to_krona
  owner: iuc
  revisions: db30eb96ea23
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_tree_shared
  owner: iuc
  revisions: 8bca555fbcaf
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_unique_seqs
  owner: iuc
  revisions: a033f325ad8d
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_venn
  owner: iuc
  revisions: 4db54b1c1d83
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: newick_utils
  owner: iuc
  revisions: fe4235285ee8
  tool_panel_section_label: Graph/Display Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 36

t{ hist[0] | to_stars }} 7

t{ hist[0] | to_stars }} 3

t{ hist[0] | to_stars }} 1

October 2024

5 stars: Liked: Very detailed and easy to follow steps. The aditional information is very useful for beginners! Disliked: Maybe add some tips on how to adapt the tutorial to our own data, I didn't find it

August 2024

5 stars: Liked: Well structured, easy to follow

October 2023

5 stars: Liked: easy to follow, step-by-step explanation Disliked: alternative workflows for different situations (for instance workflow w/o mock community removal)

August 2022

5 stars: Liked: It was explained in really easy to understand language for a novice like me with great diagrams. I've looked at a number of explanations about alpha and beta diversity and this was the most useful so far.

June 2022

5 stars: Liked: The clarity and flow of topics guidelines Disliked: so far so good

5 stars: Liked: Well-detailed Disliked: Alphadiversity

April 2022

5 stars: Liked: Each step of the analysis is very well explained and I could use this tutorial to help student understand 16S microbiome analysis. Disliked: The trainsets need updating. I tried to do this myself but it did not work. Also, could you do a version for ITS analysis?

March 2022

3 stars: Liked: nothing Disliked: nothing

5 stars: Liked: nothing Disliked: everything

July 2021

5 stars: Liked: Awesome! Thanks a lot

February 2021

5 stars: Liked: Very informative yet concise. Did an excellent job of walking through the parameters needed for each step. Also appreciated the walking through of visualization and statistical significance rather than ending at the OTU clustering.

January 2021

5 stars: Liked: The explaination about each step

August 2020

5 stars: Liked: Absolutamente todo. Cada sección perfectamente bien ejemplificado con teoría

June 2020

5 stars: Liked: Very interactive and gave detailed instructions about what to do and why. Very beginner friendly Disliked: Give brief introduction about the files extensions (for example what is a FASTQ/FASTA file?)

May 2020

4 stars: Liked: it is hands on course Disliked: I did not understand meaning of many options.

April 2020

5 stars: Liked: The fact that there was an exlpanation on why we are doing all these steps Disliked: Maybe adding PCA

February 2020

5 stars: Liked: It was quite easy to follow and understand Disliked: I saw 0.31 instead of 0.03 in the cut off box several times, would be better if it is corrected to 0.03 which is equal to 97%

December 2019

3 stars: Liked: I liked the step-by-step how to instructions Disliked: I got lost at "make.contigs" - could not find out how to use that tool in Galaxy :(

October 2019

5 stars: Liked: Perfectly explained Disliked: The use of the Venn software was not available

September 2019

5 stars: Liked: