Académique Documents
Professionnel Documents
Culture Documents
Table
of
Contents
Introduction
to
taxonomic
analysis
of
amplicon
and
shotgun
data
using
QIIME
.....
1
General information
..........................................................................................
3
Resources used
..................................................................................................
3
Tutorial
objectives
.............................................................................................................................................
4
Short
introduction
to
Linux
...........................................................................................................................
4
De
novo
OTU
picking
and
diversity
analysis
using
454
data
....................................
7
Prepare
files
..........................................................................................................................................................
7
To
denoise
or
not
to
denoise?
.....................................................................................................................
10
Picking
Operational
Taxonomic
Units
(OTUs)
...................................................................................
10
View
OTU
statistics
.........................................................................................................................................
11
Visualize
taxonomic
composition
.............................................................................................................
11
Alpha
diversity
within
samples
and
rarefaction
curves
.................................................................
12
Beta
diversity
and
beta
diversity
plots
...................................................................................................
13
Closed
reference
OTU
picking
of
16S
ribosomal
rRNA
fragments
selected
from
a
shotgun
data
set
..................................................................................................
13
Extraction
of
16S
rRNA
sequence-containing
reads
with
rRNASelector
................................
13
Closed-reference
OTU
picking
workflow
and
visualization
of
results
in
Megan
5
..............
15
Extreme
challenge
................................................................................................
16
Finally
..................................................................................................................
16
General information
The following standard icons are used in the hands-on exercises to help you
locating:
Important Information
General information / notes
Follow the following steps
Questions to be answered
Warning PLEASE take care and read carefully
Optional Bonus exercise
Resources used
QIIME (http://qiime.org/index.html)
Sutton et al. (2013). Impact of Long-Term Diesel Contamination on Soil Microbial
Community Structure. Appl. Environ. Microbiol. 79(2):619-630.
Sutcliffe et al. (2013). Draft Genome Sequence of Thermotoga maritima A7A
Reconstructed from Metagenomic Sequencing Analysis of a Hydrocarbon Reservoir
in the Bass Strait, Australia. Genome Announc. 1(5): e00688-13.
Li et al. (2013). Draft Genome Sequence of Thermoanaerobacter sp. Strain A7A,
Reconstructed from a Metagenome Obtained from a High-Temperature Hydrocarbon
Reservoir in the Bass Strait, Australia. Genome Announc. 1(5): e00701-13.
Lee et al. (2011). rRNASelector: a computer program for selecting ribosomal RNA
encoding sequences from metagenomic and metatranscriptomic shotgun libraries. J.
Microbiol. 49(4):689-691.
Tutorial objectives
In this tutorial we will look at the open source software package QIIME (pronounced
chime). QIIME stands for Quantitative Insights Into Microbial Ecology. The
package contains many tools that enable users to analyse and compare microbial
communities. QIIME was originally developed to analyse of Roche 454 amplicon
sequencing data. In the latest versions workflows have been added to analyze data
from different sequencing platforms, such as Illumina, and different types of data,
such as shotgun data. In this course we will use QIIME 1.8, which is the latest version.
We will (re-)introduce you to the Linux operating system to a basic level that is
sufficient to run bioinformatics software from preconfigured Linux installations such
as the one we will be using today.
After completion of this tutorial, you should be able to perform a taxonomic analysis
on a Roche 454 16S rRNA amplicon dataset. In addition you should be able to do 16S
taxonomic analysis on shotgun data using the tool rRNASelector in combination with
QIIME.
Finally you should be able to work out solutions for datasets from other platforms
such as Illumina from the information you find on the QIIME web site
(http://qiime.org/).
We assume you have successfully booted your computer into Linux and you have the
Linux desktop on your screen.
The first steps well do together now to get you going as quickly as possible. One
piece of advice, try to type the commands into your terminal rather than copy and
paste as it will help you understand the commands better. Also, Microsoft Word has
replaced certain characters, e.g. straight quotes with smart quotes; these and a number
of other characters cause trouble.
Most of what we will do will be run from the command line and before we can issue
any commands, we will need a terminal window. Click on Applications at the top left
of your desktop, then go to Accessories and then click on the first Terminal menu
item. A terminal window should appear on your desktop.
At the prompt (which ends with $) you can type commands. During this tutorial we
will represent the prompt as $ for brevity, do not type a dollar character at the
beginning of any of the commands, only type what follows it. To execute a command,
press return/enter.
Type the following command to list the files and directories (folders) followed by
enter:
$ ls
You will be presented with a list of the contents of your home directory. Note that
Linux does not have a concept of disks like windows (e.g. C:\). Instead it has so called
mount points with a directory at its root. /home is where by default the home
directories of all users are located. /usr is where a lot of the operating system and
programs reside. The directory structure of a Linux system outside the /home area is
for this tutorial not important. Also note that where windows uses back slashes to
separate directories, Linux uses forward slashes. The desktop is in a folder
/home/trainee/Desktop.
To move up to the desktop folder, type
$ cd Desktop
$ ls -l
Note that file and directory names are case-sensitive, cd desktop does not work. The
-l option after the ls command tells ls to show a long, more detailed listing of the
directory contents showing file permissions, owners and date stamps. On your
desktop is a folder called Taxonomy, which contains the necessary files for this
tutorial. You can probably work out how to enter this directory now. There are a few
more tips for moving around:
To go to your home directory, type one of the following (note ~ is short for your
home directory):
$ cd
$ cd ~
The command t is an alias that we have set up to make life easier. You can create
your own aliases for command that you use frequently using alias, e.g.
$ alias d=cd ~/Desktop
From that moment on, you only need to type d followed by Enter to go to your
desktop folder.
To move up one directory level (e.g. when you are in Taxonomy and want to go back
to Desktop), type:
$ cd ..
To view long text files one screen at a time, use less. Exit less by typing q at the
colon.
$ less filename
To copy a file:
$ cp file newfile
$ mv file ..
To remove/delete a file:
$ rm file
1
2
For now this is all we need to know to start the tutorial proper.
Note that the > character redirects the output from the screen to a new file. Please
inspect the files with less or less -S to truncate long lines. You will notice that
there is no obvious association of the reads with a particular sample.
Or just have a look at the fasta headers and ignore the DNA sequences with the
command grep. You need to extract all lines that start with > and send the output
to less to be able to view the output one screen at a time. The command to give is:
$ grep ^> sutton.fna | less
^> is a so-called regular expression. The ^ character means starts with, so the
grep command looks for all lines that start with >. The pipe character, | is used to
pipe or stream the output from the first command (grep) into a second command (less).
Now view the file containing the quality scores:
$ less -S sutton.qual
D5
E1
E2
F1
F2
G1
G2
G3
H1
H2
H3
I1
I2
TACAGATCGT
TACGCTGTCT
TAGTGTAGAT
ACAGTATATA
ACGCGATCGA
TCTAGCGACT
TCTATACTAT
TGACGTATGT
TACTCTCGTG
TAGAGACGAG
TCGTCGCTCG
TCGATCACGT
TCGCACTAGT
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
CCTAYGGGRBGCASCAG
D5_Sand_Polluted
E1_Fill_Clean
E2_Fill_Polluted
F1_Sand_Clean
F2_Sand_Polluted
G1_Fill_Clean
G2_Fill_Clean
G3_Fill_clean
H1_Peat_Clean
H2_Peat_Clean
H3_Sand_Clean
I1_Sand_Clean
I2_Sand_Clean
There shouldnt be any errors. If there are errors, a corrected mapping file will be
written to the directory mapping_output.
Assign samples to the reads
Using the mapping file and the Sutton fasta and quality files we are going to now
assign samples to the reads.
Type the following command on a single line:
$ split_libraries.py -m mapping.txt f sutton.fna q sutton.qual o
split_library_output b 10 L 500
$ less -S seqs.fna
You will see the reads are now batched to their sample.
To denoise or not to denoise?
The pyrosequencing technology employed by 454 sequencing machines produces
characteristic sequencing errors, mostly imprecise signals for longer homopolymers
runs. Most of the sequences contain none or only a few errors, but a few sequences
contain enough errors to be classified as an additional rare OTU. The goal for the
denoising procedure is to reduce the number of erroneous OTUs and thus increasing
the accuracy of the whole QIIME pipeline. This is a computationally intensive
procedure, which we will skip for this reason. There is a QIIME tutorial that outlines
the steps (http://qiime.org/tutorials/denoising_454_data.html) and also includes a
warning about new 454 flow patterns introduced in 2012. Note that only amplicon
data sets can be denoised with the described procedure. We will not denoise our data
today. We did denoise the full dataset in the sutton_full_denoised folder.
Picking Operational Taxonomic Units (OTUs)
We will now use a workflow for de novo OTU picking, taxonomy assignment,
phylogenetic tree construction, and OTU table construction QIIME has several
workflows to pick OTUs, we will be using the one described in the general overview
tutorial (http://qiime.org/tutorials/tutorial.html) It has 7 steps, which are described in
some detail in this tutorial.
The described procedure is run with the command from the Taxonomy directory. This
step takes about 12mins to run. Please read through the different steps
(http://qiime.org/tutorials/tutorial.html) and try to understand the procedure.
Remember that an OTU is not the same as a species, but a bag/cluster of highly
similar sequences (at least 97% is common for bacteria/archaea), or a single sequence
in case of rare OTUs.
$ pick_de_novo_otus.py -i split_library_output/seqs.fna -o otus
Please do spend some time looking at the output of this pipeline. In particular the file
seqs_rep_set_tax_assignments.txt in the uclust_assigned_taxonomy directory. By
default QIIME uses the Greengenes 16S reference database to assign taxonomy. It has
the following levels: kingdom, phylum, class, order, family, genus, species. It will be
immediately clear that most reads cannot be classified up to species level.
As described in step 6 of the QIIME overview tutorial, the pipeline creates a Newickformatted phylogenetic tree (rep_set.tre) in the otus directory. You can run the
program figtree either from the command line or select FigTree from the menu on
your desktop (Applications -> Other -> FigTree) and view the tree by opening the file
rep_set.tre in the otus folder (Desktop->Taxonomy->otus). The tree that is
produced is too complex to be of much use. We will look at a different tool, Megan 5,
which produces a far more useful tree.
10
This removes OTUs with less than 2 sequences. If you use the k option instead of the
n option, OTUs with more than the specified number of sequences will be removed.
Megan can be opened from the menu under Applications -> Other. From the File
menu select Import -> BIOM format. Find your biom file and import it. Megan will
generate a tree that is far more informative than the one produced with FigTree. You
can change the way Megan displays the data by clicking on the various icons and
menu items. Please spend some time exploring your data. The Word Cloud
visualization is interesting, too, if you want to find out which samples are similar and
which samples stand out.
View OTU statistics
You can generate some statistics, e.g. the number of reads assigned, distribution
among samples. Some of the statistics are useful for further downstream analysis, e.g.
beta-diversity analysis. Run the following now, again from within the Taxonomy
directory, and look at the results. Write down the minimum value under
Counts/sample summary. We need it for beta-diversity analysis.
$ cd ../
$ biom summarize-table -i otus/otu_table.biom o
otus/otu_table_summary.txt
$ less otus/otu_table_summary.txt
11
$ summarize_taxa_through_plots.py -i otus/otu_table.biom -o
wf_taxa_summary -m mapping.txt
To view the output, open a web browser from the Applications -> Internet menu. You
can use Google chrome, Firefox or Chromium.
In Google chrome or Chromium, type CTRL-O, or in Firefox use the File menu to
select Desktop ->Taxonomy -> wf_taxa_summary -> taxa_summary_plots and open
either area_charts.html or bar_chars.html. I prefer the bar charts myself. The top chart
visualizes taxonomic composition at phylum level for each of the samples. The next
chart goes down to class level and following charts go another level up again. The
charts (particularly the ones more at the top) are very useful for discovering how the
communities in your samples differ from each other. There is a similar plot in the
paper, if you have time, see how our analysis compares with the one described in the
paper.
Alpha diversity within samples and rarefaction curves
Alpha diversity is the microbial diversity within a sample. QIIME can calculate a lot
of metrics, but for our tutorial, we generate 3 metrics from the alpha rarefaction
workflow: chao1 (estimates species richness); observed species metric (the count of
unique OTUs); phylogenetic distance. The following workflow generates rarefaction
plots to visualize alpha diversity.
Run the following command from within your taxonomy directory, this should take a
few minutes:
$ alpha_rarefaction.py -i otus/otu_table.biom -m mapping.txt -o
wf_arare -t otus/rep_set.tre
First we are going to view the rarefaction curves in a web browser by opening
/home/trainee/Desktop/Taxonomy/wf_arare/alpha_rarefaction_plots/rarefaction_plots.
html.
To start select as metric chao1 and select as category Description. It is clear that
the microbial diversity in some samples is much higher than in other samples. Click
around in the legend as this will help you work out which line corresponds with which
sample. If you have time you could try to correlate species richness with
environmental data from the paper and establish whether our analysis confirms the
findings of the authors.
Next view the precomputed rarefaction curves which show an increased sequencing
depth.
In general the more reads you have, the more OTUs you will observe. If a rarefaction
curve start to flatten, it means that you have probably sequenced at sufficient depth, in
other words, producing more reads will not significantly add more OTUs. If on the
other hand hasnt flattened, you have not sampled enough to capture enough of the
microbial diversity and by extrapolating the curve you may be able to estimate how
many more reads you will need. Consult the QIIME overview tutorial for further
information.
12
Read through the beta diversity compute section of the QIIME overview tutorial and
try to understand this workflow. We will look at visualization of beta diversity
analysis results in more detail in the next tutorial focusing on visualization.
13
You will find a file called A7A-paired.fasta containing the sequence reads.
Fire up rRNASelector from the command line:
% rRNASelector
A graphical interface should appear. Load the sequence file by clicking on File
Choose at the top and navigate to the file A7A-paired.fasta. Select the file and click
Open. The tool will automatically fill in file names for the result files. Change the
Number of CPUs to 2, select Prokaryote 16S (to include both bacterial and archaeal
16S sequences) and specify the location of the hmmsearch file by clicking the second
File Choose button. You can type the location manually /usr/bin/hmmsearch. Next,
click process. The run should take a few minutes to complete.
14
If all went well, you can close rRNASelector by clicking on Exit. You will have 3
new files in your directory, one containing untrimmed 16S reads, one containing
trimmed 16S reads (A7A-paired.prok.16s.trim.fasta; thats the one we want) and a file
containing reads that do not contain (sufficient) 16S sequence.
Closed-reference OTU picking workflow and visualization of results in
Megan 5
We are now ready to pick our OTUs. We do that by running the following command
(all on one line and no space after gg_otus-12-10):
% pick_closed_reference_otus.py -i A7A-paired.prok.16s.trim.fasta
-o ./cr_uc
-r /mnt/workshop/tools/qiime_software/gg_otus-12_10release/rep_set/97_otus.fasta
-t /mnt/workshop/tools/qiime_software/gg_otus-12_10release/taxonomy/97_otu_taxonomy.txt
15
Extreme challenge
Many of you will send off samples to be sequenced and quite often providers
preprocess the raw data before handing it back to the client. If your reads were
demultiplexed and primers were removed, you have a problem as the amplicon
workflows from QIIME rely on the presence of primers and barcodes. The first
message is that you should insist on getting your data as unprocessed reads. Not all is
lost if you end up with a dataset with primers and barcodes stripped, but it requires
more work.
This is an exercise in understanding the format QIIME expects and how you can
reformat data to allow analysis in QIIME.
There is a directory in your Taxonomy folder called Baltic, with data from the
Baltic Sea. It has a number of samples that were sequenced using 454 technology and
are demultiplexed, but for the purpose of this exercise, these could have been Illumina
files as the format is fastq.
The challenge we give you is to write down how we need to reformat this data (even
if you do not know how) to be able to perform a similar analysis we have done with
the Red Sea data. This is something you can in a small group if you feel youre not
quite up to the challenge.
Hints:
Consider using the QIIME function convert_fastaqual_fastq.py
What does the mapping you need to run split_libraries.py look like?
Finally
If there is time left you could go back to the polluted railway site study. The aim of
this study was to understand interrelationship among microbial community
composition, pollution level, and soil geochemical and physical properties. With
additional information from the paper, could you come up with some conclusions?
The QIIME 454 overview tutorial at http://qiime.org/tutorials/tutorial.html has a
number of additional steps that you may find interesting; so feel free to try some of
them out. Note hat we have not installed Cytoscape, so we cannot visualize OTU
networks.
16
We will end this tutorial with a 15-minute summary of what we have done and how
well our analysis compares with the one in the paper.
Hopefully you will have acquired new skills that allow you to tackle your own
taxonomic analyses. There are many more tutorials on the QIIME website that can
help you pick the best strategy for your project (http://qiime.org/tutorials/). We picked
QIIME for this tutorial as it is widely used and supported, but there are alternatives
that might suit your need better (e.g. VAMPS at http://vamps.mbl.edu; mothur at
http://www.mothur.org and others).
17