Académique Documents
Professionnel Documents
Culture Documents
com
DOI 10.1016/j.copbio.2008.04.003
A list of pathogens likely to be found in aquatic environments was built (primarily based on WHO list
Table 1
For each taxon, the number of entries (number of different submissions) of protein coding sequences (CDS) and of genomes projects was
analyzed
Taxon
Entries
nbr CDS
Genomes
3644
69
1072
6371
2767
809
21296
8232
13057
3605
3010
6578
762
161
792
5488
553
5356
198
1100
7410
2611
834
17975
8015
10994
3014
2723
5425
639
125
813
5254
608
44
5
6
16
1
3
40
352
16
2
1
1
1
1
1
33
4
Burkholderia pseudomallei
Campylobacter coli
Campylobacter jejuni
Escherichia coli
Legionella pneumophila
Legionella
Pseudomonas aeruginosa
Salmonella typhi
Salmonella
Shigella
Vibrio cholerae
Vibrio parahaemolyticus
Vibrio vulnificus
Vibrio
Yersinia enterocolitica
568
346
2153
39004
2451
3386
35034
191
7953
4149
2532
882
821
10358
445
27058
367
11676
75822
15145
15545
22340
474
41953
31292
10833
5775
10549
40967
5093
24
1
11
35
4
4
7
0
24
8
17
3
4
34
1
33450
11148
39678
164
2
101006
205767
24507
67
180
1262
1796
2
0
706
843
1364
38
1
2
4
0
0
0
6
0
0
Acanthamoeba
Cryptosporidium parvum
Cryptosporidium
Cyclospora cayetanensis
Dracunculus medinensis
Entamoeba histolytica
Entamoeba
Giardia intestinalis
Naegleria fowleri
Viruses were also queried according to a higher taxonomic rank because the names used to describe them can be quite different in different entries. A
table in additional materials also provides the list of most sequenced genes for a number of waterborne pathogens. Note: complete lists or synthetic
information on genome projects can be manually obtained from URL: http://www.ncbi.nlm.nih.gov/Genomes/ or URL: http://www.genomesonline.org/. Genome numbers are for finished to in progress projects.
www.sciencedirect.com
URL: http://www.who.int/water_sanitation_health/
gdwqrevision/watpathogens.pdf) and used to query GenBank release 163 (updates up to February 5, 2008) with
ACNUC [21]. The questions asked were what are the
respective numbers of entries and protein coding genes
available for each organism (an entry is a separate submission identified by its accession number, it may contain
the sequences of many genes), and how many complete
genomes are available. These data (Table 1) demonstrate contrasting situations for the different pathogens;
some organisms have been widely sequenced, both in
terms of entries as well as complete genomes. Other
pathogens, especially among Eukarya have gathered little
interest as for pathogens of the Apicomplexa phylum, the
Nematoda Dracunculus, and Naegleria fowleri. Finally, for
Bacteria and Eukarya, querying by species name is rather
easy, but for viruses, the name of the host is often
included in the name of the organism, making queries
more difficult. In order to retrieve all of the sequences for
a given virus, it is often necessary to query using a higher
order keyword (Table 1).
Genbank contained about 1 188 211 bacterial entries
(pathogenic or not), the 16S rRNA gene alone contributing 647 899 sequences. Noteworthy is that many of these
sequences were short to very short (50500 nt, 2530331
entries), only 186 310 entries had a length of 1 200 or
more. Only 32 900 of these long sequences belonged to
cultured strains annotated as about 8 000 different
species.
Considering all bacterial entries, the nifH gene (involved
in nitrogen fixation) was the most sequenced (9 421
entries), followed by gyrB (encoding a type II topoisomerase, 6 845 sequences) and rpoB (coding for b-subunit of
RNA polymerase, 6 231 sequences). For waterborne bacterial pathogens and surprisingly the mdh gene (which
encodes for an enzyme that catalyzes the interconversion
of malate and oxaloacetate) were the most sequenced,
followed by three housekeeping genes: gyrB, rpoB, and
recA. Genes gyrB (sequences available for 337 genera and
1 483 species, most sequenced genera: Pseudomonas and
Vibrio), rpoB (sequences available for 238 genera and
1 565 species, most sequenced genus: Mycobacterium)
and recA (sequences available for 232 genera and 999
species, most sequenced genus: Vibrio) have been widely
used as taxonomic markers.
Domains sequenced
The level of sequence divergence as well as the length of
available sequences drive the phylogenetic resolution. It
is not possible to easily provide such an evaluation at the
bacterial level because sequences for the different genes
are available for a wide but different distribution of taxa.
We therefore restrained the analysis to the Vibrio
sequences or gyrB, recA, and rpoB. To simplify the
analyses and results, protein sequences were downloaded
Current Opinion in Biotechnology 2008, 19:266273
lb
rb
Number
rpoB
1400
1370
400
400
300
300
200
100
100
0
30
700
400
700
400
1100
1100
600
1400
1400
1100
800
1000
700
1300
1200
700
2
3
15
1
2
65
1
9
1
gyrB
800
410
400
380
370
340
310
300
200
200
100
0
90
100
20
30
60
90
100
200
100
200
800
500
500
400
400
400
400
400
400
300
300
5
47
4
2
1
16
1
69
212
162
1
recA
400
360
350
340
330
240
230
220
210
200
130
120
100
0
40
50
60
70
60
70
80
90
100
70
80
100
400
400
400
400
400
300
300
300
300
300
200
200
200
1
1
7
6
8
1
94
141
47
300
63
4
22
www.sciencedirect.com
Figure 1
Heatmap analysis of oligomers used in references [43,44] to identify the presence of the mip gene. Tms were calculated using the nearest
neighbor algorithm and were then transformed into colors (corresponding Tm/color shown in Figure). Each column of the heatmap (on the right)
corresponds to an oligomer as indicated in the box Primers identifiers. A gray square is for a Tm below 40 8C, a white square for a sequence
www.sciencedirect.com
Table 3
Evaluation of primers and probes recently used for the identification of the mip genes in Legionella
Bioinformatic tools
Aside from the multipurposes tools available at NCBI,
EBI or elsewhere, a number of web servers or programs
may help analyses:
GreenGenes. The greengenes web application provides access to a 16S rRNA gene sequence alignment
for browsing, blasting, probing, and downloading:
URL: http://greengenes.lbl.gov.
PubMLST. This site hosts publicly accessible MLST
databases and software: URL: http://pubmlst.org, see
also reference [46].
Legionella mip gene Sequence Database. This database
allows the comparison of a new mip gene DNA
sequences with reference sequences from all described
species of Legionella: URL: http://www.hpa.org.uk/cfi/
bioinformatics/ewgli/legionellamips.htm.
leBIBI. Blast on databases of SSU-rDNA, gyrB, recA,
sodA, rpoB, tmRNA, tuf and groel2-hsp65 gene
sequences and tools for bacterial identification: URL:
http://umr5558-sud-str1.univ-lyon1.fr/lebibi/lebibi.cgi.
ICB. Identification and classification of bacteria
database using gyrB: URL: http://seasquirt.mbio.co.jp/icb/.
GPMS. Pathogenic bacteria strain genotyping essentially for epidemiological purposes based on polymorphic tandem repeat typing: URL: http://
minisatellites.u-psud.fr.
VNTR. Molecular typing of bacteria using variable
number tandem repeats: URL: http://vntr.csie.ntu.edu.tw.
OHM. A tool that produces heatmaps representing in
a visual manner the Tm of primers on a set of
sequences (can be combined with TreeDyn [47]):
URL: http://bioinfo.unice.fr/ohm.
(Figure 1 Legend Continued ) too short to contain the oligomer. Upper Figure (A) excerpt of L. pneumophila clade (possible cases of lateral
transfert in red). Lower Figure (B) excerpt of non-L. pneumophila clade. Primer #3 shows the highest predicted Tm, but will fail on some
sequences; primer #1 also shows quite a wide heterogeneity of predicted Tms. The full figure is available as supplementary material.
www.sciencedirect.com
Conclusions
If none of the above servers can be used (this is not an
exhaustive list), sequence retrieval, alignments, phylogenies, and design of primers can be quite time consuming
and tedious for scientists that cannot write computer
programs. Sequence retrieval using keywords is often
more efficient than a Blast. SRS (Advanced Search form)
or even better ACNUC or specific tools [48] should be
preferred to Entrez, because they are more powerful for
sequence retrieval. Combining keywords for the gene or
gene products with species name or taxon ID and a filter
on sequence length (very short sequences are useless) is
often very efficient. Since annotations are not standard,
building a list of gene products is often necessary (see
additional materials). If there are many sequences, it is
possible to cluster these sequences at a given similarity
level (using blastclust or Cd-hit [49]) and align one
representative sequence per cluster. A visual inspection
of alignments reveals sequences that do not align well;
they are often the result of a wrong annotation or have to
be inverted-complemented. The remaining sequences
can then be added to this good alignment (using Clustal
profile option for example). For protein coding gene a
program such as Transalign [50] may be a good choice.
When retrieving primers from publications, older papers
are often useless because primers were designed using a
very few numbers of sequences (primers can be analyzed
using the web server cited above, to produce figures
similar to Figure 1).
Finally, there is a large difference between amplification
using DNA extracted from a pure culture and DNA
extracted from an environmental sample. Primer (P)
binds to its target DNA (T) according to the classical
equation [P][T]/[PT] = Km. The presence of one or two
differences between the P sequence and the T sequence
may strongly influence the value of Km. With DNA
extracted from a pure culture [T] may be sufficiently
high so that [PT] is large enough for the PCR to succeed.
With environmental DNA, and in the presence of mismatch(es), the primer may bind to many other domains (at
low affinity but in many places) so that [PT] is not large
www.sciencedirect.com
Acknowledgements
This work was supported by funds from the European Commission for the
HEALTHY WATER project (FOOD-CT-2006-036306) and a CNRS PICS
to R Christen. The authors are solely responsible for the content of this
publication. It does not represent the opinion of the European Commission.
The European Commission is not responsible for any use that might be
made of data appearing therein.
Conflict of interest
None.
2.
3.
6.
Best EL, Fox AJ, Frost JA, Bolton FJ: Real-time singlenucleotide polymorphism profiling using Taqman technology
for rapid recognition of Campylobacter jejuni clonal
complexes. J Med Microbiol 2005, 54:919-925.
7.
8.
9.
10. DeSantis TZ, Brodie EL, Moberg JP, Zubieta IX, Piceno YM,
Andersen GL: High-density universal 16S rRNA microarray
analysis reveals broader diversity than typical clone library
when sampling the environment. Microb Ecol 2007, 53:371-383.
Identification of pathogens in environmental samples often use parallel,
multispecies detection systems, in order to detect any pathogens. In this
analysis a DNA array with 2 97 851 probes was compared with 16S
cloning and sequencing to evaluate the biodiversity, with the conclusion
that the array was more efficient. However, pyrosequencing technologies
are likely to replace both of the approaches compared in this work.
12. Hansen RR, Sikes HD, Bowman CN: Visual detection of labeled
oligonucleotides using visible-light-polymerization-based
amplification. Biomacromolecules 2008, 9:355-362.
30. Janda JM, Abbott SL: 16S rRNA gene sequencing for bacterial
identification in the diagnostic laboratory: pluses, perils, and
pitfalls. J Clin Microbiol 2007, 45:2761-2764.
13. Lin YC, Sheng WH, Chang SC, Wang JT, Chen YC, Wu RJ,
Hsia KC, Li SY: Application of a microsphere-based array for
rapid identification of Acinetobacter spp. with distinct
antimicrobial susceptibilities. J Clin Microbiol 2008, 46:612-617.
14. Yang ZJ, Tu MZ, Liu J, Wang XL, Jin HZ: Comparison of
amplicon-sequencing, pyrosequencing and real-time PCR for
detection of YMDD mutants in patients with chronic hepatitis
B. World J Gastroenterol 2006, 12:7192-7196.
15. Kobayashi N, Bauer TW, Tuohy MJ, Lieberman IH, Krebs V,
Togawa D, Fujishiro T, Procop GW: The comparison of
pyrosequencing molecular Gram stain, culture, and
conventional Gram stain for diagnosing orthopaedic
infections. J Orthop Res 2006, 24:1641-1649.
Sequencing more efficient than staining to differentiate Gram-positive
from Gram-negative bacteria. Who would have bet on it in 2005?
16. Luna RA, Fasciano LR, Jones SC, Boyanton BL Jr, Ton TT,
Versalovic J: DNA pyrosequencing-based bacterial pathogen
identification in a pediatric hospital setting. J Clin Microbiol
2007, 45:2985-2992.
17. Dowd SE, Sun Y, Secor PR, Rhoads DD, Wolcott BM, James GA,
Wolcott RD: Survey of bacterial diversity in chronic wounds
using Pyrosequencing, DGGE, and full ribosome shotgun
sequencing. BMC Microbiol 2008, 8:43 doi: 10.1186/1471-21808-43.
18. Jackson GW, McNichols RJ, Fox GE, Willson RC: Bacterial
genotyping by 16S rRNA mass cataloging. BMC Bioinformatics
2006, 7:321 doi: 10.1186/1471-2105-7-321.
19. Grun J, Manka CK, Nikitin S, Zabetakis D, Comanescu G, Gillis D,
Bowles J: Identification of bacteria from two-dimensional
resonant-Raman spectra. Anal Chem 2007, 79:5489-5493.
20. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Res 1997,
25:3389-3402.
21. Gouy M, Delmotte S: Remote access to ACNUC nucleotide and
protein sequence databases at PBIL. Biochimie 2008, 90:555562.
22. Etzold T, Ulyanov A, Argos P: SRS: information retrieval system
for molecular biology data banks. Methods Enzymol 1996,
266:114-128.
23. Schuler GD, Epstein JA, Ohkawa H, Kans JA: Entrez: molecular
biology database and retrieval system. Methods Enzymol 1996,
266:141-162.
Current Opinion in Biotechnology 2008, 19:266273
32. Smith DL, Wareing BM, Fogg PC, Riley LM, Spencer M, Cox MJ,
Saunders JR, McCarthy AJ, Allison HE: Multilocus
characterization scheme for shiga toxin-encoding
bacteriophages. Appl Environ Microbiol 2007, 73:8032-8040.
33. Ogura Y, Ooka T, Asadulghani, Terajima J, Nougayrede JP,
Kurokawa K, Tashiro K, Tobe T, Nakayama K, Kuhara S et al.:
Extensive genomic diversity and selective conservation of
virulence-determinants in enterohemorrhagic Escherichia coli
strains of O157 and non-O157 serotypes. Genome Biol 2007,
8:R138 doi: 10.1186/gb-2007-8-7-r138.
A systematic whole genome comparison between O157 and non-O157
EHEC strains using microarray and whole genome PCR scanning analyses. An example of modern analyses and comparisons of whole genomes to understand phenotypes and their evolutions in time.
34. Zhang Y, Laing C, Steele M, Ziebell K, Johnson R, Benson AK,
Taboada E, Gannon VP: Genome evolution in major Escherichia
coli O157:H7 lineages. BMC Genomics 2007, 8:121 doi: 10.1186/
1471-2164-8-121.
Same as reference [33], but using 6167 50-mer oligonucleotides wholegenome-based microarrays for E. coli.
35. Hsiao A, Liu Z, Joelsson A, Zhu J: Vibrio cholerae virulence
regulator-coordinated evasion of host immunity. Proc Natl
Acad Sci USA 2006, 103:14542-14547.
36. Pang B, Yan M, Cui Z, Ye X, Diao B, Ren Y, Gao S, Zhang L, Kan B:
Genetic diversity of toxigenic and nontoxigenic Vibrio
cholerae serogroups O1 and O139 revealed by array-based
comparative genomic hybridization. J Bacteriol 2007, 189:48374849.
37. Fox AJ, Taha MK, Vogel U: Standardized nonculture techniques
recommended for European reference laboratories. FEMS
Microbiol Rev 2007, 31:84-88.
38. Turner KM, Feil EJ: The secret life of the multilocus sequence
type. Int J Antimicrob Agents 2007, 29:129-135.
39. Chang CH, Chang YC, Underwood A, Chiou CS, Kao CY:
VNTRDB: a bacterial variable number tandem repeat locus
database. Nucleic Acids Res 2007, 35:D416-421.
40. Martens M, Dawyndt P, Coopman R, Gillis M, De Vos P, Willems A:
Advantages of multilocus sequence analysis for taxonomic
studies: a case study using 10 housekeeping genes in the
genus Ensifer (including former Sinorhizobium). Int J Syst Evol
Microbiol 2008, 58:200-214.
www.sciencedirect.com
46. Jolley KA, Chan MS, Maiden MC: mlstdbNet-distributed multilocus sequence typing (MLST) databases. BMC Bioinformatics
2004, 5:86 doi: 10.1186/1471-2105-5-86.
www.sciencedirect.com