Académique Documents
Professionnel Documents
Culture Documents
Gene expression
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2715
each step of the sampling. Compared with existing methods, info- The IC can be extended to a more general formulation when the positions
gibbs shows good performances in terms of computation time and are not independent. This can be written as follows:
prediction quality on both simulated and real datasets. P(u|M)
IC(M,B) = P(u|M)log (5)
l
P(u|B)
u∈A
2 METHODS where A is the alphabet, l the length of the motif, P(u|M) the probability
to generate the fragment u given the matrix M and P(u|B) the probability to
Problem statement: given a set of sequences = {φ1 ,...,φz } and a motif
generate the same fragment given the background model B. When a Bernoulli
length l, the problem can be defined as follows: find a set of sequence
background model is used, Equation (5) can be simplified to Equation (4).
fragments (sites) W = {w1 ,w2 ,...,wn } that has maximal IC. When considering
For Markov models of higher order, this formula can be rewritten (see
the search space as a set of potential sites S = {s1 ,...,sz } (e.g. all allowed
Supplementary Material) and computed in acceptable time for the Markov
positions in sequences), the problem can also be viewed as finding a
orders typically used in practice (m ≤ 5).
subset W = {w1 ,...,wn } of S that has maximal IC. The search space can
optionally be restricted by marking in sequences some positions as forbidden
2.1.2 Log likelihood ratio The motif conservation can also be estimated
and discarding associated sites from S (e.g. repetitive elements, iterative
by computing the LLR, defined as the log of the ratio between the probability
masking).
to generate the set of sites W given the matrix M and given the background
model B (that can be a higher order Markov model), respectively. For a set
Background model: an important issue when trying to discover short motifs of n sites W = {w1 ,...,wn }, the LLR can be written as follows:
in biological sequences is the choice of the background model, which can
n
P(wk |M)
where Mi,j is the probability of the letter j at position i in M and Bj the The Gibbs sampling algorithm first initializes the motif W (0) by
probability of the same letter in the background model. selecting a random set of n sites of length l. The algorithm (Table 1)
2716
Table 1. Gibbs site sampling algorithm IC contribution corresponding to the column i if the letter j is found
at position i of the site w. Using this table, IC(w∪Wwk ,B) can be
INPUT: a set of sites S = {s1 ,...,sm }, a motif length l efficiently computed as follows:
OUTPUT: a motif W
IC(w∪Wwk ,B) = αi,j=wi (10)
1. (init) Choose at random a set of sites W = {w1 ,...,wn } in S i
2. Choose a site wk in W and remove it from W
3. FOR EACH w in S DO
Starting from the current matrix M, the table α can be constructed
P(w|Wwk ) in O(|A|l).
4. Choose a new site wi proportionally to P(wi |Wwk )
5. REPEAT 2–4. UNTIL convergence Efficient LLR computation—Markov background The previous
technique can be extended to compute efficiently the LLR when
the background is modeled by a Markov chain. The LLR defined
(i)
then iterates over the following steps: (i) the sampling step where Equation (6) can be rewritten as: LLR= i k logP(wk |M)−
in
the probability P(w|W ) of each possible site given the current motif k logP(wk |B) since the positions i are assumed to be independent
is computed; (ii) the update step where a new motif is constructed in the matrix M. The first term of this expression can be precomputed
by replacing a given site from the motif by a new site sampled using as previously an α table. The second term which is not position
proportionally to P(w|W ). independent can also be precomputed only once for a complete run
by computing P(wk |B) for all the potential sites (S) of the input
2717
3.5 Multiple runs For our first test, we generated motifs of length 8 where the
Due to its stochastic nature, different runs of the algorithm on the probabilities were set to 0.91 for the dominant nucleotide and 0.03
same sequences are expected to return distinct results. As similar to for the others, according to Wei and Jensen (2006).
other softwares, info-gibbs can automatically manage multiple runs
using different starting points (i.e. starting from different random 4.1.3 Implanting motif instances Random sites were generated
motifs). In this case the sampling algorithm is started from k different and implanted in sequences using the program implant-sites.
points and the motif achieving the highest IC is kept. For each possible position (u = L −l +1) of each sequence, a
motif can be implanted according to a probability p. This model
3.6 Finding multiple motifs assumes no dependencies between site positions. The probability
to implant k motif occurrences in a sequence is given by the
The site sampling algorithm searches for only one motif at a time. As
binomial formula: P(occ = k) = uk pk (1−p)u−k . In our conditions,
in other implementations, info-gibbs can discover multiple motifs by
the average expectation was set to 1 occurrence per sequence
doing several runs and iteratively mask (i.e. remove from searched
p = 1/u.
sites) sites already included in previously returned motifs.
4.1.4 Motif discovery We tested the capability of info-gibbs and
3.7 Implementation and an availability six other motif discovery softwares to detect the implanted motifs
info-gibbs is implemented in ANSI C++ and the command-line (AlignACE, BioProspector, GAME, Gibbs, MEME, MotifSampler).
tool is available upon request. A web interface to info-gibbs is also For each tool, the motif length was set to 8 and the expected number
2718
PC
0.50 0.50
0.50
0.25 0.25
0.25 0 0
0.1 0.5 1 2 4 0.1 0.5 1 2 4
pesudo count pesudo count
0 IC LLR LRw
AlignAce BioProspector GAME Gibbs info-gibbs MEME MotifSampler c = 0.85 c = 0.85
Sn PPV PC 1.00 1.00
Bernoulli Markov 1
0.75 0.75
Fig. 1. Software performances on synthetic data. For each software the
PC
0.50 0.50
sensitivity Sn (black bars), the PPV (gray bars) and the PC (line) are shown.
0.25 0.25
1.00 0 0
0.1 0.5 1 2 4 0.1 0.5 1 2 4
pesudo count pesudo count
IC LLR LRw
c = 0.75 c = 0.75
0.75 1.00 1.00
PC
0.25 0.25
0.25 0 0
0.1 0.5 1 2 4 0.1 0.5 1 2 4
pesudo count pesudo count
IC LLR LRw
0
0.95 0.91 0.9 0.85 0.8 0.7
Fig. 3. Comparison between scoring methods on synthetic data. All motifs
Conservation
were discovered with the same program (info-gibbs) the difference being the
AlignAce BioProspector updating strategy: motif-wise IC, motif-wise LLR or site-wise LR (LRw ),
GAME Gibbs
info-gibbs MEME respectively. Synthetic motifs with various degree of conservation (c = 0.91,
MotifSampler 0.85 and 0.75) were used. The PC is reported for various values of the
pseudo-count parameter. The left column shows the results obtained when
using a Bernoulli background model and the right column shows the results
Fig. 2. Software performances on synthetic data. For each software and a
when using a Markovian background model of order 1 (order 2 gives similar
level of motif conservation ranging from 0.95 to 0.70, the PC is reported.
results).
Table S1). The ranking of the software does not depend on the
degree of conservation of the implanted motif. info-gibbs and Equation (3)] was also evaluated by comparing the predictions made
MEME perform better than the other tools over the whole range using five levels of pseudo-count (0.1, 0.5, 1.0, 2.0 and 4.0).
of conservation tested. In each case, predicted motifs were assessed by comparing the
The performances on synthetic data are, however, strongly positions of the implanted sites with those of the sites included in
influenced by somewhat arbitrary choices (background model, site the discovered motif and by computing as previously the PC. The
density, etc.). In order to evaluate the software tools in realistic results are show in Figure 3.
conditions, we also tested their performances on biological datasets In this evaluation the differences between the scoring methods do
(Sections 4.2 and 4.3). not seem to be influenced by the background model (first and second
column in Fig. 3) even if all the strategies tend to give slightly better
4.1.5 Comparison between scoring methods We evaluated on results when higher order background models are used. In contrast,
synthetic data the impact of three scoring methods used for motif the pseudo-count parameter influences deeply the results. Around
updating (as defined in Section 3.1): (i) classical site-wise likelihood or lower than the commonly used value (pseudo-count = 1.0), the
ratio LRw (ii) motif-wise LLR; and (iii) motif-wise IC. performances of the motif-wise IC and LLR strategies are similar
Synthetic motifs with various degree of conservation (0.91, 0.85 and always better than the site-wise likelihood ratio. At unusual
and 0.75) were generated and implanted (1 expected site per levels of pseudo-count (pseudo-count= 4.0) only the motif-wise
sequence) in random sequences (100 sets of 10 sequences of 200 bp). LLR strategy remains efficient.
The sequences were generated according to a Markov model of order
2 trained on yeast upstream non-coding sequences.
In order to evaluate the influence of the background model in 4.2 Evaluation on RegulonDB data
the motif discovery process both Bernoulli background models The second evaluation focused on transcriptional regulation of
directly estimated by info-gibbs from the input sequences and pre- Escherichia coli K12. Regulons (i.e. sets of genes regulated by
computed higher order Markovian models were used. The influence the same TF) and associated TFBSs where retrieved from the
of the pseudo-count used to compute the motif score [as defined in RegulonDB 6.0 database (Gama-Castro et al., 2008), which provides
2719
18
a comprehensive set of curated data for the E.coli K12 transcriptional
regulatory network. 16
Number of regulons
Starting from the binding sites table, we selected all the TFs for 14
which RegulonDB contains an annotated PSSM. This provides a 12
reference set of 32 regulons (Supplementary Table S3) with good 10
annotations about target genes, binding sites and regulatory motifs 8
(PSSMs). For each regulon, we retrieved sequences upstream of the
6
start codon up to the next gene, with a maximal length of 500 bp.
4
Motif discovery softwares were run on each regulon with motif
length fixed to 16 and expected number of occurrences set to 2 sites 2
per sequences. Like the previous analysis we set the number of runs 0
0.6 0.7 0.8 0.9
to at least 10 (when this option was available) and considered the
best motifs predicted. For all the softwares the parameters were Asymptotic covariance threshold
set to search on both strands. All other parameters were left to AlignAce BioProspector
GAME Gibbs
their defaults and no background model were provided as input
info-gibbs MEME
to the software excepting the G+C frequency for programs that MotifSampler
require it (AlignACE). Under these conditions most of the software
compute a Bernoulli background using the input sequences (all
2720
1.00 10
PurR
Number of regulons
0.75
PuR 8
P(AS ≥ x)
0.50
0.25 6
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 4
Asymptotic covariance (AS)
AlignAce BioProspector
GAME Gibbs
info-gibbs MEME 2
MotifSampler random
1.00
0
0.75 DnaA 0.6 0.7 0.8 0.9
DnaA
Asymptotic covariance threshold
P(AS ≥ x)
0.50
AlignAce BioProspector
0.25
GAME Gibbs
info-gibbs MEME
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 MotifSampler
Asymptotic covariance (AS)
AlignAce BioProspector
0.50 sequence and we considered the best motifs predicted among at least
10 runs. Since the differences between Bernoulli background models
0.25
and higher order models are significant with yeast genomes, we have
0 chosen for this dataset the background model that gives the best
0 0.1 0.2 0.3 0.4 0.5 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Asymptotic covariance (AS) results for all the softwares that can use higher order background
AlignAce BioProspector
GAME Gibbs models. We provided to those software a Markov model of order
info-gibbs MEME
MotifSampler random 3 trained on all yeast promoters and a Bernoulli model trained
on the same set to the others. For all the software the parameters
Fig. 5. Complementary cumulative distributions of the asymptotic were set to search on both strands. All other parameters were left
covariance between annotated and predicted regulatory motifs for E.coli K12 to their defaults. Discovered motifs were evaluated by computing
PurR, DnaA and H-NS regulons for each software. The random behaviors their averaged asymptotic covariance (over 10 trials) with the motifs
are represented by the dashed lines. annotated in SCPD (Supplementary Table S5).
All programs show contrasted performances (Fig. 6): they
et al., 2004; Schneider and Stephens, 1990) of the annotated motif correctly detect between 0 and 8 motifs among the 29 datasets.
(Fig. 5), only one position in the middle of motif is well conserved. On these datasets, info-gibbs and MotifSampler supersede the other
Given the poor IC of this annotated motif, it is not surprising that tools and all programs supporting higher order Markov models
info-gibbs fails to identify it. However, all the other software also show higher performances (BioProspector, info-gibbs, MEME and
fail to return relevant results. Actually, the distributions are close to MotifSampler).
the random motif selection for all the softwares.
2721
synthetic sequences, with an evaluation based on site comparisons; Hertz,G.Z. et al. (1990) Identification of consensus patterns in unaligned dna sequences
(ii) biological sequences with their native sites, with a motif- known to be functionally related. Comput. Appl. Biosci., 6, 81–92.
Hughes,J.D. et al. (2000) Computational identification of cis-regulatory elements
based rather than site-based evaluation. Another originality of our
associated with groups of functionally related genes in Saccharomyces cerevisiae.
evaluation is to directly address the intrinsic lack of reproducibility J. Mol. Biol., 296, 1205–1214.
of stochastic approaches, by analyzing the distribution of matching Jensen,S. et al. (2004) Computational discovery of gene regulatory binding motifs: a
statistics rather than choosing the average performance. bayesian perspective. Stat. Sci., 19, 188–204.
The most important limitation of the proposed approach, which Jensen,S.T. and Liu,J.S. (2004) Biooptimizer: a Bayesian scoring function approach to
motif discovery. Bioinformatics, 20, 1557–1564.
is shared with some of the other methods, is that the user has Lawrence,C.E. et al. (1993) Detecting subtle sequence signals: a gibbs sampling strategy
to specify input parameters which have a strong impact on the for multiple alignment. Science, 262, 208–214.
result, in particular the motif length and the expected number of Liu,J. et al. (1995) Bayesian models for multiple local sequence alignment and Gibbs
occurrences. One strategy (supported by MEME) is to test various sampling strategies. J. Am. Stat. Assoc., 90, 1156–1170.
Liu,X. et al. (2001) BioProspector: discovering conserved DNA motifs in upstream
lengths and select the motif that optimizes some a posteriori
regulatory regions of co-expressed genes. In Pacific Symposium on Biocomputing, 6,
statistics (e.g. information per column). The expected number of pp. 127–138.
occurrences is harder to estimate when maximizing IC: under- Neuwald,A.F. et al. (1995) Gibbs motif sampling: detection of bacterial outer membrane
estimating the number of occurrence could force the algorithm protein repeats. Protein Sci., 4, 1618–1632.
to converge toward a too ‘sharp’ motif, while overestimating it Pape,U.J. et al. (2008) Natural similarity measures between position frequency matrices
with an application to clustering. Bioinformatics, 24, 350–357.
can result in a useless fuzzy motif. Another important question Pavesi,G. et al. (2001) An algorithm for finding signals of unknown length in DNA
is the choice of the temperature parameter. An adequate usage of
2722