Vous êtes sur la page 1sur 8

Vol. 25 no.

20 2009, pages 2715–2722


BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btp490

Gene expression

info-gibbs: a motif discovery algorithm that directly optimizes


information content during sampling
Matthieu Defrance∗ and Jacques van Helden∗
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe), Université Libre de Bruxelles CP 263,
Campus Plaine, Boulevard du Triomphe, B-1050 Bruxelles, Belgium
Received on October 2, 2008; revised on August 6, 2009; accepted on August 11, 2009
Advance Access publication August 18, 2009
Associate Editor: David Rocke

ABSTRACT on a Gibbs sampling strategy: Gibbs (Lawrence et al., 1993; Liu


Motivation: Discovering cis-regulatory elements in genome et al., 1995; Neuwald et al., 1995), AlignACE (Hughes et al.,

Downloaded from bioinformatics.oxfordjournals.org by guest on April 6, 2011


sequence remains a challenging issue. Several methods rely on 2000; Roth et al., 1998), MotifSampler (Thijs et al., 2002) or
the optimization of some target scoring function. The information BioProspector (Liu et al., 2001). The latter two support higher order
content (IC) or relative entropy of the motif has proven to be a good Markovian background models. More recently Shida (2006a, b)
estimator of transcription factor DNA binding affinity. However, these proposed a Gibbs sampling method that allows a variable stochastic
information-based metrics are usually used as a posteriori statistics factor (temperature) that enhances Gibbs sampling convergence
rather than during the motif search process itself. speed.
Results: We introduce here info-gibbs, a Gibbs sampling algorithm Most of the methods that target PSSM motif discovery sort the
that efficiently optimizes the IC or the log-likelihood ratio (LLR) of the predicted motifs by computing a posteriori some score such as
motif while keeping computation time low. The method compares information content (IC) (consensus, MotifSampler), log-likelihood
well with existing methods like MEME, BioProspector, Gibbs or ratio (LLR) (MotifSampler, Gibbs), E-value of the log-likelihood
GAME on both synthetic and biological datasets. Our study shows (MEME) or E-value of the IC (consensus).
that motif discovery techniques can be enhanced by directly focusing The IC (Hertz and Stormo, 1999), also called relative entropy,
the search on the motif IC or the motif LLR. presents the advantage of measuring both the specificity of a motif
Availability: http://rsat.ulb.ac.be/rsat/info-gibbs (low variability within each column) and its contrast relative to the
Contact: defrance@bigre.ulb.ac.be background model. IC has been claimed to be a good measure of
Supplementary information: Supplementary data are available at DNA binding affinity (Stormo, 1998).
Bioinformatics online. The program consensus (Hertz et al., 1990) optimizes the IC, but
is sensitive to the order of incorporation of the sequences. Genetic
algorithms like GAME that try to optimize directly this score have
1 INTRODUCTION recently emerged (Chan et al., 2008; Wei and Jensen, 2006), but
Gene expression is regulated at the transcriptional level by the time and memory complexity of genetic algorithms are higher
transcription factors (TFs) that bind to DNA at specific locations. than more specific algorithms like Gibbs sampling. Furthermore,
Several algorithmic approaches have been developed for de novo they require to specify a set of parameters (probabilities of mutation
identification of regulatory signals from a set of sequences. Motif and crossing over, population size, selection operator, etc.), which
discovery methods can be used to construct motifs that represent the are difficult to relate to properties of the input sequences and output
specificity of the binding between a TF and binding sites (TFBSs). motifs (size, number of sites, conservation, etc.).
Depending on the motif representation used, motif discovery The scoring function used to sample motifs during the
methods can be divided into two broad categories: enumerative discovery process strongly affects the resulting motifs. Jensen and
methods that select overrepresented words (exact or degenerated), co-workers (2004) emphasized the impact of the input parameters
and heuristics that target the discovery of more complex motifs and the scoring functions on the quality of discovered motifs. They
like position-specific scoring matrices (PSSMs). Among the first implemented the software BioOptimizer (Jensen and Liu, 2004),
category of methods, motifs can be represented by words (van which takes as input a motif returned by some pattern discovery
Helden et al., 1998), spaced words (van Helden et al., 2000) or algorithm (BioProspector, Consensus, AlignACE, MEME), and
words with multiple gaps and errors (i.e. with degenerated positions) improves it by local optimization of a scoring function based on
(Pavesi et al., 2001; Sinha and Tompa, 2002, 2003). Considering the the log-posterior distribution.
second category of methods, aiming at discovering PSSMs, many In this article, we present a motif finding algorithm called info-
algorithms have been proposed. This includes the greedy algorithm gibbs, that combines the qualities of Gibbs sampling (time and
consensus (Hertz et al., 1990), expectation maximization algorithms memory efficiency, interpretability of parameters) and uses as a
like MEME (Bailey and Elkan, 1994) and several algorithms based scoring a scoring scheme either the IC or the LLR of the motif.
The strategy is to directly compute the IC or LLR of the motif at
∗ To whom correspondence should be addressed.

© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2715

[15:39 29/9/2009 Bioinformatics-btp490.tex] Page: 2715 2715–2722


M.Defrance and J.van Helden

each step of the sampling. Compared with existing methods, info- The IC can be extended to a more general formulation when the positions
gibbs shows good performances in terms of computation time and are not independent. This can be written as follows:
prediction quality on both simulated and real datasets.  P(u|M)
IC(M,B) = P(u|M)log (5)
l
P(u|B)
u∈A

2 METHODS where A is the alphabet, l the length of the motif, P(u|M) the probability
to generate the fragment u given the matrix M and P(u|B) the probability to
Problem statement: given a set of sequences  = {φ1 ,...,φz } and a motif
generate the same fragment given the background model B. When a Bernoulli
length l, the problem can be defined as follows: find a set of sequence
background model is used, Equation (5) can be simplified to Equation (4).
fragments (sites) W = {w1 ,w2 ,...,wn } that has maximal IC. When considering
For Markov models of higher order, this formula can be rewritten (see
the search space as a set of potential sites S = {s1 ,...,sz } (e.g. all allowed
Supplementary Material) and computed in acceptable time for the Markov
positions in sequences), the problem can also be viewed as finding a
orders typically used in practice (m ≤ 5).
subset W = {w1 ,...,wn } of S that has maximal IC. The search space can
optionally be restricted by marking in sequences some positions as forbidden
2.1.2 Log likelihood ratio The motif conservation can also be estimated
and discarding associated sites from S (e.g. repetitive elements, iterative
by computing the LLR, defined as the log of the ratio between the probability
masking).
to generate the set of sites W given the matrix M and given the background
model B (that can be a higher order Markov model), respectively. For a set
Background model: an important issue when trying to discover short motifs of n sites W = {w1 ,...,wn }, the LLR can be written as follows:
in biological sequences is the choice of the background model, which can

n
P(wk |M)

Downloaded from bioinformatics.oxfordjournals.org by guest on April 6, 2011


greatly influence the quality of the results. Higher order Markov sequence LLR(M,W ,B) = log (6)
models can improve the detection of regulatory elements (Thijs et al., 2001). P(wk |B)
k=1
The algorithm described here supports Markov models of any order as
where P(wk |M) is the probability to generate the site wk given the matrix M
background models. By default, background frequencies are estimated from
and P(wk |B) the probability to generate the same site given the background
the input sequences, but they can also be based on an external sequence
model B.
set considered as reference (e.g. whole set of promoters of the considered
organism).
2.1.3 Relation between IC and LLR When considering a Bernoulli
background model and a matrix M constructed from W using a low (or
Matrix representation: the count matrix N associated with a given motif is a null) pseudo-count (i.e. no matrix smoothing), the LLR can be computed
constructed from the aligned set of sequence fragments W , by computing the by multiplying the IC by the number of fragments that contribute to the motif
number of occurrences Ni,j of each letter j of the alphabet A = {A,C,G,T } (LLR  n× IC). In this case, exchanging IC or LLR as a scoring function do
at each position i (from 1 to l). The count matrix N can be expressed as not lead to significant differences. However, when using higher order Markov
follows: background models and/or when the pseudo-count is significant the relation
n  
(i) between IC and LLR is no longer linear and can lead to quite significant
Ni,j (W ) = δ wk ,Aj (1)
k=1 differences. This difference highlights the nature of these two evaluators.
(i) While IC depends on the matrix alone, the LLR directly depends on the sites
where wk is letter at position i in word wk and used to build this matrix.

(i)
(i) 1 if wk = A j
δ(wk ,Aj ) = (2) 2.2 Sampling strategy
0 otherwise
Given a set of potential sites S and a background model B, the objective
The PSSM M is derived from the count matrix N by computing, for is to find the motif that maximizes some motif scoring function Wmax =
each position of the alignment, the relative frequency of each letter. To argmaxW (W |S,B), where (W |S,B) can be either n· IC [Equation (5)] or
circumvent the variance due to the small number of sites, the count matrix LLR [Equation (6)].
can be corrected by applying a simple and widely used additive smoothing In typical situations, this objective is unreachable because the number of
technique: adding a pseudo-count. The PSSM M can be expressed as follows: possible motifs increases exponentially with the number of potential sites.
The ideal solution can thus be approached by sampling the distribution that
Ni,j +Bj × we define as follows:
Mi,j = (3)
n+ 1 (W |S,B)
P(W |S,B) ∝ e T (7)
where Bj is the background frequency of residue j and  is the pseudo-count Z
which takes a Real positive value, usually chosen reasonably small with where Z is a normalization factor, and T a temperature parameter which is
respect to the number of sites. detailed in Section 3.3.

2.1 Evaluating motif conservation 3 ALGORITHM


2.1.1 Information content A well-established measure of motif
The Gibbs sampling strategy consists in trying to sample P(W |S,B)
conservation is the IC also known as relative entropy (Hertz and Stormo,
1999; Schneider et al., 1986). Given a PSSM M and a Bernoulli background
iteratively using the conditional probability P(wk |Wwk ), where Wwk
model B, the IC of M can written as follows: represents the motif constructed without the fragment wk . For a
(t+1)
given iteration t, the new site wk is drawn according to:
l 
 Mi,j  
ICBernoulli (M,B) = Mi,j log (4) (t+1) (t)
Bj wk ∝ P wk |Ww (8)
i=1 j∈A k

where Mi,j is the probability of the letter j at position i in M and Bj the The Gibbs sampling algorithm first initializes the motif W (0) by
probability of the same letter in the background model. selecting a random set of n sites of length l. The algorithm (Table 1)

2716

[15:39 29/9/2009 Bioinformatics-btp490.tex] Page: 2716 2715–2722


info-gibbs

Table 1. Gibbs site sampling algorithm IC contribution corresponding to the column i if the letter j is found
at position i of the site w. Using this table, IC(w∪Wwk ,B) can be
INPUT: a set of sites S = {s1 ,...,sm }, a motif length l efficiently computed as follows:
OUTPUT: a motif W 
IC(w∪Wwk ,B) = αi,j=wi (10)
1. (init) Choose at random a set of sites W = {w1 ,...,wn } in S i
2. Choose a site wk in W and remove it from W
3. FOR EACH w in S DO
Starting from the current matrix M, the table α can be constructed
P(w|Wwk ) in O(|A|l).
4. Choose a new site wi proportionally to P(wi |Wwk )
5. REPEAT 2–4. UNTIL convergence Efficient LLR computation—Markov background The previous
technique can be extended to compute efficiently the LLR when
the background is modeled by a Markov chain. The LLR defined
 (i)
then iterates over the following steps: (i) the sampling step where  Equation (6) can be rewritten as: LLR= i k logP(wk |M)−
in
the probability P(w|W ) of each possible site given the current motif k logP(wk |B) since the positions i are assumed to be independent
is computed; (ii) the update step where a new motif is constructed in the matrix M. The first term of this expression can be precomputed
by replacing a given site from the motif by a new site sampled using as previously an α table. The second term which is not position
proportionally to P(w|W ). independent can also be precomputed only once for a complete run
by computing P(wk |B) for all the potential sites (S) of the input

Downloaded from bioinformatics.oxfordjournals.org by guest on April 6, 2011


3.1 Sampling sequences.
In previous implementations of the Gibbs sampler, the probability
to draw a site was proportional to the likelihood ratio of the site 3.2 Shifting
(site-wise likelihood ratio, LRw ). The sampling algorithm can converge toward a motif that overlaps
P(w|M) some actual motif, but shifted left-wise or right-wise by a few
P(w|W ) ∝ LRw = (9) positions. To prevent staying blocked in this kind of near local
P(w|B)
optima, sites should be shifted altogether left or right by few
Since IC and the LLR are usually considered as relevant for positions. This is achieved by selecting after each iteration the motif
estimating the quality of the predicted motifs, our strategy is to resulting from the best shift (one position left, one position right or
directly use these statistics during the sampling phase itself rather no shift).
than as an a posteriori criterion. Given a motif constructed from
Wwk , choosing a new site w requires to compute IC(w∪Wwk ,B) or 3.3 Temperature
LLR(w∪Wwk ,B) for each site in S. Compared with the classical
site sampler, evaluating IC or LLR at each step of the algorithm The temperature parameter T defined in Equation (7) influences
can be time consuming, especially if the computation relies on the the explored space and the convergence speed of the algorithm.
raw formula [Equation (4–6)], which needs to be computed for each A low temperature (T < 1.0) allows a fast convergence but reduces
possible site at each iteration. the number of explored solutions. A high temperature (T > 1.0) on
However, the overall time performances can be greatly enhanced the other hand allows a wider exploration, but unfavorably affects the
by rewriting IC and LLR to avoid recomputations. The principle is convergence speed. The convergence behaviors have been evaluated
to construct for each iteration, a table that will store precomputed by measuring the evolution over iterations of the best motif found
partial contributions to IC or LLR. This technique can be applied (i.e. maximal IC) for different values of the temperature parameter
to compute the IC and LLR when the background is modeled by on synthetic datasets (see Section 4.1 for details on these datasets).
a Bernoulli sequence. This technique can also be extended to the The best results were obtained with the temperature set between 0.8
computation of LLR when the background is modeled by Markov and 1.0, whereas at higher or lower temperatures, the IC remains at
chain but cannot be used to compute the IC in the Markovian lower levels. Supplementary Figure S1 shows the averaged temporal
case (note that info-gibbs also supports direct update of IC under evolution of the best IC reached for a set of fixed temperature values.
Markov models, but at a higher computation cost). We first show in
the following sections how to efficiently compute the IC when the 3.4 Minimal distance between sites
background is a Bernoulli sequence. We then show how to extend In order to avoid an attractive effect of self-overlapping motifs
this technique to compute efficiently the LLR when the background (e.g. GAGAGAGAGAGAGAGA), info-gibbs supports a minimal
is Markovian. threshold dmin on the distance d(si ,si+1 ) between two successive
sites si and si+1 from the same sequence. This strategy is inspired
3.1.1 Efficient IC computation—Bernoulli background During from the strategies found in other softwares such as consensus (Hertz
the sampling phase, the IC of the matrix constructed from w and et al., 1990).
already selected sites IC(w∪Wwk ,B) needs to be computed for each Beyond the avoidance of self-overlapping motifs, this parameter
site of the input set. Assuming a Bernoulli background model, IC can be used in a particular way by setting the minimal distance to
can be computed independently for each position of the matrix using a value larger than sequence lengths. This setting allows no more
Equation (4). Since Wwk is fixed for a given iteration, a table α of size than one occurrence per sequence, as in the first version of the site
A×l that stores the IC for each column i of the matrix and each letter sampling algorithm (Lawrence et al., 1993), or by the option -zoops
j can be computed before sampling. The table cell αi,j will store the of MEME (Bailey and Elkan, 1994).

2717

[15:39 29/9/2009 Bioinformatics-btp490.tex] Page: 2717 2715–2722


M.Defrance and J.van Helden

3.5 Multiple runs For our first test, we generated motifs of length 8 where the
Due to its stochastic nature, different runs of the algorithm on the probabilities were set to 0.91 for the dominant nucleotide and 0.03
same sequences are expected to return distinct results. As similar to for the others, according to Wei and Jensen (2006).
other softwares, info-gibbs can automatically manage multiple runs
using different starting points (i.e. starting from different random 4.1.3 Implanting motif instances Random sites were generated
motifs). In this case the sampling algorithm is started from k different and implanted in sequences using the program implant-sites.
points and the motif achieving the highest IC is kept. For each possible position (u = L −l +1) of each sequence, a
motif can be implanted according to a probability p. This model
3.6 Finding multiple motifs assumes no dependencies between site positions. The probability
to implant k motif occurrences in a sequence is given by the
The site sampling algorithm searches for only one motif at a time. As
binomial formula: P(occ = k) = uk pk (1−p)u−k . In our conditions,
in other implementations, info-gibbs can discover multiple motifs by
the average expectation was set to 1 occurrence per sequence
doing several runs and iteratively mask (i.e. remove from searched
p = 1/u.
sites) sites already included in previously returned motifs.
4.1.4 Motif discovery We tested the capability of info-gibbs and
3.7 Implementation and an availability six other motif discovery softwares to detect the implanted motifs
info-gibbs is implemented in ANSI C++ and the command-line (AlignACE, BioProspector, GAME, Gibbs, MEME, MotifSampler).
tool is available upon request. A web interface to info-gibbs is also For each tool, the motif length was set to 8 and the expected number

Downloaded from bioinformatics.oxfordjournals.org by guest on April 6, 2011


provided via the RSAT web server (http://rsat.ulb.ac.be/rsat/). of occurrences to 10 (corresponding to one expected occurrence per
sequence). When a higher order background model was allowed by
the softwares (BioProspector, info-gibbs, MEME, MotifSampler),
4 RESULTS
we provided the background model used to generate the sequences
4.1 Evaluation on synthetic data as input. In other cases only nucleotide frequencies were provided.
The first evaluation was carried out on a well-defined problem: Since most of the compared software use multiple starting points
finding synthetic sites implanted in synthetic sequences. One of (initial alignments) by default or have an option to allow multiple
the main advantages of this approach is that it provides a well- starting points, we set this parameter to at least 10 runs (when this
controlled environment, where all the motif occurrences (sites) are option was available) and considered only the best motifs predicted
known. Consequently, usual evaluation statistics such as sensitivity among those runs. As the motifs were implanted on the same strand,
and positive predictive value (PPV) can be accurately computed to the parameter to search only on one strand was set when available
estimate algorithm performances. This contrasts with real datasets, (all the softwares except AlignACE). The other parameters were
where several effective binding sites are likely to have escaped left to their defaults. For info-gibbs, the sampling strategy relies on
experimental detection, and thus be missing in the annotations. the optimization of the LLR which is for time efficiency reason the
Another advantage of that evaluation is its repeatability and default strategy.
scalability. Since the sequences and the motif itself are randomly All the softwares used in this study predict a motif by aligning a set
generated, the evaluation can be repeated sufficiently to provides of sites extracted from the input sequences. In order to evaluate the
accurate measurements. predictions, we compared the sites discovered with those implanted
The datasets are generated in three steps: (i) random sequence in the sequences. Implanted sites are qualified as true positives
generation (random-seq); (ii) random motif generation (random- (TP) when they are also the predicted sites and false negative (FN)
motif ); and (iii) implantation of the generated motifs in the otherwise. Predicted sites that do not correspond to implanted sites
sequences (implant-sites). The tools used to generate those synthetic are qualified as false positives (FP).
datasets have been integrated in the Regulatory Sequences Analysis Three classical evaluation statistics were computed: (i) the
Tools (Thomas-Chollier et al., 2008; van Helden, 2003). PPV = TP/(TP + FP) which quantifies the proportion of correct sites
among the predicted ones; (ii) the sensitivity Sn =TP/(TP + FN)
4.1.1 Generating random sequences In order to generate random which quantifies the proportion of implanted sites that were
sequences that maintain some characteristics of real sequences, we covered by the predicted ones; (iii) the performance coefficient
chose to use a Markovian background model of order 4 trained on PC = TP/(TP + FN + FP) which captures both the PPV and the
all upstream sequences of the yeast Saccharomyces cerevisiae. In sensitivity (Pevzner and Sze, 2000; Tompa et al., 2005).
the case of the yeast genome, Markovian background models of The comparison of the motif discovery software performances
orders 3 and 4 seem to give the best results for motif discovery (Fig. 1) shows that, under our conditions, info-gibbs and MEME
software (Thijs et al., 2001). The background model was chosen to supersede all the other tools. The third best result is obtained
maximize the performances of the software that take advantage of by BioProspector, which achieves a Sn similar to info-gibbs and
higher order background models. This Markov model was used with MEME, but has a lower PPV and PC.
random-seq to generate 100 sets of 10 sequences of 500 bp. In order to assess the capability of the motif discovery software
to detect motifs with a higher variablity, we generated as previously
4.1.2 Generating random motifs The program random-motif the synthetic datasets (100 sets of 10 sequences of 500 bp) with
generates random motifs of a given length, with a controllable level implanted motifs whose degree of conservation c varies from 0.95 to
of conservation (c). For each position of the motif a residue is chosen 0.7. Figure 2 shows that the performances of all programs decrease
randomly, and its probability is set to c. The remaining probabilities rapidly when motifs are less conserved, and fail to detect motifs
(1−c) are shared in an equiprobable way between the other residues. below c = 0.7 (Numerical values are reported in Supplementary

2718

[15:39 29/9/2009 Bioinformatics-btp490.tex] Page: 2718 2715–2722


info-gibbs

1.00 c = 0.91 c = 0.91


1.00 1.00
Bernoulli Markov 1
0.75 0.75 0.75
Sn, PPV & PC

PC
0.50 0.50
0.50
0.25 0.25

0.25 0 0
0.1 0.5 1 2 4 0.1 0.5 1 2 4
pesudo count pesudo count
0 IC LLR LRw
AlignAce BioProspector GAME Gibbs info-gibbs MEME MotifSampler c = 0.85 c = 0.85
Sn PPV PC 1.00 1.00
Bernoulli Markov 1
0.75 0.75
Fig. 1. Software performances on synthetic data. For each software the

PC
0.50 0.50
sensitivity Sn (black bars), the PPV (gray bars) and the PC (line) are shown.
0.25 0.25

1.00 0 0
0.1 0.5 1 2 4 0.1 0.5 1 2 4
pesudo count pesudo count
IC LLR LRw
c = 0.75 c = 0.75
0.75 1.00 1.00

Downloaded from bioinformatics.oxfordjournals.org by guest on April 6, 2011


Bernoulli Markov 1
0.75 0.75
PC

PC

0.50 0.50 0.50

0.25 0.25

0.25 0 0
0.1 0.5 1 2 4 0.1 0.5 1 2 4
pesudo count pesudo count
IC LLR LRw
0
0.95 0.91 0.9 0.85 0.8 0.7
Fig. 3. Comparison between scoring methods on synthetic data. All motifs
Conservation
were discovered with the same program (info-gibbs) the difference being the
AlignAce BioProspector updating strategy: motif-wise IC, motif-wise LLR or site-wise LR (LRw ),
GAME Gibbs
info-gibbs MEME respectively. Synthetic motifs with various degree of conservation (c = 0.91,
MotifSampler 0.85 and 0.75) were used. The PC is reported for various values of the
pseudo-count parameter. The left column shows the results obtained when
using a Bernoulli background model and the right column shows the results
Fig. 2. Software performances on synthetic data. For each software and a
when using a Markovian background model of order 1 (order 2 gives similar
level of motif conservation ranging from 0.95 to 0.70, the PC is reported.
results).

Table S1). The ranking of the software does not depend on the
degree of conservation of the implanted motif. info-gibbs and Equation (3)] was also evaluated by comparing the predictions made
MEME perform better than the other tools over the whole range using five levels of pseudo-count (0.1, 0.5, 1.0, 2.0 and 4.0).
of conservation tested. In each case, predicted motifs were assessed by comparing the
The performances on synthetic data are, however, strongly positions of the implanted sites with those of the sites included in
influenced by somewhat arbitrary choices (background model, site the discovered motif and by computing as previously the PC. The
density, etc.). In order to evaluate the software tools in realistic results are show in Figure 3.
conditions, we also tested their performances on biological datasets In this evaluation the differences between the scoring methods do
(Sections 4.2 and 4.3). not seem to be influenced by the background model (first and second
column in Fig. 3) even if all the strategies tend to give slightly better
4.1.5 Comparison between scoring methods We evaluated on results when higher order background models are used. In contrast,
synthetic data the impact of three scoring methods used for motif the pseudo-count parameter influences deeply the results. Around
updating (as defined in Section 3.1): (i) classical site-wise likelihood or lower than the commonly used value (pseudo-count = 1.0), the
ratio LRw (ii) motif-wise LLR; and (iii) motif-wise IC. performances of the motif-wise IC and LLR strategies are similar
Synthetic motifs with various degree of conservation (0.91, 0.85 and always better than the site-wise likelihood ratio. At unusual
and 0.75) were generated and implanted (1 expected site per levels of pseudo-count (pseudo-count= 4.0) only the motif-wise
sequence) in random sequences (100 sets of 10 sequences of 200 bp). LLR strategy remains efficient.
The sequences were generated according to a Markov model of order
2 trained on yeast upstream non-coding sequences.
In order to evaluate the influence of the background model in 4.2 Evaluation on RegulonDB data
the motif discovery process both Bernoulli background models The second evaluation focused on transcriptional regulation of
directly estimated by info-gibbs from the input sequences and pre- Escherichia coli K12. Regulons (i.e. sets of genes regulated by
computed higher order Markovian models were used. The influence the same TF) and associated TFBSs where retrieved from the
of the pseudo-count used to compute the motif score [as defined in RegulonDB 6.0 database (Gama-Castro et al., 2008), which provides

2719

[15:39 29/9/2009 Bioinformatics-btp490.tex] Page: 2719 2715–2722


M.Defrance and J.van Helden

18
a comprehensive set of curated data for the E.coli K12 transcriptional
regulatory network. 16

Number of regulons
Starting from the binding sites table, we selected all the TFs for 14
which RegulonDB contains an annotated PSSM. This provides a 12
reference set of 32 regulons (Supplementary Table S3) with good 10
annotations about target genes, binding sites and regulatory motifs 8
(PSSMs). For each regulon, we retrieved sequences upstream of the
6
start codon up to the next gene, with a maximal length of 500 bp.
4
Motif discovery softwares were run on each regulon with motif
length fixed to 16 and expected number of occurrences set to 2 sites 2

per sequences. Like the previous analysis we set the number of runs 0
0.6 0.7 0.8 0.9
to at least 10 (when this option was available) and considered the
best motifs predicted. For all the softwares the parameters were Asymptotic covariance threshold
set to search on both strands. All other parameters were left to AlignAce BioProspector
GAME Gibbs
their defaults and no background model were provided as input
info-gibbs MEME
to the software excepting the G+C frequency for programs that MotifSampler
require it (AlignACE). Under these conditions most of the software
compute a Bernoulli background using the input sequences (all

Downloaded from bioinformatics.oxfordjournals.org by guest on April 6, 2011


Fig. 4. Software performances on RegulonDB datasets. The number of
except BioProspector). predicted motifs with a normalized asymptotic covariance greater than or
In the case of the E.coli genome the influence of the background equal to the given threshold is reported for each software.
model is less important than with eukaryotic genomes. We have thus
chosen for this dataset to compare the software with the simplest
background model (Bernoulli). This allows to compare all the Hereafter, we discuss three selected regulons which illustrate
software under the same conditions, even for those not supporting the diversity among software predictions, and permit to highlight
higher order Markov models. some strengths and weaknesses of info-gibbs. The complementary
The evaluation was based on a comparison between annotated cumulative distributions of the asymptotic covariance between
and predicted motifs rather than between annotated and predicted annotated and predicted regulatory motifs for the PurR, DnaA and H-
sites, as done for the synthetic data. Indeed, since annotations are NS regulons are shown in Figure 5. The histograms of the asymptotic
likely to be incomplete, using annotated sites as references would covariance distribution are also shown in Supplementary Figure S2,
lead to consider as FPs some predictions that might correspond to while the mean, the first, second and third quartile (i.e. 0.25 quantile,
unannotated but nevertheless effective binding sites. The predicted median and 0.75 quantile) are reported in Supplementary Table S2.
motifs were compared with annotated motifs using the asymptotic Due to the stochastic nature of the algorithms, we found essential
covariance defined by Pape et al. (2008). The asymptotic covariance to base our evaluation on the analysis of the whole distribution
gives a good measure of the similarity between two motifs, by of asymptotic covariance values, rather than reducing the results
estimating the overlap between the sites detected by two PSSMs to some summary statistics as the mean or the median. Indeed,
above a given score threshold. the complementary cumulative distributions and the histograms
The asymptotic covariance was computed using the software sstat indicate not only the central tendency, but also the dispersion of
(Pape et al., 2008) with an upper threshold set to 0.001 for the the performances across various runs of the programs.
α-value (corresponding to a random expectation of 1 prediction per The first example, corresponding to 19 genes regulated by
1000 bp). We normalized the asymptotic covariance by dividing it by the PurR TF, is typical of a well-conserved motif that is easily
the maximal possible value, obtained by comparing the annotated identified by most of the software. The full distribution, however,
motif with itself. Under these conditions, a predicted motif very shows interesting differences between the respective performances.
close to the annotated one will give a score close to 1.0, while a Methods based on IC (GAME, info-gibbs) return in most of
prediction very distant from the annotation will result in a score the cases motifs that are very close to the annotated ones (first
close to 0. quartile >0.9 indicating that 75% of the predictions reach an
Since most of the evaluated software are based on stochastic asymptotic covariance of at least 0.9). Three other programs
algorithms (all programs except MEME), we averaged the (MEME, BioProspector and MotifSampler) perform similarly well.
asymptotic covariance over 100 independent trials over each dataset. Alignace gives lower results, yet significantly better than random
As a negative control, we also compared the annotated motifs with predictions. For Gibbs and AlignACE, the histograms show a
random predictions (referred as random), obtained by running info- wide dispersion, suggesting a poor capability to converge toward
gibbs with 0 iterations (See Supplementary Table S4 for the full reproducible results.
results). The second example, related to the DnaA TF, shows very
To summarize the results we computed the number of predicted contrasted results. Most of the software except info-gibbs predict
motifs that were close to the annotated ones by counting the number motifs that are very far from the annotated one, and their
of regulons for which the asymptotic covariance were greater than performances are close to random predictions. Obviously, the
or equal to thresholds ranging from 0.6 to 0.9 (Fig. 4). In this study, dispersion is also very high for all the stochastic software (i.e. not
info-gibbs shows better performance than other tools and was the MEME).
only software to predict motifs that were very close to the annotated The last example highlights the difficulty to find the poorly
one (asymptotic covariance ≥ 0.9). conserved H-NS motif. As shown in the logo representation (Crooks

2720

[15:39 29/9/2009 Bioinformatics-btp490.tex] Page: 2720 2715–2722


info-gibbs

1.00 10

PurR

Number of regulons
0.75
PuR 8
P(AS ≥ x)

0.50

0.25 6

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 4
Asymptotic covariance (AS)
AlignAce BioProspector
GAME Gibbs
info-gibbs MEME 2
MotifSampler random

1.00
0
0.75 DnaA 0.6 0.7 0.8 0.9
DnaA
Asymptotic covariance threshold
P(AS ≥ x)

0.50

AlignAce BioProspector
0.25
GAME Gibbs
info-gibbs MEME
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 MotifSampler
Asymptotic covariance (AS)
AlignAce BioProspector

Downloaded from bioinformatics.oxfordjournals.org by guest on April 6, 2011


GAME Gibbs
info-gibbs MEME
MotifSampler random Fig. 6. Software performances on yeast ChIP-chip datasets. The number of
1.00
predicted motifs with a normalized asymptotic covariance greater than the
given threshold is reported for each software.
0.75 H-NS
H-NS
P(AS ≥ x)

0.50 sequence and we considered the best motifs predicted among at least
10 runs. Since the differences between Bernoulli background models
0.25
and higher order models are significant with yeast genomes, we have
0 chosen for this dataset the background model that gives the best
0 0.1 0.2 0.3 0.4 0.5 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Asymptotic covariance (AS) results for all the softwares that can use higher order background
AlignAce BioProspector
GAME Gibbs models. We provided to those software a Markov model of order
info-gibbs MEME
MotifSampler random 3 trained on all yeast promoters and a Bernoulli model trained
on the same set to the others. For all the software the parameters
Fig. 5. Complementary cumulative distributions of the asymptotic were set to search on both strands. All other parameters were left
covariance between annotated and predicted regulatory motifs for E.coli K12 to their defaults. Discovered motifs were evaluated by computing
PurR, DnaA and H-NS regulons for each software. The random behaviors their averaged asymptotic covariance (over 10 trials) with the motifs
are represented by the dashed lines. annotated in SCPD (Supplementary Table S5).
All programs show contrasted performances (Fig. 6): they
et al., 2004; Schneider and Stephens, 1990) of the annotated motif correctly detect between 0 and 8 motifs among the 29 datasets.
(Fig. 5), only one position in the middle of motif is well conserved. On these datasets, info-gibbs and MotifSampler supersede the other
Given the poor IC of this annotated motif, it is not surprising that tools and all programs supporting higher order Markov models
info-gibbs fails to identify it. However, all the other software also show higher performances (BioProspector, info-gibbs, MEME and
fail to return relevant results. Actually, the distributions are close to MotifSampler).
the random motif selection for all the softwares.

4.3 Analysis of yeast ChIP-chip datasets 5 DISCUSSION AND CONCLUSION


The third evaluation focused on Harbison et al.’s (2004) ChIP-chip We introduced here a new stochastic motif finding algorithm that
data. We discovered motifs in promoters of 29 sets of target genes directly optimizes IC or LLR rather than computing it a posteriori.
detected by ChIP-chip experiments for 17 yeast TFs (some factors This approach showed interesting results on both simulated and
were tested in various culture media). Among the clusters found in biological data: info-gibbs performs equal to or better than all the
the Harbison experiments, we selected all the TFs for which a matrix other methods tested here on the synthetic datasets, on bacterial
was annotated in the SCPD database (Zhu and Zhang, 1999) and for regulons and on yeast ChIP-chip data.
which Harbison et al. (2004) reported at least five target genes. Assessing the quality of motif discovery is a difficult task for
This test corresponds to realistic conditions, since those datasets are several reasons. In contrast with protein structure, we do not possess
known to be more noisy, each set of genes likely contains a mixture a dataset that could be considered as the target to which predictions
of effective target genes and of FPs. In addition, some factors were can be compared. One problem faced by Tompa et al. (2005) in
tested in various culture media, which may affect their activity, and their community-based assessment of 13 motif discovery programs
thereby the preferential binding to their target promoters. was the definition of suitable testing sets. Their dataset contained
For each set of target genes, we retrieved sequences upstream of a mixture of biological and synthetic sequences, with implanted or
the start codon up to the next gene, with a maximal length of 800 bp. native TFBs. We took inspiration from this experience, but adopted
Motif discovery was performed on each regulon with motif length a different strategy to face the lack of annotations, by separating the
fixed to 16 and expected number of occurrences set to 2 sites per evaluation in two independent tests: (i) synthetic motifs implanted in

2721

[15:39 29/9/2009 Bioinformatics-btp490.tex] Page: 2721 2715–2722


M.Defrance and J.van Helden

synthetic sequences, with an evaluation based on site comparisons; Hertz,G.Z. et al. (1990) Identification of consensus patterns in unaligned dna sequences
(ii) biological sequences with their native sites, with a motif- known to be functionally related. Comput. Appl. Biosci., 6, 81–92.
Hughes,J.D. et al. (2000) Computational identification of cis-regulatory elements
based rather than site-based evaluation. Another originality of our
associated with groups of functionally related genes in Saccharomyces cerevisiae.
evaluation is to directly address the intrinsic lack of reproducibility J. Mol. Biol., 296, 1205–1214.
of stochastic approaches, by analyzing the distribution of matching Jensen,S. et al. (2004) Computational discovery of gene regulatory binding motifs: a
statistics rather than choosing the average performance. bayesian perspective. Stat. Sci., 19, 188–204.
The most important limitation of the proposed approach, which Jensen,S.T. and Liu,J.S. (2004) Biooptimizer: a Bayesian scoring function approach to
motif discovery. Bioinformatics, 20, 1557–1564.
is shared with some of the other methods, is that the user has Lawrence,C.E. et al. (1993) Detecting subtle sequence signals: a gibbs sampling strategy
to specify input parameters which have a strong impact on the for multiple alignment. Science, 262, 208–214.
result, in particular the motif length and the expected number of Liu,J. et al. (1995) Bayesian models for multiple local sequence alignment and Gibbs
occurrences. One strategy (supported by MEME) is to test various sampling strategies. J. Am. Stat. Assoc., 90, 1156–1170.
Liu,X. et al. (2001) BioProspector: discovering conserved DNA motifs in upstream
lengths and select the motif that optimizes some a posteriori
regulatory regions of co-expressed genes. In Pacific Symposium on Biocomputing, 6,
statistics (e.g. information per column). The expected number of pp. 127–138.
occurrences is harder to estimate when maximizing IC: under- Neuwald,A.F. et al. (1995) Gibbs motif sampling: detection of bacterial outer membrane
estimating the number of occurrence could force the algorithm protein repeats. Protein Sci., 4, 1618–1632.
to converge toward a too ‘sharp’ motif, while overestimating it Pape,U.J. et al. (2008) Natural similarity measures between position frequency matrices
with an application to clustering. Bioinformatics, 24, 350–357.
can result in a useless fuzzy motif. Another important question Pavesi,G. et al. (2001) An algorithm for finding signals of unknown length in DNA
is the choice of the temperature parameter. An adequate usage of

Downloaded from bioinformatics.oxfordjournals.org by guest on April 6, 2011


sequences. Bioinformatics, 17 (Suppl. 1, S207–S214.
the temperature can positively influence both running time and Pevzner,P.A. and Sze,S.H. (2000) Combinatorial approaches to finding subtle signals
prediction quality. In our test with synthetic data, a temperature in DNA sequences. Proc. Inter. Conf. Intell. Syst. Mol. Biol., 8, 269–278.
Roth,F.P. et al. (1998) Finding DNA regulatory motifs within unaligned noncoding
set to a value slightly inferior to 1.0 seems appropriate. Using a
sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol., 16,
variable or bistable temperature can greatly reduce computation 939–945.
time while maintaining the prediction quality (data not shown). Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display
However, allowing a more complex temperature scheme implies consensus sequences. Nucleic Acids Res., 18, 6097–6100.
to fix parameters that are problem dependent and also need to Schneider,T.D. et al. (1986) Information content of binding sites on nucleotide
sequences. J. Mol. Biol., 188, 415–431.
be optimized (Shida, 2006a). Despite these limitations, the overall Shida,K. (2006a) Gibbsst: a Gibbs sampling method for motif discovery with enhanced
performances of info-gibbs are promising. Furthermore, the tool is resistance to local optima. BMC Bioinformatics, 7, 486.
integrated in the extensive software suite RSAT, which provides a Shida,K. (2006b) Hybrid Gibbs-sampling algorithm for challenging motif discovery:
flexible and powerful environment for deciphering the regulatory Gibbsdst. Genome Inform., 17, 3–13.
Sinha,S. and Tompa,M. (2002) Discovery of novel transcription factor binding sites by
elements from genome sequences.
statistical overrepresentation. Nucleic Acids Res., 30, 5549–5560.
Funding: This was supported by the Belgian Program on Sinha,S. and Tompa,M. (2003) YMF: A program for discovery of novel transcription
factor binding sites by statistical overrepresentation. Nucleic Acids Res., 31,
Interuniversity Attraction Poles, initiated by the Belgian Federal 3586–3588.
Science Policy Office, project P6/25 (BioMaGNet), who funded Stormo,G.D. (1998) Information content and free energy in DNA–protein interactions.
the postdoc grant of M.D. The BiGRe laboratory is member of J. Theor. Biol., 195, 135–137.
the BioSapiens Network of Excellence funded under the sixth Thijs,G. et al. (2001) A higher-order background model improves the detection of
promoter regulatory elements by Gibbs sampling. Bioinformatics, 17, 1113–1122.
Framework program of the European Communities (LSHG-CT-
Thijs,G. et al. (2002) A Gibbs sampling method to detect overrepresented motifs in the
2003-503265). upstream regions of coexpressed genes. J. Comput. Biol., 9, 447–464.
Thomas-Chollier,M. et al. (2008) RSAT: regulatory sequence analysis tools. Nucleic
Conflict of Interest: none declared. Acids Res., 36, W119–W127.
Tompa,M. et al. (2005) Assessing computational tools for the discovery of transcription
factor binding sites. Nat. Biotechnol., 23, 137–144.
REFERENCES van Helden,J. (2003) Regulatory sequence analysis tools. Nucleic Acids Res., 31,
3593–3596.
Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation maximization to van Helden,J. et al. (1998) Extracting regulatory sites from the upstream region of yeast
discover motifs in biopolymers. Proc. Inter. Conf. Intell. Syst. Mol. Biol., 2, 28–36. genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol., 281,
Chan,T.-M. et al. (2008) TFBS identification based on genetic algorithm with combined 827–842.
representations and adaptive post-processing. Bioinformatics, 24, 341–349. van Helden,J. et al. (2000) Discovering regulatory elements in non-coding sequences
Crooks,G.E. et al. (2004) WebLogo: a sequence logo generator. Genome Res., 14, by analysis of spaced dyads. Nucleic Acids Res., 28, 1808–1818.
1188–1190. Wei,Z. and Jensen,S.T. (2006) Game: detecting cis-regulatory elements using a genetic
Gama-Castro,S. et al. (2008) RegulonDB (version 6.0): gene regulation model algorithm. Bioinformatics, 22, 1577–1584.
of Escherichia coli k-12 beyond transcription, active (experimental) annotated Zhu,J. and Zhang,M.Q. (1999) SCPD: a promoter database of the yeast Saccharomyces
promoters and Textpresso navigation. Nucleic Acids Res., 36, D120–D124. cerevisiae. Bioinformatics, 15, 607–611.
Harbison,C.T. et al. (2004) Transcriptional regulatory code of a eukaryotic genome.
Nature, 431, 99–104.
Hertz,G.Z. and Stormo,G.D. (1999) Identifying dna and protein patterns with statistically
significant alignments of multiple sequences. Bioinformatics, 15, 563–577.

2722

[15:39 29/9/2009 Bioinformatics-btp490.tex] Page: 2722 2715–2722

Vous aimerez peut-être aussi