Académique Documents
Professionnel Documents
Culture Documents
Correspondence
sinhas@illinois.edu
In Brief
A systematically constructed ensemble
of sequence-to-expression models
captures many distinct, biologically
plausible hypotheses. Systematic
analysis and filtering of the models,
followed by in vivo experiments, improve
our understanding of combinatorial
regulation of the Drosophila ind gene.
Highlights
d Conventional modeling of enhancer readouts may lead to
incorrect conclusions
Article
*Correspondence: sinhas@illinois.edu
http://dx.doi.org/10.1016/j.cels.2015.12.002
SUMMARY that contain binding sites for TFs and can regulate the tran-
scription of target genes (Shlyueva et al., 2014). Maintaining a
To understand the relationship between an enhancer quantitative relationship between input and transcriptional
DNA sequence and quantitative gene expression, output is key to the precise patterning of gene expression.
thermodynamics-driven mathematical models of Accordingly, as the levels of inputs vary across different cell
transcription are often employed. These ‘‘sequence- types, the enhancer-controlled levels of gene expression (also
to-expression’’ models can describe an incomplete termed as the ‘‘readout’’ of the enhancer) also vary (Yáñez-
Cuna et al., 2013). These relationships are a direct function of
or even incorrect set of regulatory relationships if
the enhancer’s DNA sequence. However, a detailed under-
the parameter space is not searched systematically.
standing of how enhancer sequence affects a gene’s expres-
Here, we focus on an enhancer of the Drosophila sion level remains elusive (Yáñez-Cuna et al., 2013). Such un-
gene ind and demonstrate how a systematic search derstanding may be achieved by interrogating a mathematical
of parameter space can reveal a more comprehensive model that explains the available experimental results about
picture of a gene’s regulatory mechanisms, resolve the gene both qualitatively and quantitatively, suggests exper-
outstanding ambiguities, and suggest testable hy- iments to improve upon the current model, and is capable of
potheses. We describe an approach that generates predicting the gene’s expression pattern upon cis or trans per-
an ensemble of ind models; all of these models are turbations. Here, we refer to such models as ‘‘sequence-to-
technically acceptable solutions to the sequence- expression’’ models, and we show how they can form the basis
to-expression problem in light of wild-type data, of a systematic, unbiased enquiry into gene regulation by mul-
tiple TFs.
and some represent mechanistically distinct hypoth-
A common paradigm of sequence-to-expression modeling
eses about the regulation of ind. This ensemble can
is based on equilibrium thermodynamics (Shea and Ackers,
be restricted to biologically plausible models using 1985). This approach models the rate of transcription initiation
requirements gleaned from in vivo perturbation ex- based on quantitative descriptions of variable site affinities
periments. Biologically plausible models make uni- (‘‘motifs’’) (Stormo, 2000) and expression levels of TFs.
que predictions about how specific ind enhancer Because they can incorporate the DNA-sequence-dependent
sequences affect ind expression; we validate these characteristics of TF binding, sequence-to-expression thermo-
predictions in vivo through site mutagenesis in trans- dynamic models of this genre are arguably more realistic than
genic Drosophila embryos. thermodynamic models where all TF-binding sites are assumed
to have the same affinity (Cohen et al., 2014; Fakhouri et al.,
2010; Papatsenko and Levine, 2008; Zinzen and Papatsenko,
INTRODUCTION 2007) or only classified as ‘‘strong’’ versus ‘‘weak’’ (Bintu
et al., 2005; Gertz et al., 2009; Parker et al., 2011; White
Transcription factors (TFs) work in concert with other DNA- et al., 2012). We previously reported one such sequence-
binding molecules to regulate gene expression. These mole- to-expression model called GEMSTAT (Gene Expression
cules act as inputs at enhancers, distinct genomic regions Modeling based on Statistical Thermodynamics) and used it
396 Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc.
for modeling 40 enhancers involved in anterior-posterior cis-regulatory logic. In total, we outline a ‘‘strong inference’’
patterning during early embryonic development of Drosophila approach (Platt, 1964) that systematically eliminates various
(He et al., 2010). mechanistic explanations to the data, within the context of a
Sequence-to-expression models like GEMSTAT face a signif- predetermined qualitative model, and also suggests the addi-
icant challenge. They are formulated using one to three free pa- tional experiments necessary to further refine our mechanistic
rameters that are specific to each TF and other parameters that understanding.
represent TF-TF interactions and basal activity at the promoter.
Because a typical enhancer is controlled by a handful of TFs, RESULTS
even simple models may have a considerable number of free
parameters. Most of these parameters have not been deter- A Model of Transcriptional Regulation by Sequence-
mined experimentally. Instead, they are estimated by numerical Specific Transcription Factors and Their Interplay with
algorithms that operate within a defined parameter space to Signaling Molecules
identify parameter values whose predictions are optimal fits We modified GEMSTAT (He et al., 2010), a sequence-to-
to the data. Importantly, these fits are not necessarily unique. expression model, to study how TFs bound to the ind enhancer
Since the models are typically overdetermined given experi- may regulate the gene’s expression. We outline the main as-
mental data, a given model may make different predictions sumptions and components of the model here; see Experi-
that are consistent with available data at distinct parameter mental Procedures and Supporting Online Information for de-
settings. tails of model formulation. GEMSTAT is founded on a theory
It is entirely possible that there are multiple optima within of combinatorial gene regulation first proposed by Shea and
the parameter space for a given enhancer. For example, several Ackers (Shea and Ackers, 1985). The model considers the sys-
papers on multi-parameter models for systems biology and tem of TF molecules and their cognate sites in the enhancer, as
climatology (Gutenkunst et al., 2007; Tebaldi and Knutti, 2007) well as the basal transcriptional machinery (BTM) and its binding
have demonstrated that different parameter sets can fit the to the promoter, and uses a minimal set of parameters to model
same data. In such cases, the different optima may represent the interactions among TFs, BTM, and DNA (Figure 1A). All inter-
distinct, mutually exclusive hypotheses about underlying mech- actions are assumed to happen in thermodynamic equilibrium,
anisms. A systematic interrogation of a gene’s sequence-to- which is assumed to be reached much more rapidly than the
expression model must consider all such hypotheses and distin- timescale at which the transcription machinery is activated
guish between them. Some related studies have indeed found and begins producing mRNA. Under these assumptions, the
widely different parameter assignments to fit a dataset (Dresch transcription initiation rate, and hence the equilibrium level of
et al., 2010; Granek and Clarke, 2005; Zinzen and Papatsenko, mRNA transcription, are proportional to the fractional occu-
2007), yet the standard practice is to report one or at most a pancy of the BTM at the promoter. GEMSTAT computes this
few best models that result after iterative improvements upon fractional occupancy by considering all possible configurations
initial random guesses (He et al., 2010; Janssens et al., 2006; of DNA-bound TFs and BTM (Figure 1B) and summing the
Kim et al., 2013; Parker et al., 2011; Segal et al., 2008). probability of configurations where the BTM is promoter-bound
This suggests that if one is to understand the relationship (Figure 1C). The equilibrium probability of each configuration
between enhancer sequence and quantitative gene expression is computed based on the Boltzmann distribution. As TF con-
in an unbiased way, one should go beyond the common practice centrations change across cell types, the probability of bound
of seeking the single best fitting model. Instead, one could BTM configurations also changes, reflecting the variation of
explore the entire parameter space for models that agree with expression levels due to the change in regulator concentration
data. (Figure 1D).
Here, we take this approach and perform a systematic explo- An important distinction of the GEMSTAT model from several
ration of the parameter space of the GEMSTAT model for a other thermodynamics-based models—as we mentioned in the
specific developmental gene, intermediate neuroblasts defective Introduction—is its ability to account for varying affinities of a
(ind), in D. melanogaster. Beginning with a qualitative under- TF’s binding sites, by relating mutations from the optimal or
standing of its likely regulators, we adopt an ensemble modeling ‘‘consensus’’ site to corresponding changes in binding energy.
approach (Brown et al., 2004; Kuepfer et al., 2007; Swigon, 2013; For this, GEMSTAT implements Berg and von Hippel’s theory
Toni et al., 2009; Villaverde et al., 2014; von Dassow et al., 2000) of protein/DNA interaction energetics (Berg and von Hippel,
to learn all quantitative models consistent with wild-type expres- 1987), using the TF’s position weight matrix (Stormo, 2000) to
sion data, then use a visualization tool that we introduce here to predict the ‘‘mismatch energy’’ relative to the consensus site
recognize the distinct mechanistic hypotheses they represent. (Supporting Online Information).
Next, we ask how additional perturbation experiments reported Our modification of GEMSTAT in this work allows for modula-
in the literature refine the ensemble of models, and we elimi- tion of a TF’s DNA-binding affinity depending on the concentra-
nate mechanistic explanations that are inconsistent with these tion of some other molecular species, which in our case (next
additional data. The surviving ensemble of models, despite section) was the dual phosphorylated extracellular signal-
representing distinct hypotheses about regulation, can make regulated kinase (dpERK). We modeled a recently proposed
unambiguous predictions about ind’s response to specific per- ‘‘de-repression’’ mechanism whereby the kinase attenuates a
turbations. We verify these predictions experimentally. Using repressor TF’s DNA-binding affinity, resulting in higher expres-
this approach iteratively, we can narrow down the ensemble of sion levels of the regulated gene at higher levels of the kinase
consistent models and refine our understanding of the gene’s (Figure 1E). This allows us to model how a non DNA-binding
Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc. 397
Figure 1. Overview of the GEMSTAT Model
(A–C) GEMSTAT models the expression readout of
an enhancer from the strength of TF-DNA and TF-
BTM interactions in thermodynamic equilibrium (A)
and considers all possible configurations of bound
TFs and the BTM (B) to compute the equilibrium
probability of BTM occupancy at promoter (C).
Shown is a hypothetical probability distribution for
the configurations shown in (B); probability of BTM
occupancy is computed from configurations c1–c4.
(D) GEMSTAT’s predictions change as TF concen-
trations change across different conditions. Shown
is the profile of mRNA levels (magenta) resulting
from a uniformly expressed activator (green) and a
graded repressor (red); horizontal axis shows
different spatial locations. Also shown is how the
equilibrium probability distributions change with
changes in TF concentrations.
(E) A molecular species (gray) that can attenuate the
DNA-binding affinity of a repressor may increase the
mRNA level of the gene shown in (D).
398 Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc.
by relieving Cic-driven repression of ind. In particular, ERK
phosphorylates Cic (Astigarraga et al., 2007), and this has
been proposed to influence Cic activity by impeding its
DNA binding (Dissanayake et al., 2011; Lim et al., 2013),
leading to ind de-repression in a specific domain along
the D/V axis. This is the mechanism we chose to implement
here, though other mechanisms have also been proposed
(Grimm et al., 2012). We obtained a D/V profile of dual-
phosphorylated ERK (dpERK) from (Lim et al., 2013) and
used it as a proxy for ERK activity. To model this effect,
we modified GEMSTAT so that the energy of Cic-DNA
binding is increased (binding affinity is reduced) to an
extent proportional to dpERK concentration (Experimental
Procedures).
Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc. 399
parameter assignments that are as good or nearly as good in number of different quantitative models are consistent with
terms of their agreement with data, and examining the one wild-type data, raising the concern that some of these may yield
optimal assignment reported by GEMSTAT may provide a incorrect predictions under perturbation conditions and prompt-
skewed view of plausible models (Kirk et al., 2013). We therefore ing us to ask how additional experiments can refine the wild-type
modified the GEMSTAT program to perform a comprehensive ensemble.
exploration of the multi-dimensional parameter space, with the
goal of constructing a complete map of plausible quantitative Data from Perturbation Experiments Narrow Down the
models. To this end, we first generated a large number of Range of Plausible Models
13-dimensional vectors (parameter assignments) as follows (Fig- Wild-type data may not have sufficient information to constrain a
ure 3A; Supporting Online Information). We partitioned each pa- multi-parameter model such that it captures the precise extent of
rameter’s range into halves, which gave us 213 compartments of each TF’s effect on the target gene. To further constrain the
the parameter space. From each compartment, we sampled values of model parameters, we examined how well models in
1000 parameter vectors and scored them for their goodness of the ensemble predict the effects of the following genetic pertur-
fit to data. Next, we sorted these 1,000 3 213 parameter vectors bations for which we have data from the literature.
based on their scores, and for each parameter vector with a
d Mutation of sna: the ind expression domain remains essen-
score among the top 2% of unique scores in this sorted list,
tially unaltered in sna mutants. vnd expression is de-
we optimized the GEMSTAT model using that vector as initial es-
repressed in these embryos and expands ventrally such
timate of model parameters. The resulting collection of opti-
that ind stays repressed in the endogenous domain of
mized models can predict ind expression accurately in the
sna expression (Figure S2A).
wild-type condition (Figure S1B), with little dispersion in their
d Mutation of vnd: the ind expression domain expands
predictions (maximum difference with the mean is <0.07 at any
ventrally, yet does not encroach into the mesoderm region,
position along the D/V axis). We call this collection of models
in vnd mutants (McDonald et al., 1998; Weiss et al., 1998).
(21,000 in total) the ‘‘wild-type ensemble.’’ We note that an
d Mutation of Cic-binding sites in the ind enhancer: the
alternate strategy to compute such an ensemble would be to
readout of the ind enhancer expands dorsally, to an extent
sample from the parameter space using Monte Carlo-based
that matches the spatial domain of the Dl protein, upon
techniques (Toni et al., 2009). However, there are major chal-
mutating two particular Cic sites in the enhancer (Lim
lenges associated with these methods (e.g., slow convergence
et al., 2013).
to the stationary distribution and consequently less control
over the coverage of the parameter space, the need for pre- These are the only perturbations reported in the literature that
computing an ensemble of models that covers the parameter manifest direct effects on ind expression. We used each model
space; Toni et al., 2009), which motivated us to choose the to predict ind expression upon knocking down a TF, by setting
exhaustive strategy described above. the DNA-binding parameter of the respective TF to zero. Addi-
The 21,000 models of the wild-type ensemble spanned tionally, when simulating sna knockdown, we replaced the
widely different compartments of the parameter space (652 out spatial patterns of vnd and egfr with their altered patterns in
of 213 z8000), and the marginal densities of individual parame- sna mutants (Figures S2B and S2C). To predict the effect of
ters showed high variance and multimodality (Figure 3B). This mutating a site, we discarded the site from our set of annotated
suggested the existence of many distinct parameterizations TF-binding sites in the ind enhancer (Experimental Procedures
that explain the wild-type data equally well, and we asked if and Supplemental Experimental Procedures). In evaluating
this meant that many distinct mechanistic hypotheses explain models on perturbation data, we focused on carefully selected
the data on ind regulation. To this end, we summarized the domains along the D/V axis that, we reasoned, should provide
models as motifs that visually depict how a particular model uti- adequate information about the accuracy of model predictions
lizes each TF to regulate ind in five key spatial domains along the (Experimental Procedures).
D/V axis (Figure 3C, legend). Columns in a motif correspond to For each of the 21,000 models in the wild-type ensemble, we
domains, symbols denote the regulator TFs, and the height of evaluated its predictions on perturbation data and discarded
a symbol in a column represents the contribution of that TF in every model that failed to correctly predict the known effects.
the corresponding region. We found 2,100 models whose predictions are accurate in
Hierarchical k-medoids clustering of the motifs for the wild- both wild-type and the three perturbation conditions (Figure 4A.
type ensemble revealed at least 36 distinct sets of mechanistic We call these models the ‘‘filtered ensemble.’’ Parameters of
hypotheses (motifs) that are supported by the wild-type data these models were found to be far more constrained than those
(Figure 3D). As an example of how these hypotheses differ, we of the initial ensemble, falling into 42 of the 213 compartments in
marked three motifs in the bottom row with asterisks, which sug- the parameter space, whereas the wild-type ensemble models
gest three distinct mechanisms of activating ind in its peak span 652 compartments. (Also compare the solid to the dotted
domain of expression (third column in the motif): (1) Zld is the curves in Figure 3B, which shows marginal distributions of pa-
dominant activator of ind, while Dl does not have an important rameters.) Considering perturbation data in this manner greatly
role to that end (marked with a red asterisk); (2) Dl is the dominant narrowed down the possible explanations of ind regulation,
activator of ind, while Zld does not play a strong role in activating with the only surviving mechanistic hypotheses being those
ind (marked with a green asterisk); and (3) neither Dl nor Zld shown as motifs in the first row of Figure 3D. Whereas models
alone, but only their synergistic interaction, is the dominant input in the wild-type ensemble could explain ind regulation without
toward activating ind (marked with a blue asterisk). Clearly, a one of our three assumed repressors (the three motifs marked
400 Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc.
Figure 3. Construction and Visualization of Model Ensembles
(A) Left to right: sampling of parameter vectors, scoring, model optimization initiated from each parameter vector that scored above the threshold (the resulting set
of optimized models is the ‘‘wild-type ensemble’’), and filtering of wild-type models according to their accuracy in predicting the effects of various perturbations.
The remaining models (not crossed out) constitute the ‘‘filtered ensemble.’’
(B) Marginal densities of parameters of the wild-type and the filtered ensemble models (dashed and solid lines, respectively).
(C) The motif for a model shows how it utilizes each TF to regulate ind in five domains along the D/V axis. Each domain corresponds either to the peak expression
domain of ind (domain 3) or a TF (domains 1 and 2: peak expression domains of Sna and Vnd, respectively) or to a domain where the effect on ind expression is
known for a specific site-mutagenesis experiment (domains 4 and 5: results known for Cic site mutagenesis). Columns in a motif correspond to domains, symbols
denote the regulator TFs, and the height of a symbol in a column represents the contribution of that TF in the corresponding region; if a TF f is an activator, then its
height in a column represents the root-mean-square error (RMSE) between model-predicted ind expression profiles in the corresponding domain when there is
no activator and when f is the only activator in the model. Similarly, the height of a repressor f is computed from the conditions when there is no repressor and
when f is the only repressor in the model.
(D) Representative motifs of the 36 clusters computed from the wild-type ensemble models.
402 Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc.
Figure 4. Predictions of the Filtered Ensemble and Their Experimental Validations
(A) Predictions of filtered ensemble under perturbed conditions; left to right: mutation of sna, vnd, and two Cic sites in the ind enhancer. Shown are the mean (red)
and the range (shaded red area around the curve) of ensemble predictions. Plots in (C)–(E) follow the same semantics.
(B) Computationally identified sites for Dl (green) and Zld (cyan) in ind enhancer. Yellow and gray sequences show mutations to disrupt Dl and Zld sites,
respectively.
(C and D) Filtered ensemble models do not predict any significant change in ind expression upon the mutations reported in (Garcia and Stathopoulos, 2011) (C)
but predict that ind expression nearly abolishes upon removing the additional sites for Dl (D).
(E) Filtered ensemble models predict 50% reduction in ind expression upon mutating Zld sites in ind enhancer.
(F) The ind enhancer was used to drive expression that recapitulates the endogenous ind expression (ind1.4WT-lacZ) and Dl sites in the enhancer were mutated
(ind1.4Dlmut-lacZ). Embryos were co-stained with ind (red) and lacZ (yellow).
(G) Histograms of mean intensity values computed from bootstrapped profiles; asterisks mark bootstrapped mean values.
(H) Smoothed histograms from wild-type and mutant lacZ profiles (each created from 20 profiles [one per embryo] on 256 bins [one per intensity value]).
(I) Zld sites in the enhancer were mutated (ind1.4zldmut-lacZ). Embryos were co-stained with ind (red) and lacZ (white).
(J and K) Plots analogous to (G) and (H).
404 Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc.
Figure 5. Predictions of the Filtered Ensemble Models for the Orthologs of the D. melanogaster ind Enhancer in Ten Other Drosophilids
See the legend of Figure S1 for the details of how the orthologs were extracted. Semantics of the plots are the same as that of Figure 4A. We also show in
parentheses the number of TF-binding sites in each ortholog. Additionally, the dotted red curve for each ortholog shows the best prediction of a filtered ensemble
model for the corresponding ortholog. The goodness of a model prediction for a given ortholog is defined as the sum of the squared error between the model
prediction and ind expression data in D. melanogaster.
initiation or on activator recruitment are also plausible. Notably, Zld-mediated gene activation is still unclear (Supplemental
these alternative hypotheses rely not on attenuation of Cic’s Information), simply rejecting a direct activating input from
DNA binding but on the potency of Cic’s interaction with tran- Zld, as one would if one were taking a more conventional
scription initiation machinery or with other TFs. One can also approach, would produce a biased working hypothesis, and
imagine a scenario where ERK activates some yet-unknown opportunities for follow-up experiments could be missed. A
non-repressive TF that competes with Cic to bind DNA. Ascer- systematic ensemble approach can thus improve sequence
taining any such mechanism is a subject for future studies. to expression models by identifying mechanistic hypotheses
One might ask if modeling a larger dataset (i.e., multiple en- that are consistent with the entire dataset and also those
hancers) could make the conventional approach of computing that conform to different subsets of the data, thereby speci-
one or a few optimal models as effective as our ensemble fying the experiments for follow-up.
approach in terms of eliminating incorrect mechanistic expla-
nations. It is important to note that the conventional approach
EXPERIMENTAL PROCEDURES
attempts to discover models consistent with the entire data-
set. However, when relevant TFs and mechanisms of their The GEMSTAT Model
functions are not well understood, a systematic ensemble To estimate the probability that TFs bound to an enhancer regulates the
approach for individual enhancers could provide insights that expression of a target gene, GEMSTAT considers the ‘‘statistical weight’’ Zc
the one or few optimal models for the entire dataset would of each configuration c in the ensemble of all possible configurations of occu-
have missed. For example, when trying to fit the enhancers pied and unoccupied TF-binding sites and the promoter. Detailed formulation
of Zc and the algorithm to compute probability of mRNA expression from Zc
of ind and sog simultaneously (data not shown), we consis-
are given in Supplemental Information; here, we mention the parameters
tently discovered models that do not use a direct activating
that define Zc . For a given site S, the binding of a TF f at S contributes a sta-
input from Zld, presumably because the assumed inputs do tistical weight of qf;s = Kf;s ½f to Zc . Here Kf;s is the equilibrium constant of the
not include a dorsal repressor for sog. Given that our under- DNA-binding reaction between f and s, and ½f is the concentration of f Let
standing about the regulators of sog and the mechanism of Sfmax denote the strongest binding site of f and KðSfmax Þ denote the association
Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc. 405
constant of TF-DNA binding between f and Sfmax . We rewrite Kf;s as Received: August 10, 2015
KðSfmax ÞexpðbDEf;s Þ, where b is the Boltzmann constant and DEf;s denotes Revised: October 19, 2015
the ‘‘mismatch energy’’ of the site S relative to Sfmax for f. The concentration Accepted: December 2, 2015
½f is in arbitrary units and can be rewritten as v½frel , where ½frel is the concen- Published: December 23, 2015
tration of f relative to some unknown reference value v. Therefore,
REFERENCES
qf;S = K Sfmax v½frel expðbDEf;s Þ;
where KðSfmax Þ and v are unknown quantities. We take their product KðSfmax Þ Ajuria, L., Nieva, C., Winkler, C., Kuo, D., Samper, N., Andreu, M.J., Helman,
as a free parameter and refer to it as the ‘‘DNA-binding parameter’’ for f. In A., González-Crespo, S., Paroush, Z., Courey, A.J., and Jiménez, G. (2011).
addition to the qf;S terms, Zc includes the following multiplicative terms: (1) Capicua DNA-binding sites are general response elements for RTK signaling
uf1;f2 for each instance of two interacting TFs f1 and f2 bound in configuration in Drosophila. Development 138, 915–924.
c, (2) af for each instance of a TF f bound to one of its cognate sites and when c Astigarraga, S., Grossman, R., Dı́az-Delfı́n, J., Caelles, C., Paroush, Z., and
is a BTM-bound configuration, and (3) qBTM whenever c is a BTM-bound Jiménez, G. (2007). A MAPK docking site is critical for downregulation of
configuration. Capicua by Torso and EGFR RTK signaling. EMBO J. 26, 668–677.
Berg, O.G., and von Hippel, P.H. (1987). Selection of DNA binding sites by reg-
Filtering Models Based on Perturbation Data ulatory proteins. Statistical-mechanical theory and application to operators
Qualitative effects on the ind expression pattern upon various perturbations and promoters. J. Mol. Biol. 193, 723–750.
were mentioned in Results; we discard every model that fails to meet the
Bintu, L., Buchler, N.E., Garcia, H.G., Gerland, U., Hwa, T., Kondev, J., and
following quantitative criteria in predicting those effects.
Phillips, R. (2005). Transcriptional regulation by the numbers: models. Curr.
(1) Upon Sna knockout, a model should predict the mean probability of ind
Opin. Genet. Dev. 15, 116–124.
expression in the ventral-most 20% of the D/V axis to be low (%0.05 times of
the probability of ind expression at its peak domain). (2) Upon Vnd knockout, a Brown, K.S., Hill, C.C., Calero, G.A., Myers, C.R., Lee, K.H., Sethna, J.P., and
model should predict the mean probability of ind expression at locations of Cerione, R.A. (2004). The statistical mechanics of complex signaling networks:
peak Vnd expression (at least 80% of its maximum concentration level) to nerve growth factor signaling. Phys. Biol. 1, 184–195.
be high (mean probability of expression R0.80). (3) Upon mutation of two Cheng, Q., Kazemian, M., Pham, H., Blatti, C., Celniker, S.E., Wolfe, S.A.,
(out of four) Cic-binding sites, a model should predict ind expression to expand Brodsky, M.H., and Sinha, S. (2013). Computational identification of diverse
dorsally (mean probability of expression R0.80 at the location where wild-type mechanisms underlying transcription factor-DNA occupancy. PLoS Genet.
ind expression is half-maximal at its dorsal boundary) and to remain low ((% 9, e1003571.
0.05 times of the probability of ind expression at its peak domain) in the dor-
Cohen, M., Page, K.M., Perez-Carrasco, R., Barnes, C.P., and Briscoe, J.
sal-most 20% of the D/V axis.
(2014). A theoretical framework for the regulation of Shh morphogen-
controlled gene expression. Development 141, 3868–3878.
Model Optimization
Cowden, J., and Levine, M. (2003). Ventral dominance governs sequential pat-
GEMSTAT optimizes the root-mean-square error (RMSE) function in course
terns of gene expression across the dorsal-ventral axis of the neuroectoderm
of parameter estimation. At location i along the D/V axis, let Di and Mi
in the Drosophila embryo. Dev. Biol. 262, 335–349.
denote the ind expression level and model-predicted readout of the ind
enhancer, respectively. Assuming there are n data points, the RMSE for Dissanayake, K., Toth, R., Blakey, J., Olsson, O., Campbell, D.G., Prescott,
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi A.R., and MacKintosh, C. (2011). ERK/p90(RSK)/14-3-3 signalling has an
the predicted expression pattern is ð1=nÞ ðDi bMi Þ2 , where b is a
i impact on expression of PEA3 Ets transcription factors via the transcriptional
free parameter as used in standard least square estimation. We constrain repressor capicúa. Biochem. J. 433, 515–525.
b%2, since allowing b to assume arbitrary values may scale up and assign Dresch, J.M., Liu, X., Arnosti, D.N., and Ay, A. (2010). Thermodynamic
good RMSE scores to very low expression levels, even as low as one modeling of transcription: sensitivity analysis differentiates biological mecha-
may expect from randomly generated sequences—a problematic issue nism from mathematical model-induced effects. BMC Syst. Biol. 4, 142.
for sequence-to-expression models (He et al., 2012). Several other
Fakhouri, W.D., Ay, A., Sayal, R., Dresch, J., Dayringer, E., and Arnosti, D.N.
sequence-to-expression models also constrained the value of b in opti-
(2010). Deciphering a transcriptional regulatory code: modeling short-range
mizing RMSE (Kazemian et al., 2010; Kim et al., 2013; Segal et al., 2008).
repression in the Drosophila embryo. Mol. Syst. Biol. 6, 341.
GEMSTAT uses the Nelder-Mead simplex and the quasi-Newton gradient
descent algorithms to optimize RMSE. Foo, S.M., Sun, Y., Lim, B., Ziukaite, R., O’Brien, K., Nien, C.Y., Kirov, N.,
Shvartsman, S.Y., and Rushlow, C.A. (2014). Zelda potentiates morphogen
activity by increasing chromatin accessibility. Curr. Biol. 24, 1341–1346.
SUPPLEMENTAL INFORMATION
Garcia, M., and Stathopoulos, A. (2011). Lateral gene expression in Drosophila
Supplemental Information includes Supplemental Experimental Procedures early embryos is supported by Grainyhead-mediated activation and tiers of
and four figures and can be found with this article online at http://dx.doi.org/ dorsally-localized repression. PLoS ONE 6, e29172.
10.1016/j.cels.2015.12.002. Gertz, J., Siggia, E.D., and Cohen, B.A. (2009). Analysis of combinatorial cis-
regulation in synthetic and genomic promoters. Nature 457, 215–218.
AUTHOR CONTRIBUTIONS Granek, J.A., and Clarke, N.D. (2005). Explicit equilibrium modeling of tran-
scription-factor binding and gene regulation. Genome Biol. 6, R87.
Conceptualization, M.A.H.S., S.S., S.Y.S.; Methodology, M.A.H.S.; Software,
Grimm, O., Sanchez Zini, V., Kim, Y., Casanova, J., Shvartsman, S.Y., and
M.A.H.S.; Formal Analysis, M.A.H.S.; Investigation, M.A.H.S., B.L., N.S.; Re-
Wieschaus, E. (2012). Torso RTK controls Capicua degradation by changing
sources, M.A.H.S., B.L., N.S.; Data Curation, M.A.H.S., B.L., N.S.; Writing – Orig-
its subcellular localization. Development 139, 3962–3968.
inal Draft, M.A.H.S., S.S.; Writing – Review and Editing, All authors; Visualization,
M.A.H.S.; Supervision, S.S., S.Y.S., G.J.; Funding Acquisition, S.S., S.Y.S., Gutenkunst, R.N., Waterfall, J.J., Casey, F.P., Brown, K.S., Myers, C.R., and
C.A.R., H.L., G.J. Sethna, J.P. (2007). Universally sloppy parameter sensitivities in systems
biology models. PLoS Comput. Biol. 3, 1871–1878.
ACKNOWLEDGMENTS Hamm, D.C., Bondra, E.R., and Harrison, M.M. (2015). Transcriptional activa-
tion is a conserved feature of the early embryonic factor Zelda that requires a
This work was supported by NSF grant EFRI 1136913, MICINN grant BFU2011- cluster of four zinc fingers for DNA binding and a low-complexity activation
23611, NIH grant R01GM086537, and NIH grant R01 5R01GM114341. domain. J. Biol. Chem. 290, 3508–3518.
406 Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc.
He, X., Samee, M.A., Blatti, C., and Sinha, S. (2010). Thermodynamics-based Platt, J.R. (1964). Strong Inference: Certain systematic methods of scientific
models of transcriptional regulation by enhancers: the roles of synergistic thinking may produce much more rapid progress than others. Science 146,
activation, cooperative binding and short-range repression. PLoS Comput. 347–353.
Biol. 6, 6. Segal, E., Raveh-Sadka, T., Schroeder, M., Unnerstall, U., and Gaul, U. (2008).
He, X., Duque, T.S., and Sinha, S. (2012). Evolutionary origins of transcription Predicting expression patterns from regulatory sequence in Drosophila seg-
factor binding site clusters. Mol. Biol. Evol. 29, 1059–1070. mentation. Nature 451, 535–540.
Hong, J.W., Hendrix, D.A., Papatsenko, D., and Levine, M.S. (2008). How the Shea, M.A., and Ackers, G.K. (1985). The OR control system of bacteriophage
Dorsal gradient works: insights from postgenome technologies. Proc. Natl. lambda. A physical-chemical model for gene regulation. J. Mol. Biol. 181,
Acad. Sci. USA 105, 20072–20076. 211–230.
Janssens, H., Hou, S., Jaeger, J., Kim, A.R., Myasnikova, E., Sharp, D., and Shlyueva, D., Stampfel, G., and Stark, A. (2014). Transcriptional enhancers:
Reinitz, J. (2006). Quantitative and predictive model of transcriptional control from properties to genome-wide predictions. Nat. Rev. Genet. 15, 272–286.
of the Drosophila melanogaster even skipped gene. Nat. Genet. 38, 1159– Stathopoulos, A., and Levine, M. (2005). Localized repressors delineate the
1165. neurogenic ectoderm in the early Drosophila embryo. Dev. Biol. 280, 482–493.
Kanodia, J.S., Liang, H.L., Kim, Y., Lim, B., Zhan, M., Lu, H., Rushlow, C.A., Stormo, G.D. (2000). DNA binding sites: representation and discovery.
and Shvartsman, S.Y. (2012). Pattern formation by graded and uniform signals Bioinformatics 16, 16–23.
in the early Drosophila embryo. Biophys. J. 102, 427–433. Sun, Y., Nien, C.Y., Chen, K., Liu, H.Y., Johnston, J., Zeitlinger, J., and
Kazemian, M., Blatti, C., Richards, A., McCutchan, M., Wakabayashi-Ito, N., Rushlow, C. (2015). Zelda overcomes the high intrinsic nucleosome barrier
Hammonds, A.S., Celniker, S.E., Kumar, S., Wolfe, S.A., Brodsky, M.H., and at enhancers during Drosophila zygotic genome activation. Genome Res.
Sinha, S. (2010). Quantitative analysis of the Drosophila segmentation regula- 25, 1703–1714.
tory network using pattern generating potentials. PLoS Biol. 8, e1000456. Swigon, D. (2013). Ensemble modeling of biological systems. In De Gruyter
Kim, A.R., Martinez, C., Ionides, J., Ramos, A.F., Ludwig, M.Z., Ogawa, N., Series in Mathematics and Life Sciences 1, A. Antoniouk and R. Melnik, eds.
Sharp, D.H., and Reinitz, J. (2013). Rearrangements of 2.5 kilobases of non- (De Gruyter), pp. 19–42.
coding DNA from the Drosophila even-skipped locus define predictive rules Tebaldi, C., and Knutti, R. (2007). The use of the multi-model ensemble in prob-
of genomic cis-regulatory logic. PLoS Genet. 9, e1003243. abilistic climate projections. Philos. Trans. A Math Phys. Eng. Sci. 365, 2053–
Kirk, P., Thorne, T., and Stumpf, M.P. (2013). Model selection in systems and 2075.
synthetic biology. Curr. Opin. Biotechnol. 24, 767–774. ten Bosch, J.R., Benavides, J.A., and Cline, T.W. (2006). The TAGteam DNA
motif controls the timing of Drosophila pre-blastoderm transcription.
Kuepfer, L., Peter, M., Sauer, U., and Stelling, J. (2007). Ensemble modeling for
Development 133, 1967–1977.
analysis of cell signaling dynamics. Nat. Biotechnol. 25, 1001–1006.
Toni, T., Welch, D., Strelkowa, N., Ipsen, A., and Stumpf, M.P. (2009).
Li, X.Y., Harrison, M.M., Villalta, J.E., Kaplan, T., and Eisen, M.B. (2014).
Approximate Bayesian computation scheme for parameter inference and
Establishment of regions of genomic activity during the Drosophila maternal
model selection in dynamical systems. J. R. Soc. Interface 6, 187–202.
to zygotic transition. eLife 3, 3.
Villaverde, A., Bongard, S., Mauch, K., Müller, D., Balsa-Canto, E., Schmid, J.,
Liberman, L.M., and Stathopoulos, A. (2009). Design flexibility in cis-regulatory
and Banga, J. (2014). High-confidence predictions in systems biology dynamic
control of gene expression: synthetic and comparative evidence. Dev. Biol.
models. In 8th International Conference on Practical Applications of
327, 578–589.
Computational Biology & Bioinformatics (PACBB 2014), J. Saez-Rodriguez,
Lim, B., Samper, N., Lu, H., Rushlow, C., Jiménez, G., and Shvartsman, S.Y. M.P. Rocha, F. Fdez-Riverola, and J.F. De Paz Santana, eds. (Springer
(2013). Kinetics of gene derepression by ERK signaling. Proc. Natl. Acad. International Publishing), pp. 161–171.
Sci. USA 110, 10330–10335.
von Dassow, G., Meir, E., Munro, E.M., and Odell, G.M. (2000). The segment
Lim, B., Dsilva, C.J., Levario, T.J., Lu, H., Schupbach, T., Kevrekidis, I.G., and polarity network is a robust developmental module. Nature 406, 188–192.
Shvartsman, S.Y. (2015). Dynamics of inductive ERK signaling in the von Ohlen, T., and Doe, C.Q. (2000). Convergence of dorsal, dpp, and egfr
Drosophila embryo. Curr. Biol. 25, 1784–1790. signaling pathways subdivides the drosophila neuroectoderm into three dor-
McDonald, J.A., Holbrook, S., Isshiki, T., Weiss, J., Doe, C.Q., and Mellerick, sal-ventral columns. Dev. Biol. 224, 362–372.
D.M. (1998). Dorsoventral patterning in the Drosophila central nervous system: Weiss, J.B., Von Ohlen, T., Mellerick, D.M., Dressler, G., Doe, C.Q., and Scott,
the vnd homeobox gene specifies ventral column identity. Genes Dev. 12, M.P. (1998). Dorsoventral patterning in the Drosophila central nervous system:
3603–3612. the intermediate neuroblasts defective homeobox gene specifies intermediate
Nien, C.Y., Liang, H.L., Butcher, S., Sun, Y., Fu, S., Gocha, T., Kirov, N., column identity. Genes Dev. 12, 3591–3602.
Manak, J.R., and Rushlow, C. (2011). Temporal coordination of gene networks White, M.A., Parker, D.S., Barolo, S., and Cohen, B.A. (2012). A model of
by Zelda in the early Drosophila embryo. PLoS Genet. 7, e1002339. spatially restricted transcription in opposing gradients of activators and re-
Papatsenko, D., and Levine, M.S. (2008). Dual regulation by the Hunchback pressors. Mol. Syst. Biol. 8, 614.
gradient in the Drosophila embryo. Proc. Natl. Acad. Sci. USA 105, 2901–2906. Yáñez-Cuna, J.O., Kvon, E.Z., and Stark, A. (2013). Deciphering the transcrip-
Parker, D.S., White, M.A., Ramos, A.I., Cohen, B.A., and Barolo, S. (2011). The tional cis-regulatory code. Trends Genet. 29, 11–22.
cis-regulatory logic of Hedgehog gradient responses: key roles for gli binding Zinzen, R.P., and Papatsenko, D. (2007). Enhancer responses to similarly
affinity, competition, and cooperativity. Sci. Signal. 4, ra38. distributed antagonistic gradients in development. PLoS Comput. Biol. 3, e84.
Cell Systems 1, 396–407, December 23, 2015 ª2015 Elsevier Inc. 407