A Framework For Modeling Bounded Rationality: Mis-Specified Bayesian-Markov Decision Processes

A Framework for Modeling Bounded Rationality: Mis-specified
Bayesian-Markov Decision Processes
arXiv:1502.06901v1 [q-fin.EC] 24 Feb 2015
Ignacio Esponda
(WUSTL)
Demian Pouzo
(UC Berkeley)
February 25, 2015
PRELIMINARY AND INCOMPLETE

Abstract
We provide a framework to study dynamic optimization problems where the agent is uncertain about her environment but has (possibly) an incorrectly specified model, in the sense that
the support of her prior does not include the true model. The agents actions affect both her
payoff and also what she observes about the environment; she then uses these observations to
update her prior according to Bayes rule. We show that if optimal behavior stabilizes in this
environment, then it is characterized by what we call an equilibrium. An equilibrium strategy
is a mapping from payoff relevant states to actions such that: (i) given the strategy , the
agents model that is closest (according to the KullbackLeibler divergence) to the true model
is (), and (ii) is a solution to the dynamic optimization problem where the agent is certain
that the correct model is (). The framework is applicable to several aspects of bounded rationality, where the reason why a decision maker has incorrect beliefs can be traced to her use
of an incorrectly-specified model.
Esponda: Olin Business School, Campus Box 1133, Washington University, 1 Brookings Drive, Saint Louis,
MO6313, iesponda@wustl.edu; Pouzo: 530-1 Evans Hall #3880, Berkeley, CA 94720, dpouzo@econ.berkeley.edu.
Introduction
We study a single-agent recursive dynamic optimization problem where the agent is uncertain about
the primitives of the environment. The non-standard aspect of this decision problem is that the
agent has a mis-specified model of the world, in the sense that the support of her prior does not
include the true environment. Our objective is to characterize the limiting behavior of the agent.
Our main motivation for studying this problem is to provide a common framework that incorporates
several aspects of bounded rationality that have previously been studied in specific contexts.
A standard assumption in economics is that agents have correct beliefs in equilibrium. This
assumption is often justified by a learning story. One of the main insights from the literature on
learning in decision problems is that agents may not have correct beliefs if they do not have enough
incentives to experiment (e.g., Rothschild [1974], McLennan [1984], Easley and Kiefer [1988]). A
similar insight emerges in a game theoretic context via the notion of a self-confirming equilibrium
(Battigalli [1987], Rubinstein and Wolinsky [1994], Fudenberg and Levine [1993a], Dekel et al.
[2004]) which requires players to have beliefs that are consistent with observed past play, though
not necessarily correct when feedback is coarse. Thus, it is well-known that learning need not lead
to correct beliefs.
In this paper, we want to emphasize another reason why learning need not lead to incorrect
beliefs: If agents have mis-specified models of the world, then they will end up having incorrect
beliefs even if they receive unlimited feedback. A simple example is where a player learns by running
a linear regression but the correct specification is actually non-linear.
There is a growing literature that proposes new equilibrium concepts to capture the behavior of
players who are boundedly rational and learn from past interactions: sampling equilibrium (Osborne
and Rubinstein, 1998), the inability to recognize patterns (Piccione and Rubinstein [2003], Eyster
and Piccione [2011]), valuation equilibrium (Jehiel and Samet, 2007), analogy-based expectation
equilibrium (Jehiel, 2005), cursed equilibrium (Eyster and Rabin, 2005), and behavioral equilibrium
(Esponda, 2008). Many of these solution concepts can be viewed as modeling the behavior of agents
who end up having incorrect beliefs due to incorrectly specified models. Our framework attempts
to integrate some of these concepts and promote further applications. For example, we illustrate
in Section 7 how our framework includes the decision-theoretic analogs of the last three solution
concepts as particular cases.
Some explanations for why agents may have mis-specified models include complexity (Aragones
et al., 2005), the desire to avoid over-fitting the data (Al-Najjar [2009], Al-Najjar and Pai [2009]),
and costly attention (Schwartzstein, 2009). We do not attempt to provide micro-foundations for
why agents possibly have mis-specified models. In this paper, we take the mis-specification as given
and characterize the resulting behavior.
We study a standard dynamic environment where the agents current decision affects both
her future payoffs and what she learns about her uncertain environment. To focus on the idea
of incorrect beliefs due to mis-specified priors, we consider noisy environments, where the agent
observes full feedback, thus ruling out incorrect beliefs due to lack of experimentation. We establish
two main results for a large class of dynamic environments. First, if dynamic behavior stabilizes in
the dynamic environment, then it is characterized by what we call an equilibrium. An equilibrium
is a strategy mapping the payoff relevant states to actions such that: (i) given the policy function
, the agents model that is closest (according to the KullbackLeibler divergence) to the true
model is (), and (ii) is a solution to the dynamic optimization problem where the agent is
certain that the correct model is (). Second, we show that if is an equilibrium, then there
is an agent who asymptotically optimizes and whose behavior converges to the equilibrium [this
result is not yet written in this version]. The framework is applicable to several aspects of bounded
rationality, where the reason why a decision maker has incorrect beliefs can be traced to her use of
an incorrectly-specified model.
There is a large literature that studies learning foundations for rational expectations equilibrium
both with rational and boundedly-rational agents (Bray and Kreps [1981], Bray [1982], Blume
and Easley [1982], Blume and Easley [1984]). There is also a large game-theoretic literature that
studies explicit learning models in order to justify Nash equilibrium and self-confirming equilibrium
(Fudenberg and Kreps [1988], Fudenberg and Kreps [1993],Fudenberg and Kreps [1995], Fudenberg
and Levine [1993b], Kalai and Lehrer [1993]). Unlike our paper, this literature generally studies
repeated, not dynamic, environments. Also, agents in most of these papers can be viewed as having
correctly specified models, at least on the equilibrium path.
There is also a literature on decision problems with uncertain parameters (Easley and Kiefer
[1988], Aghion et al. [1991]). In this literature, agents have correctly specified models and the
Martingale Convergence Theorem implies that their beliefs converge. Moreover, the problem faced
by the agent becomes static once beliefs converge. The main focus of this literature is on incentives
to experiment and whether beliefs converge to the truth.
There is also a closely related statistics literature on the consistency of Bayesian updating under
correctly-specified (e.g., Freedman [1963], Diaconis and Freedman [1986]) and mis-specified models
(e.g., Berk [1966], Bunke and Milhaud [1998]). The statistics literature has focused on the passive
learning case. We extend some of this literature by allowing the agent to take actions that might
also affect what she learns. Thus, learning will be endogenous in our setting.
The work that exists on the topic of Bayesian learning under mis-specified learning seems
to be limited to examples or particular applications. Nyarko [1991] presents an example where
beliefs and actions fail to converge under mis-specified models; we use this example in the next
section to illustrate some of our main points. Barberis et al. [1998] show that a particular type of
mis-specification about the process governing earnings can explain the over and under-reaction of
investors to information. Rabin and Vayanos [2010] show that agents who incorrectly believe in the
gamblers fallacy can exaggerate the magnitude of changes in an underlying state but underestimate
their duration. There are also some papers that either implicitly or explicitly study the learning
problem of an agent that has a mis-specified model. For example, Sobel [1984] studies the non-linear
pricing problem of a monopolist when consumers naively act as if prices are linear; as in our paper,
the beliefs of the consumer are endogenously determined by her consumption choice. Spiegler [2012]
studies the behavior of policy-makers or politicians when the public naively attributes observed
outcomes to the most recent actions.
In macroeconomics (Evans and Honkapohja [2001] Chapter 13, Sargent [2001] Chapter 6), there
are several papers studying particular settings where agents make forecasts using statistical models
that are mis-specified. While the motivation is similar, we focus on agents who follow Bayes rule
and attempt to provide a fairly general decision-theoretic model that can be applied to a wide range
of circumstances.
We hope to provide a general framework for modeling agents with mis-specified agents and to
convey the idea that the consequences of mis-specifications can be precisely characterized.
In the next section, we present an example and discuss further the main points of our paper.
In Section 3, we present the Markov decision process (MDP) faced by the agent. In Section 4, we
3
present the Bayesian- Markov decision process (BMDP), which captures the fact that the agent
is uncertain about the true MDP that she faces. We also provide a definition of equilibrium (i.e.,
steady-state behavior) for a BMDP. In Sections 5 and 6, we provide a foundation for our notion of
equilibrium. We conclude in Section 7 with additional examples.
NOTE: THIS DRAFT IS PRELIMINARY AND INCOMPLETE.
More references will be added in the next draft.
Illustrative example: monopolist with unknown demand
We illustrate some of our main points by discussing Nyarkos (1991) example of a monopolist with
unknown demand. The monopolist chooses at every period t = 0, 1, ... a price xt X = {2, 10} and
then sells st+1 , determined by the demand function
st+1 = a bxt + t ,
where (t )t is an i.i.d. normally distributed process with mean zero and unit variance.1 The
monopolist observes sales st+1 but she does not observe the random shocks t ; she does know,
however, the distribution of (t )t . The monopolist has no costs of production and, therefore, her
profits in period t are
(xt , st+1 ) = xt st+1 .
The monopolist wishes to maximize the discounted expected profits, where her discount factor is
[0, 1).
Notice that we are not using the most natural notation for price (here denoted by x) and for
sales at period t (which should be more naturally indexed by t, not t + 1). We maintain this
notation, however, because it is in line with the notation of the more general setup of the paper,
which allows for dynamic decision problems where the new state is also affected by the previous
state.
The interesting feature of the problem is that the monopolist does not know the true demand
intercept and slope. Let R2 represent the set of models of the world entertained by the
monopolist, where = (a, b) denotes a demand intercept and slope. The monopolist starts with a
prior 0 with full support over and updates her prior using Bayes rule. Thus, it is well known
from dynamic programming that the monopolists problem can be represented recursively by a
value function which is defined over a state space, where a state represents the monopolists belief
over .
Suppose that the true demand parameter is = (28.5, 5.25) and that the set of models that
the monopolist considers possible is given by the rectangle with vertices at the points 0 , 00 , (a0 , b00 )
and (a00 , b0 ), where a0 = 20, a00 = 16, b0 = 1, and b00 = 4. In this case
/ and, therefore, we
say that the monopolist has a mis-specified model. This situation is depicted in Figure 1, which is
basically reproduced from Nyarko (1991, page 422).
Nyarko [1991] shows formally that the monopolists actions do not converge. To see the intuition
behind this result, suppose that the monopolist were to always choose price 2. Then, on average,
she would observe sales a b 2. She would then believe that any (a, b) that also gives rise to such
1
As mentioned by Nyarko, sales can be negative with positive probability but the normal distribution is nevertheless
chosen for simplicity.
average sales can explain the data. The set of all such (a, b)s is given by the line with slope 2
passing through the true parameter . Moreover, 0 = (a0 b0 ) is the only point on that line that
also belongs to her set of models of the world . But, under parameter 0 ,
E0 (2, s0 ) = 2(a0 b0 2) < 10(a0 b0 10) = E0 (10, s0 ),
and, therefore, the monopolist would actually strictly prefer to charge price 2. Thus, if the monopolist were to always charge a price of 2, she would eventually become very confident that the true
parameter is 0 , but then she would prefer to deviate and charge 10.
A similar argument establishes that if the monopolist were to always charge a price of 10, then
she would eventually become very confident that the true parameter is 00 , but then, since
E00 (2, s0 ) = 2(a00 b00 2) > 10(a00 b00 10) = E00 (10, s0 ),
she would prefer to deviate and charge 2. Thus, the monopolists behavior forever cycles between
prices 2 and 10.
While this example might give the impression that studying the behavior of agents with misspecified models can give rise to strange phenomena that are hard to analyze, the main objective
of the paper is to convince the reader that there is indeed a lot of regularity of behavior even in
mis-specified settings.
Notice that the idea that actions might not converge is not exclusive to mis-specified models.
For example, even if a model is correctly specified and beliefs converge, then actions might not
converge if they are not continuous in beliefs. This lack of continuity commonly arises when there
are a finite number of actions. It is indeed one of the reasons why we allow for mixed strategies;
for example, Nash equilibrium would also not always exist without considering mixed strategies.
Suppose that we allow the monopolist to choose a mixed strategy. Figure 1 depicts the mixed
strategy where is the probability of choosing price 2. If the monopolist chooses , then it is not
too difficult to show that she will eventually become very confident that the true model is given by
the point (a0 , b ) in Figure 1. This is the point on the set that is closest to when distance is
measured along the line with slope 2 + (1 )10 passing through . If the optimal strategy of
a monopolist who is convinced that the true parameter is (a0 , b ) is (i.e., if such a monopolist
is indifferent between prices 2 and 10), then we say that is an equilibrium. Notice that despite
working with a single-agent decision problem, the solution of the problem is a fixed point because
the strategy of the agent affects her beliefs. In the example, different strategies correspond to
different slopes of lines passing through and, therefore, to different parameters on the set .2
More generally, we consider the problem of an agent facing a Markov Decision Process (MDP),
which is a dynamic optimization problem where the state variable follows a Markov process. The
monopolists problem is in this section is a particular case where, under the assumption that
the true parameter is known, the problem is static in the sense that future states (sales in this
example) do not depend on previous states. Next, for any MDP we consider a Bayesian-Markov
Decision Process (BMDP), which is the problem of an agent who does not know the true transition
probability function. The agent starts with a prior over a set of models, where each model
represents a transition probability function, and she updates her prior while making decisions that
maximize her discounted expected payoffs. The BMDP is said to be mis-specified if the true model
2
In this example, we can also get convergence by allowing for a continuum of prices. It many settings, however,
it might be natural to restrict attention to a finite set of actions.
is not one of the models considered by the agent. The problem of the monopolist with unknown
demand discussed above is a special case.
Our objective is to characterize the limiting or steady-state behavior of the agent in a BMDP.
For this purpose, we define the notion of equilibrium for general BMDPs. Equilibrium of a BMDP
is conveniently defined in terms of the simpler MDP. In the above case of the monopolist with
unknown demand, to verify whether a strategy is an equilibrium we only need to verify whether it
is optimal in the static problem where the agent knows the parameter (i.e,., she does not have to
learn it) but might believe it is different from the true parameter. Equilibrium of a BMDP is defined
as a fixed point where the agent chooses a strategy that is optimal in the MDP under a parameter
that itself depends on the strategy chosen by the agent (as well as on the true parameter and the
set of models entertained by the agent, of course). We show that under fairly general assumptions
equilibrium exists. In the above monopoly example, there is in fact a unique equilibrium and it is,
as expected, strictly mixed. Section 7.1 generalizes the monopoly example and provides the formal
results.
We then turn to justifying our definition of equilibrium for a BMDP. We can apply standard
dynamic programming techniques to show that a BMDP can be cast recursively by using a value
function that solves a standard Bellman equation and which depends on the state of the MDP
and the belief over the set of models . We say that behavior stabilizes in the BMDP if the
agents behavior as a function of the state of the MDP converges. One of our main results is that,
if behavior stabilizes in the BMDP, then it must stabilize to what we call an equilibrium of the
BMDP. Thus, behavior that does not arise in equilibrium cannot be the limiting behavior of the
BMDP.
To establish the previous result that equilibrium captures the steady-state of the BMDP, we
must first be able to justify and interpret mixed strategies in our setting. We follow the standard
approach introduced in game theory by Harsanyi [1973], where the agent receives (small) payoff
perturbations that to an outside observer make her look as if she is mixing. This approach was
used by Fudenberg and Kreps [1993] to provide a learning foundation for mixed-strategy Nash
equilibrium in the context of normal-form games.3 We follow a similar approach and perturb the
payoffs in the BMDP. We then provide a learning foundation for equilibrium in this perturbed
version of the game. The result relies on extending the statistical literature on learning under
mis-specified models (e.g., Berk, 1966) to the case where the decision maker concurrently takes
actions. Finally, we show that the limit of equilibria of the perturbed BMDP as the perturbation
vanishes corresponds to an equilibrium of the (unperturbed) BMDP.
A Markov Decision Process (MDP)
We begin by describing the environment faced by the agent.

Definition 1. A Markov Decision Process (M DP ) is a tuple hS, X, , q0 , Q, , i where
S is a finite set of states
X is a finite set of actions
3
The class of models they considered is known in the literature as stochastic fictitious play.
: S 2X is a non-empty constraint correspondence

q0 (S) is a probability distribution on the initial state
Q : S X (S) is a transition probability function
: S X S R is a per-period payoff function
[0, 1) is a discount factor
Throughout the paper, it will be useful to stress the dependence of an M DP on a particular
transition probability function Q; thus, we use M DP (Q) to denote an M DP with transition
probability function Q.
The timing is as follows. At the beginning of every period t = 0, 1, 2, ..., the agent observers
state st S and chooses actions xt (st ) X. Then a new state st+1 is drawn according to the
probability distribution Q( | st , xt ) and the agent receives payoff (st , xt , st+1 ) in period t.4 The
initial state s0 is drawn according to the probability distribution q0 .
As a benchmark, we begin by considering the standard case of an agent who faces an M DP (Q).
The agent chooses a policy rule that specifies at each point in time a (possibly random) action as
a function of the history of states and actions observed up to that point. As usual, the objective
P tof the agent is to choose a feasible policy rule to maximize expected discounted utility,
t=0 (st , xt , st+1 ).
ASSUMPTION A1. sup(s,x,s0 )Gr()S |(s, x, s0 )| < .
Assumption A1 implies that we can apply the Principle of Optimality.Thus, the agents problem
can be cast recursively as

VQ (s) = max
(s, x, s0 ) + VQ (s0 ) Q(ds0 |s, x)
(1)
x(s) S
where VQ : S R is the unique solution to the Bellman equation (1); moreover VQ is bounded.
Definition 2. A strategy is a distribution over actions given states, : S (X).
Let denote the space of all strategies and let (x | s) denote the probability that the agent
chooses x when the state is s.
Definition 3. A strategy is optimal for the M DP (Q) if for all s S and all x X such
that (x | s) > 0, then

x arg max
(s, x
, s0 ) + VQ (s0 ) Q(ds0 |s, x
).
x
(s) S
An optimal strategy always exists because the space S X is finite and VQ is bounded. It is
also easy to see that there is always a deterministic optimal strategy where the agent does not
randomize. Nevertheless, random strategies will play an important role in the sequel, when the
agent does not know the transition probability function.
4
Depending on the application, we can think of st+1 being drawn at the end of period t and affecting period ts
payoff or at the beginning of period t + 1.
A Bayesian-Markov Decision Process (BMDP)
We now consider an agent who faces an M DP but who is uncertain about the transition probability
function. The agent has a prior over a set of possible transition functions and updates her beliefs
using Bayes rule. We refer to the problem with uncertainty as the Bayesian-Markov decision
process (BM DP ).
Definition 4. A Bayesian-Markov Decision Process (BM DP ) is an MDP, hS, X, , q0 , Q, , i,
and a tuple hQ , 0 , Bi where
Q = {Q : } is a family of transition probability functions, where each transition
probability function Q is indexed by a parameter
0 () is a prior
B : S2 X () () is the Bayesian operator: for all A Borel measurable and all
(s0 , s, x, ) S2 X (),
Q (s0 | s, x)(d)
0
B(s , s, x, )(A) = A
.
0
Q (s | s, x)(d)
The timing is the same as the timing specified under the M DP in Section 3. The difference is
that the agent now has a belief over the set of possible transition probability functions and updates
this belief according to Bayes rule. We interpret the set Q as the different transition probability
functions (i.e., models of the world) that the agent considers possible.
Definition 5. A BM DP is mis-specified if Q
/ Q ; otherwise, it is correctly specified.
We restrict attention to a certain class of possibly mis-specified BMDPs, as captured by the
following assumptions.
ASSUMPTION A2. (i) is a compact subset of an Euclidean space Rk ; (ii) the prior 0 has
full support: supp(0 ) = .
ASSUMPTION A3. (i) For 0 -almost every , if (s0 , s, x) S2 X and Q (s0 | s, x) = 0
then Q(s0 | s, x) = 0; (ii) For all (s0 , s, x) S2 X, Q (s0 | s, x) is continuous as a function of for
all .
ASSUMPTION A4. For all s , s0 S, there exist finite sequences (s1 , ..., sn ) and (x0 , x1 , ..., xn )
such that xi (si ) for all i = 0, 1, ..., n and
Q(s | sn , xn )Q(sn | sn1 , xn1 )...Q(s1 | s0 , x0 ) > 0.
Assumption A2 requires certain regularity assumption on , such as the requirement that it
lives in a finite-dimensional space, that are known to be important to obtain consistency of Bayesian
8
updating in the standard (i.e., correctly specified) setting (Freedman [1963]). Assumption A3(i)
is necessary to make sure that Bayesian updating is well defined. We view this assumption as
ruling out a particularly stark form of mis-specification under which the agents model of the world
cannot explain an observation. If such were the case, updating would fail and we would expect the
agent to re-consider her models of the world. Assumption A3(ii) is a technical condition that plays
an important role in several proofs. Finally, Assumption A4 guarantees that we can always get
from one state to another by some sequence of states and actions. When we study, in Section 5,
the perturbed version of the problem where the agent chooses all actions with positive probability,
Assumption A4 will guarantee that all states in S X are recurrent.
The next assumption requires additional definitions.
Definition 6. The weighted Kullback-Leibler divergence (wKLD) is a mapping KQ : (S
+ such that for any m (S X) and ,
X) R

X
Q(S 0 |s, x)
KQ (m, ) =
EQ(|s,x) ln
m(s, x).
Q (S 0 |s, x)
(s,x)SX
The set of closest models given m (S X) is the set

Q (m) arg min KQ (m, ).
Lemma 1. For every m (S X), KQ (m, ) is continuous, greater than or equal to zero, and
finite; moreover, Q (m) is non-empty, compact valued, and upper hemi-continuous as a function
of m.
Proof. See the Appendix.
Definition 6 extends the standard definition of Kullback-Leibler divergence to the case where
the sample from which the agent learns is drawn from a distribution m. The set Q (m) can be
interpreted as the set of models that are closest to the true model Q when the agent has access
to an infinite number of exogenous observations of (s, x), drawn independently according to m,
and for each observation observes the corresponding draw of the new state s0 . In particular, if the
model is correctly specified, then the true model is always in the set of closest models, i.e., for all
m (S X), there exists Q (m) such that Q = Q.
Our final assumption on the primitives plays the role of an identification assumption.
ASSUMPTION A5. For every m (S X), if , 0 Q (m), then Q ( | s, x) = Q0 ( | s, x)
for all (s, x) S X such that m(s, x) > 0.
Assumption A5 requires that the models that are closest to the true model must be indistinguishable given the data available to the agent. To motivate this assumption A5, notice that that
there are two reasons why the agent might not be able to distinguish between two models. The first
is that she does not take a particular action and therefore fails to learn in some dimension. Then the
agent can entertain different models of the world, as long as these models cannot be distinguished
9
given her data; this situation is permitted by Assumption A5. The second reason why the agent
might not be able to distinguish between two models is that these models have different transition
probability functions but one model better explains one feature of the data and the other model
better explains another feature in such a way that these models are equidistant to the truth (in
terms of the wKLD). This type of mis-specification is ruled out by Assumption A5. An informal
argument for ruling out this type of mis-specification is that the agent might decide to break the
tie between these two models by adding yet another model to her set of initial models. A more
formal argument shows that we cannot expect behavior to settle down when Assumption A5 fails.
The next example illustrates this point.
Example 1. Consider a BMDP where the state s S = {0, 1} represents whether a coin lands
heads or tails and Q(1 | x, s) = 1/2 for all (x, s), i.e., the coin tosses are i.i.d. and do not
depend on the previous action or state. Suppose that Q (1 | x, s) = for all (x, s), so that the
agent understands that the coin tosses are i.i.d, and that {1/4, 3/4}. In this case, the wKLD
becomes, for all m,
1 1/2 1
1/2
KQ (m, ) = ln
+ ln
.
2
2 1
Thus, KQ (m, 1/4) = KQ (m, 3/4) and, therefore, Q (m) = {1/4, 3/4} for all m. Since Q1/4 6= Q3/4 ,
then Assumption A5 is not satisfied. Intuitively, when a coin is truly unbiased, then = 1/4 and
= 3/4 are equidistant to the truth and are equally likely to explain the data coming from an
unbiased coin. In terms of the model where the agent updates her beliefs using Bayesian updating,
it is not too hard to establish that, for a non-dogmatic prior 0 (1/4) (0, 1), the agents beliefs over
{1/4, 3/4} will never converge (for a proof, see Berk [1966] p. 57). Therefore, it is easy to embed
an action space to this model in such a way that the actions of the agent will not converge, even if
we were to make the actions continuous in beliefs, as we do later in the paper. Finally, Assumption
A5 will be satisfied as long as the agent incorporates an additional element (1/4, 3/4) to her
set of models of the world.
As the examples throughout the paper illustrate, there are many contexts where it is straightforward to check that Assumption A5 is verified. The following is a sufficient condition for Assumption
A5 that is satisfied in many of our examples.
Proposition 1. Suppose that the following three conditions are satisfied: (i) is convex, (ii) Q
is linear in (i.e., if [0, 1] and , 0 , 00 where 00 = + (1 )0 , then Q00 ( | s, x) =
Q ( | s, x) + (1 )Q0 ( | s, x) for all (x, s) S X), and (iii) for all (s0 , s, x) S2 X, if
Q(s0 | s, x) = 0 then Q (s0 | s, x) = 0 . Then Assumption A5 is satisfied.
To illustrate the previous definitions, we now introduce another example and verify that it
satisfies Assumptions A1-A5.
10
Example 2. Every period t = 0, 1, ...,an agent chooses an action x X = {0, 1}. Given the action
x, a state s0 S = {0, 1} is drawn according to the transition probability function Q (s0 | x),
where Q (1 | 0) = 0 and Q (1 | 1) = 1 . Let the true model Q be represented by Q for some
(0, 1)2 . The agent receives payoff
1
(x, s0 ) = (x + 1)s0 x
2
each period, and her objective is to maximize discounted expected utility with discount factor
[0, 1). The above primitives describe an MDP where each new state is drawn as a function
of the action but not of the previous state; thus, the problem is inherently static. Moreover,
Assumption A1 is satisfied because is bounded.
Lets consider two different versions of BMDPs associated with the above MDP, where in each
case Q = {Q : } is the set of models of the world entertained by the agent and 0 is her
full-support prior. In the first case, [0, 1]2 and the BMDP is correctly specified. In the second
case, = { [0, 1]2 : 0 = 1 } and the BMDP is mis-specified if and only if 0 6= 1 . In the
mis-specified case, the agent incorrectly believes that her action does not affect the probability of
drawing the state.
We now verify Assumptions A2-A5 for each case. Assumption A2 is satisfied in each case because
is compact and 0 has full support. Also, Q (s0 | x) (0, 1) for all \{(0, 0), (1, 1)}. Thus,
if in each case we choose a prior 0 that puts no mass at either (0, 0) or (1, 1), then Assumption
A3(i) is satisfied. Assumption A3(ii) is also satisfied because Q is continuous in . Assumption
A4 is satisfied because the process governing the state is i.i.d. and each realization of S has positive
probability (irrespective of the action taken by the agent). Finally, we can apply Proposition 1 to
check for Assumption A5. In each case, is convex and Q is linear in . Finally, the assumption
that (0, 1)2 implies that Q(s0 | s, x) > 0 for all (s0 , s, x) S2 X, so condition (iii) in
Proposition 1 does not apply. Hence, Assumption A5 is also satisfied.
Finally, we write down the wKLD and derive the correspondence Q () of models that are
closest to the true model Q in each of the two cases. We begin with the case of the mis-specified
BMDP. Fix any m (S X) and (, ) . Notice that only the marginal distribution mX over
X is relevant for characterizing Q () because the current state does not affect the future state.
Then

1 0
KQ (m, (, )) = mX (0) ln
(1 0 ) + ln 0 0 +
1

1 1
1
+ mX (1) ln
(1 1 ) + ln 1
1
= (1 m ) ln(1 ) m ln + C,
where C contains terms that do not depend on and
m
= mX (0)0 + mX (1)1 .
(2)
It is easy to check that, for every m, KQ (m, (, )) is strictly convex in and has a unique minimizer
. Thus, (m) = {( , )} is a singleton for all m (S X); in particular,
given by m
Q
m m
Assumption A5 is satisfied, a fact that we had already verified indirectly using Proposition 1. The
intuition for this result is straightforward. If we fix m then the state s = 1 is drawn i.i.d. with
11
is exactly the one that attaches this

probability mX (0)0 + mX (1)1 . Then the closest model m
probability to s = 1 being drawn.
Next, we consider the case where = [0, 1]2 and, therefore, the BMDP is correctly specified.
It is easy to check that Q (m) = { } is a singleton set containing the true model as long as
mX (1) (0, 1); the idea is that an agent that chooses each action with positive probability will
obtain information about each of the two dimensions of . But Q (m) is no longer a singleton
if mX (1) {0, 1}. The reason is that the agent never observes the consequence of choosing either
x = 1 or x = 0 and, therefore, can have any belief about one dimension of . Formally,
Q (m) = { : i = i , j [0, 1]},
(3)
where mX (i) = 1, where i, j {0, 1} and i 6= j. Thus, even though the BMDP is correctly specified,
the agent can have incorrect beliefs about those parameters for which she observes no information.

We now turn to the analysis of BMDPs. As in the case of MDPs, it is well known that the
problem of maximizing discounted expected utility in BMDPs can be cast recursively, were the
difference with the MDP is that the state space now includes both the state variable s and the
belief .
Our main objective is to characterize the agents behavior in a BMDP when the time period is
sufficiently large so that we might expect beliefs to have settled down and behavior to converge.
For this purpose, we conclude this section by defining the notion of equilibrium for a BMDP. An
equilibrium represents steady-state behavior in a BMDP. In Section 6 we make the argument formal
and show that the notion of equilibrium that we present here corresponds to steady-state behavior.
Definition 7. The transition kernel given a strategy is a transition probability function
M : S X (S X) such that
M (s0 , x0 | s, x) = (x0 | s0 )Q(s0 | x, s).
(4)
An invariant distribution of the transition kernel M is a distribution m (S X) that

satisfies
X
m (s0 , x0 ) =
M (s0 , x0 | s, x)m (s, x)
(s,x)SX
for all
(s0 , x0 )
S X.
It is a standard result that, for every , an invariant distribution of M exists.5

We now provide an informal motivation for the definition of equilibrium that follows. Suppose
that, after sufficiently enough time, beliefs have converged in the BMDP and the agents behavior
can be represented by a strategy that maps each state in S to some probability distribution
over X. Then the strategy induces a transition kernel M over the space (S X). Let m
5
The proof follows by noticing that M is a linear (hence continuous) self-map on a convex and compact subset of
an Euclidean space (the set of probability distributions over the finite set S X); hence, Browers fixed point theorem
implies existence of an invariant distribution.
12
be an invariant distribution of M and suppose that the time average of observations over S X
converges to the invariant distribution m . Then we might expect that the agents beliefs over the
models of the world have support in the set of closest models given m , Q (m ). Finally, if is
to be a candidate for steady-state behavior it must be the case that there is a belief over with
support over Q (m ) that makes an optimal strategy in the BMDP with transition probability
induced by such a belief.
Definition 8. (Equilibrium of a BMDP) A strategy and probability distribution (, m)
(S X) is an equilibrium of the BMDP with true model Q if there exists () such
that
), where Q
=
(i) is an optimal strategy for the M DP (Q
Q (d), and
(ii) (Q (m)), where m is an invariant distribution of the transition kernel M .
We make several remarks regarding the definition of equilibrium of a BMDP.
Remark 1. Provided that Definition 8 captures steady-state behavior in BMDPs (investigated in
the next sections), then one of the main benefits of the above definition of equilibrium is that the
modeler who is interested in the limiting behavior of the agent in a BMDP can restrict attention to
the much simpler class of MDPs. We illustrate throughout the paper how the definition makes concrete, endogenous predictions for mis-specified agents and that these predictions do not obviously
follow from the exogenous primitives.
Remark 2. Definition 8 places two restrictions on equilibrium behavior: (i) optimization given
beliefs and (ii) endogenous restrictions on beliefs. This dichotomy is standard in game theory and
is made explicit by Fudenberg and Levine [1993a] in their definition of self-confirming equilibrium.
Optimization is a standard assumption. The innovation behind our definition is to specify how
beliefs are to be endogenously restricted as a function of the true model Q, the agents models of
the world Q , and the agents strategy .
Remark 3. We define an equilibrium to be a strategy-distribution pair. The interpretation is
that the agents behavior is given by her strategy and that the corresponding distribution over
outcomes in S X is given by the corresponding invariant distribution m. The need to specify the
invariant distribution as part of the definition of equilibrium is that, in dynamic games, a strategy
might have several invariant distribution associated with it and, therefore, a strategy might not be
sufficient to determine the equilibrium outcome of the decision process.
Remark 4. In the special case where the model is correctly specified, then the optimal strategy for
the true model is always an equilibrium of the BMDP. This result is analogous to the result that a
Nash equilibrium is always a self-confirming equilibrium (e.g., Fudenberg and Levine, 1993a). As
in that case, the converse result does not necessarily hold because the agent might not learn the
correct model if she does not sufficiently experiment.
Proposition 2. If the BMDP with true transition probability function Q is correctly specified, then
the optimal strategy for the M DP (Q) is an equilibrium of the BMDP.
Proof. Let be an optimal strategy for the M DP (Q). Let m be an invariant distribution of
the transition kernel M ; see footnote 5 for the argument guaranteeing existence of an invariant
13
distribution. Since the BMDP is correctly specified, there exists such that Q = Q .
By Lemma 1, the wKLD is greater than or equal to zero. In addition, KQ (m, ) = 0. Then,
Q (m). Since is optimal for the M DP (Q ), then is an equilibrium of the BMDP.
Remark 5. Existence of equilibrium. Proposition 2 and the fact that an optimal strategy for the
MDP always exists imply that an equilibrium always exists whenever the BMDP is correctly specified. Unfortunately, existence of equilibrium for mis-specified BMDPs cannot be established using
standard methods. The standard approach to proving existence is to show that the corresponding
best response correspondence has a fixed point. In our setting, let BR() = { 0 : m
)}. The problem is that
invariant of M and (Q (m)) such that 0 is optimal for M DP (Q
this correspondence is not necessarily convex-valued. For example, fix and suppose that m0 and
m00 are two different invariant distributions of M and that Q (m0 ) = {0 } and Q (m00 ) = {00 }.
Then different beliefs can justify different elements of BR(). For example, it is possible that 0 is
optimal for M DP (Q0 ) and 00 is optimal for M DP (Q00 ) but that a convex combination of 0 and
00 is not optimal for either belief 0 or 00 .
Our solution to this problem will be to study a perturbed version of the BMDP, show that
equilibrium exists in the perturbed version, and then show that the limit of a sequence of equilibria
as the perturbation vanishes exists and is an equilibrium of the original BMDP. We postpone the
proof to the next section but state the existence result now.
Theorem 1. An equilibrium of a BMDP always exists.
Proof. Follows from Lemma 7, Theorem 2, and Theorem 3 in Section 5; see the discussion right
after Theorem 3.
Example 2, continued. We now find all the equilibria for each version of the BMDP in
Example 2. Let the strategy represent the probability that x = 1 (again, we can ignore the state
s because the agent believescorrectlythat the current state does not affect the draw of the next
state).
First, we consider the case where 1 < 1/2 < 0 and study the mis-specified BMDP where
= { [0, 1]2 : 0 = 1 }. For each , there is a unique invariant distribution m and its marginal
over X satisfies mX (1) = . Thus, for each with invariant distribution m, (Q (m)) is the set
of possible beliefs. By (2) above, this belief is degenerate and puts probability 1 on
= (1 )0 + 1 .
(5)
Thus, by Definition 8, is an equilibrium strategy if and only if is an optimal strategy when

the parameter governing the (static) problem is given by (5). We show that there is a unique
equilibrium strategy and that the agent strictly mixes in equilibrium. Suppose that = 1 Then,
by (5), the agent believes that the true parameter is 1 . But then
E1 (1, s0 ) = 21
1
< 1 = E1 (0, s0 )
2
and the agent prefers to deviate and choose = 0. Thus, = 1 is not an equilibrium strategy.
A similar reasoning yields that = 0 is not an equilibrium strategy either. Thus, an equilibrium
14
strategy (0, 1) must be such that, under the corresponding belief in equation (5), the agent
is indifferent between her actions, i.e.,
E (1, s0 ) = 2
1
= = E (0, s0 ).
2
(6)
The unique solution (hence, the unique equilibrium strategy) of (6) is

=
0 1/2
(0, 1).
0 1
For example, if 0 = 3/4 and 1 = 1/4, then = 1/2 is the unique equilibrium strategy for this
mis-specified BMDP.
Next, we consider the case where = [0, 1]2 and, therefore, the BMDP is correctly specified.
For concreteness, suppose that 0 = 3/4 and 1 = 1/4. Then,
E (1, s0 ) = 21
1
< 0 = E (0, s0 ),
2
and, therefore, = 0 is the unique optimal strategy given . Thus, by Proposition 2, = 0

is an equilibrium strategy of the BMDP. Next, consider any strategy (0, 1). In this case,
the corresponding invariant distribution m has a marginal mX (1) (0, 1) and, by the previous
discussion, beliefs must be degenerate at the true parameter . Then (0, 1) cannot be optimal
given these beliefs and, therefore, (0, 1) is not an equilibrium strategy. Finally, suppose that
= 1. The corresponding invariant distribution m has a marginal mX (1) = 1 and, by (5), the
agent must believe that 1 = 1 but can hold any belief about 0 [0, 1]. In particular, she can
believe that 0 = 0, in which case it is (weakly) optimal to choose = 1. Thus, = 1 is the only
other equilibrium strategy of the BMDP.
A feature of Example 2 (and also of the monopolist example in Section 2) is that the environment
faced by the agent who knows the transition probability function is static, meaning that the
current state does not influence the future state.
Definition 9. A transition probability function is static if the distribution over the new state
does not depend on the previous state, i.e., Q( | s, x) = Q( | s, x) for all x X and all s, s S. A
BMDP is static if all transition probability functions in Q are static.6
In static BMDPs, it is without loss of generality to restrict attention to strategies such that
( | s) = ( | s) for all s, s S. Thus, in static BMDPs we denote a strategy by (X).
Moreover, for a given strategy there is a unique invariant distribution m (S X) and its
marginal over X, denoted by mX , coincides with . In addition, the wKLD depends only on mX
rather than on . Thus, in the remainder of the paper we abuse notation and denote the wKLD
and the correspondence Q () as a function of rather than m if the BMDP is static. Notice also
that in this case it is not necessary to specify the invariant distribution as part of the equilibrium.
6
Of course, the problem is dynamic for the agent who also needs to learn the transition probability function, even
if these transition probability functions are static.
15
Remark 6. More generally, there are several environments where, for each , there is a unique
invariant distribution m and the set Q (m ) is a singleton. In these environments, by letting
be the unique element of Q (m ), Definition 8 becomes:
a strategy is an equilibrium of the BMDP if and only if is optimal for
the M DP (Q ).
The next two sections provide a justification for the notion of equilibrium proposed above. The
reader who is interested in applications of the equilibrium concept can jump to Section 7.
Perturbed Decision Processes
In this section, we perturb the payoffs of the MDP and BMDP introduced in the previous section
and establish that equilibrium exists under a class of perturbations. These perturbations were
introduced by Harsanyi [1973] in the context of normal-form games and have been incorporated
by Doraszelski and Escobar [2010] in the type of dynamic settings studied here. We then consider
a sequence of perturbed environments where the perturbation goes to zero and establish that the
limit (which exists) is an equilibrium of the unperturbed environment. This result implies existence
of equilibrium in the unperturbed environment. In the next section, we show that equilibria in these
perturbed games can be viewed as the steady-states of the perturbed BMDP.
Definition 10. A perturbed MDP is an MDP hS, X, , q0 , Q, , i and a tuple = hV, PV i where
V R|X| is a set of payoff perturbations for each action
PV : S (V) is a distribution over payoff perturbations conditional on each state s S
The timing of a perturbed MDP coincides with the (unperturbed) MDP defined in Section
3 except for two modifications. First, before taking an action in period t, the agent not only
observes st but she now also observes a vector of payoff perturbations , where (x) denotes the
perturbation corresponding to action x. This vector of payoff perturbations is drawn i.i.d. every
period conditional on the state st , according to the probability distribution PV ( | st ). Second, after
the agent chooses xt and the new st+1 is realized, her payoff for period t is now
(st , xt , st+1 ) + (xt ).
ASSUMPTION A6.
For all s S, PV (|s) is absolutely continuous with respect to the Lebesgue
|X|
measure on R and V ||||PV (d|s) < .
In a perturbed MDP, the agents optimization problem can be cast recursively as

n
o
VQ (s) =
max
(s, x, s0 ) + (x) + VQ (s0 ) Q(ds0 |x, s) PV (d|s).
V
x(s) S
where VQ : S R.
16
(7)
Lemma 2. There exists a unique solution VQ to the Bellman equation (7); moreover VQ is bounded
and continuous as a function of Q.
Definition 11. A strategy : S (X) is optimal for the perturbed M DP if for all s S and
all x X,

n
o
(s, x
, s0 ) + (
x) + VQ (s0 ) Q(ds0 |s, x
) | s .
(8)
(x | s) = PV : x arg max
x
(s) S
We can think of an optimal strategy for the perturbed MDP in two stages. First, for each (s, )
the agent determines the optimal action. Second, we integrate over . In other words, if is an
optimal strategy then (x | s) is the probability that x is optimal when the state is s and the
perturbation is , taken over all possible realizations of . The next result is a consequence of the
absolute continuity of PV .
Lemma 3. An optimal strategy for the perturbed MDP exists (and is, therefore, unique). Moreover,
is continuous as a function of the transition probability function Q.
Proof. That an object like in (8) is well defined is standard (e.g., Harsanyi [1973]) and follows
from the facts that there are a finite number of actions and that the set of s such that the agent
is indifferent between two actions lies in a lower-dimensional hyperplane. By absolute continuity,
the set of s where the agent is indifferent has measure zero. Thus, the RHS of (8) must add up to
1 when added over all actions and, therefore, in (8) is well defined. The fact that is continuous
as a function of Q follows from the fact that VQ is continuous in Q (Lemma 2) and by the Theorem
of the Maximum.
An important role will be played by environments where the perturbation induces the agent to
choose all possible actions.
Definition 12. A fully perturbed MDP is a perturbed MDP where, for all s S, the support
of PV ( | s) is R.7
Lemma 4. If is the optimal strategy for a fully perturbed MDP, then there exists c > 0 such
that (x | s) c for all (s, x) S X.
7
The next version will extend this definition to bounded supports.
17
If is such that all actions are chosen with positive probability for each state, then Assumption
A4 implies that all (s, x) S X are visited infinitely often; thus, there is a unique invariant
distribution associated with .
Lemma 5. If satisfies (x | s) > 0 for all (s, x) S X, then the transition kernel M has a
unique invariant distribution m, and m satisfies m(s, x) > 0 for all (s, x) S X.
Lemma 6. Suppose that m(s, x) > 0 for all (s, x) S X. Then

any , 0 (Q (m)).
Q (d)
0 (d)
for
Proof. By Assumption A5, for all 0 , 00 Q (m), Q0 ( | s, x) = Q00 ( | s, x) for all (s, x) S X;
hence, the result follows.
Along similar lines, we can now define a perturbed version of a BMDP.
Definition 13. A perturbed BMDP is a perturbed MDP hS, X, , q0 , Q, , i, = hV, PV i and
a tuple hQ , 0 , Bi as defined in Definition 4. A fully perturbed BMDP is a perturbed BMDP
where the corresponding MDP is fully perturbed.
The timing of a perturbed BMDP is the same as the timing of a BMDP specified in Section
4, where now the timing of the payoff perturbations coincides with the timing described above for
the perturbed MDP.
Finally, we extend the definition of equilibrium of a BMDP in Section 4 to the case of a perturbed
BMDP. The definition coincides with Definition 8, except that of course we require optimality with
respect to the perturbed version of the MDP.
Definition 14. (Equilibrium of a perturbed BMDP) A strategy and probability distribution
(, m) (S X) is an equilibrium of the perturbed BMDP with true model Q if there
exists () such that
), where Q
=
(i) is an optimal strategy for the perturbed M DP (Q
Q (d), and
(ii) (Q (m)), where m is an invariant distribution of the transition kernel M .
Theorem 2. An equilibrium of a fully perturbed BMDP always exists.

Proof. Let = { : (x | s) [c , 1 (|X| 1)c ] for all (s, x) S X}, where c (0, 1) is
defined in Lemma 4. For all , Lemma 5 implies that there is a unique invariant distribution,
which we denote m, of the transition kernel M . Also, for all , Lemmas 5 and 6 imply
that
that Q (d) = Q 0 (d) for all , 0 (Q (m )). Thus, the following function Q
maps elements in to transition probability functions is well defined: Q()

= Q (d), where
is any belief that belongs to (Q (m )) and m is the (unique) invariant distribution of the
18
transition kernel M . Next, we define the function f : where f () is the optimal strategy
for the perturbed M DP (Q()),

which exists and is unique by Lemma 3. Notice that if is a
fixed point of f , then (, m ) is an equilibrium of the perturbed BMDP. We now apply Browers
fixed point theorem to establish that a fixed point of f always exists. The space is a compact
and convex subset of an Euclidean space. Thus, it remains to establish that f is continuous. We
establish this last result in two steps.
defined above is continuous for all . Let and suppose
First, we show that Q
that (n )n is a sequence of strategies in that converges to . Let (mn )n be the sequence of
(unique) invariant distributions for the transition kernels (Mn )n . By compactness of (X S),
pick a subsequence of (mnk )k (denoted in the same way to simplify notation) that converges to
m . By Lemma 12, m is an invariant distribution of M . For each element in the subsequence,
the fact that Q (mnk ) is non-empty (by Lemma 1) and that is compact implies that we can
pick a subsubsequence nkl Q (mnk ) that converges to some . Then Q (m ) by the
l
upper hemicontinuity of Q () established in Lemma 1. Thus, the facts that nkl Q (mnk ) and
l
then follows because Q

n ) = Q
and Q()
= Q . Continuity of Q
Q (m ) imply that Q(
kl
nk
l
is continuous as a function of (by Assumption A3). Second, by Lemma 4, the function mapping a
transition probability function Q to the optimal strategy of a perturbed MDP is continuous. Since
then f is continuous.
f is the composition of this function with the function Q,
We now consider a sequence of BMDPs where the payoff perturbations go to zero.
Definition 15. A vanishing family of perturbed BMDPs is a sequence of BMDPs with the
following properties:
each BMDP in the sequence shares the same primitives hS, X, , q0 , Q, , i and hQ , 0 , Bi
D
E
each BMDP possibly differs in the perturbation structure V , PV , where, N now indexes
the element of the sequence8
perturbations vanish as : For all s S and any sequence of measurable set (D )
lim
|(x)| PV (d|s) = 0.
(9)
D
A vanishing family of fully perturbed BMDPs is a family of perturbed BMDPs where each
BMDP is fully perturbed.
Lemma 7. For a fixed set of primitives hS, X, , q0 , Q, , i and hQ , 0 , Bi, a vanishing sequence
of fully perturbed BMDPs always exists.
Proof. Note that conditions on perturbation structure in Definition 12 and Definition 15 are compatible (e.g., Normal distribution with mean zero and vanishing variance).
8
In a slight abuse of notation, we now use to index the sequence whereas was used in definition 10 to denote
a tuple.
19
Theorem 3. Fix a vanishing sequence of perturbed BMDPs and a corresponding sequence ( , m )

of equilibria such that lim ( , m ) = (, m). Then (, m) is an equilibrium of the (unperturbed)
BMDP.
Proof. By assumption, there is a sequence ( , m , ) such that, for all , (i) m is invariant for
the transition kernel M , (ii) (Q (m )), and (iii) is optimal for the perturbed M DP (Q ),
where Q = Q (d), and (iv) lim ( , m ) = (,m). By compactness of (), we can fix a
subsequence where converges to some . Let Q = Q (d). By Lemma 1, Q () is upper
hemicontinuous, compact valued, and has compact domain and range; hence, the correspondence
(Q ()) inherits the same properties. Therefore, (Q ()) satisfies the closed-graph property,
which implies that (Q (m)). Moreover, by Lemma 12, m is an invariant distribution of the
transition kernel M . Thus, to show that (, m) is an equilibrium of the (unperturbed) BMDP,
it remains to show that is an optimal strategy for the (unperturbed) MDP with transition
probability function Q .
Fix any s S and any x X such that (x | s) > 0. Since lim = , there exists
ks,x > 0 and s,x such that for all s,x , (x | s) ks,x . From now on, consider only such
sufficiently large s for the fixed subsequence. For all such s, let D (x ) be the set of V such
that
(x ) (x) {(s, s0 , x) + VQ (s0 )}Q (ds0 |s, x) {(s, s0 , x ) + VQ (s0 )}Q (ds0 |s, x ) (10)
S
PV (D (x ))
=
| s) (see equation 8). Since (x | s) ks,x is an
for all x 6=
Notice that
optimal strategy for the perturbed M DP (Q ), then PV (D (x )) ks,x > 0 for all . Integrating
expression (10) over all D (x ), we obtain
{(x ) (x)}PV (d) {(s, s0 , x) + VQ (s0 )}Q (ds0 |s, x)

PV (D (x )) D (x )
S
{(s, s0 , x ) + VQ (s0 )}Q (ds0 |s, x ).

x .
(x
Taking limits over and using assumption (9) and the fact that S is finite, it follows that
{(s, s0 , x ) + lim VQ (s0 )}Q (ds0 |s, x ) {(s, s0 , x) + lim VQ (s0 )}Q (ds0 |s, x) 0. (11)
S
Finally, since Q is continuous (Assumption A3) and bounded, then lim Q = Q . Thus, by
Lemma 13, lim VQ () = VQ () and, therefore, expression (11) implies that x is optimal in
state s in the MDP with transition probability function Q . Because the above holds for all (s, x )
such that (x | s) > 0, it follows that is optimal for M DP (Q ).
A corollary of the above results is that an equilibrium of the (unperturbed) BMDP always
exists (as stated in Theorem 1 of Section 4). To prove this statement, fix a vanishing family of fully
perturbed BMDPs, which we can do by Lemma 7. By Theorem 2, there exists a corresponding
sequence of equilibria. Since equilibria live in a compact space, then there exists a subsequence of
equilibrium that converges. Theorem 3 says that this limit point is an equilibrium of the (unperturbed) BMDP.
As a final remark, notice that Theorem 3 is of independent interest and holds even if the
vanishing sequence of perturbed BMDPs is not fully perturbed.
20
Learning foundation for equilibrium
In this section, we provide a learning foundation for the notion of equilibrium of a fully perturbed
BMDP introduced in Section 5. DThroughout
this section, we fix a fully perturbed BMDP with
E
components hS, X, , q0 , Q, , i, V , PV , and hQ , 0 , Bi satisfying assumptions A1-A5. It is

easy to see that the agents dynamic optimization problem can be cast recursively as

0 0 0
0 0
0
(ds0 |x, s),
W (s , , )PV (d |s ) Q
W (s, , ) = max
(s, x, s ) + (x) +
(12)
x(s) S
0
0
=
where Q
Q (d) and where = B(s , s, x, ) is next periods belief, which is updated given
Bayes rule. The main difference with respect to the case where the agent knows the transition
probability function is that now the agents beliefs about are part of the state space.9
Lemma 8. There exists a unique solution W to the Bellman equation (12); moreover W is
bounded in S () for all V and continuous in S V () (under the product topology).
Definition 16. A policy function is a function f : S V () X mapping the state of

the dynamic optimization problem to the set of actions. A policy function f is optimal (for the
perturbed BMDP) if

0
0 0 0
0 0
(ds0 |x, s)
f (s, , ) arg max
(s, x, s ) + (x) +
W (s , , )PV (d |s ) Q
x(s) S
for all (s, , ) S V ().

The restriction to deterministic policy functions is without loss of generality and simplifies some
of the following definitions.
Let h = (s0 , 0 , x0 , ..., st , t , xt , ...) represent the infinite history or outcome path of the dynamic optimization problem and let H = (S V X) represent the space of infinite histories.
The primitives of the BMDP, including the prior 0 (), and a policy function f induce a probability distribution over H that is defined in a standard way; let P 0 ,f denote this probability
distribution over H .
Definition 17. The sequence of intended strategies given policy function f is the sequence
(t )t of random variables t : H (X)S such that, for all t = 0, 1, ...,
t (h )(x | s) = PV ( : f (s, , t ) = x | s) ,
where t () is the posterior at time t defined recursively by t = B(st , st1 , xt1 , t1 ).
9
We now also explicitly include the payoff perturbation as a state variable for convenience.
21
(13)
By an argument similar to that used to prove Lemma 3, the assumption of absolute continuity
of PV ( | s) implies that (13) is well defined.
Notice that, given a policy function f , an intended strategy t describes the strategy that the
agent would like to play if she were to arrive at time t with the beliefs t . One reasonable criteria
to claim that the agents behavior stabilizes is that her intended behavior stabilizes.
Definition 18. A strategy : S X (X) is stable under policy function f if

P 0 ,f lim kt (h ) k = 0 > 0,
t
where (t )t is the sequence of intended strategies given f .
Lemma 9. Consider a fully perturbed BMDP and a strategy with invariant distribution m for
the transition kernel M . If is stable under policy function f , then there exists a set of histories
H H with P 0 ,f (H ) > 0 such that, for all h H ,
t1
1X
lim
1{s,x} (s , x )(h ) = m(s, x) > 0.
t t
=0
for all (s, x) SX and limt kt (h ) k = 0, where (t )t is the sequence of intended strategies
given f .
Lemma 9 says that, if the sequence of intended strategies converges in a fully perturbed BMDP,
then the frequency of outcomes also converges; moreover, this frequency converges to the invariant
distribution of the transition kernel M . By Lemma 5, this invariant distribution is unique. Thus,
for fully perturbed BMDPs, the notion of stability in Definition 18 captures the idea that behavior
and outcomes stabilize in the BMDP.
The next result characterizes the set of strategies that are stable when the agent follows an
optimal policy function in the fully perturbed BMDP.
Theorem 4. Consider a fully perturbed BMDP and a strategy with invariant distribution m for
the transition kernel M . If is stable under a policy function f that is optimal, then (, m) is an
equilibrium of the fully perturbed BMDP.
The proof of Theorem 4 follows from lemma 9 and the following lemmas, which are proven in
the Appendix.
Lemma 10. Given a policy function f , suppose that there exists a H H such that P0 ,f (H ) >
0 and, for all (s, x) S X,
t1
1X
lim
1{s,x} (s , x )(h ) = m(s, x)
t t
=0
22
for all h H . Then, for any open set U Q (m),

lim t (U ) = 1
a.s. P0 ,f over H , where (t )t is defined recursively as t+1 = B(st+1 , st , f (st , t , t ), t ) for all
t.
Lemma 11. Let f be an optimal

for a fully perturbed BMDP and let be a
policy function
0
0
set with the property that Q

Q
(d)
=
Q

(d) for all , (). Suppose that there
exists a H H with P0 ,f (H ) > 0 such that, for any open set U ,
lim t (U ) = 1
a.s. P0 ,f over H . Then limt (h ) = a.s. P0 ,f over H , where is the optimal
strategy for a fully perturbed MDP(Q).

Proof of Theorem 4. Let be stable under a policy function f that is optimal and let m be
the invariant distribution of M . By Lemma 9, the time average of outcomes converges to m. By
Lemma 10, beliefs concentrate over Q (m). If we can shows that Q (m) has the property stated
in Lemma 11, then that lemma implies the desired result. Notice that, by Lemma 9, m(s, x) > 0
for all (s, x) S X; thus, Lemma 6 implies that the set Q (m) has the desired property.
Examples
7.1
Monopolist with unknown demand
We now study a generalized version of Nyarko [1991] example discussed in Section 2. The monopolist
chooses at each period t = 0, 1, ... a price xt X = {2, 10}. After choosing price xt , the monopoly
observes quantity sold according to the demand function
st+1 = a bxt + t ,
where (t )t are i.i.d. random variables. Nyarko [1991] considered the special case where (t )t are
normally distributed with mean zero and unit variance. Our results allow us to easily generalize
this assumption to the case where the error term is distributed according to an absolutely continuous probability distribution with density f and support in R and where the following regularity
assumption holds: for every z 6= 0, the set
{ : f (z + ) = f ()}
(14)
has probability strictly less than 1 according to f . Assumption 14 rules out, for example, density
functions that are periodic.10 The corresponding transition probability function is, for all Borel
10
As mentioned by Nyarko [1991], the assumption that the support is R is chosen for simplicity despite the fact
that realized quantity can be negative with positive probability. It is possible to also consider the case where has
positive support but the agent does not know the distribution of ; in this case, the agent cannot identify the demand
intercept from the mean of , but this lack of identification is irrelevant for optimal behavior.
23
sets S0 ,
Q (S0 | x) =
{abx+S0 }
f (s (a bx))d,
= (a, b). Let = (a , b ) denote the true demand parameter; i.e., the true transition probability
function is Q = Q .
For simplicity, the monopolist incurs no costs of production. The profits received in period t
are then
(xt , st+1 ) = xt st+1 .
Notice that the problem of the monopolist who knows the demand function is static and that it
would be more natural to index price and quantity with the same time period t. Nevertheless, we
maintain our notation from the paper, which is more general and allows us to capture problems
that are intrinsically dynamic even without a learning problem.
In this example, S = R, which does not satisfy the assumption that S is a finite set. Nevertheless,
it is not too difficult to show that the results of the paper go through. Thus, the previous setup is
an example of an MDP.
Next, we assume that the monopolist does not know the parameter = (a, b); moreover, she
never observes the shocks t , but knows their distribution. In particular, let R2 be a compact
set and suppose that the monopolist starts with prior 0 with support . This problem is now a
BMDP and it is straightforward to check that Assumptions A1-A4 are satisfied (extended to the
case where S R).
Finally, to check Assumption A5 we begin by writing the WKLD. Since this MDP is inherently
static (i.e., the current state does not affect the new state), as remarked at the end of Section 4 we
replace the marginal of the invariant distribution given a strategy directly by the strategy. In this
example, let the strategy [0, 1] denote the probability that the agent chooses x = 2.

r + )
f (r2 r2 + )
f (r10
10
KQ (, ) = ln
f ()d + (1 ) ln
f ()d ,
f ()
f ()
R
R
where rx = a bx. We consider two cases.
Case 1. Correctly specified BMDP: Suppose that , so that the BMDP is correctly
specified. By Lemma 1, KQ is always greater than or equal to zero. Suppose that (0, 1). If
is such that rx rx = 0 for all x, then KQ (, ) = 0; thus, such a minimizes the WKLD. Notice
also that the unique solution to the two previous equations is a = a and b = b . Thus, the above
solution minimizes the WKLD. Moreover, the fact that ln() is strictly concave and that assumption
(14) is satisfied (where we set z = rx rx ) implies that the above is the unique minimizer of the
WKLD.
Next, consider the case where = 1 or = 0. By applying the same arguments, we conclude
r = 0,
that the set of minimizers of the WKLD is the set of that solves solves r2 r2 = 0 or r10
10
respectively. Thus, we have established that
if = 1
{ : a b2 = a b 2}
Q () = { }
(15)
if (0, 1)
{ : a b10 = a b 10} if = 0
24
Figure 1 provides intuition for this result. The two lines through with slope 2 and 10
correspond to the sets Q () when = 1 and = 0, respectively. The intuition is that if the
agent only plays x = 2 (or 10), then she cannot distinguish between any of the parameters along
the line with slope 2 (or 10). However, if she were to play both x = 2 and x = 10 with positive
probability, then the agents belief would be given by the intersection of these two lines, which is
{ }. In particular, notice that Assumption A5 is satisfied because, for each line, all the points
along the line lead to the same transition probability function when the agent chooses the action
corresponding to the line with probability 1.
Let denote optimal strategy in the MDP (i.e., is known). Suppose, for concreteness,
that is such that = 1 is the unique optimal strategy for the MDP (this is true if and only
if a < 12b ). Then Proposition 2 says that is an equilibrium strategy for the BMDP. The
question that we would like to answer is whether there exist other equilibrium strategies. By (15),
if the agent mixes then it must be the case that she puts probability 1 on the true model and,
therefore, chooses . So if another equilibrium exists it must involve the agent choosing 10 with
probability 1. For simplicity, suppose that the agent has degenerate beliefs (which is without loss
of generality if is convex). For the agent to prefer price 10, she must believe that 10 is (weakly)
better than 2, i.e., a 12b, as depicted by the dashed line in Figure 1. Moreover, (a, b) must be
on the line with slope 10 in Figure 1. Thus, if includes points along the line with slope 10 and
above the dashed line, then = 0 is also an equilibrium strategy for the BMDP.
Finally, suppose that we consider a fully perturbed version of this BMDP. Then the fact that the
agent must mix implies that she has correct beliefs about and, therefore, the unique equilibrium
strategy is = 1. If we now take a sequence of equilibrium strategies where the perturbation
vanishes, then the sequence must trivially converge to = 1. Thus, the fully perturbed BMDP
can be used to refine the se of equilibria of the unperturbed BMDP.
Case 2. Mis-specified BMDP: Suppose that is a rectangle in R2 with vertices at the
points 0 , 00 , (a0 , b00 ), and (a00 , b0 ), where a00 < a0 < a and b0 < b00 < b . Notice that the BMDP is
mis-specified because . By minimizing the WKLD we obtain
(
(a0 , b ) if
Q () =
(16)
00
(a , b ) if
where a = a (2 + 10(1 ))(b b00 ), b = b (a a0 )/(2 + 10(1 )), and

=
5/4 (a a0 )/(8(b b00 )). In particular, the closest model to the true model is always unique
and Assumption A5 is verified for the mis-specified case.
For concreteness, consider the special case studied by Nyarko [1991] and depicted in Figure 1
where (a , b ) = (28.5, 5.25), a0 = 20, a00 = 16, b0 = 1, and b00 = 4. In this case, the two lines with
slope 2 and 10 happen to pass through two vertices of the rectangle , though the same results
hold if, for example, a00 < 16 and b0 < 1. If = 1, then all the points on the line with slope
2 minimize the WKLD, but now the point that belongs to that is closest to along the axis
created by the line with slope 2 is 0 ; in fact, 0 is the unique point that is both on the line and
belongs to , but this fact is not important and the result extends to the case where a00 < 16 and
b0 < 1. Using equation16, we can verify that b1 = 1 and, therefore, (a0 , b0 ) is indeed the unique
minimizer. Next, consider the case where < 1. Figure 1 depicts the line through point with
slope 2 + 10(1 ). If
, where
is defined above, it is easy to see that the point on that
25
is closest to when measured along the line with slope 2 + 10(1 ) is a unique point lying on
the top of the rectangle that define ; this point is (a0 , b ) in equation 16.
We now go back to the general case where a00 < a0 < a and b0 < b00 < b . Suppose that = 1,
so that the agent believes, according to (x), that the true parameter is (a0 , b0 ). If 2(a0 2b0 ) <
10(a0 10b0 ), or equivalently, if a0 > 12b0 , then the unique optimal strategy is for the agent to
play = 0. Thus, = 1 is not an equilibrium of the mis-specified BMDP. Next, consider the case
where = 0, so that the agent believes, according to (16), that the true parameter is (a00 , b00 ). If
2(a00 2b00 ) > 10(a00 10b00 ), or equivalently, if a00 < 12b00 , then the unique optimal strategy is for the
agent to play = 1. Thus, = 0 is not an equilibrium of the mis-specified BMDP. Thus, if both
a0 > 12b0 and a00 < 12b00 , as is the case in the example considered by Nyarko [1991], the equilibrium
must be in mixed strategies. In such a case, the monopolist must choose that leads to a belief
that the parameter is (a , b ) such that she is indifferent between prices x = 2 and x = 10:
2(a 2b ) = 10(a 10b ),
or, equivalently, a = 12b . Using (16) and some algebra, the unique equilibrium strategy for the
case where a0 > 12b0 and a00 < 12b00 is
=
5 1 (a a0 )
.
4 8 (b a0 /12)
For the parameters specified by Nyarko, the unique equilibrium strategy is = .95.
7.2
Trading with adverse selection
We use the trading example from Esponda (2008, Section I) in order to illustrate that the current
framework includes as particular cases the decision-theoretic analogs of three game theoretic concepts that have been defined to capture the behavior of boundedly rational players: (fully) cursed
equilibrium Eyster and Rabin [2005], analogy-based expectation equilibrium (Jehiel [2005], Jehiel
and Koessler [2008]), and behavioral equilibrium (Esponda [2008]). Splieger [2011] and Esponda
[2008] discuss further relationships between these concepts that are not elaborated here.
At each period t = 0, 1, ..., a buyer and a seller simultaneously submit a (bid) price xt from a
finite set X and an ask price at from a finite set A, respectively. If xt at , then the buyer pays xt
to the seller and receives the sellers object, which she values at vt , drawn from a finite set V. If
xt < at , the no trade takes place and each player receives 0. At the time she makes an offer, the
buyer does not know her realized value vt . Suppose that the sellers ask price and the buyers value
are drawn each period from the same probability distribution q (A V), where qA (A) and
qV (V) denote the marginal distributions. Our objective is to analyze the optimal pricing strategy
of a risk neutral buyer. (The typical story is that there is a population of sellers each of whom
follows the weakly dominant strategy of asking for her valuation; thus, the ask price is a function
of the sellers valuation and, if buyer and seller valuations are correlated, as is the case in adverse
selection settings, then the ask price and buyer valuation are also correlated.)
7.2.1
Behavioral equilibrium
Suppose that the buyer observes her realized value vt at the end of each period t if and only if trade
takes place on that period. Suppose also that the buyer always observes the ask price at submitted
26
by the seller at the end of each period.11 Esponda [2008] allows for more general types of feedback
and captures different types of feedback by using an information feedback function, an approach
that is common in the self-confirming equilibrium literature. Here, in order to avoid having to
introduce a new element to the setup in Section 3, we model limited feedback by a reformulation
of the state space. In particular, the state space is S = A V {}, where (a, v) represents the
state drawn at the end of period t and a value of v = indicates that there was no trade and,
therefore, the buyer did not observe the realized value of the object.
Let Q denote the true transition probability function. Then, for all x X,
Q(a, v | x) = q(a, v)1{xa} (x)
(17)
Q(a, | x) = qA (a)1{x<a} (x)
(18)
for all (a, v) A V and

for all a A.
The type of mis-specification captured by a behavioral equilibrium is one where the agent
believes that (a, v) are independent random variables (a, v). Let = A V , where A and V
parameterize the set of all probability distributions qA over A and qV over V, respectively. Then
Q = {Q : }, where, for all x X,
Q (a, v | x) = qA (a)qV (v)1{xa} (x)
Q (a, | x) = qA (a)1{x<a} (x)
for all a A. It is straightforward to check that, under mild assumptions, the setup is a BMDP
that satisfies Assumptions A1-A5.
The buyers expected profit from choosing price x is
EQ (x, (a, v)) =

1{xa} (x) (v x) Q (d(a, v) | x)
AV
when she believes the true transition is Q . In particular, the correct expected profit is
EQ (x, (a, v)) = Pq (x a) (Eq (v | x a) x)
(19)
while the mis-specified expected profit is

EQ (x, (a, v)) = PA (x a) (EB v x) .
(20)
Thus, the mis-specified buyer does not account for the fact that the price she offers affects the
quality of the objects she trades.
11
If the buyer did not observe ask prices, then she could also have incorrect beliefs about her equilibrium probability
of trading, but we could then refine the set of equilibria by considering the fully perturbed BMDP.
27
For any strategy (X) and parameter , the wKLD can be written as
K(, ) =
(x)EQ(|x) ln
xX
qA (a)
qA (a)
(a,v):x<a
q(a, v)
q(a, v) ln
.
qA (a)qV (v)
(x)
xX
(a,v):x<a
Q(a, v | x)
Q (a, v | x)
q(a, v) ln
Some algebra then yields that any that minimizes the wKLD given satisfies
PA (x a) = Pq (x a)
and
EB v =
(x )Pq (x a)
E (v | x a).
0 )P (x0 a) q
(x
0
q
x X
P
x X
(21)
(22)
In particular, the mis-specified agents beliefs about the probability of trade are correct and her
beliefs about the value of the object is given by the expected value of the objects that she actually
does trade given her strategy . Replacing (21) and (22) into (20), we obtain that a buyer who
chooses strategy believes that her expected profit from choosing price x is given by
!
X
(x )Pq (x a)
P
Pq (x a)
E (v | x a) x .
(23)
0 )P (x0 a) q
(x
0
q
x X
x X
In words, the buyer has correct beliefs about the probability of trade if she chooses price x and she
clearly understand that, if she trades, she must pay her offer x. She believes, however, that her
offer has no impact on the expected value of the object she will trade. This perceived expected
value is given by the weighted expected value of the object conditional on each price offer she makes
according to her strategy .12
We can now use this expression to define an equilibrium of the BMDP (i.e., a behavioral
equilibrium) to be a strategy such that every x in its support maximizes expression (23), which
itself depends on . Esponda (2008) showed that, when a and v are affiliated random variables, a
mis-specified buyer chooses prices that are lower than a correctly-specified buyer, thus exacerbating
the adverse selection problem. For further results and the extension to general classes of games,
the reader is referred to Esponda (2008).
7.2.2
Fully cursed equilibrium
Suppose now that the mis-specified buyer has the same type of mis-specification, i.e., she does not
perceive that a and v might be correlated, but a different information structure. In particular,
12
Esponda (2008) considered a setting were it was sufficient to restrict attention to pure strategies; in that case,
the expected value in (23) reduces to the simpler expression Eq (v | x a), where x is the equilibrium strategy of
the buyer. In the mixed strategy case, the Eq (v | x a)s are appropriately weighted by the likelihood of trading
(hence, observing the value).
28
suppose that the buyer always observes the value of the object, irrespective of whether she trades.
The true transition probability function is
Q(a, v) = q(a, v)
(24)
and the parameterized one is given by

Q (a, v) = qA (a)qV (v)
for all (a, v) A V and = A V . Notice that EQ (x, (a, v)) and EQ (x, (a, v)) are still
provided by (19) and (20), respectively.
For any strategy (X) and parameter , the wKLD is now given by
K(, ) =
(x)EQ(|x) ln
xX
q(a, v) ln
(a,v)AV
Q(a, v | x)
Q (a, v | x)
q(a, v)
,
qA (a)qV (v)
which does not depend on . Some algebra then yields that any that minimizes the wKLD given
satisfies (21) and
EB v = Eq (v).
Thus, an equilibrium of the BMDP is a strategy such that every x in its support maximizes
Pq (x a) (Eq (v) x) ,
(25)
which no longer depends on .13 This case corresponds to a fully cursed equilibrium (Eyster and
Rabin [2005]) and, in this trading context, it was originally discussed by Kagel and Levin [1986]
and Holt and Sherman [1994].
7.2.3
Analogy-based expectation equilibrium
While a behavioral equilibrium and a fully cursed equilibrium can be viewed as capturing the same
type of mis-specification (i.e., failure to understand correlation), an analogy-based expectation
equilibrium captures a richer class of mis-specifications by introducing the notion of an analogy
class. In the context of the trading example, suppose that we partition the set V into k analogy
classes (Vj )j=1,...,k , where j Vj = V and Vi Vj = 0 for all i 6= j. Jehiel [2005], Jehiel and Koessler
[2008] implicitly make the same feedback assumption as Eyster and Rabin (2005); the two concepts
were developed independently, though. Suppose, as in the analysis of cursed equilibrium in Section
7.2.2, that the buyer always observes the realization vt . The true transition probability function is
the same function Q from equation (24), but the mis-specified models are now represented by
Q (a, v) = q (a | v)q (v),
13
Of course, cursed equilibrium is defined as a fixed point in a game with multiple agents; the point is that it is
not a fixed point with a single agent because beliefs do not depend on the agents strategy. A similar remark holds
for the analogy-based expectation equilibrium analyzed in the next section.
29
for all (a, v) A V, where, for every analogy class i = 1, ..., j,

q (a | v) = q (a | v 0 )
for all v, v 0 Vi . Thus, the agent believes (possibly incorrectly) that the distribution of A conditional on v might depend on the analogy class to which v belongs but does not depend on which
element of an analogy class we take. In other words, the buyer believes that (a, v) are independent
conditional on v Vi , for each i = 1, ..., k.
The expected profit function when the model is is now also different and given by
EQ (x, (a, v)) = PA (x a) (E (v | x a) x) .
(26)
For any strategy (X) and parameter , the wKLD is now given by
K(, ) =
(x)EQ(|x) ln
xX
q(a, v) ln
(a,v)AV
Q(a, v | x)
Q (a, v | x)
q(a, v)
,
q (a | v)qV (v)
which does not depend on . Some algebra then yields that any that minimizes the wKLD given
satisfies (21), q (v) = qV (v) for all v V, and, for all i = 1, ..., j and all v Vi ,
P
)q(
v)
vVi q(a | v
P
q (a | v) = qVi (a)
.
v)
vVi q(
Then, (26) becomes
EQ (x, (a, v)) =
k
X
Pq (v Vi ) (Pq (x a | v Vi ) (Eq (v | v Vi ) x)) .
(27)
i=1
An equilibrium of the BMDP is a strategy such that every x in its support maximizes (27),
which does not depend on . This case corresponds to the analogy-based expectation equilibrium
of Jehiel and Koessler [2008]. Notice that in the particular case where the partition over V is trivial,
i.e., k = 1, then (27) reduces to expression (25) and the analogy-based expectation equilibrium is
equivalent to the fully cursed equilibrium. See also Splieger [2011] (Chapter 8) for a discussion of
analogy-based expectation equilibrium in this trading example.
7.2.4
Behavioral equilibrium with analogy classes
One of the benefits of having a unifying framework is that we can easily consider new types of
mis-specifications that are a combination of previous cases. For example, consider the trading
example where feedback coincides with the behavioral equilibrium case analyzed above (i.e., the
buyer only observes realized values when she trades) and where the buyer has the type of misspecification analyzed in an analogy-based expectation equilibrium (i.e., she learns using analogy
classes (Vj )j=1,...,k ).
30
As in the behavioral equilibrium case, let S = A V {} denote the state space. The true
transition probability function Q is once again given by (24) . Then
Q(a, v) = q(a, v)1{xa} (x)
Q(a, ) = qA (a)1{x<a} (x)
for all a A.
The mis-specified models are now represented by
Q (a, v) = q (a, v)1{xa} (x)
and
Q (a, ) = q (a, v)1{x<a} (x)
for all (a, v) A V, where, for every analogy class i = 1, ..., j,
q (a | v) = q (a | v 0 )
for all v, v 0 Vi .
In this case, a buyer who chooses strategy might have several beliefs if there is a price x that
she never chooses. This situation was analyzed by Esponda (2008) for the particular case where
each element of V is in a different analogy class. This is the case for the agent who allows for any
correlation between a and v and is, therefore, correctly-specified for all primitives q (A S).14
Here, we consider an arbitrary partition over V but we offer a refinement that leads to unique
beliefs.
Fix a strategy . First, notice that beliefs are unique if all prices are chosen with positive
probability under , simply because the buyer has experience with all possible prices. Suppose
that is not the case. Then we consider any sequence of strategies ( ) with the property that
lim = and that, for each , every price is chosen with positive probability under .
It turns out that we obtain the same limit belief irrespective of the sequence that we take to
approximate . Equivalently, we can modify the information structure and assume that the buyer
always observes the element of the partition Vi that contains v, but she only observes v if she trades.
In either case, we can show that the perceived expected profit from choosing price x for a buyer
who is choosing strategy is given by
!
k
X
X
(x )Pq (x a)
P
Pq (x a | v Vi )
Pq (v Vi | x a)
E
(v
|
x
a,
v
V
)
x
.
q
i
0
0
x0 X (x )Pq (x a)
j=1
x X
(28)
An equilibrium of this BMDP is a strategy such that each x in the support of maximizes expression (28). Notice that expression (28) reduces to the corresponding expression for the behavioral
equilibrium (23) if we take the trivial partition.
14
Esponda (2008) referred to this case as the self-confirming equilibrium case or the case with a sophisticated
buyer.
31
7.3
Search with uncertainty about future job offers
We consider the problem of an infinitely-lived agent that faces the choice of accepting or rejecting
a wage offer. The agent is uncertain of her chances of receiving a job offer in the future and of her
chances of being fired if she accepts employment. The probability that the agent receives an offer
and the probability that she is fired from her job depend on economic fundamentals. We study the
behavior of a mis-specified agent who fails to realize that the chance of future wage offers or the
chance of being fired are related to economic fundamentals. We find that if the chance of receiving
an offer and the chance of being fired are negatively correlated, then a mis-specified agent will be
less selective when accepting wage offers compared to an agent with the correct model.
At the beginning of each period t = 0, 1, ... the agent decides whether to accept or reject a given
wage offer wt . If she accepts the offer, then she earns wt in that period; otherwise, she earns zero.
After she makes her employment decision, an economic fundamental zt+1 is drawn i.i.d. from the
finite set Z according to the probability distribution G. If the agent is employed, then she is fired
with probability (z), where is a vector in [0, 1]|Z| . If the agent is unemployed (either because
she was employed and then fired or because she did not accept employment at the beginning of
the period), then with probability (z), where [0, 1]|Z| , she draws a new wage wt+1 from the
interval [0, 1] according to the absolutely continuous distribution F . With probability 1 (z), the
unemployed agent receives no wage offer, which we conveniently represent by saying she receives
a negative wage offer, wt+1 = w < 0 , which she will of course never accept. The agent will have
to decide whether to accept or reject wt+1 at the beginning of next period. If the agent accepted
employment at wage wt at the beginning of time t and was not fired, then she starts next period
with wage offer wt+1 = wt and will again have to decide whether to quit or remain in her job at
that offer. As usual, the agent maximizes her discounted expected utility, where per period payoffs
are (wt , xt ) = xt wt and [0, 1) is her discount factor.
It is easy to see that this setting constitutes an MDP with state space S = W Z, where
W {w} [0, 1], constraint correspondence (s) = X for all s S, and, for [0, 1]|Z| (why we
index Q on this vector will become clear soon), a true transition probability function Q such that,
for all Borel sets A, all z 0 Z, and all (w, z, x) S X,
(w0 A | z 0 , w, x),
Q (w0 A, z 0 | w, z, x) = G(z 0 )Q
where
(w0 A | z 0 , w, 1) = (1 (z 0 ))1A (w) + (z 0 )(z 0 )

Q
F (dw)
A
+ (z 0 )(1 (z 0 ))1{w} (w)

and
(w0 A | z 0 , w, 0) = (z 0 )
Q
F (dw) + (1 (z 0 ))1{w} (w).

A
We consider the corresponding BMDP where the agent knows the primitives except the vectors
and that determine the probability of being fired and receiving an offer, respectively. The agent
believes, possibly incorrectly, that does not depend on the economic fundamental, i.e., (z) =
for all z Z , where [0, 1] is a parameter she wants to learn. Given the assumption that
is constant, then the agents behavior depends only on the average value of and, therefore, we
32
will simply assume that the agent knows (or its average value). The set of models in the BMDP
is then Q = {Q(,...,) : }, where = [0, 1] and supp(0 ) = . From now on, we denote
Q(,...,) by Q .
The assumption that the economic fundamental is i.i.d. and revealed only after the worker
decides whether to accept an offer is made for simplicity. It guarantees that, irrespective of whether
the agent is mis-specified or knows the true model, her optimal strategy depends only on the wage
offer, and not on the economic fundamental. Thus, we can better compare strategies in the misspecified and correct settings and isolate the effect of the mis-specification.
It is a straightforward exercise to verify that Assumptions A1-A5 are satisfied in this context.
[As usual, with the caveat that we need to extend the assumptions for the case where W R rather
than being a finite set]
The next result characterizes equilibrium for this BMDP.
Proposition 3. A strategy is an equilibrium strategy of the BMDP if and only if it is characterized
by an equilibrium threshold w such that lower offers are rejected and higher offers are accepted,
irrespective of the value of z, i.e., for all z Z, (0 | w, z) = 1 if w < w and (1 | w, z) = 1 if
w > w , where w solves the system of equations:

w (1 + EG []) = (1 EG [])
(w w ) F (dw)
(29)
w>w
and
mX (0)
mX (1) (EG [])
=
E
[]
+
G
mX (0) + mX (1) (EG [])
mX (0) + mX (1) (EG [])
where
mX (0) =

CovG (, )
EG [] +
,
EG []
EG [] (1 F (w ))EG []
(1 F (w )) {EG [] EG []} + EG []
(30)
(31)
and mX (1) = 1 mX (0). Moreover, if CovG (, ) > 0, then there is a unique equilibrium threshold
w .
Equation (29) is the standard equation in search models that characterizes the threshold w as
a function of the parameter (i.e., when the agent believes in the mis-specified model Q ). It
is easy to see that there is a unique threshold w that solves (29) for each ; denote this solution
by w(). Figure 2 plots two examples of w(); notice that w() is increasing because, if the agent
believes in a higher probability of receiving a wage offer, then accepting a current offer becomes
less attractive, and the optimal threshold increases.
As usual in equilibrium, the belief is determined endogenously by the strategy of the agent,
which is characterized by w in this example. Equations (30) and (31) describe how the belief
depends on w . Consider first equation (30). The agent only observes the realization of , i.e.,
whether she receives a wage offer, in cases where she is unemployed. There are two reasons why
the agent can be unemployed. The first reason is that the agent rejected the offer. Since this
decision happens before the new economic fundamental is realized, there is no correlation between
33
this decision and the chance of getting a wage offer. Thus, the mis-specified agent will believe
that the probability of getting a wage offer is given by the true average probability, EG []. This
situation is reflected by the first term in the RHS of (30).
The second reason for being unemployed is that the agent accepted an offer but was then fired.
In this case, the probability of being fired might be correlated with the probability of receiving an
offer, but the agent fails to account for this possibility. If, say, CovG (, ) < 0, so that the agent
is less likely to get an offer in periods in which she is fired, then she will have a more pessimistic
view about the probability of receiving a wage offer relative to the average probability EG [] . The
second term in the RHS of (30) captures this bias.
The weights on the RHS of (30) represent the probability of being unemployed by choice or
due to being fired conditional on the probability of being unemployed, respectively. As described
by (31), these weights depend on the invariant probability of accepting an offer, mX (1). Since this
probability is determined by the agents strategy, then the weights and, therefore, the belief also
endogenously depend on the threshold strategy w . For example, if the agent rejects more offers,
so that w increases, then the weight on being unemployed by choice increases and the bias in
decreases.
Let (w) denote the beliefs that correspond to following threshold strategy w, which is obtained
by replacing (31) into (30). Figure 1 plots this relationship for two cases. In the left panel,
CovG (, ) < 0 and, as explained above, the agent has a negative bias. As w increases, the agent
spends more time unemployed and the bias diminishes; thus, () is increasing. In the right panel,
CovG (, ) > 0 and, therefore, the bias is positive. As w increases, the bias diminishes, which
implies that () is decreasing.
As depicted in Figure 1, any intersection of the functions w() and () is an equilibrium threshold. In the case CovG (, ) < 0, there might be multiple equilibrium thresholds. But in the case
CovG (, ) > 0, the facts that w() is increasing and () is decreasing imply that there is a unique
equilibrium threshold.
Next, we characterize the optimal strategy for the agent who knows the true transition probability function Q.
Proposition 4. The optimal strategy : W Z (X) for the M DP (Q) is essentially unique
and is characterized by the unique threshold wo that solves
(
)
wo (1 + EG []) = (EG [] EG [])
w>wo
(w wo ) F (dw) .
(32)

By comparing equations (29) and (32), we observe that the only difference appears in the term
multiplying the RHS. In the mis-specified case, the term is (1 EG []); in the correct case, the
term is (EG []EG []) = EG [](1EG [])CovG (, ). Thus, the misspecification affects the
optimal threshold in two ways. First, the misspecified agent estimates the mean of incorrectly,
i.e., 6= EG []; second, the misspecified agents ignores the potential correlation between and ,
i.e., CovG (, ). Using these observations, we can now compare the equilibrium strategy of the
34
mis-specified BMDP with the optimal strategy for the MDP with the correct transition probability
function.
< (>)w , for any threshold w that characterizes
Proposition 5. If CovG (, ) < (>)0 then wm
o
m
an equilibrium strategy of the BMDP.

The intuition of the proof relies on the two differences highlighted above between the optimal
solution to the search problem under the true model and under a mis-specified model. For concreteness, suppose that CovG (, ) < 0 and consider an agent with a correct model and an agent
who is mis-specified.
Lets first isolate the effect of mis-perceiving the mean of while assuming that the covariance
term is zero for both agents. As discussed above, the mis-specified agent estimates the mean of
with a downward bias, < EG [] (see equation 30). Therefore, she (incorrectly) believes that, in
expectation, offers arrive with lower probability, thus making the option to reject and wait for for
the possibility of drawing a new wage offer next period less attractive.
Second, lets isolate the effect of mis-perceiving that CovG (, ) < 0, but assuming that both
agents believe in the same mean of . If an agent accepts a wage offer, then she can get fired,
in which case she would like to know the probability of receiving another wage offer upon being
fired. If an agent rejects an offer, then she would also like to know the probability of receiving
another wage offer. The mis-specified agent believes that the probability of receiving a wage offer
is independent of the event of being fired, but the agent with correct beliefs understands that,
because CovG (, ) < 0, then she is less likely to receive an offer if she is fired. This situation
makes accepting an offer relatively less attractive for the agent with correct beliefs. In other words,
the mis-specified agent will be more willing to accept a given wage offer.
Both the mean and the covariance effects discussed move in the same direction: the mis-specified
agent finds accepting an offer more attractive than the agent with correct beliefs, and, therefore, her
equilibrium threshold is lower than the optimal threshold of an agent with correct beliefs. When
CovG (, ) > 0, a similar argument shows that the situation reverses.
35
Appendix
8.1
Additional results used in the proofs
Lemma 12. Let (n , mn )n be a sequence with the property that, for each n, mn is an invariant probability distribution
of the transition kernel given strategy n , Mn . If limn (n , mn ) = (, m), then m is an invariant probability
distribution of the transition kernel M .
Proof. Notice that
km M mk km mn k + kmn Mn mn k + kMn mn M mk
km mn k + kMn mn M mn k + kM mn M mk
= km mn k + k(Mn M )mn k + kM (mn m)k ,
where, as n , the first and last terms of the last line go to zero by assumption and the middle term also goes
to zero because (mn )n is bounded and M is linear, hence, continuous in (see equation 4). Thus, m satisfies
m M m = 0 and is, therefore, an invariant distribution of M .
Lemma 13. If a sequence of transition probability functions (Q ) converges to Q , then lim VQ = VQ , where,
VQ and VQ are the value functions for the perturbed and unperturbed MDPs, respectively.
Proof. We divide the proof into four steps.
STEP 0. We introduce some notation. Let

,Q (s)
(s, f (s, ), s0 )Q (ds0 |s, f (s, )) PV (d|s)
V
S

X
=
(s, x, s0 )Q (ds0 |s, x) PV (d|s)
xX
xX
{:f (s,)=x}
X
(s, x, s0 )Q (ds0 |s, x)

{:f (s,)=x}
X
xX
PV (d|s)
(s, x, s )Q (ds |s, x) (x|s).
o
n
where f (s, ) = arg maxx(s) S (s, x, s0 ) + (x) + VQ (s0 ) Q (ds0 |s, x) (which is well-defined a.s-PV ). Also, let
P
e (s) xX {:f (s,)=x} (x)PV (d|s). Finally, let L ,Q : CB (S) CB (S) where CB is the space of bounded
and continuous functions and
L ,Q [H](s) EQ (|s, ) [H] .
Where
Q (|s, )
X
xX
{:f (s,)=x}
Q (|s, x)PV (d|s).
Q (|s, x) (x | s)
xX
STEP 1. Observe that L ,Q is an element of the Banach algebra endowed with norm ||L|| supH
||L[H]||L (S)
||H||L (S)
(see Lax [2002]). This implies that algebraic properties enjoyed by algebras, such as multiplication (understood as
composition) of elements, can be applied to L ,Q . Moreover, it is easy to see that ||L ,Q || 1.
Moreover, it turns out (invoke Presno-Pouzo Lemma F.5) that, for [0, 1)
t Lt ,Q = (I L ,Q )1
t=0
36
also belong the same Banach algebra. In fact, ||(I L ,Q )1 || 1. Moreover, for any g L (S) and any s S,
(, Q) 7 (I L,Q )1 [g](s) R is continuous (under the product topology) provided (, Q) 7 L,Q [g](s) is.
To show this last result. Observe that, since H CB , Q 7 EQ(|s,) [H] is continuous
Punder the weak topology.
Since |X| < , then 7 EQ(|s,) [H] is trivially continuous . This shows that L,Q [g] = xX EQ(|s,) [H] is in fact
continuous (under the product topology).
STEP 2. By definition

n
o
VQ (s) =
(s, f (s, ), s0 ) + (f (s, )) + VQ (s0 ) Q (ds0 |s, f (s, )) PV (d|s)
V
= ,Q (s) + e (s) + L ,Q [VQ ](s).

Hence, by our observations in step 1,
VQ (s) = (I L ,Q )1 [ ,Q + e ](s).
Suppose that lim || ,Q + e ,Q ||L = 0 (where ,Q (s) is defined similar to ,Q (s)), then, since
||(I L ,Q )1 || 1,

lim |VQ (s) (I L,Q )1 [,Q ](s)| lim (I L ,Q )1 (I L,Q )1 [,Q ](s) .
The term in the RHS vanishes due to our observations in Step 1 and the fact that , Q converges to , Q. This
means that, for any s S, lim |VQ (s) (I L,Q )1 [,Q ](s)| = 0.
We now argue that lim || ,Q + e ,Q ||L = 0. This follows from by assumption over vanishing (PV )
and the fact that (, Q) 7 ,Q (s) is continuous.
STEP 3. Let W (I L,Q )1 [,Q ]. By definition of VQ it follows that, for all x X,

n
o
(s, x, s0 ) + (x) + VQ (s0 ) Q (ds0 |s, x) PV (d|s).
VQ (s)
V
Taking limits, by the result in step 2, and the fact that Q converges to Q , it follows that

W (s)
(s, x, s0 ) + W (s0 ) Q (ds0 |s, x)
S
for all x X. Also, by definition of W , it follows that

X

W (s) =
(s, x, s0 ) + W (s0 ) Q (ds0 | s, x) (x | s).
xX
Hence W is in fact equal to VQ (Note: this also proves that is optimal for Q ).
8.2
Proofs in Section 4
Proof of Proposition 1. Fix any m (S X). Suppose that , 0 Q (m), where 6= 0 , and that 00 =
+ (1 )0 for some (0, 1). Then
X

KQ (m, 00 ) =
EQ(|s,x) ln Q00 (S 0 |s, x) m(s, x) + C
(s,x)SX
EQ(|s,x) [ln (Q ( | s, x) + (1 )Q0 ( | s, x))] m(s, x) + C
(s,x)SX

KQ (m, ) + (1 )KQ (m, 0 ) ,
(33)
where C includes the terms that do not depend on , the second line follows because Q is linear, and the third line
follows because ln() is a concave function. In fact, because ln() is strictly concave, it follows that (33) holds with
equality if and only if Q ( | s, x) = Q0 ( | s, x) a.s.-Q( | s, x) for all (s, x) S X. Moreover, the assumption that
, 0 Q (m) implies that equation (33) must hold with equality (otherwise, 00 would be closer to the true than
and 0 ). By assumption (iii), it follows that Q ( | s, x) = Q0 ( | s, x) for all (s, x) S X, so that Assumption A5
is satisfied.
37
8.3
Proofs in Section 5
Proof of Lemma 2. Follows from standard arguments invoking Contraction Mapping Theorem; see Stokey et al.
[1989]
Proof of Lemma 4. It follows from the definition of , the fact that has unbounded support but finite first
moments.
Proof of Lemma 4. Throughout this proof, we use e to denote the vector (s, x). By the results of Meyn and
Tweedie [2005] Chapter 4, it suffices to show that for any e, e0 (S X)2 , there exists an n such that Mn (e|e0 ) > 0
where Mn = M M . By assumption A4, there exist a path (e0 , e1 , ..., en ) such that xi (si ) for all i =
0, 1, ..., n and
Q(s | en )Q(sn | en1 )...Q(s1 | e0 ) > 0.
Since (x|s) > 0, the previous display implies that
X
X
M2 (e2 |e0 )
M (e2 | e1 )M (e1 | e0 ) =
(x2 | s2 )Q(s2 | e1 )(x1 | s1 )Q(s1 | e0 ) > 0.
e1
(s1 ,x1 )
By iterating, we show that M (e|e0 ) > 0 as desired.
8.4
Proofs in Section 6
Proof of Lemma 8. Let W(S V ()) {W : S V ()) R : sups,S() V |W (s, , )|PV (d|s) <
}. It is not difficult to see that this is a Banach space under the norm || ||W sups,S() V | |PV (d|s) < .
15
For any H W(S V ()), let

T [H](s, , ) = max
(s, x, s0 ) + (x) + H(s0 , 0 , 0 )PV (d 0 |s0 ) Q(ds0 |x, s).
x(s)
Observe that, if ||H||W C < , then
||T [H]||W C + max

s
max
V
i=1,...,|X|
|i |PV (d|s) + C.
C + maxs V maxi=1,...,|X| |i |PV (d|s) is a valid bound. Since maxs V maxi=1,...,|X| |i |PV (d|s)
is bounded (because V |i |PV (di |s) < and

Thus, C
1
1
max
V i=1,...,|X|
|i |PV (d|s)
|X|
X
i=1
|i |PV (di |s)
). It follows that T maps W(S V ()) onto itself.

Since W(S V ()) is a Banach space, it is complete. We now show that TQ is a contraction. For this we
use a slight modification of Blackwell sufficient conditions. For any w, u W(S V ()) observe that
{w(s, , ) + ||u w||W } PV (d|s) = {w(s, , )} PV (d|s) + ||u w||W

V
V
{w(s, , )} PV (d|s) +
{u(s, , ) w(s, , )}PV (d|s)
V
V
= {u(s, , )} PV (d|s).
V
15
The fact that || ||W is a norm and linearity of W(S V ()) is trivial. Completeness is shown as follows.
First, observe that W(S V ()) is closed under || ||W . Second, it is easy to see that W(S V ())
L1 (S V ()) for, say, the measure generated by PV , the counting measure over S and 0 . Consider a sequence
(wn )n that is Cauchy under || ||W . Then it must be Cauchy under || ||L1 , and since L1 is complete, this sequence
has a limit in L1 . However, since W(S V ()) is closed, the limit ought to be in W(S V ()).
38
Also, note that, if
{w(s0 , 0 , 0 ) + a}PV (d 0 |s0 )
u(s0 , 0 , 0 )PV (d 0 |s0 ) (a a scalar), then
T [w](s, , ) + a = T [w + a](s, , ) T [u](s, , ).

Putting both results together, with a = ||u w||W , it follows
T [u](s, , ) T [w](s, , ) ||u w||W .
By interchanging u and w, taking absolute values and integrating w.r.t PV , it follows that
||T [u] T [w]||W ||u w||W .
Hence, by the contraction mapping theorem, there exists a unique WQ

W(S V ()) that is a fixed point of
0 0
TQ . Moreover, maxs V |WQ (s, , )|PV (d |s ).
Continuity: TBW
Proof of Lemma 10. By lemma 14,
lim
dm ()t (d) = 0
. This implies the desired result. If not, there exists an open

a.s. P0 ,f over H , where dm () = inf
|| ||
m
Q (m) such thatt ( \ U
) > c > 0 for all t. So there exists a set exists a set, C, which we can take to be
set U
closed, such that t (C) > c > 0. However, since dm () > d > 0 for any C, this violates the display above.
Proof of Lemma 11. Fix an arbitrary h H . It suffices to show that any subsequence of (t (h ))t has a
convergent subsequent which limit is the optimal strategy for a fully perturbed MDP(Qm ) (and does not depend on
the particular h nor the particular subsequence).
Take (t(j) (h ))j as the subsequence. () is compact, thus for each subsequence of (t(j) )t(j) , there exists a
subsubsequence, (t(j) )j (abuse of notation) that converges (under the weak topology); we denote its limit as h
(this limit may depend on the subsequence and the underlying history h ).
By definition of t (h ), it follows that
1{f (s,,t(j) )=x} PV (d|s).

t(j) (h ) (t(j) )(x|s) =
V
By standard arguments it can be shown that f is continuous , and thus limt ||t(j) (h )(h )|| = 0. It remains
to show that (h ) is the optimal strategy for a fully perturbed MDP(Qm ), i.e.,

(h )(x | s) = PV : f(s, ) = x | s
where f is such that
V )
f(s, ) = arg max F (s, x, , Q,
Q
x(s)
where
(s, x, , Q, H) 7 F (s, x, Q, H)
(s, x, s0 ) + (x) +
and
VQ (s, )
TQ [VQ ](s, )

H(s0 , 0 )PV (d 0 |s0 ) Q(ds0 |x, s).
V ).
maxx(s) F (s, x, , Q,
Q
By virtue of the Dominated Convergence theorem and the fact that PV is absolutely continuous with respect to
Lebesgue, it suffices to show that for any > 0, there exists a J() > 1 such that

f (s, , t(j) ) f(s, ) <
for all j J(), where f (s, , t(j) ) = arg maxx(s) F (s, x, , Qt (j) , W (, , t(j) )) (where W is the fixed point of
the Bellman operator T defined below). By the Theorem of the Maximum, it suffices to show that for any > 0,
there exists a J() > 1 such that

V ) <
max F (s, x, , Qt(j) , W (, , t(j) )) F (s, x, , Q,
Q
s,x
39
for all j > J(). We now establish this fact.

Observe that

V )
F (s, x, , Qt(j) , W (, , t(j) )) F (s, x, , Q,
Q

V ) + F (s, x, , Q , W (, , t(j) )) F (s, x, , Q , V )
< F (s, x, , Qt(j) , VQ ) F (s, x, , Q,
t(j)
t(j)
Q
Q
At(j) (s, , x) + Bt(j) (s, , x).
It follows that
At(j) (s, , x)

0 |x, s) Q (ds0 |x, s) .
(s, x, s0 ) + (x) + V (s0 , 0 )P (d 0 |s0 ) Q(ds
t(j)

Q
V
S

Sincet(j) h , then Qt(j) Qh 0 because 7 Q (s0 |x, s) is continuous and bounded and s, x are
= 1. 16 This last statement may fail for the particular h since it only
finite. Moreover, it also follows that h ()
holds a.s. on H ; if this is the case subtract the set of measure zero from H . The previous statements imply
Hence, this, the fact that quantities are bounded and that s, x are discrete imply that At(j) vanishes.
Qh = Q.
Regarding Bt(j) , observe that
n

o

Bt(j) (s, , x)
W (s0 , 0 , t(j) ) VQ (s0 , 0 ) PV (d 0 |s0 ) Qt(j) (ds0 |x, s) .
S
It is thus sufficient to bound ||W (, , t(j) ) VQ ||L (SV) . For this let
T [H](s, , ) = max
x(s)
By definition of W and
VQ

(s, x, s0 ) + (x) + H(s0 , 0 , B(s0 , x, s, ))PV (d 0 |s0 ) Q (ds0 |x, s).

S
and the fact that TQ is a contraction, it follows that
||W (, , t(j) ) VQ ||L (SV) = ||T [W ](, , t(j) ) TQ [W (, , t(j) )]||L (SV)
+||W (, , t(j) ) VQ ||L (SV)
and

T [W ](s, , t(j) ) TQ [W (, , t(j) )](s, )

W (, , t(j) )) .
max F (s, x, , Qt(j) , W (, , t(j) )) F (s, x, , Q,
x(s)
Similar calculation to those used to bound At(j) imply that the RHS vanishes, and thus imply that Bt(j) does too.
8.4.1
Supplementary lemmas
The next lemma is an extension of Berk and Bunke-Milhaud to allow for actions affecting the conditional probability
of the process (st )t .
Lemma 14. Given aP
policy function f , suppose that
there exists a H H such that P0 ,f (H ) > 0 and

t1
1

limt maxx,sXS t =0 1{x,s} (x , s ) m(x, s) = 0 for all h H . Then,
lim
d ()t (d) = 0
t
0 ,f
a.s.P
for all t.
and (t )t is defined recursively, as t+1 = B(st+1 , st , f (st , t , t ), t )

over H , where dm () = inf
||||
Q (m)
16
Take a sequence (Un )n such that Un Un+1 Q (m), n Un = Q (m) and open. By continuity of probability
measures limn h (Un ) = h (Q (m)). However, the LHS is always equal to 1, because t(j) h .
40
Proof. Note that 7 dm is u.h.c. (see lemma 15). Therefore, for any > 0, { : dm () } is compact. This
result and continuity of 7 KQ (m, ) (see lemma 16(i)), imply that inf : dm () KQ (m, ) = KQ (m, m, ) > Km
KQ (m, Q (m)). Let Km

+ KQ (m, m, ) > 0. Also, let > 0 be chosen such that KQ (m, ) Km
+ 0.25 for
all { : dm () < } (such always exists by continuity of 7 KQ (m, )).
We note that, for all > 0,
d ()Zt ()0 (d)

At ()
{ : dm ()} m
dm ()t (d) +
+
,
B
Z
()
(d)
t ()
t
0
{ : dm ()}
where Zt ()
Q (s |s 1 ,x 1 )
=1 Q(s |s 1 ,x 1 )
Qt

n P
o
Q(s |s
,x 1 )
= exp t =1 log Q (s |s1
. It suffices to show that
1 ,x 1 )
lim sup {exp {t (Km

+ 0.5)} At ()} = 0
(34)
and
lim inf {exp {t (Km

+ 0.5)} Bt ()} = Const > 0.
(35)
a.s.-P0 ,f .
Regarding equation 34, it suffices to show that
(
lim
sup
t { : d ()}
m
(Km
+ 0.5) t
t
X

log
=1
Q(s |x 1 , s 1 )
Q (s |x 1 , s 1 )
)
const < 0,
a.s.-P0 ,f . To show this, note that

t1
t
X

log
=1
Q(s |x 1 , s 1 )
Q (s |x 1 , s 1 )
(
(x 1 ,s 1 )=(x,s)XS

) Pt1
t1
=0 log QQ(s(s |x,s)
1{x,s} (x , s )
1X
|x,s)
.
1{x,s} (x , s )
Pt1
t =0
=0 1{x,s} (x , s )
Pt1
Pt1
1
Since t1
=0 1{x,s} (x , s ) m(h) and m(h) > 0, then
=0 1{x,s} (x , s ) . An adaptation of the ULLN
[TBW], shows that

Pt1
Q(s |x,s)

=0 log Q (s |x,s) 1{x,s} (x , s )
Q(S 0 |x, s)
lim
= EQ(|x,s) log
Pt1
t
Q (S 0 |x, s)
=0 1{x,s} (x , s )
uniformly over { : d () } and a.s.-P0 ,f . Moreover, by assumption
lim
t1
1X
1{x,s} (x , s ) = m(x, s)
t =0
a.s.-P0 ,f . Therefore
(
lim
sup
t { : d ()}
m
Km
+ 0.5 t
t
X

log
=1
sup
Q(s |x 1 , s 1 )
Q (s |x 1 , s 1 )
)
{Km
+ 0.5 KQ (m, )}
{ : dm ()}
= Km
+ 0.5
inf
{ : dm ()}
KQ (m, ).
By our choice of , the RHS is less than Km

+ 0.5 KQ (m, m, ) = 0.5 < 0. The desired result thus follows.
Regarding equation 35, note that by Fatous lemma and some algebra it suffices to show that
lim inf exp {t (Km

+ 0.5)} Zt () = > 0
t
41
(pointwise on { : dm () }). Equivalently,

lim inf
t
Km
+ 0.5 t
t
X

log
=1
Q(s |x 1 , s 1 )
Q (s |x 1 , s 1 )
!
> 0.
Similar arguments to those used to show the validity of equation 34, it follows that,

!
t
X
Q(s |x 1 , s 1 )
lim inf Km + 0.5 t

log
= Km
+ 0.5 KQ (m, )
t
Q (s |x 1 , s 1 )
=1
(pointwise on { : dm () }). By our choice of , the RHS is greater than 0.25 and our desired result
follows.
Lemma 15. Then 7 d () is continuous (d is defined in lemma 14).

Proof. (Proof of Lemma 15) For any m, Q (m) is compact (see Lemma 16). Also || || is continuous for all .
So, by the Theorem of the Maximum, 7 min
|| || is an compact-valued, u.h.c. correspondence.
Q (m)
Lemma 16. (i) for any m (H), 7 KQ (m, ) is continuous and finite. Moreover,
Q (m) arg min KQ (m, )
is non-empty, compact-valued and u.h.c. (as a mapping m 7 Q (m)).

Proof. (Proof of Lemma 16) (i) It follows that
X

KQ (m, ) =
EQ(|x,s) ln Q(S 0 |x, s) m(x, s)
(x,s)XS

EQ(|x,s) ln Q (S 0 |x, s) m(x, s)
(x,s)XS
so it suffices to show that 7 (x,s)XS EQ(|x,s) [ln (Q (S 0 |x, s))] m(x, s) is continuous. Since |H| < , it suffices to
show 7 EQ(|x,s) [ln (Q (S 0 |x, s))] for any (s, x). Since |S| < , it is sufficient to show that 7 ln (Q (s0 |x, s)) for
any (s0 , s, x). Under assumption A4(i), we actually only need to show this for any (s0 , s, x) such that Q (s0 |x, s) > 0;
hence 7 ln (Q (s0 |x, s)) R is well-defined. By assumption A4(ii), 7 ln (Q (s0 |x, s)) is continuous for any
(s0 , s, x) such that Q (s0 |x, s) > 0, and the desired result follows.
It is easy to see that, under assumption A4, (, m) 7 KQ (m, ) is continuous under the product topology .
This result, the fact that is compact and the theorem of the Maximum, ensure that m 7 Q (m) is non-empty,
compact-valued and u.h.c.
P
Lemma 17. Suppose, > 0 and let m the invariant probability corresponding to M . Then 7 m is a function
and is continuous.
Proof. The fact that the mapping 7 m is function follows from the fact that there is a unique invariant distribution
of M when > 0; see lemma 5. In order to establish continuity, consider a sequence (n )n such that ||n || 0,
then
||qn q || ||Mn qn M qn || + ||M (qn q )||.
Since M is a contraction , the second term in the RHS is bounded by ||qn q || with [0, 1). Thus, it suffices
to establish that ||Mn M || 0, but this follows trivially by construction of M .
42
8.5
Proofs in Section 7
Proof of Proposition 3. We prove the result in several steps.

Claim 1. For any m (S X) with marginal mX (X), Q (m) is a singleton and equal to

mX (0)
mX (1) (EG [])
CovG (, )
Q (m) =
EG [] +
EG [] +
mX (0) + mX (1) (EG [])
mX (0) + mX (1) (EG [])
EG []
CovG (, )mX (1)
,
= EG [] +
mX (0) + (EG []) mX (1)
P
where EG [] zZ (z)(z)G(z), etc.
Proof. Observe that, if x = 1, then, for any A W Borel and any z 0 ,

X

1
Q({z 0 } A|s, 1)
0
0
0
0
EQ(|s,0) ln
=
(z
)
(z
)
log
()
F
(dw
)
+
(1
(z
))
log
(1
)
G(z 0 )
Q ({z 0 } A|s, 1)
0
0
z Z
X
+
(1 (z 0 )) {log (1)} G(z 0 )
z 0 Z
=EG [] log () + (EG [] EG []) log (1 ) .

Similarly, if x = 1,

Q({z 0 } A|s, 1)
EQ(|s,0) ln
= EG [] log () + (1 EG []) log (1 ) .
Q ({z 0 } A|s, 1)
It is easy to see that, over [0, 1], these are strictly convex functions of , so a convex combination also is. Thus
Q (m) is a singleton for any m. The first order conditions yield
1
1
{EG []mX (1) + EG []mX (0)} =
{(EG [] EG []) mX (1) + (1 EG [])mX (0)} .
1
Thus
Q (m) = Q (m) =
EG []mX (1) + EG []mX (0)

.
(EG []) mX (1) + mX (0)
The desired results follows from some algebra and from standard expression for the covariance.
Claim 2. For each , the optimal strategy : W Z (X) for the M DP (Q ) is essentially unique and is
characterized by a unique threshold w such that lower offers are rejected and higher offers are accepted, irrespective
of the value of z, i.e., for all z Z, (0 | w, z) = 1 if w < w and (1 | w, z) = 1 if w > w , where w is the unique
solution to
(
)
w : w (1 + EG []) = (1 EG [])
w>w
(w w ) F (dw) .
Proof. Let L(z, w) be the value of an employed agent with wage w before receiving notice if she was fired. It follows
that

L(z, w) = (z)U (z) + (1 (z)) w + L(z 0 , w)G(dz 0 )

Z
where U (z) be the value of being unemployed, before receiving a wage offer

1
0
0
0
0
0
0
max w + L(z , w)G(dz ), 0 + U (z )G(dz ) F (dw).
U (z) = (1 ) U (z )G(dz ) +
Z
Note that in this case U (z) = U does not depend on z.
43

1
It is easy to see that,

U = (1 )U + 0 max w + Z L(z 0 , w)G(dz 0 ), U F (dw) and L(z, w) = (z)U + (1
(z))w + (1 (z)) Z L(z 0 , w)G(dz 0 ), which implies
EG []U + (1 EG [])w
.
L(z, w)G(dz) =
1 (1 EG [])
Z
So
(1 (z))EG []U
(1 (z))(1 EG [])w
L(z, w) =(z)U + (1 (z))w +
+
1 (1 EG [])
1 (1 EG [])

(1 (z))EG []
(1 (z))
=U (z) +
+w
1 (1 EG [])
1 (1 EG [])

(z)(1 ) + EG []
(1 (z))
=U
+w
1 (1 EG [])
1 (1 EG [])
This last equation implies that

EG []U
w
+
, U F (dw)
1 (1 EG [])
1 (1 EG [])
0
EG [](1 F (w ))U
w
=(1 )U + U F (w ) +
F (dw) +
1 (1 EG [])
1 (1 EG [])
w>w

wF (dw)
w>w
EG [](1 F (w ))
+
.
=U (1 ) + F (w ) +
1 (1 EG [])
1 (1 EG [])
U =(1 )U +
max
So
U=
w>w
n
1 (1 ) + F (w ) +
wF (dw)
1(1EG [])
w>w
))
EG [](1F (w
1(1EG [])
wF (dw)
1(1EG [])
n

o
EG []
1 1 + 1(1E
(1 F (w ))
G [])
1 1+
w>w
wF (dw)
1(1EG [])
(EG []1)(1)
1(1EG [])
(1 F (w ))
o.
From the previous calculations, it is easy to see that L is increasing, so if w is such that w > (<)w then the
agent accepts (rejects), and w is given by
w : w + L(z 0 , w )G(dz 0 ) = U w = U (1 EG [])(1 ).

Z
Replacing by the value of U we obtained above, it follows that
w {(1 + EG [])) + ((1 EG [])) (1 F (w ))} =

which implies
wF (dw)(1 EG [])
w>w
(
w (1 + EG []) = (1 EG [])
w>w
)
(w w ) F (dw)
Claim 3. Let be a strategy characterized by a threshold w . Then there is a unique invariant distribution m of
the transition kernel M , and it satisfies
mX (0) =
EG [] (1 F (w ))EG []
(1 F (w )) {EG [] EG []} + EG []
and mX (1) = 1 mX (0).
44
Proof. Observe that, for any z 0 and A Borel,

(x0 |s0 )Q(z 0 , A|s, x)m(s, x)dsdx.
m(z 0 , A, x0 ) =
S
Where {z 0 }, A, x0 is just notation for the set {z 0 } A {x0 }. And

Q (z 0 , A|s, x) Pr{w0 A|z 0 , s, x}G(z 0 )
and
0
F (dw0 )
1{w A}
Pr{w A|z , s, 0} = Pr{w |z , 0} =

and
w/ prob. (z 0 )
,
w/ prob. (1 (z 0 ))
F (dw0 )
w/pr. (z 0 )
A
0
0
0
0
Pr{w A|z , s, 1} = Pr{w A|z , w, 1} =
1{w A}
w/pr. (1 (z 0 ))
1{w A}
w/pr. (z 0 )
w/pr. 1 (z 0 )
Also (1|s) = (1|w) = 1{w > w }. Hence, for x0 = 1

m (z 0 , W, 1) = m(z 0 , {w0 > w }, 1)
and similarly for x0 = 0. Thus, mX (1) = Z m(dz 0 , {w0 > w }, 1) and mX (0) = Z m(dz 0 , {w0 < w }, 0) (w0 = w
occurs with probability zero, so it can be ignored). It thus follows that

mX (1) =
(x0 |s0 )Q(dz 0 , {w0 > w }|s, x)m(s, x)dsdx
Z S X

=
P r{{w0 > w }|z 0 , s, x}G(dz 0 )m(s, x)dsdx
Z S X

=
P r{{w0 > w }|z 0 , s, 0}G(dz 0 )m(s, 0)ds
Z S

+
P r{{w0 > w }|z 0 , s, 1}G(dz 0 )m(s, 1)ds
Z S

= P r{{w0 > w }|z 0 , 0}G(dz 0 )m(0) +
P r{{w0 > w }|z 0 , w, 1}G(dz 0 )m(s, 1)ds
Z
Z W
= (z 0 )G(dz 0 )(1 F (w ))mX (0) + (z 0 )(z 0 )(1 F (w ))G(dz 0 )mX (1)

Z
Z
0
0
+ (1 (z ))G(dz )
1{w > w }m(dw, 1).
Z
where the last line follows from the fact that 1{w0 > w }1{w0 = w A} = 0 always. Observe that
w }m (dw, 1) = m({w > w }, 1) = mX (1) by our previous observation. Thus
1{w >
mX (1) = EG [](1 F (w ))mX (0) + {EG [](1 F (w )) + (1 EG [])} mX (1).

Therefore, since mX (0) = 1 mX (1),
mX (1) =
EG [](1 F (w ))
.
{EG [] EG []} (1 F (w )) + EG []
The proof of the first part of Proposition 3 follows immediately from the definition of equilibrium and claims 1, 2,
and 3. We conclude by showing that We only need to show that the equilibrium threshold is unique if CovG (, ) > 0.
By definition, it suffices to find a unique pair (w , ) that solves equations 29-30. We prove this by showing that (i)
w is increasing as a function of and (ii) is decreasing as a function of w . Part (i) is trivial. Regarding part
(ii), observe that

CovG (, )
= (w )EG [] + (1 (w )) EG [] +
EG []
45
where
(w )
EG [] (1 F (w ))EG []
.
EG [] (1 F (w ))EG [] + EG [](1 F (w ))EG []
Since CovG (, ) > 0, then EG [] > EG [] +

After some algebra,
CovG (,)
,
EG []
so it suffices to show that w 7 (w ) is increasing.
EG [] EG [] + EG []F (w )
EG [] EG [] + F (w ) {EG [] EG []EG []} + EG []EG []
EG [] EG [] + EG []F (w )
=
.
EG [] EG [] + F (w )CovG (, ) + EG []EG []
(w ) =
It is obvious that F (w ) is increasing in w ; so it suffices to show that t 7 (t)

It follows that
EG []EG []+EG []t

.
EG []EG []+tCovG (,)+EG []EG []
d(t)
=
dt
EG [] (EG [] EG [] + tCovG (, ) + EG []EG []) (EG [] EG [] + EG []t) CovG (, )
,
(EG [] EG [] + tCovG (, ) + EG []EG [])2
Since we are only concern about the sign it suffices to study the numerator. This expression gets simplified to
(EG [] EG []) (EG [] CovG (, )) + EG []EG []EG []
= (EG [] EG []) (EG []EG []) + EG []EG []EG []
= EG [] (EG []EG []) + EG []EG []EG [] > 0.
Proof of Proposition 4. Similar to the proof of lemma 2, let

Lo (z, w) = (z)Uo (z) + (1 (z)) w + Lo (z 0 , w)G(dz 0 )

Z
where Uo (z) is the value of being unemployed, before receiving a wage offer

1
Uo (z) = (1 (z)) Uo (z 0 )G(dz 0 ) + (z)

max l(w), Uo (z 0 )G(dz 0 ) F (dw).
Z
0
Where l(w) w + Z Lo (z , w)G(dz ). Observe that Uo does depend on z, because does. It is easy to see that
w 7 Lo (w, z) is increasing. So (0 | w, z) = 1 if w < wo and (1 | w, z) = 1 if w > wo , where wo is such that
wo + Lo (z 0 , wo )G(dz 0 ) = Uo (z 0 )G(dz 0 ).
(36)
Z
Observe that the Bellman equation for Lo implies
(z)Uo (z)G(dz) + (1 Z (z)G(dz))w

Z
Lo (z, w)G(dz) =
1 (1 Z (z)G(dz))
Z

(z)Uo (z)G(dz)
w
Z
Lo (z, w) = (z)Uo (z) + (1 (z))
+
.
1 (1 EG [])
1 (1 E[])
This can be cast as L(z, w) w(1 (z))A + (z)Uo (z) + A(1 (z)) Z (z)Uo (z)G(dz). And thus

1
Uo (z) =(1 (z)) Uo (z 0 )G(dz 0 ) + (z)

max J(w), Uo (z 0 )G(dz 0 ) F (dw)
Z
0
Z

0
0
0
0
=(1 (z)) Uo (z )G(dz ) + (z) Uo (z )G(dz )F (wo ) + (z)A
wF (dw)
and
Uo (z)(z)G(dz 0 )(1 F (wo )).
+ (z) (1 + A(1 EG []))

Z
46
w>wo
Where J(w) wA +
satisfy
(z 0 )Uo (z 0 )G(dz 0 ) (1 + A(1 EG [])) . We guess that Uo (z) = C + (z)D. Our guess must
C = C + EG []D C =
EG []D
.
1
[]D
Observe that EG [Uo ] = C + EG []D = EG1
and,
D = H+A
wF (dw) + (1 + A(1 EG [])) {C + (z)D} (z)G(dz 0 )(1 F (wo ))
w>wo

EG []EG []
=I + D
+ EG [] (1 F (wo )).
1
[]D
[]D
G []
where H EG1
+ F (w ) EG1
and I D E1
(F (w ) 1) + A w>w wF (dw). This yields,
o
A w>w wF (dw)
o
n
o.
D=
EG []
G []
+ EG []
1 + 1 (1 F (wo )) A(1 F (wo )) EG []E
1
1
1 (1 EG [])
Therefore, equation 36 becomes

wo A +
(z 0 )Uo (z 0 )G(dz 0 ) (1 + A(1 EG [])) =
Uo (z 0 )G(dz 0 )
Z
n
o
[]D
G []
Since Z Uo (z 0 )G(dz 0 ) = EG1
and Z Uo (z 0 )(z 0 )G(dz 0 ) = CEG [] + DEG [] = D EG []E
+ EG [] . Thus,
1
the previous display implies

EG []
EG []EG []
wo A = D
A
+ EG []
.
1
1

n
o
G []
G []
Let E E1
A EG []E
+ EG [] . Observe that
1
A w>w wF (dw)
o
o
n
D=
G []
G []
1 + E1
(1 F (wo )) A(1 F (wo )) EG []E
+ EG []
1
A w>w wF (dw)
o

n
o
=
G []
G []
1 + (1 F (wo )) E1
A EG []E
+ EG []
1
A w>w wF (dw)
o
.
=
1 + (1 F (wo )) E
So wo (1 + (1 F (wo )) E) = E w>w wF (dw). Equivalently,

o
wo =E
{w wo } F (dw)
w>wo
w>wo
{w wo } F (dw)
1 + EG []
[EG [] EG []]
Where the last line follows from the fact that

EG [] 2 EG [] + 2 EG []EG []
EG []EG []
1
E=
+ EG []
1 + EG []
1
1
=
[EG [] EG []] .
1 + EG []

Proof of Proposition 5. Let

Zo (w) w(1 + EG []) (EG [] EG [])
w0 >w
47

w0 w F (dw0 )
and, for any invariant distribution m,

Z(w, ) w(1 + EG []) (1 EG [])

w w F (dw ) .
0
w0 >w
It can be shown that Zo is increasing;

Zo (w) Z(w, ) =
w0 >w

=
w0 >w

=
w 7 Z(w, ) is also increasing. Observe that

w0 w F (dw0 ) ((1 EG []) (EG [] EG []))

w0 w F (dw0 ) ({ EG []} + {EG [] EG []})

w0 w F (dw0 ) ({ EG []} + {CovG (, ) + EG [](EG [] )})
w0 >w
Now, for = Q (m) for any invariant distribution m, by lemma 1 it follows

EG [] mX (1) + EG [] mX (0)
(CovG (, )) mX (1)
= EG [] +
.
mX (0) + (EG []) mX (1)
mX (0) + (EG []) mX (1)

G [])mX (1)
So, { EG []} + {CovG (, ) + EG [](EG [] )} = mX(1E
+
1
CovG (, ) and
(0)+(EG [])mX (1)
Q (m) =

sign {Zo (w) Z(w, Q (m))} =
mX (1) + mX (0)
mX (0) + (EG []) mX (1)

CovG (, ).
It is clear that the term inside the parenthesis is positive. So CovG (, ) > (<)0 implies Zo (w) > (<)Z(w, Q (m))
for any m invariant. Since Zo is increasing, wo is unique. Moreover, by taking m as the invariant corresponding to
the optimal threshold, this implies wo < (>)wm

for any threshold wm
that characterizes an equilibrium strategy of
the BMDP.
References
P. Aghion, P. Bolton, C. Harris, and B. Jullien. Optimal learning by experimentation. The review
of economic studies, 58(4):621654, 1991.
N. Al-Najjar. Decision makers as statisticians: Diversity, ambiguity and learning. Econometrica,
77(5):13711401, 2009.
N. Al-Najjar and M. Pai. Coarse decision making. 2009.
E. Aragones, I. Gilboa, A. Postlewaite, and D. Schmeidler. Fact-free learning. American Economic
Review, 95(5):13551368, 2005.
Nicholas Barberis, Andrei Shleifer, and Robert Vishny. A model of investor sentiment. Journal of
financial economics, 49(3):307343, 1998.
P. Battigalli. Comportamento razionale ed equilibrio nei giochi e nelle situazioni sociali, volume
100. Universita Bocconi, Milano, 1987.
R.H. Berk. Limiting behavior of posterior distributions when the model is incorrect. The Annals
of Mathematical Statistics, 37(1):5158, 1966.
L.E. Blume and D. Easley. Learning to be rational. Journal of Economic Theory, 26(2):340351,
1982.
48
L.E. Blume and D. Easley. Rational expectations equilibrium: An alternative approach. Journal
of Economic Theory, 34(1):116129, 1984.
M. Bray. Learning, estimation, and the stability of rational expectations. Journal of economic
theory, 26(2):318339, 1982.
M. Bray and D.M. Kreps. Rational learning and rational expectations. Graduate School of Business,
Stanford University, 1981.
O. Bunke and X. Milhaud. Asymptotic behavior of bayes estimates under possibly incorrect models.
The Annals of Statistics, 26(2):617644, 1998.
E. Dekel, D. Fudenberg, and D.K. Levine. Learning to play bayesian games. Games and Economic
Behavior, 46(2):282303, 2004.
P. Diaconis and D. Freedman. On the consistency of bayes estimates. The Annals of Statistics,
pages 126, 1986.
Ulrich Doraszelski and Juan F Escobar. A theory of regular markov perfect equilibria in dynamic
stochastic games: Genericity, stability, and purification. Theoretical Economics, 5(3):369402,
2010.
D. Easley and N.M. Kiefer. Controlling a stochastic process with unknown parameters. Econometrica: Journal of the Econometric Society, pages 10451064, 1988.
I. Esponda. Behavioral equilibrium in economies with adverse selection. The American Economic
Review, 98(4):12691291, 2008.
G. W. Evans and S. Honkapohja. Learning and Expectations in Macroeconomics. Princeton University Press, 2001.
E. Eyster and M. Piccione. An approach to asset-pricing under incomplete and diverse perceptions.
Technical report, Mimeo, 2011.
E. Eyster and M. Rabin. Cursed equilibrium. Econometrica, 73(5):16231672, 2005.
D.A. Freedman. On the asymptotic behavior of bayes estimates in the discrete case. The Annals
of Mathematical Statistics, 34(4):13861403, 1963.
D Fudenberg and D. Kreps. Learning mixed equilibria. Games and Economic Behavior, 5:320367,
1993.
D. Fudenberg and D.M. Kreps. A theory of learning, experimentation, and equilibrium in games.
Technical report, mimeo, 1988.
D. Fudenberg and D.M. Kreps. Learning in extensive-form games i. self-confirming equilibria.
Games and Economic Behavior, 8(1):2055, 1995.
D. Fudenberg and D.K. Levine. Self-confirming equilibrium. Econometrica, pages 523545, 1993a.
D. Fudenberg and D.K. Levine. Steady state learning and nash equilibrium. Econometrica, pages
547573, 1993b.
49
J.C. Harsanyi. Games with randomly disturbed payoffs: A new rationale for mixed-strategy equilibrium points. International Journal of Game Theory, 2(1):123, 1973.
C. Holt and R. Sherman. The losers curse. American Economic Review, pages 642652, 1994.
P. Jehiel. Analogy-based expectation equilibrium. Journal of Economic theory, 123(2):81104,
2005.
P. Jehiel and F. Koessler. Revisiting games of incomplete information with analogy-based expectations. Games and Economic Behavior, 62(2):533557, 2008.
P. Jehiel and D. Samet. Valuation equilibrium. Theoretical Economics, 2(2):163185, 2007.
J.H. Kagel and D. Levin. The winners curse and public information in common value auctions.
The American Economic Review, pages 894920, 1986.
E. Kalai and E. Lehrer. Rational learning leads to nash equilibrium. Econometrica: Journal of the
Econometric Society, pages 10191045, 1993.
P. Lax. Functional Analysis. Wiley, 2002.
A. McLennan. Price dispersion and incomplete learning in the long run. Journal of Economic
Dynamics and Control, 7(3):331347, 1984.
S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, 2005.
Y. Nyarko. Learning in mis-specified models and the possibility of cycles. Journal of Economic
Theory, 55(2):416427, 1991.
M.J. Osborne and A. Rubinstein. Games with procedurally rational players. American Economic
Review, 88:834849, 1998.
M. Piccione and A. Rubinstein. Modeling the economic interaction of agents with diverse abilities
to recognize equilibrium patterns. Journal of the European economic association, 1(1):212223,
2003.
Matthew Rabin and Dimitri Vayanos. The gamblers and hot-hand fallacies: Theory and applications. The Review of Economic Studies, 77(2):730778, 2010.
M. Rothschild. A two-armed bandit theory of market pricing. Journal of Economic Theory, 9(2):
185202, 1974.
A. Rubinstein and A. Wolinsky. Rationalizable conjectural equilibrium: between nash and rationalizability. Games and Economic Behavior, 6(2):299311, 1994.
T. J. Sargent. The Conquest of American Inflation. Princeton University Press, 2001.
J. Schwartzstein. Selective attention and learning. Unpublished Manuscript, Harvard University,
2009.
Joel Sobel. Non-linear prices and price-taking behavior. Journal of Economic Behavior & Organization, 5(3):387396, 1984.
50
R. Spiegler. Placebo reforms. The American Economic Review, page forthcoming, 2012.
R. Splieger. Bounded Rationality and Industrial Organization. Oxford University Press, 2011.
N. Stokey, R. Lucas, and E. Prescott. Recursive Methods in Economic Dynamics. Harvard University Press, 1989.
51
Figure 1: Monopolist with unknown demand (Nyarko [1991])
52
Figure 2: Equilibrium of the search model
53

A Framework For Modeling Bounded Rationality: Mis-Specified Bayesian-Markov Decision Processes

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

A Framework For Modeling Bounded Rationality: Mis-Specified Bayesian-Markov Decision Processes

Transféré par

Droits d'auteur :

Formats disponibles

A Framework for Modeling Bounded Rationality: Mis-specified

Bayesian-Markov Decision Processes

arXiv:1502.06901v1 [q-fin.EC] 24 Feb 2015

February 25, 2015

PRELIMINARY AND INCOMPLETE

Illustrative example: monopolist with unknown demand

A Markov Decision Process (MDP)

We begin by describing the environment faced by the agent.

: S 2X is a non-empty constraint correspondence

A Bayesian-Markov Decision Process (BMDP)

The set of closest models given m (S X) is the set

is exactly the one that attaches this

Q (m) = { : i = i , j [0, 1]},

An invariant distribution of the transition kernel M is a distribution m (S X) that

It is a standard result that, for every , an invariant distribution of M exists.5

Thus, by Definition 8, is an equilibrium strategy if and only if is an optimal strategy when

The unique solution (hence, the unique equilibrium strategy) of (6) is

and, therefore, = 0 is the unique optimal strategy given . Thus, by Proposition 2, = 0

Perturbed Decision Processes

The next version will extend this definition to bounded supports.

Lemma 6. Suppose that m(s, x) > 0 for all (s, x) S X. Then

Theorem 2. An equilibrium of a fully perturbed BMDP always exists.

maps elements in to transition probability functions is well defined: Q()

for the perturbed M DP (Q()),

then follows because Q

Theorem 3. Fix a vanishing sequence of perturbed BMDPs and a corresponding sequence ( , m )

{(x ) (x)}PV (d) {(s, s0 , x) + VQ (s0 )}Q (ds0 |s, x)

{(s, s0 , x ) + VQ (s0 )}Q (ds0 |s, x ).

Learning foundation for equilibrium

components hS, X, , q0 , Q, , i, V , PV , and hQ , 0 , Bi satisfying assumptions A1-A5. It is

Definition 16. A policy function is a function f : S V () X mapping the state of

for all (s, , ) S V ().

where (t )t is the sequence of intended strategies given f .

for all h H . Then, for any open set U Q (m),

Lemma 11. Let f be an optimal

set with the property that Q

a.s. P0 ,f over H . Then limt (h ) = a.s. P0 ,f over H , where is the optimal

strategy for a fully perturbed MDP(Q).

Monopolist with unknown demand

where a = a (2 + 10(1 ))(b b00 ), b = b (a a0 )/(2 + 10(1 )), and

Trading with adverse selection

Q(a, | x) = qA (a)1{x<a} (x)

for all (a, v) A V and

EQ (x, (a, v)) =

while the mis-specified expected profit is

Fully cursed equilibrium

and the parameterized one is given by

Analogy-based expectation equilibrium

for all (a, v) A V, where, for every analogy class i = 1, ..., j,

Pq (v Vi ) (Pq (x a | v Vi ) (Eq (v | v Vi ) x)) .

Behavioral equilibrium with analogy classes

Search with uncertainty about future job offers

(w0 A | z 0 , w, 1) = (1 (z 0 ))1A (w) + (z 0 )(z 0 )

+ (z 0 )(1 (z 0 ))1{w} (w)

F (dw) + (1 (z 0 ))1{w} (w).

Proof. See the Appendix.

Proof. See the Appendix.

Additional results used in the proofs

(s, x, s0 )Q (ds0 |s, x)

(s, x, s )Q (ds |s, x) (x|s).

Q (|s, x)PV (d|s).

= ,Q (s) + e (s) + L ,Q [VQ ](s).

for all x X. Also, by definition of W , it follows that

for all j > J(). We now establish this fact.