Vous êtes sur la page 1sur 15

Dynamic programming

Martin Ellison

1 Motivation
Dynamic programming is one of the most fundamental building blocks of
modern macroeconomics. It gives us the tools and techniques to analyse
(usually numerically but often analytically) a whole class of models in which
the problems faced by economic agents have a recursive nature. recursive
problems pervade macroeconomics: any model in which agents face repeated
decision problems tends to have a recursive formulation. This lecture intro-
duces two key concepts: the value function and value function iterations. To
fully understand the intuition of dynamic programming, we begin with sim-
ple models that are deterministic. Models which are stochastic and nonlinear
will be considered in future lectures.

2 Key reading
This lecture draws on the material in chapters 2 and 3 of “Dynamic Eco-
nomics: Quantitative Methods and Applications” by Jérôme Adda and Rus-
sell Cooper, Massachusetts Institute of Technology, 2003. We will also use
the book later in the course. It is a very accessible introduction to techniques
for dynamic economics, covering topics including consumption, investment

1
and employment. As it is very reasonably priced, I recommend this text for
purchase.

3 Other reading
The same material is covered in several textbooks, most notably “Recursive
Macroeconomic Theory”, 2nd ed by Lars Ljungqvist and Tom Sargent, MIT
Press, 2000 and the original grandfather text “Recursive Methods in Eco-
nomic Dynamics” by Nancey Stokey and Robert Lucas, 1989. These cover
the material with greater mathematical rigor, whereas Adda and Cooper
place greater weight on intuitive understanding.

4 The value function


At the heart of dynamic programming is the value function, which shows the
value of a particular state of the world. For example, what is the value of
having income I when there are two goods in the economy, x1 and x2 , at prices
p1 and p2 and utility is logarithmic and separable? To answer this question,
we begin by asking how the consumer would allocate income I across the two
goods. The utility maximisation problem faced by the consumer is

max [! ln x1 + (1 − !) ln x2 ]
x1 ;x2

s:t:
p1 x1 + p2 x2 = I

Taking first order conditions gives the familiar solution in which there are
constant expenditure shares.

2
p1 x1 = !I
p2 x2 = (1 − !)I

Solving these two equations for x1 and x2 , we can then substitute into
the utility function to obtain.
µ ¶ µ ¶
!I (1 − !)I
V (I; p1 ; p2 ) = ! ln + (1 − !) ln
p1 p2
Students of microeconomics will immediately recognise this as the indirect
utility function. It describes how utility depends on the state variables I, p1
and p2 . However, in the dynamic programming terminology, we refer to it as
the value function - the value associated with the state variables. Note that
it is intrinsic to the value function that the agents (in this case the consumer)
is optimising. More generally, we can write

V (I; p) = maxu(c)
c∈C

s:t:
pc = I

where C is a vector of possible consumption goods and p is a vector


of their corresponding prices. this formulation makes it explicit that the
value function incorporates optimisation. Whilst the example in this section
is trivial, we will see that recasting economic problems in terms of value
functions turns out to be extremely powerful.

5 Cake-eating example
To introduce dynamics to the problem, we now consider the problem of how
quickly one should eat a cake of given size. Imagine the cake is initially of

3
size W1 and all cake should be eaten before time T (by which time presum-
ably either the cake has become moldy or the consumer has died and become
moldy!) Instantaneous utility derived from eating cake is given by the func-
tion u(ct ) and the consumer discounts future utility by the factor ¯. This is
a finite-horizon dynamic problem with discounting.

T
X
max ¯ t−1 u(ct )
{ct }
t=1

s:t:
Wt+1 = Wt − ct
W1 given

The problem is to find {ct }. In standard undergraduate or master courses


in macroeconomics, the preferred solution method is one of direct attack. We
solve the budget constraint forward to obtain

T
X
ct + WT +1 = W1
t=1

and define the Lagrangian

T
" T
#
X X
L= ¯ t−1 u(ct ) + ¸ W1 − WT +1 − ct
t=1 t=1

The first order conditions for t = 1 : : : T are

¯ t−1 u0 (ct ) = ¸

Alternatively,

u0 (ct ) = ¯u0 (ct+1 )

4
This is the familiar Euler equation, equating the net present value of
marginal utility of consumption across consecutive time periods. In itself, it
is not sufficient to uniquely determine how the cake should be eaten. For
that, we also require that WT +1 = 0, a terminal condition that states that
no cake should be left over after period T .
To solve the problem using the method of dynamic programming, we
define a value function VT (W1 ) to be the solution derived above with the
method of direct attack, i.e.,

T
X
VT (W1 ) = max ¯ t−1 u(ct )
{ct }
t=1

s:t:
Wt+1 = Wt − ct
W1 given

Although we do not know the function VT (W1 ), we do know its derivative


VT0 (W1 ). An increment in the initial cake size W1 allows consumption in any
period to increase, therefore, VT0 (W1 ) = ¯ t−1 u0 (ct ). It does not matter in
which period the extra cake is eaten since, due to optimality, the return (in
terms of the value function) of eating extra cake is equalised across periods.
The power of dynamic programming becomes apparent when we add an
additional period 0 to our problem. The problem at time 0 is to solve

max [u(c0 ) + ¯VT (W1 )]


c0

s:t:
W1 = W0 − c0
W0 given

5
This is a simple problem to solve because we only have to choose c0 rather
than a whole time path for consumption {ct }. The first order condition is
simply

u0 (c0 ) = ¯VT0 (W1 )

However, we know from before that VT0 (W1 ) = ¯ t−1 u0 (ct ) = u0 (c1 ) so we
can conclude

u0 (ct ) = ¯u0 (ct+1 )

and we have derived the Euler equation using the dynamic programming
method. Notice how we did not need to worry about decisions from time
t = 1 onwards. This is an example of the Bellman optimality principle. It is
sufficient to optimise today conditional on future behaviour being optimal.
The ease with which we did this is of course illusionary because we already
knew the form of VT0 (W1 ) from the direct attack approach. In general, this
will not be the case and we will not know the exact form of the value function
or its first derivative. Fortunately, this is not a completely insurmountable
problem. Our approach will be to make a first guess at the value function and
then have several value function iterations until our guesses converge on the
true value function. The next section is devoted to showing how these value
function iterations are carried out and under what conditions they converge
to the true value function.

6 Value function iterations


We illustrate the convergence of value function iterations to the true value
function in a general formulation of the dynamic programming problem. The
key ingredients are an payoff function ¾
˜ (st ; ct ) and a transition equation

6
st+1 = ¿ (st ; ct ). The payoff function describes the instantaneous return from
choosing a vector of controls ct at a given vector of states st . In the cake-
eating example, ¾
˜ (·) is simply the (direct) utility function. The transition
equation describes the evolution of the vector of state variables st . For the
cake-eating example, ¿ (·) is the intertemporal budget constraint. Of crucial
importance for the remainder of this course is that ¾
˜ (·) and ¿ (·) are not
time-dependent so the problem is stationary. This ensures that the problem
has a recursive representation. In other words, for a given state vector, the
problem faced by the agent is always the same.
The value function is now defined as the value of having a particular state
s (we remove the time index and use s and s0 to denote the state vector in
adjacent time periods).

V (s) = ¾(s; c) + ¯V (s0 )]


max [˜
c∈C(s)
s:t:
st+1 = ¿ (st ; ct )

C(s) is the set of all possible choices for the controls c for a given state
vector s. To make the notation more compact, we invert the transition
equation to define the control c as a function of current and future states,
enabling the payoff function to be written in terms of s and s0 . Instead of
choosing c, the agent chooses the future state s0 from the set of feasible states
Γ(s).

V (s) = max [¾(s; s0 ) + ¯V (s0 )]


0s ∈Γ(s)

The problem now is to find the value function V (s). In some cases, it is
possible to make an intuitive guess at its form (e.g. quadratic in the state
variables) and then proceed via the method of undetermined coefficients to

7
show that the guess is consistent with optimality. The approach we take
through value function iterations is more general, although it leads to a
numerical rather than analytical solution. We begin by making an initial
guess of the value function W (·). The next guess of the value function is
obtained by applying an operator T , defined as follows:

T (W )(s) = max [¾(s; s0 ) + ¯W (s0 )]


s ∈Γ(s)
0

To put the iteration in words, what we are doing in each iteration is re-
optimising the choice of the future states s0 . In doing this, we need to know
how a change in the future states affects the payoff this period and in all
future periods (the latter is often known as the continuation value). For the
payoff this period, we can use the function ¾(s; s0 ). For the payoff in future
periods, we use the previous guess of the value function W (s0 ). In effect, we
are reoptimising our choice of s0 , assuming that W (s0 ) is a correct represen-
tation of the true value function from the next period onwards. Clearly it is
not, but successive iterations will generally converge so that W (s0 ) becomes
the true value function V (s0 ) and the guess of the value function does not
change between successive value function iterations.
How do we know that value function iterations will converge? Even if they
do, how do we know that they converge to the unique value function? To
answer these question we need a fixed point theorem, since we wish to show
that value function iterations converge to the unique fixed point defined by

T (V )(s) = max [¾(s; s0 ) + ¯V (s0 )]


s ∈Γ(s)
0

There are many fixed point theorems, some more useful than others. For
our purposes, the most useful fixed point theorem is known as Blackwell’s
sufficiency conditions. These conditions ensure convergence to a unique fixed
point. The conditions are 1) monotonicity and 2) discounting of the T oper-

8
ator. The mathematics behind these conditions can be found in Stokey and
Lucas and many other places. Here, we will concentrate on gaining intuition
into why Blackwell’s sufficiency conditions guarantee convergence of value
function iterations to the unique true value function.
To illustrate the intuition, we describe a simple example in which the state
space collapses to a single point. In this example, there is no possibility to
change state so no control or optimisation decision. However, there is a value
associated with the (unique) state so we can still illustrate the operation of
value function iterations. In our simple example the value function will be a
constant. If the initial guess of the value function is W then the next guess
of the value function is obtained by applying the operator

T (W ) = ¾ + ¯W

Notice that the assumption of a collapsed state space removes the state
dependency of the payoff function ¾(·). It is a simple matter to plot the
mapping graphically.

T(W) 45o line

T(W) = s + b W

W
V

This mapping converges to V for any starting value. It is an example of


a contraction mapping as the span of W is contracted at each iteration by
the T operator. The key condition for convergence in the simple model is
|¯| < 1, which is guaranteed by discounting. Graphically, this ensures that

9
the T mapping cuts the 45o line from above and with a gradient of modulus
less than 1.
Once we move to problems with a fully specified state space, the oper-
ator T is applied to a function W (s0 ) rather than a constant W . In this
case, discounting is not a sufficient condition for unique convergence of value
function iterations to the true value. Intuitively, what we require is that
the T mapping cuts the 45o surface from above in every direction and that
the dynamics are stable. The condition of monotonicity ensures this. The
original Blackwell paper form 1965 contains a formal proof, as does Stokey
and Lucas.
It is easy to see that Blackwell’s sufficiency conditions apply to the dy-
namic programming problems we will study. Monotonicity requires that if
W (s) ≥ Q(s) for all s ∈ S then T (W )(s) ≥ T (Q)(s) for all s ∈ S. This is
guaranteed because our problem is one of maximisation. We have

T (W )(s) = max [¾(s; s0 ) + ¯W (s0 )]


s ∈Γ(s)
0

≥ ¾(s; s00 ) + ¯W (s00 )


= T (Q)(s)

where s00 is the future state chosen in the previous period. Discounting
is satisfied if we consider adding a constant k to the value function and
T (W +k)(s) ≤ T (W )(s) +¯k. This is satisfied trivially in our model because

T (W + k) = max [¾(s; s0 ) + ¯(W (s0 ) + k)] = T (W ) + ¯k


0s ∈Γ(s)

7 Numerical example
In this final section we show how to apply the principles of dynamic program-
ming to the cake-eating problem in practice. We discuss the Matlab program

10
available from the (very preliminary) Adda and Cooper book’s homepage at
http://www.eco.utexas.edu/~cooper/dynprog/dynprog1.html. This pro-
gram iterates the value function and derives an optimal policy function.
Initialise the program by clearing the working space and define the num-
ber of value function iterations and the discount factor.

Clear;
dimIter=30;
beta=0.75;

We discretise the possible cake sizes into a vector K. It contains of one


hundred rows, starting from 0 and increasing in steps of 0.01 to 1. We store
the row and column size of K in rowK and colK

K=0:0.01:1;
[rowK,colK]= size(K);

V is a matrix that stores the results of the value function iterations. The
rows of V correspond to the value of the possible cake sizes defined in K.
The columns of V contain successive value function iterations. The initial
guess of the value function is zero for all sizes of cake.

V=zeros(colK,dimIter);

Begin with the first value function iteration. Continue until the desired
number of iterations have been completed.

iter;
FOR iter=1:dimIter;

Define aux as an auxiliary matrix with the same number of rows and
columns as the cake-size matrix K. We will use this matrix to store the

11
value of choosing to leave K(ik2) cake for the next period when the current
size of the cake is K(ik). We will actually only use the lower left triangle of
the aux matrix since it is impossible to leave more cake in the future than
you have at present, i.e. K(ik2) ≤ K(ik).

aux=zeros(colK,colK)+NaN;

Beginning with the first possible current cake size, start looking through
all possible cake sizes.

for ik=1:colK;

For each current cake size K(ik), we examine the value of leaving all pos-
sible future cake sizes, K(ik2) < K(ik). We ignore the possibility K(ik2) =
K(ik) since that would imply no consumption and therefore starvation under
logarithmic utility.

for ik2=1:(ik-1)

The value of choosing K(ik2) when the current cake has size K(ik) is
stored in the ik; ik2 element of aux. It consists of two parts, the (logarithmic
here) payoff function log(K(ik) − K(ik2)) and the expected continuation
value V (ik2; iter). Note that this uses the value function V (ik2; iter) from
the previous iteration.

aux(ik,ik2)=log(K(ik)-K(ik2))+beta*V(ik2,iter);
END
END

The newly iterated value function is derived by choosing the best future
cake K(ik2) for each current cake K(ik). Simply looking for the maximum
value of each row of aux (or alternatively the maximum value of each column
of aux0 ) is sufficient to find the optimal cake choices.

12
V(:,iter+1)=max(aux’)’;

Loop to the next value function iterations until we get to last iteration,
dimIter.

ENDO;

The value function iterations are now complete. The final value function
is stored in V al, with the corresponding indices of future cake choices in Ind.
optK converts these indices into actual cake sizes, with optC the necessary
consumption.

[Val,Ind]=max(aux’);
optK=K(Ind);
optK=optK+Val*0;
optC=K’-optK’;

Plot a graph of the successive value function iterations.

figure(1)
plot(K,V);
xlabel(’Size of Cake’);
ylabel(’Value Function’);

Plot the optimal policy function.

figure(2)
plot(K,optC,’LineWidth’,2)
hold on
plot(K,K’,’—r’,...
’LineWidth’,2)
xlabel(’Size of Cake’);

13
ylabel(’Optimal Consumption’);
text(0.4,0.65,’45 degree line’,’FontSize’,18)
text(0.4,0.13,’Optimal Consumption’,’FontSize’,18)
legend(’Optimal Consumption’,’45 degree line’,2)

The output of the code is shown below. In the first figure, the value
function clearly converges over successive iterations. the second figure shows
the optimal consumption as a function of current cake size.

-5

-10

-15
Value Function

-20

-25

-30

-35

-40

-45
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Size of Cake

Iteration of the value function

14
1

0.9

0.8

0.7

45 degree line
Optimal Consumption

0.6

0.5

0.4

0.3

0.2

0.1
Optimal Consumption

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Size of Cake

Optimal policy function

15