Académique Documents
Professionnel Documents
Culture Documents
David Silver
Lecture 3: Planning by Dynamic Programming
Outline
1 Introduction
2 Policy Evaluation
3 Policy Iteration
4 Value Iteration
6 Contraction Mapping
Lecture 3: Planning by Dynamic Programming
Introduction
vk+1 (s) [s
a
r
vk (s0 ) [ s0
!
X X
0
vk+1 (s) = π(a|s) Ras + γ a
Pss 0 vk (s )
a∈A s 0 ∈S
k+1 π π k
v P v
= R + γP
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Example: Small Gridworld
Given a policy π
Evaluate the policy π
π 0 = greedy(vπ )
Policy Iteration
Policy Improvement
Consider a deterministic policy, a = π(s)
We can improve the policy by acting greedily
π 0 (s) = argmax qπ (s, a)
a∈A
This improves the value from any state s over one step,
qπ (s, π 0 (s)) = max qπ (s, a) ≥ qπ (s, π(s)) = vπ (s)
a∈A
If improvements stop,
Principle of Optimality
g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2
0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2
0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2
0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2
Problem V1 V2 V3
0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3
-1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4
-2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5
-3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6
V4 V5 V6 V7
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Value Iteration
vk+1 (s) [s
r
0 0
vk (s ) [s
!
X
0
vk+1 (s) = max Ras + γ a
Pss 0 vk (s )
a∈A
s 0 ∈S
a a
vk+1 = max R + P vk
γP
a∈A
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
http://www.cs.ubc.ca/∼poole/demos/mdp/vi.html
Lecture 3: Planning by Dynamic Programming
Value Iteration
Summary of DP Algorithms
Prioritised Sweeping
Full-Width Backups
Sample Backups
Train next value function v̂ (·, wk+1 ) using targets {hs, ṽk (s)i}
Lecture 3: Planning by Dynamic Programming
Contraction Mapping
T π (v ) = Rπ + γP π v
T ∗ (v ) = max Ra + γP a v
a∈A