Académique Documents
Professionnel Documents
Culture Documents
Mainly based on
“Reinforcement Learning –
An Introduction” by Richard
Sutton and Andrew Barto
http://www.cs.ualberta.ca/~sutton/book/the-book.html
Reinforcement Learning 1
Learning from Experience Plays a Role in …
Artificial Intelligence
Neuroscience
Artificial Neural Networks
Reinforcement Learning 2
What is Reinforcement Learning?
Reinforcement Learning 3
Supervised Learning
Supervised Learning
Inputs System
Outputs
Reinforcement Learning 4
Reinforcement Learning
RL
Inputs System
Outputs (“actions”)
Reinforcement Learning 5
Key Features of RL
Reinforcement Learning 6
Complete Agent
Temporally situated
Continual learning and planning
Object is to affect the environment
Environment is stochastic and uncertain
Environment
state action
reward
Agent
Reinforcement Learning 7
Elements of RL
Policy
Reward
Value
Model of
environment
Policy: what to do
Reward: what is good
Value: what is good because it predicts reward
Model: what follows what
Reinforcement Learning 8
An Extended Example: Tic-Tac-Toe
X X X X X O X X O X
X O X O X O X O X O X O X
O O O O X
} x’s move
x x
... x
x o
... ... o
... } o’s move
x o x
} x’s move
... ... ... ... ...
} o’s move
Assume an imperfect opponent:
he/she sometimes makes mistakes } x’s move
x o
x
x o
Reinforcement Learning 9
An RL Approach to Tic-Tac-Toe
x x x
pick our moves, look ahead
o 1 win
o
one step:
x
x o
o 0 loss current state
o
o x o
various possible
o x x
x o o 0 draw * next states
“Exploratory” move
Reinforcement Learning 12
e.g. Generalization
State V State V
s1
s2
s3
.
.
.
Train
here
s
N
Reinforcement Learning 13
How is Tic-Tac-Toe Too Easy?
Reinforcement Learning 14
Some Notable RL Applications
TD-Gammon: Tesauro
world’s best backgammon program
Elevator Control: Crites & Barto
high performance down-peak elevator controller
Dynamic Channel Assignment: Singh & Bertsekas, Nie &
Haykin
high performance assignment of radio channels to
mobile telephone calls
…
Reinforcement Learning 15
TD-Gammon
Tesauro, 1992–1995
TD error
Effective branching factor 400
Vt1 Vt
Reinforcement Learning 16
Elevator Dispatching
Crites and Barto, 1996
10 floors, 4 elevator cars
22
Conservatively about 10 states
Reinforcement Learning 17
Performance Comparison
Reinforcement Learning 18
Evaluative Feedback
Reinforcement Learning 19
The n-Armed Bandit Problem
E rt | at Q* ( at )
These are unknown action values
Distribution of rt depends only on a t
Objective is to maximize the reward in the long term,
e.g., over 1000 plays
at at* exploitation
at at* exploration
You can’t exploit all the time; you can’t explore all the
time
You can never stop exploring; but you should always
reduce exploring
Reinforcement Learning 21
Action-Value Methods
r1 r2 rka
Qt ( a ) “sample average”
ka
lim Qt ( a ) Q* ( a )
ka
Reinforcement Learning 22
e-Greedy Action Selection
e-Greedy:
at* with probability 1 e
at { random action with probability e
Reinforcement Learning 23
10-Armed Testbed
n = 10 possible actions
*
Each Q ( a ) is chosen randomly from a normal
distribution: N ( 0 ,1 )
each rt is also normal:
*
N ( Q ( at ),1 )
1000 plays
repeat the whole thing 2000 times and average the results
Evaluative versus instructive feedback
Reinforcement Learning 24
e-Greedy Methods on the 10-Armed Testbed
Reinforcement Learning 25
Softmax Action Selection
Reinforcement Learning 27
Binary Bandit Tasks
at if success
dt { the other action if failure
and then always play the action that was most often
the target
Call this the supervised algorithm.
It works fine on deterministic tasks but is
suboptimal if the rewards are stochastic.
Reinforcement Learning 28
Contingency Space
Reinforcement Learning 29
Linear Learning Automata
Reinforcement Learning 31
Incremental Implementation
Qk 1 Qk
1
rk 1 Qk
k 1
Reinforcement Learning 32
Computation
1 k 1
Qk 1
k 1 i 1
ri
1 k
rk 1 ri
k 1 i 1
1
rk 1 kQk Qk Qk
k 1
1
Qk rk 1 Qk
k 1
Stepsize constant or changing with time
Reinforcement Learning 33
Tracking a Non-stationary Problem
Qk 1 Qk rk 1 Qk
for constant , 0 1
k
(1 ) Q0 (1 )k i ri
k
i 1
1. k i
( 1 ) 1
Qk Qk 1 [rk Qk 1 ] i 1
rk (1 )Qk 1
rk (1 ) rk 1 (1 ) 2 Qk 2
k
(1 ) Q0 (1 ) k i ri
k 2. Step size
i 1 parameter
In general : convergence if after the k-th
application of
k
k 1
( a ) and k (a)
2
k 1 action a
1
satisfied for k but not for fixed
k
Reinforcement Learning 35
Optimistic Initial Values
Reinforcement Learning 36
The Agent-Environment Interface
... rt +1 rt +2 rt +3 s ...
st st +1 st +2 t +3 a
at at +1 at +2 t +3
Reinforcement Learning 37
The Agent Learns a Policy
Policy at step t , t :
a mapping from states to action probabilit ies
t ( s , a ) probabilit y that at a when st s
Reinforcement Learning 38
Getting the Degree of Abstraction Right
Time steps need not refer to fixed intervals of real time.
Actions can be low level (e.g., voltages to motors), or
high level (e.g., accept a job offer), “mental” (e.g., shift in
focus of attention), etc.
States can low-level “sensations”, or they can be
abstract, symbolic, based on memory, or subjective (e.g.,
the state of being “surprised” or “lost”).
An RL agent is not like a whole animal or robot, which
consist of many RL agents as well as other components.
The environment is not necessarily unknown to the
agent, only incompletely controllable.
Reward computation is in the agent’s environment
because the agent cannot change it arbitrarily.
Reinforcement Learning 39
Goals and Rewards
Reinforcement Learning 40
Returns
Reinforcement Learning 41
Returns for Continuing Tasks
Discounted return:
Rt rt 1 rt 2 2 rt 3 k rt k 1 ,
k 0
shortsighted 0 1 farsighted
Reinforcement Learning 42
An Example
Avoid failure: the pole falling
beyond a critical angle or the cart
hitting end of track.
We can cover all cases by writing Rt k rt k 1 ,
k 0
Reinforcement Learning 45
Markov Decision Processes
reward probabilities:
Rsas Ert 1 st s,at a,st 1 s for all s, sS, a A(s).
Reinforcement Learning 46
An Example Finite MDP
Recycling Robot
Reinforcement Learning 47
Recycling Robot MDP
S high ,low R search expected no. of cans while searching
A( high ) search ,wait R wait expected no. of cans while waiting
A( low ) search ,wait ,recharge R search R wait
Reinforcement Learning 48
Transition Table
Reinforcement Learning 49
Value Functions
k 0
k 0
Reinforcement Learning 50
Bellman Equation for a Policy
So: V ( s ) E Rt st s
E rt 1 V st 1 st s
Or, without the expectation operator:
V (s) (s,a) PsasRsas V ( s)
a s
Reinforcement Learning 51
Derivation
Reinforcement Learning 52
Derivation
Reinforcement Learning 53
More on the Bellman Equation
a s
Backup diagrams:
for V
for Q
Reinforcement Learning 54
Grid World
State-value function
for equiprobable
random policy;
= 0.9
Reinforcement Learning 55
Golf
State is ball location
Reward of -1 for each stroke until the ball is in the hole
Value of a state?
Actions:
putt (use putter)
driver (use driver)
putt succeeds anywhere on the green
Reinforcement Learning 56
Optimal Value Functions
For finite MDPs, policies can be partially ordered:
if and only if V (s) V (s) for all s S
There is always at least one (and possibly many) policies
that is better than or equal to all the others. This is an
optimal policy. We denote them all *.
Optimal policies share the same optimal state-value
function:
V (s) max V (s) for all s S
Optimal policies also share the same optimal action-
value function:
Q (s,a) max Q (s, a) for all s S and a A(s)
Reinforcement Learning 58
Bellman Optimality Equation for V*
Q ( s, a) E rt 1 max Q ( st 1 , a) st s, at a
P R
a
a
ss
a
ss max Q ( s, a)
a
s
Reinforcement Learning 60
Why Optimal State-Value Functions are Useful
Any policy that is greedy with respect to V is an optimal policy.
Therefore, given V , one-step-ahead search produces
the long-term optimal actions.
Reinforcement Learning 61
What About Optimal Action-Value Functions?
*
Q
Given , the agent does not even have to do a one-
step-ahead search:
Reinforcement Learning 62
Solving the Bellman Optimality Equation
Finding an optimal policy by solving the Bellman Optimality Equation
requires the following:
accurate knowledge of environment dynamics;
we have enough space and time to do the computation;
the Markov Property.
How much space and time do we need?
polynomial in number of states (via dynamic programming
methods; see later),
BUT, number of states is often huge (e.g., backgammon has
about 1020 states).
We usually have to settle for approximations.
Many RL methods can be understood as approximately solving the
Bellman Optimality Equation.
Reinforcement Learning 63
A Summary
Reinforcement Learning 64
Dynamic Programming
Reinforcement Learning 65
Policy Evaluation
Policy Evaluation: for a given policy , compute the
state-value function V
k
V ( s ) E Rt st s E rt k 1 st s
k 0
Bellman equation for V*:
V ( s ) ( s , a ) Psas Rsas V ( s )
a s
Reinforcement Learning 66
Iterative Methods
V0 V1 Vk Vk1 V
a “sweep”
Reinforcement Learning 67
Iterative Policy Evaluation
Reinforcement Learning 68
A Small Gridworld
V ( s) ( s, a) Psas Rsas V ( s)
a s
Reinforcement Learning 70
Policy Improvement
Reinforcement Learning 72
Proof sketch
Reinforcement Learning 73
Policy Improvement Cont.
Reinforcement Learning 74
Policy Improvement Cont.
What if V V ?
i.e., for all s S, V (s) max PsasRsas V (s ) ?
a
s
Reinforcement Learning 75
Policy Iteration
0 1
0 V 1 V V
* * *
Reinforcement Learning 76
Policy Iteration
Reinforcement Learning 77
Value Iteration
Reinforcement Learning 79
Value Iteration
Reinforcement Learning 81
Asynchronous DP
Reinforcement Learning 83
Generalized Policy Iteration
Generalized Policy Iteration (GPI):
any interaction of policy evaluation and policy improvement,
independent of their granularity.
Reinforcement Learning 84
Efficiency of DP
Reinforcement Learning 85
Summary
Reinforcement Learning 86
Monte Carlo Methods
Reinforcement Learning 87
Monte Carlo Policy Evaluation
1 2 3 4 5
Reinforcement Learning 89
Blackjack example
Reinforcement Learning 90
Blackjack value functions
Reinforcement Learning 91
Backup diagram for Monte Carlo
Reinforcement Learning 92
Monte Carlo Estimation of Action Values (Q)
Reinforcement Learning 95
Monte Carlo Control
Reinforcement Learning 96
Monte Carlo Exploring Starts
Reinforcement Learning 98
Blackjack Example Continued
Exploring starts
Initial policy as described before
Reinforcement Learning 99
On-policy Monte Carlo Control
non-max greedy
T
T T TT T T
TT T TT T TT
rt 1
st 1
T TT TT T T
T T
T TT T TT
rt 1
st 1
T TT T T T
TT T T T T
bootstraps samples
DP + -
MC - +
TD + +
exit highway 20 15 35
behind truck 30 10 40
home street 40 3 43
arrive home 43 0 43
After each new episode, all previous episodes were treated as a batch,
and algorithm was trained until convergence. All repeated 100 times.
A, 0, B, 0
B, 1
B, 1
B, 1 V(A)?
B, 1
B, 1 V(B)?
B, 1
B, 0
V(A)?
Estimate Q for the current behavior policy .
TD prediction
Introduced one-step tabular model-free TD methods
Extend prediction to control by employing some form of
GPI
On-policy control: Sarsa
Off-policy control: Q-learning (and also R-learning)
These methods bootstrap and sample, combining
aspects of DP and MC methods
n-step TD:
2 step return: Rt( 2) rt 1 rt 2 2Vt ( st 2 )
Task: 19 state
random walk
n1
Vt (st ) Rtl Vt (st )
n1
n1
n1