Vous êtes sur la page 1sur 34

Q-Learning in RL

with OpenAI Gym


JOO SOON LEE
2018. 01. 16

Center for Healthcare Robotics


School of integrated technology
Internship Program of Intelligent Robot Technology
Gwangju Institute of Science and Technology (GIST)
Contents

 OpenAI Gym

 Q-Learning with Table

 Improvement
OpenAI Gym
- Library for Reinforcement Learning
: Toolkit for developing and comparing reinforcement learning
algorithms

- In anaconda envs,
(env)$ pip install gym
OpenAI Gym
- FrozenLake (4x4)

- Agent will start


[Agent]
S travel from S to G.

- When the agent


H H go to H (hole), the
Reward : 0
game is over.
H
Reward : +1

H G
[Environment]
Frozen Lake

S 0 1 2 3
H H 4 5 6 7
H 8 9 10 11
H G 12 13 14 15
[Environment] [State]

+1 Left Down Right Up


[Action]
[Agent] [Reward]

=> Return (new_state, reward, done, info)


Q-function
[Agent]
[Action at] [Action at+1]

State St State St+1 State St+2

However Reward is not always given

Q-function (state-action value function) is needed

(1) State
(3) Quality (reward)
(2) Action

Q(state, action)
Q-function

(1) State
(3) Quality (reward)
(2) Action
Q(state, action)

Ex)

Q(s, Left) = 0
Action : Right
Q(s, Right) = 0.5 = argmax Q(s, a)
= π*(s)
Q(s, Up) = 0
Q(s, Down) = 0.3
Finding, Learning Q-function
To judge which actions can receive the highest rewards.

* Assumption

For Q(s, a)
1. The agent is in State s
2. When action is taken, it receives reward r and moves to state s‘
3. There is Q(s’, a’) in state s’
4. Q(s, a) = r + max Q(s’, a’)
State, Action and Reward

S
H H
H
H G
Terminal State
s0 , a0 , r1 , s1 , a1 , r2 , …. , sn-1 , an-1 , rn , sn
Tatal Reward

R = r1 + r 2 + r3 + … + rn 𝑅𝑡∗ = rt + rt+1 + rt+2 + … + rn


Rt = rt + rt+1 + rt+2 + … + rn = rt + max ( Rt+1 )
= rt + Rt+1 Q(s, a) = r + max Q(s’, a’)
Learning Q-function
Updating Q(s, a)

෠ a) ← r + max 𝑄(s’,
𝑄(s, ෠ a’)

S
H H Q-Table
Learning
H
H G

16 (state) x 4 (action)
Q-Table Learning
1. Initialized Table with Zero(0)
2. ෠ a) ← r + max 𝑄(s’,
𝑄(s, ෠ a’)
3. Random action before arriving to Goal

16 (state) x 4 (action)

Updating Q(s, a)
Q-Table Learning
Q(0,
Q(1,R)R)==reward
reward++max
maxQ(1,
Q(2,a)a)==00++00

0 0
0

0
Q(14, R) = reward + max Q(15, a) = 1 + 0
1

16 (state) x 4 (action)
Q-Table Learning

1 1
1

1
1

1 1
Q-Table Learning – Algorithm
Q-Table Learning Result
Q-Table Learning Result
Q-Table Learning Result
Q-Table Learning
Problem
: The agent will not go different way because the agent follows only maximum value.
So the Q-Table may not be updated perfectly.

Solution

: Sometimes, the agent will select random action not following maximum value.

Exploit & Exploration


Q-Table Learning - Exactly
[ Random Action ]
1. Random Move : How much the chains twist around each other
1) E-greedy
2) Decaying E-greedy

2. Random Noise
1) Random Noise
2) Decaying Noise
Q-Table Learning - Exactly
[ Random Action ]
1. Random Move
1) E-greedy
e = 0.1
if rand(1) < e :
action = random
else :
action = argmax[ Q(s,a) ]

2) Decaying E-greedy
for i in range (1000) :
e = 0.1 / (i+)
if rand(1) < e :
action = random
else :
action = argmax[ Q(s,a) ]
Q-Table Learning - Exactly
[ Random Action ] ex)
2. Random Noise [ 0.5 0.6 0.3 ] + [0.1 0.2 0.14]
Q(s, a) noise
1) Random Noise
action = argmax[ Q(s, a) + random_value ]

2) Decaying Noise
for i in range (1000) :
action = argmax[ Q(s,a) + random_value / (i+1) ]
Q-Table Learning - Exactly
Problem
: Too many paths can be generated

Solution

: Multiplying discount constant to lower [ maxQ(s’, a’) ]

Discount constant

෠ a) ← r + 𝛾 × max 𝑄(s’,
𝑄(s, ෠ a’)

𝛾 : discount constant < 1


Q-Table Learning – Exactly - Results
* Adding Noise
Q-Table Learning – Exactly - Results
* Adding Noise
Q-Table Learning – Exactly - Results
* Adding Noise
Q-Table Learning – Exactly - Results
* Adding Noise – various discount factors

Discount constant = 0.50 Discount constant = 0.75 Discount constant = 0.99


Q-Table Learning – Exactly - Results
* Random Move - Decaying E-greedy
Q-Table Learning – Exactly - Results
* Random Move - Decaying E-greedy
Q-Table Learning – Exactly - Results
* Random Move - Decaying E-greedy
Q-Table Learning – Exactly - Results
* Random Move - Decaying E-greedy – Larger ‘e’ : More Random moves
Q-Table Learning – Exactly - Results
* Random Move - Decaying E-greedy – Larger ‘e’ : More Random moves
Q-Table Learning – Exactly - Results
* Random Move - Decaying E-greedy – Larger ‘e’ : More Random moves
Q-Table Learning – Exactly - Results
* Random Move - Decaying E-greedy – Larger ‘e’ : More Random moves
Thank you
for your Attention.

Vous aimerez peut-être aussi