Académique Documents
Professionnel Documents
Culture Documents
OpenAI Gym
Improvement
OpenAI Gym
- Library for Reinforcement Learning
: Toolkit for developing and comparing reinforcement learning
algorithms
- In anaconda envs,
(env)$ pip install gym
OpenAI Gym
- FrozenLake (4x4)
H G
[Environment]
Frozen Lake
S 0 1 2 3
H H 4 5 6 7
H 8 9 10 11
H G 12 13 14 15
[Environment] [State]
(1) State
(3) Quality (reward)
(2) Action
Q(state, action)
Q-function
(1) State
(3) Quality (reward)
(2) Action
Q(state, action)
Ex)
Q(s, Left) = 0
Action : Right
Q(s, Right) = 0.5 = argmax Q(s, a)
= π*(s)
Q(s, Up) = 0
Q(s, Down) = 0.3
Finding, Learning Q-function
To judge which actions can receive the highest rewards.
* Assumption
For Q(s, a)
1. The agent is in State s
2. When action is taken, it receives reward r and moves to state s‘
3. There is Q(s’, a’) in state s’
4. Q(s, a) = r + max Q(s’, a’)
State, Action and Reward
S
H H
H
H G
Terminal State
s0 , a0 , r1 , s1 , a1 , r2 , …. , sn-1 , an-1 , rn , sn
Tatal Reward
a) ← r + max 𝑄(s’,
𝑄(s, a’)
S
H H Q-Table
Learning
H
H G
16 (state) x 4 (action)
Q-Table Learning
1. Initialized Table with Zero(0)
2. a) ← r + max 𝑄(s’,
𝑄(s, a’)
3. Random action before arriving to Goal
16 (state) x 4 (action)
Updating Q(s, a)
Q-Table Learning
Q(0,
Q(1,R)R)==reward
reward++max
maxQ(1,
Q(2,a)a)==00++00
0 0
0
0
Q(14, R) = reward + max Q(15, a) = 1 + 0
1
16 (state) x 4 (action)
Q-Table Learning
1 1
1
1
1
1 1
Q-Table Learning – Algorithm
Q-Table Learning Result
Q-Table Learning Result
Q-Table Learning Result
Q-Table Learning
Problem
: The agent will not go different way because the agent follows only maximum value.
So the Q-Table may not be updated perfectly.
Solution
: Sometimes, the agent will select random action not following maximum value.
2. Random Noise
1) Random Noise
2) Decaying Noise
Q-Table Learning - Exactly
[ Random Action ]
1. Random Move
1) E-greedy
e = 0.1
if rand(1) < e :
action = random
else :
action = argmax[ Q(s,a) ]
2) Decaying E-greedy
for i in range (1000) :
e = 0.1 / (i+)
if rand(1) < e :
action = random
else :
action = argmax[ Q(s,a) ]
Q-Table Learning - Exactly
[ Random Action ] ex)
2. Random Noise [ 0.5 0.6 0.3 ] + [0.1 0.2 0.14]
Q(s, a) noise
1) Random Noise
action = argmax[ Q(s, a) + random_value ]
2) Decaying Noise
for i in range (1000) :
action = argmax[ Q(s,a) + random_value / (i+1) ]
Q-Table Learning - Exactly
Problem
: Too many paths can be generated
Solution
Discount constant
a) ← r + 𝛾 × max 𝑄(s’,
𝑄(s, a’)