Hierarchical Problem Solving Using Reinforcement Learning: Methodology and Methods

Hierar
hi al Problem Solving using Reinfor ement Learning:

Methodology and Methods
by
Yassine Faihe
DISSERTATION
Submitted to the Fa ulty of S ien e in fulllment

of the requirements for the degree of
\Do teur es S ien es"
University of Neu h^atel

Department of Computer S ien e
Emile Argand 11
CH-2007 Neu h^atel
Switzerland
1999
This thesis is dedi ated to my parents,

to whom I owe everything.
A knowledgements
I am indebted to my advisor, Professor Jean-Pierre Muller for his support and en ouragement. While giving me extensive freedom to ondu t my resear h, he has always
provided me with useful advi e and original ideas. My introdu tion to the eld of reinfor ement learning as well as the dire tion taken by my resear h ome from his guidan e.
I am grateful to my thesis ommittee. I greatly a knowledge Paul Bourgine who has
helped me to develop the mathemati al aspe t of my thesis and for the useful dis ussions
we had in Paris. I would also like to thank Tony Pres ott for his explanations whi h have
been of great help in my understanding of the a tion sele tion me hanism as well as for his
ex ellent omments about the dissertation. Thanks must also go to Dario Floreano and
Killian Stoel. Their questions and their remarks allowed me to larify some important
issues.
The intera tions I have had with the CASCAD Team members: Antoine, Abdelaziz,
Eri , Fabri e, Luba, Lu -Laurent and Matthieu have always been fruitful and of great
interest.
Finally I would like to thank Carolina Badii who has proofread the draft of this dissertation and has helped in improving the style of the written English.
Contents
1 Introdu tion
2 Ba kground: Reinfor ement Learning
1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Claims and Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Formulation . . . . . . . . . . . . . . . . . . . .
2.1.1 Framework . . . . . . . . . . . . . . . .
2.1.2 Markov De ision Pro esses . . . . . . . .
2.1.3 Returns and Optimality Criteria . . . . .
2.2 Temporal Credit Assignment . . . . . . . . . . .
2.2.1 Value Fun tions and Optimal Poli ies . .
2.2.2 Dynami Programming . . . . . . . . . .
2.2.3 Temporal Dieren e Learning . . . . . .
2.3 Stru tural Credit Assignment . . . . . . . . . .
2.3.1 Predi tion with Fun tion Approximator
2.3.2 Neural networks . . . . . . . . . . . . . .
2.3.3 Conne tionist Reinfor ement Learning .
2.4 Summary . . . . . . . . . . . . . . . . . . . . .
3 The Postman Robot Problem
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
4
5
5
6
7
8
9
9
14
19
20
21
25
28
29
3.1 The Postman Robot Task . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 The robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Contents
3.3 The Environment . .

3.3.1 Assumptions .
3.3.2 Dynami s . .
3.3.3 Testbed . . .
3.4 Summary . . . . . .
ii
.
.
.
.
.
.
.
.
.
.
4 The Methodology
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.1 A Methodology for Reinfor ement Learning

4.1.1 Pfeifer's Design Prin iples . . . . . .
4.1.2 The BAT Methodology . . . . . . . .
4.1.3 Dis ussion . . . . . . . . . . . . . . .
4.2 Agent-Environment Intera tion Model . . .
4.3 The HPS Methodology . . . . . . . . . . . .
4.3.1 Spe i ation . . . . . . . . . . . . . .
4.3.2 De omposition . . . . . . . . . . . .
4.3.3 Sensory-motor Loop's Design . . . .
4.3.4 Coordination . . . . . . . . . . . . .
4.3.5 Evaluation and validation . . . . . .
4.4 Case Study . . . . . . . . . . . . . . . . . .
4.4.1 Spe i ation and De omposition . . .
4.4.2 Sensory-motor Loop's Design . . . .
4.4.3 Coordination . . . . . . . . . . . . .
4.4.4 Evaluation and Validation . . . . . .
4.5 Experiments . . . . . . . . . . . . . . . . . .
4.5.1 Learning to Navigate . . . . . . . . .
4.5.2 Learning the Coordination . . . . . .
4.6 Summary . . . . . . . . . . . . . . . . . . .
5 The Coordination Problem
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
31
32
34
35
35
36
38
39
40
42
43
44
46
49
49
50
51
54
55
55
56
56
58
64
68
5.1 Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Contents
iii
5.2 Related Work . . . . . . . . . . . . .

5.2.1 Hierar hi al Q-Learning . . .
5.2.2 Feudal Q-Learning . . . . . .
5.2.3 Hierar hi al Distan e to Goal
5.2.4 W-Learning . . . . . . . . . .
5.2.5 Compositional Q-Learning . .
5.2.6 Ma ro Q-Learning . . . . . .
5.3 The Sele tion Devi e . . . . . . . . .
5.4 Indexed Poli y . . . . . . . . . . . .
5.4.1 The Restless Bandits . . . . .
5.4.2 Dis ussion . . . . . . . . . . .
5.5 Experiments . . . . . . . . . . . . . .
5.6 Summary . . . . . . . . . . . . . . .
6 Con lusion
6.1
6.2
6.3
6.4
Summary of ontributions
Pra ti al Issues . . . . . .
Future work . . . . . . . .
Epilogue . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
70
70
71
72
73
74
74
76
77
80
81
84
86
86
87
88
89
List of Tables
3.1 The letter arrivals patterns for ea h o e. . . . . . . . . . . . . . . . . . . 32
4.1 Outline of the evaluation forms. . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Steps needed by the robot to move between dierent pla es in the environment. 62
List of Figures
2.1 Reinfor ement learning framework . . . . . . . . . . . . . . . . . . . . . . .
2.2 The poli y iteration method build a sequen e of poli ies that onverge to .
PE and PI are respe tively the poli y evaluation and the poli y improvement
operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 The poli y iteration algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 The value iteration algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Evolution of tra es a ording to the state visits. . . . . . . . . . . . . . . .
2.6 Algorithms of Q() and Sarsa() with either repla ing or a umulating
tra es. For = 0 we have Sarsa and one step Q-learning algorithms. . . . .
2.7 Multi-layer per eptron network. . . . . . . . . . . . . . . . . . . . . . . . .
2.8 A onnexion between units of onse utive layers. The index of the layers
de reases from the output to the input. . . . . . . . . . . . . . . . . . . . .
2.9 Algorithm of Sarsa() with a onne tionist fun tion approximator. . . . .
2.10 An Elman network as used by Lin (1992). . . . . . . . . . . . . . . . . . .
6
11
11
12
16
18
22
23
25
26
3.1 The Nomad 200 robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 The Nomad 200 development host and the letters ow and batteries simulator. 33
4.1
4.2
4.3
4.4
4.5
A general engineering methodology. . . . . . . . . . . . . . . . . . . . . . .

Agent-environment intera tion model in a reinfor ement learning framework.
Overview of the HPS methodology . . . . . . . . . . . . . . . . . . . . . .
A graphi al representation of the fun tion to optimize. . . . . . . . . . . .
The proposed generi sensory-motor loop. . . . . . . . . . . . . . . . . . .
36
40
43
45
47
List of Figures
4.6 The de omposition pro ess of the postman robot problem. . . . . . . . . .

4.7 The hierar hy of sub-behaviors obtained for the postman robot problem. .
4.8 The input real-value x is oarse oded into four values in [0,1 whi h are
0.05, 0.55, 0.95, 1.0 and onstitute a suitable input for a neural network. .
4.9 The se urity zone dened in front of the robot. . . . . . . . . . . . . . . . .
4.10 The robot moving from one room to another. . . . . . . . . . . . . . . . .
4.11 The optimal path found between o e3 and the harger. . . . . . . . . . .
4.12 Generalization abilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.13 Rea tion to an unexpe ted obsta le. . . . . . . . . . . . . . . . . . . . . . .
4.14 Number of steps needed to rea h the harger starting from o e3 for ea h
trial. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.15 Average penalties re eived during ea h trial. . . . . . . . . . . . . . . . . .
4.16 The at ar hite ture used for the omparison with the hierar hi al one. . .
4.17 Tables resuming the performan e of the oordination methods for dierent
letters ow ongurations. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.18 Average of the quality riterion as a fun tion of de ision steps. The top
graph on erns the periodi letters ow and the bottom graph the Poisson
distribution letters ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 A hierar hy of sensory-motor loops. The path of a tive sensory-motor loops
on a given time step is represented in bold. . . . . . . . . . . . . . . . . . .
5.2 The hysteresis loop representing the behavior swit hing between the a tive
and passive phases. I is the index of the andidate behavior and w is the
width of the hysteresis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Sta ks reward densities for = 0:9. Noti e that the sta k to pop is not
ne essary the one with the highest value at its top. . . . . . . . . . . . . .
5.4 Algorithm of RBI-learning. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Tables summarizing the performan es of the oordination methods for different letters ow ongurations. . . . . . . . . . . . . . . . . . . . . . . .
vi
52
54
55
58
59
59
60
60
61
61
63
65
66
69
76
78
81
84
List of Figures
vii
5.6 Average of the quality riterion as a fun tion of de ision steps. The top
graph on erns the periodi letters ow and the bottom graph the Poisson
distribution letters ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter 1
Introdu tion
1.1
Context and Motivation
This thesis is about the use of autonomous agents to solve problems. A problem is dened
by an environment and a task to a hieve. For instan e, the environment ould be a
building with an elevator group and the task ould be to ontrol the elevator ars so
as to redu e the passengers' waiting time (Crites 1996). An autonomous agent is an
entity that has the ability to intera t, without human intervention, with dynami and
unpredi table environments through sensing and a ting devi es. It an sense some aspe ts
of the environment's state and in uen e its dynami s. During this intera tion the agent
exhibits a behavior. When tightly oupled with the environment, the agent is said to be
embedded (Kaelbling 1993b), that is, being a part of this environment and having qui k
rea tions to stimuli.
The lassi al approa h to building embedded autonomous agents has been to program
them. The designer uses his own expertise and a priori knowledge to anti ipate all possible
patterns of intera tion, or analyzes and models the problem with dierential equations. In
the latter ase the agent's ontroller is derived using methods developed in the eld of
ontrol theory. However the in reasing omplexity of the problems, oming from di ult
tasks or from non-linear, sto hasti and unstru tured environments, limits the appli ability
of su h methods, even though adaptive methods to tune ertain parameters of the ontroller
do exist.
Introdu tion
One way of over oming this di ulty is autonomous programming, that is, making
the agent a quire the ne essary skills to a hieve the given task from the intera tion with
the environment. Su h a pro ess is alled learning and refers to the ability to modify
one's knowledge a ording to experien e. Apart from freeing the designer from expli itly
programming the agent, learning is useful to maintain the agent's apability to perform
a task under hanging ir umstan es. Thus learning agents are more exible, robust and
able to ope with un ertainty and hanging environments.
First resear h on learning fo used on supervised learning where a tutor trains a system
using input-output pairs examples. Be ause su h training examples are not always available, appli ations of supervised learning methods are restri ted to patterns re ognition and
lassi ation, and fun tions approximation. Reinfor ement learning (RL) is appli able in
more general and di ult ases. In the reinfor ement learning paradigm, an agent learns
how to a hieve a given task from its own intera tion with the environment. To do so
it modies its de ision pro ess on the basis of a feedba k whi h is a s alar evaluation of
its urrent performan e. Positive and negative (high and low) values of this s alar orrespond to rewards and punishments respe tively. Thus the agent solves the problem when
it behaves in a way that maximizes rewards and minimizes punishments. RL methods
have proven to perform well on simple problems but be ome impra ti al to use when the
problem's omplexity in reases.
The main motivation of the work presented in this dissertation is to s ale up reinfor ement learning to omplex problems.
1.2
Claims and Proposals
Two losely linked reasons an explain why reinfor ement learning fails to solve omplex
problems. First the appropriate reinfor ement fun tion, that is, the one that makes the
agent solve the problem when rewards are maximized, is not easy to nd. So far there has
been no systemati way to design su h a fun tion. The se ond reason is that the number
of situations that the agent may en ounter during its intera tion with the environment
in reases with the omplexity of the problem, so the sear h pro ess is slowed down and
Introdu tion
be omes ompli ated. This phenomenon is alled the urse of dimensionality.

We laim that a good understanding of the dieren e between a behavior and the
me hanism that produ es it, as well as the underlying onsequen es, will provide useful
insights to over ome the above di ulties. We argue that:
a behavior is the des ription, from an external observer's point of view, at dierent
levels of abstra tion, of a sequen e of a tions produ ed by the agent via its oupling
with the environment;
omplex behaviors may be produ ed by the oordination of several simple sensory-
motor me hanisms intera ting with the environment (Braitenberg 1984; Pfeifer and
S heier 1998);
solving a problem using an embedded agent amounts to designing the orresponding
behavior;
the design pro ess of a behavior onsists in transposing the observer's point of view
into the agent's point of view;
Having these arguments in mind, it is now possible to ta kle the obsta les that limit the
s alability of reinfor ement learning.
Let's start with the urse of dimensionality. When a problem requires to be solved
in whi h the agent performs a long sequen e of a tions, it be omes very hard to dis over
su h a sequen e, espe ially when the reinfor ements are sparse be ause the exploration
is not guided. One may introdu e lo al reinfor ements (given by a tea her) to guide the
exploration or ome up with e ient exploration strategies. One may also argue that the
agent does not have the adequate a tions otherwise it would have solved the problem in
few de ision steps (Martin 1998). Thus, we propose to add the missing a tions to the agent
repertoire by allowing it to learn them. A tually these new a tions orrespond to skills
that solve parts of the problem. So it is ne essary to perform a problem de omposition
in order to identify the needed skills. If the skills found are still too di ult to learn, the
orresponding sub-problems are de omposed on e again. The resulting agent's ar hite ture
Introdu tion
is a hierar hi ally stru tured skills set where ea h skill is learned using previously a quired
ones.
The dire t onsequen e of this approa h is that we will have to design several simple
reinfor ement fun tions (one for ea h sub-problem) rather than a single global and omplex
one. However the ne essity to have a means of des ribing behaviors still remains.
In order to systemize the approa h we mentioned above, and manage the overall design
pro ess a methodology is required. Issues that should be raised by su h a methodology
on ern:
the analysis of the problem and the spe i ation of the desired behavior;
the problem de omposition into sub-problems and the learning of the orresponding
skills;
oordination of these skills to solve the global problem.
A methodology that meets these requirements as well as methods to address the above
issues are proposed in this thesis, and onstitute our main ontribution.
1.3
Organization of the Dissertation
In this thesis we investigate the methodologi al aspe t of hierar hi al problem solving using
agents that learn by reinfor ement. The next hapter denes the reinfor ement learning
problem. It provides a mathemati al formulation of the problem and reviews te hniques
to solve it. Chapter 3 presents the postman robot problem and des ribes the testbed used
in this work. In hapter 4 a new agent design methodology is introdu ed with details of its
omponents. One parti ular omponent of the methodology, the oordination, is addressed
in depth in hapter 5. Both hapters 4 and 5 report and analyze the experimental results
we have obtained. Finally in hapter 6, we summarize the ontribution of our work, dis uss
some pra ti al issues, and suggest dire tions for future resear h.
Chapter 2
Ba kground: Reinfor ement Learning
In this hapter we introdu e the reinfor ement learning problem. We rst setup the framework by dening how the agent intera ts with the environment and formalize the problem
as the optimal ontrol of a Markov de ision pro ess. The solutions are presented from the
redit assignment point of view. Both temporal and stru tural redit assignment problems
are des ribed and state-of-the-art methods to solve them are reviewed.
2.1
Formulation
2.1.1 Framework
The agent, the environment it intera ts with and the task it has to a hieve are the omponents that dene the reinfor ement learning framework (gure 2.1). The intera tion
between the agent and the environment is ontinuous. On one hand the agent's de ision
pro ess sele ts a tions a ording to the per eived situations of the environment, and on
the other hand these situations evolve under the in uen e of the a tions. Ea h time the
agent performs an a tion, it re eives a reward. A reward is a s alar value that tells the
agent how well it is fullling the given task. To be formal let's denote x a representation
of the environment's state as it is per eived by the agent, a the sele ted a tion, and r the
re eived reward. The agent's de ision pro ess is alled poli y and is a mapping from states
to a tions. A learning agent modies its poli y a ording to its experien e and to its goal
whi h is to maximize the umulated rewards over time. Su h an amount is alled return
Reinforcement
Agent
Perceptions
Action
Environment
Task
Figure 2.1:
Reinfor ement learning framework
and will be explained later. Be ause of its exibility and its abstra tion, the reinfor ement
learning framework an be used to spe ify several kinds of problems. A tually, time steps
at whi h an intera tion o urs have to be seen as de ision-making steps rather than xed
ti ks of real time, and states and a tions may range from low-level intera tion devi es to
high-level des riptions and de isions.
2.1.2 Markov De ision Pro esses
A Markov de ision pro ess (MDP) onsists of a set of states X and a set of a tions A
whi h allow movement from one state to another. In ea h state x only a subset of a tions
A(x) A is available. The dynami s of the pro ess is governed by a set of transition
matri es. There is one matrix P (a) for ea h a tion a, where ea h element Pxy (a) denotes
the probability of transition to state y given x and a. If an a tion a is not available in state
x then Pxy (a) = 0. At the end of ea h transition a reward r = R(x; a; y ) is generated. The
immediate evaluation of a transition is generally expressed by the expe ted reward:
R(x; a) = E [R(x; a; y )
X
(2.1)
= Pxy (a)R(x; a; y):
2
y X
In this thesis we assume that the pro ess is dis rete and that both S and A are nite.
Poli y
A poli y is a mapping t : X 7! A whi h asso iates an a tion a to ea h state x. We noti e

that a poli y not only depends on the state of the pro ess but also on the time step where
the de ision is made. Here we will fo us on poli ies that spe ify a tions as a fun tion of a
state only. Su h poli ies are alled stationary and are denoted .
Markov Property
In general the out ome of a pro ess, in terms of states and rewards, at a given time step
depends on the prior sequen e of states or past history Ht = fxt ; at ; xt ; at ; :::; x ; a g.
When it is possible to predi t the next state and the next expe ted reward only on the
basis of the urrent state, then the pro ess is said to have the Markov property or to be
Markovian. Formally the Markov property an be expressed by the following equality:
1
P r(xt+1 = x; rt+1 = rjHt) = P r(xt+1 = x; rt+1 = rjxt ; at )
(2.2)
One an noti e the importan e of the Markov property in the sense that the de ision
is only a fun tion of the urrent state. The ase where an agent has to deal with nonMarkov states, either be ause it intera ts with a non-Markov environment or be ause of
its in omplete per eptions, will be dis ussed later.
2.1.3 Returns and Optimality Criteria
An MDP ontrolled by a poli y generates a sequen e of rewards R = fr ; r ; r ; :::; rn; :::g.

To order dierent poli ies we an dene an optimality riterion on this sequen e of rewards.
Roughly speaking an optimal poli y optimizes the total amount of rewards generated over
a long run period:
1
r1 + r2 + r3 + ::: + rn + :::
(2.3)
Su h a measure of long term reward is alled return (Barto et al. 1990). Be ause of the
sto hasti ity of the ontrolled pro ess we will onsider the expe ted value of the return.
Moreover we introdu e the following generi notation for the return:

"
E
N
X
t=0
! (t)rt ;
(2.4)
where E is the expe tation operator when poli y is used, N is the horizon of the return
and ! is a weighting fa tor. Several optimality riteria have been investigated in the
literature (Mahadevan 1996), but all an be expressed in the above form. Here we will
fo us on the ase where N ! 1 and !(t) = t , where 0 < 1, whi h represents the
expe ted dis ounted total reward. The dis ount fa tor a ts as an attenuator. Hen e one
unit of reward re eived at time t + is equivalent to units at time t. This optimality
riterion is attra tive be ause of its mathemati al properties whi h make the omputation
of the optimal poli y more tra table: the return value is nite (be ause 0 < 1 and as
long as the reward fun tion is bounded) and the optimal innite horizon poli y is always
stationary.
2.2
Temporal Credit Assignment
The temporal redit assignment problem (TCA) onsists in attributing redit or blame to
individual a tions on the basis of the result of a whole plan of a tions and is a on ern for
most real de ision problems. Indeed some a tions may generate low immediate payo but
an ontribute to produ ing higher rewards in the future. Sometimes several a tions have
to be performed before getting a reward: the reward is said to be delayed. In this se tion
we review dynami programming (DP) and temporal dieren e (TD) learning whi h are
te hniques that solve the TCA problem. Although DP algorithms an ompute optimal
poli ies for MDPs, they are not very useful to solve reinfor ement learning problems be ause an a urate model of the environment is usually not available. However dynami
programming provides important theoreti al foundations for understanding the fun tion
of temporal dieren e methods.
2.2.1 Value Fun tions and Optimal Poli ies
A widely used approa h to deal with delayed rewards is to estimate the worth of a state or
a de ision in terms of future expe ted rewards. Given an optimality riterion we an dene
a value fun tion for a poli y , V : X 7! IR as a mapping from states to real values. We
have:
V (x) = E
"
1
X
t=0
t rt jx0 = x ;
(2.5)
whi h expresses the expe ted return when the poli y is followed starting from state x.
In the same way we an dene a utility fun tion for poli y , Q : X A 7! IR mapping
state-a tion pairs to real values. Q (x; a) expresses the utility to perform a tion a in state
x and follow poli y thereafter:
Q (x; a) = E
"
1
X
t=0
t rt jx0 = x; a0 = a :
(2.6)
Given two poli ies and , we say that is better than (or an improvement of)
if the value fun tion for the rst poli y is at least equal to that of the se ond poli y, and
is greater for at least one state. Hen e the optimal poli y is the one whi h annot be
improved anymore. Its value fun tion is V . Many optimal poli ies may exist but they all
have the same optimal value fun tion V . Now we will see how su h optimal poli ies an
be indu ed.
1
2.2.2 Dynami Programming
The starting point of dynami programming omes from equation 2.5 written in a re ursive
form:
V (x) = R(x; (x)) +
y X
Pxy ( (x))V (y );
(2.7)
whi h, for the optimal poli y , be omes:

V (x) = R(x; (x)) +
y X
Pxy ( (x))V (y ):
(2.8)
10
As all optimal poli ies have the same optimal value fun tion V , and V V for all
x 2 X and for all poli ies i , we obtain:
i
V (x) =
max
2
"
a A(x)
R(x; a) +
y X
Pxy (a)V (y ) :
(2.9)
This equation is known as the Bellman's optimality equation (or Bellman's equation for
). When V is known, the optimal poli y an be easily derived:
(x) = arg
max
2
a A(x)
"
R(x; a) +
y X
Pxy (a)V (y ) :
(2.10)
There are several omputational te hniques to solve the Bellman's equation. Here we will
limit ourselves to two of them: value iteration and poli y iteration. But let's rst see how
the evaluation of a given poli y an be omputed.
Poli y Evaluation
Let's dene Vn (x) as the expe ted return if poli y is followed for n steps only, starting
from state x. For n = 1, the expe ted return is simply the expe ted immediate reward,
when a tion a = (x) is performed:
V1 (x) = R(x; a):
(2.11)
Assuming that V is known and that the next observed state when a is performed in x is
y with probability Pxy (a), we have for all x 2 X :
1
V2 (x) = R(x; a) +
y X
Pxy (a)V1 (y ):
(2.12)
Similarly we an determine V from V , V from V , and in the general ase Vn from
Vn :
3
Vn+1 (x) = R(x; (x)) +
y X
+1
Pxy ( (x))Vn (y );
(2.13)
for all x 2 X . After a high number of iterations N over all states, VN (x) an be onsidered
as a good approximation of V (x) given an arbitrary initial V (x).
0
11
Poli y Iteration
The poli y iteration method onsists of two pro edures: the poli y evaluation and the
poli y improvement. Thus starting from any initial poli y we will su essively evaluate
it, obtaining V , improve it, obtaining , and so on until the optimal poli y is rea hed
(gure 2.2). On e a poli y n is evaluated, the result Vn is used to make the improvement.
0
0 PE-V 0 PI - 1 PE-V 1
......
-V
PE
The poli y iteration method build a sequen e of poli ies that onverge to . PE
and PI are respe tively the poli y evaluation and the poli y improvement operators.
Figure 2.2:
The following update is applied for all x 2 X :

n+1 (x)
"
arg max
R(x; a) +
a
y X
Pxy (a)Vn (y ) :
Figure 2.3 shows the poli y iteration algorithm.

arbitrary poli y
V arbitrary fun tion
repeat
Poli y evaluation
repeat
for ea h x 2 X do
P
V (x) R(x; (x)) + y2X Pxy ( (x))V (y )
end for
until maxx2X jVn (x) Vn 1 (x)j <
Poli y improvement
for ea h x 2 X do h
i
P
(x) arg maxa R(x; a) + y2X Pxy (a)V (y )
end for
until is stable
Figure 2.3:
The poli y iteration algorithm
(2.14)
12
Value Iteration
The poli y evaluation phase in the poli y iteration algorithm needs a lot of omputation
and has to be performed after ea h improvement. Instead of making an improvement after
ea h poli y evaluation, it is possible to make it after only one ba kup of ea h state. This
pro edure amounts to dire tly ompute the optimal value fun tion using equation 2.9. The
ba kup operation be omes:
"
Vn+1 (x) = max

R(x; a) +
a
y X
Pxy (a)Vn (y )) ;
(2.15)
for all x 2 X . The omplete value iteration algorithm is given in gure 2.4.
V0
arbitrary fun tion
Compute optimal value fun tion

repeat
for ea h x 2 X do h
i
P
Vn+1 (x) maxa R(x; a) + y2X Pxy (a)Vn (y )
end for
until maxx2X jVn+1(x) Vn (x)j <
Compute optimal poli y
for ea h x 2 X do h
i
P
(x) arg maxa R(x; a) + y2X Pxy (a)Vn+1 (y )
end for
Figure 2.4:
The value iteration algorithm
Asyn hronous Dynami Programming
The algorithms presented in the previous se tion are alled syn hronous dynami programming algorithms be ause at ea h iteration the value fun tion is updated for the entire state
spa e. In the ase where the state spa e is very large, the solution of the MDP be omes
omputationally intra table. Asyn hronous dynami programming relaxes this rule and
allows ba kups to be applied for only a subset of the state set, whi h may be a singleton
13
(Gauss-Seidel DP) and may vary in ea h iteration. Let Xn X be the set of states whose
value fun tions will be ba ked up during the iteration stage n = 0; 1; ::: The ba kups are
done as follows:
(
h
i
P
max
R
(
x;
a
)
+

P
(
a
)
V
(
y
)
if x 2 Xn;
a
n
y 2X xy
V n ( x) =
(2.16)
Vn (x)
otherwise:
The hoi e of Xn is ru ial for the onvergen e to V . Ideally ea h state should be ba ked
up innitely, whi h means that it should be ontained in all the subsets Xn.
+1
Adaptive Real-Time Dynami Programming
The relaxation introdu ed by asyn hronous DP is very useful when the omputation of
the optimal poli y o urs while intera ting with an unknown pro ess. In this ase the
states are ba ked up as they are en ountered. Adaptive real-time dynami programming
(ARTDP) (Barto et al. 1995) relies on this prin iple to perform an on-line ontrol of a
pro ess. It involves the estimation of the pro ess' model, the poli y omputation, and
the ontrol. Ea h time a transition is observed the estimate of the transition probabilities
matrixes P^ (a) is updated:
n (a)
P^xy (a) = xy
(2.17)
nx (a)
where nxy (a) is the number of transitions from x to y when a is performed, and nx(a) =
P
y 2X nxy (a) is the number of times a was performed in x. The estimation of the immediate
reward R^ (x; a) is simply updated with the average of the observed immediate reward for
this state-a tion pair. After an innite number of updates the estimated model of the
pro ess onverges to the true pro ess. At ea h time step t the optimal value fun tion
is estimated using the urrent pro ess model estimation and the previous optimal value
fun tion estimation V^t . With an a urate model only one ba kup would be ne essary
and V^t would be equal to V . However, in the present ase su h a model is not available
and there are little variations between two onse utive estimations of the model. For these
reasons a tive exploration me hanisms have been investigated (Barto and Singh 1990) to
speed up the identi ation phase.
1
14
2.2.3 Temporal Dieren e Learning
Temporal dieren e learning (Sutton 1988) methods are on erned with solving a predi tion
problem and unlike DP methods, do not need a model of the environment's dynami s. Su h
methods are referred to as dire t or model-free methods as opposed to indire t methods like
ARTDP or model-based methods like DP. In this se tion we present the general prin iple
behind the predi tion of the value fun tion of an MDP and then extend it to the ontrol
problem. Finally we will see how the e ien y of TD methods an be improved with
eligibility tra es and review some popular TD algorithms.
Predi tion
For a Markov de ision pro ess and a poli y , the predi tion problem on erns the value
fun tion V . Let V^ (x) be an estimate of V (x). Given an experien e hx; a; r; yi and the
estimates of ea h of these states, V^ (x) and V^ (y), it appears , relying on equation 2.7
that r + V^ (y) is a better estimate of V (x) than V^ (x). The temporal dieren e error
(TD-error)
V^ = r + V^ (y) V^ (x)
(2.18)
is simply the dieren e between these two estimates, and is used to update the previous
estimate of V . The onstru tion of an estimate of V dire tly from the observation of
su essive states and rewards is done using the following update rule:
V^ (x)
V^ (x) + V^ ;
(2.19)
where 0 < 1 is the learning rate. Equation 2.19 is known as the TD(0) equation. Ea h
time the state x is visited and the above update is applied, the estimate V^ (x) be omes
loser to V (x).
Control
To use TD methods for the ontrol problem, the predi tion has to be made on the utility
fun tion Q (x; a) rather than on the value fun tion V (x). On the other hand we need to
15
expand the experien e mentioned above by adding b whi h is the hosen a tion when y is
observed. At the end of ea h state-a tion pair transition h(x; a); r; (y; b)i, the same update
rule as for V (x) is applied to estimate Q (x; a):
Q^ (x; a)
Q^ (x; a) + Q^ ;
(2.20)
where Q^ = r + Q^ (y; b) Q^ (x; a). We noti e that there is a mutual in uen e between
the poli y and the utility fun tion Q . In ee t a new update of Q hanges , whi h
then modies Q and so on until both of them be ome optimal. Algorithms based on this
update rule are alled Sarsa (be ause of the tuple State, A tion, Reward, State, A tion ) and
was rst investigated by Rummery and Niranjan (1994) who alled it Modied Q-learning.
Q-learning (Watkins 1989) is another algorithm also based on TD-learning, whi h dire tly
estimates the optimal utility fun tion Q. It uses the following update rule:
Q^ (x; a)
Q^ (x; a) + Q^ ;
(2.21)
where
Q^ (y; b) Q^ (x; a):
Q^ = r + max
b
(2.22)
Unlike Sarsa, Q-learning does not need to know the a tual a tion that will be exe uted
during the next experien e; it simply takes greedy a tion with respe t to y and the urrent
estimate of Q . Q-learning is qualied by asyn hronous or o-poli y algorithm be ause
it an learn the utility fun tion of a poli y (the optimal one) while following another (by
observing the behavior of another agent for instan e). The onvergen e of these algorithms
is guaranteed if all state-a tion pairs are visited an innite number of time and the learning
rate is de ayed adequately. Moreover the Sarsa algorithm requires that the ontrol poli y
onverges little by little towards a greedy poli y.
Eligibility Tra es
One way of improving learning and dealing more e iently with the temporal redit assignment is not only to update the value fun tion of the state whi h is urrently visited,
16
but to update those that have led to it as well. To do so, we keep a re ord of the degree
of re en y of the visited states: their eligibility tra es. Thus the estimate of the value
fun tion is updated for ea h state a ording to its eligibility. The update rule is
V^ (x)
V^ (x) + V^ e(x);
for ea h x 2 X
(2.23)
where e(x) is the eligibility of state x. It is updated on-line either by a umulating tra es

e(x) + 1 if x is the urrent state;
e(x)
(2.24)
e(x)
otherwise
or by repla ing tra es

1
if x is the urrent state;
e(x)
(2.25)
e(x) otherwise,
where 0 1 is the tra e-de ay fa tor. The dieren e between these two eligibility
tra e me hanisms is emphasized in gure 2.5. Basi ally a umulating tra es takes into
a ount both the frequen y and the re en y of the state whereas repla ing tra es only
onsiders the re en y. Both tra es de ay exponentially a ording to when the state is
no longer visited. Re ent work has reported the superiority of repla ing tra es (Singh and
Sutton 1996). Predi tion algorithms based on the update 2.23 are alled TD() and are a
accumulating trace
replacing trace
visits to a state
Figure 2.5:
Evolution of tra es a ording to the state visits.
generalization of TD(0). The way we introdu ed the eligibility tra es is alled the ba kward
view of TD() (Sutton and Barto 1998). It is intuitive and an be dire tly implemented.
17
On the other hand, the forward view of TD() is a more theoreti al view and onsists in
making updates using predi tions on several forth oming steps.
Eligibility tra es an also be used to enhan e the performan es of ontrol algorithms
su h as Sarsa or Q-learning. However it is required to have tra es for ea h state-a tion pair
and not only for ea h state. The algorithms resulting from this ombination are Sarsa()
(Rummery 1995) and Q() (Peng and Williams 1996), and are presented in gure 2.6.
The ounterpart of the e ien y in the use of eligibility tra es is their omputational
ost be ause the value fun tion and the eligibility tra es have to be updated for ea h
state (or state-a tion pair for the ontrol). However there are some promising results that
over ome this drawba k (Ci hosz 1995; Wiering and S hmidhuber 1998). The prin iple
is to update only the states whose tra es are above a ertain and ignore the remaining
states.
Exploration
As it was pointed out earlier the onvergen e of TD ontrol algorithms to an optimal poli y
is essentially subje t to the requirement to visit all state-a tion pairs an innite number
of time. This is obviously not possible in pra ti e be ause it would take too long before
starting the optimal ontrol. The agent is therefore fa ed with an interesting trade-o
between (i) performing a tions that will in rease its knowledge about the environment (i.e.
visiting new states or onsolidating its experien e) and (ii) a tions that are optimal relative
to its urrent estimate of the optimal poli y. In fa t some a tions are known to give good
results in a parti ular situation but some others are not known at all and might produ e
better results. This trade-o is alled the exploration-exploitation dilemma. Methods to
solve this dilemma an be lassied into two ategories: undire ted methods and dire ted
methods.
Undire ted methods, also alled ad ho methods, do not use any knowledge about
the learning pro ess to dire t the exploration: they make a random exploration. The
simplest te hnique to do so is alled -greedy poli y. It takes a greedy a tion by default
and, with probability , a random a tion. The parameter is set to 1 in the beginning to
18
Q^ (x0 ; a0 )
0 and e(x0 ; a0) 0 for ea h x0 2 X and a0 2 A

Observe x
Choose a a ording to Q^ (x; a) and some exploration poli y
loop
Perform a, observe r and y
Choose b a ording to Q^ (x; b) and some exploration poli y
For Q()
0 Q^ r + max Q^ (y; ) Q^ (x; a)
Q^ r + max Q^ (y; ) max Q^ (x; )
For Sarsa()
0 Q^ r + Q^ (y; b) Q^ (x; a)
Q^ 0Q^
for ea h state-a tion pair (x0 ; a0 ) do
e(x0 ; a0 ) e(x0 ; a0 )
^ (x0 ; a0)
Q(x0 ; a0 ) Q(x0 ; a0 ) + Qe
end for
^ (x; a)
Q(x; a) Q(x; a) + 0 Qe
For a umulating tra es
e(x; a) e(x; a) + 1
For repla ing tra es
e(x; a) 1
for ea h a0 2 A do
e(x; a0 ) 0
end for
x y and a
end loop
Algorithms of Q() and Sarsa() with either repla ing or a umulating tra es. For
= 0 we have Sarsa and one step Q-learning algorithms.
Figure 2.6:
19
en ourage exploration and is slowly de reased thereafter to ensure exploitation. Another

more sophisti ated te hnique is based on a Boltzmann distribution
P (ajx) =
eQ(x;a)=T
Q(x;b)=T
b2A e
(2.26)
where T is the temperature parameter whi h ontrols the exploration. With a high temperature the probabilities are uniform and as T de reases the probability of hoosing (x)
be ome loser to one.
Dire ted methods (see (Thrun 1992; Wyatt 1997; Wilson 1996) for more details) are
based on an exploration bonus whi h is added to the utility fun tion. It is worth mentioning
that this bonus is simply a random value in the ase of indire ted methods. As for dire ted
methods, the bonus is based on one or a ombination of the following riteria:

riterion, whi h takes into a ount the number of times that a state-a tion
pair is visited;
riterion, whi h uses the variation of the utility fun tion. The higher the
variation of the utility the more its orresponding state-a tion pair is preferred;
re en y
ounter
error
re ently.
riterion, whi h promotes state-a tion pairs that have been tried the least
Other te hniques that seem to be powerful and promising are based on the Gittins' indexes
and are urrently investigated by (Meuleau and Bourgine 1998).
2.3
Stru tural Credit Assignment
The natural and simplest way of representing the estimates of the value and utility fun tions
is to use a lookup table. Su h a table will have a single entry for ea h state or state-a tion
pair. This kind of representation is well-suited for simple tasks with small state and
a tion spa es. However when these spa es be ome huge, the problem fa ed goes beyond
the prohibitive amount of memory needed to store values of ea h entry. Spe i ally, the
greater the number of situations whi h the agent has to deal, the smaller the probability
20
that the same situation will be fa ed more than on e. Thus the learning pro ess be omes
di ult and the agent needs some generalization ability, whi h allows it to make a fair
de ision in a situation it has never fa ed before. This is known as the stru tural redit
assignment problem and is on erned with attributing redit (or blame) to features of the
fa ed situations in order to generalize a ross them.
To deal with this problem value (or utility), fun tions are represented using fun tion
approximators. An ideal fun tion approximator should use a xed and limited amount of
resour es to represent a fun tion, have good generalization abilities and be parameterizable
to allow on-line estimation of the fun tion.
Several generalization methods and fun tion approximators have been developed and
used in reinfor ement learning: methods based on Hamming distan e and statisti al lustering (Mahadevan and Connell 1992), Cerebellar Model Arti ulation Controller (CMAC)
(Tham 1995; Santamaria et al. 1997; Benbrahim and Franklin 1997) and neural networks (Rummery 1995; Millan 1996). Here we will fo us on neural networks and on
multi-layer per eptron (MLP) in parti ular be ause they are well-suited to implement the
gradient-des ent methods (a widely used method for fun tion approximation) using the
error ba k-propagation algorithm, and nally be ause it is the approximator we used in
our experiments.
2.3.1 Predi tion with Fun tion Approximator
In this se tion we present the general algorithm that ombines both temporal dieren e
methods and fun tion approximation te hniques. It is based on the gradient-des ent approa h and an be used with any fun tion approximator.
Let's assume we have at our disposal the true values of V (the fun tion we want to
approximate) for ea h x 2 X . Also let V^p (x) = V^ (p~; x) be the fun tion whi h approximates V where p~ is a parameter ve tor. It is those parameters that are tuned so that
V^p (x) be omes loser to V (x) for ea h x 2 X . Finding a good approximation of V using
V^p onsists in nding the onguration of ~p that minimizes the quadrati error over the
21
state spa e:
1 X hV (x) V^ (x)i :
(2.27)
p
2 x2X
To do so gradient-des ent methods progressively redu e the observed error for ea h step.
The parameter ve tor is tuned in the opposite dire tion of the gradient of V^p (x) with
respe t ~p:
p~ p~ rp~E
h
i
(2.28)
p~ + V (x) V^p (x) rp~ Vp (x)
where is the learning rate and rp~ is the gradient operator with respe t to p~. The learning
rate weights the strength of the tuning so that only a small step is taken in the improving
dire tion. If the learning rate is tuned to ompletely redu e the error on the observed
example then the parameter ve tor will not onverge be ause it will be destabilized after
ea h new update.
In the ase of TD learning, the value we want to approa h with V^p (x) after an experien e
< x; a; r; y >, is r + V^p (y ). Hen e the update rules for the parameter ve tor are
~p ~p + V^p~e
(2.29)
where V^p is the TD error r + V^p (y) V^p (x), is the learning rate and ~e is the eligibility
tra e ve tor. In the tabular ase eligibility tra es were assigned to ea h state. In the present
ase they are assigned to ea h omponent of the parameter ve tor. Their update is
~e ~e + rp~ Vp (x)
(2.30)
where ~e has an initial value of zero.
The equations presented here an be extended to estimate the utility fun tion Q (x; a)
in the same way as for the tabular ase. In the next se tion we brie y introdu e neural
networks, and then we show how they an be used with the above update rules.
E=
2.3.2 Neural networks
Arti ial neural networks (ANN), also known as onne tionist networks, are mathemati al
and omputational models inspired from human nervous ells. Their basi omponents
22
are simple pro essing units (also alled neurons or per eptrons ) inter onne ted by weighted
synapti links. Ea h unit re eives signals from other units or external sour es and pro esses
them. The result of pro essing is either used as input to other units or as output of the
network.
Ar hite ture
Output layer
Input layer
Input
Hidden layer(s)
Back-propagation
Output
Activation
Figure 2.7:
Multi-layer per eptron network.
As we said above, we will only onsider multi-layer per eptron (MLP) networks. In
su h networks, units are organized in layers: units intera ting with the outside are in the
input or output layers, and all other units belong to the hidden layers (gure 2.7). When
the units are onne ted in a forward way (from the input to the output layer) we have a
feed-forward network. Sometimes ertain units in the hidden or output layers are fed ba k
to previous layers and give a re urrent network.
A tivation
The a tivation in the network is omputed by propagating the units a tivation from the
input to the output. The onnexion between two units is dened by a weight wijq whi h
determines the ee t that a tivation ajq of unit j has on unit i (gure 2.8). The a tivation
of a unit i (its output) is al ulated in the following manner,
1
aqi = F (sqi );
(2.31)
23
layer q 1
layer q
q
ij
A onnexion between units of onse utive layers. The index of the layers de reases
from the output to the input.
Figure 2.8:
where q indexes the layer, F is an a tivation fun tion and sqi the weighted sum of the unit's
inputs plus a bias bqi ,
sqi =
X
j
wijq ajq
+ bqi :
(2.32)
The a tivation fun tion F has to be non-linear and is usually either sigmoidal, semi-linear
or tangential. However a sigmoid a tivation fun tion is very often used
F (s) = 1 +1e s :
(2.33)
Ba k-Propagation
The prin iple of the ba k-propagation method is to propagate the error, namely the dieren e between the desired output and the a tual output, from the output to the input units
so as to know the error of ea h unit. It onsists in using a gradient-des ent te hnique to
minimize the quadrati error
1
(2.34)
E = (d~ ~a) ;
2
where d~ is the desired output ve tor and ~a is the a tual output ve tor of the network. To
do so the gradient E=wijq is omputed by de omposing it into two terms whi h will be
separately evaluated
2
E
wijq
E sqi
= sq wq :
i
ij
(2.35)
24
The se ond term an be dire tly al ulated

sqi
wijq

wijq
wikq akq
+ bqi = ajq
(2.36)
and the rst term whi h is the error " on the unit i of the layer q is de omposed on e
again to give:
E E aq
(2.37)
"qi = q = q qi :
si ai si
As aqi = F (sqi ) we immediately dedu e
aqi
0 q
(2.38)
q = F (si ):
si
For the al ulation of E=aqi we have to onsider two distin t ases in whi h whether
layer q is or is not the output layer. If it is then
E
= (di aqi );
(2.39)
aqi
and the error of an output unit is
"qi = (di aqi )F 0 (sqi ):
(2.40)
When the layer q is not the output layer, the gradient E=aqi is derived from the errors
of forward layers
E X E sqk
= sq aq
aqi
i
k
k
(2.41)
X
q
q
= "k wki :
q
i
+1
+1
+1
and the error of an non-output unit is

"
q
i
+1
X
k
"
q +1
k
q +1
ki
F 0(sqi ):
(2.42)
Finally ea h weight of the synapti links is orre ted as follows

wijq wijq + "qi aqi ;
(2.43)
where "qi orresponds either to the one of equation 2.40 or 2.42. At this stage it is straightforward to noti e how the gradient-des ent method for value fun tion predi tion presented
in se tion 2.3.1 an be easily implemented with neural networks. Figure 2.9 presents the
onne tionist version of Sarsa().
25
Initialize w~ with small random values and ~e to zero

Observe x
Choose a a ording to Q^ w (x; a) and some exploration poli y
loop
Perform a, observe r and y
Choose b a ording to Q^ w (y; b) and some exploration poli y
Q^ w r + Q^ w (y; b) Q^ w (x; a)
~e ~e + rw~ Q^ w (x; a)
w~ w~ + Q^ w~e
x y and a b
end loop
Figure 2.9:
Algorithm of Sarsa() with a onne tionist fun tion approximator.
2.3.3 Conne tionist Reinfor ement Learning
To represent the utility with MLP networks ( alled in this ase Q-nets ), one has to arefully
dene a ertain number of issues.
Basi ally Q-nets take as inputs a state x and an a tion a and produ e their utility
Q(x; a) as an output. So the rst issue on erns the use of a single network whose inputs
en ode both the state and the a tion or a set of jAj distin t networks whose inputs en ode
only the state. The monolithi ase may give fair results when the a tion spa e is ontinuous
but is not e ient to deal with domains with dis rete a tions. This limitation omes from
the fa t that the network has, in this ase, to model a highly non-linear fun tion be ause
for the same state dierent a tions (usually having a similar representation) may have very
dierent utility fun tions. Moreover this ar hite ture does not support the use of eligibility
tra es. The distributed ar hite ture, also alled OAON (One A tion One Networks ) (Lin
1992) asso iates one network to ea h a tion to redu e the interferen es between a tions
and is suitable for use with eligibility tra es.
The se ond issue on erns non-Markov states. Re all that a Markov state is ne essary
and su ient to make the right de ision and to predi t the next state for a given a tion
in a given state. When the agent does not have a Markov state it fa es the hidden state
26
To ope with this problem the agent has to build an internal Markov state using
history information. Re urrent neural networks, onstru t su h a history in a ompa t
way: units in the hidden layer are fed ba k to a part of the input layer alled ontext, the
rest of the input layer is devoted to the state (gure 2.10). These networks are known as
Elman networks and have been used by Lin (1992) to solve several non-Markov tasks.
The last issue refers to the spe i ations of ea h of the three layers.
Figure 2.10:
Output unit
Hidden units
Context units
Input units
problem.
An Elman network as used by Lin (1992).
The Input Pattern
The input ve tor of the neural network is a representation in terms of features oding of
the state to be evaluated. It is alled the input pattern. The design of this ve tor is
very important and has a great impa t on the learning and generalization abilities of the
network. The hoi e of the features requires a good knowledge of the task domain and
their oding depends on their nature.
As far as the features allow it, the most simple and e ient way of representing them is
a binary oding. If a feature has a nite and small number of possible values, for instan e
su h as a lift's lo ation in a building, then one input unit is asso iated with ea h of them.
The unit is 'on' when the feature has the orresponding value and 'o' otherwise.
27
When the feature is a real value, su h as a robot's sensor reading, it an be either

s aled in the range [0; 1 (to avoid units overshooting) and represented with a single unit,
or spread over several units. The latter hoi e is a oarse oding te hnique and is useful
when dierent responses are needed for dierent ranges of the value we want to ode or
also when we need more a ura y. The oarse oding te hnique is used in onjun tion with
binary features, radial basis fun tions (RBF) or sigmoid fun tions. For more details about
these te hniques see (Sutton and Barto 1998) for the rst two methods and (Rummery
1995) for the third one .
1
The Hidden Layer(s)
The number of hidden layers as well as the number of units in ea h layer are the fa tors
that dene the degree of freedom of a neural network. Hen e the more ompli ated the
fun tion, the more numerous hidden layers and units. In an MLP a single hidden layer
is usually su ient but there is no systemati means of determining the exa t number of
hidden units. However it has been reported by Rummery (1995) that, in reinfor ement
learning appli ations, the nal performan e of the system is no more ae ted beyond a
ertain number of hidden units. Only the onvergen e time and the omputational ost
be ome high. Therefore a possible strategy to nd the ideal number of hidden units would
be to start with a small number of hidden units and to in rease it up to the point where
no improvement an be observed.
The Output Layer
The output of the network, when it is used to approximate a utility fun tion, is a real
value. It an be either en oded by several sigmoidal output units using the te hnique
of overlapping Gaussian ranges (Pomerleau 1991) or by a single unit. In the latter the
a tivation fun tion of this unit may be either linear or sigmoidal. However with a linear
fun tion the output value is not bounded, therefore a high error may be ba k-propagated
and thereby makes the units overshot. If a sigmoid fun tion is utilized, the output value is
1A
brief des ription of the oarse oding te hnique using sigmoid fun tion is given in se tion 4.4.2.
28
within the range [0; 1 so the immediate reinfor ement has also to be within this range. In
pra ti e either we have an idea about the variation range of the reinfor ement, so we an
s ale it, or we use a very small learning rate whi h will slow-down the learning pro ess.
To over ome this handi ap Benbrahim and Franklin (1997) developed a method alled Self
S aling Reinfor ement (SSR) whi h self s ales the reinfor ement signal a ording to the
minima and the maxima observed.
2.4
Summary
This hapter has set up the foundations of reinfor ement learning and has overviewed related existing methods and algorithms. Let's re all that reinfor ement learning has to be
viewed as a lass of problems or as an adaptive ontrol paradigm rather than a parti ular
learning te hnique. RL has be ome very popular in the eld of intelligent autonomous
agents and has attra ted resear hers from other dis iplines like statisti s, psy hology and
arti ial intelligen e. RL is be oming in reasingly mature be ause, on one hand its theoreti al aspe ts (link with dynami programming, hoi e of optimality riteria, analysis
of various algorithms' behavior, fun tion approximators) are intensively investigated and
on the other hand the number of pra ti al appli ations is ontinuously growing. Examples of su h appli ations are: elevator ontrol (Crites 1996), TD-Gammon (Tesauro 1995),
dynami hannel allo ation in ellular telephone system (Singh and Bertsekas 1997) and
job-shop s heduling (Zhang and Dietteri h 1995). The eorts are urrently fo used on s aling up reinfor ement learning to large, omplex and partially observable problems. They
involve issues su h as ontinuous state and a tion spa es, representation, hierar hi al ontrol and task de omposition, and methodologies for general appli ation of RL. The last
two issues onstitute the entral theme of this thesis.
Chapter 3
The Postman Robot Problem
In this thesis the postman robot problem is used, as an appli ation framework for the
methodology we will introdu e, and as a testbed for our experiments. In this hapter we
des ribe the postman robot task as well as the robot and the parti ular setups that we
have used.
3.1
The Postman Robot Task
The postman-robot is given a set of parallel and on i ting obje tives and must satisfy
them as best as it an. The robot a ts in an o e environment omposed of o es,
a batteries' harger and a mailbox. Its task is to olle t letters from the o es and post
them in the mailbox. While a hieving its postman's task as e iently as possible the robot
has to avoid ollisions with obsta les and re harge its batteries to prevent break-downs.
3.2
The robot
The physi al robot is a Nomad 200 mobile platform (gure 3.1). It has 16 infrared sensors for ranges less than 40 entimeters, 16 sonar sensors for ranges between 40 and 650
entimeters, and 20 ta tile sensors to dete t onta t with obje ts. It is also equipped with
wheel en oders and a ompass to ompute its urrent lo ation and orientation relative to
its initial ones. Finally, it has three wheels ontrolled together by two motors whi h make
it translate and rotate. A third motor ontrols the turret rotation.
Figure 3.1:
30
The Nomad 200 robot

3.3
31
The Environment
The postman robot's de isions are mainly driven by the letters ow, as well as by the
batteries' level. In this se tion we des ribe their dynami s and relative assumptions.
3.3.1 Assumptions
We dene an atomi a tion that the robot an perform as a steering of degree followed

by a translation of d entimeters. Thus the set of available a tions is onstituted of
several pairs ai = (i ; di). The interval between the end of the exe ution of two a tions
denes the duration of an intera tion y le and orresponds to one time step.
In addition the following assumptions are made about the robot apabilities:
The robot an sense the number of letters it holds, its batteries' level and the number
of letters in ea h o e;
The robot gets the letters on e it is in an o e, posts the letters on e it is near the
mailbox, and re harges its batteries on e it is near the harger (be ause it does not
have any grasping or re harging devi es).
3.3.2 Dynami s
Let's denote xr (t) the number of letters that the robot holds, xli(t) the number of letters
in ea h o e i, and xb (t) the batteries' level, at a given time step t. The evolution of these
parameters are governed by the following equations:
Letters in an o e i 8
< xli (t) + (t)
xli (t + 1) =
:
0; if the robot pi ks up the letters from o e i
where (t) is the in oming letters in o e i at time step t.
Letters transported by the robot:
xr (t + 1) =
8
<
xr (t) + xli (t); if the robot pi ks up the letters from o e i
0; if the robot posts the letters it holds.
Batteries' level:
xb (t + 1) =
8
<
32
xb (t) xb
100%; if the robot re harges its batteries

where xb is the batteries' onsumption rate for one time step.
:
3.3.3 Testbed
The parti ular environment we used for our experiments is omposed of three o es, one
mailbox and one harger (gure 3.2). Its size is approximatively 13m 13m. Letter
arrivals in ea h o e are either periodi (i.e n letters ea h p time steps) or follow a Poisson
distribution. Table 3.1 shows the letters ow patterns that were used.
Periodi
Poisson
(letters/period) (mean letters/time steps)
O e 1
1/40
3/100
O e 2
1/30
5/100
O e 3
1/20
7/100
Table 3.1:
The letter arrivals patterns for ea h o e.
To arry out the experiments we had at our disposal the Nomad 200 development
host whi h simulates the robot's sensors and inemati s and we wrote a program whi h
simulates the letters arrival and the batteries' dynami s (gure 3.2). Although the robot's
simulator is realisti it is time onsuming. For example, it takes about 30 se onds to move
from one o e to another when the simulator is run on a Sun Ultra 1 station. To speed up
the simulation pro ess, we have pro eeded in the following manner. When the navigation's
behaviors were learned (using the Nomad 200 simulator) we measured the number of time
steps needed to move from one pla e to another. These measures are used to dene a grid
simulator whi h is then oupled to the letters ow and batteries simulator.
Thus we an test and validate the oordination of these elementary behaviors mu h
faster, while still being able to reuse the learned oordination with the robot's simulator.
As our navigation algorithms rely on the odometry (see se tion 4.5.1), we were unable
to reuse them on the real robot be ause of the drift. We are urrently developing other
navigation behaviors based on bea ons' dete tion.
33
Mailbox
Office 1
Office 2
Charger
Office 3
Figure 3.2:
The Nomad 200 development host and the letters ow and batteries simulator.

3.4
34
Summary
We have hosen the postman robot task be ause it provides an opportunity to apply
reinfor ement learning to build both rea tive (navigation, obsta le avoidan e) and planning
( olle ting and posting letters e iently) skills of the robot. It is an instan e of a more
general task involving the oordination of on urrent and interfering behaviors and is
analogous to the optimal foraging problem whi h is usually fa ed by animals (Stephens
and Krebs 1986). Let's add by the way that a postman robot is urrently running in a
building of Carnegie Mellon University and that its design and implementation has involved
about 10 persons (Simmons et al. 1997).
Chapter 4
The Methodology
This hapter introdu es a methodology to solve problems using reinfor ement learning.
We begin this hapter by justifying the need of a methodology in reinfor ement learning.
An intera tion model between the agent and the environment is then presented and some
important notions like agent or behavior are laried. Then we des ribe the Hierar hi al
Problem Solving (HPS) methodology as well as its asso iated methods, and apply it to the
postman robot problem. Finally we report the experiments we arried out and the results
we obtained.
4.1
A Methodology for Reinfor ement Learning
Problem solving using embedded reinfor ement learning agents has be ome very attra tive be ause the level of abstra tion at whi h the designer intervenes is raised (i.e. the
agent is told what to do using the reinfor ement fun tion and not how to do it) and little
programming eort is required (most of the work is done by autonomous training).
Nevertheless and despite its mathemati al foundations, reinfor ement learning annot
be used as it is to make the agents solve omplex problems. Su h a limitation is essentially
due to the huge sear h spa e the agent has to deal with and to the di ulty in nding
the adequate reinfor ement fun tion. One way to solve omplex problems is to adopt a
divide-and- onquer approa h: (1) breaking down the initial problem into sub-problems
with small state spa es and simple reinfor ement fun tions, (2) solving ea h sub-problem,
(3) ombining the solutions of ea h sub-problem to solve the original problem.
The Methodology
36
The above pro edure is re ognized to give fair results and has been widely applied
in reinfor ement learning (see (Mahadevan and Connell 1992; Lin 1993; Kalmar et al.
1998; Dietteri h 1997) for instan e). However only experien ed designers an over ome the
tri ks that may appear during its use. In this hapter we introdu e a methodology whi h
integrates this pro edure and helps the designer to build e ient ontrol ar hite tures for
reinfor ement learning agents.
The obje tive of a methodology, in any engineering eld, is to provide helpful guidelines
to engineers during the design pro ess. Its role is of great importan e be ause it not only
ensures the quality of the nal produ t but also optimizes the use of available resour es,
the tasks' allo ation over several persons as well as the management of the whole pro ess.
The dierent stages in a general engineering methodology are shown in gure 4.1. The
next two se tions review attempts to determine prin iples for the agent's design pro ess.
Define the problem
Implement, test
and validate
the solution
Engineering
Design
Process
Analyze the
problem
Make the
design choices
Figure 4.1:
A general engineering methodology.
4.1.1 Pfeifer's Design Prin iples
By setting up the foundations for autonomous agents' design prin iples, Pfeifer (1996)
wanted to provide new insights in understanding intelligen e. His main argument is that
the best way to understand intelligen e is to build autonomous agents. Another major
The Methodology
37
purpose is that the agent's design relies on the intuitions of experien ed designers and
that this know-how is often left impli it in most s ienti publi ations. Thus the design
prin iples aim at making this knowledge expli it and provide guidan e on how to build
autonomous agents.
The design prin iples whi h were proposed are lustered into two lasses. The rst lass
is alled task environment and on erns the denition of the e ologi al ni he in whi h the
agent will evolve, as well as the task it has to a hieve and the behaviors it has to exhibit.
The se ond lass is devoted to the design of the agent itself and is onstituted of seven
prin iples whi h in lude issues su h as agent morphology and ontrol ar hite ture. We
review these prin iples as they were summarized in (Pfeifer and S heier 1998):
1. The omplete agent prin iple. The kind of agents of interest are the omplete
agents, i.e. agents that are autonomous, self-su ient, embodied and situated.
2.
Intelligen e is emergent
from an agent-environment intera tion based on a large number of oupled pro esses
that run in parallel, loosely oupled pro esses that run asyn hronously and are onne ted to the sensory-motor apparatus.
3.
The prin iple of sensory-motor oordination.
4.
Designs must be parsimonious, and exploit the

physi s and the onstraints of the e ologi al ni he.
5.
Sensory systems must be designed based on dierent

sensory hannels with information overlap.
6.
The prin iple of e ologi al balan e.
The prin iple of parallel, loosely oupled pro esses.
All intelligent behavior (e.g. per eption, ategorization, memory) is to be on eived as a sensory-motor oordination
whi h serves to stru ture the input.
The prin iple of heap designs.
The redundan y prin iple.
The \ omplexity" of the agent has to

mat h the omplexity of the task environment. In parti ular given a ertain task
environment, there has to be a mat h in the omplexity of sensors, motor system,
and neural substrate.
The Methodology
7.
38
This prin iple states that the agent has to be equipped with
a value system and a me hanisms for self-supervised learning employing prin iples of
self-organization.
The value prin iple.
These design prin iples were su essfully applied to build "Sahabot", a mobile robot
whose behavior is inspired from the desert ant's behavior.
4.1.2 The BAT Methodology
The need for a prin ipled approa h to developing learning autonomous agents, also motivates the eorts of Dorigo and Colombetti (1998) to dene a new te hnologi al dis ipline
alled Behavior Engineering. Behavior Engineering aims at providing a methodology, a
repertoire of models and a set of tools supporting all the phases of the agent development
pro ess. The methodology they proposed, alled Behavior Analysis and training (BAT)
(Colombetti et al. 1996), is based on the experien e a quired during their past resear h,
and overs several issues in the building pro ess of autonomous robots su h as spe i ation,
design, training, and assessment. The BAT methodology omprises the following stages:
1. The informal (i.e. in natural language) des ription of the agent and its environment
as well as the requirements of the desired behavior.
2. The analysis of the behavior and its de omposition into simple behaviors. The intera tion between these behaviors is then dened using some operators (independent
sum, ombination, suppression, sequen e). The result of this stage is a stru tured
behavior.
3. The spe i ation of the robot omponents in luding in parti ular the sensors and the
ee tors, the ontroller ar hite ture, the reinfor ement fun tion for ea h elementary
behavior, the training strategy, and sometimes the extensions that should be added to
the environment. A set of generi ontrol ar hite tures based on Behavioral Modules
(BM) is provided to implement the stru tured behavior.
4. The design, the implementation and the veri ation of the ontrol ar hite ture.
The Methodology
39
5. The robot's training until the desired behavior is learned.

6. The validation of the learning pro ess and the observed behavior.
This methodology assumes that the robot's apparatus and the environment are predened, and the BMs are endowed with a well- hosen reinfor ement learning me hanism
whi h was in most ases a Learning Classier System (LCS). The feasibility of this methodology was demonstrated through three pra ti al examples.
4.1.3 Dis ussion
The two approa hes presented above onstitute the main and, to the best of our knowledge,
the only attempts to dene a prin ipled and systemati means to designing autonomous
agents. Both of them were developed within and espe ially for the roboti s eld. However
some remarks an me made about them.
Pfeifers' design prin iples provide a set of re ommendations and advi e to respe t,
rather than guidelines to follow. Also they do not deal with the testing and the evaluation
issues, and timidly address the learning aspe t. However the dieren e between a behavior
and the me hanism whi h produ es it by intera tion with the environment has been learly
stated and highlighted (this point will be detailed in the next se tion).
The BAT methodology expli itly guides the designer during all the stages and denes
the expe ted result at the end of ea h of them. Learning is onsidered as an integrated
part of the methodology and the role of the trainer to make the learning pro ess e ient is
stressed. However we regret a ertain la k of formalism in the spe i ation phase and that
the de omposition pro ess heavily relies on the designer's intuition and past experien e.
In on lusion, we an point out that these approa hes are (or may be) omplementary
in the sense that the rst one addresses the s ienti part while the se ond one addresses
the engineering part in the design of autonomous agents.
The Methodology
4.2
40
Agent-Environment Intera tion Model
At this stage it is worth larifying the notion of behavior whi h is usually en ountered
in agent appli ations and roboti s in parti ular. A behavior is the des ription from the
observer's point of view, at dierent levels of abstra tion, of a sequen e of a tions produ ed
by the agent via its oupling with the environment. In simple words, an agent's behavior
an also be dened as the result of the intera tion between the agent's sensory-motor loops
and the environment. In this se tion, we des ribe this intera tion within the reinfor ement
Observers
point of view
Environment
u
Execution
Perception
y
Reinforcement
r
a
Agents
point of view
Decision
x
Revision
Agent
Sensory-motor loop
Figure 4.2:
Agent-environment intera tion model in a reinfor ement learning framework.
learning framework, in more depth than in hapter 2. As shown in gure 4.2 the agent's
behavior is modeled as a oupling of two dynami al systems: the agent, onstituted here
by a single sensory-motor loop and the environment. We also distinguish between the
dierent points of view:
the agent's point of view, whi h takes into a ount the internal me hanism that
generates ommands a ording to per eptions;
The Methodology
41
the external observer's point of view, whi h onsiders the environment, in luding the
agent be ause it is embedded in it.
This distin tion allows us to emphasize the following points:

the dieren e between the environment's state s and the agent's per eption y, as well
as between the ommand a that the agent exe utes and the a tion u that a tually
in uen es the environment. In roboti s, for example, the agent may have an obsta le
in front of it and only gets sonar or infrared readings. The observer knows that this
measure is orrelated with the distan e to the obsta le but a priori not the agent.
In the same ontext, the agent may send the motors a ommand orresponding to
a ertain number of wheel turns whi h makes it move in the environment, but this
movement is not per eived as su h by the agent. Moreover the same number of wheel
turns may result in dierent movements a ording to the distan e to the obsta le and
to possible slipping;
the agent's de ision is taken a ording to the internal state x, whi h has to be Markov.
This state is made Markov by the revision (or re onstru tion) pro ess whi h ranges
from the identity fun tion up to the most sophisti ated knowledge revision pro ess;
the reinfor ement signal whi h previously ame from the task (gure 2.1) is now a
part of the agent. More pre isely, it is a part of the agent's a priori knowledge given
by the designer: the philogeneti al inheritan e.
omplex behaviors may be produ ed by simple me hanisms through their intera tion
with the environment (Braitenberg 1984; Pfeifer and S heier 1998). Hen e the behavior's design pro ess would be a proje tion from the problem's domain (observer's
point of view) to the o-domain (robot's point of view).
From now on, we will use the term behavior to des ribe an agent solving a problem.
Also, problem de omposition and sub-problem will be repla ed by behavior de omposition
and sub-behavior. Thus a behavior is onstituted by a hierar hy of sub-behaviors just as
if we had a hierar hy of agents in whi h ea h agent is solving a sub-problem. In addition,
The Methodology
42
these hanges will pla e stress on the design of an intera tion rather than that of an isolated
agent.
4.3
The HPS Methodology
The Hierar hi al Problem Solving (HPS) methodology we propose aims at providing a systemati approa h in the use of embedded reinfor ement learning agents to solve problems.
It fo uses on the agent's design and more spe i ally on the hierar hi al aspe t of the
ontrol ar hite ture. The methodology assumes that the environment, the agent and its
intera tion devi es, as well as the problem to solve are predened.
The HPS methodology will guide the designer by telling him how to:
formally spe ify the agent's behavior;
de ompose the global behavior into a hierar hy of sub-behaviors;
produ e elementary behaviors of the hierar hy, i.e. behaviors of the lowest level,
using reinfor ement learning sensory-motor loops;
oordinate the sensory-motor loops at a given level of the hierar hy to get the behavior
of the upper level;
evaluate and validate the global behavior.
Figure 4.3 gives an overview of the dierent stages of the methodology. We noti e that:
the ontroller's design is iterative, that is, the results of the global behavior's evalua-
tion an be used to orre t the spe i ations. The y le is repeated until the expe ted
behavior is observed;
the analysis pro ess is top-down and from the observer's point of view while the
design pro ess is bottom-up and from the robot's point of view;
The Methodology
43
the distin tion between the dierent points of view allows us to identify whi h parts
have to be treated by the designer and whi h have to be learned by the robot. Hen e
we an easily ombine engineering and evolution.
Observers point of view
Robots point of view
Evaluation and validation

of the behavior
Problem and agent
definition
Formal specification
of the behavior
Decomposition into
a hierarchy of behaviors
Figure 4.3:
Coordination of the
sensory-motor loops
Production of
elementary behaviors
of the hierarchy
Overview of the HPS methodology
4.3.1 Spe i ation
The spe i ation stage has an important role in the HPS methodology. On one hand all
the next stages rely on it, and on the other hand it provides the assessment stage with
a useful referen e mat hing. The dynami s of the intera tion between the agent and the
environment was formalized as an MDP. Thus a behavior will be represented by a parti ular
traje tory in the MDP's state spa e.
By asso iating with ea h possible traje tory, a quality riterion we then have a means of
spe ifying the desired behavior. The quality riterion an be expressed as the ombination
of an obje tive fun tion and some onstraints on the traje tory. The obje tive fun tion
largely depends on the nature of the problem and represents a measure of the system's
performan e su h as the letters olle ted or the fuel onsumption or more generally the
squared deviation from an optimal value. It is expressed as an integral on the traje tory
generated by a ontrol poli y , for a horizon N :
J ( ) =
N
0
f (x(t); t) dt:
(4.1)
The Methodology
44
The onstraints set C = fx 2 X j ' (x) = 0; :::; 'n(x) = 0g re e ts the aspe ts of the
traje tory whi h are undesirable. So the goal is to optimize the obje tive fun tion while
at the same time satisfying the onstraints. The onstraints are enfor ed by augmenting
the obje tive fun tion as follows
1
J 0 (; ) = J ( ) +
=
=
N
0
N
0
N
0
X
i
i (x(t); t):'i (x(t); t) dt
[f (x(t); t) +
X
i
i (x(t); t):'i (x(t); t) dt
(4.2)
F (x(t); ; t) dt
where the auxiliary fun tion F (x; ) is alled Hamiltonian fun tion and i are the Lagrange multipliers. They are omputed using the exterior penalties method (Minoux 1986):
i (x) = 0 if the onstraint 'i (x) = 0 is satised and i (x) = pi otherwise. The positive
onstant pi weights the strength of the penalty.
At the end of this stage the desired behavior is spe ied.
4.3.2 De omposition
Human designers are usually skillful in de omposing a omplex task. However with a
systemati approa h they an perform better de ompositions.
To de ompose the main behavior into a hierar hy of sub-behaviors we propose a graphi al based approa h. The rst step in the de omposition pro edure is to graphi ally represent F as a fun tion of time steps or de ision steps. The next step onsists in identifying
the positive ontributions of the agent to optimize this fun tion as well as the asso iated de ision making (or behaviors sele tion). These ontributions usually appear as falling edges
in the ase of a minimization. Of ourse between two falling edges other de isions ould
have been made ex ept that they do not have a positive ontribution or their ontribution
does not appear be ause of the nature of the fun tion and the kind of representation.
The surfa e orresponding to the integral we have to minimize is de omposed into a
series of re tangles whose sides are respe tively the distan e between two falling edges and
the value of the fun tion when the se ond falling edge o urs (gure 4.4). We noti e that
The Methodology
45
the sum of the re tangles' surfa e is not exa tly equal to the integral of F but to the a tual
measure of the agent's ontribution. This measure on erns aspe ts of the environment
that are ontrollable by the agent and allows us to ompare the performan es of two agents.
For example, in the postman robot problem, the robot an hoose in whi h o e to go
but annot a t on the letters' ow. In ee t, while the robot is moving towards a given
pla e, the number of letters in standby in the o es as well as the batteries' level evolve
independently of the robot destination. They are a tually ae ted when the destination
is rea hed, that is, when the exe ution of the robot de ision is ompleted. The surfa e of
ea h re tangle an be minimized by redu ing one of its two sides. The pro esses onsisting
in minimizing ea h of these sides orrespond to two on urrent behaviors.
The obtained behaviors are then formally spe ied and de omposed on e again. The
pro edure is repeated until the behaviors annot be de omposed anymore or an easily be
produ ed. At that time we have a hierar hy of sub-behaviors.
falling edge
Figure 4.4:
A graphi al representation of the fun tion to optimize.
Mathemati al Support
In this se tion we provide a mathemati al support for the graphi al-based de omposition
method presented above. Let's rst introdu e the fundamental denition and theorem
(taken from (Minoux 1986)) on whi h fun tion de omposition methods rely.
We say that a fun tion f is de omposable into f and f if f is separable
(i.e. it an be put into the form f (x; y) = f (x; f (y))), and if moreover the fun tion f
Denition
The Methodology
46
is monotone non-de reasing relative to its se ond argument. The following fundamental
result an then be stated:
Theorem Let f be a real fun tion of x and of y = (y1 ; :::; yk ).
f (x; y ) = f1 (x; f2 (y )) then we have
If f is de omposable with
min
f (x; y ) = min
ff (x; min
ff (y)g)g
x
y
x;y
(
The minimization of a re tangle surfa e S = l :l an then be written:

1
min l :l = min
f (l ; min f (l ));
l
l
(l1 ;l2 )
where f (u; v) = u:v and f (x) = x, when l and l are both positive.
1
4.3.3 Sensory-motor Loop's Design
In this se tion we present a generi sensory-motor loop (gure 4.5) whi h allows us to
generate a behavior given its spe i ations. This stage of the methodology essentially
onsists in making design hoi es and on erns elementary behaviors as well as other subbehaviors of the hierar hy.
The ore of the sensory-motor loop is the learning system whi h omputes the utility
of ea h ommand. The nature of the representation of the utility fun tion depends on the
size of the state spa e. A simple lookup table is su ient for small spa es, but a fun tion
approximator su h as those presented in se tion 2.3 is needed for huge spa es.
From the per eptions we have to generate an internal state representation whi h must
be on the one hand omplete enough to allow predi tion of future states and rewards
and on the other hand sele tive, i.e. ontaining only information whi h is relevant to the
behavior asso iated to the sensory-motor loop. Su h a representation an also be learned,
as reported by M Callum (1996).
The reinfor ement fun tion is an important part of the sensory-motor loop and a great
are must be taken to ensure that it will lead to the desired behavior. It translates the
agent per eptions' into a reward value.
The Methodology
47
The dierent exploration strategies were presented in se tion 2.2.3 so the designer an
hoose the most suitable among them.
Finally, as an output, the sensory-motor loop generates signals whi h a tivate or inhibit
the ommands. The ommand set may ontain atomi ommands whi h dire tly intera t
with the environment or sensory-motor loops in the ase of a oordination.
Command
Set
Activation/Inhibition
Perceptions
State
Representation
Utility
Function
Representation
Exploration
Policy
Reinforcement
Function
Figure 4.5:
Action
Selection
Mechanism
The proposed generi sensory-motor loop.
The Reinfor ement fun tion
To avoid the generation of wrong behaviors we propose to use the fun tion that spe ies the
behavior to dene the reinfor ement fun tion. As the spe i ation fun tion is dened from
the observer's point of view, the expe ted behavior will be generated when this fun tion is
optimized. We then dene the instantaneous reinfor ement as the dieren e between the
surfa es of two onse utive re tangles:
r(T ) = F (x(T
1); ):T
F (x(T ); ):T
(4.3)
where T is a de ision step and T is the dieren e, in terms of time steps, between two
de ision steps T 1 and T . The reinfor ement fun tion has the form of a gradient and gives
a ontinuous information on the progress made by the agent. In addition, the learning is
speeded up and the exploration is improved (Matari 1994). Given that the reinfor ement
learning algorithms we use maximize the umulated dis ounted reward over an innite
The Methodology
48
horizon, we have:
1
X
T =0
T r(T + 1) = 0 (F (x(0); ):0
F (x(1); ):1 )
+ (F (x(1); ): F (x(2); ): )

+:::
+ n(F (x(n); ):n F (x(n + 1); ):n )
+:::
= F (x(0); ):
+ ( )F (x(1); ):
+ ( )F (x(2); ):
+:::
+ ( n n )F (x(n); ):n
+:::
= F (x(0); ):
1
X
+ ( 1) T F (x(T ); ):T
1
+1
(4.4)
T =1
= ( 1)
1
X
T =0
T F (x(T ); ):T + F (x(0); ):0
We noti e that maximizing equation 4.4 is equivalent to the initial obje tive whi h is to
minimize equation 4.2, be ause 0 < < 1 and as far as the value of is hosen so that N
be ome negligible.
The Internal State
To build an internal state that meets the ompleteness and sele tiveness properties, the
designer has to onsider the following two guidelines. First he has to identify the per eptions on whi h the spe i ation fun tion depends, that is, those whi h make the fun tion
hange when they evolve. Then the designer has to he k if the instantaneous per eptions
The Methodology
49
are su ient to make e ient de isions. If not, some kind of ontext or short term memory
has to be added.
4.3.4 Coordination
The next hapter is entirely devoted to this stage of the methodology.

4.3.5 Evaluation and validation
During this stage the designer has to answer the following questions:
Is the observed behavior orre t ?
If not, why ?
What are the agent performan es ?
Wyatt et al. (1998) argue that a orre t approa h is to employ multiple forms of evaluations. Thus it is possible to disambiguate the error sour e and to provide explanations of
why the agent failed or su eeded.
Here we make the distin tion between the behavior assessment (Colombetti et al. 1996)
and the evaluation of the agent learning. The former is a qualitative riterion and the latter
is a quantitative riterion. Moreover we add two viewpoints: internal and external.
To assess a behavior the designer should validate its orre tness and its robustness.
This is done from the observer's point of view. A behavior is orre t when the task
assigned to the agent is fullled. For example, we will validate the postman-robot if we see
the robot olle ting and posting the letters without running out of energy. On the other
hand a behavior is robust if it remains orre t when stru tural hanges of the environment
o ur. Robustness is strongly linked to the adaptiveness property. If the orre t behavior
is not generated, then the designer should verify the learning system qualitatively, that is,
determine if the agent is learning or not. A problem during this veri ation is usually due
to a programming error in the software ar hite ture.
The Methodology
50
Qualitative
Quantitative
Convergen e speed
Internal Is the robot learning ? Average reward
Corre tness
External Robustness
Table 4.1:
Obje tive fun tion

Constraints violation
Failure or su ess rates
Outline of the evaluation forms.
If the agent is ee tively learning, then it is ne essary to he k if it is learning orre tly

with regards to the reinfor ement program, i.e. maximizing rewards and minimizing punishments. This is done from the agent's point of view. The average of the rewards re eived
over time steps is a good indi ator to use during this he king. This is a ni e way to nd
out why the behavior is in orre t. In ee t if the agent learns what it is taught (through
the reinfor ement program) and exhibits the wrong behavior then it is surely be ause it is
learning from an in orre t reinfor ement fun tion. Therefore the designer has to orre t
it.
Finally it is useful to ompare the performan es of several agents, ar hite tures or
algorithms. It is possible to evaluate the asymptoti onvergen e to the optimal behavior
(Kaelbling et al. 1996) regarding two quantitative riteria. The rst riterion is the
onvergen e speed, that is, the ne essary time (number of intera tion y les) to rea h a
plateau. The se ond riterion is the quality of the onvergen e. It is represented by the
value of the rea hed plateau. The metri s that are usually used for su h a omparison are
umulated deviations from the optimal behavior (if it is known), average reinfor ements
re eived over time and su ess or failure rates. Figure 4.1 outlines the dierent forms of
evaluation.
4.4
Case Study
In this study we des ribe the appli ation of the HPS methodology to solve the postman
robot problem.
The Methodology
51
4.4.1 Spe i ation and De omposition
To fulll its task, the postman robot has to minimize the number of letters xli in the o es
as well as the number of letters xr it holds by respe tively olle ting and then posting
them. The following obje tive fun tion is derived
f1 (x; t) =
X
i
xli (t) + xr (t);
0<<1
(4.5)
subje t to the onstraint on the batteries level xb

'1 (x; t) = xbth
xb (t) 0;
(4.6)
where xbth is a safety threshold.

The letters arried by the robot may also be seen as a onstraint and as a Lagrange
multiplier be ause the fun tions Pi xli(t) and xr (t) are antagonist: when the former is
minimized the latter is maximized. Hen e minimizing Pi xli (t) and xr (t) amounts to
minimizing Pi xli(t) subje t to xr (t) = 0. The value of the Lagrange multiplier is a
onstant between 0 and 1, so that any ontribution to minimize either Pi xli(t) or xr (t)
will also minimize f (x; t). Moreover it is not ne essary to set to zero when the onstraint
is satised (xr (t) = 0).
The Hamiltonian fun tion
1
F1 (x; 1 ; t) =
X
i
xli (t) + xr (t) + 1 (x; t):'1 (x; t)
(4.7)
is then dedu ed and represented graphi ally (gure 4.6).

A falling edge o urs when the robot
olle ts letters from an o e;
posts the letters it holds;
re harges its batteries and their level is below the threshold (the penalty is removed).
The two on urrent behaviors that are involved in the minimization pro ess of a re tangle's
surfa e are:
The Methodology
52
F1
F21
F22
penalty
Figure 4.6:
The de omposition pro ess of the postman robot problem.
move to the nearest pla e providing a positive ontribution, for the horizontal side;
move the pla e providing the highest ontribution, for the verti al side.
For the rst behavior the robot has to minimize the traveled distan e xd between two
de ision steps T 1 and T . The orresponding obje tive fun tion is
f21 (x; T ) = xd (T );
(4.8)
subje t to providing a positive ontribution (a falling edge in the graphi s). In ee t, the
robot may move to the nearest o e but it may not ontain any letter. This onstraint is
expressed by ' (x; T ) = 0 where
8
1) F (x; ; tT ) > 0
< 0; if F (x; ; tT
' (x; T ) =
(4.9)
:
1; otherwise.
where tT is the time step orresponding to the de ision step T . We obtain
21
21
F21 (x; 21 ; T ) = xd (T ) + 21 (x; T ):'21 (x; T ):
(4.10)
The Methodology
53
The se ond behavior an be dened as the one maximizing

F22 (x; T ) = f22 (x; T ) = F1 (x; 1 ; tT
1)
F1 (x; 1 ; tT )
(4.11)
where tT is the time step orresponding to the de ision step T .

We noti e that F and F only hange at the end of a de ision and remain onstant
the rest of the time. Thus they are represented as a fun tion of T be ause they depend
on the de ision step dependent rather than on the time step. Moreover they on ern the
sides of a single re tangle only. It is the role of the upper behavior to oordinate them in
order to minimize the sum of re tangles' surfa e. This is why a graphi al representation
of those behaviors does not provide additional information. However it is obvious that the
behaviors nearest and highest may orrespond to one of the following ve behaviors
21
22
move to an o e (3 behaviors);
move to the mailbox;
move to the batteries harger;
or more generally to a behavior onsisting in moving to a spe i pla e.

Re all from se tion 3.3.1 that the robot's atomi ommands onsist of a steering of
degrees followed by a translation of d entimeters. To rea h a given goal the robot has to
minimize its orientation relative to this goal while moving. This mean that the obje tive
fun tion
f3 (t) = x (t);
(4.12)
where x is the robot orientation with respe t to the goal has to be minimized subje t to
the obsta les avoidan e onstraint
'3i (x) = (dsi
xsi ) < 0;
(4.13)
where xsi is the robot's reading of sensor i whi h indi ates the distan e to the nearest
obsta le and dsi is the nearest safe distan e to an obsta le. The performan e riterion we
The Methodology
54
obtain is
F3 (x; 3 ; t) = x (t) +
X
i
3i (x; t):'3i (x; t):
(4.14)
The hierar hy of sub-behaviors obtained at the end of this stage is sket hed in gure
4.7.
postman
nearest
move to
office 1
Figure 4.7:
move to
office 2
highest
move to
office 3
move to
mailbox
move to
charger
The hierar hy of sub-behaviors obtained for the postman robot problem.
4.4.2 Sensory-motor Loop's Design
The design hoi es that were made for ea h sensory-motor loop are now des ribed. Ea h
sub-behavior of the hierar hy will be learned using onnexionist reinfor ement learning.
A single MLP with a sigmoid fun tion a tivation and a single output unit was used to
represent the utility fun tion of ea h ommand. Some omponents of the per eption ve tor
are represented using a sigmoidal oarse oding as in (Rummery 1995). Basi ally su h a
oding works as follows. A number of sigmoid fun tions, one for ea h input neuron, are
spread a ross the input spa e (gure 4.8). As the sigmoid fun tions overlap ea h other,
ea h input value will be oded by several values in [0,1 orresponding to the value of
ea h sigmoid fun tion for that input. The input patterns for ea h network as well as the
reinfor ement fun tions are detailed in the experiments se tion.
The Methodology
55
Con erning the exploration poli y, a simple -greedy poli y was used. A ommand is
hosen a ording to the probability P (a = arg maxa2A x Q(x; a)jx) = 1 , where is
de reased from 1 to 0 in Nexp exploration steps.
( )
1.0
0.5
input
0.0
x
Figure 4.8: The input real-value x is oarse oded into four values in [0,1 whi h are 0.05, 0.55,
0.95, 1.0 and onstitute a suitable input for a neural network.
4.4.3 Coordination
We used a oordination me hanism in whi h the sensory-motor loops of a given layer are
treated as simple ommands by the upper level. On e they are a tivated they keep the
ontrol of the agent until they are omplete. The ontrol is then returned ba k to the
sensory-motor loop whi h a tivated them. This kind of oordination is alled Hierar hi al
Q-learning (Lin 1993).
4.4.4 Evaluation and Validation
To judge the ee tiveness of the overall behavior we dened the following metri s:
the average letters in standby in the o es, the average letters arried by the robot
as well as the average batteries level for the external assessment. These values are
updated at ea h intera tion y le (the lowest temporal resolution) to guarantee a
uniformity in the omparison with other ar hite tures;
the average of the global quality riterion, updated at ea h de ision step, to evaluate
the learning pro ess.
The Methodology
4.5
56
Experiments
The postman robot behavior is learned in rementally. With this te hnique the robot is
rst trained to learn elementary behaviors and then to learn the upper behaviors using
previously a quired skills. This pro ess alled modular learning is repeated for ea h level
of the hierar hy. The navigation behaviors are learned separately and preserved using
persistent neural networks. The oordination behaviors are then learned so as to a hieve
the global behavior.
4.5.1 Learning to Navigate
Mobile robot navigation towards a goal while avoiding obsta les has been studied in the
reinfor ement learning ontext by Rummery (1995) and Millan (1996). Their work is an
extension of those of Pres ott and Mayhew (1992) and Krose and Van Dam (1993) in whi h
the robot avoids obsta les, not in order to get to a target lo ation, but just to explore the
environment. It may be seen as an adaptive onstru tion of a potential eld (i.e. the
goal generates a potential whi h pulls the robot towards it, and the obsta les produ e a
potential whi h repels the robot away) where the potential ve tor in a given position is
dened by the robot`s a tion with the highest utility in this situation. In lassi al path
planning (Khatib 1986; Barraquand and Latombe 1991) the potential eld is omputed
using a priori knowledge about the environment's onguration.
In our experiments re urrent neural networks, with 2 hidden units were used to learn
the navigation behaviors. A network's input pattern is a ve tor of 26 omponents whi h
are real numbers in the interval [0,1. The rst 16 omponents orrespond to the inverse
exponential of distan e sensors readings, e d=k where k is a weighting fa tor set to 50 during
the experiments, and d is a ombination of infrared and sonar readings so as to provide
measures between 0 and 650 entimeters. The next 8 omponents are a sigmoid oarse
oding of the robot's orientation relative to the goal. The orientation is omputed using
odometry. The remaining 2 omponents represent the input ontext and are linked to the
output of hidden units. The input ontext as well as the orientation allow the robot to
dierentiate several situations orresponding to the same sensors onguration.
The Methodology
57
Three ommands whi h are:

( = 22:5o; t = 25 m)
turn-left
turn-right
move-forward
( = 22:5o; t = 25 m)
( = 0o; t = 25 m)
were available to the robot.

The reinfor ement fun tion is al ulated from the quality riterion F (x; ; t) = x (t)+
P
i i (dsi xsi ), without onsidering the interval between two de isions be ause the a tions
have the same duration. We have
3
r3 (t) = F3 (x; 3 ; t
1)
F3 (x; 3 ; t):
Safety thresholds on ern only the nine frontal sensors and dene a se urity zone in front
of the robot (gure 4.9). We noti e that safety thresholds are higher in front of the robot
than on its sides. It is simply be ause the robot an still move even if its sides are near an
obsta le but annot do so if its front is on erned. The values of the Lagrange multipliers
when the onstraints are violated are hosen to give a penalty whi h is proportional to the
violated surfa e in the se urity zone, the overall zone being equivalent to the maximum
robot's heading deviation from the goal, whi h is 180 degrees.
The networks' weights were initialized with random values between -0.1 and 0.1, the
dis ount fa tor was xed to 0.99, the learning rate to 2.0, the eligibility tra e fa tor
to 0.5 and the exploration parameter Nexp to 1500 steps. As the networks' output is in
the range [0,1, we s aled the reinfor ement signal between -0.1 and 0.1 to prevent units
from overshooting.
The robot was trained to learn ea h of the ve navigation behaviors in a series of trials,
with ea h trial starting with the robot pla ed in a dierent room and ending when it rea hes
the target lo ation. Figure 4.10 shows the robot's traje tories when it navigates from one
room to another, on e it has learned. To evaluate the robot learning performan es we
onsidered the behavior move to the harger. The robot was trained to rea h the harger
starting from o e 3 only. After learning it was able to nd the optimal path leading to
The Methodology
58
obstacle
security
zone
Figure 4.9:
The se urity zone dened in front of the robot.
the harger starting from o e 3 (gure 4.11) and also starting from other rooms (gure
4.12), thus exhibiting generalization abilities. Moreover it rea ts e iently to unexpe ted
obsta les (gure 4.13). The learning urves of gure 4.14 show that the robot learns how
to move to the harger after 4 trials, orresponding to 6258 steps. However the path found
is not optimal and sometimes not safe either. The reason is that during this trial there is a
residual exploration of 38%. Thereafter, from the 22nd trial, the path is optimal (between
41 and 44 steps) and safe as we see in gure 4.15 that there are no more penalties after
the 22th trial. It is worth adding that during this trial the residual exploration was 16%.
4.5.2 Learning the Coordination
In this se tion, we report the experiments we arried out to oordinate the navigation
behaviors. A grid simulator ongured with the distan es shown in table 4.2 was used for
this purpose.
As shown in the hierar hy of gure 4.7 two intermediate behaviors, nearest and highest,
as well as the global behavior postman have to be learned. On e again the robot was rst
The Methodology
59
Figure 4.10:
Figure 4.11:
The robot moving from one room to another.
The optimal path found between o e3 and the harger.
The Methodology
60
Figure 4.12:
Figure 4.13:
Generalization abilities.
Rea tion to an unexpe ted obsta le.
The Methodology
61
3500
3000
Steps to reach the goal
2500
2000
1500
1000
500
0
0
Figure 4.14:
10
15
20
25
Trials
30
35
40
45
50
Number of steps needed to rea h the harger starting from o e3 for ea h trial.
140
120
Average penalties
100
80
60
40
20
0
0
10
Figure 4.15:
15
20
25
Trials
30
35
40
45
Average penalties re eived during ea h trial.
50
The Methodology
62
Table 4.2:
39
Office 3
62
41
Mailbox
34
43
65
Charger
40
29
42
44
Office 2
Office 3
Mailbox
Charger
Office 2
Office 1
Office 1
Steps needed by the robot to move between dierent pla es in the environment.
trained to learn the two intermediate behaviors, whi h were preserved thereafter, and
trained to learn the global behavior afterwards. We used feed-forward neural networks to
store the Q-values. The same network ar hite ture was used for the three above behaviors,
as they share the same state spa e. It is omposed of 40 input units, 3 hidden units and
one output unit. All units have a sigmoid a tivation fun tion. The input pattern is as
follows:
35 units: ea h set of 7 units represents a sigmoidal oarse oding of either the number
of letters in ea h o e or the number of letters arried by the robot or the batteries

level;
5 units: ea h of these units represents a possible lo ation of the robot, i.e. in whi h
pla e it is. So exa tly one unit is 'on' at any de ision step.
However the ar hite tures dier in their number of networks and in their reinfor ement
fun tions. The intermediate behaviors needed ve networks ea h: one for ea h navigation behavior. The global behavior needed only two networks: one for ea h intermediate
behavior. The reinfor ement fun tion of ea h behavior is dire tly omputed from the orresponding performan e riterion, as explained in se tion 4.3.3.
The Methodology
63
The networks' weights, in ea h sensory-motor loop, were initialized with a random value
in the range [-0.1,0.1, and the rest of the parameters as follows: = 0:99, = 0:5, = 2:0
and Nexp = 1500. Like in the navigation behaviors the reinfor ement fun tion was s aled
between -0.1 and 0.1.
Sin e we did not have an optimal poli y for the postman robot problem we de ided to
ompare the performan e of our hierar hi al ar hite ture with those of a at ar hite ture
(gure 4.16) and of a hand- oded ontroller. In the at ar hite ture, the high level behavior (postman ) dire tly ontrols the navigation behaviors. It uses the same reinfor ement
fun tion as the behavior highest does. We tried to use a rule based reinfor ement fun tion
but the results were so bad that it would not be fair to ompare them with the hierar hi al
ar hite ture. So the omparison will be espe ially made on the ar hite ture rather than
on the behaviors' spe i ation, be ause the behaviors were spe ied in the same way. The
hand- oded ontroller uses a simple heuristi to hoose between the navigation behaviors.
This heuristi onsists in moving to the o e with the highest amount of letters, posting
the letters when the amount of letters arried by the robot is higher than the number of
letters in ea h o e, and re harging the batteries when their level is below the threshold.
Ea h of these ontrollers was tested on 50000 de ision steps, a de ision step orresponding to an elementary behavior sele tion, and for ea h letters ow onguration of table 3.1.
The batteries level threshold was set to 60 %.
postman
move to
office 1
Figure 4.16:
move to
office 2
move to
office 3
move to
mailbox
move to
charger
The at ar hite ture used for the omparison with the hierar hi al one.
Tables of gure 4.17 show the obtained results. Re all that a good postman robot is the
one whi h minimizes both the number of letters in standby in the o es and the arried
The Methodology
64
letters, while keeping its batteries level above the xed threshold. We an see that both
RL systems a hieved good performan es ompared to those of the hand- oded one. The
main reason is that the learning agents impli itly take into a ount some parameters like
the distan e between the rooms and the letters ows. Thus they an anti ipate the ee t
of their de isions and move, for example, to the o e from whi h the highest amount of
letters will a tually be olle ted. The hand- oded agent de ides to move to an o e whi h
ontains the highest amount of letters, at the moment where the de ision is taken but
not ne essarily when it is ompleted. On the other hand we noti e that the hierar hi al
ar hite ture outperforms the at ar hite ture. With the former ar hite ture, there are in
average 11.38 and 10.32 (respe tively with periodi and Poisson ow) letters in standby
in the o es less than with the latter ar hite ture, whereas the average letters arried
rises by only 7.56 and 4.23 letters. Moreover a better energy management is a hieved
by the hierar hi al ar hite ture. As it an be observed in the urves of gure 4.18, the
hierar hi al ar hite ture learns a better strategy than the at one, and does so very qui kly,
i.e. it does not behave badly in the beginning. To explain this superiority we argue that the
hierar hi al ar hite ture explores a smaller sear h spa e in the sense that it oordinates only
two sensory-motor loops whi h are pre-learned, whereas the at ar hite ture oordinates
ve sensory-motor loops. Another reason is that we were a tually solving a Semi-Markov
De ision Problem, that is an MDP where the duration of the a tions is not the same. The
hierar hi al ar hite ture takes this feature into a ount and expli itly onsiders the elapsed
time between two de isions, whereas the at ar hite ture does not.
4.6
Summary
We have presented a methodology whose obje tive is to provide helpful guidelines to analyze and design agents apable of solving omplex reinfor ement learning problems. The
methodology must be seen as a on eptual framework in whi h a number of methods are
to be dened. The postman robot ase study illustrated how the HPS methodology an be
applied. The proposed spe i ation and de omposition methods were su essfully tested
and have given good results. The methodology must now be applied to solve other prob-
The Methodology
Parameters
Average letters in O e 1
Average letters arried
Average battery level
Average quality riterion
65
Periodi ow
Hand- oded
Flat
Hierar hi al
ar hite ture ar hite ture ar hite ture
11.72
8.49
6.40
9.87
9.70
4.76
15.32
17.17
13.95
16.59
16.42
23.96
68.95
68.92
72.62
-46.61
-42.88
-38.18
Poisson distribution ow
Parameters
Hand- oded
Flat
Hierar hi al
ar hite ture ar hite ture ar hite ture
15.52
9.43
8.18
16.44
13.54
9.72
20.00
21.97
16.32
21.09
25.59
29.82
69.78
70.56
77.51
-62.10
-55.86
-49.80
Figure 4.17: Tables resuming the performan e of the oordination methods for dierent letters
ow ongurations.
The Methodology
66
Average Quality Criterion
-20
-40
-60
-80
-100
Flat architecture
Hierarchical architecture
Hand Coded
-120
0
5000
10000
15000
20000
25000
30000
Time Step
35000
40000
45000
50000
-50
-100
-150
-200
Flat architecture
Hierarchical architecture
Hand Coded
-250
-300
0
5000
10000
15000
20000
25000
30000
Time Step
35000
40000
45000
50000
Average of the quality riterion as a fun tion of de ision steps. The top graph
on erns the periodi letters ow and the bottom graph the Poisson distribution letters ow.
Figure 4.18:
The Methodology
67
lems in order to be generalized and ompleted, and some eorts need to be done to improve
our methods or to propose new ones.
Chapter 5
The Coordination Problem
This hapter on erns the oordination problem, that is, how a omplex behavior an
be generated by the oordination of several sensory-motor loops. We rst review the
hierar hi al methods that have been proposed so far to s ale up reinfor ement learning.
Then we dis uss the properties that a oordination me hanism should have and propose a
new oordination method based on the restless bandits theory. Restless bandits allo ation
indexes are an extension of the Gittins indexes and are borrowed from the eld of optimal
s heduling. They on ern problems involving the sharing of limited resour es between
several proje ts whi h are being pursued. The performan es of the proposed method are
illustrated through the postman robot problem and ompared to those of the Hierar hi al
Q-learning (Lin 1993).
5.1
Statement
Consider a olle tion of sensory-motor loops organized in a hierar hi al stru ture (gure
5.1) in whi h the sensory-motor loops at a given level have a dire t ausal in uen e, in
terms of a tivation or inhibition, on the sensory-motor loops of the level below. In su h a
hierar hy, de ision making and learning o ur at dierent levels but the intera tion with
the environment an only take pla e at the lowest level. Finally, ea h sensory-motor loop
has its own internal state depending on the level at whi h it intervenes as well as on the
task it has to solve. Generally stated, the oordination problem within a hierar hy of
sensory-motor loops onsists in a tivating at ea h time step one sensory-motor loop at
69
Level n, T n
Sn
Level n, T 2
S21
S22
S2n
Level n, T 1
S11
S12
S1n
Level 0, t
Primitive commands
A hierar hy of sensory-motor loops. The path of a tive sensory-motor loops on a

given time step is represented in bold.
Figure 5.1:
ea h level in order to generate the global expe ted behavior. This is also known as the
a tion sele tion problem and on erns the resolution of on i ts whi h arise when several
a tions or behaviors ompete to a ess to limited motor resour es. It has been studied in
ethology (M Farland 1981) as well as in adaptive behavior (Tyrell 1993).
5.2
Related Work
It has been re ognized that the use of hierar hies in reinfor ement learning improves the
learning performan es. It allows a better exploration of the sear h spa e, the reuse of
previously learned skills at a given level to a quire new skills at the level above, and speed
up the overall learning pro ess. Although we are espe ially interested in the sele tion
devi e, that is, the me hanism that allows swit hing between sensory-motor loops, we take
the opportunity to review most of the work done in hierar hi al reinfor ement learning.
This work an be roughly grouped into four ategories:
1. ommand and temporal abstra tion;
70
2. state abstra tion ( oarsening or aggregation);

3. MDP de omposition (state spa e partitioning);
4. sub-goals de omposition (modular approa hes).
Of ourse there may be approa hes that fall in multiples ategories.
5.2.1 Hierar hi al Q-Learning
When we think about oordination methods the rst idea that omes to mind is to treat
the sensory-motor loops as primitive ommands. This approa h has been rst introdu ed
by Mahadevan and Connell (1992). In their work a global behavior onsisting in boxpushing was de omposed into elementary sub-behaviors (nder, pusher, unwedger) whi h
were learned independently using reinfor ement learning. A hand- oded arbiter swit hes
between the sub-behaviors a ording to their appli ability onditions and their pre eden e.
Lin (1993) went further and proposed a system in whi h both the sub-behaviors and
the arbiter were learned using Q-learning. The task to be a hieved onsisted of nding a
batteries' harger in an o e environment and onne ting to it. As this task is so di ult
to learn by a monolithi agent, it was de omposed into three sub-behaviors: following
walls on the robot's left/right hand side, passing a door, do king on the harger. Ea h
sub-behavior Sbi was learned by a single sensory-motor loop with Q-learning using a lo al
reinfor ement fun tion. These new skills are then used as a tions by the arbiter whi h
learns Q(state; Sbi ) with global reinfor ement fun tion and state spa e. A sub-behavior is
sele ted a ording to its Q-value and some appli ability onditions, and a new de ision is
made when an a tive sub-behavior ends or another one be omes appli able.
5.2.2 Feudal Q-Learning
The prin iple of this approa h, proposed by Dayan and Hinton (1993), is to operate a
oarsening at ea h level of the hierar hy, that is, ea h state at a given level represents an
aggregation of states at the immediately lower level. The goal state is also abstra ted so
that for ea h level i, the goal state is the one to whi h the goal state at the lower level
71
1 belongs. Given that ea h level has an assigned manager, the learning pro edure works
as follows. The manager of a level i being in abstra t state S i performs a ommand C i
whi h should lead him to state S i . This ommand be omes a goal for the manager of
the lower level i 1, in the sense that ommands have to be exe uted in order to enter
a state S i in the aggregation represented by S i . This pro edure ontinues until the
lowest level where a primitive ommand is exe uted. An abstra t a tion ends when a new
state is observed at the same level. A manager is then rewarded if he a hieves his goal
and punished otherwise. If the goal is rea hed at a given level, its manager delegates the
responsibility to his sub-manager to sear h within the zone dened by his abstra t state.
This approa h was applied to a robot navigation task in 8 8 grid without obsta les.
It has re ently been extended by Dietteri h (1997) who adds the possibility of hierar hi al learning of the Q-values. The value fun tion of an abstra t ommand (i.e., the sum of
rewards generated by the exe ution of this abstra t ommands) is treated as an immediate
reward by the level that sele ts it, just as the rst level does with primitive ommands.
The dire t onsequen e of this improvement is the polling exe ution of the hierar hy, that
is, a de ision is made at ea h level at ea h time step.
i
5.2.3 Hierar hi al Distan e to Goal
MDP de omposition methods onsist in partitioning the state spa e into regions and omputing an optimal poli y for ea h of them. The resulting poli ies are then ombined to
solve the initial MDP.
In the HDG algorithm (Hierar hi al Distan e to Goal) proposed by Kaelbling (1993a)
the state spa e is partitioned so that ea h region orresponds to a landmark. A landmark
is a tually a spe i state and a region is omposed by states that are loser to a landmark
than to any other one. First a high-level poli y that leads to the goal region (i.e. the
region ontaining the goal state) starting from any other region is learned. It gives the
agent the next region to rea h on the route from its urrent region (i.e. the agent losest's
landmark) to the goal region. Then for ea h region, a poli y that allows the agent to move
to the neighboring region is learned. On e the agent is in the goal's region, it learns how
72
to rea h the goal state. The union of these poli ies denes the global solution.
The landmarks are given a priori by the designer. However methods to autonomously
nd them are urrently being investigated.
Similar approa hes have been studied by Parr (1998), Dean and Lin (1995) and Hauskre ht
et al. (1998).
5.2.4 W-Learning
In modular approa hes, sensory motor loops are used as gating me hanisms whose role is
to ontrol the ow of ommands from the bottom to the top of the hierar hy. There is
no temporal or state abstra tion. The problem is solved at the lowest level of abstra tion
by using suggestions of several experts. Humphrys (1997) and Whitehead et al. (1993)
proposed a two-level ar hite ture in whi h several modules (sensory-motor loops) ompete
to get the ontrol of the agent. Ea h module learns how to a hieve a sub-goal and maintains
its own Q-values tables. In a given state x observed by the agent, ea h module Mi suggests
a ommand i it wants to see exe uted. The module hooses the ommand a ording to
its utility Qi (x; i) and strengthens it with a weight Wi (x). The agent nds the module
Mk with the highest weight
Wk (x) = max
Wi (x);
i
and exe utes the suggested ommand k . The value of Wi(x) may be omputed as follows:
Wi(x) = Qi(x; i ) alled maximize best happiness by Humphrys (1997), and nearest
neighbor
by Whitehead et al. (1993);
Wi(x) = Pi Qi(x; i) alled maximize olle tive happiness by Humphrys (1997), and
greatest mass
by Whitehead et al. (1993).
A more interesting way to ompute Wi (x) is to make it express the dieren e between the
utility Qi(x; i) that a module Mi has of being obeyed and the utility Qi(x; k ) of not being
obeyed (a tually following the suggestion of module Mk ):
Wi (x) = Qi (x; i ) Qi (x; k ):
73
This approa h similar to the ours whi h we introdu e in se tion 5.4.1. The idea is to
minimize the worst unhappiness, that is, perform the ommand k of the module Mk that
will most suer if it is not obeyed:
Wk (x) = max
max
(Qi(x; i) Qi (x; j )):
j
i
However, the result of the sele tion is greatly in uen ed by the order in whi h the modules'
suggestions are examined, and the same suite of ommands is needed for all modules.
To over ome this drawba k, Humphrys (1997) proposed the following update rule,
whi h he alled W-learning to estimate Wi(x) online, even when the modules do not share
the same set of ommands:
Wi (x)
(1
)Wi(x) + (Qi (x; i )
(ri + max
Qi (y; b));
b2C
i
for all i 6= k where Mk is the winning module. We noti e that the transition is aused
by the ommand k and that the error represents the loss of prot of module Mi . It is
assumed in this rule that Qi is already learned. Therefore if Qi and Wi(x) are to be
estimated onjointly, then it is ne essary to delay the learning of Wi(x).
5.2.5 Compositional Q-Learning
Singh (1992) developed an ar hite ture to solve ompositional tasks, that is, tasks whi h
an be expressed as a sequen e of sub-tasks. The originality of his approa h is that subtasks are not a priori assigned to sensory-motor loops. During the learning phase a reward
is generated only when a sub-task is a hieved or when the whole omposite task is ompleted. A gating fun tion learns to sele t the sensory-motor loop that will a tually perform
its ommand. The winning sensory-motor loop is the one who has the best estimate of the
Q-values (or has the smallest expe ted error) of the sub-task that is urrently exe uted.
Be ause the sensory-motor loop that has produ ed the least error learns the most (in proportion to the error), the more a sensory-motor loop learns a given sub-task, the more it
improves its Q-values estimate. Thus its probability of being sele ted for the same sub-task
will in rease leading to the emergen e of sub-task assignment over sensory-motor loops.
74
5.2.6 Ma ro Q-Learning
Finally, Sutton et al. (1998) studied the ase where an MDP has to be solved using abstra t
a tions (options or ma ro-a tions as they alled them). To do so they used SMDP Qlearning (Bradtke and Du 1995; Mahadevan et al. 1997) and introdu ed the notion of
Termination Improvement. During the exe ution of a parti ular option o, laun hed at time
t from state st and normally terminating at time t + k, it is possible to update the utility
values of performing option o (as well as other options whose traje tories are in luded in
the one of option o) from ea h state st i (1 i < k). Thus, information to make de isions
is available in every state and an ongoing option an be interrupted in any state in favor
of a more promising option. The notion of ma ro-a tion interruption is dis ussed in the
next se tion.
+
5.3
The Sele tion Devi e
To ensure e ient oordination of sensory-motor loops a number of useful hara teristi s

are required for the sele tion devi e (as reported by Pres ott et al. (1999)).
Providing lean swit hing, that is dire tly sele ting the sensory-motor loop with the
highest a tivation, onstitutes the rst property. The se ond one states that there must
be no interferen e from sensory-motor loops that are appli able but not sele ted, in other
words, only the sele ted sensory-motor loop ontrols the agent. These two properties an be
implemented by an indexed poli y (see the next se tion). To dene the third hara teristi
we rst need to introdu e the notion of preemption. The approa hes reviewed in the
previous se tion an be lassied into two ategories: those using a ommand sele tion
s heme and those using a behavior sele tion s heme. These two s hemes are respe tively
preemptive and non-preemptive.
In the behavior sele tion s heme, the learning pro ess is uniform through the levels of
the hierar hy and the problem is solved at dierent levels of abstra tion. However from the
se ond level of the hierar hy on, the a tion sele tion is repla ed by the behavior sele tion
and the time s ale for de ision making rises from t to T (gure 5.1). This means that on e
1
75
a sensory-motor loop is sele ted it will keep the ontrol of the agent until its ompletion .
Hen e learning will o ur only at the termination of this sensory-motor loop. Be ause a
sele ted sensory-motor loop annot be interrupted, this s heme may have some drawba ks
in problems involving the satisfa tion of multiple and on urrent obje tives. On the other
hand exploration is improved be ause the state spa e is overed using big steps (Dietteri h
1997).
In the a tion sele tion s heme only one behavior will remain: it is the one whi h is
produ ed by the overall system. It may be analyzed at various levels as onsisting of
streams of a tions ranging from rea tive to planning operations. At ea h time step the
system learns and makes de isions at ea h level of the hierar hy. Therefore any sensorymotor loops at any levels may interrupt ea h other. Su h a ontinual interruption leads to
ine ient exploration be ause it redu es the probability of rea hing sensible states.
Let's illustrate these two s hemes by the traditional ethologi al example of an animal
having to satisfy both hunger and thirst drives. We assume that food and water are in
dierent lo ations and that there are several levels of thirst and hunger. Suppose that
the animal is hungry and that this a tivates the behavior leading it towards the food. If
the thirst level be omes higher than the hunger one and the animal annot interrupt the
sele ted behavior it might die of dehydration en-route towards the food. On the other
hand if it an interrupt its behaviors at any time and the levels of thirst and hunger
be ome alternatively higher and lower one relative to the other, it may die of starvation or
dehydration somewhere between the two lo ations.
These two approa hes seem to be extremist but omplementary. To nd a ompromise
we an either introdu e a model of fatigue, whi h is based on time sharing, to the a tion
sele tion s heme or allow the interruption of sensory-motor loops in the behavior sele tion
s heme. The se ond method seems to be more natural than the rst one but may exhibit
an unstable behavior. In ee t, two sensory-motor loops with losed a tivation degree may
interrupt ea h other (as explained in the above example), thus generating an os illation.
This phenomenon is alled dithering (Pres ott et al. 1999; Redgrave et al. 1999). A way of
1
1A
sensory-motor loop is ompleted when it rea hes a state whi h is a goal or in whi h it is not appli able
anymore.
76
over oming this problem is to add some kind of persisten e to the a tive sensory-motor loop.
It means that to interrupt an a tive sensory-motor loop, the andidate sensory-motor loop
(i.e. the one with the highest a tivation degree among the appli able but ina tive sensorymotor loops) must not only have a greater a tivation degree than the a tive sensory-motor
loop has but must also ex eed it by a given onstant w=2. The onstant w is the width of
the hysteresis loop representing the behavior swit hing between a tive and passive phases
(gure 5.2).
It has been hypothesized that the sele tion me hanism of this form is implemented in
the vertebrate brain by the Basal Ganglia (Pres ott et al. 1999; Redgrave et al. 1999).
active
I-Ic
inactive
w
Figure 5.2: The hysteresis loop representing the behavior swit hing between the a tive and
passive phases. I is the index of the andidate behavior and w is the width of the hysteresis.
5.4
Indexed Poli y
An indexed poli y onsists in allo ating an index to ea h sensory-motor loop and to a tivate
the one with the highest index in a winner-take-all manner. Of ourse indexes whi h are
omputed adaptively and on-line are highly desirable. In hierar hi al Q-learning (Lin 1993)
the indexes simply orrespond to the Q-values of sele ting a sensory-motor loop in a ertain
77
global state. In W-learning the index is the value of the strength W (x). Here we introdu e
another method to ompute su h indexes, based on the restless bandit theory, whi h we
all RBI-learning.
5.4.1 The Restless Bandits
The restless bandits are an extension of the multi-armed bandit problem and have been
studied by Whittle (1988), and Weber and Weiss (1989). The initial problem on erns n
proje ts, the state of a proje t i at time t being denoted by xi(t). At ea h time step t only
one proje t has to be operated. If the operated proje t is i then it will generate a reward
ri (t) and make a transition xi (t) ! xi (t + 1) a ording to its transition probabilities Pi .
The other n 1 proje ts remain frozen, i.e. neither produ e reward nor hange state. A
proje t is said to be in an a tive or passive phase depending upon whether it is sele ted
or not. Gittins (1989) has shown that an index poli y is optimal for this problem. Su h
an index is denoted Ii(xi ) and is a fun tion of the proje t i as well as its state xi:

P
E t t ri (t)
I (x ) = max
:
(5.1)
P
i
>0
1
=0
1
t=0
This index an be interpreted as the maximal value of the reward density relative to
the stopping time . The optimal poli y will simply be to sele t the proje t with the
greatest index. The ni e property of su h a strategy is that Ii only depends on information
on erning proje t i. The dimensionality of the problem is onsiderably redu ed.
To have a better and intuitive understanding of the Gittins' indexes, we will examine
the following dida ti example provided by Du , where for the sake of simpli ity the
rewards are deterministi . Imagine several sta ks ontaining numbers, whi h are rewards,
and suppose that we an see the entire ontents of ea h sta k. Our goal is to pop the sta ks
in an order that maximizes the dis ounted sum of the resulting reward stream. We an
onvin e ourselves that the optimal strategy involves popping the sta k with the highest
reward density:
PT
k ri ( k )
k
Di = max P
;
(5.2)
T
k
T
2
1
=0
k =0
2 Personal
ommuni ation
78
where ri(k) is the ontents of sta k i in position k, starting from the top. Sta ks with
higher reward density ontain high rewards near their top and have to be popped rst
be ause of the dis ount fa tor (gure 5.3).
8
12
D1 = 4:05
D2 = 8:84
D3 = 2:07
Sta ks reward densities for = 0:9. Noti e that the sta k to pop is not ne essary
the one with the highest value at its top.
Figure 5.3:
Unfortunately this method annot be dire tly applied to solve the oordination problem
(the proje ts being repla ed by sensory-motor loops) be ause the fundamental assumption
(i.e. the unsele ted proje ts remain frozen) is not valid anymore. This happens in many
ases and espe ially in mobile roboti s be ause the states of the sensory-motor loops are
built from the same agent's per eptions and these per eptions evolve whatever the sele ted
sensory-motor loop is.
To treat the restless bandits problem we will introdu e the following notation:
3
Xi is the set of states of proje t i;

Pi(x; y; k) is the probability that proje t i moves from state x to state y when it is
in phase k, where k = 1 or 2 for respe tively the a tive or the passive phase;
rik (t) is the reward produ ed at time t by proje t i in phase k.

3 To
introdu e the theory we will use the term proje t instead of sensory-motor loop.
79
If we want to maximize the dis ounted sum of reward over an innite horizon, for a single
proje t i we just have to solve the following optimality equation
(
Vi (x) = kmax
E rik +
=1;2
y Xi
Pi (x; y; k)Vi(y ) ;
(5.3)
where Vi(x) is the value fun tion of proje t i in state x. To do so we will ompute the
Q-values

Qi (x; k) = E rik + max Qi (y; b)
(5.4)
b=1;2
and then de ide to a tivate or freeze the proje t a ording its Q-values.
Consider now the multi-proje t ase. We are essentially interested in maximizing
X X
t rik (t)
(5.5)
t
subje t to i li (t) = n 1 where li(t) = 1 if proje t i is passive at time t and li(t) = 0

otherwise (it means that at ea h time step only one proje t has to be a tive). Su h a
maximization amounts to maximizing
P
X
t
( t
X
i
rik (t) + (t)
X
i
li (t))
(5.6)
where is a Lagrangian multiplier. The new optimality equation to solve be omes

(
Vi (x) = max rik + li +

k =1;2
or more ompa tly
y Xi
Pi (x; y; k)Vi(y ) ;
(5.7)
Vi (x) = max Li1 ; + Li2 ;
(5.8)
where
Lik = rik +
y Xi
Pi (x; y; k)Vi(y ):
(5.9)
Whittle said that an be seen by an e onomist as a 'subsidy for passivity' tuned at a

level whi h guarantees that only one proje t is a tive at a time. The index of a proje t i
in state xi is then dened as being the value i (xi) of whi h makes Li = + Li . It an
be omputed by using the Q-values of a proje t.
1

Proposition
80
The index of a proje t i in state xi is:

i (xi ) = Qi (xi ; 1) Qi (xi ; 2):
Proof Let x = (x1 ; x2 ; :::; xi ; :::; xn ) be the omposite state of the global problem, and let
Q(x; k) be the utility of a tivating proje t k in state x:
Q(x; k) = Qk (xk ; 1) +
i=k
Qi (xi ; 2)
Let m be the proje t that maximizes this utility. We have:

Q(x; m) = max Q(x; k) ) Q(x; m) Q(x; k)
k
8 k 2 [1; n
This inequality an be written as follows:

X
Qm (xm ; 1) + Qk (xk ; 2) +
Qi (xi ; 2) Qm (xm ; 2) + Qk (xk ; 1) +
6
i=k;m
) Qm (xm ; 1) + Qk (xk ; 2) Qm(xm ; 2) + Qk (xk ; 1)

) Qm (xm ; 1) Qm (xm ; 2) Qk (xk ; 1) Qk (xk ; 2): Q.E.D.
Qi (xi ; 2)
i=k;m
Figure 5.4 shows the algorithm of RBI-learning.

5.4.2 Dis ussion
Intuitively we an see that the index a tually re e ts the need for a proje t to be a tive
with respe t to the exploration and exploitation riteria. A tually the value of in reases
if
Qi(xi ; 1) in reases whi h means that the proje t needs to be a tive (exploitation
phase), or
Qi(xi ; 2) de reases whi h means that the proje t does not want to be passive (explo-
ration of the ee ts of its a tivation). This ondition holds as far as the proje t is
deteriorating during its passive phase (i.e., re eiving negative rewards).
81
loop
Observe state xi for ea h proje t i

for ea h proje t i do
Ii (xi ) = Qi (xi ; 1) Qi (xi ; 2)
end for
A tivate proje t k su h that Ik (xi ) = maxi Ii (xi)

Update Qk (ix; 1)
for ea h proje t i 6= k do
Update Qi(xi ; 2)
end for
end loop
Figure 5.4:
Algorithm of RBI-learning.
On the other hand the utility to a proje t of being a tive or passive an be seen respe tively
as a tivation and inhibition signals. Thus, persisten e may be implemented by simply
removing the inhibition signal from the sele ted sensory-motor loop and keeping it for
others.
Our oordination method may be situated between hierar hi al Q-learning and Wlearning. RBI-learning and W-learning are similar be ause they are both motivated by the
same riterion, whi h is to redu e the loss of prot when a proje t (module) is not sele ted
(obeyed). However they dier in the sense that RBI-learning supports temporal abstra tion
(like hierar hi al Q-learning) whereas W-learning does not. A tually W-learning needs to
perform an update after ea h exe ution of a primitive ommand. In addition RBI-learning
is supported by a strong theory and does not require any pre-learned Q-values.
5.5
Experiments
The oordination method we presented above is now evaluated and its performan es ompared to those of Hierar hi al Q-learning . To do so we have followed the HPS methodol4
4 We
have not made any omparison with W-learning. The reason is that it is not appli able to the
postman robot problem be ause it does not support temporal abstra tion. In ee t, the update of Q-values
82
ogy therefore some results and settings from the previous hapter are reused. We de ided
to test these oordination methods on the at ar hite ture, using the behavior sele tion
s heme. The network ar hite ture used to implement the Hierar hi al Q-learning method
is the same as in the previous hapter. However we view it here. Ea h of the ve neural
networks of the ar hite ture is omposed of 40 input units, 3 hidden units and one output
unit. All units have a sigmoid a tivation fun tion. The input pattern is as follows:
35 units: ea h set of 7 units represents a sigmoidal oarse oding of either the number
of letters in ea h o e or the number of letters arried by the robot or the batteries

level;
5 units: ea h of these units represents a possible lo ation of the robot, i.e. in whi h
room it is. So exa tly one unit is 'on' at ea h de ision step.
Re all also that the fun tion to be optimized is f (x; ; t) = Pi xli(t)+xr (t)+ (x; t)(xbth
xb (t)) and the instantaneous reinfor ement fun tion is r(t) = f (x; ; t) f (x; ; t 1). For
the RBI-learning, the above fun tion is linearly de omposed into ve fun tions: one for
ea h elementary sensory-motor loop. We obtained:
1
f (x; t) = xl (t) for the sensory-motor orresponding the behavior move to o e1 ;

1

2

3
f (x; t) = xr (t) for the sensory-motor orresponding the behavior move to the mail4
box ;
f (x; ; t) = (x; t)(xbth xb (t)) for the sensory-motor orresponding the behavior
5
move to the harger.
We used two dierent network ar hite tures to implement the restless bandits method. In
the rst one all the sensory-motor loops share the same state spa e so it is similar to the one
of primitive ommands (robot's movements) would be ine ient be ause the state spa e is huge and the
reinfor ement is only given when the robot rea hes one of the sub-goals (o es, mailbox, harger).
83
of Hierar hi al Q-learning. In the se ond ar hite ture, the state spa e is redu ed for ea h
sensory-motor loop in order to keep only features relevant to the fun tion to be optimized.
Thus, for ea h sensory-motor loop, we kept features representing the robot lo ation (5
units) and features representing the amount to be optimized (7 units), orresponding for
example to the number of letters arried by the robot for the behavior move to the mailbox.
Hen e we obtained networks with 12 input units, 2 hidden units and one output unit.
However, for both ar hite tures we needed two networks for ea h sensory-motor loop: one
to approximate the Q-values for ea h phase (passive or a tive) of the sensory-motor loop.
The networks weights, in ea h sensory-motor loop, were initialized with a random value
in the range [-0.1,0.1 and the reinfor ement fun tion was s aled between -0.1 and 0.1. The
rest of the parameters was set as follows: = 0:99, = 0:5, = 2:0 and, for Hierar hi al
Q-learning Nexp = 1500. There was no exploration phase for RBI-learning.
Ea h of these ontrollers was tested on 50000 de ision steps, a de ision step orresponding to an elementary behavior sele tion, and for ea h letters ow onguration of table 3.1.
The batteries level threshold was set to 60 %.
The results reported in the tables of gure 5.5 show that RBI-learning outperforms
Hierar hi al Q-learning. For the Poisson distribution we an see that with the former
method, there are in average 8.75 letters less in standby in the o es than with the latter
method, whereas the average of arried letters in reases by only 5.88 letters. For the
periodi ow the arried letters are almost the same whereas the letters in standby drop
by 5.33 letters for the RBI-learning. Moreover a better energy management is a hieved by
the restless bandits method. The performan es of RBI-learning an be justied by the fa t
that the reinfor ement fun tion is de omposed and that there is a good balan ing between
exploration and exploitation whi h allows a good strategy to be found very qui kly (gure
5.6).
Surprisingly, the RBI-learning with redu ed state spa e did not give the expe ted results. We expe ted that, be ause of the small sear h spa es, a better strategy would have
been found or at least the indexes would have been learned more qui kly. It seems that in
84
pra ti e, better performan es are to be expe ted when the state spa e is the same .
Periodi ow
Parameters
Hierar hi al Restless Bandits Restless Bandits
Q-learning
full spa e
redu ed spa e
8.49
8.88
9.82
9.70
8.26
16.65
17.17
12.89
18.47
16.42
15.31
16.18
68.92
71.00
71.42
-42.88
-37.42
-53.06
5
Poisson distribution ow
Parameters
Hierar hi al Restless Bandits Restless Bandits
Q-learning
full spa e
redu ed spa e
9.43
7.75
10.15
13.54
11.21
17.99
21.97
17.23
21.00
25.59
31.47
27.82
70.56
72.12
68.17
-55.86
-52.82
-63.26
Tables summarizing the performan es of the oordination methods for dierent
letters ow ongurations.
Figure 5.5:
5.6
Summary
In order to solve the oordination problem we have been inspired by the fun tioning of the
a tion sele tion devi e of natural ontrol systems. We proposed a omputational model
based on restless bandits indexes that implements su h a devi e and showed that its performan es over ome those of an existing method. However we have used the behavior
sele tion s heme without interruption in our implementation, be ause so far, we do not
have a lear idea about how interruption should work. We think that this issue is of great
importan e and we will investigate it in our future work.
5 Personal
ommuni ation from John Tsitsiklis
85
-20
-40
-60
-80
-100
Hierarchical Q-Learning
Restless Bandits with full space
Restless Bandits with reduced space
-120
0
5000
10000
15000
20000
25000
30000
Time Step
35000
40000
45000
50000
-50
-100
-150
-200
Hierarchical Q-Learning
Restless Bandits with full space
Restless Bandits with reduced space
-250
-300
0
5000
10000
15000
20000
25000
30000
Time Step
35000
40000
45000
50000
Figure 5.6: Average of the quality riterion as a fun tion of de ision steps. The top graph
on erns the periodi letters ow and the bottom graph the Poisson distribution letters ow.
Chapter 6
Con lusion
6.1
Summary of ontributions
The work presented in this thesis was motivated by the need to solve omplex problems
using embedded agents learning by reinfor ement. We identied and analyzed the reasons
that make standard reinfor ement learning methods impra ti al in omplex domains and
proposed some me hanisms to s ale up these approa hes. Our ontributions are summarized as follows.
We set up a new design methodology whose aim is to systemize the agent's design
pro ess (Faihe and Muller 1997). It provides a on eptual framework to design hierar hi al
ontrol ar hite tures for embedded autonomous agents. The obje tives of ea h stage of the
methodology were learly dened and the distin tion was made between what the agent
has to learn and what has to be given a priori by the designer.
Assuming that the solution to the problem orresponds to a parti ular pattern of intera tion between the agent and the environment, we established the relationship between
solving a problem and generating a behavior. Then we proposed a way of formally spe ifying a behavior. To do so we used a quality riterion, omposed of an obje tive fun tion
and a set of onstraints. The desired behavior is the one generating a traje tory (in the
intera tion spa e) that optimizes the obje tive fun tion without violating the onstraints.
In addition to being both a formal and natural means of dening a behavior, the proposed
method allows us to derive the reinfor ement fun tion (as a progress estimator), to learn
Con lusion
87
the behavior and to have a good basis for the de omposition pro ess.
A graphi al approa h was proposed to perform the problem's de omposition (or behavior's de omposition be ause to ea h problem orresponds a set of behaviors that solves it).
Although this te hnique is still partly reliant on the designer's intuition and experien e, it
allows to dis over sub-behaviors that would not be identied otherwise.
Con erning the oordination problem, we reviewed the features present in the behavior's
sele tion me hanism of natural systems and highly desirable in arti ial systems. We
proposed a oordination method based on restless bandits indexes (Faihe and Muller 1998).
It extends and generalizes W-learning, is ompletely distributed and has been shown to be
more powerful than Hierar hi al Q-learning for the postman robot task.
The feasibility of the methodology as well as the performan es of the methods were
demonstrated through the postman robot problem, whi h is a non-trivial problem. In
addition we developed and implemented a three-level ar hite ture, whi h is rarely found
in the reinfor ement learning area.
6.2
Pra ti al Issues
The implementation of the postman robot ar hite ture was not straightforward and sometimes resulted in agents that fail to onverge to a satisfa tory solution. The main di ulty
was nding a good tuning of the parameters, whi h are the learning rate , the eligibility
tra e fa tor , the dis ount fa tor and the number of exploration steps Nexp. Unfortunately there is no s ienti method to tune su h parameters so they are hosen a ording
to one's own experien e and experiments as well as those reported by other resear hers.
We noti ed that and are losely linked and that the evolution of one of them ae ts
the value of the other. A bad setting of these parameters results either in a slow onvergen e
or in a omplete failure of the learning pro ess. We de ided to measure the performan es
of the agent (i.e. the average value of the quality riterion after 50000 de ision steps) for
several values of and . The best results were obtained for = 0:5 and = 2:0, whi h
are the values used during our experiments for all the ar hite tures.
The number of exploration steps was easy to nd. Starting with a small value of Nexp
Con lusion
88
(200 steps), we in reased it progressively until 8000 steps and reported the agent's performan e. We noti ed that the performan e improves while Nexp in reases, then stabilizes
between 1500 and 4000 steps, and deteriorates thereafter. In ee t, if the value of Nexp is
too low then the agent will be unable to nd a good poli y (due to the la k of sear h) and,
on the other hand a high value will prevent the agent onsolidating its knowledge be ause
of the random perturbations.
For the dis ount fa tor one may wonder whether to dis ount ( < 1) or not ( = 1).
Dis ounting is useful for any task that is learned in trials. The navigation tasks, for example, are suitable to be learned with dis ounting be ause solutions that allow the agent
to rea h the goal in very few steps are preferred. The s heduling task ( oordination of the
navigation's behaviors) is a ontinuous task. Therefore a natural and logi al optimality
riteria would be the average reward re eived over time. General results for online learning using su h a riterion are urrently under progress (Mahadevan 1994). However we
obtained fair performan es with = 0:99.
Another di ulty we had to fa e on erns the stability of neural networks. It was
impossible to get a stable network with a linear output unit, even with a very low learning
rate (order of magnitude of 10 ). For this reason we used networks with non-linear output
units. Nevertheless we were onstrained to s ale the reinfor ement value between -0.1 and
0.1 to avoid large updates, whi h may make units blow up.
3
6.3
Future work
Further resear h that an be arried out in the dire tion of the work presented in this dissertation is twofold. It may on ern the extension of the methodology or the improvement
of the proposed methods.
One possible way of extending the methodology would be to automate the pro esses,
whi h require extensive human intervention. Su h pro esses are the de omposition of a
behavior into sub-behaviors and the design of sensory-motor loops. We do believe that
animals, whi h learn by reinfor ement (su h as birds learning to y), were born with
all the ne essary stru tures to a hieve su h a learning. These stru tures are geneti ally
Con lusion
89
transmitted and evolved through several generations to t in their environment. In our

framework we are interested in nding a hierar hy of sub-behaviors with, for ea h of them,
the reinfor ement fun tion as well as the sets of relevant per eptions and ommands. It is
possible to do so using geneti algorithms but we still need to nd out a means of des ribing
the representation and the dynami s of the agent's internal stru ture.
The behavior spe i ation method we proposed may have some drawba ks in ase the
agent does not have the ability to sense features that allow frequent updates of the quality
riterion. This problem, parti ularly arising in roboti s, may make the learning system
fail be ause of the la k of immediate reinfor ements. A ommon way of adding immediate
reinfor ements is to provide the agent with advi e. Advi e omes from a tea her's visual
evaluation of the agent's performan es and may ompletely bias the learning pro edure or
make the agent exhibit unexpe ted behaviors be ause it is di ult to put oneself in the
agent's shoes. Therefore an interesting issue would be to nd a way of arefully integrating
su h advi e in the quality riterion.
A mathemati al approa h to perform the de omposition is highly desirable in the sense
that it will allow us to understand this pro ess and to automate it.
The oordination problem has to be investigated within a theoreti al framework. The
most suitable one is proposed by Sutton et al. (1998). It onsists in solving MDPs using
ma ro-a tions and involves temporal and behavior abstra tions as well as ma ro-a tions'
interruption. Interesting dire tions for investigation on ern the state abstra tion and the
implementation of a behavior's persisten e. In the latter dire tion, an important issue
would be to identify states where it is worth interrupting ma ro-a tions in order to avoid
updating and making a new de ision in ea h state of a ma ro-a tion's traje tory.
Finally, an intensive appli ation of the HPS methodology to dierent problems in different areas would help to nd out its weaknesses and over ome them.
6.4
Epilogue
The work presented in this thesis takes pla e within the general ontext of learning and
development in arti ial reatures. The long-term obje tive is to nd me hanisms that
Con lusion
90
allow animats to in rementally develop their intelligen e in a onstru tivist manner. It

means that they have to dis over and develop by themselves the building blo ks that will
be used to build more and more omplex skills. The main rule is that they an only
learn what is lose to what they already know. We had the opportunity to verify this
rule during the implementation of the hierar hi al ar hite ture (se tion 4.5.2). We were
unable to obtain a stable strategy using what Dorigo and Colombetti (1998) alled holisti
learning, that is, learning from s rat h all the behaviors of ea h level at the same time,
even when learning was delayed between the levels (by in reasing the number of exploration
steps or redu ing the learning rate of upper behaviors). This is why we adopted a modular
learning approa h.
Bibliography
Barraquand, J. and J. C. Latombe (1991). Robot motion planning: A distributed representation approa h. The International Journal of Roboti s Reasear h 10 (6), 628{649.
Barto, A., R. Sutton, and C. Watkins (1990). Learning and sequential de ision making.
In Learning and sequential de ision making, M.Gabriel and J.W. Moore, editors, The
MIT Press.
Barto, A. G., S. J. Bradtke, and S. P. Singh (1995). Learning to a t using real-time
dynami programming. Arti ial Intelligen e 72(1-2), 81{138.
Barto, A. G. and S. P. Singh (1990). On the omputational e onomi s of reinfor ement
learning. In D. S. Touretzky (Ed.), Conne tionist Models: Pro eedings of the 1990
Summer S hool. Morgan Kaufmann.
Benbrahim, H. and J. A. Franklin (1997). Biped dynami walking using reinfor ement
learning. Roboti s and Autonomous Systems 22, 284{302.
Bradtke, S. J. and M. O. Du (1995). Reinfor ement learning methods for ontinuoustime markov de ision problems. In Advan es in Neural Information Pro essing Systems 7. MIT Press.
Braitenberg, V. (1984). Vehi les. Experiments In Syntheti Psy hology. MIT Press.
Ci hosz, P. (1995). Trun ating temporal dieren es: On the e ient implementation of
td() for reinfor ement learning. Journal of Arti ial Intelligen e Resear h 2, 287{
318.
Colombetti, M., M. Dorigo, and G. Borghi (1996). Behavior analysis and design - a
methodology for behavior engineering. IEEE Transa tions on Systems, Man and
Bibliography
92
Cyberneti s 26.
Crites, R. H. (1996). Large-S ale Dynami Optimization using Teams of Reinfor ement
Learning Agents. Ph. D. thesis, University of Massa husetts.
Dayan, P. and G. E. Hinton (1993). Feudal reinfor ement learning. In Advan es in Neural
Information Pro essing Systems 5.
Dean, T. and S.-H. Lin (1995). De omposition te hniques for planning in sto hasti
domains. Te hni al Report CS-95-10, Brown University.
Dietteri h, T. G. (1997). Hierar hi al reinfor ement learning with the MAXQ value
fun tion de omposition. Te hni al report, Oregon State University.
Dorigo, M. and M. Colombetti (1998). Robot Shaping: An Experiment in Behavior
Engineering. MIT Press/Bradfort Books.
Faihe, Y. and J.-P. Muller (1997). Behavior analysis and design: Towards a methodology.
In A. Birk and J. Demiris (Eds.), Pro eedings of the Sixth European Workshop on
Learning Robots (EWLR6), Le ture Notes in Arti ial Intelligen e. Springer-Verlag.
Faihe, Y. and J.-P. Muller (1998). Behaviors oordination using restless bandits allo ation indexes. In Pro eedings of the Fifth International Conferen e on Simulation of
Adaptive Behavior (SAB98).
Gittins, J. C. (1989). Multi-armed Bandit Allo ation Indi es. Willey.
Hauskre ht, M., N. Meuleau, C. Boutilier, L. P. Kaelbling, and T. Dean (1998). Hierar hi al solution of markov de ision pro esses using ma ro-a tions. In Pro eedings of
the Fourteenth Conferen e on Un ertainty in Arti ial Intelligen e (UAI98).
Humphrys, M. (1997). A tion Sele tion methods using Reinfor ement Learning. Ph. D.
thesis, University of Cambridge.
Kaelbling, L. P. (1993a). Hierar hi al learning in sto hasti domains: Preliminary results. In M. Kaufmann (Ed.), Pro eedings of the Tenth International Conferen e on
Ma hine Learning.
Kaelbling, L. P. (1993b). Learning in Embedded Systems. MIT Press.
Bibliography
93
Kaelbling, L. P., M. L. Littman, and A. W. Moore (1996). Reinfor ement learning: a

survey. Journal of Arti ial Intelligen e Resear h 4.
Kalmar, Z., C. Szepesvari, and A. Lorin z (1998). Module-based reinfor ement learning:
Experiments with a real robot. Ma hine Learning . In press.
Khatib, O. (1986). Real time obsta le avoidan e for manipulators and mobile robots.
The International Journal of Roboti s Reasear h 5 (1), 90{98.
Krose, B. J. A. and J. W. M. Van Dam (1993). Learning to avoid ollisions: a reinfor ement learning paradigm for mobile robot navigation. In Pro eedings of International
Symposium on Arti ial Intelligen e in Real-Time Control (IFAC).
Lin, L. J. (1992). Reinfor ement learning with hidden state. In Pro eedings of the Se ond
International Conferen e on Simulation of Adaptive Behavior.
Lin, L. J. (1993). Hierar hi al learning of robot skills by reinfor ement. In Pro eedings
of the IEEE International Conferen e on Neural Networks.
Mahadevan, S. (1994). To dis ount or not to dis ount in reinfor ement learning: A ase
study omparing R-learning and Q-learning. In Pro eedings of the Eleventh International Conferen e on Ma hine Learning. Morgan Kaufmann.
Mahadevan, S. (1996). Optimality riteria in reinfor ement learning. Presented at the
AAAI Fall Symposium on Learning Complex Behaviors in Adaptive Intelligent Systems.
Mahadevan, S. and J. Connell (1992). Automati programming of behavior-based robots
using reinfor ement learning. Arti ial Intelligen e 55, 311{365.
Mahadevan, S., N. Mar halle k, D. T., and G. Abhijit (1997). Self improving fa tory
simulation using ontinuous-time reinfor ement learning. In M. Kaufmann (Ed.),
Pro eedings of the Fourteenth International Conferen e on Ma hine Learning.
Martin, M. M. (1998). Reinfor ement Learning for Embedded Agents Fa ing Complex
Tasks. Ph. D. thesis, Universitat Polite ni a de Catalunya.
Bibliography
94
Matari , M. J. (1994). Reward fun tions for a elerated learning. In Pro eedings of the
Eleventh International Conferen e on Ma hine Learning. Morgan Kaufmann.
M Callum, A. (1996). Reinfor ement Learning with Sele tive Per eption and Hidden
State. Ph. D. thesis, University of Ro hester.
M Farland, D. (1981). Animal Behaviour. Longman.
Meuleau, N. and P. Bourgine (1998). Exploration of multi-state environments: Lo al
measures and ba k-propagation of un ertainty. Ma hine Learning . In press.
Millan, J. d. R. (1996). Rapid, safe and in remental learning of navigation strategies.
IEEE Transa tions on Systems, Man and Cyberneti s 26.
Minoux, M. (1986). Mathemati al Programming. John Wiley and Son.
Parr, R. (1998). Flexible de omposition algorithms for weakly oupled markov de ision
problems. In Pro eedings of the Fourteenth Conferen e on Un ertainty in Arti ial
Intelligen e (UAI98).
Peng, J. and R. J. Williams (1996). In remental multi-step Q-learning. Ma hine Learning 22, 283{290.
Pfeifer, R. (1996). Building "fungus eaters": Design prin iples of autonomous agents.
In Pro eedings of the Fourth International Conferen e on Simulation of Adaptive
Behavior (SAB96).
Pfeifer, R. and C. S heier (1998). Introdu tion to "New Arti ial Intelligen e". MIT
Press. Book manus ript under review.
Pomerleau, D. A. (1991). E ient training of arti ial neural networks for autonomous
navigatioin. Neural Computation 3 (1), 88{97.
Pres ott, T. J. and J. E. Mayhew (1992). Obsta le avoidan e through reinfor ement
learning. In Advan es in neural information pro essing systems 4, pp. 523{530. Morgan Kaufmann.
Pres ott, T. J., P. Redgrave, and G. Kevin (1999). Layered ontrol ar hite tures in
robots and vertebrates. Adaptive Behavior . To appear.
Bibliography
95
Redgrave, P., T. J. Pres ott, and G. Kevin (1999). The basal ganglian: A vertebrate
solution to the sele tion problem? Neuros ien e . To appear.
Rummery, G. A. (1995). Problem Solving With Reinfor ement Learning. Ph. D. thesis,
University of Cambridge.
Rummery, G. A. and M. Niranjan (1994). On-line Q-learning using onnexionist systems.
Te hni al Report CUED/F-INFEG/TR66, Cambridge University.
Santamaria, J. C., R. S. Sutton, and A. Ram (1997). Experiments with reinfor ement learning in problems with ontinuous state and a tion spa es. Adaptive Behavior 6 (2), 163{217.
Simmons, R., R. Goodwin, K. Z. Haigh, S. Koenig, and J. O'Sullivan (1997). A modular ar hite ture for o e delivery robots. In Pro eedings of the First International
Conferen e on Autonomous Agents. ACM Press.
Singh, S. P. (1992). Transfer of learning by omposing solutions of elemental sequential
tasks. Ma hine Learning 8 (3/4), 323{339.
Singh, S. P. and D. Bertsekas (1997). Reinfor ement learning for dynami hannel allo ation in ellular telephone systems. In Advan es in Neural Information Pro essing
Systems. MIT Press.
Singh, S. P. and R. S. Sutton (1996). Reinfor ement learning with repla ing eligibility
tra es. Ma hine Learning 22, 123{158.
Stephens, D. W. and J. R. Krebs (1986). Foraging Theory. Prin eton University Press.
Sutton, R. S. (1988). Learning to predi t by the methods of temporal dieren es. Ma hine Learning 3, 9{44.
Sutton, R. S. and A. G. Barto (1998). Reinfor ement Learning: An Introdu tion. MIT
Press.
Sutton, R. S., D. Pre up, S. Singh, and B. Ravindran (1998). Improved swit hing among
temporally abstra t a tions. In Advan es in Neural Information Pro essing Systems
11. MIT Press.
Bibliography
96
Tesauro, G. (1995). Temporal dieren e learning and td-gammon. Communi ation of

the ACM 38, 58{68.
Tham, C. L. (1995). Reinfor ement learning of multiple tasks using hierar hi al CMAC
ar hite ture. Roboti s and Autonomous Systems 15 (4), 247{274.
Thrun, S. (1992). E ient exploration in reinfor ement learning. Te hni al Report CMUCS-92-102, Carnegie Mellon University.
Tyrell, T. (1993). The use of hiera hies for a tion sele tion. In Pro eedings of the Se ond
International Conferen e on the Simulation of Adaptive Behaviour (SAB92).
Watkins, C. (1989). Learning from delayed rewards. Ph. D. thesis, University of Cambridge.
Weber, R. R. and G. Weiss (1989). On an index poli y for restless bandits. Journal of
Applied Probability 27.
Whitehead, S., J. Karlsson, and J. Tenenberg (1993). Learning Multiple Goal Behavior
via Task De omposition and Dynami Poli y Merging. Kluwer A ademi Publishers.
Whittle, P. (1988). Restless bandits : A tivity allo ation in a hanging world. Journal
of Applied Probability 25.
Wiering, M. and J. S hmidhuber (1998). Fast online Q(). Ma hine Learning . In press.
Wilson, S. W. (1996). Explore/exploit strategies in autonomous learning. In J.-A. M.
Pattie Maes and S. Wilson (Eds.), From animals to animats 4: Pro eedings of
the Fourth International Conferen e on the Simulation of Adaptive Behaviour. MIT
Press.
Wyatt, J. (1997). Exploration and Inferen e in Learning from Reinfor ement. Ph. D.
thesis, University of Edinburgh.
Wyatt, J., J. Hoar, and G. Hayes (1998). Design, analysis and omparison of robot
learners. Spe ial issue on S ienti Methods in Mobile Roboti s: The New Wave,
Roboti s and autonomous Systems 24 (1-2).
Bibliography
97
Zhang, W. and T. G. Dietteri h (1995). A reinfor ement learning approa h to jobshop s heduling. In Pro eedings of the Fourteenth International Joint Conferen e on
Arti ial Intelligen e. Morgan Kaufmann.

Hierarchical Problem Solving Using Reinforcement Learning: Methodology and Methods

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Hierarchical Problem Solving Using Reinforcement Learning: Methodology and Methods

Transféré par

Droits d'auteur :

Formats disponibles

Hierar

hi al Problem Solving using Reinfor ement Learning:

Submitted to the Fa ulty of S ien e in ful llment

University of Neu h^atel

This thesis is dedi ated to my parents,

2 Ba kground: Reinfor ement Learning

1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 The Postman Robot Problem

3.1 The Postman Robot Task . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 The Environment . .

4.1 A Methodology for Reinfor ement Learning

5 The Coordination Problem

5.2 Related Work . . . . . . . . . . . . .

3.1 The Nomad 200 robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A general engineering methodology. . . . . . . . . . . . . . . . . . . . . . .

4.6 The de omposition pro ess of the postman robot problem. . . . . . . . . .

Context and Motivation

Claims and Proposals

be omes ompli ated. This phenomenon is alled the urse of dimensionality.

 omplex behaviors may be produ ed by the oordination of several simple sensory-

 solving a problem using an embedded agent amounts to designing the orresponding

into the agent's point of view;

 oordination of these skills to solve the global problem.

Organization of the Dissertation

Ba kground: Reinfor ement Learning

Reinfor ement learning framework

Ba kground: Reinfor ement Learning

A poli y is a mapping t : X 7! A whi h asso iates an a tion a to ea h state x. We noti e

P r(xt+1 = x; rt+1 = rjHt) = P r(xt+1 = x; rt+1 = rjxt ; at )

An MDP ontrolled by a poli y  generates a sequen e of rewards R = fr ; r ; r ; :::; rn; :::g.

Ba kground: Reinfor ement Learning

Moreover we introdu e the following generi notation for the return:

Temporal Credit Assignment

Ba kground: Reinfor ement Learning

2.2.1 Value Fun tions and Optimal Poli ies

2.2.2 Dynami Programming

whi h, for the optimal poli y , be omes:

Ba kground: Reinfor ement Learning

V2 (x) = R(x; a) +

Vn+1 (x) = R(x;  (x)) +

Ba kground: Reinfor ement Learning

The following update is applied for all x 2 X :

Figure 2.3 shows the poli y iteration algorithm.

The poli y iteration algorithm

Ba kground: Reinfor ement Learning

Vn+1 (x) = max

arbitrary fun tion

Compute optimal value fun tion

The value iteration algorithm

Asyn hronous Dynami Programming

Ba kground: Reinfor ement Learning

Adaptive Real-Time Dynami Programming

Ba kground: Reinfor ement Learning

2.2.3 Temporal Di eren e Learning

Ba kground: Reinfor ement Learning

Ba kground: Reinfor ement Learning

V^  (x) + V^  e(x);

Evolution of tra es a ording to the state visits.

Ba kground: Reinfor ement Learning

Ba kground: Reinfor ement Learning

0 and e(x0 ; a0) 0 for ea h x0 2 X and a0 2 A

Ba kground: Reinfor ement Learning

en ourage exploration and is slowly de reased thereafter to ensure exploitation. Another

Submitted to the Fa ulty of S ien e in fulllment

omplex behaviors may be produ ed by the oordination of several simple sensory-

solving a problem using an embedded agent amounts to designing the orresponding

oordination of these skills to solve the global problem.

A poli y is a mapping t : X 7! A whi h asso iates an a tion a to ea h state x. We noti e

An MDP ontrolled by a poli y generates a sequen e of rewards R = fr ; r ; r ; :::; rn; :::g.

whi h, for the optimal poli y , be omes:

V2 (x) = R(x; a) +

Vn+1 (x) = R(x; (x)) +

2.2.3 Temporal Dieren e Learning

V^ (x) + V^ e(x);

Algorithm of Sarsa() with a onne tionist fun tion approximator.

We dene an atomi a tion that the robot an perform as a steering of degree followed

Sensory systems must be designed based on dierent

evaluate and validate the global behavior.

i (x(t); t):'i (x(t); t) dt

i (x(t); t):'i (x(t); t) dt

T r(T + 1) = 0 (F (x(0); ):0

+ (F (x(1); ): F (x(2); ): )

T F (x(T ); ):T + F (x(0); ):0