Vous êtes sur la page 1sur 5

2004 IEEE International Conference o n Systems, Man and Cybernetics

Successful Cooperation lbetween Heterogeneous Fuzzy


Q-Learning Agents*
Ali Akhavan Bitaghsir, Amir Moghimi, Mohsen Lesani, Mohammad Mehdi Keramati,
Majid Nili Ahmadabadi, Babak Nadjar Arabi
Electrical and Computer Engineeing Dep., Faculty of Engineering, University of Tehran, Iran

Abstract - Cooperation in learning improves the speed of fects of the performed porposal and learn accordingly. [?I
cowergance and the qiialih of learning. Special treatment introduces an extension of Tan's paper I?], however, for het-
is needed when heterogeneous agents cooperate in learn- erogeneous agents. Difference in the learning rates of the Q-
ing. It has bee11 discussed rhar, cooperation in leorriing may learning agents is the cause of heterogeneity belween them.
cairse rlie leai-nifrg process nor to coiiverge if heterogeneit?. The authors showed that the complementarity property of
is nor haridledpmperly. I n tliis pope,; it is assiimed that hvo their agents made the multi agent learning process more ef-
(or several) heterogeneoirs Q-leamirig agents cooperare to ficient and robust. ANIMALS [?] is based on a number of
learn. The two hvrirer agerits iridependeritly piirsue a prey independent, hut cooperating agents each of whose task is
agerit on a m.0-dirrieririorial larrice: houever; die hiinters' managing a traditional machine learning algorithm (ID3 and
i.isuallfield deprhs ara different. Tlius, in order to have .siic- LPE). Distributed learning occures when agents communi-
cessfirl cooperation, the agents shoirld be able ro ifiterprete cate requests for theories or facts from other agents. Tan
other agerirs' Q-table. For this purpose, an algorirlim has showed that, sharing episodes with an expert agent could
been pioposed arid irnplemenred OILrlie piirsuir problem. 'TWO improve the group learning significantly [?I. In [?I, the
case studies has beeit iurrodirced and sbiiulated to show the state, action, and value pairs are communicated among the
effectirerzess of rlie proposed algorirlir,7. agents. No measure is used to evaluate the received rules by
the learners; however, in heterogeneous cooperative learning,
Keywords: Fuzzy Q-Learning, cooperative learning, het- the learner agents need a proper mechanism to interprete and
erogeneous agents, multi-agent systems. evaulate other agents' experiences for their own use.
In this paper, we look at the heterogeneous cooperaitve-
1 Introduction and Related Work learning agents problem from another point of view. Here,
Agents can communicate knowledge between each other our agents are fuzzy Q-learners and they have the same ac-
and take advices or training commands from peer agents, ex- tions. They also use the same fuzzy sets as their Q-table
pert agents, or even asents with less expertness [?]. Because states, however, heterogeneity is in their (actual) perceptual
of having more knowledge acquisition resources in multi- state space. The developed algorithm allows the agents to
agent systems, cooperation in learning can result in higher successfully cooperate in learning.
performance compared to individual learning. Researcliers
have shown that improvements in learning occures when us- 2 Cooperative Learning in Heteroge-
ing cooperative learning [?] [?]. In the field of multi agent neous Fuzzy Q-Learning Agents
cooperative learning, there are few works on heterogeneous
agents. Lesser et al [?] [?] [?] has used the negotiation 2.1 Fuzzy Discrete Action-Space Q-Learning
method to resolve the conflicts arisen between some hetero- 2.1.1 Fuzzy if-then Rule
geneous agents in a steam condenser. They learn to change
In this study, each agent uses a one-step fuzzy Q-learning
their organizational roles by negotiating about problem s o b
algorithm. . Although we introduce a specific fuzzy Q-
ing situation and relaxing some of their soft constraints. In
learning algorithm, hut the method can he applied to other
[?] sharing meta information is used to guide solving the
forms of FQL as well. We introduce a fuzzy Q-learning
steam condenser problem in a distributed search space. ILS
algorithm like the one addressed in [?I, but modified for
[?] is a distributed system of heterogeneous agents that learns
discrete action-space. Let us assume that the state space in
how to control a telecommunication network. The proposal
the problem domain is described by an n-dimentional vector
made by each agent is given to TLC (The Learning Con-
5 = (XI, 2 2 , . . .,&). Also, suppose that there exists m dif-
troller). Then, TLC chooses one of these proposals and per-
ferent discrete actions in the action-space { U I , a 2 ; .. . ,am).
forms actions based on that. Afterwards, agents see the ef-
We use fuzzy if then rules of the following type :
*0.7803-8566-7/04p$20.00 @ 2004 IEEE. Rj : If z1is S,Ia n d . . . and 2, is Sj,
5579

Authorized licensed use limited to: University of Washington. Downloaded on October 29, 2008 at 12:03 from IEEE Xplore. Restrictions apply.
thenQj = (Qjl>Qj2,...,QJm) a weighted average of other agent’s Q-table values. The
j = l ,. . . ; N . learner agent assigns a relative urilify weighr (U) to each
where S;i for 1 5 i 5 n is a fuzzy set for a state variable, fuzzy-state (Q-table row) of other heterogeneous agents with
Q; is a consequent real vector of fuzzy if-then rule R;, and respect to the fuzzy state of the teacher’s utility for itself and
N is the number of fuzzy if-then rules. Assume that Q j average it with its own Q-table values. Thus, the ith row’s
is the j t h row of the Q-table and Q j i corresponds to the values (rows indicate states and columns indicate actions) of
Q-value of action a, in the R; rule‘s corresponding state. the resulting Q-table for the learner agent Ai will be :
2.1.2 Action Selection
When the learning agent receives a state vector z, the over-
all weight of each discrete action ( a r ) in the action-space is
calculated through fuzzy inference as follows :
where L is the number of agents and U ( 1 : j : i )is the utility
weight of agent j in relation to agent I for fuzzy state i (con-
dition of Rule Ri). As mentioned earlier, suppose that the
agents have different perceptual state space (In our simula-
where pJ(x) is the compatibility of a state vector x with the tion, the hunters have different visual-field depths). Let us
rule R,. For selecting the agent’s final output action, we use call the intersection of agent 1 and agent j ’ s perceptual space
the Boltzmann selection scheme. Thus, the probability for as I,;. I!, can be a concrete or discrete state space. In or-
selecting action n, is : der to assign an appropriate value to the utility function U,
we define U ( l : j , i )to be the maximumcoinpatibilityof I8;’s
members in rule R, :

W j , 1 ) = ~ ~ ~ ~ z E ~ r i ( p , , ( ~ ) ()6 )
where T indicates the temprature.
In other words, the more the common perceptual space is
2.1.3 Updating Q-values compatible with state i . the more utility is assigned for these
Assume that reward T is given to the learning agent after two agents.
performing the selected action. The Q-table values corre-
sponding to each fuzzy if-then rule is updated by :
3 Simulation Results
JP =
Q’leiu (1 - a;).Qpid+ .(I + r.V(z‘)) (3) Two case studies have been discussed for acquiring ex-
perimental results approving the proposed method’s effec-
where is a discounting factor, a‘ is an adaptivz learning
tiveness. The tasks considered in this study involve hunter
rate defined by :
agents seeking to capture randomly-moving prey agents in
a I O by 10 grid world. On each time step, each agent has
(4) five possible actions to choose from : moving up, down, left,
right or sloping. A prey is captured when it occupies the
where n is a positive constant. Also, V(z’) is the maximum same cell as a hunter. Upon capturing, the hunter involved
value among & ( a t ) values (1 5 k 5 m ) in the new state receive +1 reward whereas Hunters receive -0.1 reward for
vector x’ resulted from performing the selected action, where each unsuccessful movement. Each hunter has a limited vi-
Q(ak) values are computed again from Eq. ( I ) in state vector sual field inside which it can locate prey accurately. Each
2‘. hunter’s perception is represented by (x?,yr) where xr (y,)
is the relative distance of the prey to the hunter according to
2.2 Cooperative Learning Algorithm its x (y) axis. We use two hunters, one with visual-field depth
The reinforcement learning agents act in two modes : in- of 5 and the othcr with 2. So, for example the first hunter’s
dividual and cooperative learning mode. At first, all of the perceptual state space is {(z,.,~,.) : -5 5 z ~ y? .; <
5)
agents are in individual learning mode. After executing some and their intersection of perceptual state space will be de-
trials, the learner agent(s), switches to cooperative learning scribed as : {(x7,yv): -2 <
2,: y? <
2) . Each of the
mode to acquire properlj other agents’ Q-policies into its z and y axes has five uniform triangular membership func-
Q-table. The learner agent go back then to the individual tions. The linguistic labels’(mentioned later) corresponding
learning mode and in regular trial-intervals it switches to the values are -4, -2,@,2 and 4, respectively for both axes. Thus,
cooperative learning mode for one trial. Each learning trial the location of the prey may be considered in more than one
starts from a random state and ends when the agent reaches state. Also, the hunter Q-table’s fuzzy state representation
the goal. will be of type (Sz, S,) where S, can be a fuzzy linguistic
In the individual learning mode, the agent uses the fuzzy Q- label from {LEFTMOST, LEFT, MIDDLE, RIGHT, RIGHT-
learning method introduced in the previous subsection. How- MOST} and S, from {DOWNMOST, DOWN, MIDDLE,
ever, in the cooperative learning mode, the leamer agent uses UP, UPMOST}.

5580

Authorized licensed use limited to: University of Washington. Downloaded on October 29, 2008 at 12:03 from IEEE Xplore. Restrictions apply.
3.1 Case Study 1 : Regular Frequent Cooper- performance is the next step of this research.
ation between Peer Agents
In the first set of simulations, the two hunters learn indi-
References
vidualy at first. however, at various frequencies the hunter [I] Majid Nili Ahmadabadi and Masoud Asadpour: Expert-
agent with visual depth equal to 5 performs a urility wighted ness Based Cooperative Q-Learning, IEEE Trailsacriotis
policy averaging between the other hunter’s Q-table arid its on Sys. Man arid Cybernerics - Parr B :Cybernerics, Vol.
own one using the proposed method. The performance re- 32, No. I , pp. 66-76, Feb. 2002.
sults when the learner hunter averaged the policies at every
121 B.Silver, W. Frawley, G.lba and J.Vittal : ILS : A Sys-
IO, 20, 50 and 100 trials show firstly, that the learning pro-
tem of Learning Distributed Heterogeneous Agents for
cess in all of these cases converged quicker than indepen-
Network Traffic Management, Ptoceedings ofthe hirer-
dent learning hunter (a benefit of cooperation). The (other
narional Conference on Cornmiuiicotioris, 1993.
important result was acquired by simulating the same sce-
nario without the utility weighted method but with an equal- [3] Ming Tan : Multi-agent Reinforcement Learning : Inde-
weighted policy averaging. In other words, we set U ( l , 2 , i) pendent vs. Cooperative Agents, Procedings ofthe Terirh
and U(l,1: i ) equal to 1 for the two hunters (A1, A z )and for Interiiariorial Coifereiice on Machine Learnbig, 1993.
all 1 Ii 5 N , where N is the number of all fuzzy states (5 *
5 = 25). The performace comparison for learning in indepen- [4] S. E. Lander and V. R. Lesser : Understanding the Role
dent mode, cooperative mode with utility weighted averping of Negotiation in Distributed Search among Heteroge-
and with equal weighted avergaing is shown in Fig. 1 and neous Agents, Proceedings of the Thirteenth Interna-
Fig. 2. We observed that the utility weighted policy aver- tional Joint Conference on Artificial Intelligence, August
gaing outperforms the equal-weighted method. Evenmore, 1993.
the equal-weighted policy averaging method could not con-
[SI Susan E. Lander : Distributed Search and Conflict Man-
verge to a final value, sincc the learner hunter used the auther
agement Among Reusable Heterogeneous Agents, Phd
hunter’s Q-table blindly without respect to each fuzzy state
Thesis, University of Massachusetts, Amherest, Depan-
of the other hunter’ utility for itself.
ment of Computer Science, 1994.

3.2 Case Study 2 : Unfrequent Learning from [6] Maram V. NapendraPrasad : Learning Situation Specific
an Expert Agent Control in Multi-Agent Systems, Phd Thesis, University
of Massachusetts, Amherest, Department of Computer
In the second case, an expert hunter agent with visual
Science, 1997.
depth equal to 2 has been used for teaching. The other hunter
starts in individual learning mode but performs utility/equal 171 S. E. Lander and V. R. Lesser : Sharing Meta-
weighted policy averaging at a specific trial (IO, 20, SO, 75. information to Guide Cooperative Search among Hetero-
100 or 200) and then switches back into the individual learn- geneous Reusable Agents, Cornpirter Science Tech. Rep.
ing mode again, and continues until convergance. It is. ob- 94-48, University of Massachusers, 1994. To appear in
served that the utility weighted policy averging convcxges IEEE Transactions OIL Knowledge and Dora Engbieer-
quicker than the individual learning policy. So, it outper- ing, 1996.
forms the indpendent learning policy in terms of learning
speed. The performace results are depicted in Figure 3 when [SI 1. Kawaishi and S. Yainada : Experimental Comparison
the agents cooperate in leanring at trial 75. of a Heterogeneous Learning Multi-agent System with a
Homogeneous One, IEEE hirernoriorial Confererice 011
Sys. Marl arid Cybernerics, 1996.
4 Conclusion and Future Work
It is shown that careful cooperation in learning can have [9] Winton H. E. Davies : ANIMALS : A Distributed Het-
crucial effect on the learning process of a team of hei.ero- erogeneous Milti-agent Machine Learning System, MS
geneous agents. When the agents are heterogeneous, coop- Thesis, Uriiversih ofAberdeen, Scorland, 1999.
erative learning can be misleading if the agents cannot han-
[IO] I. D. Kelly : The Developement of Shared Experience
dle this heterogeneity in some ways. Heterogeneous agents
Learning in a Group of Mobile Robots, PhD Disserra-
with similar fuzzy state space and actions but different per-
rion, Uiiiv. Reading, Dept. Cybem., Reading Cir): U.K.,
ceptual state space has been considered in this research. A
1997.
utility function helps the agents interprete and evaluate other
agents’ fuzzy states in order to perform a successful wighted [ I l l Tomoharu Nakashima, Masayo Udo and Hisao
policy averaging. The results approves the effectiveness of Ishibuchi : Implementation of Fuzzy Q-Learning for a
the proposed cooperation algorithm and also shows the pro- Soccer Agent, IEEE International Corference on Fuzzy
vided opportunity for cooperative learning in heterogeni:ous S)’sreiiis, 2003.
multi agent systems. Introducing expertness measures in r?]
such multi agent systems to increase the cooperative learning

5581

Authorized licensed use limited to: University of Washington. Downloaded on October 29, 2008 at 12:03 from IEEE Xplore. Restrictions apply.
140
' Ind'epend&t L e a r n i n
Utility Weighted Cooperative Learning with F r e Q . 28 -
--Q---

120

100

40

20

0 ~

100 200 300 400 500 600 700 800 30


rriai

Figure 1 : Performance comparison between Independent and Weighted Utility Cooperative Leaming

i10

,oo
utility w e i d h t e d cboperaiive Learning with ~ r e ; l .20
E q u a l W e i g h t e d Cooperative Learning with F r e q . 20
-
-------

so

SO

70

40

30

20

io

0
-00 200 300 400 500 600 700 SO0 900
Trial

Figure 2: Performance comparison between Weighted Utility and Equal Weighted Cooperative Learning

5582

Authorized licensed use limited to: University of Washington. Downloaded on October 29, 2008 at 12:03 from IEEE Xplore. Restrictions apply.
i60

140

120

io0

80

60

40

20

0
0 100 200 300 400 500 600 700 800 SO0 1000
Trial

Figure 3: Performance comparison between Unfrequent Weighted Utility Cooperative Learning and Independent Learning

5583

Authorized licensed use limited to: University of Washington. Downloaded on October 29, 2008 at 12:03 from IEEE Xplore. Restrictions apply.

Vous aimerez peut-être aussi