Académique Documents
Professionnel Documents
Culture Documents
net/publication/4095317
CITATIONS READS
17 19
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Victor Lesser on 27 January 2014.
meta-level reasoning process using abstract representation Legend Use detailed scheduler
on all tasks including[A4]
An environment named AMM has a combination(A) 3 shows the percentage of the total 500 units spent on de-
of simple and complex tasks arriving at a medium fre- liberative/control actions(CT); Row 4 is the percentage of
quency(M) and with medium tightness of deadline(M). tasks rescheduled (RES); Row 5 is the percent of total tasks
We first tested the effectiveness of meta-level control completed (PTC);Row 6 is percent of tasks delayed on ar-
using the two hand-generated context sensitive heuristic rival (PTDEL). The results show that the combined utili-
strategies NHS and SHS. The myopic strategy, NHS, uses ties of the two agents when using the heuristic strategies
state-dependent heuristics based on high-level features to is significantly higher than the combined utilities when us-
decide which deliberative (also called control) decision to ing the deterministic and random strategies. The utility ob-
tained from using SHS is significantly higher (p 0.05)
make. These heuristics are myopic and do not reason ex-
plicitly about the arrival of tasks in the near future. A sample than NHS and also 14% more tasks are completed using
NHS heuristic used to make a decision when a new task ar- SHS than the NHS. The reason for the improved perfor-
rives would be “If new task has high priority; current sched- mance by the heuristic strategies is explained by compar-
ule has low utility goodness, then best action is drop its cur- isons of the percent of control time over the same set of
rent schedule and schedule the new task immediately inde- environments. As described earlier, control actions do not
pendent of the schedule’s deadline.” have associated utility of their own. Domain actions pro-
The non-myopic strategy, SHS, is a set of rules that use duce utility upon successful execution and the control ac-
knowledge about environment characteristics to make non- tions serve as facilitators in choosing the best domain ac-
myopic decisions. The knowledge of the task arrival model tions given the agent’s state information. So resources such
enables the SHS to make decisions that in the long term as time spent directly on control actions do not directly pro-
minimize the resources spent on redundant deliberative ac- duce utility. When excessive amounts of resources are spent
tions. An example SHS heuristic is “If new task has very on control actions, the agent’s utility is reduced since re-
low utility goodness, loose deadline; low probability of a sources are bounded and are not available for utility pro-
high priority tasks arriving in the near future, then best ac- ducing domain actions. The trend in other factors such as
tion is schedule new task using simple scheduling.” percent of reschedules, tasks completed and delayed tasks
The complete list of NHS and SHS heuristic rules and are also affected by the resources spent on control actions.
their descriptions can be found in [5] (Chapter 4). The fea- The heuristic strategies use control activities that opti-
tures presented in Section 2 were selected based on what in- mize their use of available resources (time in this case). The
tuitively made sense for meta-level control. We use the fol- deterministic strategy has higher control costs because it al-
lowing experiments based on the heuristic rules to verify ways makes the same control choice (due to the absence
the usefulness of these features. NHS and SHS were com- of explicit meta-level control), the expensive call to the de-
pared to base-line approaches which used a random and de- tailed scheduler, independent of context. The random strat-
terministic strategy respectively. egy has low control costs but it does not reason about its
Performance comparison of the various strategies in en- choices leading to bad overall performance.
vironment AMM, over a number of dimensions and aver- Experimental results describing the behavior of the two
aged over 300 test episodes are described in Table 1. Col- interacting agents in 3 environments AMM, AML and ALM
umn 1 is row number; Column 2 describes the various com- averaged over 300 test episodes are presented in Table 2
where Column 1 is the environment type; Column 2 repre-
equips agents to automatically learn meta-level control poli-
Environment RL-3000 SHS NHS
cies. The empirical reinforcement learning algorithm used
AMM-UTIL 118.56 111.44 89.84 is a modified version of the algorithm developed by [9] for a
AMM-CT 8.86% 9.21% 8.09% spoken dialog system. Both problem domains have the bot-
AML-UTIL 211.45 207.88 130.68 tle neck of collecting training data. The algorithm optimizes
AML-CT 22.56% 23.21% 40.89% the meta-level control policy based on limited training data.
ALM-UTIL 136.05 113.83 92.56 The utility of this approach is demonstrated experimentally
ALM-CT 10.21% 13.11% 15.88% by showing that the meta-level control policies that are au-
tomatically learned by the agent perform as well as the care-
Table 2. Utility and Control Time Comparisons fully hand-generated heuristic policies.
over four environments
References
sents the performance characteristics of the RL policy after [1] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Pro-
gramming. Athena Scientific, Belmont, MA, 1996.
3000 training episodes; Column 3 and 4 represent the per-
[2] M. Boddy and T. Dean. Decision-theoretic deliberation
formance characteristics of SHS and NHS respectively. De-
scheduling for problem solving in time-constrained environ-
tails about the generation of the training episodes can be
ments, 1994.
found in [5] (Chapter 5). In these tests, one agent was fixed
[3] E. A. Hansen and S. Zilberstein. Monitoring anytime algo-
to the best policy it was able to learn in a single agent envi- rithms. SIGART Bulletin, 7(2):28–33, 1996.
ronment. The other agent then learned its meta-level control [4] E. J. Horvitz. Reasoning under varying and uncertain re-
policy within these conditions. source constraints. In National Conference on Artificial In-
The results show that the combined utilities of the two telligence of the American Association for AI (AAAI-88),
agents when using the RL strategy is as good as the SHS pages 111–116, 1988.
strategy which uses environment characteristic information [5] A. Raja. Meta-level control in multi-agent systems. PhD
in its decision making process. The ability of the RL algo- Thesis, Computer Science Department, University of Mas-
rithm to make non-myopic decisions leads to optimized use sachusetts at Amherst, September 2003.
of its resources (time) and hence learn policies which signif- [6] A. Raja and V. Lesser. Reasoning about Coordination Costs
icantly outperforms NHS (p 0.05) in these environments in Resource-Bounded Multi-Agent Systems. In In Proceed-
ings of AAAI 2004 Spring Symposium on Bridging the mul-
and does as well as if not better than SHS. There are twelve
features in the abstract representation of the state and each tiagent and multirobotic research gap, Stanford, CA, March
2004, Pages 35-40, March 2004.
feature can have one of four different
values,
P
so the maxi-
[7] S. Russell and E. Wefald. Principles of metareasoning. In
mum size of the search space is *,3 , , which is about
Proceedings of the First International Conference on Prin-
a million states. Of these million states, only about a 100
ciples of Knowledge Representation and Reasoning, pages
states on average are visited with high frequency for a spe- 400–411, 1989.
cific environment. This is in interesting area of future work [8] S. J. Russell, D. Subramanian, and R. Parr. Provably bounded
where we would like to study the characteristics of the fre- optimal agents. In Proceedings of the Thirteenth Interna-
quently visited states over multiple environments. tional Joint Conference on Artificial Intelligence (IJCAI-93),
pages 338–344, 1993.
5. Discussion [9] S. P. Singh, M. J. Kearns, D. J. Litman, and M. A. Walker.
Empirical evaluation of a reinforcement learning spoken di-
alogue system. In Proceedings of the Seventeenth National
The experimental evaluation leads to the following con-
Conference on Artificial Intelligence, pages 645–651, 2000.
clusions: Meta-level control reasoning is advantageous
[10] R. Sutton and A. Barto. Reinforcement Learning. MIT Press,
in resource-bounded agents in environments that ex-
1998.
hibit non-stationarity, action outcome uncertainty and
[11] T. Wagner, A. Garvey, and V. Lesser. Criteria-Directed
partial-observability; the high-level features are good indi- Heuristic Task Scheduling. International Journal of Approx-
cators of the agent state and facilitate effective meta-level imate Reasoning, Special Issue on Scheduling, 19(1-2):91–
control; the domain independence of these high level fea- 118, 1998. A version also available as UMASS CS TR-97-
tures allows the solution approach to be generalized to 59.
domains where tasks have associated deadlines and util-
ities, there are alternate ways of achieving the tasks,
uncertainty in execution performance and detailed schedul-
ing and coordination of tasks may be required.
We describe a reinforcement learning approach which