Académique Documents
Professionnel Documents
Culture Documents
International Journal of
Computer Science &
Applications
Volume 4 Issue 2
July 2007
Special Issue on
Communications, Interactions and
Interoperability in Information Systems
Editor-in-Chief
Rajendra Akerkar
ADVISORY EDITOR
Douglas Comer
Department of Computer Science, Purdue University, USA
EDITOR-IN-CHIEF
MANAGING EDITOR
Rajendra Akerkar
David Camacho
Universidad Carlos III de Madrid, Spain
Technomathematics Research Foundation
204/17 KH, New Shahupuri , Kolhapur
416001, INDIA
ASSOCIATE EDITORS
Ngoc Thanh Nguyen Wroclaw University of Pawan Lingras Saint Mary's University,
Technology, Poland
Halifax, Nova Scotia, Canada.
COUNCIL OF EDITORS
Huan Liu Arizona State University, USA
Stuart Aitken University of Edinburgh, UK
Pericles Loucopoulos UMIST, Manchester,
Tetsuo Asano JAIST, Japan.
Costin Badica University of Craiova,Craiova, UK
Wolfram - Manfred Lippe University of
Romania
Muenster, Germany
JF Baldwin University of Bristol, UK
Lorraine McGinty University College
Pavel Brazdil LIACC/FEP,University of
Dublin, Belfield, Ireland
Porto, Portugal
C. R. Muthukrishnan Indian Institute of
Ivan Bruha Mcmaster University, Canada
Technology, Chennai, India
Jacques Calmet Universitt Karlsruhe
Marcin Paprzycki SWPS and IBS PAN,
Germany
Warsaw
Narendra S. Chaudhari Nanyang
Lalit M. Patnaik Indian Institute of Science,
Technological University, Singapore
Bangalore, India
Walter Daelemans University of Antwerp,
Dana Petcu Western University of Timisoara,
Belgium
Romania
K. V. Dinesha IIIT, Bangalore, India
Shahram Rahimi Southern Illinois
David Hung-Chang Du University of
University, Illinois, USA
Minnesota, USA
Hai-Bin Duan Beihang University, P. R. China. Sugata Sanyal Tata Institute of Fundamental
Research, Mumbai, India.
Yakov I. Fet Russian Academy of Sciences,
Dharmendra Sharma University of
Russia
Canberra, Australia
Maria Ganzha, Gizycko Private Higher
Ion O. Stamatescu FEST, Heidelberg,
Educational Institute, Gizycko, Poland
Germany
S. K. Gupta IIT, New Delhi, India
Henry Hexmoor University of Arkansas,
Jos M. Valls Ferrn Universidad Carlos
Fayetteville, U.S.A.
III, Spain
Ray Jarvis Monash University, Victoria,
Rajeev Wankar University of Hyderabad,
Australia
Hyderabad, India
Peter Kacsuk MTA SZTAKI Research
Krzysztof Wecel The Poznan University of
Institute, Budapest, Hungary
Economics, Poland
Editorial Office: Technomathematics Research Foundation, 204/17 Kh, New Shahupuri, Kolhapur 416001, India.
E-mail: editor@tmrfindia.org
Copyright 2007 by Technomthematics Research Foundation
All rights reserved. This journal issue or parts thereof may not be reproduced in any form or by any means, electrical or
mechanical, including photocopying, recording or any information storage and retrieval system now known or to be
invented, without written permission from the copyright owner.
Permission to quote from this journal is granted provided that the customary acknowledgement is given to the source.
International Journal of Computer Science & Applications (ISSN 0972 9038) is high quality electronic journal
published six-monthly by Technomathematics Research Foundation, Kolhapur, India. The www-site of IJCSA is
http://www.tmrfindia.org/ijcsa.html
ii
Contents
Editorial
(v)
(1 21)
(23-41)
3. Adaptability of Methods for Processing XML Data using Relational Databases the
State of the Art and Open Problems
(43-62)
(75-91)
6. What Enterprise Architecture and Enterprise Systems Usage Can and Can not Tell
about Each Other .
(93-109)
(111-123)
iii
(125-144)
(145-163)
iv
Editorial
The First International Conference on Research Challenges in Information Science
(RCIS) aimed at providing an international forum for scientists, researchers, engineers
and developers from a wide range of information science areas to exchange ideas and
approaches in this evolving field. While presenting research findings and state-of-art
solutions, authors were especially invited to share experiences on new research
challenges. High quality papers in all information science areas were solicited and
original papers exploring research challenges did receive especially careful interest from
reviewers. Papers that had already been accepted or were currently under review for
other conferences or journals were not to be considered for publications at RCIS07.
103 papers were submitted and 31 were accepted. They will be published in the
RCIS07 proceedings.
This special issue of the International Journal of Computer Science & Applications,
dedicated to Communications, Interactions and Interoperability in Information Systems,
presents 9 papers that obtained the highest marks in the reviewing process. They are
presented in an extended version in this issue. In the context of RCIS07, they are linked
to Information System Modelling and Intelligent Agents, Description Logics,
Ontologies and XML based techniques.
Colette Rolland
Oscar Pastor
Jean-Louis Cavarero
Abstract
In this paper, we propose a new quantitative trust model for argumentation-based
negotiating agents. The purpose of such a model is to provide a secure environment for
agent negotiation within multi-agent systems. The problem of securing agent negotiation
in a distributed setting is core to a number of applications, particularly the emerging
semantic grid computing-based applications such as e-business. Current approaches to
trust fail to adequately address the challenges for trust in these emerging applications.
These approaches are either centralized on mechanisms such as digital certificates, and
thus are particularly vulnerable to attacks, or are not suitable for argumentation-based
negotiation in which agents use arguments to reason about trust.
Key words: Intelligent Agents, Negotiating Agents, Security, Trust.
1 Introduction
Research in agent communication protocols has received much attention during the last
years. In multi-agent systems (MAS), protocols are means of achieving meaningful
interactions between software autonomous agents. Agents use these protocols to guide
their interactions with each other. Such protocols describe the allowed communicative
acts that agents can perform when conversing and specify the rules governing a dialogue
between these agents.
Protocols for multi-agent interaction need to be flexible because of the open and
dynamic nature of MAS. Traditionally, these protocols are specified as finite state
machines or Petri nets without taking into account the agents autonomy. Therefore,
they are not flexible enough to be used by agents expected to be autonomous in open
MAS [16]. This is due to the fact that agents must respect the whole protocol
specification from the beginning to the end without reasoning about them. To solve this
problem, several researchers recently proposed protocols using dialogue games [6, 11,
15, 17]. Dialogue games are interactions between players, in which each player moves
by performing utterances according to a pre-defined set of roles. The flexibility is
achieved by combining different small games to construct complete and more complex
protocols. This combination can be specified using logical rules about which agents can
reason [11].
The idea of these logic-based dialogue game protocols is to enable agents to
effectively and flexibly participate in various interactions with each other. One such type
of interaction that is gaining increasing prominence in the agent community is
negotiation. Negotiation is a form of interaction in which a group of agents, with
conflicting interests, but a desire to cooperate, try to come to a mutually acceptable
agreement on the division of scarce resources. A particularly challenging problem in this
context is security. The problem of securing agent negotiation in a distributed setting is
core to a number of applications, particularly the emerging semantic grid computingbased applications such as e-science (science that is enabled by the use of distributed
computing resources by end-user scientists) and e-business [9, 10].
The objective of this paper is to address this challenging issue by proposing a new
quantitative, probabilistic-based model to trust negotiating agents, which is efficient, in
terms of computational complexity. The idea is that in order to share resources and
allow mutual access, involved agents in e-infrastructures need to establish a framework
of trust that establishes what they each expect of the other. Such a framework must
allow one entity to assume that a second entity will behave exactly as the first entity
expects. Current approaches to trust fail to adequately address the challenges for trust in
the emerging e-computing. These approaches are mostly centralized on mechanisms
such as digital certificates, and thus are particularly vulnerable to attacks. This is
because if some authorities who are trusted implicitly are compromised, then there is no
other check in the system. By contrast, in the decentralized approach we propose in this
paper and where the principals maintain trust in each other for more reasons than a
single certificate, any invaders can cause limited harm before being detected.
Recently, some decentralized trust models have been proposed [2, 3, 4, 7, 13, 19] (see
[18] for a survey). However, these models are not suitable for argumentation-based
negotiation, in which agents use their argumentation abilities as a reasoning mechanism.
In addition, some of these models do not consider the case where false information is
collected from other partners. This paper aims at overcoming these limits.
The rest of this paper is organized as follows. In Section 2, we present the negotiation
framework. In Section 3, we present our trustworthiness model. We highlight its
formulation, algorithmic description, and computational complexity. In Section 4, we
describe and discuss implementation issues. In Sections 5, we compare our framework
to related work, and in Section 6, we conclude.
2 Negotiation Framework
In this section, we briefly present the dialogue game-based framework for negotiating
agents [11, 12]. These agents have a BDI architecture (Beliefs, Desires, and Intention)
augmented with argumentation and logical and social reasoning. The architecture is
composed of three models: the mental model, the social model, and the reasoning model.
The mental model includes beliefs, desires, goals, etc. The social model captures social
concepts such as conventions, roles, etc. Social commitments made by agents when
negotiating are a significant component of this model because they reflect mental states.
Thus, agents must use their reasoning capabilities to reason about their mental states
before creating social commitments. The agent's reasoning capabilities are represented
by the reasoning model using an argumentation system. Agents also have general
knowledge, such as knowledge about the conversation subject. This architecture has the
advantage of taking into account the three important aspects of agent communication:
mental, social, and reasoning. It is motivated by the fact that conversation is a cognitive
and social activity, which requires a mechanism making it possible to reason about
mental states, about what other agents say (public aspects), and about the social aspects
(conventions, standards, obligations, etc).
The main idea of our negotiation framework is that agents use their argumentation
abilities in order to justify their negotiation stances, or influence other agents
negotiation stances considering interacting preferences and utilities. Argumentation can
be abstractly defined as a dialectical process for the interaction of different arguments
for and against some conclusion. Our negotiation dialogue games are based on formal
dialectics in which arguments are used as a way of expressing decision-making [8, 14].
Generally, argumentation can help multiple agents to interact rationally, by giving and
receiving reasons for conclusions and decisions, within an enriching dialectical process
that aims at reaching mutually agreeable joint decisions. During negotiation, agents can
establish a common knowledge of each others commitments, find compromises, and
persuade each other to make commitments. In contrast to traditional approaches to
negotiation that are based on numerical values, argument-based negotiation is based on
logic.
An argumentation system is simply a set of arguments and a binary relation
representing the attack-relation between the arguments. The following definition,
describe formally these notions. Here * indicates a possibly inconsistent knowledge
base. A stands for classical inference and { for logical equivalence.
Definition 1 (Argument). An argument is a pair (H, h) where h is a formula of a
logical language and H a sub-set of * such that : i) H is consistent, ii) H A h and iii) H
is minimal, so no subset of H satisfying both i and ii exists. H is called the support of the
argument and h its conclusion.
Definition 2 (Attack Relation). Let (H1, h1), (H2, h2) be two arguments. (H1, h1) attacks
(H2, h2) iff h1 { h2.
Negotiation dialogue games are specified using a set of logical rules. The allowed
communicative acts are: Make-Offer, Make-Counter-Offer, Accept, Refuse, Challenge,
Inform, Justify, and Attack. For example, according to a logical rule, before making an
offer h, the speaker agent must use its argumentation system to build an argument (H,
h). The idea is to be able to persuade the addressee agent about h, if he decides to refuse
the offer. On the other side, the addressee agent must use his own argumentation system
to select the answer he will give (Make-Counter-Offer, Accept, etc.).
3.1 Formulation
Let A be the set of agents. We define an agents trustworthiness in a distributed setting
as a probability function as follows:
Nb _ Arg ( Ag b) Ag a Nb _ C ( Ag b) Ag a
T _ Nb _ Arg ( Ag b) Ag a T _ Nb _ C ( Ag b) Ag a
(1)
Ag1
Ag2
Trust(Ag2)
Trust(Ag1)
Ag3
Trust(Ag3)
Trust(Agb)
Trust(Agb)
Trust(Agb)
Aga
Trust(Agb) ?
Agb
Fig. 1. Problem of measuring Agbs trustworthiness by Aga
(2)
where E(X) is the expectation of the random variable X and p is the probability that the
agent is trustworthy. Thus, p is the probability that we seek. Therefore, it is enough to
evaluate the expectation E(X) to find TRUST ( Ag b) Ag a . However, this expectation is a
theoretical mean that we must estimate. To this end, we can use the Central Limit
Theorem (CLT) and the law of large numbers. The CLT states that whenever a random
sample of size n (X1,Xn) is taken from any distribution with mean P, then the sample
mean (X1 + +Xn)/n will be approximately normally distributed with mean P. As an
application of this theorem, the arithmetic mean (average) (X1++ Xn)/n approaches a
normal distribution of mean P, the expectation and standard deviation V n .Generally,
and according to the law of large numbers, the expectation can be estimated by the
weighted arithmetic mean.
Our random variable X is the weighted average of n independent random variables Xi
that correspond to Agbs trustworthiness according to the point of view of confidence
agents Agi. These random variables follow the same distribution: the Bernoulli
distribution. They are also independent because the probability that Agb is trustworthy
according to an agent Agt is independent of the probability that this agent (Agb) is
trustworthy according to another agent Agr. Consequently, the random variable X
follows a normal distribution whose average is the weighted average of the expectations
of the independent random variables Xi. The mathematical estimation of expectation
E(X) is given by Equation 3.
M0
(3)
M1
Agb
( 4)
TR( Ag i ) Agb
O ln('t Ag b )
TR('t Ag b ) e
Ag
i
( 5)
't is the time difference between the current time and the time at which Agi updates
its information about Agbs trust. O is an application-dependant coefficient. The
intuition behind this formula is to use a function decreasing with the time difference
(Fig. 2). Consequently, the more recent the information is, the higher is the timely
relevance coefficient. The function ln is used for computational reasons when dealing
with large numbers. Intuitively, the function used in Equation 5 reflects the reliability of
the transmitted information. Indeed, this function is similar to the well known reliability
eOt ).
1.0
Ag
TR('t Ag b )
i
Ag
TR('t Ag b )
i
O ln( 't Ag b )
Ag
i
0
1
Ag
Time ( 't Ag b )
i
M2
(6)
The way of combining Equation 3 (M0) and Equation 4 (M1) in the calculation of
Equation 6 (M2) is justified by the fact that it reflects the mathematical expectation of
the random variable X representing the Agbs trustworthiness. This equation represents
the sum of the probability of each possible outcome multiplied by its payoff.
This Equation shows how trust can be obtained by merging the trustworthiness values
transmitted by some mediators. This merging method takes into account the proportional
relevance of each trustworthiness value, rather than treating them equally.
According to Equation 6, we have:
i,TRUST( Ag b ) Ag w
i
M w.
M w
Consequently, if all the trust values sent by the consulted agents about Agb are less than
the threshold w, then Agb can not be considered as trustworthy. Thus, the well-known
Kyburgs lottery paradox can never happen. The lottery paradox was designed to
demonstrate that three attractive principles governing rational acceptance lead to
contradiction, namely that:
1. it is rational to accept a proposition that is very likely true;
2. it is not rational to accept a proposition that you are aware is inconsistent; and
3. if it is rational to accept a proposition A and it is rational to accept another proposition
B, then it is rational to accept A B,
are jointly inconsistent. In our situation, we do not have such a contradiction.
To assess M, we need the trustworthiness of other agents. To deal with this issue, we
propose the notion of trust graph.
Agbs trustworthiness. We can build a trust graph in order to deal with this issue. We
define such a graph as follows:
Definition 3 (Trust Graph). A trust graph is a directed and weighted graph. The nodes
are agents and an edge (Agi, Agj) means that agent Agi knows agent Agj. The weight of
the edge (Agi, Agj) is a pair (x, y) where x is the Agjs trustworthiness according to the
point of view of Agi and y is the interaction number between Agi and Agj. The weight of
a node is the agents trustworthiness according to the point of view of the source agent.
(belonging to the set Str.Agents) from the point of view of the agent who referred him
(i.e. Agi).
3- When a consulted agent answers by indicating a set of agents, these agents will
also be consulted. They can be regarded as potential witnesses. These witnesses are
added to a set called: Potonial_Witnesses. When a potential witness is consulted, he is
removed from the set.
4- To ensure that the evaluation process terminates, two limits are used: the maximum
number of agents to be consulted (Limit_Nbr_Visited_Agents) and the maximum number
of witnesses who must offer an answer (Limit_Nbr_Witnesses). The variable
Nbr_Additional_Agents is used to be sure that the first limit is respected when Aga starts
to receive the answers of the consulted agents.
10
Evaluate-Node(Agy) {
Arc(Agx, Agy )
If Node(Agx) is note evaluated Then
Evaluate-Node(Agx)
m1 := 0, m2 := 0
Arc(Agx, Agy ) {
m1 = m1 +
Weight(Node(Agx)) * Weight(Arc(Agx, Agy ))
m2 = m2 + Weight(Node(Agx))
}
Weight(Node(Agy )) = m1 / m2
}
Algorithm 2
4 Implementation
In this section we describe the implementation of our negotiation dialogue game
framework and the trustworthiness model using the JackTM platform (The Agent
Oriented Software Group, 2004). We select this language for three main reasons:
1- It is an agent-oriented language offering a framework for multi-agent system
development. This framework can support different agent models.
2- It is built on top of and fully integrated with the Java programming language. It
includes all components of Java and it offers specific extensions to implement agents
behaviors.
3- It supports logical variables and cursors. A cursor is a representation of the results
11
of a query. It is an enumerator which provides query result enumeration by means of rebinding the logical variables used in the query. These features are particularly helpful
when querying the state of an agents beliefs. Their semantics is mid-way between logic
programming languages with the addition of type checking Java style and embedded
SQL.
Ag1
Ag2
Trust_Ag1
Trust_Agn
Negotiation protocol
Interactions for determining Ag1s
trustworthiness
The trustworthiness model is implemented using the same principle (events + plans).
The requests sent by an agent about the trustworthiness of another agent are events and
the evaluations of agents trustworthiness are programmed in plans. The trust graph is
implemented as a Java data structure (oriented graph).
As Java classes, negotiating agents and trust model agents have private data called
Belief Data. For example, the different commitments and arguments that are made and
manipulated are given by a data structure called CAN implemented using tables and the
different actions expected by an agent in the context of a particular negotiation game are
given by a data structure (table) called data_expected_actions. The different agents
trustworthiness values that an agent has are recorded in a data structure (table) called
data_trust. These data and their types are given in Fig. 4 and Fig. 5.
12
13
The main steps of the evaluation process of Agbs trustworthiness are implemented as
follows:
1- By respecting the two limits and the threshold w , Aga consults his knowledge base
data_trust of type table_trust and sends a request to his confidence agents Agi (i = 1,..,
n) about Agbs trustworthiness. The JackTM primitive Send makes it possible to send the
request as a JackTM message that we call Ask_Trust of MessageEvent type. Aga sends
this request starting by confidence agents whose trustworthiness value is highest.
2- In order to answer the Agas request, each agent Agi executes a JackTM plan instance
that we call Plan_ev_Ask_Trust. Thus, using his knowledge base, each agent Agi offers
to Aga an Agbs trustworthiness value if Agb is known by Agi. If not, Agi proposes a set of
confidence agents from his point of view, with their trustworthiness values and the
number of times that he interacted with them. In the first case, Agi sends to Aga a JackTM
message that we call Trust_Value. In the second case, Agi sends a message that we call
Confidence_Agent. These two messages are of type MessageEvent.
3- When Aga receives the Trust_Value message, he executes a plan:
Plan_ev_Trust_Value. According to this plan, Aga adds to a graph structure called
graph_data_trust two information: 1) the agent Agi and his trustworthiness value as
graph node; 2) the trustworthiness value that Agi offers for Agb and the number of times
that Agi interacted with Agb as arc relating the node Agi and the node Agb. This first part
14
of the trust graph is recorded until the end of the evaluation process of Agbs
trustworthiness. When Aga receives the Confidence_Agent message, he executes another
plan: Plan_ev_Confidence_Agent. According to this plan, Aga adds to another graph
structure: graph_data_trust_sub_level three information for each Agi agent: 1) the agent
Agi and his trustworthiness value as a sub-graph node; 2) the nodes Agj representing the
agents proposed by Agi; 3) For each agent Agj, the trustworthiness value that Agi assigns
to Agj and the number of times that Agi interacted with Agj as arc between Agi and Agj.
This information that constitutes a sub-graph of the trust graph will be used to evaluate
Agjs trustworthiness values using Equation 5. These values are recorded in a new
structure: new_data_trust. Thus, the structure graph_data_trust_sub_level releases the
memory once Agjs trustworthiness values are evaluated. This technique allows us to
decrease the space complexity of our algorithm.
4- Steps 1, 2, and 3 are applied again by substituting data_trust by new_data_trust,
until all the consulted agents offer a trustworthiness value for Agb or until one of the two
limits (Limit_Nbr_Visited_Agents or Limit_Nbr_Witnesses) is reached.
5- Evaluate the Agbs trustworthiness value using the information recorded in the
structure graph_data_trust by applying Equation 5.
The different events and plans implementing our trustworthiness model and the
negotiating agent constructor are illustrated by Fig. 6. Fig. 7 illustrates an example
generated by our prototype of the process allowing an agent Ag1 to assess the
trustworthiness of another agent Ag2. In this example, Ag2 is considered trustworthy by
Ag1 because its trustworthiness value (0.79) is higher than the threshold (0.7).
15
Fig. 6. Events, plans and the conversational agent constructor implementing the trustworthiness model
16
representing the events and the plans. For example, the event
Event_Attack_Commitment and the plan Plan_ev_Attack_commitment implement the
Attack game. The architecture of our negotiating agents is illustrated in Fig. 8.
5 Related Work
Recently, some online trust models have been developed (see [20] for a detailed survey).
The most widely used are those on eBay and Amazon Auctions. Both of these are
implemented as a centralized trust system so that their users can rate and learn about
each others reputation. For example, on eBay, trust values (or ratings) are +1, 0, or 1
and user, after an interaction, can rate its partner. The ratings are stored centrally and
summed up to give an overall rating. Thus, reputation in these models is a global single
value. However, the model can be unreliable, particularly when some buyers do not
return ratings. In addition, these models are not suitable for applications in open MAS
17
such as agent negotiation because they are too simple in terms of their trust rating values
and the way they are aggregated.
Dialogue games
Jack Event o Jack Plan
Jack Event o Jack Plan
Argumentation system
(Java + Logical programming)
Knowledge
base (Jack
Beliefset)
Argumentation system
(Java + Logical programming)
Ontology (Jack
Beliefset)
Knowledge
base (Jack
Beliefset)
Another centralized approach called SPORAS has been proposed by Zacharia and
Maes [7]. SPORAS does not store all the trust values, but rather updates the global
reputation value of an agent according to its most recent rating. The model uses a
learning function for the updating process so that the reputation value can reflect an
agents trust. In addition, it introduces a reliability measure based on the standard
deviations of the trust values. However, unlike our models, SPORAS deal with all
ratings equally without considering the different trust degrees. Consequently, it suffers
from rating noise. In addition, like eBay, SPORAS is a centralized approach, so it is not
suitable for open negotiation systems.
Broadly speaking, there are three main approaches to trust in open multi-agent
systems. The first approach is built on an agents direct experience of an interaction
partner. The second approach uses information provided by other agents [2, 3, 4]. The
third approach uses certified information provided by referees [9, 19]. In the first
approach, methods by which agents can learn and make decisions to deal with
trustworthy or untrustworthy agents should be considered. In the models based on the
second and the third approaches, agents should be able to reliably acquire and reason
about the transmitted information. In the third approach, agents should provide thirdparty referees to witness about their previous performance. Because the first approaches
are only based on a history of interactions, the resulting models are poor because agents
with no prior interaction histories could trust dishonest gents until a sufficient number of
interactions is built.
Sabater [13] proposes a decentralized trust model called Regret. Unlike the first
approach models, Regret uses an evaluation technique not only based on an agents
direct experience of its partners reliability, but it also uses a witness reputation
component. In addition, trust values (called ratings) are dealt with according to their
recency relevance. Thus, old ratings are given less importance compared to new ones.
18
However, unlike our model, Regret does not show how witnesses can be located, and
thus, this component is of limited use. In addition, this model does not deal with the
possibility that an agent may lie about its rating of another agent, and because the ratings
are simply equally summed, the technique can be sensitive to noise. In our model, this
issue is managed by considering the witnesses trust and because our merging method
takes into account the proportional relevance of each trustworthiness value, rather than
treating them equally (see Equation 6 Section III.B)
Yu and Singh [2, 3, 4] propose an approach based on social networks in which
agents, acting as witnesses, can transmit information about each other. The purpose is to
tackle the problem of retrieving ratings from a social network through the use of
referrals. Referrals are pointers to other sources of information similar to links that a
search engine would plough through to obtain a Web page. Through referrals, an agent
can provide another agent with alternative sources of information about a potential
interaction partner. The social network is presented using a referral network called
TrustNet. The trust graph we propose in this paper is similar to TrustNet, however there
are several differences between our approach and Yu and Singhs approach. Unlike Yu
and Singhs approach in which agents do not use any particular reasoning, our approach
is conceived to secure argumentation-based negotiation in which agents use an
argumentation-based reasoning. In addition, Yu and Singh do not consider the
possibility that an agent may lie about its rating of another agent. They assume all
witnesses are totally honest. However, this problem of inaccurate reports is considered
in our approach by taking into account the trust of all the agents in the trust graph,
particularly the witnesses. Also, unlike our model, Yu and Singhs model do not treat
the timely relevance information and all ratings are dealt with equally. Consequently,
this approach cannot manage the situation where the agents behavior changes.
Huynh, Jennings, and Shadbot [19] tackle the problem of collecting the required
information by the evaluator itself to assess the trust of its partner, called the target. The
problem is due to the fact that the models based on witness implicitly assume that
witnesses are willing to share their experiences. For this reason, they propose an
approach, called certified reputation, based not only on direct and indirect experiences,
but also on third-party references provided by the target agent itself. The idea is that the
target agent can present arguments about its reputation. These arguments are references
produced by the agents that have interacted with the target agents certifying its
credibility (the model proposed by Maximilien and Singh [5] uses the same idea). This
approach has the advantage of quickly producing an assessment of the targets trust
because it only needs a small number of interactions and it does not require the
construction of a trust graph. However, this approach has some serious limitations.
Because the referees are proposed by the target agent, this agent can provide only
referees that will give positive ratings about it and avoid other referees, probably more
credible than the provided ones. Even if the provided agents are credible, their witness
could not reflect the real picture of the targets honesty. This approach can privilege
opportunistic agents, which are agents only credible with potential referees. For all these
reasons, this approach is not suitable for trusting negotiating agents. In addition, in this
approach, the evaluator agent should be able to evaluate the honesty of the referees
using a witness-based model. Consequently, a trust graph like the one proposed in this
paper could be used. This means that, in some situations, the targets trust might not be
assessed without asking for witness agents.
19
6 Conclusion
The contribution of this paper is the proposition and the implementation of a new
probabilistic model to trust argumentation-based negotiating agents. The purpose of
such a model is to provide a secure environment for agent negotiation within multi-agent
systems. To our knowledge, this paper is the first work addressing the security issue of
argumentation-based negotiation in multi-agent settings. Our model has the advantage of
being computationally efficient and of gathering four most important factors: (1) the
trustworthiness of confidence agents; (2) the targets trustworthiness according to the
point of view of confidence agents; (3) the number of interactions between confidence
agents and the target agent; and (4) the timely relevance of information transmitted by
confidence agents. The resulting model allows us to produce a comprehensive
assessment of the agents credibility in an argumentation-based negotiation setting.
Acknowledgements
We would like to thank the Natural Sciences and Engineering Research Council of
Canada (NSERC), le fonds qubcois de la recherche sur la nature et les technologies
(NATEQ), and le fonds qubcois de la recherche sur la socit et la culture (FQRSC)
for their financial support. The first author is also supported in part by Concordia
University, Faculty of Engineering and Computer Science (Start-up Grant). Also, we
would like to thank the three anonymous reviewers for their interesting comments and
suggestions.
References
[1] A. Abdul-Rahman, and S. Hailes. Supporting trust in virtual communities. In Proceedings of
the 33rd Hawaii International Conference on System Sciences, 6, IEEE Computer Society
Press. 2000.
[2] B. Yu, and M. P. Singh. An evidential model of distributed reputation management. In
Proceedings of the First International Joint Conference on Autonomous Agents and MultiAgent Systems. ACM Press, pages 294301, 2002.
[3] B. Yu, and M. P. Singh. Detecting deception in reputation management. In Proceedings of
the 2nd International Joint Conference on Autonomous Agents and Multi-Agent Systems.
ACM Press, pages 73-80, 2003.
[4] B. Yu, and M. P. Singh. Searching social networks. In Proceedings of the second
International Joint Conference on Autonomous Agents and Multi-Agent Systems. ACM Press,
pp. 6572, 2003.
[5] E. M. Maximilien, and M. P. Singh. Reputation and endorsement for web services. ACM
SIGEcom Exchanges, 3(1):24-31, 2002.
[6] F. Sadri, F. Toni, and P. Torroni. Dialogues for negotiation: agent varieties and dialogue
sequences. In Proceedings of the International workshop on Agents, Theories, Architectures
and Languages. Lecture Notes in Artificial Intelligence (2333):405421, 2001.
[7] G. Zacharia, and P. Maes. Trust management through reputation mechanisms. Applied
Artificial Intelligence, 14(9):881-908, 2000.
[8] H. Prakken. Relating protocols for dynamic dispute with logics for defeasible argumentation.
In Synthese (127):187-219, 2001.
[9] H. Skogsrud, B. Benatallah, and F. Casati. Model-driven trust negotiation for web services.
IEEE Internet Computing, 7(6):45-52, 2003.
20
[10] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: enabling the scalable
virtual organization. The International Journal of High Performance Computing
Applications, 15(3), 200-222, 2001.
[11] J. Bentahar, B. Moulin, J-.J. Ch. Meyer, and B. Chaib-draa. A computational model for
conversation policies for agent communication. In J. Leite and P. Torroni editors,
Computational Logic in Multi-Agent Systems. Lecture Notes in Artificial Intelligence (3487):
178-195, 2005.
[12] J. Bentahar. A pragmatic and semantic unified framework for agent communication. Ph.D.
Thesis, Laval University, Canada, May 2005.
[13] J. Sabater. Trust and Reputation for Agent Societies. Ph.D. Thesis, Universitat Autonoma de
Barcelona, 2003.
[14] L. Amgoud, N. Maudet, S. Parsons. Modelling dialogues using argumentation. In Proceeding
of the 4th International Conference on Multi-Agent Systems, pages 31-38, 2000.
[15] M. Dastani, J. Hulstijn, and L. V. der Torre. Negotiation protocols and dialogue games. In
Proceedings of Belgium/Dutch Artificial Intelligence Conference, pages 13-20, 2000.
[16] N. Maudet, and B. Chaib-draa, Commitment-based and dialogue-game based protocols, new
trends in agent communication languages. In Knowledge Engineering Review. Cambridge
University Press, 17(2):157-179, 2002.
[17] P. McBurney, and S. Parsons, S. Games that agents play: A formal framework for dialogues
between autonomous agents. In Journal of Logic, Language, and Information, 11(3):1-22,
2002.
[18] S. D. Ramchurn, T. D. Huynh, and N. R. Jennings. Trust in multi-agent systems. The
Knowledge Engineering Review, 19(1):1-25, March 2004.
[19] T. D. Huynh, N. R. Jennings, and N. R. Shadbolt. An integrated trust and reputation model
for open multi-agent systems. Journal of Autonomous Agents and Multi-Agent Systems
AAMAS, 2006, 119-154.
[20] T. Grandison, and M. Sloman. A survey of trust in internet applications. IEEE
Communication Surveys & Tutorials, 3(4), 2000.
21
22
1 Introduction
Inter-Organizational Workflow (IOW) is essential given the growing need for
organizations to cooperate and coordinate their activities in order to meet the new
demands of highly dynamic and open markets. The different organizations involved in
such cooperation must correlate their respective resources and skills, and coordinate
their respective business processes towards a common goal, corresponding to a valueadded service [1], [2].
A fundamental issue for IOW is the coordination of these different distributed,
heterogeneous and autonomous business processes in order to both support semantic
interoperability between the participating processes, and efficiently synchronize the
distributed execution of these processes.
Coordination in IOW raises several problems such as: (i) the definition of the
universe of discourse, without which it would not be possible to solve the various
semantic conflicts that are bound to occur between several autonomous and
23
24
This paper also relies on the exploitation of agent and semantic Web approaches
viewed as enabling technologies to deal with the computing context in which IOW is
deployed.
The agent approach brings technical solutions and abstractions to deal with
distribution, autonomy and openness [9], which are inherent to loose IOW. This
approach also provides organizational concepts, such as groups, roles, commitments,
which are useful to structure and rule at a macro level the coordination of the different
partners involved in a loose IOW [10]. Using this technology, we also inherit
numerous concrete solutions to deal with coordination in multi-agent systems [11]
(middleware components, sophisticated interactions protocols).
The semantic Web approach facilitates communication and semantic interoperability between organizations involved in a loose IOW. It provides means to
describe and access common business vocabulary and shared interaction protocols.
The contribution of this paper is fourfold. First, this paper defines a multi-agent
PMS architecture and shows how this architecture can be connected to any WfMS
whose architecture is compliant with the WfMC reference architecture. Second, it
proposes an organizational model, instance of the Agent Group Role meta-model
[12], to structure and rule the interactions between the components of the PMS' and
WfMS' architectures. Third, it provides a protocol meta-model specified with OWL to
constitute a shared ontology of coordination protocols. This meta-model contains the
necessary information to select an appropriate interaction protocol at run-time
according to the current coordination problem to be dealt with. This meta-model is
then refined to integrate a classification of interaction protocols devoted to loose IOW
coordination. Fourth, this paper presents a partial implementation of this work limited
to a matchmaker protocol useful to deal with finding partners' problem.
The remaining of this paper is organized as follows. Section 2 presents the PMS
architecture stating the role of its components and explaining how they interact with
each other. Section 3 shows how to implement any WfMS engine connectable to a
PMS, while remaining compliant with the WfMC reference architecture. For reasons
related to homogeneity and flexibility, this engine is also provided with an agentbased architecture. Section 4 gives an organizational model that structures and rules
the communication between the different agents involved in a loose IOW. Section 5
addresses engineering issues for protocols. It presents the protocol meta model and
also identifies, among multi-agent system interaction protocols, the ones which are
appropriate for the loose IOW context. Section 6 gives a brief overview of the
implementation of the matchmaker protocol to deal with finding partners' problem.
Finally, section 7 compares our contribution to related works and concludes the paper.
25
specify protocols i.e. their control structures, the actors involved in them and the
necessary information for their execution.
The Protocol Selection Agent (PSA) is an agent whose aim is to help a WfMS
requester to select the most appropriate coordination protocol according to the
objective of the conversation to be initiated (finding partners, negotiation between
partners), and some specific criteria (quality of service, due time) depending on
the type of the chosen coordination protocol. These criteria will be presented in
section 4.
The Protocol Launcher Agent (PLA) is an agent that creates and launches agents
called Moderators that will play as protocols.
The Domain Ontology source structures and records different business taxonomies
to solve semantic interoperability problems that are bound to occur in a loose IOW
context. Indeed, organizations involved in such an IOW must adopt a shared business
view through a common terminology before starting their cooperation.
The Coordination Protocol Ontology source describes and records protocol
description which may be queried by the PSA agent or which may be used as models
of behavior by the PLA agent when creating moderators.
Protocol Design and
Selection
Coordination
Protocol Ontology
Domain Ontology
Communication
Act
Database
Protocol
Launcher Agent
Protocol
Selection Agent
Moderator
Communication
Act
Database
...
Moderator
Conversation
Database
Conversation
Server
The Protocol Execution block is composed of two types of agents: the conversation
server and as many moderators as the number of conversations in progress. It exploits
the Domain Ontology source, maintains the Conversation databases and handles a
Communication Act database for each moderator. We now describe the two types of
agents and how they interact with these database and knowledge sources.
Each moderator manages a single conversation which is conform to a given
coordination protocol, and a moderator has the same lifetime as the conversation it
manages. A moderator grants roles to agents and ensures that any communication act
that takes place in its conversation is compliant with the protocol's rules. It also
records all the communication acts within its conversation in the Communication Act
database. A moderator also exploits the Domain Ontology source to interact with the
agents involved in its conversation using an adequate vocabulary.
The Conversation Server publishes and makes global information about each
current conversation accessible (such as its protocol, the identity of the moderator
supervising the conversation, the date of its creation, the requester initiator of the
conversation and the participants involved in the conversation). This information is
stored in the Conversation database. By allowing participants to get information about
26
the current (and past) conversations and be notified of new conversations [6], the
Conversation Server makes the interaction space explicit and available. Then, this
interaction space may be public or private according to the policies followed by
moderators. A database oriented view mechanism may be used to specify what is
public or private and for whom.
In addition to the two main blocks of the PMS, two other components are needed:
The Message Dispatcher and the Agent Communication Channel.
The Message Dispatcher is an agent that interfaces the PMS with any WfMS. Thus,
any WfMS intending to invoke a PMSs service only needs to know the Message
Dispatcher address.
Finally, the Agent Communication Channel is an agent that supports the transport
of interaction messages between all the agents of the PMS architecture.
27
We have chosen the agent technology for several reasons. First, and as mentioned
in the introduction, this technology is convenient to the loose IOW context since it
provides natural abstractions to deal with distribution, heterogeneity and autonomy
which are inherent to this context. Therefore, each organization involved in a loose
IOW may be seen as an autonomous agent having the mission to coordinate with other
workflow agents, and acting on behalf of its organization. Second, as the agent
technology is also at the basis of the PMS architecture, it will be easier and more
homogeneous, using agent coordination techniques, to structure and rule the
interaction between the agents of both PMS and WfMS's architectures. Third, as
defended in [14], [15], [16], the use of this technology ensures a bigger flexibility to
the modeled workflow processes: the agents implementing them can easily adapt to
their specific requirements.
Figure 2 presents the agent-based architecture we propose for the WES. This
architecture includes: (i) as many agents, called workflow agents, as the number of
workflow process instances being currently in progress, (ii) an agent manager in
charge of these agents, (iii) a connection server and a new interface, interface 6, that
help workflow agents to solicit the PMS for coordination purposes and finally (iv) an
agent communication channel to support the interaction between these agents.
Regarding the Workflow Agents, the idea is to implement each workflow process
(stored in the Workflow Process Database) instance as a software process, and to
encapsulate this process within an agent. Such a Workflow Agent includes a
workflow engine that, as and when the workflow process instance progresses, reads
the workflow definition and triggers the action(s) to be done according to its current
state. This Workflow Agent supports interface 3 with the applications that are to be
used to perform pieces of work associated to process tasks.
Workflow Enactment Service
Connexion
Server
Knowledge
Database
Workflow
Agent
3, 4
Workflow
Agent
3, 4
Workflow
Agent
3, 4
Agent Manager
1, 2, 5
Workflow
Process Database
The Agent Manager controls and monitors the running of Workflow Agents:
- Upon a request for a new instance of a workflow process, the Agent Manager
creates a new instance of the corresponding Workflow Agent type, initializes its
parameters according to the context, and launches the running of its workflow
engine.
- It ensures the persistency of Workflow Agents that execute long-term business
processes extending for a long time in which task performances are interleaved with
periods of inactivity.
- It coordinates Workflow Agents in their use of the local shared resources.
- It assumes interfaces 1, 2 and 5 of the WfMS.
In the loose IOW context, workflow agents need to find external workflow agents
running in other organizations and able to contribute to the achievement of their goal.
Connecting them requires finding, negotiation and contracting capacities but also
28
29
the interactions space in small and manageable units thanks to the notion of role and
group. Openness is also facilitated, since AGR imposes no constraint on the internal
structure of agents.
RequesterRequesterConnection-Server MessageDispatcher
Workflow-Agent
ProtocolLauncher-Agent
Coordination-Protocol-Selection
1
Coordination-Protocol-Execution
Requester-Participation
1
ConversationServer
Request-CreationCoordination-Protocol
Moderator
1
1
Creation-Coordination-Protocol
1
Coordination-Protocol
Provider-Connection-Server
Provider-Agent-Manager
1
Request-ParticipationCoordination-Protocol
6
*
Provider-Participation
6
*
Let us detail now how each group operates. First, the Requester-Participation
group enables a requester workflow agent to solicit its connection server in order to
contact the PMS to deal with a coordination problem (finding partners, negotiation
between partners). The Request-Creation-Coordination-Protocol group enables the
connection server to forward this request to the message dispatcher agent. This latter
then contacts, via the Coordination-Protocol-Selection group, the protocol selection
agent that helps the requester workflow agent to select a convenient coordination
protocol. The Coordination-Protocol-Execution group then enables the message
dispatcher to enter in connection with the Protocol Launcher Agent (PLA) and ask it
for the creation of a new conversation which is created by the PLA. More precisely,
30
/Requester Connection
Server: Agent
1
inform/in-replyto(protocols)
1 1 inform/in-replyto(protocols)
request(protocol-creation)
1 1
1
/Selection Protocol
Agent: Agent
request(conversation-creation)
1
request(conversationcreation)
request(conversationcreation)
/Message Dispatcher:
Agent
1
inform/in-reply-to(protocols)
1 1
1
1
/Protocol Launcher
Agent: Agent
request(protocolcreation)
request(protocolcreation)
1 1
/Moderator:
Agent
1 request(protocol-creation) 1
confirm(protocol-creation)
inform/in-reply-to
(conversation-creation)
1
1
inform/in-reply-to
(conversation- inform(conversation-creation)
creation)
1 1
1
1
31
32
<owl:Class rdf:ID="Protocol">
<rdfs:subClassOf>
<owl:Restriction>
<owl:maxCardinality rdf:datatype= "http://www.w3.org/2001/XMLSchema#int"> 1
</owl:maxCardinality>
<owl:onProperty>
<owl:ObjectProperty rdf:ID="HasTerminalState"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
<owl:DatatypeProperty rdf:ID="Name">
<rdfs:domain rdf:resource="#Protocol"/>
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID="Description">
<rdfs:domain rdf:resource="#Protocol"/>
<rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
</owl:DatatypeProperty>
...
</owl:Class>
Figure 5a: The Protocol Meta Model: OWL representation
Role
Name
Skill
Casting const.
1..1
1..*
Member
1..1
PermissionToPerform
1..1
Creator
1..*
Protocol
1..1 Name
Description
Casting const.
Behavioural const.
Parameters
InterventionType
Name
Action
Input
Output
Behavioural const.
0..*
0..1
0..1
1..*
Business Domain
0..*
TerminalState
PreCondition
PostCondition
1..1
1..1
1..*
Ontology
State
Condition
1..1
InitialState
1..1
33
structure of the protocol, that is the sequences of interventions that may occur in the
course of a conversation. An Intervention Type belongs to the role linked to it, that
only agents playing that role may perform.
A Protocol includes a set of member Roles, one of them is allowed to initiate a new
conversation following the Protocol, and a set of Intervention Types linked to these
Roles. The Business Domain link gives access to the all-possible Ontologies to which
a Protocol may refer. The InitialState link describes the configuration required to start
a conversation, while the TerminalState link establishes a configuration to reach in
order to consider a conversation as being completed. A Protocol may include a
Description in natural language for documentation purpose, or information about the
Protocol at the agents' disposal. The protocol casting constraints attribute records
constraints that involve several Roles and cannot be defined regarding individual
Roles such as the total number of agents that are allowed to take part in a
conversation. Similarly, the protocol behavioral constraints attribute records
constraints that cannot be defined regarding individual Intervention Types such as the
total number of interventions that may occur in the course of a conversation. Some of
these casting or behavioral constraints can involve Parameters of the Protocol,
properties whose value is not fixed by the Protocol but is proper to each conversation
and set by the conversations initiator.
Ontology
Negotiation
FindingPartner
Matchmaker
P2PExecution
ComparisonMode
QualityRate
NumberOfProviders
1..*
1..*
Business Domain
Broker
Contract
ContractNet
ContractTemplate
In the UML schema of figure 6, the protocol class of figure 5.b is refined. Protocols
are specialized, according to their objective, into three abstract classes:
FindingPartner, Negotiation and Contract. Those abstract classes are in their turn
recursively specialized until obtaining leaves corresponding to models of protocols
like for instance Matchmaker, Broker, Argumentation, Heuristic, ContractTemplate
Each of these protocols may be used in one of the coordination steps of IOW.
However, at this last level, the different classes feature new attributes specific to each
one and possibly related to the quality of service.
If we consider for example the Matchmaker protocol (the only one developed in
figure 6), we can make the following observations. First, it differs from the broker by
the fact that it implements a Peer-to-Peer execution mode with the provider: the
34
identity of the provider is known and a direct link is established between the requester
and the provider at run time. Then, one can be interested in its comparison mode (plug
in, exact, and/or subsume) [21], a quality rating to compare it to other matchmakers,
the minimum number of providers it is able to manage
6 Implementation
This work gives rise to an implementation project, called ProMediaFlow (Protocols
for Mediating Workflow), aiming at developing the whole PMS architecture. The first
step of this project is to evaluate its feasibility. For this purpose, we have developed a
simulator, limited for the moment to a subset of components and considering a single
protocol. In fact, regarding the PMS architecture, we have implemented only the
Protocol Execution block and considered the Matchmaker protocol. Regarding the
WfMS architecture, we have implemented both requester and provider Workflow
Agents and their corresponding Agent Managers and Connection Servers.
This work has been implemented with the MadKit platform [23], which permits the
development of distributed applications using multi-agent principles. Madkit also
supports the organizational AGR meta-model and then permits a straightforward
implementation of our organizational model (presented in section 4).
In the current version of our implementation, the system is only able to produce
moderators playing the Matchmaker protocol. More precisely, the Matchmaker is able
to compare offers and requests of workflow services (i.e. a service implementing a
workflow process) by establishing flexible comparisons (exact, plug in, subsume) [21]
35
based on a domain ontology. For that purpose, we also have included facilities to
describe workflow services into the simulator. So, as presented in [24], these offers
and requests are described using both the Petri Net with Object (PNO) formalism [25]
and the OWL-S language [26]: the PNO formalism is used to design, analyze,
simulate, check and validate workflow services which are then automatically derived
into OWL-S specifications to be published through the Matchmaker.
Figure 7 below shows some screenshots of our implementation. More precisely, this
figure shows the four following agents: a Requester Workflow Agent, a Provider
Workflow Agent, the Conversation Server, and a Moderator, which is a Matchmaker.
While windows 1 and 2 represent agents belonging to WfMS (respectively a provider
workflow agent and a requester workflow agent), windows 3 and 4 represent agents
belonging to the PMS (respectively the conversation server and a moderator agent
playing the Matchmaker protocol).
1
2
3
4
The top left window (number 1) represents a requester workflow agent belonging to
a WfMS involved in a loose IOW and implementing a workflow process instance. As
shown by window 1, this agent can: (i) specify a requested workflow service
(Specification menu), (ii) advertise this specification through the Matchmaker
(Submission menu), (iii) visualize the providers offering services corresponding to the
specification (Visualization menu), (iv) establish peer-to-peer connections with one of
these providers (Contact menu), and, (v) launch the execution of this requested
service (WorkSpace menu). In a symmetric way, the bottom left window (number 2)
represents an agent playing the role of a workflow service provider, and a set of
menus enables it to manage its offered services. As shown by window 2, the
Specification menu includes three commands to support PNO and OWL-S
specifications. The first command permits the specification of a workflow service
using the PNO formalism, the second one permits the analysis and validation of the
specified PNO and the third one derives the corresponding OWL-S specifications.
The top right window (number 3) represents the Conversation Server agent
belonging to the PMS architecture. As shown by window 3, this agent can: (i) display
36
all the conversations (Conversations menu and List of conversations option), (ii)
select a given conversation (Conversations menu and Select a conversation option),
and, (iii) obtain all the information related to this selected conversation -its moderator,
its initiator, its participants- (Conversations menu and Detail of a conversation
option). Finally, the bottom right window (number 4) represents a Moderator agent
playing the Matchmaker protocol. This agent can: (i) display all the conversation acts
of the supervised conversation (Communication act menu and List of acts option), (ii)
select a given conversation act (Communication act menu and Select an act option),
and, (iii) obtain all the information related to this selected conversation act -its sender,
its content- (Communication act menu and Detail of an act option), and, (iv) know
the underlying domain ontology (Ontology domain menu).
Now let us give some indications about the efficiency of our implementation and
more precisely of our Matchmaker protocol. Let us first remark, that in the IOW
context, workflows are most often long-term processes which may last for several
days. Consequently, we do not need an instantaneous matchmaking process. However,
in order to prove the feasibility of our proposition, we have measured the matchmaker
processing time according to some parameters (notably the number of offers and the
comparison modes) intervening in the complexity formulae of the matchmaking
process [27]. The measures have been realized in the context of the conference review
system case study, where a PC chair subcontracts to the matchmaker the research of
reviewers able to evaluate papers. The reviewers are supposed to have submitted to
the matchmaker their capabilities in term of topics. Papers are classified according to
topics belonging to an OWL ontology. Figure 8 visualizes the matchmaker average
processing time for a number of offers (services) varying from 100 to 1000 and
considering the plug in, exact and subsume comparison modes.
As illustrated in figure 8, and in accordance with the complexity study of [27], the
exact mode is the most efficient in term of time processing. To better analyze the
Matchmaker behavior, we also plan to measure its recall and precision rates, well
known in Information Retrieval [28].
Average Processing Time of the Matchaker in milliseconds
540
520
500
Exact
480
Plug In
460
Subsume
440
420
400
100
500
1000
37
- Easy design and development of loose IOW systems. The principle of separation of
concerns improves understandability of the system, and therefore eases and speeds
up its design, its development and its maintenance. Following this practice,
protocols may be thought as autonomous and reusable components that may be
specified and verified independently from any specific workflow system behavior.
Doing so, we can focus on specific coordination issues and build a library of easyto-use and reliable protocols. The same holds for workflow systems, since it
becomes possible to focus on their specific capabilities, however they interact with
others.
- Workflow agent reusability. As a consequence of introducing moderators, workflow
agents and agent managers do not interact with each other directly anymore.
Instead, they just have to agree on the type of conversation they want to adopt.
Consequently, agents impose fewer requirements to their partners, and they are
loosely coupled. Hence, heterogeneous workflow agents can be easily composed.
- Visibility of conversations. Some conversations are private and concern only the
protagonists. But, most of the time, meeting the goal of the conversation depends to
a certain extent on its transparency i.e. on the possibility given to the agents to be
informed on the conversation progress. With a good knowledge about the state of
the conversation and the rules of the protocol, each agent can operate with
relevance and in accordance with its objectives. In the absence of a Moderator, the
information concerning a conversation is distributed among the participating
agents. Thus, it is difficult to know which agent has the searched information,
supposing that this agent has been designed to support the supply of this
information. By contrast, moderators support the transparency of conversations.
Related works may be considered according to two complementary points of view:
the loose IOW coordination and the protocol engineering points of view.
Regarding loose IOW coordination, it can be noted that most of the works ([10],
[14], [15], [16], [29], [30], [31]) adopt one of the following multi-agent coordination
protocol: organizational structuring, matchmaking, negotiation, contracting...
However, these works neither address all the protocols at the same time nor follow the
principle of separation of concerns. In consequence, they do not address protocol
engineering issues and notably protocol design, selection and execution.
Regarding protocol engineering, the most significant works are [6], [32]. [6] has
inspired our software engineering approach, but it differs from our work since it does
not address workflow applications and does not address the classification and
selection of protocol issues. [32] is a complementary work to ours. It deals with
protocol engineering issues focusing particularly on the notion of protocol
compatibility, equivalence and replaceability. In fact, this work aims at defining a
protocol algebra which can be very useful to our PMS. At design time, it can be used
to establish links between protocols, while, at run-time, these links can be used by the
Protocol Selection Agent.
Finally, we must also mention work which addresses both loose IOW coordination
and protocol engineering issues ([33], [34]). [33] isolates protocols and represents
them as ontologies but this work only considers negotiation protocols in e-commerce.
[34] considers interaction protocols as design abstractions for business processes,
provides an ontology of protocols and investigates the composition of protocol issues.
[34] differs from our work since it only focuses on the description and the
composition aspects. Finally, none of these works ([33], [34]) proposes means for
classifying, retrieving and executing protocols, nor addresses architectural issues as
we did through the PMS definition.
Regarding future works, we plan to complete the implementation of the PMS. The
current implantation is limited to the Matchmaker protocol, and so, we intend to
design and implement other coordination protocols (broker, argumentation, heuristic).
We also believe that an adequate combination of this work and the comparison
38
References
[1] W. van der Aalst: Inter-Organizational Workflows: An Approach Based on
Message Sequence Charts and Petri Nets. Systems Analysis, Modeling and
Simulation, 34(3), 1999, pp. 335-367.
[2] M. Divitini, C. Hanachi, C. Sibertin-Blanc: Inter Organizational Workflows for
Enterprise Coordination. In: A. Omicini, F. Zambonelli, M. Klusch, and R.
Tolksdorf (eds): Coordination of Internet Agents, Springer-Verlarg, 2001, pp.
46-77.
[3] F. Casati, A. Discenza: Supporting Workflow Cooperation within and across
Organizations. 15th Int. Symposium on Applied Computing, Como, Italy, March
2000, pp. 196-202.
[4] P. Grefen, K. Aberer, Y. Hoffner, H. Ludwig: CrossFlow: Cross-Organizational
Workflow Management in Dynamic Virtual Enterprises. Computer Systems
Science and Engineering, 15( 5), 2000, pp. 277-290.
[5] O. Perrin, F. Wynen, J. Bitcheva, C. Godart: A Model to Support Collaborative
Work in Virtual Enterprises. 1st Int. Conference on Business Process
Management, Eindhoven, The Netherlands, June 2003, pp 104-119.
[6] C. Hanachi, C. Sibertin-Blanc: Protocol Moderators as active Middle-Agents in
Multi-Agent Systems. Autonomous Agents and Multi-Agent Systems, 8(3),
March 2004, pp. 131-164.
[7] C. Ghezzi, M. Jazayeri, D. Mandrioli, Fundamentals of Software Engineering.
Prentice-Hall International, 1991.
[8] W. van der Aalst: The Application of Petri-Nets to Workflow Management.
Circuit, Systems and Computers, 8( 1), February 1998, pp. 21-66.
[9] M. Genesereth, S. Ketchpel: Software Agents. Communication of the ACM,
37(7), July 1994, pp. 48-53.
[10] E. Andonoff, L. Bouzguenda, C. Hanachi, C. Sibertin-Blanc: Finding Partners in
the Coordination of Loose Inter-Organizational Workflow. 6th Int. Conference
on the Design of Cooperative Systems, Hyeres (France), May 2004, pp. 147-162.
[11] N. Jenning, P. Faratin, A. Lomuscio, S. Parsons, C. Sierra, M. Wooldridge:
Automated Negotiation: Prospects, Methods and Challenges. Group Decision
and Negotiation, 10 (2), 2001, pp. 199-215.
[12] J. Ferber, O. Gutknecht: A Meta-Model for the Analysis and Design of
Organizations in Multi-Agent Systems, 3rd Int. Conference on Multi-Agents
Systems, Paris, France, July 1998, pp. 128-135.
[13] The Workflow Management Coalition, The Workflow Reference Model.
Technical Report WfMC-TC-1003, November 1994.
[14] L. Zeng, A. Ngu, B. Benatallah, M. ODell: An Agent-Based Approach for
Supporting Cross-Enterprise Workflows. 12th Australian Database Conference,
Bond, Australia, February 2001, pp. 123-130.
[15] E. Andonoff, L. Bouzguenda L: Agent-Based Negotiation between Partners in
Loose Inter-Organizational Workflow. 5th Int. Conference on Intelligent Agent
Technology, Compigne, France, September 2005, pp. 619-625.
[16] P. Buhler, J. Vidal: Towards Adaptive Workflow Enactment Using Multi Agent
Systems. Information Technology and Management, 6(1), 2005, pp. 61-87.
39
40
41
42
Abstract
As XML technologies have become a standard for data representation, it is
inevitable to propose and implement efficient techniques for managing XML
data. A natural alternative is to exploit tools and functions offered by
(object-)relational database systems. Unfortunately, this approach has many
objectors, especially due to inefficiency caused by structural differences between
XML data and relations. On the other hand, (object-)relational databases have
long theoretical and practical history and represent a mature technology, i.e.
they can offer properties that no native XML database can offer yet. In this
paper we study techniques which enable to improve XML processing based on
relational databases, so-called adaptive or flexible mapping methods. We provide
an overview of existing approaches, we classify their main features, and sum up
the most important findings and characteristics. Finally, we discuss possible
improvements and corresponding key problems.
Keywords: XML-to-relational mapping, state of the art, adaptability, relational databases
Introduction
Without any doubt the XML [9] is currently one of the most popular formats
for data representation. It is well-defined, easy-to-use, and involves various recommendations such as languages for structural specification, transformation,
querying, updating, etc. The popularity invoked an enormous endeavor to propose more efficient methods and tools for managing and processing XML data.
The four most popular ones are methods which store XML data in a file system, methods which store and process XML data using an (object-)relational
database system, methods which exploit a pure object-oriented approach, and
native methods that use special indices, numbering schemes [17], and/or data
structures [12] proposed particularly for tree structure of XML data.
Naturally, each of the approaches has both keen advocates and objectors. The
situation is not good especially for file system-based and object-oriented methods. The former ones suffer from inability of querying without any additional
1 This work was supported in part by Czech Science Foundation (GACR), grant number
201/06/0756.
43
preprocessing of the data, whereas the latter approach fails especially in finding a corresponding efficient and comprehensive tool. The highest-performance
techniques are the native ones since they are proposed particularly for XML
processing and do not need to artificially adapt existing structures to a new
purpose. But the most practically used ones are methods which exploit features
of (object-)relational databases. The reason is that they are still regarded as
universal data processing tools and their long theoretical and practical history
can guarantee a reasonable level of reliability. Contrary to native methods it is
not necessary to start from scratch but we can rely on a mature and verified
technology, i.e. properties that no native XML database can offer yet.
Under a closer investigation the database-based2 methods can be further classified and analyzed [19]. We usually distinguish generic methods which store
XML data regardless the existence of corresponding XML schema (e.g. [10] [16]),
schema-driven methods based on structural information from existing schema
of XML documents (e.g. [26] [18]), and user-defined methods which leave all the
storage decisions in hands of future users (e.g. [2] [1]).
Techniques of the first type usually view an XML document as a directed
labelled tree with several types of nodes. We can further distinguish generic
techniques which purely store components of the tree and their mutual relationship [10] and techniques which store additional structural information, usually
using a kind of a numbering schema [16]. Such schema enables to speed up certain types of queries but usually at the cost of inefficient data updates. The fact
that they do not exploit possibly existing XML schemes can be regarded as both
advantage and disadvantage. On one hand they do not depend on its existence
but, on the other hand, they cannot exploit the additional structural information. But together with the finding that a significant portion of real XML
documents (52% [5] of randomly crawled or 7.4% [20] of semi-automatically
collected3 ) have no schema at all, they seem to be the most practical choice.
By contrast, schema-driven methods have contradictory (dis)advantages. The
situation is even worse for methods which are based particularly on XML Schema
[28] [7] definitions (XSDs) and focus on their special features [18]. As it is
expectable, XSDs are used even less (only for 0.09% [5] of randomly crawled or
38% [20] of semi-automatically collected XML documents) and even if they are
used, they often (in 85% of cases [6]) define so-called local tree grammars [22],
i.e. languages that can be defined using DTD [9] as well. The most exploited
non-DTD features are usually simple types [6] whose lack in DTD is crucial
but for XML data processing have only a side optimization effect.
Another problem of purely schema-driven methods is that information XML
schemes provide is not satisfactory. Analysis of both XML documents and
XML schemes together [20] shows that XML schemes are too general. Excessive
examples can be recursion or * operator which allow theoretically infinitely
deep or wide XML documents. Naturally, XML schemes also cannot provide
any information about, e.g., retrieval frequency of an element / attribute or the
way they are retrieved. Thus not only XML schemes but also corresponding
2 In
the rest of the paper the term database represents an (object-)relational database.
collected with interference of a human operator who removes damaged, artificial,
too simple, or otherwise useless XML data.
3 Data
44
XML documents and XML queries need to be taken into account to get overall
notion of the demanded XML-processing application.
The last mentioned type of approach, i.e. the user-defined one, is a bit different. It does not involve methods for automatic database storage but rather tools
for specification of the target database schema and required XML-to-relational
mapping. It is commonly offered by most known (object-)relational database
systems [3] as a feature that enables users to define what suits them most instead
of being restricted by disadvantages of a particular technique. Nevertheless, the
key problem is evident it assumes that the user is skilled in both database and
XML technologies.
Apparently, advantages of all three approaches are closely related to the particular situation. Thus it is advisable to propose a method which is able to
exploit the current situation or at least to comfort to it. If we analyze databasebased methods more deeply, we can distinguish so-called flexible or adaptive
methods (e.g. [13] [25] [29] [31]). They take into account a given sample set of
XML data and/or XML queries which specify the future usage and adapt the
resulting database schema to them. Such techniques have naturally better performance results than the fixed ones (e.g. [10] [16] [26] [18]), i.e. methods which
use pre-defined set of mapping rules and heuristics regardless the intended future
usage. Nevertheless, they have also a great disadvantage the fact that the target database schema is adapted only once. Thus if the expected usage changes,
the efficiency of such techniques can be even worse than in corresponding fixed
case. Consequently the adaptability needs to be dynamic.
The idea to adapt a technique to a sample set of data is closely related to
analyses of typical features of real XML documents [20]. If we combine the
two ideas, we can assume that a method which focuses especially on common
features will be more efficient than the general one. A similar observation is
already exploited, e.g., in techniques which represent XML documents as a set
of points in multidimensional space [14]. Efficiency of such techniques depends
strongly on the depth of XML documents or the number of distinct paths.
Fortunately XML analyses confirm that real XML documents are surprisingly
shallow the average depth does not exceed 10 levels [5] [20].
Considering all the mentioned points the presumption that an adaptive enhancing of XML-processing methods focusing on given or typical situations seem
to be a promising type of improvement. In this paper we study adaptive techniques from various points of view. We provide an overview of existing approaches, we classify them and their main features, and we sum up the most
important findings and characteristics. Finally, we discuss possible improvements and corresponding key problems. The analysis should serve as a starting
point for proposal of an enhancing of existing adaptive methods as well as of an
unprecedented approach. Thus we also discuss possible improvements of weak
points of existing methods and solutions to the stated open problems.
The paper is structured as follows: Section 2 contains a brief introduction
to formalism used throughout the paper. Section 3 describes and classifies the
existing related works, both practical and theoretical, and Section 4 sums up
their main characteristics. Section 5 discusses possible ways of improvement of
the recent approaches and finally, the sixth section provides conclusions.
45
46
Existing Approaches
Up to now only a few papers have focused on a proposal of an adaptive databasebased XML-processing method. We distinguish two main directions costdriven and user-driven. Techniques of the former group can choose the most
efficient XML-to-relational storage strategy automatically. They usually evaluate a subset of possible mappings and choose the best one according to the given
sample of XML data, query workload, etc. The main advantage is expressed by
the adverb automatically, i.e. without necessary or undesirable user interference. By contrast, techniques of the latter group also support several storage
strategies but the final decision is left in hands of users. We distinguish these
techniques from the user-defined ones, since their approach is slightly different:
By default they offer a fixed mapping, but users can influence the mapping process by annotating fragments of the input XML schema with demanded storage
strategies. Similarly to the user-defined techniques this approach also assumes
a skilled user, but most of the work is done by the system itself. The user is
expected to help the mapping process, not to perform it.
3.1
Cost-Driven Techniques
As mentioned above, cost-driven techniques can choose the best storage strategy
for a particular application automatically, without any interference of a user.
Thus the user can influence the mapping process only through the provided
XML schema, set of sample XML documents or data statistics, set of XML
queries and eventually their weights, etc.
Each of the techniques can be characterized by the following five features:
1. an initial XML schema Sinit ,
2. a set of XML schema transformations T = {t1 , t2 , ..., tn }, where i : ti
transforms a given schema S into a schema Si ,
3. a fixed XML-to-relational mapping function fmap which transforms a given
XML schema S into a relational schema R,
4. a set of sample data Dsample characterizing the future application, which
usually consists of a set of XML documents {d1 , d2 , .., dk } valid against
Sinit , and a set of XML queries {q1 , q2 , .., ql } over Sinit , eventually with
corresponding weights {w1 , w2 , .., wl }, i : wi h0, 1i, and
5. a cost function fcost which evaluates the cost of the given relational schema
R with regard to the set Dsample .
The required result is an optimal relational schema Ropt , i.e. a schema, where
fcost (Ropt , Dsample ) is minimal.
A naive but illustrative cost-driven storage strategy that is based on the idea
of using a brute force is depicted by Algorithm 1. It first generates a set of
possible XML schemes S using transformations from set T and starting from
initial schema Sinit (lines 1 4). Then it searches for schema s S with minimal
cost fcost (fmap (s), Dsample ) (lines 5 12) and returns the corresponding optimal
relational schema Ropt = fmap (s).
47
48
(c) v does not have a parent node which would satisfy the conditions.
4. Each fragment f G1 which consists of a previously identified node v
and its descendants is replaced with an attribute node having the XML
data type, resulting in a schema graph G2 .
5. G2 is mapped to a relational schema using a fixed mapping strategy.
The measure of significance v of a node v is defined as:
v =
1
1
1
1
1 card(Dv ) 1 card(Qv )
Sv + Dv + Qv = Sv +
+
2
4
4
2
4 card(D)
4 card(Q)
(1)
where Sv is derived from the DTD structure as a combination of weights expressing position of v in the graph and complexity of its content model (see
[13]), D Dsample is a set of all given documents, Dv D is a set of documents containing v, Q Dsample is a set of all given queries, and Qv Q is a
set of queries containing v.
As we can see, the algorithm optimizes the naive approach mainly by the
facts that the schema graph is preprocessed, i.e. v is determined for v V1 ,
that the set of transformations T is a singleton, and that the transformation is
performed if the current node satisfies the above mentioned conditions (a) (c).
As it is obvious, the preprocessing ensures that the complexity of the search
algorithm is given by K1 card(V1 ) + K2 card(E1 ), where K1 , K2 N . On
the other hand, the optimization is too restrictive in terms of the amount of
possible XML-to-relational mappings.
3.1.2
FlexMap Mapping
49
Note that except for commutativity and simplifying unions the transformations generate equivalent schema in terms of equivalence of sets of document
instances. Commutativity does not retain the order of the schema, whereas
simplifying unions generates a more general schema, i.e. a schema with larger
set of document instances. (However, only inlining and outlining were implemented and experimentally tested by the FlexMap system.)
The fixed mapping again uses a strategy similar to the Hybrid algorithm but it
is applied locally on each fragment of the schema specified by the transformation
rules stated by the search algorithm. For example elements determined to be
outlined are not inlined though a traditional Hybrid algorithm would do so.
The process of evaluating fcost is significantly optimized. A naive approach
would require construction of a particular relational schema, loading sample
XML data into the relations, and cost analysis of the resulting relational structures. The FlexMap evaluation exploits an XML Schema-aware statistics framework StatiX [11] which analyzes the structure of a given XSD and XML documents and computes their statistical summary, which is then mapped to
relational statistics regarding the fixed XML-to-relational mapping. Together
with sample query workload they are used as an input for a classical relational
optimizer which estimates the resulting cost. Thus no relational schema has to
be constructed and as the statistics are respectively updated at each XML-toXML transformation, the XML documents need to be processed only once.
3.1.3
The following method, which is also based on the idea of searching a space of
possible mappings, is presented in [29] as an Adjustable and adaptable method
50
(AAM). In this case the authors adapt the given problem to features of genetic
algorithms. It is also the first paper that mentions that the problem of finding
a relational schema R for a given set of XML documents and queries Dsample ,
s.t. fcost (R, Dsample ) is minimal, is NP-hard in the size of the data.
The set T of XML-to-XML transformations consists of inlining and outlining
of subelements. For the purpose of the genetic algorithm each transformed
schema is represented using a bit string, where each bit corresponds to an edge
of the schema graph and it is set to 1 if the element the edge points to is
stored into a separate table or 0 if the element the edge points to is stored into
parent table. The bits set to 1 represent borders among fragments, whereas
each fragment is stored into one table corresponding to so-called Universal table
[10]. The extreme instances correspond to one table for the whole schema (in
case of 00...0 bit string) resulting in many null values and one table per each
element (in case of 11...1 bit string) resulting in many join operations.
Similarly to the previous strategy the algorithm chooses only the best possible
continuation at each iteration. The algorithm consists of the following steps:
1. The initial population P0 (i.e. the set of bit strings) is generated randomly.
2. The following steps are repeated until terminating conditions are met:
(a) Each member of the current population Pi is evaluated and only the
best representatives are selected for further production.
(b) The next generation Pi+1 is produced by genetic operators crossover,
mutation, and propagate.
The algorithm terminates either after certain number of transformations or
if a good-enough schema is achieved.
The cost function fcost is expressed as:
fcost (R, Dsample ) = fM (R, Dsample ) + fQ (R, Dsample ) =
q
m
n
P
P
P
Cl Rl + ( Si PSi +
=
J k PJk )
l=1
i=1
(2)
k=1
51
3.1.4
The last but not least cost-driven adaptable representative can be found in
paper [31]. The approach is again based on a greedy type of algorithm, in this
case a Hill climbing strategy that is depicted by Algorithm 3.
Algorithm 3 Hill Climbing Algorithm
Input: Sinit , T , fmap , Dsample , fcost
Output: Ropt
1: Sopt Sinit ; Ropt fmap (Sopt ) ; costopt fcost (Ropt , Dsample )
2: Ttmp T
3: while Ttmp 6= do
4:
t any member of Ttmp
5:
Ttmp Ttmp \{t}
6:
Stmp t(Sopt )
7:
costtmp fcost (fmap (Stmp ), Dsample )
8:
if costtmp < costopt then
9:
Sopt Stmp ; Ropt fmap (Stmp ) ; costopt costtmp
10:
Ttmp T
11:
end if
12: end while
13: return Ropt
As we can see, the hill climbing strategy differs from the simple greedy strategy depicted in Algorithm 2 in the way it chooses the appropriate transformation
t T . In the previous case the least expensive transformation that can reduce
the current (sub)optimum is chosen, in this case it is the first such transformation found. The schema transformations are based on the idea of vertical (V)
or horizontal (H) cutting and merging the given XML schema fragment(s). The
set T consists of the following four types of (pairwise inverse) operations:
V-Cut(f, (u,v)) cuts fragment f into fragments f1 and f2 , s.t. f1 f2 = f ,
where (u, v) is an edge from f1 to f2 , i.e. u f1 and v f2
V-Merge(f1 , f2 ) merges fragments f1 and f2 into fragment f = f1 f2
H-Cut(f, (u,v)) splits fragment f into twin fragments f1 and f2 horizontally from edge (u, v), where u 6 f and v f , s.t. ext(f1 ) ext(f2 ) =
ext(f ) and ext(f1 ) ext(f2 ) = 5 6
H-Merge(f1 , f2 ) merges two twin fragments f1 and f2 into one fragment
f s.t. ext(f1 ) ext(f2 ) = ext(f )
As we can observe, V-Cut and V-Merge operations are similar to outlining
and inlining of the fragment f2 out of or into the fragment f1 . Conversely, HCut operation corresponds to splitting of elements used in FlexMap mapping,
i.e. duplication of the shared part, and the H-Merge operation corresponds to
inverse merging of elements.
5 ext(f ) is the
i
6 Fragments f
1
52
n
X
wi cost(Qi , R)
(3)
i=1
where Dsample consists of a sample set of XML documents and a given query
workload {(Qi , wi )i=1,2,...,n }, where Qi is an XML query and wi is its weight.
The cost function cost(Qi , R) for a query Qi which accesses fragment set {fi1 ,
..., fim } is expressed as:
cost(Qi , R) =
|f
Pi1 |
m=1
m>1
(4)
where fij and fik , j 6= k are two join fragments, |Eij | is the number of elements
in ext(fij ), and Selij is the selectivity of the path from the root to fij estimated
using Markov table. In other words, the formula simulates the cost for joining
relations corresponding to fragments fij and fik .
The authors further analyze the influence of the choice of initial schema Sinit
on efficiency of the search algorithm. They use three types of initial schema
decompositions leading to Binary [10], Shared, or Hybrid [26] mapping. The
paper concludes with the finding that a good choice of an initial schema is
crucial and can lead to faster searches of the suboptimal mapping.
3.2
User-Driven Techniques
53
MDF
Probably the first approach which faces the mentioned issues is proposed in
paper [8] as a Mapping definition framework (MDF). It allows users to specify
the required mapping, checks its correctness and completeness and completes
possible incompleteness. The mapping specifications are made by annotating
the input XSD with a predefined set of attributes A listed in Table 1.
Attribute
Target
outline
attribute
ment
Value
Function
true,
false
tablename
attribute, element,
or group
string
columnname
attribute, element,
or simple type
string
sqltype
attribute, element,
or simple type
string
structurescheme
root element
KFO,
Interval,
Dewey
edgemapping
element
true,
false
If the value is true, the element and all its subelements are
mapped using Edge mapping.
maptoclob
attribute
ment
true,
false
or
or
ele-
ele-
Key, Foreign Key, and Ordinal Strategy (KFO) each node is assigned
a unique integer ID and a foreign key pointing to parent ID, the sibling
order is captured using an ordinal value
Interval Encoding a unique {start,end} interval is assigned to each
node corresponding to preorder and postorder traversal entering time
Dewey Decimal Classification each node is assigned a path to the root
node described using concatenation of node IDs along the path
As side effects can be considered attributes for specifying names of tables
or columns and data types of columns. Not annotated parts are stored using
user-predefined rules, whereas such mapping is always a fixed one.
54
3.2.2
XCacheDB System
Value
Function
INLINE
TABLE
STORE BLOB
BLOB ONLY
RENAME
string
DATATYPE
string
The value specifies the data type of corresponding column created for node v.
3.3
Theoretic Issues
Besides proposals of cost-driven and user-driven techniques there are also papers
which discuss the corresponding open issues on theoretic level.
3.3.1
Data Redundancy
As mentioned above, the XCacheDB system allows a certain degree of redundancy, in particular duplication in BLOB columns and the violation of BCNF
or 3NF condition. The paper [4] discusses the strategy also on theoretic level
and defines four classes of XML schema decompositions. Before we state the
definitions we have to note that the approach is based on a slightly different
7 An
55
Grouping problem
Paper [15] is dealing with the idea that searching a (sub)optimal relational decomposition is not only related to given XML schema, query workload, and
XML data, but it is also highly influenced by the chosen query translation algorithm 8 and the cost model. For the theoretic purpose a subset of the problem
so-called grouping problem is considered. It deals with possible storage strategies for shared subelements, i.e. either into one common table (so-called fully
grouped strategy) or into separate tables (so-called fully partitioned strategy).
For analysis of its complexity the authors define two simple cost metrics:
RelCount the cost of a relational query is the number of relation instances
in the query expression
RelSize the cost of a relational query is the sum of the number of tuples
in relation instances in the query expression
and three query translation algorithms:
Naive Translation performs a join between the relations corresponding
to all the elements appearing in the query, a wild-card query 9 is converted
into union of several queries, one for each satisfying wild-card substitution
8 An
9A
56
Single Scan a separate relational query is issued for each leaf element
and joins all relations on the path until the least common ancestor of all
the leaf elements is reached
Multiple Scan on each relation containing a part of the result is applied
Single Scan algorithm and the resulting query consists of union of the
partial queries
On a simple example the authors show that for a wild-card query Q which
retrieves a shared fragment f with algorithm Naive Translation the fully partitioned strategy performs better, whereas with algorithm Multiple Scan the fully
grouped strategy performs better. Furthermore, they illustrate that reliability
of the chosen cost model is also closely related to query translation strategy. If
a query contains not very selective predicate than the optimizer may choose a
plan that scans corresponding relations and thus RelSize is a good corresponding
metric. On the other hand, in case of highly selective predicate the optimizer
may choose an index lookup plan and thus RelCount is a good metric.
Summary
We can sum up the state of the art of adaptability of database-based XMLprocessing methods into the following natural but important findings:
1. As the storage strategy has a crucial impact on query-processing performance, a fixed mapping based on predefined rules and heuristics is not
universally efficient.
2. It is not an easy task to choose an optimal mapping strategy for a particular application and thus it is not advisable to rely only on users experience
and intuition.
3. As the space of possible XML-to-relational mappings is very large (usually
theoretically infinite) and most of the subproblems are even NP-hard, the
exhaustive search is often impossible. It is necessary to define search
heuristics, approximation algorithms, and/or reliable terminal conditions.
4. The choice of an initial schema can strongly influence the efficiency of the
search algorithm. It is reasonable to start with at least locally good
schema.
5. A strategy of finding a (sub)optimal XML schema should take into account
not only the given schema, query workload, and XML data statistics, but
also possible query translations, cost metrics, and their consequences.
6. Cost evaluation of a particular XML-to-relational mapping should not involve time-consuming construction of the relational schema, loading XML
data and analyzing the resulting relational structures. It can be optimized
using cost estimation of XML queries, XML data statistics, etc.
7. Despite the previous claim, the user should be allowed to influence the
mapping strategy. On the other hand, the approach should not demand
a full schema specification but it should complete the user-given hints.
8. Even thought a storage strategy is able to adapt to a given sample of
schemes, data, queries, etc., its efficiency is still endangered by later
changes of the expected usage.
57
Open Issues
Although each of the existing approaches brings certain interesting ideas and
optimizations, there is still a space of possible future improvements of the adaptable methods. We describe and discuss them in this section starting from (in
our opinion) the least complex ones.
Missing Input Data As we already know, for cost-driven techniques there
are three types of input data an XML schema Sinit , a set of XML documents
{d1 , d2 , .., dk }, and a set of XML queries {q1 , q2 , .., ql }. The problem of missing schema Sinit was already outlined in the introduction in connection with
(dis)advantages of generic and schema-driven methods. As we suppose that the
adaptability is the ability to adapt to the given situation, a method which does
not depend on existence of an XML schema but can exploit the information if
being given is probably a natural first improvement. This idea is also strongly
related to the mentioned problem of choice of a locally good initial schema Sinit .
The corresponding questions are: Can be the user-given schema considered as a
good candidate for Sinit ? How can we find an eventual better candidate? Can
we find such candidate for schema-less XML documents? A possible solution
can be found in exploitation of methods for automatic construction of XML
schema for the given set of XML documents (e.g. [21] [23]). Assuming that
documents are more precise sources of structural information, we can expect
that a schema generated on their bases will have better characteristics too.
On the other hand, the problem of missing input XML documents can be at
least partly solved using reasonable default settings based on general analysis
of real XML data (e.g. [5] [20]). Furthermore, the surveys show that real XML
data are surprisingly simple and thus the default mapping strategy does not have
to be complex too. It should rather focus on efficient processing of frequently
used XML patterns.
Finally, the presence of sample query workload is crucial since (to our knowledge) there are no analyses on real XML queries, i.e. no source of information
for default settings. The reason is that collecting such real representatives is not
as straightforward as in case of XML documents. Currently the best sources of
XML queries are XML benchmarking projects (e.g. [24] [30]) but as the data
and especially queries are supposed to be used for rating the performance of a
system in various situations, they cannot be considered as an example of a real
workload. Naturally, the query statistics can be gathered by the system itself
and the schema can be adapted continuously, as discussed later in the text.
Efficient Solution of Subproblems A surprising fact we have encountered
are numerous simplifications of the chosen solutions. As it was mentioned,
some of the techniques omit, e.g., ordering of elements, mixed contents, or
recursion. This is a bit confusing finding regarding the fact that there are
proposals of efficient processing of these XML constructs (e.g. [27]) and that
adaptive methods should cope with various situations.
A similar observation can be done for user-driven methods. Though the
proposed systems are able to store schema fragments in various ways, the default
58
strategy for not annotated parts of the schema is again a fixed one. It can be an
interesting optimization to join the ideas and search the (sub)optimal mapping
for not annotated parts using a cost-driven method.
Deeper Exploitation of Information Another open issue is possible deeper
exploitation of the information given by the user. We can identify two main
questions: How can be the user-given information better exploited? Are there
any other information a user can provide to increase the efficiency?
A possible answer can be found in the idea of pattern matching, i.e. using the
user-given schema annotations as hints how to store particular XML patterns.
We can naturally predict that structurally similar fragments should be stored
similarly and thus to focus on finding these fragments in the rest of the schema.
The main problem is how to identify the structurally similar fragments. If we
consider the variety of XML-to-XML transformations, two structurally same
fragments can be expressed using at first glance different regular expressions.
Thus it is necessary to propose particular levels of equivalence of XML schema
fragments and algorithms how to determine them. Last but not least, such
system should focus on scalability of the similarity metric and particularly its
reasonable default setting (based on existing analyses of real-world data).
Theoretical Analysis of the Problem As the overview shows, there are
various types of XML-to-XML transformations, whereas the mentioned ones
certainly do not cover the whole set of possibilities. Unfortunately there seems
to be no theoretic study of these transformations, their key characteristics, and
possible classifications. The study can, among others, focus on equivalent and
generalizing transformations and as such serve as a good basis for the pattern
matching strategy. Especially interesting will be the question of NP-hardness
in connection with the set of allowed transformations and its complexity (similarly to paper [15] which analyzes theoretical complexity of combinations of
cost metrics and query translation algorithms). Such survey will provide useful
information especially for optimizations of the search algorithm.
Dynamic Adaptability The last but not least issue is connected with the
most striking disadvantage of adaptive methods the problem of possible changes
of XML queries or XML data that can lead to crucial worsening of the efficiency.
As mentioned above, it is also related to the problem of missing input XML
queries and ways how to gather them. The question of changes of XML data
opens also another wide research area of updatability of the stored data a
feature that is often omitted in current approaches although its importance is
crucial.
The solution to these issues i.e. a system that is able to adapt dynamically
is obvious and challenging but it is not an easy task. It should especially avoid
total reconstructions of the whole relational schema and corresponding necessary
reinserting of all the stored data, or such operation should be done only in very
special cases. On the other hand, this brute-force approach can serve as an
inspiration. Supposing that changes especially in case of XML queries will not
be radical, the modifications of the relational schema will be mostly local and
59
Conclusion
The main goal of this paper was to describe and discuss the current state of the
art and open issues of adaptability in database-based XML-processing methods.
Firstly, we have stated the reasons why this topic should be ever studied. Then
we have provided an overview and classification of the existing approaches and
summed up the key findings. Finally, we have discussed the corresponding
open issues and their possible solutions. Our aim was to show that the idea of
processing XML data using relational databases is still up to date and should
be further developed. From the overview we can see that even though there are
interesting and inspiring approaches, there is still a variety of open problems
which can further improve the database-based XML processing.
Our future work will naturally follow the open issues stated at the end of this
paper and especially survey into the solutions we have mentioned. Firstly, we
will focus on the idea of improving the user-driven techniques using adaptive
algorithm for not annotated parts of the schema together with deeper exploitation of the user-given hints using pattern-matching methods i.e. a hybrid userdriven cost-based system. Secondly, we will deal with the problem of missing
theoretic study of schema transformations, their classification, and particularly
influence on the complexity of the search algorithm. And finally, on the basis of
the theoretical study and the hybrid system we will study and experimentally
analyze the dynamic enhancing of the system.
References
[1] DB2 XML Extender. IBM. http://www.ibm.com/.
[2] Oracle XML DB. Oracle Corporation. http://www.oracle.com/.
[3] S. Amer-Yahia. Storage Techniques and Mapping Schemas for XML. Technical Report TD-5P4L7B, AT&T Labs-Research, 2003.
[4] A. Balmin and Y. Papakonstantinou. Storing and Querying XML Data
Using Denormalized Relational Databases. The VLDB Journal, 14(1):30
49, 2005.
[5] D. Barbosa, L. Mignet, and P. Veltri. Studying the XML Web: Gathering
Statistics from an XML Sample. World Wide Web, 8(4):413438, 2005.
60
[6] G. J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema:
a Practical Study. In WebDB04: Proc. of the 7th Int. Workshop on the
Web and Databases, pages 7984, New York, NY, USA, 2004. ACM Press.
[7] P. V. Biron and A. Malhotra. XML Schema Part 2: Datatypes (Second
Edition). W3C, October 2004.
[8] F. Du, S. Amer-Yahia, and J. Freire. ShreX: Managing XML Documents
in Relational Databases. In VLDB04: Proc. of 30th Int. Conf. on Very
Large Data Bases, pages 12971300, Toronto, ON, Canada, 2004. Morgan
Kaufmann Publishers Inc.
[9] T. Bray et al. Extensible Markup Language (XML) 1.0 (Fourth Edition).
W3C, September 2006.
[10] D. Florescu and D. Kossmann. Storing and Querying XML Data using an
RDMBS. IEEE Data Eng. Bull., 22(3):2734, 1999.
[11] J. Freire, J. R. Haritsa, M. Ramanath, P. Roy, and J. Simeon. StatiX:
Making XML Count. In ACM SIGMOD02: Proc. of the 21st Int. Conf.
on Management of Data, pages 181192, Madison, Wisconsin, USA, 2002.
ACM Press.
[12] T. Grust. Accelerating XPath Location Steps. In SIGMOD02: Proc. of
the ACM SIGMOD Int. Conf. on Management of Data, pages 109120,
New York, NY, USA, 2002. ACM Press.
[13] M. Klettke and H. Meyer. XML and Object-Relational Database Systems
Enhancing Structural Mappings Based on Statistics. In Lecture Notes in
Computer Science, volume 1997, pages 151170, 2000.
[14] M. Kratky, J. Pokorny, and V. Snasel. Implementation of XPath Axes
in the Multi-Dimensional Approach to Indexing XML Data. In Proc. of
Current Trends in Database Technology EDBT04 Workshops, pages 46
60, Heraklion, Crete, Greece, 2004. Springer.
[15] R. Krishnamurthy, V. Chakaravarthy, and J. Naughton. On the Difficulty
of Finding Optimal Relational Decompositions for XML Workloads: A
Complexity Theoretic Perspective. In ICDT03: Proc. of the 9th Int. Conf.
on Database Theory, pages 270284, Siena, Italy, 2003. Springer.
[16] A. Kuckelberg and R. Krieger. Efficient Structure Oriented Storage of
XML Documents Using ORDBMS. In Proc. of the VLDB02 Workshop
EEXTT and CAiSE02 Workshop DTWeb, pages 131143, London, UK,
2003. Springer-Verlag.
[17] Q. Li and B. Moon. Indexing and Querying XML Data for Regular Path
Expressions. In VLDB01: Proc. of the 27th Int. Conf. on Very Large Data
Bases, pages 361370, San Francisco, CA, USA, 2001. Morgan Kaufmann
Publishers Inc.
[18] I. Mlynkova and J. Pokorny. From XML Schema to Object-Relational
Database an XML Schema-Driven Mapping Algorithm. In ICWI04:
Proc. of IADIS Int. Conf. WWW/Internet, pages 115122, Madrid, Spain,
2004. IADIS.
[19] I. Mlynkova and J. Pokorny. XML in the World of (Object-)Relational
Database Systems. In ISD04: Proc. of the 13th Int. Conf. on Informa-
61
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
62
1 Introduction
XML became the most significant standard for data exchange and publication over the
internet. Nevertheless, most business data remain stored in relational databases. XML
views over relational data are seen as a general way to publish relational data as XML
documents. There are many proposals to overcome the mismatch between flat relational
and hierarchical XML models (e.g. [1, 2]). Also commercial relational database management systems offer a possibility to publish relational data as XML (e.g. [3, 4]).
The importance of XML technology is increasing tremendously in process management. Web services [5], workflow management systems [6] and B2B standards [7, 8] use
XML as a data format. Complex XML documents published and exchanged by business
processes are usually defined with XML Schema types. Process activities expect and produce XML documents as parameters. XML documents encapsulated in messages (e.g.
63
WSDL) can trigger new process instances. Frequently, hierarchical XML data used by
processes and activities have to be translated into flat relational data model used by external databases. These systems are very often loosely coupled and it is impossible or
very difficult to provide view maintenance. On the other hand, the original data can be
accessed and modified by other systems or application programs. Therefore, a method of
controlling the freshness of a view and invalidating views becomes vital.
We developed a view invalidation method for our prototype workflow management
system. A special data access module, so called generic data access plug-in (GDAP),
enables the definition of XML views over relational data. GDAP offers a possibility to
check the view freshness and can invalidate a stale view. In case of view update operations
the GDAP automatically checks whether the view is not stale before propagating update
to the original database.
The remainder of this paper is organized as follows: Section 2 presents an overall
architecture of our prototype workflow management systems and introduces the idea of
data access plug-ins used to provide uniform access to external data in workflows. Section 3 discusses invalidation mechanisms for XML views defined over relational data and
Section 4 describes their actual implementation in our system. We give some overview of
related work in Section 5 and finally draw conclusions in Section 6.
64
WfMS
Data Access
PlugIn Manager
Workflow
Engine
Workflow
Repository
Data Access
Plug-ins
Program
Interaction
Manager
Worklist
Manager
External
Systems
Worklist handler
65
The task of a data access plug-in is to translate the operations on XML documents to
the underlying data sources. A data access plug-in exposes to the workflow management
system a simple interface which allows XML documents to be read, written or created
in a collection of many documents of the same XML Schema type. Each document in
the collection is identified by a unique identifier. The plug-in must be able to identify the
document in the collection given only this identifier.
Each data access plug-in allows an XPath expression to be evaluated on a selected
XML document. The XML documents used within a workflow can be used by the workflow engine to control the flow of workflow processing. This is done in conditional split
nodes by evaluating the XPath conditions on documents. If a given document is stored in
an external data source and accessed by a data access plug-in, then the XPath condition
has to be evaluated by this plug-in. XPath is also used to access data values in XML
documents.
66
change methods. Change methods that implement change operations on the underlaying base data can be extended by a functionality to inform the plug-ins in case
of potential view-invalidations.
3. Passive change detection by event mechanisms: This is the most general approach
because here an additional publish-subscribe mechanism is assumed. GDAPs subscribe views for different events on the database (e.g. change operations on base
data). If an event occurs, a notification is sent to the GDAP initiating the view
invalidation process.
On the other hand, using active change detection techniques, it is the GDAPs own
responsibility to check whether underlaying base data has changed periodically or at defined points in time. Due to the fact that no concepts of active or object oriented databases
and no publish-subscribe mechanisms are required, these techniques are universal. We
distinguish three active change detection methods:
1. A naive approach is to backup the relevant relational base data at view creation time
and compare it to the relational base data at the time when the view validity has to
be checked. Differences in these two data sets may lead to view invalidation.
2. To avoid storing huge amounts of backup data, a hash function can be used to
compute its hash value and back it up. At the point in time when a view validity
check becomes necessary again a hashvalue is computed on the new database state
and compared to the backup-value to determine changes. Notice that in this case
it can come to a collision, i.e. hash values could be same for different data and in
result lead to over-invalidation.
3. Usage of change logs: Change logs log all changes within the database caused by
internal or external actors. Because no database state needs to be backed up at
view-creation time, less space is used.
After changes in base data have been detected the second GDAP task is to determine
whether corresponding views became stale, i.e. check if they need to be invalidated or
not. Not all changes in the base relational data lead to invalid views. We developed
an algorithm that considers both the type of change operation and the view definition to
check the view invalidation.
Figure 2 gives an overview of this algorithm. It shows the different effects caused
by different change operations: The change of a tuple that is displayed in the view (i.e.
that satisfies the selection-condition), always leads to view invalidation. In certain cases
change operations on tuples that are not displayed in the view lead to invalidation too: (1)
Changes of tuples selected by a where-clause make views stale. (2) Changes invalidate
all views with view-definitions that do not contain a where-clause, and the changed tuples
emerge in relations occurring in the from-clause of the view-definition.
These cases also apply to deletes and inserts. A tuple inserted by an insert operation
invalidates the view if it is displayed in the view, it is selected in the where-clause or the
view-definition does not contain a where-clause and the tuple is inserted into a relation
occurring in the from-clause of the view-definition.
The same applies to the case of deletes: A view is invalidated if the deleted tuple is
displayed in the view, it is selected by the where-clause or it is deleted from a relation
occurring in the view-definitions from-clause.
67
selection-clause satisfied
(tuple is shown in view)
no
yes
where-clause exists
yes
no
tuple occurs in relation
occuring in fromclause
where-clause
selects tuple
yes
invalid
invalid
no
valid
yes
invalid
no
valid
If we assume that for update-propagation reasons the identifying primary keys of the
base tuples are contained in the view, every tuple of the view can be associated with the
base tuple and vice-versa. Thus, every single change within base data can be associated
with the view.
4 Implementation
To validate our approach we implemented a generic data access plug-in which was integrated into our prototype workflow management system described in Section 2. The
current implementation of the GDAP for relational databases (Fig. 3) takes advantage of
XML-DBMS middleware for transferring data between XML documents and relational
databases [10]. XML-DBMS maps the XML document to the database according to an
object-relational mapping in which element types are generally viewed as classes and
attributes and XML text data as properties of those classes. An XML-based mapping language allows the user to define an XML view of relational data by specifying these mappings. The XML-DBMS supports also insert, update and delete operations. We follow in
our implementation an assumption made by the XML-DBMS that the view updateability
problem has already been resolved. The XML-DBMS checks only basic properties for the
view updateability, e.g. presence in an updateable view of primary keys and obligatory
attributes. Other issues, like content and structural duplication are not addressed.
The GDAP controls the freshness of generated XML views using the predefined triggers and so called view-tuple lists (VTLs). A VLT contains primary keys of tuples which
were selected into the view. VLTs are managed and stored internally by the GDAP. A
sample view-tuple list which contains primary keys of displayed tuples in each involved
relation is shown in Table 1.
Our GDAP also uses triggers defined on tables which were used to create a view.
These triggers are created together with the view definition and store all primary keys of
all tuples inserted, deleted or modified within these relations in a special log. This log is
inspected by the GDAP later, and information gathered during this inspection process is
used for our invalidation process of the view. Thus, our implementation uses a variant of
active change detection with change logs as described in Section 3.
Two different VTLs are used in the invalidation process: VTLold is computed when
the view is generated, VTLnew is computed at the time of the invalidation process. The
primary keys of modified records in the original tables logged by a trigger are used to
68
implements Workflow.Plugins.
DataAccessPluginInterface
XForms
Workflow Engine
method call
view updatability checking
ViewDef.xml
XML-DBMS-middleware
configuration
LogDef.xml
VTLold, VTLnew
vtl.bin
X-Diff
viewList.bin
view-type-ID-List
active
change detection
passive
change detection
SQL
DB changes
DB changes
trigger
definition/
deletion
log
relational
DB
external users/
applications
...
immutable schema
view-id
view 001
view 002
view 003
...
relation
order
customer
position
article
...
5
...
69
tupleID
1
2
yes
case 4 or case 5 satisfied
yes
invalid
no
invalid
valid
be checked. Viewold resp. Viewnew denotes the original view resp. the view generated
during the validation checking process:
Case 4: If VTLold is not equal to VTLnew , the view is invalid because the set of
tuples has changed.
Case 5: If VTLold is equal to VTLnew and Viewold is not equal to Viewnew , the
view has to be invalidated. That means that the tuples remained the same, but values
within these tuples have changed.
To clarify that it is necessary to check case 5, see the following view-definition and
the corresponding view listed in Table 2:
SELECT tupleID, name, salary,
(SELECT max(salary) AS maxSalary
FROM employees
WHERE department=IT)
FROM employees WHERE department=IT
AND tupleID<3
If the salary of the employee with the maximum salary is changed (notice that this
employee is not selected by the selection-condition), still the same tuples are selected, but
within the view maxSalary changes.
The invalidation checking in Cases 1-4 does not require view recomputation. But Case
5 only needs to be checked if the Cases 1-4 are not satisfied. Notice that while the Cases 4
and 5 just have to be checked once, Cases 1-3 have to be checked for every tuple occurring
in the change log. This invalidation algorithm used in our GDAPs is summarized in Fig. 4.
Case 5 is checked by comparing two views, as shown above. The disadvantages
of this comparison are that a recomputation of the view Viewnew is time and resource
consuming, as well as the comparison itself may be very inefficient. A more efficient way
is to replace Case 5 by Case 5a:
70
tupleID
1
2
5 Related Work
In most existing workflow management systems, data used to control the flow of the workflow instances (i.e. workflow relevant data) are controlled by the workflow management
system itself and stored in the workflow repository. If these data originate in external
data sources, then external data are usually copied into the workflow repository. There
is no universal standard for accessing external data in workflow management systems.
Basically each product uses different solutions [11].
There has been recent interest in publishing relational data as XML documents often
called XML views over relational data. Most of these proposals focus on converting
queries over XML documents into SQL queries over the relational tables and on efficient
methods of tagging and structuring of relational data into XML (e.g. [1, 2, 12].
The view updateability problem is well known in relational and object- relational
databases [13]. The mismatch between flat relational and hierarchical XML models is
an additional challenge. This problem is addressed in [14]. However, most proposals of
updateable XML views [15] and commercial RDBMS (e.g. [3]) assume that XML view
updateability problem is already solved. The term view freshness is not used in a uniform
way, dealing with the currency and timeliness [16] of a (materialized) view. Additional
71
dimensions of view freshness regarding the frequency of changes are discussed in [17].
We do not distinguish all these dimensions in this paper. Here view freshness means the
validity of a view. Different types of invalidation, including over- and under-invalidation
are discussed in [18].
A view is not fresh (stale) if data used to generate the view were modified. It is
important to detect relevant modifications. In [19] authors proposed to store both original
XML documents and their materialized XML views in special relational tables and to use
update log to detect relevant updates. A new view generated from the modified base data
may be different as a previously generated view. Several methods for change detection
of XML documents were proposed (e.g. [20, 21]). The authors of [22] proposed first
to store XML in special relational tables and then to use SQL queries to detect content
changes of such documents. In [4] before and after images of an updateable XML view are
compared in order to find the differences which are later used to generate corresponding
SQL statements responsible for updating relational data. The before and after images are
also used to provide optimistic concurrency control.
6 Conclusions
The data aspect of workflows requires more attention. Since workflows typically access
data bases for performing activities or making flow decisions, the correct synchronization
between the base data and the copies of these data in workflow systems is of great importance for the correctness of the workflow execution. We described a way for recognizing
the invalidation of materialized views of relational data used in workflow execution. To
check the freshness of generated views our algorithm does not require any special data
structures in the RDBMS except a log table and triggers. Additionally, view-tuple-lists
are managed to store primary keys of tuples selected into a view. Thus, only a very small
amount of overhead data are stored and can be used to invalidate stale views.
The implemented general data access plug-in enables flexible publication of relational
data as XML documents used in loosely coupled workflows. This brings obvious advantages for intra- and interorganizational exchange of data. In particular, it makes the
definition of workflows easier and the coupling between workflow system and databases
more transparent, since it is no longer needed to perform all the necessary checks in the
individual activities of a workflow.
Acknowledgements
This work is partly supported by the Commission of the European Union within the
project WS-Diamond in FP6.STREP
References
[1] M. Fernandez, Y. Kadiyska, D. Suciu, A. Morishima, and W.-C. Tan, Silkroute:
A framework for publishing relational data in xml, ACM Trans. Database Syst.,
vol. 27, no. 4, pp. 438493, 2002.
72
[2] J. Shanmugasundaram, J. Kiernan, E. J. Shekita, C. Fan, and J. Funderburk, Querying xml views of relational data, in VLDB 01: Proceedings of the 27th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc.,
2001, pp. 261270.
[3] Oracle, XML Database Developers Guide - Oracle XML DB. Release 2 (9.2), Oracle Corporation, October 2002.
[4] M. Rys, Bringing the internet to your database: Using sqlserver 2000 and xml to
build loosely-coupled systems, in Proceedings of the 17th International Conference on Data Engineering ICDE, April 2-6, 2001, Heidelberg, Germany. IEEE
Computer Society, 2001, pp. 465472.
[5] T. Andrews, F. Curbera, H. Dholakia, Y. Goland, J. Klein, F. Leymann, K. Liu,
D. Roller, D. Smith, S. Thatte, I. Trickovic, and S. Weerawarana, Business process
execution language for web services (bpel4ws), BEA, IBM, Microsoft, SAP, Siebel
Systems, Tech. Rep. 1.1, 5 May 2003.
[6] WfMC, Process definition interface - xml process definition language (xpdl 2.0),
Workflow Management Coalition, Tech. Rep. WFMC-TC-1025, 2005.
[7] ebXML, ebXML Technical Architecture Specification v1.0.4, ebXML Technical Architecture Project Team, 2001.
[8] M. Sayal, F. Casati, U. Dayal, and M.-C. Shan, Integrating workflow management
systems with business-to-business interaction standards, in Proceedings of the 18th
International Conference on Data Engineering (ICDE02). IEEE Computer Society, 2002, p. 287.
[9] M. Ader, Workflow and business process management comparative study. volume
2, Workflow & Groupware Strategies, Tech. Rep., June 2003.
[10] R.
Bourret,
Xml-dbms
middleware,
http://www.rpbourret.com/xmldbms/index.htm.
Viewed:
May
2005,
Addison Wesley,
[14] L. Wang and E. A. Rundensteiner, On the updatability of xml views published over
relational data, in Conceptual Modeling - ER 2004, 23rd International Conference
on Conceptual Modeling, Shanghai, China, November 2004, Proceedings, ser. Lecture Notes in Computer Science, P. Atzeni, W. W. Chu, H. Lu, S. Zhou, and T. W.
Ling, Eds., vol. 3288. Springer, 2004, pp. 795809.
73
74
GNTZER Ulrich
Institute of Computer Science
University of Tbingen
Sand 13
72076 Tbingen, Germany
ulrich.guentzer@informatik.unituebingen.de
Abstract
Preference-based queries often referred to as skyline queries play an important role in
cooperative query processing. However, their prohibitive result sizes pose a severe challenge to the paraGLJPVSUDFWLFDODSSOLFDELOLW\. In this paper we discuss the incremental
re-computation of skylines based on additional information elicited from the user. Extending the traditional case of totally ordered domains, we consider preferences in their
most general form as strict partial orders of attribute values. After getting an initial skyline set our apSURDFK DLPV DW LQFUHPHQWDOO\ LQFUHDVLQJ WKH V\VWHPV LQIRUPDWLRQ DERXW
WKH XVHUV ZLVKHV. This additional knowledge then is incorporated into the preference
information and constantly reduces skyline sizes. In particular, our approach also allows
users to specify trade-offs between different query attributes, thus effectively decreasing
the query dimensionality. We provide the required theoretical foundations for modeling
preferences and equivalences, show how to compute incremented skylines, and proof the
correctness of the algorithm. Moreover, we show that incremented skyline computation
can take advantage of locality and database indices and thus the performance of the algorithm can be additionally increased.
Keywords: Personalized Queries, Skylines, Trade-Off Management, Preference Elicitation
1 Introduction
Preference-based queries, usually called skyline queries in database research [9], [4],
[19], have become a prime paradigm for cooperative information systems. Their major
appeal is the intuitiveness of use in contrast to other query paradigms like e.g. rigid setbased SQL queries, which only too often return an empty result set, or efficient, but hard
to use top-k queries, where the success of a query depends on choosing the right scoring
or utility functions.
1
Part of this work was supported by a grant of the German Research Foundation (DFG) within the Emmy
Noether Program of Excellence.
75
Skyline queries offer user-centered querying as the user just has to specify the basic attributes to be queried and in return retrieves the Pareto-optimal result set. In this set all
possible EHVWREMHFWVZKHUHEHVWUHIHUs to being optimal with respect to any monotonic
optimization function) are returned. Hence, a user cannot miss any important answer.
However, the intuitiveness of querying comes at a price. Skyline sets are known to grow
exponentially in size [8], [14] with the number of query attributes and may reach unreasonable result sets (of about half of the original database size, cf. [3], [7]) already for as
little as six independent query predicates. The problem even becomes worse, if instead
of totally ordered domains user preferences on arbitrary predicates over attribute-based
domains are considered. In database retrieval, preferences are usually understood as partial orders [12], [15], [20] of domain values that allow for incomparability between attributes. This incomparability is reflected in the respective skyline sizes that are generally significantly bigger than in the totally ordered case. On the other hand such attribute-based domains like colors, book titles, or document formats play an important role in
practical applications, e.g., digital libraries or e-commerce applications. As a general
rule of thumb it can be stated that the more preference information (including its transitive implications) is given by the user with respect to each attribute, the smaller the average skyline set can be expected to be. In addition to prohibitive result set sizes, skyline
queries are expensive to compute. Evaluation times in the range of several minutes or
even hours over large databases are not unheard of.
One possible solution is based on the idea of refining skyline queries incrementally by
taking advantage of user interaction. This approach is promising since it benefits skyline
sizes as well as evaluation times. Recently, several approaches have been proposed for
user-centered refinement:
x using an interactive, exploratory process steering the progressive computation of
skyline objects [17]
x exploiting feedback on a representative sample of the original skyline result [8],
[16]
x projecting the complete skyline on subsets of predicates using pre-computed skycubes [20], [23].
The benefit of offering intuitive querying and a cooperative system behavior to the user
in all three approaches can be obtained with a minimum of user interaction to guide the
further refinement of the skyline. However, when dealing with a massive amount of result tuples, the first approach needs a certain user expertise for steering the progressive
computation effectively. The second approach faces the problem of deriving representative samples efficiently, i.e. avoiding a complete skyline computation for each sample.
In the third approach the necessary pre-computations are expensive in the face of updates of the database instance.
Moreover, basic theoretical properties of incremented preferences in respect to possible
preference collisions and induced query modification and query evaluation have been
outlined in [13].
In this paper we will provide the theoretical foundations of modeling partial-ordered
preferences and equivalences on attribute domains provide algorithms for incrementally
and interactively computing skyline sets and prove the soundness and consistency of the
algorithms (and thus giving a comprehensive view of [1], [2], [6]). Seeing preferences in
their most general form as partial orders between domain values, this implicitly includes
76
the case of totally ordered domains. After getting an (usually too big) initial skyline set
our approach aims at interacWLYHO\LQFUHDVLQJWKHV\VWHPVLQIRUPDWLRQDERXWWKHXVHUV
wishes. The additional knowledge then is incorporated into the preference information
and helps to reduce skyline sets. Our contribution thus is:
x Users are enabled to specify additional preference information (in the sense of
domination), as well as equivalences (in the sense of indifference) between attributes leading to an incremental reduction of the skyline. Here our system will
efficiently support the user by automatically taking care that newly specified
preferences and equivalences will never violate the consistency of the previously stated preferences.
x Our skyline evaluation algorithm will allow specifying such additional information within a certain attribute domain. That means that more preference information about an attribute is elicited from the user. Thus the respective preference will be more complete and skylines will usually become smaller. This can
reduce skylines to the (on average considerably smaller) sizes of total order
skyline sizes by canceling out incomparability between attribute values.
x In addition, our evaluation algorithm will also allow specifying additional relationships between preferences on different attributes. This feature allows defining the qualitative importance or equivalence of attributes in different domains
and thus forms a good tool to compare the respective utility or desirability of
certain attribute values. The user can thus express trade-offs or compromises
he/she is willing to take and also can adjust imbalances between fine-grained
and coarse preference specifications.
x We show that the efficiency of incremented skyline computation can be considerably increased by employing preference diagrams. We derive an algorithm
which takes advantage of the locality of incremented skyline set changes depending on the changes made by the user to the preference diagram. By that,
the algorithm can operate on a considerable smaller dataset with an increased
efficiency.
Spanning preferences across attributes (by specifying trade-offs) is the only way short
of dropping entire query predicates to reduce the dimensionality of the skyline computation and thus severely reduce skyline sizes. Nevertheless the user stays in full control
of the information specified and all information is only added in a qualitative way, and
not by unintuitive weightings.
77
beach area
arts district
commercial
university
district
district
beach area
arts district
studio
outer suburbs
commercial
district
university
district
[>250 000]
outer suburbs
preference P 1
preference P 2
preference P 3
preference P 1
location
type
price range
location
Figure 1. Three typical user preferences (left) and an enhanced preference (right)
Ad C
BdC
equivalence across
BdD
preference P1 and P2 (beach area, 2-bedroom, C)
Ad D
if the price is equal,
CdD
CdD
Dbeach area
(beach
area,
studio, D) (arts district, loft, D)
studiois equivalent
CdE
DdE
DdE
to Darts district loft
(arts district, 2-bedroom, E)
original relationship
new relationship
78
line. Naturally, existing preferences might be extended by adding some new preference
relationships. But also explicit equivalences may be stated between certain attributes
expressing actual indifference and thus resulting in new domination relationships, too.
Example (cont): /HWVDVVXPHthat the skyline still contains too many apartments. Thus,
Anna interactively refines her originally stated preferences. For example, she might state
that she actually prefers the arts district over the university district and the latter over the
commercial district which would turn the preference P1 into a totally ordered relation.
This would for instance allow apartments located in the arts and university district to
dominate those located in the commercial district with respect to P1, resulting in a decrease of the size of the Pareto-optimal set of skyline objects. Alternatively, Anna might
state that she actually does not care whether her flat is located in the university district or
the commercial district that these two attributes are equally desirable for her. This is
illustrated in the right hand side part of Figure 1 as preference P1. In this case, it is reasonable to deduce that all arts district apartments will dominate commercial district
apartments with respect to the location preference.
Preference relationships over attribute domains lead to domination relationships on database objects, when the skyline operator is applied to a given database. These resulting
domination relationships are illustrated by the solid arrows in figure 2. However, users
might also weigh some predicates as more important than others and hence might want
to model trade-offs they are willing to consider. Our preference modeling approach introduced in [1] allows expressing such trade-offs by providing new preference relations
or equivalence relations between different attributes7KLV PHDQVDPDOJDPDWLQJ some
attributes in the skyline query and subsequently reducing the dimensionality of the
query.
Example (cont): While refining her preferences, Anna realizes that she actually would
consider the area in which her new apartment is located as more important than the actual apartment type in other words: for her, a relaxation in the apartment type is less
severe than a relaxation in the area attribute. Thus she states that she would consider a
beach area studio (the least desired apartment type in the best area) still as equally desirable to a loft in the arts district (the best apartment type in a less preferred area by doing that, she stated a preference on an amalgamation of the attribute apartment type and
location). This statement induces new domination relations on database objects (illustrated as the dotted arrows in Figure 2), allowing for example any cheaper beach area 2bedroom to dominate all equally priced or more expensive arts district lofts (by using the
ceteris paribus [18] assumption). In this way, the result set size of the skyline query can
be decreased.
79
3.1
In this section we will provide the basic definitions which are prerequisites for section
3.1 and 3.1 . We will introduce the notion for base preferences, base equivalences, their
amalgamated counterparts, a generalized Pareto composition and a generalized skyline.
The basic construct are so-called base preferences defining strict partial orders on attribute domains of database objects (based on [12], [15]):
Definition 1: (Base Preference)
Let D1, D2, Dm be a non-empty set of m domains (i.e. sets of attribute values) on the
attributes Attr1, Attr 2Attr m so that Di is the domain of Attr i. Furthermore let O D1
D2 Dm be a set of database objects and let attri : O o Di be a function mapping
each object in O to a value of the domain Di.
Then a Base Preference Pi Di2 is a strict partial order on the domain Di.
The intended interpretation of (x, y) Pi with x, y Di (or alternatively written x <Pi y)
is WKHLQWXLWLYHVWDWHPHQW,OLNH attribute value y (for the domain Di) better than attribute
value x (of the same domain) This implies that for o1, o2 O (attri(o1), attri(o2)) Pi
PHDQV,OLNHRbject o2 better than object o1 with respect to its i-th attribute value
In addition to specifying preferences on a domain Di we also allow to define equivalences as given in Definition 2.
Definition 2: (Base Equivalence and Compatibility)
Let O a set of database objects and Pi a base preference on Di as given in Definition 1.
Then we define a Base Equivalence Qi Di2 as an equivalence relation (i.e. Qi is reflexive, symmetric and transitive) which is compatible with Pi and is defined as:
a) Qi Pi = (meaning no equivalence in Qi contradicts any strict preference in Pi)
b) Pi 4i = Qi 3i = Pi (the domination relationships expressed transitively using Pi and
Qi must always be contained in Pi)
In particular, as Qi is an equivalence relation, Qi trivially contains the pairs (x, x) for all x
Di.
The interpretation of base equivalences is similarly intuitive as for base preferences: (x,
y) Qi with x, y Di (or alternatively written x ~Qi yPHDQV,am indifferent between
attribute values x and y of the domain Di
80
81
loft maisonette
beach area
arts district
2-bedroom
commercial
university
district
district
studio
outer suburbs
preference P1
preference P2
location
type
This means: Given two tuples xP , yP from the same amalgamated domains described
byP, the function AmalPref(xP , yP) returns a set of relationships between database objects of the form (o1, o2) where the attributes of o1 projected on the amalgamated domains equal those of xP , the attributes of o2 projected on the amalgamated domains equal
those of yP and furthermore all other attributes which are not within the amalgamated
attributes are identical for o1 and o2. The last requirement denotes the well-known ceteris
paribus [18] condition DOORWKHUWKLQJVEHLQJHTXDO. The relationships created by that
function may be incorporated into P DVORQJDVWKH\GRQW violate PV consistency. The
conditions and detailed mechanics allowing this incorporation are the topic of section
3.1 .
Please note the typical cross-shape introduced by trade-offs: A relaxation in one attribute is compared to a relaxation in the second attribute. Though in the Pareto sense the
two respective objects are not comparable (in Figure 3 the arts district studio has a better
value with respect to location, whereas the university district loft has a better value with
respect to type), amalgamation adds respective preference and equivalence relationships
between those objects.
Definition 6: (Amalgamated Equivalence Functions)
Let P ^m} be a set with cardinality k. Using S as the projection in the sense of
relational algebra we define the function
AmalEq(xP , yP) : ( iuP Di )2 o O2
(xP , yP) {(o1, o2) O2 | i P : [ (SAttri (o1) = SAttri (xP) SAttri (o2) = SAttri (yP))
(SAttri (o2) = SAttri (xP) SAttri (o1) = SAttri (yP)) ] i ^m}\ P. : (SAttri (o1) = SAttri
(o2) )) }
82
The function differs from amalgamated preferences in that as it returns symmetric relationships, i.e. if (o1, o2) Q, also (o2, o1) has to be in Q. Furthermore, these relationships
have to be incorporated into Q instead of P DVORQJDVWKH\GRQWYLRODWHFRQVLstency. But
due to the compatibility characteristic also P can be affected by new relationships in Q.
Based on an object preference P we can finally derive the Skyline set (given by Definition 7) which is returned to the user. This set contains all possible best objects with respect to the underlying user preferences. This means that the set contains only those objects that are not dominated by any other object in the object level preference P. Note
that we call this set generalized Skyline as it is derived from P which is initially the
Pareto order but may also be extended with additional relationships (e.g. trade-offs).
Note that the generalized skyline set is solely derived using P, still it respects Q by introducing new relationships into P based on recent additions to Q (cf. Definition 8).
Definition 7: (Generalized Skyline)
The Generalized Skyline S O is the set containing all optimal database objects in respect to a given object preference P and is defined as
S := { o O | o O : (o, o) P }
3.1
The last section provided the basic definitions required for dealing with preference and
equivalence sets. In the following sections we provide a method for incremental specification and enhancements of object preference sets P and equivalence sets Q. Also, we
show under which conditions the addition of new object preferences / equivalences is
safe and how compatibility and soundness can be ensured.
The basic approach for dealing with incremental preference and equivalent sets is illustrated in Figure 4. First, the base preferences P1 to Pm (Definition 1) and their according
base equivalences Q1 to Qm (Definition 2) are elicited. Based on these, the initial object
preference P (Definition 4) is created by using the generalized Pareto aggregation
(Definition 3). The initial object equivalence Q starts as a minimal relation as defined in
Definition 4. The generalized Pareto skyline (Definition 7) of P is then displayed to the
user and the iterative phase of the process starts. Users now have the opportunity to
specify additional base preferences or equivalences or amalgamated relationships
(Definition 5, Definition 6) as described in the previous section. The set of new object
relationships resulting from the Ceteris Paribus functions of the newly stated user information then is checked for compatibility using the constrains given in this section. If
the new object relationships are compatible with P and Q they are inserted and thus incremented sets P* and Q* are formed. If the relationships were not consistent with the
previously stated information, then the last addition is discarded and the user is notified.
The user thus can state more and more information until the generalized skyline is lean
enough for manual inspection.
P1,,Pm,Q1,,Qm
Pareto
P, Q
P*, Q*
Iteration
83
84
Lemma 2: P* T P*
Proof: analogous to Lemma 1
ad 3) Since Q T and Q is symmetric, Q Q*. Analogously E T and E is symmetric, E Q*. Thus, Q E Q*.
ad 4) We have to show three implications for the equivalence of a), b) and c):
a) c): Assume there would exist a cycle (x0, x1 xn-1, xn) with x0 = xn and
edges from (P Q S E) where at least one edge is from P S, further assume
without loss of generality (x0, x1) P S. We know (x2, xn) T and (x1, x0) T, therefore (x0, x1) Q* and (x0, x1) P*. Thus, the statement P S P* cannot hold in contradiction to a).
c) b): We have to show T (P S)conv = . Assume there would exist (x0, x1
xn-1, xn) (P S)conv with (xi-1, xi) (P Q S E) for 1 d i d n. Because of (x0,
xn) (P S)conv follows (xn, x0) P S and thus (x0, x1xn-1, xn) would have
been a cycle in (P Q S E) with at least one edge from P or S, which is a contradiction to c).
b) a): If the statement P S P* would not hold, there would be x and y with (x, y)
P S, but (x, y) P*. Since (x, y) T, it would follow (x, y) Q*. But then also (y,
x) Q* (P S)conv would hold, which is a contradiction to b).
This completes the equivalence of the three conditions now we have to show that from
any of we can deduce Q* = (Q E)+. Let us assume condition c) holds.
First we show Q* (Q E)+. Let (x, y) Q*, then also (y, x) Q*. Thus we have two
representations (x, y) = (x0, x1 xn-1, xn) and (y, x) = (y0, y1 ym-1, ym),
where all edge are in (P Q S E) and xn = y = y0 and x0 = x = ym. If both representations are concatenated, a cycle is formed with edges from (P Q S E). Using condition c) we know that none of these edges can be in P S. Thus, (x, y) (Q E)+.
The inclusion Q* (Q E)+ holds trivially due to (Q E)+ T and (Q E)+ is symmetric, since both Q and E are symmetric.
85
The evaluation of skylines thus comes down to calculating P* and Q* as given by Definition 8 after we have checked their consistency as described in Theorem 1, i.e. verified
that no inconsistent information has been added. It is a nice advantage of our system that
at any point we can incrementally check the applicability and then accept or reject a
statement elicited from the user or a different source like e.g. profile information. Therefore, skyline computation and preference elicitation are interleaved in a transparent
process.
3.1
In the last sections, we provided the basic theoretical foundations for incremented preference and equivalence sets. In addition, we showed how to use the generalized Pareto
aggregation for incremented Skyline computation based on the previous skyline objects.
,QWKLVVHFWLRQZHZLOOLPSURYHWKLVDOJRULWKPVHIILFLHQF\E\H[SORLWLQJWKHORFDl nature
of incrementally added preference / equivalence information.
While in the last section we facilitated skyline computation by modeling the full object
preference P and object equivalence E, we will now enhance the algorithm to be based
on transitively reduced (and thus considerably smaller) Preference Diagrams. Preferences diagrams are based on the concept of Hasse diagrams but, in contrast, do not require a full intransitive reduction (e.g. some transitive information may remain in the
diagram). Basically, a preference diagram is a simple graph representation of attribute
values and preference edges as following:
Definition 9: (Preference Diagrams)
Let P be a preference in the form of a finite strict partial order. A preference diagram
PD(P) for preference P denotes a (not necessarily minimal) graph such that the transitive
closure PD(P)+ = P.
Please note that there may be several preference diagrams provided (or incrementally
completed) by the user to express the same preference information (which is given by
the transitive closure of the graph). Thus the preference diagram may contain redundant
transitive information if it was explicitly stated by the user during the elicitation process.
This is particularly useful when the diagram is used for user interface purposes [2].
In the remainder of this section, we want to avoid the handling of the bulky and complex
to manage incremented preference P* and rather only incorporate increments of new
preference information as well as new equivalence information into the preference diagram instead. The following two theorems show how to do this.
Theorem 2: (Calculation of P*)
Let O be a set of database objects and P, Pconv, and Q as in Definition 8 and E := {(x, y),
(y, x)} new equivalence information such that (x, y), (y, x) ( P Pconv Q). Then P*
can be calculated as
P* = (P (P E P) (Q E P) (P E Q)).
Proof: Assume (a, b) T as defined in Definition 8. The edge can be represented by a
chain (a0, a1) (an-1, an), where each edge (ai-1, ai) (P Q E) and a0 := a, an :=
b. This chain can even be transcribed into a representation with edges from (P Q E),
where at most one single edge is from E. This is because, if there would be two (or
more) edges from E namely (ai-1, ai) and (aj-1, aj) (with i < j) then there are four possibilities:
86
a) both edges are (x, y) or both edges are (y, x), in both of which cases the sequence (ai,
ai+1) (aj-1, aj) forms a cycle and can be omitted
b) the first edge is (x, y) and the second edge is (y, x), or vice versa, in both of which
cases (ai-1, ai) (aj-1. aj) forms a cycle and can be omitted, leaving no edge from E
at all.
Since we have defined Q as compatible with P in Definition 8, we know that (P Q)+ =
(P Q) and since elements of T can be represented with at most one edge from E, we
get T = P Q ((P Q) E (P Q)).
In this case both edges in E are consistent with the already known information, because
there are no cyclic paths in T containing edges of P ( c.f. condition 1.4.c) in [1]): This is
because if there would be a cycle with edges in (P Q E) and at least one egde from
P (i.e. the new equivalence information would create an inconsistency in P*), the cycle
could be represented as (a0, a1) P and (a1, a2) (an-1, a0) would at most contain
one edge from E and thus the cycle is either of the form P (P Q), or of the form P
(P Q) E (P Q). In the first case there can be no cycle, because otherwise P and
Q would already have been inconsistent, and if there would be cycle in the second case,
there would exist objects a, b O such that (a, x) P, (x, y) E and (y, b) (P Q)
and (y, x) = (y, a) (a, x) (P Q) P P contradicting (x, y) Pconv.
Because of T = (P Q (P E P) (Q E P) (P E Q) (Q E Q)) and
P* = T \ Q* and since (P (P E P) (Q E P) (P E Q)) Q* = (if the
intersection would not be empty then due to Q* being symmetric there would be a cycle
in P* with edges from (P Q E) and at least one edge from P contradicting the condition 1.4 above), we finally get P* = (P (P E P) (Q E P) (P E Q)).
We have now found a way to derive P* in the case of a new incremental equivalence
relationship, but still P* is a large relation containing all transitive information. We will
now show that we can also get P* by just manipulating a respective preference diagram
in a very local fashion. Locality here results to only having to deal with edges that are
directly adjacent in the preference diagram to the additional edges in E. Let us define an
abbreviated form of writing such edges:
Definition 10: (Set Shorthand Notations)
Let R be a binary relation over a set of database objects O and let x O. We write:
(_ R x) := { y O | (y, x) R} and
(x R _) := { y O | (x, y) R)}
If R is an equivalence relation we write the objects in the equivalence class of x in R as:
R[x] := { y O | (x, y), (y, x) R}
With these abbreviations we will show what objects sets have to be considered for actually calculating P* via a given preference diagram:
Theorem 3: (Calculation of PD(P)*)
Let O be a set of database objects and P, Pconv, and Q as in Definition 8 and
E := {(x, y), (y, x)} new equivalence information such that
(x, y), (y, x) ( P Pconv Q).
If PD(P) P is some preference diagram of P, and with
87
In general these sets will be rather small because first they are only derived from the
preference diagram which is usually considerably smaller than preference P and second
in these small sets there usually will be only few edges originating or ending in x or y.
Furthermore, these sets can be computed easily using an index on the first and second
entry of the binary relation PD(P) and Q. Getting a set like e.g., _PD(P) x then is just an
LQH[SHQVLYH LQGH[ORRNXSRIWKH W\SHVHOHFWDOOILUVWHQWULHVIURP PD(P) where second
entry is x
Therefore we can calculate the incremented preference P* by simple manipulations on
PD(P) and the computation of a transitive closure like shown in the commutating diagram in Figure 5.
P
extend by E
trans.
closure
PD(P)
extend by E
P* = (PD(P)*)+
trans.
closure
PD(P)*
88
each edge (ai-1, ai) (P Q S) and a0 := a, an := b. This chain can even be transcribed
into a representation with edges from (P Q S), where edge (x, y) occurs at most
once. This is because, if (x, y) occurs twice the two edges would enclose a cycle that can
be removed.
Since we have assumed Q to be compatible with P in Definition 8, we know T = P Q
((P Q) S (P Q)) and (like in Theorem 2) edge (x, y) is consistent with the
already known information, because there are no cyclic paths in T containing edges of P
( c.f. 4.c in Theorem 1): This is because if there would be a cycle with edges in (P Q
S) and at least one egde from P (i.e. the new preference information would create an
inconsistency in P*), the cycle could be represented as (a0, a1) P and (a1, a2)
(an-1, a0) would at most contain one edge from S and thus the cycle is either of the form
P, or of the form P (P Q) S (P Q). In the first case there can be no cycle, because otherwise P would already have been inconsistent, and if there would be cycle in
the second case, there would exist objects a, b O such that (a, x) P and (y, b) (P
Q) and (y, x) = (y, a) (a, x) (P Q) P P contradicting (x, y) Pconv.
Similarly, there is no cycle with edges in (Q S) and at least one egde from S, either: if
there would be such a cycle, it could be transformed into the form S Q, i.e. (x, y) (a,
b) would be a cycle with (a, b) Q, forcing (a, b)= (y, x) Q and thus due to QVV\mmetry a contradiction to (x, y) Q.
Because of T = (P Q (P S P) (Q S P) (P S Q) (Q S Q)) and
P* = T \ Q* and since (P (P S P) (Q S P) (P S Q) (Q S Q))
Q* = (if the intersection would not be empty then due to Q* being symmetric there
would be a cycle in P* with edges from (P Q S) and at least one edge from P contradicting the condition 1.4 above), we finally get P* = (P (P S P) (Q S P)
(P S Q) (Q S Q)).
89
4 Conclusion
In this paper we laid the foundation to efficiently compute incremented skylines driven
by user interaction. Building on and extending the often used notion of Pareto optimality, our approach allows users to interactively model their preferences and explore the
resulting generalized skyline sets. New domination relationships can be specified by
incrementally providing additional information like new preferences, equivalence relations, or acceptable trade-offs. Moreover, we investigated the efficient evaluation of incremented generalized skylines by considering only those relations that are directly afIHFWHGE\DXVHUVFKDQJHVLQSUHference information. The actual computation takes advantage of the local nature of incremental changes in preference information leading to
far superior performance over the baseline algorithms.
Although this work is an advance for the application of the skyline paradigm in real
world applications, still several challenges remain largely unresolved. For instance, the
time necessary for computing initial skylines is still too high hampering the SDUDGLJPV
applicability in large scale scenarios. Here, introducing suitable index structures, heuristics, and statistics might prove beneficial.
90
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
W.-T. Balke, U. Gntzer, C. Lofi. Eliciting Matters Controlling Skyline Sizes by Incremental Integration of User Preferences. Int. Conf. on Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand, 2007
W.-T. Balke, U. Gntzer, C. Lofi. User Interaction Support for Incremental Refinement of PreferenceBased Queries. 1st IEEE International Conference on Research Challenges in Information Science
(RCIS), Ouarzazate, Morocco, 2007.
W.-T. Balke, U. Gntzer, W. Siberski. Getting Prime Cuts from Skylines over Partially Ordered Domains. Datenbanksysteme in Business, Technologie und Web (BTW 2007), Aachen, Germany, 2007
W.-T. Balke, U. Gntzer. Multi-objective Query Processing for Database Systems. Int. Conf. on Very
Large Data Bases (VLDB), Toronto, Canada, 2004.
W.-T. Balke, M. Wagner. Through Different Eyes - Assessing Multiple Conceptual Views for Querying
Web Services. Int. World Wide Web Conference (WWW), New York, USA, 2004.
W.-T. Balke, U. Gntzer, W. Siberski. Exploiting Indifference for Customization of Partial Order Skylines. Int. Database Engineering and Applications Symp. (IDEAS), Delhi, India, 2006.
W.-T. Balke, J. Zheng, U. Gntzer. Efficient Distributed Skylining for Web Information Systems. Int.
Conf. on Extending Database Technology (EDBT), Heraklion, Greece, 2004.
W.-T. Balke, J. Zheng, U. Gntzer. Approaching the Efficient Frontier: Cooperative Database Retrieval
Using High-Dimensional Skylines. Int. Conf. on Database Systems for Advanced Applications
(DASFAA), Beijing, China, 2005.
J. Bentley, H. Kung, M. Schkolnick, C. Thompson. On the Average Number of Maxima in a Set of Vectors and Applications. Journal of the ACM (JACM), vol. 25(4) ACM, 1978.
S. Brzsnyi, D. Kossmann, K. Stocker. The Skyline Operator. Int. Conf. on Data Engineering (ICDE),
Heidelberg, Germany, 2001.
C. Boutilier, R. Brafman, C. Geib, D. Poole. A Constraint-Based Approach to Preference Elicitation and
Decision Making. AAAI Spring Symposium on Qualitative Decision Theory, Stanford, USA, 1997.
L. Chen, P. Pu. Survey of Preference Elicitation Methods. EPFL Technical Report IC/2004/67,
Lausanne, Swiss, 2004.
J. Chomicki. Preference Formulas in Relational Queries. ACM Transactions on Database Systems
(TODS), Vol. 28(4), 2003.
J. Chomicki. Iterative Modification and Incremental Evaluation of Preference Queries. Int. Symp. on
Found. of Inf. and Knowledge Systems (FoIKS), Budapest, Hungary, 2006.
P. Godfrey. Skyline Cardinality for Relational Processing. Int Symp. on Foundations of Information and
Knowledge Systems (FoIKS), Wilhelminenburg Castle, Austria, 2004.
W. Kieling. Foundations of Preferences in Database Systems. Int. Conf. on Very Large Databases
(VLDB), Hong Kong, China, 2002.
V. Koltun, C. Papadimitriou. Approximately Dominating Representatives. Int. Conf. on Database Theory
(ICDT), Edinburgh, UK, 2005.
D. Kossmann, F. Ramsak, S. Rost. Shooting Stars in the Sky: An Online Algorithm for Skyline Queries.
Int. Conf. on Very Large Data Bases (VLDB), Hong Kong, China, 2002.
M. McGeachie, J. Doyle. Efficient Utility Functions for Ceteris Paribus Preferences. In Proc. of Conf. on
Artificial In-telligence and Conf. on Innovative Applications of Artificial IntellLJHQFH$$$,,$$,
Edmonton, Canada, 2002.
D. Papadias, Y. Tao, G. Fu, B. Seeger. An Optimal and Progressive Algorithm for Skyline Queries. Int.
Conf. on Management of Data (SIGMOD), San Diego, USA, 2003.
J. Pei, W. Jin, M. Ester, Y. Tao. Catching the Best Views of Skyline: A Semantic Approach Based on
Decisive Subspaces. Int. Conf. on Very Large Databases (VLDB), Trondheim, Norway, 2005.
T. Satty. A Scaling Method for Priorities in Hierarchical Structures. Journal of Mathematical Psychology, 1977
T. Xia, D. Zhang. Refreshing the sky: the compressed skycube with efficient support for frequent updates. Int. Conf. on Management of Data (SIGMOD), Chicago, USA, 2006.
Y. Yuan, X. Lin, Q. Liu, W. Wang, J. Yu, Q. Zhang. Efficient Computation of the Skyline Cube. Int.
Conf. on Very Large Databases (VLDB), Trondheim, Norway, 2005.
91
92
1 Introduction
In the past decade, companies and public sector organizations developed an
increased understanding that true connectedness and participation in the
networked economy or in virtual value webs would not happen merely
through applications of technology, like Enterprise Resource Planning (ERP),
Enterprise Application Integration middleware, or web services. The key lesson
they learnt was that it would happen only if organizations changed the way they
run their operations and integrated them well into cross-organizational business
processes [1]. This takes at least 2-3 years and implies the need to (i) align
changes in the business processes to technology changes and (ii) be able to
anticipate and support complex decisions impacting each of the partner
organizations in a network and their enterprise systems (ES).
In this context, Enterprise Architecture (EA) increasingly becomes critical, for
93
94
Section 5 reports on how both classes of models agree and disagree. Section 6
reports on and discusses findings from our case study. In Section 7, we check the
consistency between the findings from the literature survey and the ones from
the case study. We summarize conclusions and research plans in Section 8.
95
96
97
architects (E2ACMM). These two criteria both mean to assess the extent to
which business stakeholders are actively kept involved in the architecture
processes. When compared on a symbol-by-symbol, contents-by-contents and
code-by-code basis, the definitions of these two criteria indicate that they both
mean to measure a common aspect, namely how frequently and how actively
business representatives participate in the architecture process and what the level
of business representatives awareness of architecture is.
An extraction of our analysis findings is presented in Table 1. It reports on a
set of assessment criteria that we found to be semantically equivalent in two
models, namely the E2ACMM [23], and the DoC ACMM [4].
E2ACMM
DoC ACMM
Business Linkage
Strategic Governance
Governance
Architecture Process
Architecture Development
Architecture Communication
IT security
Architecture Development
98
training. What these models have in common is that they all are meant as
theoretical frameworks for analysing, both retrospectively and prospectively, the
business value of ES. It is important to note that organizations repeatedly go
through various maturity stages when they undertake major upgrades or
replacement of ES.
As system evolution adds the concept of time to these frameworks, they tend
to structure ES experiences in terms of stages, starting conditions, goals, plans
and quality of execution. First, the model by Markus et al [16] allocates
elements of ES success to three different points in time during the system life
cycle in an organization: (i) the project phase in which the system is configured
and rolled out, (ii) the shakedown phase in which the organization goes live
and integrates the system in their daily routine, and (iii) the onward and upward
phase, in which the organization gets used to the system and is going to
implement additions. Success in the shakedown phase and in the onward and
upward phase is influenced by ES usage maturity. For example, observations
like (i) a high level of successful improvement initiatives, (ii) a high level of
employees willingness to work with the system, and (iii) frequent adaptations in
new releases, are directly related to a high level of ES usage maturity. Second,
the ERP Maturity Model by Ernst & Young, India [7] places the experiences in
context of creating an adaptable ERP solution that meets changing processes,
organization structures and demand patterns. This model structures ERP
adopters experiences into three stages: (i) chaos, in which the adopter may loose
the alignment of processes and ERP definition, reverts to old habits and routines,
and complements the ERP system usage with workarounds, (ii) stagnancy in
which organizations are reasonably satisfied with the implemented solution but
they had hoped for a higher return-on-investment rates and, therefore, they refine
and improve the ES usage to get a better business performance, and (iii) growth
in which the adopter seeks strategic support from the ES and moves its focus
over to profit, working capital management and people growth. Third, the staged
maturity model by Holland et al [13] suggests three stages as shown in the Table
2. It is based on five assessment criteria that reflect how ERP-adopters progress
to a more mature level based on increased ES usage.
Our comparative analysis of the definitions of the assessment criteria pointed
out that the number of common factors that make up the criteria of these three
models is less than 30%. The common factors are: (1) shared vision of how the
ES contributes to the organizations bottom-line, (2) use of ES for strategic
purposes, (3) tight integration of processes and ES, and (4) executive
sponsorship. In the next section, we refer to these common criteria when we
compare the models for assessing ES usage maturity to the ones for assessing
architecture maturity.
99
Constructs
Stage 1
Stage 2
Stage 3
Strategic Use
of IT
x Retention of
responsible people
x no CIO (anymore)
x IS does not
support strategic
decision-making
Organizational
Sophistication
x no process
orientation
x very little thought
about information
flows
x no culture change
x ES is on a low level
used for strategic
decision-making
x IT strategy is
regularly reviewed
x High ES
importance
x significant
organizational
change
x improved
transactional
efficiency
Penetration of
the ERP
System
x Strong vision
x Organizationwide IT strategy
x CIO on the
senior
management
team
x process
oriented
organization
x top level
support and
strong
understanding
of ERPimplications
x truly integrated
organization
x users find the
system easy to
use
Drivers &
Lessons
Vision
x most business
groups /
departments are
supported
x high usage by
employees
Key drivers:
x reduction in costs
x replacement of
legacy systems
x integrating all
business processes
x improved access of
management
information
Key drivers:
x single supply
chain
x replacement of
legacy systems
x performance
oriented culture
x internal and
external
benchmarking
x higher level
uses are
identified
x other IT
systems can be
connected
100
ES MMMs
contain
x Vision
x Strategic decisionmaking, transformation
& support
x Coherence between big
picture view & local
project views
x Process definition
x Alignment of people,
processes &
applications with goals
x Business involvement
& buy-in
x Drivers
101
102
as business initiatives rather than IT projects and had strong commitment from
top management.
IT investment and acquisition strategy: IT was critical to the companys
success and market share. Investments in applications were done as a result of a
strategic planning process.
Architecture process: The architecture process was institutionalized as a part
of the corporate Project Office. It was documented in terms of key activities and
key deliverables. It was supported by means of standards and tools.
Architecture development: All major areas of business, e.g. all core business
processes, major portion of the support processes, and all data subject areas were
architected according to Martins methodology [17]. The architecture team had a
quite good understanding of which architecture elements were rigid and which
were flexible.
Architecture communication: Architecture was communicated by the
Project Office Department and by the process owners. The IT team has not been
consistently successful in marketing the architecture services. There were ups
and downs as poor stakeholder involvement impacted the effectiveness of the
architecture teams interventions.
IT security: IT Security was considered as one of the highest corporate
priorities. The manager of this function was part of the business, and not of the
IT function. He reported directly to Vice-President Business Development.
103
this led to cultural conflicts. Another problem was the unwillingness to change
the organization. People were afraid that the new ways of working were not as
easy as before and, therefore, they undermined the process.
Penetration of the ERP system: The amount of involvement of process
owners in the implementation led immediately to the same amount of results.
The process owners were committed to reuse their old processes, which led to
significant customization efforts. The penetration of the ERP can be assessed
according to two indicators: the number of people who use the system or the
number of processes covered. The latter gives a clearer picture of the use, than
the first because many employees can be in functions in which they have nothing
to do with the ES. Examples of such functions were field technicians in cell site
building and call center representatives. In our case study organization, 30-40%
of the business processes are covered with SAP and they are still extending.
Vision: The company wanted to achieve a competitive advantage by
implementing ES. Because this was a pricy initiative, they made consistent
efforts to maximize the value of ES investments and extend it to non-core
activities and back office.
Drivers & Lessons: The companys drivers were: (i) integration of sites and
locations, (ii) reducing transaction costs, and (iii) replacement of legacy
applications. There was a very high learning curve through the process. Some
requirements engineering activities, like requirements prioritization and
negotiation went wrong in the first place, but solutions were found during the
process. More about the lessons learned in the requirements process can be
found in [2].
104
105
Drivers-and-Lessons criterion.
7. We found no correlation between a highly-mature Architecture
Development and the ES UMM criterion of Organizational Sophistication.
Stakeholders saw process architecture deliverables as tools to communicate their
workflow models to other process owners. All agreed that process models made
process knowledge explicit. But business users also raised a shared concern
about the short life-span of the architecture-compliant ERP process models. Due
to the market dynamics in the telecommunication sector, process models had the
tendency to get outdated in average each 6 weeks. Modelling turned out to be an
expensive exercise and took in average at least 3 days of full-time architects
work and one day of process owners time. Keeping the models intact was found
resource-consuming and business users saw little value in doing this.
To sum up, high architecture maturity does not necessarily imply coordination
in determining ES priorities and drivers; neither can it turn an ES initiative into a
systematic learning process.
While the architecture maturity in the beginning of the project was very high,
the organization could not set up a smooth implementation process for the first
six ERP projects. So, at the time of the first assessment, the ES usage maturity
was low (stage 1) although the company had clarity on the strategic use of IT
and treated the ES implementation projects as business initiatives and not as IT
projects.
Survey
Study
Case
Study
yes
yes
yes
yes
yes
yes
yes
no
yes
yes
no
yes
no
yes
no
yes
Table 4: Consistency check in the findings of the survey and the case study
Next, our findings suggest that three factors were identified in the survey but not
106
in the case study. One factor was found in the case study but not in the survey.
8 Conclusions
In the past decade, awareness of IT governance in organizations increased and
many have also increased their spending in EA and ES with the expectation that
these investments will bring improved business results. However, some
organizations appear to be more mature than others in how they use EA and ES
for their advantage and do get better value out of their spending. This has
opened the need to understand what it takes for an organization to be more
mature in EA and ES usage and how an organization measures up by using one
of the numerous maturity models available in the market. Our study is one
attempt to answer this question. We outlined a comparative strategy for
researching the multiple facets of a correlation relationship existing between
these two types of maturity models, namely for EA and ES. We used a survey
study and a case study of one companys ERP experiences in order to get a
deeper understanding of how these assessment criteria refer to each other. We
found that the two types of maturity models rest on a number of overlapping
assessment criteria, however, the interpretation of these criteria in each maturity
model can be different. Furthermore, our findings suggest that a well-established
architecture function in a company does not imply that there is support for an
ES-implementation. This leads to the conclusion that high architecture maturity
does not automatically guarantee high ES usage maturity.
In terms of research methods, our experiences in merging a case study and a
literature survey study suggest that a multi-analyses approach is necessary for a
deeper understanding of the correlations between architecture and ES usage. The
present study shows that a multi-analyses method helps revise our view of
maturity to better accommodate the cases of ES and EA from an IT governance
perspective and provides rationale for doing so. By applying a multi-analyses
approach to this research problem, our study departs from past framework
comparison studies. Moreover, this study extends previous research by providing
a conceptual basis to explicitly link the assessment criteria of two types of
models in terms of symbols, contents and codified good practices. In our case
study, we have chosen to use qualitative assessments of EA and ES maturity,
instead of determining quantitative maturity measurement according to the
models. The nature of the semiotic analysis, however, makes specific
descriptions of linkages between EA and ES usage difficult.
Many open and far-reaching questions result from this first exploration. Our
initial but not exhaustive list includes the following lines for future research:
1. Apply content analysis methods [24] to selected architecture and ES usage
models to check the repeatability of the findings of this research.
2. Analyze how EA is used in managing strategic change. This will be done
by carrying out case studies at companies sites.
3. Refining ES UMM concepts. The ES UMM was developed at the time of
the year 2000 ERP boom and certainly needs revisions to reflect the most recent
107
References
[1] J. Champy, X-Engineering the Corporation, Warner Books, New York, 2002.
[2] M. Daneva, ERP requirements engineering practice: lessons learned, IEEE Software,
March/April 2004.
[3] T. Davenport, The future of enterprise system-enabled organizations, Information Systems
Frontiers 2(2), 2000, pp. 163-180.
[4] Department of Commerce (DoC), USA Government: Introduction IT Architecture Capability
Maturity Model, 2003, https://secure.cio.noaa.gov/hpcc/docita/files/
acmm_rev1_1_05202003.pdf
[5] L. Dube, G. Pare, Rigor in information systems positivist research: current practices, trends, and
recommendations, MIS Quarterly, 27(4), 2003, pp. 597-635.3N.,
[6] U. Eco, A Theory of Semiotics, Bloomington, Indiana, University of Indiana Press, 1996.
[7] Ernst & Young LLT: Are you getting the most from your ERP: an ERP maturity model,
Bombay, India, April 2002.
[8] Federal Enterprise Architecture Program Management Office, NASCIO, US Government,
April,
http://www.feapmo.gov/resources/040427%20EA%20Assessment%20Framework.pdf,
2004.
[9] Fonstad, D. Robertson, Transforming a Company, Project by Project: The IT Engagement
Model, Sloan School of Management, CISR report, WP No363, Sept 2006.
[10]N. Fenton, S.L. Pfleeger, Software Metrics: a Rigorous & Practical Approach, International
Thompson Publ., London, 1996.
[11]Gartner Architecture Maturity Assessment, Gartner Group Stamford, Connecticut, Nov. 2002.
[12]W. van Grembergen, R. Saull, Aligning business and information technology through the
balanced scorecard at a major Canadian financial group: its status measured with an IT BSC
maturity model. Proc.of the 34th Hawaii Intl Conf. on System Sciences, 2001.
[13]C. Holland, B. Light, A stage maturity model for enterprise resource planning systems use, The
DATABASE for Advances in Information Systems, 32 (2), 2001, pp. 3445.
[14]J. Lee, K. Siau, S. Hong, Enterprise integration with ERP and EAI, Communications of the
ACM, 46(2), 2002.
[15]M. Markus, Paradigm shifts E-business and business / systems integration, Communications
of the AIS, 4(10), 2000.
[16]M. Markus, S. Axline, D. Petrie, C. Tanis, Learning from adopters experiences with ERP:
problems encountered and success achieved, Journal of Information Technology, 15, 2000, pp.
245-265.
[17]J. Martin, Strategic Data-planning Methodologies. Prentice Hall, 1982.
[18]META Architecture Program Maturity Assessment: Findings and Trends, META Group,
Stamford, Connecticut, Sept. 2004.
[19]A. Rafaeli, M. Worline, Organizational symbols and organizational culture, in Ashkenasy &
C.P.M. Wilderom (Eds.) International Handbook of Organizational Climate and Culture, 2001,
71-84.
[20]J. Ross, P.Weill, D. Robertson, Enterprise Architecture as Strategy: Building a Foundation for
Business Execution, Harvard Business School Press, July 2006.
[21]P. Weill P, J.W. Ross, How effective is your IT governance? MIT Sloan Research Briefing,
March 2005.
[22] R.K. Yin, Case Study Research, Design and Methods, 3rd ed. Newbury Park, Sage Publications,
2002.
[23] J. Schekkerman, Enterprise Architecture Score Card, Institute for Enterprise Architecture
Developments, Amersfoort, The Netherlands, 2004.
[24] S. Stempler, An overview of content analysis, Journal of Practical Assessment, Research &
Evaluation, 7(17), 2001.
108
Acknowledgement: This research has been carried out with the financial
support of the Nederlands Organization for Scientific Research (NWO) under
the Collaborative Alignment of Cross-organizational Enterprise Resource
Planning Systems (CARES) project. The authors thank Claudia Steghuis for
providing us with Table 1.
109
110
1 Introduction
Humanity has long been feeling the need to arrange mechanisms that make
sluggish, routine and complex tasks easier. Calculations, data storing are some of
the examples and lots of tools have been created to solve these problems. The
personal computer comes as a working instrument that carries out many of them.
The interconnection of these devices allows for information release and gives users
the chance to share contents thus improving task execution efficiency. The Internet
is inevitably a consequence of the use and proliferation of computers. Its constant
evolution could be framed into several features, such as infrastructure,
development tools (increasingly easy to use) and interfaces with the user. The
objective of this paper is to explore the new man-computer interface and its
implications upon society.
The Web interfaces are based on languages of their own, which are used to write
the documents and, by means of appropriate software (navigators, interpreters) can
be read, visualized and executed. The first programs to present the contents of
these documents used simple interfaces, which only contained the text. Nowadays,
they support texts in various formats, charts, images, videos, sounds, among others.
As information visualization software progresses, along with the need to make this
Jacques Schreiber, Gunter Feldens
Eduardo Lawisch, Luciano Alves
111
information available to more users, more complex interfaces arise, using hardware
systems and software innovators. Then come technologies that utilize voice and
dialogue recognition systems to provide and collect data. The motivation for
developing this work comes from the ever-expanding fixed and mobile telephone
services in Brazil, particularly over the past decade, a development that will allow
a broader range of users to access the contents of the Web. There are also
motivations of social character, for example, access to the Internet for special needs
citizens, like visual deficient people.
112
In spite of it all, this protocol shows some limitations, among them, the inability to
establish a channel that allows for transporting audio data in real time. This is a
considerable limitation and makes it impossible to use the WAP as a solution to
supply Web contents through a voice interface.
113
114
115
Server
Request
VoiceXML
Interpreter
Context
Document
VoiceXml
Interpreter
Implementation
Platform
Figure 3: Architecture details
Now analyzing the tasks executed by the gateway VoiceXML in a more detailed
manner (Figure 4) it becomes clear that the interpretation of the scripts and the
interaction with the user are actions controlled by the latter in order to execute
them, the gateway consists of a set of hardware and software elements which form
the heart of the VoiceXml (VoiceXml Interpreter technology and the above
described VoiceXml Interpreter Context are also components of the gateway).
Essentially, they furnish the user interaction mechanisms analogically to the
browsers in a conventional HITP service. The calls are answered by the telephone
services and by the signal processing component.
Audio Reproduction
TTS services
HTTP clients
116
scale telephone centers utilized by many institutions. The architecture allows the
users to request the transference of their call to an operator and also allows the
technology to implement the order easily.
When a call is received, the VoiceXml Interpreter starts to check and execute
the instructions contained in the scripts VoiceXml. As mentioned before, when
the script, which is executing, requests an answer from the user, the interpreter
directs the control to the recognition system, which listens to and interprets the
users reply. The recognition system is totally independent from other gateway
components. The interpreter may use a compatible client/server recognition
system or may change the system during the execution with the purpose to
improve the performance. Another manner of collecting data is the recognition of
keys which results into DTMF controls which are interpreted to allow the user to
furnish information to the system, like access passwords.
117
Business
Logic
(JSP)
118
For the development of this system, the JSP program free server was used http://www.eatj.com and a free VoiceXML gateway, site http://cafe.bevocal.com.
The following phase of the project consisted in the development of the
VoiceXML documents that make up the interface. The objective was to create a
voice interface in Portuguese, but unfortunately free VoiceXml gateways, capable
of supporting this language, do not exist yet. Therefore, we developed a version
that does not utilize the graphic signs typical to the Portuguese language, like
accentuation and cedillas; even so, the dialogues are understood by Brazilian users.
An accurate analysis of the implementation showed that the alteration of the system
to make it support the Portuguese language is relatively simple and intuitive, once
the only thing to do is to alter the text to be pronounced by the Portuguese text
platform and force the system to utilize this language (there is a VoiceXML
element to turn it into speak xml:lang="pt-BR").
A. System with dialogues in Portuguese
The interface and the dialogues it is supposed to provide, as well as the
information to be produced by the system, were studied and projected by the team,
under the guidance of the professor of Special Topics, at Uniscs (University of
Santa Cruz do Sul) Computer Science College. After agreeing on the idea that the
functionalities already offered by the system on the Web should be identical to the
service existing on the Web, it was necessary to start implementing the dialogues
in VoiceXml and also create dynamic pages so that the information contained on
the database could be offered to the user-students. Dialogue organization and the
corresponding database were established to make the resulting interface comply
with the specifications (Figure 9).
DisciplineRN.java
Function.Java
Start.jsp
SaveEnrollment.jsp
EnrollmentRN.java
119
An analysis of Figure 9 shows that the user starts interacting with the system
through the program start.jsp which contains an initial greeting, asks for
students name and enrollment number:
<%try {
ConnectionBean conBean = new ConnectionBean();
Connection con = conBean.getConnection();
DisciplineRN disc = new DisciplineRN(con);
GeneralED edResearch = new GeralED();
String no.Enrolment =
request.getParameter("nroMatricula");
//int nroDia = 2;
//Integer.valueOf(request.getParameter("nroDia")).intValue
();
edResearch.put("codeStudent", numberEnrolment);
String name = disc.listStudent(edResearch);
DevCollection lstDisciplines;
String dia[] =
{"","","Monday","Tuesday","Wednesday","Thursday","Friday",
"end"};
%>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
<field name=enrolment>
<prompt>
Welcome to Unisc-Phone, now you can do your
enrolment at home, by phone. Please inform your enrolment
number.
</prompt>
</field>
...
<block>
<%=name%>, now we will start with your reenrolment.
</block>
<form id="enrolment">
<% for (int no.Day=2;no.Day<=6;no.Day++) { %>
<field name="re-enrol_<%=day[no. of Day]%>">
120
<prompt>
do you want to re-enroll on
<%=day[no.Day]%>-day of week?
</prompt>
<grammar>
[yes no]
</grammar>
<filled>
<if cond="re-enrol_<%=day[no.Day]%> ==
'yes'">
Lets get started...
<goto
nextitem="discipline_<%=day[no.Day]%>"/>
<else/>
you wont have lessons on
<%=day[no.Day]%>-day of week,
<% if (no.Day==6){ %>
The re-enrolment process
has been concluded, sending data to server...
<submit method="post"
namelist="discipline_<%=daya[2]%> discipline_<%=day[3]%>
discipline_<%=day[4]%> discipline_<%=day[5]%>
disciplina_<%=dia[6]%>"
next="http://matvoices.s41.eatj.with/EnrolmentVoice/saveEn
rolment.jsp"/>
...
</vxml>
121
now
we
will
start
your
re-enrolment
4 Conclusions
The rising business opportunities offered by the Internet as a result of neverending technology improvement, both at infrastructure and interface level, and the
growing number of users translate into a diversification of needs at interface level,
triggering the appearance of new technologies. It was this context that led to the
creation of the voice interface. The VoiceXML comes as a response to these needs
due to its characteristics, once it allows for Computer-Human dialogues as a means
of providing the users with information. Although entirely specified, this
technology is still at its study and development stage. The companies comprised
by the VoiceXML Forum are doing their best in spreading the system rapidly.
Nevertheless, there are still some shortfalls like, for example, the lack of speech
recognition motors and TTS transformation into languages like Portuguese,
compatible with the existing gateways.
The system here in above described represents a step forward for the Brazilian
scientific community, which lacks practical applications that materialize their
research works and signal the right course toward the new means of HumanComputer interfaces within the Brazilian context.
122
References
[1] "W AP Forum"- (http://www.wapforum.org).
[2] Peter A. Heeman, "Modeling Speech Repairs and International Phrasing to Improve Speech
Recognition", Computer Science and Engineering Oregon Graduate Institute of Science and
Technology.
[3] Speech Recognition Technologies Are NOT Ali Alike, (http://www.comman~corp.com/).
[4] Speech Synthesis Markup Language Specification for the Speech Interface Framework
(http://www.w3.org/TR/2001/WD-speech-synthesis-2001 O 1 03).
[5] Steve Ihnen VP, "Developing With VoiceXML:Overview and System Architecture ",
Applications DevelopmentSpeechHost, Inc.
[6] WAP White Paper1 http://www.wapforum.org/what/WAPWhite_Paper1.pdf
123
124
Introduction
In the last years, the ontology plays a key role in the Semantic Web [1] area of
research. This term has been used in various areas in Artificial Intelligence [2] (i.e.
knowledge representation, database design, information retrieval, knowledge management, and so on), so that to find an unique its meaning becomes a subtle topic.
In a philosophical sense, the term ontology refers to a system of categories in order
to achieve a common sense of the world [3]. From the FRISCO Report [4] point of
125
view, this agreement has to be made not only by the relationships between humans
and objects, but also from the interactions established by humans-to-humans.
In the Semantic Web an ontology is a formal conceptualization of a domain of interest, shared among heterogeneous applications. It consists of entities, attributes,
relationships and axioms to provide a common understanding of the real world
[5, 3, 6, 4]. With the support of ontologies, users and systems can communicate
with each other through an easy information integration [7]. Ontologies help people and machines to communicate concisely by supporting information exchange
based on semantics rather than just syntax.
Nowadays, there are ontology applications where information is often vague
and imprecise, for instance, the semantic-based applications of the Semantic Web,
such as e-commerce, knowledge management, web portals, etc. Thus, one of the
key issues in the development of the Semantic Web is to enable machines to exchange meaningful knowledge across heterogeneous applications to reach the users
goals. Ontology provides a semantic structure for sharing concepts across different
applications in an unambiguous way. The conceptual formalism supported by a
typical ontology may not be sufficient to represent uncertain information that is
commonly found in many application domains. For example, keywords extracted
by many queries in the same domain may not be considered with the same relevance, since some keywords may be more significant than others. Therefore, the
need to give a different interpretation according to the context emerges. Furthermore, humans use linguistic adverbs and adjectives to specify their interests and
needs (i.e., users can be interested in finding a very fast car, a wine with a
very strong taste, a fairly cold drink, and so on). The necessity to handle the
richness of natural languages used by humans emerges.
A possible solution to treat uncertain data and, hence, to tackle these problems, is to incorporate fuzzy logic into ontologies. The aim of fuzzy set theory [8]
introduced by L. A. Zadeh [9] is to describe vague concepts through a generalized
notion of set, according to which an object may belong to a set with a certain
degree (typically a real number in the interval [0,1]). For instance, the semantic
content of a statement like Cabernet is a deep red acidic wine might have the
degree, or truth-value, of 0.6. Up to now, fuzzy sets and ontologies are jointly
used to resolve uncertain information problems in various areas, for example, in
text retrieval [10, 11, 12] or to generate a scholarly ontology from a database in
ESKIMO [13] and FOGA [14] frameworks. The FOGA framework has been recently applied in the Semantic Web context [15].
However, there is not a complete fusion of Fuzzy Set Theory with ontologies in
any of these examples.
In literature we can find some attempts to directly integrate fuzzy logic in ontology,
for instance in the context of medical document retrieval [16] and in ontology-based
queries [17]. In particular, in [16] the integration is obtained by adding a degree
of membership to all terms in the ontology to overcome the overloading problem;
while in [17] a query enrichment is performed. This is done with the insertion of
a weight that introduces a similarity measure among the taxonomic relations of
the ontology. Another proposal is an extension of the ontology domain with fuzzy
concepts and relations [18]. However it is applied only to Chinese news summarization. Two well-formed definitions of fuzzy ontology can be found in [19, 20].
126
In this paper we will refer to the formal definition stated in [19]. A fuzzy ontology is an ontology extended with fuzzy values assigned to entities and relations
of the ontology. Furthermore, in [19] it has been showed how to insert fuzzy logic
in ontology domain extending the KAON 1 editor [21] to directly handle uncertainty information during the ontology definition, in order to enrich the knowledge
domain.
In a recent work, fuzzy ontologies have been used to model knowledge in creative environments [22]. The goal was to build a digitally enhanced environment
supporting creative learning process in architecture and interaction design education. The numerical values are assigned during the fuzzy ontology definition by
the domain expert and by the user queries. There is a continuous evolution of new
relations among concepts and of new concepts inserted in the fuzzy ontology. This
evolutive process lets the concepts to arrange in a characteristic topological structure describing a weighted complex network. Such a network is neither a periodic
lattice nor a random graph [23]. This network has been introduced in an information retrieval algorithm, in particular this has been adopted in a computer-aided
creative environment. Many dynamical systems can be modelled as a network,
where vertices are the elements of the system and hedges identify the interaction
between them [24]. Some examples are biological and chemical systems, neural
networks, social interacting species, computer networks, the WWW, and so on
[25]. Thus, it is very important to understand the emerging behaviour of a complex network and to study its fundamental properties. In this paper, we present
how fuzzy ontology relations evolve in time, producing a typical structure of complex network systems. Two efficiency measures are used to study how information
is suitably exchanged over the network, and how the concepts are closely tied [26].
The rest of paper is organized as follows: Section 2 presents the importance of
the use of the fuzzy ontology and the definition of a new concept network based
on its evolution in time. Section 3 introduces scale-free network notation, smallworld phenomena and efficiency measures on weighted network topologies, while
in Section 4 a new information retrieval algorithm exploiting the concept network
is discussed. In Section 5 we present some experimental results that confirms the
scale-free nature of the fuzzy concept network. In Section 6 some other relevant
works we found in literature are presented for introducing our scope. Finally, in
Section 7 some conclusions are reported.
Fuzzy Ontology
In this section an in depth study about how the Fuzzy Set Theory [9] has been
integrated into the ontology definition will be discussed. Some fuzzy ontology
preliminary applications are presented. We will show how to construct a novel
concept network model also relying on this definition.
1 The
KAON project is a meta-project carried out at the Institute AIFB, University of Karlsruhe and at the Research Center for Information Technologies(FZI).
127
2.1
(1)
where fA , fA : U 7 [0, 1], is the membership function of the fuzzy set A; fA (ui )
indicates the degree of membership of ui in A.
Finally, we can give the definition of fuzzy ontology presented in [19].
Definition 2 A fuzzy ontology is an ontology extended with fuzzy values assigned
through the two functions
g :(Concepts Instances) (P roperties P roperty value) 7 [0, 1]
h :Concepts Instances 7 [0, 1].
(2)
(3)
where g is defined on the relations and h is defined on the concepts of the ontology.
2.2
From the practical point of view, using the given fuzzy ontology definition we can
denote not only non-crisp concepts, but we can also directly include the property
value according to the definition given in [27]. In particular, the knowledge domain has been extended with the quality concept becoming an application of the
property values. This solution can be used in the tourist context to better define
the meaning of a sentence like this is a hot day. Furthermore, it is an usual
practice to extend the set of concepts already present in the query with other ones
which can be derived from an ontology. Generally, given a concept, the query is
extended with its parents and children to enrich the set of displayed documents.
With a fuzzy ontology it is possible to establish a threshold value (defined by the
domain expert) in order to extend queries with instances of concepts which satisfies the chosen value [19]. This approach can be compared with [17] where the
128
queries evaluation is determined through the similarity among the concepts and
the hyponymy relations of the ontology.
In literature the problem of the efficient queries refinement has been faced with
a large number of different approaches during the last years. PASS is a method
developed in order to construct automatically a fuzzy ontology (the associations
among the concepts are found analysing the documents keywords [28]) that can
be used to refine a users query [29] .
Another possible use of the fuzzy value associated to concepts has been adopted in
the context of medical document retrieval to limit the problems due to overloading
of a concept in an ontology [16]. This also permits the reduction of the number of
documents found hiding those that do not fulfil the request of the user.
The relevant goal achieved using fuzzy ontology has been the direct handling
of concept modifiers into the knowledge domain. A concept modifier [30] has the
effect of altering the fuzzy value of a property. Given a set of linguistic hedges
such as very, more or less, slightly, a concept modifier is a chain of one
or more hedges, such as very slightly or very very slightly, and so on. So, a
user can write a statement like Cabernet has a very dry taste. It is necessary
to associate a membership modifier to any (linguistic) concept modifier.
A membership modifier has a value > 0 which is used as an exponent to modify
the value of the associated concepts [19, 31]. According to their effect on a fuzzy
value, a hedge can be classified in two groups: concentration type and dilation type.
The effect of a concentration modifier is to reduce the grade of a membership value.
Thus, in this case, it must be > 1. For instance, to the hedge very, it is usually
assigned = 2. So, if Cabernet has a dry taste with value 0.8, then Cabernet
has a very dry taste with value 0.82 = 0.64. On the contrary, a dilation hedge
has the effect of raising a membership value, that is (0, 1). The example is
analogous to the previous one. This allows not only the enrichment of the semantic
that usually the ontologies offer, but it also gives the possibility to make a request
without mandatory constraints to the user.
2.3
Every time that a query is performed there is an updating of the fuzzy values
given to the concepts or to the relations set by the expert during the ontology
definition. In [22] two formulae both to update and inizialize the fuzzy values are
given. Such expressions take into account the use of concept modifiers.
The dynamical behaviour of a fuzzy ontology is also given by the introduction
of new concepts when a query is performed. In [22] a system has been presented
that allows the fuzzy ontology to adapt to the context in which it is used, in
order to propose an exhaustive approach to directly handle the knowledge-based
fuzzy information. This consists in the determination of a semantic correlation
[22] among the entities (i.e. concepts and instances) that are searched together in
a query.
Definition 3 A correlation is a binary and symmetric relation between entities.
It is characterized by a fuzzy value: corr : O O 7 [0, 1], where the set O =
{o1 , o2 , . . . , on } is the set of the entities contained in the ontology.
This defines the degree of relevance for the entities. The closer the corr value is
129
to 1, the more the two considered, for instance, concepts are correlated. Obviously
an updating formula for each existent correlation is also given. A similar technique
is known in literature as co-occurrence metric [32, 33].
To integrate the correlation values into the fuzzy ontology is a crucial topic.
Indeed, the knowledge of a domain is given, also considering the use of the objects inside the context. An important topic is to handle the trade off between
the correct definition of an object (given by the ontology represented definition
of the domain) and the actual means assigned to the artifact by humans (i.e.
the experience-based context assumed by every person according to his specific
knowledge).
130
From the dynamical nature of the FCN some important topological properties can
be determined. In particular, the correlations time evolution plays a dominant
role in this kind of insight. In this section some formal tools and some efficiency
measures are presented. These have been adopted to numerically analyse the FNC
evolution and the underlying fuzzy ontology.
The study of the structural properties of complex systems underlying networks
can be very important. For instance, the efficiency of communication and navigation over the Net is strongly related to the topological properties of the Internet
and of the World Wide Web. The connectivity structure of a population (the set
of social contacts) acts on the way ideas are diffused. Only very recently the increasing accessibility of databases of real networks on one side, and the availability
of powerful computers on the other side, have made possible a series of empirical
studies on the social networks properties.
In their seminal work [23], Watts and Strogatz have shown that the connection
topology of some real networks is neither completely regular nor completely random. These networks, named small-world networks [36], exhibit a high clustering
coefficient (a measure of the connectedness of a network), like regular lattices,
and small average distance between two generic points (small characteristic path
length), like random graphs. Small average distance and high clustering are not
all the common features of complex networks. Albert, Barabasi et al. [37] have
studied P(k), the degree distribution of a network, and found that many large
networks are scale-free, i.e., have a power-law degree distribution P (k) k .
Watts and Strogatz have named these networks, that are somehow in between
regular and random networks, small-worlds, in analogy with the small-world phenomenon, empirically observed in social systems more than 30 years ago [36]. The
mathematical characterization of the small-world behaviour is based on the evaluation of two quantities, the characteristic path length L, measuring the typical
separation between two generic nodes in the network and the clustering coefficient C, measuring the average cliquishness of a node. Small-world networks are
highly clustered, like regular lattices, having small characteristic path lengths, like
random graphs.
Generic network is usually represented by a weighted graph G = (N, K), where
N is a finite set of vertices and K (N N ) are the edges connecting the
nodes. The information related to G is described by both an adjacency matrix
A M(|N |, {0, 1}) and by a weight matrix W M(|N |, R+ ). Both the matrix A
and W are symmetric. The entry aij in A are 1 if there is an edge joining vertex i
to vertex j, and 0 otherwise. The matrix W contains a weight wij related to any
edge aij . If aij = 0 then wij = . If the condition wij = 1 for any aij = 1 is
assumed the graph G corresponds to an unweighted relational network.
In network analysis a very important quantity is the degree of a vertex, i.e.,
the number of incident with i N . The degree k(i) (N ) of generic vertex i is
defined as:
k(i) = |{(i, j) : (i, j) K}|
(4)
131
P
< k(i) >=
i k(i)
2 |N |
(5)
The definition given in Equation (6) is valid for a totally connected G, where at
least one finite path connecting any couple of vertices exists. Otherwise, when
from a node i we cannot reach a node j then the distance dij = , thus the sum
in L(G) diverges.
The clustering coefficient C(G) is a measure depending on the connectivity
of the subgraph Gi induced by a generic node i and its neighbours. Formally a
subgraph Gi = (Ni , Ki ) of a node i N can be defined as the pair:
Ni = {j N : (i, j) K}
(7)
Ki = {(j, k) K : j Ni k Ni }
(8)
An upper bound on the cardinality of Ki can be stated according to the following observation: if the degree of a given node is k(i), following Equation (4),
then Gi has
k(i) (k(i) 1)
|Ki |
(9)
2
Let us stress that the subgraph Gi does not contain the node i. Gi results to
be useful in studying the connectivity of the neighbours of a node i after the
eliminations of the node itself.
The upper bound on the number of the edges in a subgraph Gi introduces the
ratio of the actual number of edges in Gi with respect to the right hand side of
equation (9). Formally this ratio is defined as:
Csub (i) =
2 |Ki |
k(i) (k(i) 1)
(10)
The quantities Csub (i) are used to calculate the clustering coefficient C(G) as their
mean value:
1 X
C(G) =
Csub (i)
(11)
|N |
iN
132
and the local efficiency, in analogy with C, can be defined as the average efficiency
of local subgraphs:
E(Gi ) =
1
k(i)(k(i) 1)
Eloc (G) =
X
l6=mGi
1 X
E(Gi )
N
1
d0lm
(13)
(14)
iG
In the last decade, there has been a rapid and wide development of Internet which
has brought online an increasingly great amount of documents and online textual
133
information. The necessity of a better definition of the Information Retrieval System (IRS) emerged in order to retrieve the information considered pertinent to
a user query. Information Retrieval is a discipline that involves the organization,
storage, retrieval and display of information. IRSs are designed with the objective
of providing references to documents which contain the information requested by
the user [32]. In IRS, problems arise when there is the need to handle uncertainty
and vagueness that appear in many different parts of the retrieval process. On one
hand, an IRS is required to understand queries expressed in natural languages.
On the other hand, it has the need to handle the uncertain representation of a
document.
In literature, there are many models of IRS that are classified into the following categories: boolean logic, vector space, probabilistic and fuzzy logic [39, 40].
However, both the efficiency and the effectiveness of these methods are not satisfactory [34]. Thus, other approaches have been proposed to directly handle the
knowledge-based fuzzy information. In preliminary attempts the knowledge was
represented by a concept matrix, where the elements identify relevant values among
concepts [34]. Other more relevant approaches have been made adding fuzzy types
to object-oriented databases systems [41].
As stated in Section 2, a crucial topic for the semantic information handling is
the face the trade off between the proper definition of an object and its common
sense counterpart. The FCN characteristic weights are initially set by an expert
of the domain. In particular, he sets the initial correlation values on the fuzzy
ontology and the fuzzy concept network construction procedure takes these as
initial values for the links among the objects in O (see Definition 4). From now
on the correlation values F (oi , oj ) will be updated according to the queries (both
selections and insertions) performed on the documents.
Most of all, the FCN usage gives the possibility to directly incorporate the
semantics expressed by the natural languages in graph spanning. This feature let
us intrinsically obtain fuzzy information retrieval algorithms without introducing
fuzzyfication and defuzzyfication operators. Let us stress that this process is possible because the fuzzy logic is directly inserted into the knowledge expressed by
the fuzzy ontology.
An example for this kind of approach is presented in the following. The original crisp information retrieval algorithm taken into account has been successfully
applied to support the creative processes of architects and interaction designers.
More in detail, a new formalization of the algorithm adopted in the ATELIER
project (see [22] and Section 5.1) is presented including a step-by-step brief description.
The FCN has been involved in steps (1) and (4) in order to semantically enrich
the results obtained. The algorithm input is the vector of the keywords in the
query. The first step of the algorithm uses these keywords to locate the documents
(e.g., stored in a relational database) containing them. The keyword vector is
extended with all the new keywords related to each selected document.
In the step (1) the queries are extended by navigating the FCN recursively.
For each keyword specified in the query, a depth-first visit is performed arresting
the spanning at a fixed level. In [22] this threshold was set to 3. The edges
whose F (oi , oj ) is 0 are excluded and the neighbour keywords are collected without
134
[F (oi , oj )] oi ,oj
w(oi ) = m(oi )oi
(15)
oj K,oj 6=oi
Where K is the set of the keywords obtained from the step (3), oi ,oj R is a
modifier value used to express concept modifiers effects (see [19] and Section 2.2
for details).
The final score of a document is evaluated through a cosine distance among
the weights of each keyword. This is done for normalisation purpose. Such a value
is finally sorted in order to obtain a ranking among the documents.
Test validation
This section is divided as follows: in the first part we introduce the environment
used to experiment the FCN, whereas in the second part the analytic study of the
scale-free properties of these networks is given.
5.1
A creative learning environment is the context chosen to study the fuzzy concept
networks behaviour. In particular, the ATELIER (Architecture and Technologies
for Inspirational Learning Environments) project has been examined. ATELIER
is an EU-funded project that is part of the Disappearing Computer initiative2 .
The aim of this project is to build a digitally enhanced environment, supporting
a creative learning process in architecture and interaction design education. The
work of the students is supported by many kinds of devices (e.g., large displays,
2 http://www.disappearing-computer.net
135
5.2
During the construction of the fuzzy concept network some snapshots have been
periodically dumped to file (one snapshot each 50 queries) to be analyzed. To
have a graphical topological representation a network analysis tool called AGNA3
has been used. AGNA (Applied Graph and Network Analysis) is a platformindependent application designed for scientists and researchers who employ specific
mathematical methods, such as social network analysis, sociometry and sequential
analysis. Specifically, AGNA can assist in the study of communication relations
in groups, organizational analysis and team building, kinship relations or animal
behavior laws of organization. The most recent version is AGNA 2.1.1 and it has
been used to produce the following pictures of the fuzzy concept networks starting
from the weighted adjacency matrix defined in Section 3. The link color intensity
is proportional to the function F introduced in Definition 2 so, the more marked
3 http://http://www.geocities.com/imbenta/agna/
136
lines mean F values near to 1. Because of the large number of concepts and
correlations, the pictures in Figure 4 have the purpose of showing qualitatively
the link distributions and the hub locations. Little semantic information about
the ontology can be effectively extracted from these pictures.
Both the global and the local efficiency, as defined in Equation (12) and in
Equation (14), are calculated on each of these snapshots. The evolution for the
efficiency measures are reported in Figure 3(a), 3(b) and 3(c). The solid line
corresponds to the global efficiency while the dashed one is the local efficiency.
In Figure 3(a) the efficiency evolutions of the HMDB are reported. The global
efficiency becomes a dominant effect after an initial transitory where the local
efficiency results dominant. So, we can deduce the emergence of a hub connected
fuzzy concept network. This consideration is graphically confirmed by the network
reported in Figure 4(a). To increase the readability we located the hubs on the
borders of the figure.
It can be seen clearly that the hub concepts are people, man, woman,
hat, face and portrait. These central concepts have been isolated using the
betweenness [42] sociometric measure. The high betweenness values of the hubs
with respect to the one for the other concepts confirm the measure obtained by the
Eglo . Indeed, the mean distance among the concepts is kept low thanks to these
points appearing very frequently in the paths from and to all the other nodes.
We want to stress that the global efficiency quantifies the presence of hubs in a
given network, while the betweenness of the nodes gives a way to identify which
of the concepts are actually hub points. On the other hand, Figure 3(b) shows
how the local efficiency in a fuzzy concept network builded using user queries is
higher than its global counterpart. This suggests that the network topology lacks
hubs. Furthermore many nodes present quite the same number of neighbours. A
confirmation for this analysis is given by the fuzzy concept network reported in
Figure 4(b). In this case the betweenness index for the concepts shows that no
particular point has significantly higher frequency in the other node paths. Finally,
in Figure 3(c) the effect of the network obtained by both the documents and the
queries is reported. In this composed kind of tests a total of about 1000 queries is
taken into account. It is interesting how both the data from HMDB and the user
queries act in a non linear way on the quantification of the efficiency measures.
The resulting fuzzy concept network shows a dominant Eglo with respect to its
Eloc and some hubs emerge. In particular Figure 5.2 highlights the fact that the
hubs collect a large number of links coming from the other concepts.
The betweenness index for this fuzzy concept network identify one main hub,
the concept people, with an extremely high value (5 times the mean value for
the other hubs). The principal other hubs are woman, man, landscape,
sea, portrait, red and fruit. In this case the Freeman General Coefficient
evaluated on the indexes is slightly lower than in the HMDB case, this is due to
the higher clustering among concepts (higher Eloc ), see Table 1. The Freeman
Coefficient is a function that allows the consolidation of node-level measures in a
single value related to the properties of the whole network [42].
The strength of these connections is much more marked than in the similar situation treated in the case of HMDB induced fuzzy concept network. This means
that the queries contribution reinforces the semantic correlations among the hub
137
Efficiency
0.2
Efficiency
0.2
0.175
0.175
0.15
0.15
0.125
0.125
0.1
0.1
0.075
0.075
0.05
0.05
0.025
0.025
2
10
Snap
(a)
10
Snap
(b)
Efficiency
0.2
0.175
0.15
0.125
0.1
0.075
0.05
0.025
2
10
Snap
(c)
Figure 3: (a) Efficiency measures for the concept network induced by the knowledge base. (b) Efficiency for the user queries. (c) Efficiency for the joined queries
and knowledge base documents.
concepts. To confirm the scale-free nature of the hubs in the fuzzy concept networks we analyzed the statistical distributions of k(i) (see Equation (4)) reported
in Figure 5.
For Figures 5(a) and 5(c) the frequencies decrease according to a power law.
This confirms what is stated by the theoretical expectations very well. The user
queries distributions, in Figure 5(b), behave differently. Their high values for Eloc
imply, as already stated, a highly clustered structure.
Let us consider how Eloc is related to other classical social network parameters such as the density and the weighted density [42]. The density reflects the
connectedness of a given network with respect to its complete graph. It is easy to
note that this criterion is quite similar to what stated in Equation (14): we can
consider the Eloc as a mean value for the densities evaluated locally in each node
of the fuzzy concept network. Unexpectedly the numerical results in Table 1 show
that there is an inverse proportional relation between the Eloc and the density.
More investigations are required.
The weighted density can be interpreted as a measure of the mean link weight
value namely, the mean semantic correlation (see Section 2) among the concepts
in the fuzzy concept network. In Table 1 it can be seen that the weighted density
values are higher for the systems exhibiting hubs. This is graphically confirmed
by the Figures 4(a) and 5.2, where the links among the concepts are more marked
(i.e. more colored lines correspond to stronger correlations).
138
(a)
(b)
(c)
139
count
count
50
25
40
20
30
15
20
10
10
count
60
50
40
30
20
20
40
60
80
100
10
10
# link
(a)
15
20
25
20
40
# link
(b)
60
80
100
# link
(c)
Figure 5: (a) HMDB fuzzy concept network link distribution. (b) User queries
fuzzy concept network link distribution. (c) Complete knowledge-base fuzzy concept network link distribution.
Table 1: Comparison of the efficiency measures of other complex systems w.r.t our
fuzzy concept networks.
Fuzzy Concept
Eglo Freeman Coeff. Eloc Density Weighted
Network
Density
HMDB
0.094
0.07
0.074
0.035
0.01
Queries
0.053
0.02
0.144
0.015
0.006
Complete
0.141
0.06
0.079
0.036
0.013
(HMDB+Queries)
Related Work
140
has been applied to ontologies related to the contexts covered in the Web Mining
field of research. Another ontology model somehow related to our approach is the
seed ontology. A seed ontology creates a semantic network through co-occurrence
analysis: it considers exclusively how many times the keywords are used together.
The disambiguation related problems are resolved using a WordNet consultation.
The major advantage of these nets is that they allow the efficient identification of
the most probable candidates for inclusion in an extended ontology.
Ontology-based information retrieval approaches are one of the most promising
methodologies to improve the quality of the responses for the users. The definition
of the FCN implies a better calculation for the relevance of searched documents.
A different approach to ontology-based information retrieaval has been proposed
in [46]. In this work a semantic network is built to represent the semantic contents
of a document. The topological structure of this network is used in the following
way: every time that a query is performed a keyword vector is created, in order
to select the appropriate concepts that characterize the contents of the documents
according to the search criteria. We are investigating on how to integrate the FNC
with this different kind of semantic network, in fact both of these methodologies
could be effectively used to achieve the goal of the IRs.
Conclusions
Acknowledgements
The work presented in this paper had been partially supported by the ATELIER
project (IST-2001-33064).
141
References
[1] T. Berners-Lee, T. Hendler, and J. Lassila, The semantic web, Scientific American, vol. 284, pp. 3443, 2001.
[2] N. Guarino, Formal ontology and information systems, 1998. [Online]. Available:
citeseer.ist.psu.edu/guarino98formal.html
[3] T. Gruber, A Translation Approach to Portable Ontology Specifications, Knowledge Acquisition, vol. 5, pp. 199220, 1993.
[4] E. Falkenberg, W. Hesse, P. Lindgreen, B. Nilsson, J. Oei, C. Rolland, R. Stamper,
F. V. Assche, A. Verrijn-Stuart, and K. Voss, Frisco : A framework of information
system concepts, IFIP, The FRISCO Report (Web Edition) 3-901882-01-4, 1998.
[5] N. Lammari and E. Mtais, Building and maintaining ontologies: a set of algorithms, Data and Knowledge Engineering, vol. 48, pp. 155176, 2004.
[6] N. Guarino and P. Giaretta, Ontologies and Knowledge Bases: Towards a Terminological Clarification, in Towards Very Large Knowledge Bases: Knowledge Building
and Knowledge Sharing, N. Mars, Ed. Amsterdam: IOS Press, 1995, pp. 2532.
[7] V. W. Soo and C. Y. Lin, Ontology-based information retrieval in a multi-agent
system for digital library, in 6th Conference on Artificial Intelligence and Applications, 2001, pp. 241246.
[8] G. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice
Hall, 1995.
[9] L. A. Zadeh, Fuzzy sets, Inform. and Control, vol. 8, pp. 338353, 1965.
[10] P. Bouquet, J. Euzenat, E. Franconi, L. Serafini, G. Stamou, and S. Tessaris, Specification of a common framework for characterizing alignment, IST Knowledge web
NoE, vol. 2.2.1, 2004.
[11] S. Singh, L. Dey, and M. Abulaish, A Framework for Extending Fuzzy Description
Logic to Ontology based Document Processing, in Proceedings of AWIC 2004, ser.
LNAI, vol. 3034. Springer-Verlag, 2004, pp. 95104.
[12] M. Abulaish and L. Dey, Ontology Based Fuzzy Deductive System to Handle Imprecise Knowledge, in In Proceedings of the 4th International Conference on Intelligent
Technologies (InTech 2003), 2003, pp. 271278.
[13] C. Matheus, Using Ontology-based Rules for Situation Awareness and Information
Fusion, in Position Paper presented at the W3C Workshop on Rule Languages for
Interoperability, April 2005.
[14] T. Quan, S. Hui, and T. Cao, FOGA: A Fuzzy Ontology Generation Framework
for Scholarly Semantic Web, in Knowledge Discovery and Ontologies (KDO-2004).
Workshop at ECML/PKDD, 2004.
[15] Q. T. Tho, S. C. Hui, A. Fong, and T. H. Cao, Automatic fuzzy ontology generation
for semantic web, IEEE Transactions on Knowledge and Data Engineering, vol. 18,
no. 6, pp. 842856, 2006.
[16] D. Parry, A fuzzy ontology for medical document retrieval, in Proceedings of The
Australian Workshop on DataMining and Web Intelligence (DMWI2004), Dunedin,
2004, pp. 121126.
[17] T. Andreasen, J. F. Nilsson, and H. E. Thomsen, Ontology-based querying,
in Flexible Query-Answering Systems, 2000, pp. 1526. [Online]. Available:
citeseer.ist.psu.edu/682410.html
142
[18] L. Chang-Shing, J. Zhi-Wei, and H. Lin-Kai, A fuzzy ontology and its application to
news summarization, IEEE Transactions on Systems, Man, and Cybernetics-Part
B: Cybernetics, vol. 35, pp. 859880, 2005.
[19] S. Calegari and D. Ciucci, Integrating Fuzzy Logic in Ontologies, in ICEIS,
Y. Manolopoulos, J. Filipe, P. Constantopoulos, and J. Cordeiro, Eds., 2006, pp.
6673.
[20] E. Sanchez and T. Yamanoi, Fuzzy ontologies for the semantic web. in Proceeding
of FQAS, 2006, pp. 691699.
[21] AA.VV., Karlsruhe Ontology and Semantic Web Tool Suite (KAON), 2005,
http://kaon.semanticweb.org.
[22] S. Calegari and M. Loregian, Using dynamic fuzzy ontologies to understand creative
environments, in LNCS - FQAS, H. L. Larsen, G. Pasi, D. O. Arroyo, T. Andreasen,
and H. Christiansen, Eds. Springer, 2006, pp. 404415.
[23] D. J. Watts and S. H. Strogatz, Collective dynamics of small-world networks,
Nature, vol. 393, no. 6684, pp. 440442, June 4 1998.
[24] V. Latora and M. Marchiori, Efficient behavior of small-world networks, Phys.
Rev. Lett., vol. 87, no. 19, p. 198701(4), 2001.
[25] Y. Bar-Yam, Dynamics of Complex Systems, Addison-Wesley, Ed.
1997.
Reading, MA,
143
[35] D. Lucarella and R. Morara, First: fuzzy information retrieval system, J. Inf.
Sci., vol. 17, no. 2, pp. 8191, 1991.
[36] M. S., The small world problem, Psychology Today, vol. 2, pp. 6067, 1967.
[37] R. H. J. Albert and A. Barabasi, Error and attack tolerance of complex networks,
Nature, vol. 406, pp. 378382, 2000.
[38] P. Crucitti, V. Latora, M. Marchiori, and A. Rapisarda, Efficiency of scale-free
networks: Error and attack tolerance, Physica A, vol. 320, pp. 622642, 2003.
[39] G. Salton and M. McGill, Introduction to Modern Information Retrieval. McGrawHill Book Company, 1984.
[40] F. Crestani and G. Pasi, Soft information retrieval: applications of fuzzy sets theory and neural networks, in Neuro-Fuzzy Techniques for Intelligent Information
Systems, N. Kasabov and R. Kozma, Eds. Physica Verlag, 1999, pp. 287315.
[41] N. Marn, O. Pons, and M. A. V. Miranda, A strategy for adding fuzzy types to an
object-oriented database system. Int. J. Intell. Syst., vol. 16, no. 7, pp. 863880,
2001.
[42] S. Wasserman and K. Faust, Social network analysis.
University Press, 1994.
Cambridge: Cambridge
[43] J. Lee, M. Kim, and Y. Lee, Information retrieval based on conceptual distance in
is-a hierarchies, J. Documentation, vol. 49, no. 2, pp. 188207, 1993.
[44] J. Sowa, Encyclopedia of Artificial Intelligence.
144
Martine Collardb
EXECO Project
Laboratoire I3S - CNRS UMR 4070
Les Algorithmes, 2000 route des Lucioles,
BP.121 - 06903 Sophia-Antipolis - France
Email: a rmartine@i3s.unice.fr, b mcollard@i3s.unice.fr
Abstract
This paper discusses different approaches for integrating biological knowledge in gene expression analysis. Indeed we are interested in the fifth step of microarray analysis procedure which focuses on knowledge discovery via interpretation of the microarray results. We
present a state of the art of methods for processing this step and we propose a classification
in three facets: prior or knowledge-based, standard or expression-based and co-clustering.
First we discuss briefly the purpose and usefulness of our classification. Then, following sections give an insight into each facet. We summarize each section with a comparison between
remarkable approaches.
Keywords: data mining, knowledge discovery, bioinformatics, microarray, biological sources
of information, gene expression, integration.
1 Introduction
Nowadays, one of the main challenges in gene expression technologies is to highlight the
main co-expressed1 and co-annotated2 gene groups using at least one of the different sources
of biological information [1]. In other words, the issue is the interpretation of microarray
results via integration of gene expression profiles with corresponding biological gene annotations extracted from biological databases.
Analyzing microarray data consists in five steps: protocol and image analysis, statistical
data treatment, gene selection, gene classification and knowledge discovery via data interpretation [2]. We can see in Figure 1 the goal of the fifth analysis step devoted to interpretation, which is the integration between two domains, the numeric one represented by the
gene expression profiles and the knowledge one represented by gene annotations issued from
different sources of biological information.
At the beginning of gene expression technologies, researches were focused on the numeric3
side. So, there have been reported ([3, 4, 5, 6, 7, 8]) a variety of data analysis approaches
which identify groups of co-expressed genes based only on expression profiles without taking
into account biological knowledge. A common characteristic of purely numerical approaches
is that they determine gene groups (or clusters) of potential interest. However, they leave
to the expert the task of discovering and interpreting biological similarities hidden within
these groups. These methods are useful, because they guide the analysis of the co-expressed
gene groups. Nevertheless, their results are often incomplete, because they do not include
biological considerations based on prior biologists knowledge.
1 Co-expressed
145
Figure 1: Interpretation of microarray results via integration of gene expression profiles with
corresponding sources of biological information
In order to process the interpretation step in an automatic or semi-automatic way, the bioinformatics community is faced to an ever-increasingly volume of sources of biological information on gene annotations. We have classified them into the following six sources of biological information: molecular databases (GenBank, Embl, Unigene, etc.); semantic sources
as thesaurus, ontologies, taxonomies or semantic networks (UMLS, GO, Taxonomy, etc.);
experience databases (GEO, Arrayexpress, etc.); bibliographic databases (Medline, Biosis,
etc.); Gene/protein related specific sources (ONIM, KEGG, etc.); and minimal microarray
information as seen in 1. Exploiting these different sources of biological information is quite
a complex task so scientists developed several tools for manipulating them or integrate them
into more complex databases [9], [10].
This paper presents a complete survey of the different approaches for automatic integration
of biological knowledge with gene expression data. A first discussion of these methods is
presented by Chuaqui in [11]. Here we present an original classification of the different
microarray analysis interpretation approaches.
The interpretation step may be defined as the result of the integration between gene expression profiles analysis with corresponding gene annotations. This integration process consists
in grouping together co-expressed and co-annotated genes. Based on this definition, three
research axes may be distinguished: the prior or knowledge-based axis, the standard or
expression-based axis and the co-clustering axis. Our classification emphasizes the weight
of the integration process scheduling on the final results [12, 13, 14, 15].
Indeed the main criteria underlying the classification we propose is the scheduling of
phases which alternatively consider gene measures or gene annotations. In prior or knowledgebased approaches, first the co-annotated gene groups are built and then the gene expression
profiles are integrated. In standard or expression-based approaches, first co-expressed gene
groups are built and then gene annotations are integrated. Finally, co-clustering approaches
integrate co-expressed and co-annotated gene groups at the same time.
146
This paper is organized in the following way: each section fully explains the corresponding
interpretation axis, giving an insight and a comparison of their remarkable approaches. Then,
we develop a discussion among the three interpretation axis.
2.1
1. Co-Annotated Gene Groups Composition There exist several ways to build co-annotated
gene groups. We present here one structured way of building them. First, we need to choose
among different sources of biological information. Each kind of information is stored in a
specific format (xml, sql, etc. ) and has intrinsic characteristics. In each case, the analysis process needs to deal with each biological source format. Another issue is to choose a
nomenclature for each gene identity that has to be coherent with the sources of information
and thereafter with the expression data. Next, all the annotations of each gene are to be collected in one or more sources of information. Finally, we gather in a subset of genes that
share the same annotation. Thus, we obtain all the co-annotated gene groups as shown in
Figure 2.
2. Gene Expression Profiles Integration There are different ways to integrate gene expression profiles with previously built co-annotated gene groups. Here we present one current
way to do it. First, expression profiles measures are taken for each gene. Then, a variability
measure, as f old change or t statistic or f score [16] is used to build a sorted list of
gene-ranks based on expression profiles. Finally, this measure is incorporated gene by gene
into the co-annotated groups. Thus, we obtain co-annotated gene groups with the expression
profiles information within as shown in Figure 2.
3. Selection of the Significant Co-Annotated and Co-Expressed Gene Groups At this
stage all co-annotated and co-expressed gene groups are built. The next step is to reveal
which of these groups or subgroups are statistically significant. To tackle this issue the most
frequent technique is the statistical hypothesis testing. Here, we present the four steps for
statistical hypothesis testing:
a) Formulate the null hypothesis, H0 ,
H0 : Commonly, that the genes that are co-annotated and co-expressed were expressed
together as the result of pure chance. versus the alternative hypothesis, H1 ,
H1 : Commonly, that the co-expressed and co-annotated gene groups are found together because of a biological effect combined with a component of chance variation.
b) Identify a test statistic: The test is based on a probability distribution that will be used
to assess the truth of the null hypothesis.
147
2.2
We present here four representative approaches: GSEA [17], iGA [18], PAGE [19] and
CGGA [20]. In the following we describe each of them and emphasize some parameters
particularly: the source of biological information, the profiles expression measure, the expression variability measure, the hypothesis testing parameters and details (type of test, test
statistic, distribution, corrections etc.).
1. Gene Set Enrichment Analysis, GSEA
This approach [17] proposes a statistical method designed to detect coordinated changes
in expression profiles of pre-defined groups of co-annotated genes. This method is born from
148
the need of interpreting metabolic pathways results, where a group of genes is supposed to
move together along the pathway.
In the first step, it builds a priori defined gene sets using specific sources of information
which are the NetAFFX and GenMapp metabolic pathways databases.
In the second step, it takes the Signal to Noise Ratio (SNR) to measure the expression
profiles of each gene within the co-annotated group. Then it builds a sorted list of genes for
each of the co-annotated groups.
Third, it uses a non-parametric statistic: enrichment score, ES, (based in a KolmogorovSmirnoff normalized statistic) for hypothesis testing. It takes as null hypothesis:
H0 : The rank ordering of genes is random with regard of the sample.
Then, it assesses the statistical significance of the maximal ES by running a set of permutations among the samples. Finally, it compares the max ES with a threshold , obtaining
the significant co-expressed and co-annotated gene groups.
2. Parametric Analysis of Gene Set Enrichment, PAGE
This approach [19] detects co-expressed genes within a priori co-annotated groups of genes
like GSEA, but it implements a parametric method.
In first step, it builds a priori defined gene sets from Gene Ontology (GO)4 , NetAFFX 5
and GenMapp 6 metabolic databases.
In second step, it takes the f old change to measure the expression profiles of each gene
within the co-annotated group. Then, it builds a zscore from the corresponding f old change
of the two comparative groups (normal versus non normal) as variability expression measure.
Third, it uses the z score as parametric test statistic. Then, it uses the central limit
theorem [21] to argue that when the sampling size of a co-annotated group is large enough, it
would have a normal distribution. Using the null hypothesis:
H0 : The z score within the groups has a standard normal distribution.
Thus, if the size of the co-annotated gene groups is not big enough to reach normality, then
it would be significantly co-expressed.
3. Iterative Group Analysis, iGA
This approach [18] finds co-expressed gene groups within a priori functionally enriched
groups, sharing the same functional annotation.
In a first step, it builds a priori functionally enriched groups of genes from Gene Ontology
(GO) or other sources of biological information.
In a second step, it uses the f old change gene expression measure to build a complete
sorted list of genes. Then, it generates a reduced sorted list specific to the functionally enriched group.
In a third step, it calculates iteratively the probability of change for each functionally enriched group (based in the cumulative hypergeometric distribution). It states the null hypothesis:
H0 : The top x genes are associated by chance within the functionally
enriched group.
Then, it assesses the statistical significance of each group comparing the probability of
change p value against a user-determined value.
4. Co-expressed Gene Group Analysis, CGGA
This approach [20] automatically finds co-expressed and co-annotated gene groups.
4 http://www.geneontology.org
5 http://www.affymetrix.com/analysis
6 http://www.genmapp.org
149
In a first step, it builds a priori defined gene groups from a source of biological information
for instance Gene Ontology (GO) and KEGG 7 .
In a second step, it uses the f old change as a gene expression measure. Then, it composes
the f score from the corresponding genes f old change. Using the f score on each gene
it builds a sorted list of gene ranks. Then, it generates a reduced list of gene ranks specific to
the co-annotated enriched group.
In a third step, it states the null hypothesis:
H0 : x genes from a co-annotated gene group (or subgroup) are coexpressed by chance.
A hypergeometric distribution and p value calculated from the cumulative distribution
is assumed. This p value is compared against to reveal all the significant co-expressed
and co-annotated gene groups, including all the possible subgroups.
2.3
Table I presents the brief summary of the four prior approaches described in last section. For
each approach the four following parameters are presented: sources of biological information used, expression profile measure, variability expression measure and hypothesis testing
details (test statistic, distribution and particular characteristics).
First of all, the four approaches are concerned by metabolic pathways within biological
processes, but they use different sources of information: iGA, PAGE and CGGA uses Gene
Ontology and GSEA uses manual metabolic annotations, GENMAPP and NetAffx. CGGA
is the only one which uses KEGG database combined with Gene Ontology.
For expression profiles parameters, GSEA is the only one which choice is the SNR measure
while the others opted for the f old change measure. PAGE and CGGA use respectively
zscore and f score variability measures to detect the changes in gene expression profiles.
For hypothesis testing, GSEA is the only one which uses a non parametric method based on
a maximal ES statistic and sampling to calculate the pvalue. In the contrary, PAGE (normal
distribution), CGGA (hypergeometric distribution) and iGA (hypergeometric distribution)
chose a parametric approach. iGA chose a hypothesis proof based in the most over-expressed
or under-expressed genes (in the rank list) of a co-annotated group, while CGGA searches
all the possible co-expressed subgroups within a co-annotated group (the internal sub-group
position in the group does not matter).
150
Approach
Biological
Source of
Information
Expression
Profile
Measure
Variability
Expression
Measure
GSEA
(Mootha et al.
2003)
Manual
Annotations,
NetAffx and
GENMAPP
GO
SNR (Signal
to Noise Ratio)
Mean Expression
Difference
Fold Change
Fold
Change
GO
Fold Change
z-score
One-tailed test.
Modified Fishers exact statistic: The most over or Under expressed Genes in a
group. Hypergeometric distribution.
One-tailed test. z-score statistic. Normal distribution.
GO:
(MP,
BP, CC) and
KEGG
Fold Change
F-score
iGA
(Breitling et al.
2004)
PAGE
(Kim et al.
2005)
CGGA
(Martinez et al.
2006)
One-tailed test.
Modified Fishers exact statistic:
All over or under expressed
genes in a group. Hypergeometric distribution. Binomial distribution for N large.
Bonferroni Correction.
TABLE I
C OMPARISON BETWEEN FOUR KNOWLEDGE - BASED INTEGRATION APPROACHES
3.1
1. Gene Expression Profiles Classification There exist several methods for classifying
gene expression profiles from cleaned microarray data, i.e. data matrix of thousands of genes
measured in tens of biological conditions. Various supervised methods and non supervised
methods tackled the gene classification issue. Between the most common methods, we can
mention: hierarchical clustering, k-means, Diana, Agnes, Fanny [22], model-based clustering
[23] support vector machines SVM, self organizing maps (SOM), and even association rules
(see more details in [24]).
The target of these methods is to classify genes into clusters sharing similar gene expression profiles as shown in the first step of Figure 3.
2. Biological Annotations Integration Once clusters of genes are built by similar expression levels, each gene annotation is extracted from sources of biological information. As
in prior axis, this step deal with different formats of information. A list of annotations is
composed for each gene, and then all the annotations are integrated into the clusters of genes
(previously built by co-expression profiles). Thus, subsets of co-annotated and co-expressed
gene groups are built within each cluster. Figure 3 illustrates this process: three clusters of
similar expression profiles are first built, and then all the individual gene annotations are collected to be incorporated in each cluster. For example in the first under-expressed green group
we have found three subsets of co-annotated genes. These subsets are respiratory complex:
Gene E and Gene D, gluconeogenesis: Gene G and Y and tricarboxylic acid cycle Gene E and
Gene T. We can observe intersections of genes within the under-expressed cluster because of
the different annotations that each gene may have. Thus, we obtain all the co-annotated gene
groups.
3. Selection of the Significant Co-Annotated and Co-Expressed Gene Groups At this
stage all the co-expressed and co-annotated gene groups are built and the issue is to reveal
151
Figure 3: Interpretation of microarray results via integration of gene expression profiles with
corresponding sources of biological information
152
which of these groups or the possible subgroups are statistically significant. The most current
technique in use is the statistical hypothesis testing (see Figure 3).
Afterward, this full three-step methodology the expression-based approaches present the
interpretation results as significant co-expressed and co-annotated groups of genes.
The next section presents some of the most representative approaches and methods of the
expression-based axis. Since these approaches are quite numerous, we have classified them
according their main source of biological information. Thus, we have the following classification: minimal information approaches, ontology approaches and bibliographic source
approaches.
3.2
Expression-based semantic approaches integrate fundamentally semantic annotations (contained in ontologies, thesaurus, semantic networks etc.) into co-expressed gene groups.
Nowadays, semantic sources of biological information i.e. structured and controlled vocabularies are one of the best available sources of information to analyze microarray data in order
to discover meaningful rules and patterns [1].
Actually, expression-based semantic approaches are widely exploited. In this section we
present seven among them: FunSpec [25], OntoExpress [26], Quality Tool [27], EASE [28],
THEA [29], Graph Theoretic Modeling [12] and GENERATOR [30]. Each approach uses
Gene Ontology (GO) as source of biological annotation, sometimes combined with another
gene/protein related specific sources as MIPS, KEGG, Pfam, Smart, etc. or molecular database as Embl, SwissProt8 , etc.
During last years, GO has been chosen preferably over other sources of information, because of its non ambiguous and comprehensible structure. That is the reason of the recent
explosion of many more expression-based GO approaches. Among these approaches, we can
cite the integration tools which integrate gene expression data with GO as GoMiner [31],
FatiGO [32], Gostat [33], GoToolbox [34], GFINDer [35], CLENCH [36], BINGO [37], etc.
This up to date GO compendium 9 gives more integration methods, GO searching tools, GO
browsing tools and related GO tools.
In the next section, we describe seven remarkable expression-based semantic solutions.
3.3
153
154
3.4
Nowadays bibliographic databases represent one of the richest update sources of biological
information. This type of information, however, is under-exploited by researchers because of
the highly unstructured free-format characteristics of the published information and because
of its overwhelming volume. The main challenges coming up with bibliographic databases
integration are to manage interactions with textual sources (abstracts, articles etc.) and to resolve syntactical problems that appears in biological language like synonyms or ambiguities.
At the moment, some text mining methods and tools have been developed for manipulate this
kind of biological textual information. Among these methods we can mention Suiseki [43]
which focuses on the extraction and visualization of protein interactions, MedMinder [44]
takes advantage of GeneCards13 as a knowledge source and offers gene information related
to specific keywords, XplorMed [45] which presents specified gene-information through user
interaction, EDGAR [46] which extracts information about drugs and genes relevant to cancer
from the biomedical literature, GIS [47] which retrieves and analyzes gene-related information from PubMed14 abstracts. These methods are useful as stand-alone applications but they
do not integrate gene expression profiles.
13 http://www.genecards.org
14 http://www.pubmed.gov
155
Approach
Biological
Source of
Information
Distinctive
Characteristic
FunSpec
(Robinson et
al. 2002)
OntoExpress
(Draghici et
al. 2002)
GO, MIPS,
EMBL and
Pfam
GO (MP, BP
and CC)
Hypergeometric Bonferroni
Correction
Binomial Hypergeometric
2
Quality Tool
(Gibbons et
al. 2002)
EASE (Hosack et al.
2003)
GO (MP, BP
and CC)
One-tailed
test
Fishers
exact
statistic
One-tailed
test
Fishers
exact
statistic
2
statistic
One-tailed test zscore
Normal
GO, KEGG,
Pfam, Smart,
and
SwissProt
GO (MP, BP
and CC)
One-tailed
test
Fishers
exact
statistic
Hypergeometric
Ease
correction
One-tailed
test
Fishers
exact
statistic
Graph
Theoretic
Modeling
(Sung 2004)
GENERATOR
(Pehkonen et
al. 2005)
GO (MP, BP
and CC)
One-tailed
Average
statistic
Hypergeometric Binomial
Bonferroni
Correction
NonParametric
GO (MP, BP
and CC)
One-tailed
test
Fishers
exact
statistic
Hypergeometric
AnnotationTool (Masys
et al. 2001)
Medline
(abstracts),
Mesh (keywords),
UMLS
One-tailed
test
Estimated
likelihood
Vs.
Observed likelihood
SemiParametric:
Empirical
Likelihood
THEA
(Pasquier et
al. 2004)
test
PD
Friendly interface
quick annotation
clusters analysis
for
and
TABLE II
E XPRESSION -BASED A PPROACHES
156
3.5
4 Co-Clustering Axis
From the beginning of gene expression technologies, clustering algorithms were focused on
grouping gene expression profiles with biological conditions [16]. Sources of biological information and well structured ontologies as GO and KEGG particularly, are constantly grow15 http://www.nlm.nih.gov/mesh
16 http://umlsks.nlm.nih.gov
157
ing in quantity and quality and have opened the interpretation challenge of grouping heterogeneous data as numeric gene expression profiles and textual gene annotations. Co-clustering
approaches focus their effort to answer this challenge. Each co-clustering approach has its
specific parameters: biological source of information, clustering method and integration algorithm. They generally follow a three-step methodology described in the following.
New co-clustering integration approaches are currently one of the interpretation challenges
in gene expression technologies. At the moment, few co-clustering approaches have been
reported since the principal barrier is the difficulty to build clustering methods fitting heterogeneous sources of information. Among the co-clustering approaches we can cite Co-Cluster
[15] and Bicluster [14] described in subsection remarkable co-clustering methods.
4.1
Co-Clustering Methodology
In a first step, they state two different measures: one measure to manipulate gene expression
profiles and the other one for gene annotations in an independent manner.
In a second step, they apply an integration criterion (merging function, graphical function
etc.) within the co-clustering algorithm for building the co-expressed and co-annotated gene
groups simultaneously.
They select the significant co-expressed and co-annotated gene groups. In the last step,
most recent solutions [49], [50], [51], [52], [53], [54] and [55] test the quality of the final
clusters.
4.2
158
Approach
Biological
Source of
Information
Gene
Expression
Profiles
Measure
and Gene Matrix
Distance
Co-clustering
Details
Co-expressed
and
Co-annotated
gene
group
Selection Details
Co-Cluster
(Hanisch et
al. 2003)
GO
Biclustering
(Liu J. et al.
2004)
KEGG
Silhouette
Coefficient
GO:
(MP,
BP, CC) and
KEGG
SHTP:
Smart
Hierarchical Tendency preserving
One-tailed
Fishers
test
Alfa
threshold
construction
TABLE III
C O - CLUSTERING I NTEGRATION A PPROACHES
GO annotations tree becomes the selected significant group of co-annotated and co-expressed
genes by tendency.
4.3
Table III presents a brief summary of the two co-clustering approaches explained in last subsection. It is based on four parameters: source of biological information, expression profile
measure, co-clustering details, and co-expressed and co-annotated gene groups selection details as seen in Table III.
Both approaches select well-structured ontologies: KEGG database in Co-Cluster and GO
for Bi-Cluster. These ontologies have a graph-based representation that allows the clustering
algorithm to integrate gene expression profiles with gene database annotations.
For manipulating gene expression measures, both methods use f old change expression
measures. Nevertheless, co-cluster chooses Pearsons correlation coefficient as gene to gene
distance and bi-cluster chooses a gene tendency measure based in the gene-rank between
biological conditions.
Concerning to co-clustering details, both co-cluster and bi-cluster have chosen a hierarchical clustering method. However, co-cluster has opted for typical hierarchical average linkage
algorithm and bi-cluster has developed the Smart Hierarchical Tendency preserving (SHTP)
algorithm.
Related to gene group selection, co-cluster uses the silhouette coefficient for determining
the quality of the clusters built (selecting the significant ones). In the other hand, bi-cluster
states for a selection in two different stages.
First it uses standard one-tailed Fishers test for calculate the pvalue for the co-annotated
and co-expressed gene groups and then it builds a particular threshold for each of them.
Finally, as seen in the previous approaches, it compares p value against to select or not
the co-expressed and co-annotated gene group.
5 Discussion
The bioinformatics community has developed many approaches to tackle the interpretation
microarray challenge, we classify them in three different interpretation axes: prior, standard
and co-clustering. The important intrinsic characteristics of each axis have been developed
before.
Standard or expression-based approaches give importance to gene expression profiles.
However, microarray history has revealed intrinsic errors in microarray measures and proto-
159
cols that increase during the whole microarray analysis process. Thus, the expression-based
interpretation results can be severely biased [13], [14].
On the other hand prior or knowledge-based approaches give importance to biological
knowledge. Nevertheless, all sources of biological information fix many integration constraints: the database format or structure, the weak quantity of annotated genes or the availability of maintaining up to date and well revised annotations for instance. Consequently, the
knowledge-based interpretation results can be poor or somewhat quite small in relation to the
whole studied biological process.
Co-clustering approaches represent the best compromise in terms of integration, giving the
same weight to expression profiles and biological knowledge. But, they have to deal with
the algorithmic issue of integrating these two elements at the same time. However, they are
often forced to give more weight to one of these elements. In the last section above, we have
seen two examples: co-cluster algorithm gives more weight to knowledge, and expression
profiles were used to guide the clustering analysis while hand bi-cluster algorithm gives more
weight to tendency in expression profiles and GO annotations are used to guide the clustering
analysis.
Indeed, the improvement of microarray data quality, microarray process analysis and the
completion of biological information sources should make the interpretations results more
independent on the interpretation axis.
As long as there is not enough reliability on these main elements, the choice of the interpretation approach remains of crucial importance for the final interpretation results.
References
[1] T. Attwood and C. J. Miller, Which craft is best in bioinformatics? Computer Chemistry, vol. 25, pp. 329339, 2001.
[2] A. Zhang, Advanced analysis of gene expression microarray data, 1st ed., ser. Science,
Engineering, and Biology Informatics. World Scientific, 2006, vol. 1.
[3] R. Cho, M. Campbell, and E. Winzeler, A genome-wide transcriptional analysis of the
mitotic cell cycle, Molecular Cell, vol. 2, pp. 6573, 1998.
[4] J. DeRisi, L. Iyer, and V. Brown, Exploring the metabolic and genetic control of gene
expression on a genomic scale, Science, vol. 278, pp. 680686, 1997.
[5] M. Eisen, P. Spellman, P. Brown, and D. a. Botsein, Cluster analysis and display of
genome wide expression patterns, in Proceedings of the National Academy of Sciences
of the USA, vol. 95, no. 25, 1998, pp. 14 8638.
[6] P. Tamayo and D. Slonim, Interpreting patterns of gene expression with self-organizing
maps: Methods and application to hematopoietic differentiation, in Proceedings of the
National Academy of Sciences of the USA, vol. 96, 1999, pp. 29072912.
[7] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church, Systematic determination of genetic network architecture, Nature Genetics, vol. 22, pp. 281285, 1999.
[8] A. Ben-Dor, R. Shamir, and Z. Yakhini, Clustering gene expression patterns, Computational Biology, vol. 6, pp. 281297, 1999.
[9] C. Blaschke, L. Hikrschman, and A. Valencia, Co-clustering of biological networks
and gene expression data, Bioinformatics, vol. 18, pp. S145S154, 2002.
160
Wiley
161
[27] D. Gibbons and F. Roth, Judging the quality of gene expression-based clustering methods using gene annotation, Genome Research, vol. 12, pp. 15741581, 2002.
[28] D. Hosack and G. Dennis, Identifying biological themes within lists of genes with
ease, Genome Biology, vol. 4, no. 70, 2003.
[29] C. Pasquier, F. Girardot, K. Jevardat, and R. Christen, Thea : Ontology-driven analysis
of microarray data, Bioinformatics, vol. 20, no. 16, 2004.
[30] P. Pehkonen, G. Wong, and P. Toronen, Theme discovery from gene lists for identification and viewing of multiple functional groups, BMC Bioinformatics, vol. 6, p. 162,
2005.
[31] W. Feng, G. Wang, B. Zeeberg, K. Guo, A. Fojo, D. Kane, W. Reinhold, S. Lababidi,
J. Weinstein, and M. Wang, Development of gene ontology tool for biological interpretation of genomic and proteomic data, in AMIA Annual Symposium Proceedings,
2003, p. 839.
[32] F. Al-Shahrour, R. Diaz-Uriarte, and J. Dopazo, Fatigo: a web tool for finding significant associations of gene ontology terms with groups of genes, Bioinformatics, vol. 20,
no. 4, pp. 578580, 2004.
[33] T. Beissbarth and T. Speed, Gostat: find statistically overrepresented gene ontologies
within a group of genes, Bioinformatics, vol. 20, no. 9, pp. 14641465, 2004.
[34] D. Martin, C. Brun, E. Remy, P. Mouren, D. Thieffry, and B. Jacq, Gotoolbox: functional analysis of gene datasets based on gene ontology, Genome Biology, vol. 5, no. 12,
2004.
[35] M. Masseroli, D. Martucci, and F. Pinciroli, Gfinder: Genome function integrated
discoverer through dynamic annotation, statistical analysis, and mining, Nucleic Acids
Research, vol. 32, pp. 293300, 2004.
[36] N. Shah and N. Fedoroff, Clench: a program for calculating cluster enrichment using
the gene ontology, Bioinformatics, vol. 20, pp. 11961197, 2004.
[37] S. Maere, K. Heymans, and M. Kuiper, Bingo: a cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks, Bioinformatics, vol. 21,
pp. 34483449, 2005.
[38] R. Fisher, On the interpretation of x2 from contingency tables, and the calculation of
p, Journal of the Royal Statistical Society, vol. 85, no. 1, pp. 8794, 1922.
[39] K. Kerr and G. Churchill, Statistical design and the analyissi of gene expression microarray data, Genetics Research, vol. 77, pp. 123128, 2001.
[40] M. Man, Z. Wang, and Y. Wang, Power sage: comparing statistical test for sage experiments, Bioinformatics, vol. 16, no. 11, pp. 953959, 2000.
[41] L. Fisher and G. Van Belle, Biostatistics: a methodology for health sciences.
York, USA: Wiley and Sons, 1993.
[42] T. Cover and J. Thomas, Elements of information theory.
Interscience, 1991.
New
[43] C. Blaschke, R. Hoffmann, A. Valencia, and J. Oliveros, Extracting information automatically from biological literature, Comparative and Functional Genomics, vol. 2,
no. 5, pp. 310313, 2001.
162
[44] L. Tanabe, U. Scherf, L. Smith, J. Lee, L. Hunter, and J. Weinstein, Medminer: an internet text-mining tool for biomedical information, with application to gene expression
profiling, Biotechniques, vol. 27, no. 6, pp. 12101217, 1999.
[45] C. Perez-Iratxeta, P. Bork, and M. Andrade, Exploring medline abstracts with
xplormed, Drugs Today, vol. 38, no. 6, pp. 381389, 2002.
[46] T. Rindflesch, L. Tanabe, J. Weinstein, and L. Hunter, Edgar: extraction of drugs, genes
and relations from the biomedical literature, in Proceedings of the Pacific Symposium
on Biocomputing, 2000, pp. 517528.
[47] J. Chiang, H. Yu, and H. Hsu, Gis: a biomedical text-mining system for gene information discovery, Bioinformatics, vol. 1, no. 20, pp. 120121, 2004.
[48] D. Masys, Use of keyword hierarchies to interpret gene expressions patterns, Bioinformatics, vol. 17, pp. 319326, 2001.
[49] K. Zhang and H. Zhao, Assessing reliability of gene clusters from gene expression
data, Functional Integrative Genomics, pp. 156173, 2000.
[50] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, On clustering validation techniques,
Intelligent Information Systems, vol. 17, pp. 107145, 2001.
[51] F. Azuaje, A cluster validity framework for genome expression data. bioinformatics,
Bioinformatics, vol. 18, pp. 319320, 2002.
[52] A. Ben-Hur, A. Elisseeff, and I. Guyon, A stability based method for discovering structure in clustered data, in Pacific Symposium on Biocomputing, vol. 7, 2002, pp. 617.
[53] S. Datta and S. Datta, Comparisons and validation of clustering techniques for microarray gene expression data, Bioinformatics, vol. 4, pp. 459466, 2003.
[54] C. Giurcaneanu, I. Tabus, I. Shmulevich, and W. Zhang, Stability-based cluster analysis applied to microarray data, in Proceedings of the Seventh International Symposium
on Signal Processing and its Applications, 2003, pp. 5760.
[55] M. Smolkin and D. Ghosh, Cluster stability scores for microarray data in cancer studies, BMC Bioinformatics, vol. 4, no. 36, 2003.
[56] D. Elihu, P. Nechama, and L. Menachem, Mercury exposure and effects at a thermometer factory, Scandinavian Journal of Work Environmental Health, vol. 8, no. 1,
pp. 161166, 2004.
[57] P. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis, Computational and Applied Mathematics, vol. 20, pp. 5365, 1987.
[58] J. Liu, J. Yang, and W. Wang, Gene ontology friendly biclustering of expression profiles, in Computational Systems Bioinformatics Conference, CSB 2004 Proceedings,
2004, pp. 436447.
163