Académique Documents
Professionnel Documents
Culture Documents
VOL. 27,
Context-Based Diversification
for Keyword Queries Over XML Data
Jianxin Li, Chengfei Liu, Member, IEEE, and Jeffrey Xu Yu, Senior Member, IEEE
AbstractWhile keyword query empowers ordinary users to search vast amount of data, the ambiguity of keyword query makes it
difficult to effectively answer keyword queries, especially for short and vague keyword queries. To address this challenging problem, in
this paper we propose an approach that automatically diversifies XML keyword search based on its different contexts in the XML data.
Given a short and vague keyword query and XML data to be searched, we first derive keyword search candidates of the query by a
simple feature selection model. And then, we design an effective XML keyword search diversification model to measure the quality of
each candidate. After that, two efficient algorithms are proposed to incrementally compute top-k qualified query candidates as the
diversified search intentions. Two selection criteria are targeted: the k selected query candidates are most relevant to the given query
while they have to cover maximal number of distinct results. At last, a comprehensive evaluation on real and synthetic data sets
demonstrates the effectiveness of our proposed diversification model and the efficiency of our algorithms.
Index TermsXML keyword search, context-based diversification
INTRODUCTION
1041-4347 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
661
TABLE 1
Top 10 Selected Feature Terms of q
keyword
features
database
query
them perform diversification as a post-processing or reranking step of document retrieval based on the analysis
of result set and/or the query logs. In IR, keyword search
diversification is designed at the topic or document level.
For e.g., Agrawal et al. [7] model user intents at the topical
level of the taxonomy and Radlinski and Dumais [11]
obtain the possible query intents by mining query logs.
However, it is not always easy to get these useful taxonomy and query logs. In addition, the diversified results in
IR are often modeled at document levels. To improve the
precision of query diversification in structured databases
or semi structured data, it is desirable to consider both
structure and content of data in diversification model. So
the problem of keyword search diversification is necessary
to be reconsidered in structured databases or semi structured data. Liu et al. [12] is the first work to measure the
difference of XML keyword search results by comparing
their feature sets. However, the selection of feature set in
[12] is limited to metadata in XML and it is also a method
of post-process search result analysis.
Different from the above post-process methods, another
type of works addresses the problem of intent-based keyword query diversification through constructing structured
query candidates [13], [14]. Their brief idea is to first map
each keyword to a set of attributes (metadata), and then construct a large number of structured query candidates by
merging the attribute-keyword pairs. They assume that
each structured query candidate represents a type of search
intention, i.e., a query interpretation. However, these works
are not easy to be applied in real application due to the following three limitations:
TABLE 2
Part of Statistic Information for q
database systems query
language expansion optimization evaluation complexity
71
5
68
13
1
log
efficient distributed semantic translation
#results
12
17
50
14
8
relational database query
language expansion optimization evaluation complexity
#results
40
0
20
8
0
#results
#results
...
log
2
...
efficient
11
distributed
5
semantic translation
7
5
PROBLEM DEFINITION
662
TABLE 3
Mutual Information Score w.r.t. Terms in q
system
7.06
Mutual
image
score (104 ) 1.73
query
language
3.63
Mutual
log
score (104 ) 1.17
database
relational
3.84
sequence
1.31
expansion
2.97
efficient
1.03
protein
2.79
search
1.1
optimization
2.3
distributed
0.99
distributed
2.25
model
1.04
evaluation
1.71
semantic
0.86
oriented
2.06
large
1.02
complexity
1.41
translation
0.70
T j
in RT , i.e., Probx; T jRx;
jRT j where jRx; T j is the number of results containing x. Let Probx; y; T be the probability of terms x and y co-occurring in RT , i.e., Probx; y;
y; T j
T jRx;
jRT j . If terms x and y are independent, then knowing x does not give any information about y and vice versa,
so their mutual information is zero. At the other extreme, if
terms x and y are identical, then knowing x determines the
value of y and vice versa. Therefore, the simple measure can
be used to quantify how much the observed word co-occurrences maximize the dependency of feature terms while
reduce the redundancy of feature terms. In this work, we
use the popularly-accepted mutual information model as
follows:
MIx; y; T Probx; y; T
Probx; y; T
log Probx;
T Proby; T :
(1)
VOL. 27,
(2)
2.2.1
(3)
jRfiji ; T j=jRT j
jRsi ; T j
;
jRfiji ; T j
Probki j fiji ; T
(5)
jRqnew ; T j
jRT j
T
si 2qnew
Rsi ;T
jRT j
(6)
Q
Rsi ; T
Rfiji ; T
T Rs ;T
i
jRT j ;
(7)
663
where Rqnew ; TS
represents the set of SLCA results gener0
ated by qnew ;
q0 2Q Rq ; T represents the set of SLCA
results generated by queries in Q, which excludes the
duplicate and ancestor nodes; vx vy means that vx is a
duplicate of
S vySfor , or vx is an ancestor of vy for < };
Rqnew ; T f q0 2Q Rq0 ; T g is an SLCA result set that satisfies with the exclusive property.
By doing this, we can avoid presenting repeated or overlapped SLCA results to users. In other words, the consideration of novelty allows us to incrementally refine the
diversified results into more specific ones when we incrementally deal with more query candidates.
As we know our problem is to find top k qualified query
candidates and their relevant SLCA results. To do this, we
can compute the absolute score of the search intention for
each generated query candidate. However, for reducing the
computational cost, an alternative way is to calculate the relative scores of queries. Therefore, we have the following
equation transformation. After we substitute Equations (7)
and (8) into Equation (2), we have the final equation
T
Q Rsi ; T j Rsi ; T j
scoreqnew g
jRT j
Rfiji ;T
vx j vx 2 Rqnew ; T ^ @vy 2 S 0 Rq0 ; T ^ vx vy
q
2
Q
Rqnew ; T S S 0 Rq0 ; T
q 2Q
Q Rsi ; T
T
g
jRT
Rsi ; T j
Rfiji ; T j
j
S
vx j vx 2 Rqnew ; T ^ @vy 2
0; T ^ v v
Rq
x
y
0 2Q
q
Rqnew ; T S S 0 Rq0 ; T
q 2Q
Q Rsi ; T T
j
Rsi ; T j
7!
Rfiji ; T
vx j vx 2 Rqnew ; T ^ @vy 2 S 0 Rq0 ; T ^ vx vy
S q 2Q
S
;
Rqnew ; T
Rq0 ; T
0
(9)
q 2Q
1
where ki 2 q, si 2 qnew , fiji 2 si and g Probq
j T can be any
value in (0, 1] because it does not affect the expanded query
candidates w.r.t. an original keyword query q and data T .
Although the above equation can model the probabilities
of generated query candidates (i.e., the relevance between
these query candidates and the original query w.r.t. the
data), different query candidates may have overlapped
result sets. Therefore, we should also take into account the
novelty of results of these query candidates.
2.2.2
DIF qnew ; Q; T
S
jfvx j vx 2 Rqnew ; T ^ @vy 2 f q0 2Q Rq0 ; T g ^ vx vy gj
S S
;
jRqnew ; T f q0 2Q Rq0 ; T gj
(8)
q0 2 Q
664
S
Rq0 ;T g^vx vy gj
S S q0 2 Q 0
1, so we can
jfvx j vx 2 Rqnew ;T ^ @ vy 2 f
jRqnew ;T
Rq
q0 2Q
;T gj
derive that
j
Rsi ; T j
vx j vx 2 Rqnew ; T ^@vy 2 f S 0 Rq0 ; T g^vx vy
q 2Q
S
S
jRqnew ; T f q0 2Q Rq0 ; T gj
minfjLki jg:
4
u
t
VOL. 27,
665
x y
x y
6:
f ComputeSLCA({lix jy g);
7:
prob q new prob s k * jfj;
8:
if F is empty then
9:
scoreqnew prob q new;
10:
else
11:
for all Result candidates rx 2 f do
12:
for all Result candidates ry 2 F do
13:
if rx ry or rx is an ancestor of ry then
14:
f:removerx ;
15:
else if rx is a descendant of ry then
16:
F:removery ;
17:
scoreqnew prob q new * jfj* jfj jfj
jFj;
18:
if jQj < k then
19:
put qnew : scoreqnew into Q;
20:
put qnew : f into F;
0
21:
else if scoreqnew > scorefqnew
2 Qg then
0
0
22:
replace qnew
: scoreqnew
with qnew : scoreqnew ;
0
23:
F:removeqnew
;
24: return Q and result set F;
the interrelationships between the intermediate SLCA candidates that have been already computed for the generated query candidates Q and the nodes that will be
merged for answering the newly generated query candidate qnew . And then, we will propose the detailed description and algorithm of the anchor-based pruning solution.
666
VOL. 27,
individually; no new and distinct SLCA results can be generated across the areas.
Proof. For the keyword nodes in the ancestor area, all of
them are the ancestors of the anchor node va that is an
SLCA node w.r.t. Q. All these keyword nodes in the area
can be directly discarded due to the exclusive property of
SLCA semantics. For the keyword nodes in the area
Lva pre (or Lva next ), we need to compute their SLCA candidates. But the computation of SLCA candidates can be
bounded by the preorder (or postorder) traversal number
of va where we assume the range encoding scheme is used
in this work. For the keyword nodes in Lva des , if they can
produce SLCA candidates that are bounded in the range
of va , then these candidates will be taken as the new and
distinct results while va will be removed from result set.
The last is to certify that the keyword nodes across any
two areas cannot produce new and distinct SLCA results.
This is because the only possible SLCA candidates they
may produce are the node va and its ancestor nodes. But
these nodes cannot become new and distinct SLCA results
due to the anchor node va . For instance, If we can select
some keyword nodes from Lva pre and some keyword
nodes from Lva des , then their corresponding SLCA candidate must be the ancestor of va , which cannot become the
new and distinct SLCA result due to the exclusive property of SLCA semantics. Similarly, no new result can be
generated if we select some keyword nodes from the other
u
t
two areas Lva pre and Lva next (or Lva des and Lva next ).
Example 3. Fig. 2 gives a small tree that contains four different terms k1 , k2 , f1 and f2 . And the encoding ranges are
attached to all the nodes. Assume f1 and f2 are two feature terms of a given keyword query {k1 ; k2 }; and f1 has
higher correlation with the query than f2 . After the first
intended query {k1 ; k2 ; f1 } is evaluated, the node v10 will
be the SLCA result. When we process the next intended
query {k1 ; k2 ; f2 }, the node v10 will be an anchor node to
partition the keyword nodes into four parts by using its
encoding range 16-23 : Lpre {v2 ; v3 ; v4 ; v5 ; v6 ; v7 ; v9 } in
which each nodes postorder number is less than 16;
Lnext {v14 ; v15 ; v16 ; v17 } in which each nodes preorder
number is larger than 23; Lanc {v1 ; v8 } in which each
nodes preorder number is less than 16 while its postorder number is larger than 23; and the rest nodes belong
to Ldes {v11 ; v12 ; v13 }. Although v8 contains f2 , it cannot
contribute more specific results than the anchor node v10
due to exclusive property of SLCA semantics. So we only
need to evaluate {k1 ; k2 ; f2 } over the three areas Lpre , Lnext
and Ldes individually. Since no node contains f2 in Ldes ,
667
668
EXPERIMENTS
VOL. 27,
669
1
Simri ;rj
jRqi j jRqj j
Ck2
(10)
where top kDSq represents top-k diversified query suggestions; Rqi represents the number of selected most relevant results for the diversified query suggestion qi ;
Simri ; rj represents the similarity of two results ri and rj
to be measured by n-gram model.
670
VOL. 27,
contain a few specific keywords and their produced structured queries (i.e., query suggestions in DivQ) are with significantly different structures. The experimental study
shows our context sensitive diversification model outperforms the DivQ probabilistic model due to the consideration
of both the word context (i.e., term correlated graph) and
structure (i.e., exclusive property of SLCA).
suggestions, as well as the novelty of the results to be generated by these query suggestions.
RELATED WORK
Diversifying results of document retrieval has been introduced [6], [7], [8], [9]. Most of the techniques perform diversification as a post-processing or re-ranking step of
document retrieval. Related work on result diversification
in IR also includes [25], [26], [27]. Santos et al. [25] used
probabilistic framework to diversify document ranking, by
which web search result diversification is addressed. They
also applied the similar model to discuss search result diversification through sub-queries in [26]. Gollapudi and
671
Sharma [27] proposed a set of natural axioms that a diversification system is expected to satisfy, by which it can
improve user satisfaction with the diversified results. Different from the above relevant works, in this paper, our diversification model was designed to process keyword queries
over structured data. We have to consider the structures of
data in our model and algorithms, not limited to pure text
data like the above methods. In addition, our algorithms can
incrementally generate query suggestions and evaluate
them. The diversified search results can be returned with
the qualified query suggestions without depending on the
whole result set of the original keyword query.
Recently, there are some relevant work to discuss the
problem of result diversification in structured data. For
instance, [28] conducted thorough experimental evaluation
of the various diversification techniques implemented in
a common framework and proposed a threshold-based
method to control the tradeoff between relevance and diversity features in their diversification metric. But it is a big challenge for users to set the threshold value. Hasan et al. [29]
developed efficient algorithms to find top-k most diverse set
of results for structured queries over semi-structured data.
As we know, a structured query can be used to express
much more clear search intention of a user. Therefore, diversifying structured query results is less significant than that of
keyword search results. In [30], Panigrahi et al. focused on
the selection of diverse item set, not considering structural
relationships of the items to be selected. The most relevant
work to ours is the approach DivQ in [13] where Demidova
et al. first identified the attribute-keyword pairs for an original keyword query and then constructed a large number of
structured queries by connecting the attribute-keyword
pairs using the data schema (the attributes can be mapped to
corresponding labels in the schema). The challenging problem in [13] is that two generated structured queries with
slightly different structures may still be considered as different types of search intentions, which may hurt the effectiveness of diversification as shown in our experiments.
However, our diversification model in this work utilized
mutually co-occurring feature term sets as contexts to represent different query suggestions and the feature terms are
selected based on their mutual correlation and the distinct
result sets together. The structure of data are considered by
satisfying the exclusive property of SLCA semantics.
CONCLUSIONS
672
ACKNOWLEDGMENTS
This work was supported by the ARC Discovery Projects
under Grant No. DP110102407 and DP120102627, and partially supported by grant of the RGC of the Hong Kong
SAR, No. 418512.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Y. Chen, W. Wang, Z. Liu, and X. Lin, Keyword search on structured and semi-structured data, in Proc. SIGMOD Conf., 2009,
pp. 10051010.
L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram, Xrank:
Ranked keyword search over xml documents, in Proc. SIGMOD
Conf., 2003, pp. 1627.
C. Sun, C. Y. Chan, and A. K. Goenka, Multiway SLCA-based
keyword search in xml data, in Proc. 16th Int. Conf. World Wide
Web, 2007, pp. 10431052.
Y. Xu and Y. Papakonstantinou, Efficient keyword search for
smallest lcas in xml databases, in Proc. SIGMOD Conf., 2005,
pp. 537538.
J. Li, C. Liu, R. Zhou, and W. Wang, Top-k keyword search over
probabilistic xml data, in Proc. IEEE 27th Int. Conf. Data Eng.,
2011, pp. 673684.
J. G. Carbonell and J. Goldstein, The use of MMR, diversitybased reranking for reordering documents and producing summaries, in Proc. SIGIR, 1998, pp. 335336.
R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong,
Diversifying search results, in Proc. 2nd ACM Int. Conf. Web
Search Data Mining, 2009, pp. 514.
H. Chen and D. R. Karger, Less is more: Probabilistic models for
retrieving fewer relevant documents, in Proc. SIGIR, 2006,
pp. 429436.
C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A.
Ashkan, S. B
uttcher, and I. MacKinnon, Novelty and diversity in information retrieval evaluation, in Proc. SIGIR, 2008,
pp. 659666.
A. Angel and N. Koudas, Efficient diversity-aware search, in
Proc. SIGMOD Conf., 2011, pp. 781792.
F. Radlinski and S. T. Dumais, Improving personalized web
search using result diversification, in Proc. SIGIR, 2006, pp. 691
692.
Z. Liu, P. Sun, and Y. Chen, Structured search result differentiation, J. Proc. VLDB Endowment, vol. 2, no. 1, pp. 313324,
2009.
E. Demidova, P. Fankhauser, X. Zhou, and W. Nejdl, DivQ:
Diversification for keyword search over structured databases, in
Proc. SIGIR, 2010, pp. 331338.
J. Li, C. Liu, R. Zhou, and B. Ning, Processing xml keyword
search by constructing effective structured queries, in Advances
in Data and Web Management. New York, NY, USA: Springer, 2009,
pp. 8899.
H. Peng, F. Long, and C. H. Q. Ding, Feature selection based on
mutual information: Criteria of max-dependency, max-relevance,
and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 27, no. 8, pp. 12261238, Aug. 2005.
C. O. Sakar and O. Kursun, A hybrid method for feature selection based on mutual information and canonical correlation analysis, in Proc. 20th Int. Conf. Pattern Recognit., 2010, pp. 43604363.
N. Sarkas, N. Bansal, G. Das, and N. Koudas, Measure-driven
keyword-query expansion, J. Proc. VLDB Endowment, vol. 2,
no. 1, pp. 121132, 2009.
N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa, Seeking stable clusters in the blogosphere, in Proc. 33rd Int. Conf. Very Large
Data Bases, 2007, pp. 806817.
S. Brin, R. Motwani, and C. Silverstein, Beyond market baskets:
Generalizing association rules to correlations, in Proc. SIGMOD
Conf., 1997, pp. 265276.
W. DuMouchel and D. Pregibon, Empirical bayes screening for
multi-item associations, in Proc. 7th ACM SIGKDD Int. Conf.
Knowl. Discovery Data Mining, 2001, pp. 6776.
VOL. 27,
[21] A. Silberschatz and A. Tuzhilin, On subjective measures of interestingness in knowledge discovery, in Proc. 7th ACM SIGKDD
Int. Conf. Knowl. Discovery Data Mining, 1995, pp. 275281.
[22] [Online]. Available: http://dblp.uni-trier.de/xml/
[23] [Online]. Available: http://monetdb.cwi.nl/xml/
[24] K. Jarvelin and J. Kekalainen, Cumulated gain-based evaluation
of ir techniques, ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422446,
2002.
[25] R. L. T. Santos, C. Macdonald, and I. Ounis, Exploiting query
reformulations for web search result diversification, in Proc. 16th
Int. Conf. World Wide Web, 2010, pp. 881890.
[26] R. L. T. Santos, J. Peng, C. Macdonald, and I. Ounis, Explicit
search result diversification through sub-queries, in Proc. 32nd
Eur. Conf. Adv. Inf. Retrieval, 2010, pp. 8799.
[27] S. Gollapudi and A. Sharma, An axiomatic approach for result
diversification, in Proc. 16th Int. Conf. World Wide Web, 2009,
pp. 381390.
[28] M. R. Vieira, H. L. Razente, M. C. N. Barioni, M. Hadjieleftheriou,
D. Srivastava, C. Traina J., and V. J. Tsotras, On query result
diversification, in Proc. IEEE 27th Int. Conf. Data Eng., 2011,
pp. 11631174.
[29] M. Hasan, A. Mueen, V. J. Tsotras, and E. J. Keogh, Diversifying
query results on semi-structured data, in Proc. 21st ACM Int.
Conf. Inf. Knowl. Manag., 2012, pp. 20992103.
[30] D. Panigrahi, A. D. Sarma, G. Aggarwal, and A. Tomkins, Online
selection of diverse results, in Proc. 5th ACM Int. Conf. Web Search
Data Mining, 2012, pp. 263272.
Jianxin Li received the BE and ME degrees in
computer science, from the Northeastern University, Shenyang, China, in 2002 and 2005, respectively. He received the PhD degree in computer
science, from the Swinburne University of Technology, Australia, in 2009. He is currently a postdoctoral research fellow at Swinburne University
of Technology. His research interests include
database query processing & optimization, and
social network analytics.