Vous êtes sur la page 1sur 15

1176 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO.

8, AUGUST 2010

BinRank: Scaling Dynamic Authority-Based


Search Using Materialized Subgraphs
Heasoo Hwang, Andrey Balmin, Berthold Reinwald, and Erik Nijkamp

Abstract—Dynamic authority-based keyword search algorithms, such as ObjectRank and personalized PageRank, leverage semantic
link information to provide high quality, high recall search in databases, and the Web. Conceptually, these algorithms require a query-
time PageRank-style iterative computation over the full graph. This computation is too expensive for large graphs, and not feasible at
query time. Alternatively, building an index of precomputed results for some or all keywords involves very expensive preprocessing.
We introduce BinRank, a system that approximates ObjectRank results by utilizing a hybrid approach inspired by materialized views in
traditional query processing. We materialize a number of relatively small subsets of the data graph in such a way that any keyword
query can be answered by running ObjectRank on only one of the subgraphs. BinRank generates the subgraphs by partitioning all the
terms in the corpus based on their co-occurrence, executing ObjectRank for each partition using the terms to generate a set of random
walk starting points, and keeping only those objects that receive non-negligible scores. The intuition is that a subgraph that contains all
objects and links relevant to a set of related terms should have all the information needed to rank objects with respect to one of these
terms. We demonstrate that BinRank can achieve subsecond query execution time on the English Wikipedia data set, while producing
high-quality search results that closely approximate the results of ObjectRank on the original graph. The Wikipedia link graph contains
about 108 edges, which is at least two orders of magnitude larger than what prior state of the art dynamic authority-based search
systems have been able to demonstrate. Our experimental evaluation investigates the trade-off between query execution time, quality
of the results, and storage requirements of BinRank.

Index Terms—Online keyword search, ObjectRank, scalability, approximation algorithms.

1 INTRODUCTION

T HE PageRank algorithm [1] utilizes the Web graph link


structure to assign global importance to Web pages. It
works by modeling the behavior of a “random Web surfer”
results. Therefore, the issue of scalability of PPR has
attracted a lot of attention [3], [4], [5].
ObjectRank [6] extends (personalized) PageRank to
who starts at a random Web page and follows outgoing perform keyword search in databases. ObjectRank uses a
links with uniform probability. The PageRank score is query term posting list as a set of random walk starting
independent of a keyword query. Recently, dynamic points and conducts the walk on the instance graph of the
versions of the PageRank algorithm have become popular. database. The resulting system is well suited for “high
They are characterized by a query-specific choice of the recall” search, which exploits different semantic connection
random walk starting points. In particular, two algorithms paths between objects in highly heterogeneous data sets.
have got a lot of attention: Personalized PageRank (PPR) for ObjectRank has successfully been applied to databases that
Web graph data sets [2], [3], [4], [5] and ObjectRank for have social networking components, such as bibliographic
graph-modeled databases [6], [7], [8], [9], [10]. data [6] and collaborative product design [9].
PPR is a modification of PageRank that performs search However, ObjectRank suffers from the same scalability
personalized on a preference set that contains Web pages issues as personalized PageRank, as it requires multiple
that a user likes. For a given preference set, PPR performs a iterations over all nodes and links of the entire database
very expensive fixpoint iterative computation over the graph. The original ObjectRank system has two modes:
entire Web graph, while it generates personalized search online and offline. The online mode runs the ranking
algorithm once the query is received, which takes too long
on large graphs. For example, on a graph of articles of
. H. Hwang is with the Department of Computer Science and Engineering, English Wikipedia1 with 3.2 million nodes and 109 million
University of California, San Diego, 9500 Gilman Drive, Mail Code 0404,
La Jolla, CA 92093-0404. E-mail: heasoo@cs.ucsd.edu.
links, even a fully optimized in-memory implementation of
. A. Balmin and B. Reinwald are with IBM Almaden Research Center, ObjectRank takes 20-50 seconds to run, as shown in Fig. 3. In
650 Harry Rd., San Jose, CA 95120. the offline mode, ObjectRank precomputes top-k results for
E-mail: abalmin@us.ibm.com, reinwald@almaden.ibm.com. a query workload in advance. This precomputation is very
. E. Nijkamp is with the Technische Universität Berlin, Straße des 17. Juni
135, D-10623 Berlin, Germany. expensive and requires a lot of storage space for precom-
E-mail: nijkamp@mailbox.tu-berlin.de. puted results. Moreover, this approach is not feasible for all
Manuscript received 15 May 2009; revised 16 Sept. 2009; accepted 26 Nov. terms outside the query workload that a user may search for,
2009; published online 3 May 2010. i.e., for all terms in the data set dictionary. For example, on
Recommended for acceptance by Y. Ioannidis, D. Lee, and R. Ng. the same Wikipedia data set, the full dictionary precompu-
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number tation would take about a CPU-year.
TKDESI-2009-05-0428.
Digital Object Identifier no. 10.1109/TKDE.2010.85. 1. http://en.wikipedia.org.
1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1177

In this paper, we introduce a BinRank system that employs . The idea of approximating ObjectRank by using
a hybrid approach where query time can be traded off for Materialized subgraphs (MSGs), which can be
preprocessing time and storage. BinRank closely approxi- precomputed offline to support online querying for
mates ObjectRank scores by running the same ObjectRank a specific query workload, or the entire dictionary.
algorithm on a small subgraph, instead of the full data graph. . Use of ObjectRank itself to generate MSGs for “bins”
The subgraphs are precomputed offline. The precomputation of terms.
can be parallelized with linear scalability. For example, on the . A greedy algorithm that minimizes the number of
full Wikipedia data set, BinRank can answer any query in less bins by clustering terms with similar posting lists.
than 1 second, by precomputing about a thousand subgraphs, . Extensive experimental evaluation on the Wikipedia
which takes only about 12 hours on a single CPU. data set that supports our performance and search
BinRank query execution easily scales to large clusters by quality claims. The evaluation demonstrates super-
distributing the subgraphs between the nodes of the cluster. iority of BinRank over other state-of-the-art approx-
This way, more subgraphs can be kept in RAM, thus imation algorithms.
decreasing the average query execution time. Since the The rest of the paper is organized as follows: We start
distribution of the query terms in a dictionary is usually with a survey of related work in Section 2. We give an
very uneven, the throughput of the system is greatly overview of the ObjectRank algorithm in Section 3. Materi-
improved by keeping duplicates of popular subgraphs on alized subgraphs are introduced in Section 4, and the bin
multiple nodes of the cluster. The query term is routed to construction algorithm is described in Section 5. In Section 6,
the least busy node that has the corresponding subgraph. we suggest the adaptive MSG recomputation method that
There are two dimensions to the subgraph precomputa- improves the performance of BinRank. Section 7 describes
the architecture of the BinRank system. Section 8 walks
tion problem: 1) how many subgraphs to precompute and
through the experimental evaluation. We conclude in
2) how to construct each subgraph that is used for
Section 9.
approximation. The intuition behind our approach is that
a subgraph that contains all objects and links relevant to a
set of related terms should have all the information needed 2 RELATED WORK
to rank objects w.r.t. one of these terms. For 1), we group all The issue of scalability of PPR [3] has attracted a lot of
terms into a small number (around 1,000 in case of attention. PPR performs a very expensive fixpoint iterative
Wikipedia) of “bins” of terms based on their co-occurrence computation over the entire graph, while it generates
in the entire data set. For 2), we execute ObjectRank for each personalized search results. To avoid the expensive iterative
bin using the terms in the bins as random walk starting calculation at runtime, one can naively precompute and
points and keep only those nodes that receive non-negligible materialize all the possible personalized PageRank vectors
scores. (PPVs). Although this method guarantees fast user response
Our experimental evaluation highlights the tuning of the time, such precomputation is impractical as it requires a
system needed to balance the query performance with size huge amount of time and storage especially when done on
and number of the precomputed subgraphs. Intuitively, large graphs. In this section, we examine hub-based and
query performance is highly correlated to the size of the Monte Carlo style methods that address the scalability
subgraph, which, in turn, is highly correlated with the problem of PPR, and give an overview of HubRank [8]
number of documents in the bin. Thus, normally, it is that integrates the two approaches to improve the scal-
sufficient to create bins with a certain size limit to achieve a ability of ObjectRank. Even though these approaches
specific target running time. However, there is some enabled PPR to be executed on large graphs, they either
variability in the process and some bins may still result in limit the degree of personalization or deteriorate the quality
unusually large subgraphs and slow queries. To address this, of the top-k result lists significantly.
we employ an adaptive iterative process that further splits Hub-based approaches materialize only a selected subset of
the problematic subgraphs to guarantee that a vast majority PPVs. Topic-sensitive PageRank [2] suggests materialization
of queries will be executed within the allotted time budget. of 16 PPVs of selected topics and linearly combining them at
Other approximation techniques have been considered query time. The personalized PageRank computation
before to improve scalability of dynamic authority-based suggested in [3] enables a finer-grained personalization by
search algorithms. Monte Carlo algorithms are introduced efficiently materializing significantly more PPVs (e.g., 100 K)
in [4] and [5] for approximation during precomputation. and combining them using the hub decomposition theorem
HubRank [8] uses the same approximation as [4], but and dynamic programming techniques. However, it is still
performs precomputation only for “hub” nodes. Other not a fully personalized PageRank, because it can persona-
techniques might also suggest sampling-based techniques lize only on a preference set subsumed within a hub set H.
online. However, although these techniques claim online Monte Carlo methods replace the expensive power itera-
query processing, they have only been demonstrated on tion algorithm with a randomized approximation algorithm
graphs with less than 106 links. In contrast, we demonstrate [4], [5]. In order to personalize PageRank on any arbitrary
superior scalability of our approach on a Wikipedia graph preference set with maintaining just a small amount of
that is two orders of magnitude larger. We also show that precomputed results, Fogaras et al. [4] introduce the
our approximation using ObjectRank itself is more precise fingerprint algorithm that simulates the random walk model
than the sampling-based techniques. of PageRank and stored the ending nodes of sampled
Our contributions are: walks. Since each random walk is independent, fingerprint
1178 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010

generation can be easily parallelized and the quality of 3.1 Data Model
search results improves as the number of fingerprints ObjectRank performs top-k relevance search over a database
increases. However, as mentioned in [4], the precision of modeled as a labeled directed graph. The data graph GðV ; EÞ
search results generated by the fingerprint algorithm is models objects in a database as nodes, and the semantic
somewhat less than that of power-iteration-based algo- relationships between them as edges. A node v 2 V contains
rithms, and sometimes, the quality of its results may be a set of keywords and its object type. For example, a paper in
inadequate especially for nodes that have many close a bibliographic database can be represented as a node
neighbors. In [5], a Monte Carlo algorithm that takes into containing its title and labeled with its type, “paper.” A
account not only the last visited nodes, but also all visited directed edge e 2 E from u to v is labeled with its relation-
nodes during the sampled walks, is proposed. Also, it ship type ðeÞ. For example, when a paper u cites another
showed that Monte Carlo algorithms with iterative start paper v, ObjectRank includes in E an edge e ¼ ðu ! vÞ that
outperform those with random start. has a label “cites.” It can also create a “cited by”—type edge
HubRank [8] is a search system based on ObjectRank that from v to u. In ObjectRank, the role of edges between objects
improved the scalability of ObjectRank by combining the is the same as that of hyperlinks between Web pages in
above two approaches. It first selects a fixed number of hub PageRank. However, note that edges of different edge types
nodes by using a greedy hub selection algorithm that utilizes may transfer different amounts of authority. By assigning
a query workload in order to minimize the query execution different edge weights to different edge types, ObjectRank
time. Given a set of hub nodes H, it materializes the can capture important domain knowledge such as “a paper
fingerprints of hub nodes in H. At query time, it generates cited by important papers is important, but citing important
an active subgraph by expanding the base set with its papers should not boost the importance of a paper.” Let wðtÞ
neighbors. It stops following a path when it encounters a denote the weight of edge type t. ObjectRank assumes that
hub node whose PPV was materialized, or the distance from
weights of edge types are provided by domain experts.
the base set exceeds a fixed maximum length. HubRank
recursively approximates PPVs of all active nodes, terminat- 3.2 Query Processing
ing with computation of PPV for the query node itself. For a given query, ObjectRank returns top-k objects relevant
During this computation, the PPV approximations are to the query. We first describe the intuition behind
dynamically pruned in order to keep them sparse. As stated ObjectRank, introduce the ObjectRank equation, and then,
in [8], the dynamic pruning takes a key role in out- elaborate on important calibration factors.
performing ObjectRank by a noticeable margin. However, ObjectRank query processing can be illustrated using the
by limiting the precision of hub vectors, HubRank may get
random surfer model. A random surfer starts from a random
somewhat inaccurate search results, as stated in [8]. Also,
node vi among nodes that contain the given keyword. These
since it materialized only PPVs of H, just as [3], the efficiency
of query processing and the quality of query results are very random surfer starting points are called a base set. For a given
sensitive to the size of H and the hub selection scheme. keyword t, the keyword base set of t, BSðtÞ, consists of nodes
Finally, Chakrabarti [8] did not show any large-scale in which t occurs. Note that any node in G can be part of the
experimental results to verify the scalability of HubRank. base set, which makes ObjectRank support the full degree of
In Section 8, we perform quality and scalability experi- personalization. At each node, the surfer follows outgoing
ments on the full English Wikipedia data set exported in edges with probability d, or jumps back to a random node in
October 2007, to show that BinRank is an efficient the base set with probability ð1  dÞ.2 At a node v, when it
ObjectRank approximation method that generates a high-
determines which edge to follow, each edge e originated from
quality top-k list for any keyword query in the corpus. For wððeÞÞ
v is chosen with probability OutDegððeÞ;vÞ , where OutDegðt; vÞ
comparative evaluation of the performance of BinRank, we
denotes the number of outgoing edges of v whose edge types
implemented the Monte Carlo algorithm 4 in [5] that was
shown to outperform other variations in [5]. We also are the same as t. The ObjectRank score of vi is the probability
implemented HubRank [8] to check its scalability on our rðvi Þ that a random surfer is found at vi at a certain moment.
Wikipedia data set. Let r denote the vector of ObjectRank scores ½rðv1 Þ; . . . ;
Unlike [4] which proves the convergence to the exact rðvi Þ; . . . ; rðvn ÞT , and A be an n  n matrix with Aij , the
solution on arbitrary graphs, and [8] and [3] which offer probability that a random surfer moves from vj to vi by
exact methods at the expense of limiting the choice of traversing an edge. Also, let q be a normalized base set
s
personalization, our solution is entirely heuristic. However, vector jBSðtÞj , where jBSðtÞj is the size of base set BSðtÞ and s
extensive experimental evaluation confirms that on real- is a base set vector ½sv1 ; . . . ; svi ; . . . ; svn T , where svi ¼ 1 if vi is
world graphs, BinRank can strike a good balance between in BSðtÞ and 0 otherwise. The ObjectRank equation is
query performance and closeness of approximation.
r ¼ dAr þ ð1  dÞq: ð1Þ

3 OBJECTRANK BACKGROUND For a given query t, the ObjectRank algorithm uses the
power iteration method to get the fixpoint of r, the
In this section, we describe the essentials of ObjectRank [6], ObjectRank vector w.r.t. t, where the ðk þ 1Þth ObjectRank
[9], [10]. We first explain the data model and query vector is calculated as follows:
processing, and then, discuss the result quality and
scalability issues that motivate this paper. 2. Throughout this paper, we assume that d ¼ 0:85.
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1179

rðkþ1Þ ¼ dArðkÞ þ ð1  dÞq: ð2Þ relevant to the query term. Second, we observed that top-k
The algorithm terminates when r converges, which is results of any keyword term t generated on subgraphs of G
composed of nodes with non-negligible ObjectRank values,
determined by using a term convergence threshold
 w.r.t. the same t, are very close to those generated on G.
t ¼ jBSðtÞj . The constant  is one of the main ObjectRank
Third, when an object has a non-negligible ObjectRank
calibration parameters, as it controls the speed of conver-
value for a given base set BS1 , it is guaranteed that the
gence and the precision of r.
object gains a non-negligible ObjectRank score for another
3.3 Quality and Scalability base set BS2 if BS1  BS2 . Thus, a subgraph of G composed
ObjectRank returns top-k search results for a given query of nodes with non-negligible ObjectRank values, w.r.t. a
using both the content and the link structure in G. Since it union of base sets of a set of terms, could potentially be
utilizes the link structure that captures the semantic used to answer any one of these terms.
relationships between objects, an object that does not Based on the above observations, we speed up the
contain a given keyword but is highly relevant to the ObjectRank computation for query term q, by identifying a
keyword can be included in the top-k list. This is in contrast subgraph of the full data graph that contains all the nodes
to the static PageRank approach that only returns objects and edges that contribute to accurate ranking of the objects
containing the keyword sorted according to their PageRank w.r.t. q. Ideally, every object that receives a nonzero score
score. This key difference is one of the main reasons for during the ObjectRank computation over the full graph
ObjectRank’s superior result quality, as demonstrated by should be present in the subgraph and should receive the
the relevance feedback survey reported in [6]. same score. In reality, however, ObjectRank is a search
system that is typically used to obtain only the top-k result
However, the iterative computation of ObjectRank
list. Thus, the subgraph only needs to have enough
vectors described in Section 3.2 is too expensive to
information to produce the same top-k list. We shall call
execute at runtime. For a given query, ObjectRank iterates such a subgraph a Relevant subgraph (RSG) of a query.
over the entire graph G to calculate the ObjectRank vector
ðkþ1Þ ðkÞ Definition 4.1. The top-k result list of the ObjectRank of
r until jri  ri j is less than the convergence threshold
ðkþ1Þ ðkÞ keyword term t on data graph GðV ; EÞ, denoted by
for every ri in rðkþ1Þ and ri in rðkÞ . This is a very
ORðt; G; kÞ, is a list of k objects from V sorted in descending
strict stopping condition. This iterative computation may
order of their ObjectRank scores w.r.t. a base set that is the set
take a very long time if G has a large number of nodes of all objects in V that contain keyword term t.
and edges. Therefore, instead of evaluating a keyword
Definition 4.2. A Relevant Subgraph (RSGðt; G; kÞ) of a data
query at query time, the original ObjectRank system [6]
graph GðV ; EÞ w.r.t. a term t and a list size k is a graph
precomputes the ObjectRank vectors of keywords in H,
Gs ðVs ; Es Þ such that Vs  V , Es  E, and ORðt; G; kÞ ¼
the set of keywords, during the preprocessing stage, and
ORðt; Gs ; kÞ.
then, stores a list of <ObjId; RankV alue> pairs per
keyword. However, the preprocessing stage of ObjectRank It is hard to find an exact RSG for a given term, and it is not
is expensive, as it requires jHj ObjectRank executions and feasible to precompute one for every term in a large workload.
OðjV j  jHjÞ bits of storage. In fact, according to the worst- However, we introduce a method to closely approximate
case bounds for PPR index size proven in [4], the index RSGs. Furthermore, we observed that a single subgraph can
size must be ðjV j  jHjÞ bits, for any system that returns serve as an approximate RSG for a number of terms, and that
the exact ObjectRank vectors. it is quite feasible to construct a relatively small number of
such subgraphs that collectively cover, i.e., serve as approx-
imate RSGs, all the terms that occur in the data set.
4 RELEVANT SUBGRAPHS
Definition 4.3. An Approximate Relevant Subgraph
Our goal is to improve the scalability of ObjectRank while
(ARSGðt; G; k; cÞ) of a data graph GðV ; EÞ with respect to
maintaining the high quality of top-k result lists. We focus
on the fact that ObjectRank does not need to calculate the a term t, list size k, and confidence limit c 2 ½0; 1, is a graph
exact full ObjectRank vector r to answer a top-k keyword Gs ðVs ; Es Þ such that Vs  V , Es  E, and ðORðt; G; kÞ;
query (K  jV j). We identify three important properties of ORðt; Gs ; kÞÞ > c.
ObjectRank vectors that are directly relevant to the result
quality and the performance of ObjectRank. First, for many Kendall’s  is a measure of similarity between two lists of
of the keywords in the corpus, the number of objects with [11]. This measure is commonly used to describe the quality of
non-negligible ObjectRank values is much less than jV j. approximation of top-k lists of exact ranking (RE ) and
This means that just a small portion of G is relevant to a approximate ranking (RA ) that may contain ties (nodes with
specific keyword. Here, we say that an ObjectRank value of equal ranks) [4], [8]. A pair of nodes that is strictly ordered in
v, rðvÞ is non-negligible if rðvÞ is above the convergence both lists is called concordant if both rankings agree on the
threshold. The intuition for applying the threshold is that ordering, and discordant otherwise. A pair is e-tie, if RE does
differences between the scores that are within the threshold not order the nodes of the pair, and a-tie, if RA does not order
of each other are noise after ObjectRank execution. Thus, them. Let C, D, E, and A denote the number of concordant,
scores below threshold are effectively indistinguishable discordant, e-tie, and a-tie pairs, respectively. Then, Kendall’s
from zero, and objects that have such scores are not at all  similarity between two rankings, RE and RA , is defined as
1180 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010

CD relevant nodes will pass through irrelevant nodes. Thus,


ðRE ; RA Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
ðM  EÞðM  AÞ even if MSGðB; GÞ is not an RSGðt; G; kÞ, it is very likely to
be ARSGðt; G; k; cÞ with high confidence c. Our experi-
where M is the total number of possible pairs M ¼ nðn1Þ and
2 mental evaluation supports this intuition.
n ¼ jRE [ RA j. We linearly scale  to ½0; 1 interval as in [4], [8].
In this paper, we construct MSGs by clustering all the
Definition 4.4. An ARSG cover of a data graph GðV ; EÞ, w.r.t. terms of the dictionary, or of a query workload if one is
a keyword term workload W , list size k, and confidence limit available, into a set of term “bins.” We create a base set B
c 2 ½0; 1, is a set of graphs  such that for every term t 2 W , for every bin by taking the union of the posting lists of the
there exists Gs 2  that is ARSGðt; G; k; cÞ, and inversely, terms in the bin and construct MSGðBÞ for every bin. We
every Gs 2  is an ARSGðt; G; k; cÞ for at least one term remember the mapping of terms to bins, and at query time,
t 2 W. we can uniquely identify the corresponding bin for each
term and execute the term on the MSG of this bin.
We construct an ARSG for term t by executing Objec- Theorem 4.5 supports our intuition that a bin’s MSG is
tRank with some set of objects B as the base set and very likely to be an ARSG for each term in the bin with fairly
restricting the graph to include only nodes with non- high confidence. Thus, the set of all bin MSGs will be an
negligible ObjectRank scores NORðBÞ, i.e., those above the ARSG cover with sufficiently high confidence. Our empirical
convergence threshold t of the ObjectRank algorithm. We results support this claim. For example, after a reasonable
call the induced subgraph G½NORðBÞ a materialized tuning of parameter settings ( ¼ 0:0005 and maximum B
subgraph for set B, denoted by MSGðBÞ.
size of 4,000 documents), 90 percent of our random workload
The main challenge of this approach is identifying a base
terms ran on their respective bin MSGs with ðORðt; G; 100Þ;
set B that will provide a good RSG approximation for term
ORðt; MSG; 100ÞÞ > 0:9. Moreover, the other 10 percent of
t. We focus on sets B that are supersets of the base set of t.
This relationship gives us the following important result: terms, which had 100 < 0:9, were all very infrequent terms.
The most frequent among them appeared in eight docu-
Theorem 4.5. If BS1  BS2 , then (v 2 MSGðBS1 Þ ) v 2 ments. 100 tends to be relatively small for infrequent terms,
MSGðBS2 Þ). because there simply may not be 100 objects with meaningful
Proof. Let BS1 and BS2 be subsets of V that satisfy relationships to the base set objects.
BS1  BS2 . Also, let r1 , r2 , and r2n1 be the ObjectRank
vectors and q1 , q2 , and q2n1 be the normalized base set
5 BIN CONSTRUCTION
vectors corresponding to BS1 , BS2 , and ðBS2  BS1 Þ,
respectively. Then, by applying the linearity theorem in As outlined above, we construct a set of MSGs for terms of a
[3] on the ObjectRank (1), we get the following equation: dictionary or a workload by partitioning the terms into a set
of term bins based on their co-occurrence. We generate an
1 r1 þ 2n1 r2n1 ¼ dAð1 r1 þ 2n1 r2n1 Þ MSG for every bin based on the intuition that a subgraph
þ ð1  dÞð1 q1 þ 2n1 q2n1 Þ; that contains all objects and links relevant to a set of related
terms should have all the information needed to rank
where 1 ¼ jBS 1j
jBS2 j and 2n1 ¼
jBS2 BS1 j
jBS2 j . Since BS1  BS2 , objects with respect to one of these terms.
1 þ 2n1 ¼ 1, which satisfies the linearity theorem. There are two main goals in constructing term bins. First,
Note that since 1 q1 þ 2n1 q2n1 ¼ q2 , 1 r1 þ 2n1 r2n1 ¼ controlling the size of each bin to ensure that the resulting
r2 holds. subgraph is small enough for ObjectRank to execute in a
Now, let us consider a node v 2 G is in MSGðBS1 Þ. reasonable amount of time. Second, minimizing the number
Since we just showed r2 ¼ 1 r1 þ 2n1 r2n1 , r2 ðvÞ ¼ of bins to save the preprocessing time. After all, we know
1 r1 ðvÞ þ 2n1 r2n1 ðvÞ also holds. Thus, r2 ðvÞ 1 r1 ðvÞ, that precomputing ObjectRank for all terms in our corpus is
because 2n1 > 0 and r2n1 ðvÞ 0. Also, since v 2 not feasible.
MSGðBS1 Þ, r1 ðvÞ > jBS 1 j by definition of MSG. To achieve the first goal, we introduce a maxBinSize
Since 1 ¼ jBS 1j  parameter that limits the size of the union of the posting
jBS2 j and r1 ðvÞ > jBS1 j , r2 ðvÞ 1 r1 ðvÞ >
jBS1 j    lists of the terms in the bin, called bin size. As discussed
jBS2 j  jBS1 j ¼ jBS2 j . Since r2 ðvÞ > jBS2 j , by definition of
above, ObjectRank uses the convergence threshold that is
MSG, v 2 MSGðBS2 Þ. u
t
inversely proportional to the size of the base set, i.e., the bin
According to this theorem, for a given term t, if the term size in case of subgraph construction. Thus, there is a strong
base set BSðtÞ is a subset of B, all the important nodes
correlation between the bin size and the size of the
relevant to t are always subsumed within MSGðBÞ, i.e., all
materialized subgraph. As show in Section 8, the value of
the non-negligible end points of random walks originated
maxBinSize should be determined by quality and perfor-
from starting nodes containing t are present in the subgraph
generated using B. mance requirements of the system.
However, note that even though two nodes v1 and v2 are The problem of minimizing the number of bins is NP-
guaranteed to be found both in G and in MSGðBÞ, the hard. In fact, if all posting lists are disjoint, this problem
ordering or their ObjectRank scores might not be preserved reduces to a classical NP-hard bin packing problem [12]. We
on MSGðBÞ as we do not include intermediate nodes if apply a greedy algorithm that picks an unassigned term
their ObjectRank scores are below the convergence thresh- with the largest posting list to start a bin and loops to add the
old. Missing intermediate nodes could deteriorate the term with the largest overlap with documents already in the
quality of ObjectRank scores computed on MSGðBÞ. bin. We use a number of heuristics to minimize the required
However, it is unlikely that many walks terminating on number of set intersections, which dominate the complexity
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1181

bin, we need to intersect the bin with every term t0 that co-
occurs with t, in order to check if t0 is subsumed by the bin
completely, and can be placed into the bin “for free.”
For example, consider N terms with posting lists of size
X each that all co-occur in one document d0 with no other
co-occurrences. If maximum bin size is 2ðX  1Þ, a bin will
have to be created for every term. However, to get to that
situation, our algorithm will have to check intersections for
every pair of terms. Thus, the upper bound on the number
of intersections is tight.
In fact, it is easy to see from the above example that no
algorithm that packs the bins based on the maximum
overlap can do so with fewer than NðN  1Þ=2 set
intersections in the worst case. Fortunately, real-world text
databases have structures that are far from the worst case,
as shown in Section 8.
Lastly, we show that the number of bins the algorithm
uses to pack a set of posting lists is at most 2OP T , where 
indicates that the degree of overlap across posting lists and
OP T is minimal. Note that since BinRank constructs an
MSG for each bin during preprocessing, 2OP T is also the
upper bound of the number of MSGs.
Theorem 5.1. Given a set of posting lists S of Si s, suppose that
P S
there exists  1 such that Si 2S jSi j
j Si 2S Si j. Then,
the approximation ratio of P ackT ermsIntoBins is 2.
Proof. Let OP T and OP T 0 denote the optimal number of
bins and the number of bins P ackT ermsIntoBins uses.
P S
Si 2S
j Si j
. Claim1: OP T maxBinSize .
Since no bin can hold a total capacity of more
than maxBinSize,
Fig. 1. Bin computation algorithm. S
j Si 2S Si j
OP T :
maxBinSize
of the algorithm. The tight upper bound on the number of
set intersections that our algorithm needs to perform is the Also, since  satisfies
 
number of pairs of terms that co-occur in at least one [  P jSi j
 
document. To speed-up the execution of set intersections for  S  Si 2S ;
larger posting lists, we use KMV synopses [13] to estimate S 2S i  
i
S P S
the size of set intersections. j Si 2S Si j Si 2S j Si j
The algorithm in Fig. 1 works on term posting lists from a OP T :
maxBinSize maxBinSize  
text index. As the algorithm fills up a bin, it maintains a list of
document IDs that are already in the bin, and a list of ; Claim1Sholds.
. Claim2: j Si 2S Si j > ðOP T 0  1Þ maxBinSize .
candidate terms that are known to overlap with the bin (i.e., 2
Since no more than one bin is less than half full,
their posting lists contain at least one document that was S
j S j > ðOP T 0  1Þ maxBinSize . Also, since
already placed into the bin). The main idea of this greedy P Si 2S i S 2 P
jSi j
j S i j for  1, Si 2S jSi j

algorithm is to pick a candidate term with a posting list that Si


S 2S S i 2S
j Si 2S Si j. ; Claim2 holds. P S
overlaps the most with documents already in the bin, without j Si j
T 0 1
posting list union size exceeding the maximum bin size. By Claim 1 and Claim 2, OP T maxBinSize Si 2S
> OP2 ,
0
OP T 1 0
While it is more efficient to prepare bins for a particular i.e., OP T > 2 . ; OP T
2OP T u
t
workload that may come from a system query log, it is
dangerous to assume that a query term that has not been 6 ADAPTIVE MSG RECOMPUTATION
seen before will not be seen in the future. We demonstrate
that it is feasible to use the entire data set dictionary as the We construct bins of up to a certain number of documents
workload, in order to be able to answer any query. based on the intuition that a limited bin size will limit the
Due to caching of candidate intersection results in lines 12- resulting MSG size, which, in turn, will limit the running time
14 of the algorithm, the upper bound on the number of set of the query. As we demonstrate in Section 8, this intuition
intersections performed by this algorithm is the number of holds for the average case; however, for a small minority of
pairs of co-occurring terms in the data set. Indeed, in the MSGs and queries, the running time can be 2-3 times higher
worst case, for every term t that has just been placed into the than the average. Fortunately, we can detect problematic
1182 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010

Fig. 2. System architecture.

MSGs and replace them with more efficient ones during the and tested against the maxMSGSize. If some MSGs still fail
preprocessing stage. the test, the process can be repeated iteratively.
Recall that the ObjectRank running time scales linearly
with two parameters: the number of iterations required and 7 SYSTEM ARCHITECTURE
the size of the graph. The number of iterations is correlated to
the size of the base set, so for a given MSG, queries with the Fig. 2 shows the architecture of the BinRank system. During
largest base sets are going to be the slowest. And for queries the preprocessing stage (left side of figure), we generate
MSGs as defined in Section 4. During query processing
with fixed sized base sets, the running time will largely
stage (right side of figure), we execute the ObjectRank
depend on the number of links in the graph. In fact, we report
algorithm on the subgraphs instead of the full graph and
in Section 8.5 a 94 percent correlation between the number of
produce high-quality approximations of top-k lists at a
links on an MSG and the BinRank running time for queries
small fraction of the cost. In order to save preprocessing
with large base sets. This observation enables us to reliably cost and storage, each MSG is designed to answer multiple
identify problematic MSGs based only on their link counts. term queries. We observed in the Wikipedia data set that a
However, the correlation between the bin sizes and the single MSG can be used for 330-2,000 terms, on average.
MSG link counts is less obvious. Fig. 15 shows that the link-
count for MSGs follows a normal distribution even with all 7.1 Preprocessing
the Bin and MSG generation parameters fixed. Thus, setting The preprocessing stage of BinRank starts with a set of
the generation parameters in a way that no MSG exceeds a workload terms W for which MSGs will be materialized. If
certain link-count threshold is not going to be practical. an actual query workload is not available, W includes the
Instead, we set the parameters in such a way that only a entire set of terms found in the corpus. We exclude from W
small minority of MSGs exceed the limit, and then, deal all terms with posting lists longer than a system parameter
with this minority separately. maxP ostingList. The posting lists of these terms are deemed
One way to deal with dangerously large MSGs is to too large to be packed into bins. We execute ObjectRank for
recompute them with a larger convergence threshold, thus each such term individually and store the resulting top-k
making them smaller. However, this may diminish the lists. Naturally, maxP ostingList should be tuned so that
subsequent query result quality, so instead we choose to there are relatively few of these frequent terms. In the case of
keep the same , but regenerate the bins that produced these Wikipedia, we used maxP ostingList ¼ 2;000 and only
MSGs with a smaller maxBinSize. 381 terms out of about 700,000 had to be precomputed
To do this, we introduce a new threshold maxMSGSize individually. This process took 4.6 hours on a single CPU.
and generate a set of rejected bins RB that resulted in MSGs For each term w 2 W , BinRank reads a posting list T
with the number of links larger than maxMSGSize. We from the Lucene3 index and creates a KMV synopsis T 0 that
then generate a new set of workload terms W 0 , which is used to estimate set intersections.
consists of all the keywords of all bins in RB, and rerun the The bin construction algorithm, P ackT ermsIntoBins,
P ackT ermsIntoBins algorithm with W 0 and the new partitions W into a set of bins composed of frequently co-
maxBinSize set to the half of the original one. The new
set of bins replaces RB, and the new MSGs are produced 3. http://lucene.apache.org.
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1183

occurring terms. The algorithm takes a single parameter over MSGs to generate the top-k list for the query. For a
maxBinSize, which limits the size of a bin posting list, i.e., disjunctive query, the ObjectRank module sums the Objec-
the union of posting lists of all terms in the bin. During the tRank scores w.r.t. each term calculated using MSGs to
bin construction, BinRank stores the bin identifier of each produce BinRank scores.
term into the Lucene index as an additional field. This One of the advantages of BinRank query execution
allows us to map each term to the corresponding bin and engine is that it can easily utilize large clusters of nodes. In
MSG at query time. this case, we distribute MSGs between the nodes and
The ObjectRank module takes as input a set of bin employ Hadoop4 to start an MSG cache and an ObjectRank
posting lists B and the entire graph GðV ; EÞ with a set of engine Web service on every node. A set of dispatcher
ObjectRank parameters, the damping factor d, and the processes, each with its own replica of the Lucene index,
threshold value . The threshold determines the conver- routes the queries to the appropriate nodes.
gence of the algorithm as well as the minimum ObjectRank
score of MSG nodes.
Our ObjectRank implementation stores a graph as a row- 8 EXPERIMENTS
compressed adjacency matrix. In this format, the entire We present our experimental evaluation in this section. We
Wikipedia graph consumes 880 MB of storage and can be first describe our experimental setup using English Wiki-
loaded into main memory for MSG generation. In case that pedia articles. Then, we show scalability numbers for
the entire data graph does not fit in main memory, we can ObjectRank followed by numbers for BinRank. Finally, we
apply parallel PageRank computation techniques such as present a performance comparison of BinRank with Monte
hypergraph partitioning schemes described in [14]. Carlo Method and HubRank.
The MSG generator takes the graph G and the ObjectRank
result w.r.t. a term bin b, and then, constructs a subgraph 8.1 Setup
Gb ðV 0 ; E 0 Þ by including only nodes with rt ðuÞ b . b is the We evaluate the performance of the BinRank algorithm on

convergence threshold of b, that is, jBSðbÞj . Given the set of the collection of English Wikipedia articles exported in
MSG nodes V , the corresponding set of edges E 0 is copied
0 October 2007. We parsed the 13.8 GB dump file and
from the in-memory copy of G. The edge construction takes extracted 3.2 M articles and 109M intrawiki links of 10 types
(e.g., “Regular links,” “Category links,” “See also links,”
1.5-2 seconds for a typical MSG with about 5 million edges.
Once the MSG is constructed in memory, it is serialized etc.). All the experiments in this section are performed over
to a binary file on disk in the same row-compressed the labeled graph Gwiki ¼ ðVwiki ; Ewiki Þ that is composed of
adjacency matrix format to facilitate fast deserialization. We the Wikipedia articles as nodes and the intrawiki links as
observed that deserializing a 40 MB MSG on a single SATA edges. We used the standard row-compressed matrix
disk drive takes about 0.6 seconds. In general, deserializa- format to represent the link structure and weight dissipa-
tion speed can be greatly improved by increasing the tion rates of Ewiki compactly. We were able to store the
transfer rate of the disk subsystem. 3:2M 3:2M transition matrix of Gwiki with 109M nonzero
elements in only 880 MB. We created a Lucene text index of
7.2 Query Processing the Wikipedia article titles, which takes up 154 MB. The
For a given keyword query q, the query dispatcher retrieves dictionary of the index contains 698,214 terms.
from the Lucene index the posting list bsðqÞ (used as the We chose to index only article titles, by analogy with the
base set for the ObjectRank execution) and the bin identifier original ObjectRank [6] setup that used only publication
bðqÞ. Given a bin identifier, the MSG mapper determines titles from DBLP. It is important for ObjectRank to have a
whether the corresponding MSG is already in memory. If it base set of objects that are highly related to a search term.
is not, the MSG deserializer reads the MSG representation
However, a large article can mention a term without being
from disk. The BinRank query processing module uses all
meaningfully related to it. For that reason, title index works
available memory as an LRU cache of MSGs.
For smaller data graphs, it is possible to dramatically better than an index on the full text of the articles. In order
reduce MSG storage requirements by storing only a set of to use the full article text index, the ObjectRank algorithm
MSG nodes V 0 , and generating the corresponding set of edges would have to be augmented to take into account Lucene
E 0 only at query time. However, in our Wikipedia, data set search scores of the base set documents. This is one of our
that would introduce an additional delay of 1.5-2 seconds, future research directions.
which is not acceptable in a keyword search system. For our experiments, we implemented the BinRank
The ObjectRank module gets the in-memory instance of system (and other algorithms for performance comparisons)
MSG, the base set, and a set of ObjectRank calibrating in Java and performed experiments on a single PC with a
parameters: 1) the damping factor d; 2) the convergence Pentium4 3.40 GHz CPU and 2.0 GB of RAM.
threshold ; and 3) the number of top-k list entries k. Once
the ObjectRank scores are computed and sorted, the 8.2 ObjectRank on the Full Wikipedia Graph
resulting document ids are used to retrieve and present ObjectRank on Gwiki takes too long to be executed online
the top-k objects to the user. and consumes around 880 MB of memory just for the link
Multikeyword queries are processed as follows: For a information of Gwiki . As shown in Fig. 3, it takes around 20-
given conjunctive query composed of n terms ft1 ; . . . ; tn g, 50 seconds (30 seconds on average) to compute the
the ObjectRank module gets MSGs, fMSGðbðt1 ÞÞ; . . . ; dynamically generated top-k list for a given single keyword
MSGðbðtn ÞÞg, and evaluates each term over the correspond-
ing MSG. Then, it multiplies the ObjectRank scores obtained 4. http://hadoop.apache.org.
1184 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010

Fig. 5. The effect of maxBinSize on the MSG construction cost ( is


fixed to 5.0E-4).

We construct bins for all terms in our Lucene index, except


for the 381 most frequent terms which have posting lists
longer than a system parameter maxP ostingList ¼ 2;000.
Recall from Section 7 that such terms are deemed to be too
frequent, so we precompute their ObjectRank authority
vectors individually. This process takes 4.6 hrs.
To pack the remaining 697,833 keywords into terms, we
Fig. 3. The number of keywords and average ObjectRank execution time construct bins with various maxBinSizes, as shown in Fig. 4.
on the Wikipedia graph per frequency range ( is fixed to 5.0E-4). Note that as maxBinSize increases, the bin construction
algorithm generates fewer bins while consuming more time.
query even with our optimized, in-memory ObjectRank The running time goes up because the greedy algorithm
execution engine. For frequent keywords that have posting needs to try more intersections of larger sets to fill the larger
bins. However, even with maxBinSize ¼ 12;000, BinRank
lists with more than 200 documents, the ObjectRank is likely
generates all 345 bins in only 1,106 seconds. This is a small
to take longer. Since frequent keywords are found in many fraction of the total preprocessing time, which is dominated
articles, they are more likely to be meaningfully connected by MSG construction, as we will see next.
to many other articles through many paths, resulting in a Note that Wikipedia page titles are a very simple case for
wider search space for ObjectRank to evaluate and rank. bin generation as the typical document size is extremely
Fig. 3 also shows the keyword frequency distribution small. We also tested the bin construction algorithm on the
obtained from the Lucene text index built on the article full text of Wikipedia pages. In this case, the total size of the
titles. The total number of keywords in the index is posting lists in the text index was 84 million versus
698,214, and the keyword frequencies follow the typical 4.8 million for titles. The algorithm produced 6,340 bins
power law distribution. with maxBinSize 5;000, performing over 4 billion intersec-
tions. The packing process took about 70 hours.
8.3 BinRank MSG generation. Once the bins are constructed, we
During the BinRank preprocessing stage, we generate bins generate an MSG for each bin. For our Wikipedia data set,
we generated a comprehensive set of MSGs with 24 combi-
for all the keywords in the corpus. Once the bins are
nations of the two parameters maxBinSize and . For each
constructed, we generate an MSG per bin by executing combination, we measure the performance of BinRank, i.e.,
ObjectRank on Gwiki using the union of the posting lists of the query time and the quality of top-k lists.
the terms in a bin as a single base set. We first describe the maxBinSize determines the number of bins to be
performance of the bin construction and MSG generation, constructed, and thus, the number of MSGs generated
and then, measure the query result quality and the impact (the second column in Fig. 5). The construction time and
of maxBinSize. average size go up with the maxBinSize. Intuitively, the
larger the base set, the more objects will be related to it. And
8.3.1 Preprocessing the more objects have nontrivial scores, the more iterations
Bin construction. To measure the performance of the bin it will take the ObjectRank algorithm to reach the fixpoint.
Fig. 5 supports this intuition.
construction stage, we examine the bin construction time
Note that the total MSG construction time decreases
and the number of bins constructed with different
significantly, as the maxBinSize increases. However, the
maxBinSize values. average MSG size increases at the same time, which leads to

Fig. 6. The effect of  on the MSG construction cost (maxBinSize is


Fig. 4. Performance of bin construction. fixed to 4,000).
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1185

Fig. 8. The effect of maxBinSize on the top-100 accuracy ( is fixed to


5.0E-4).
Fig. 7. The effect of maxBinSize on the BinRank running time.
RAG and precision measure the quality of BRðkw; KÞ by
slower query execution time. Thus, there is a clear trade-off considering top-k lists as sets, say ORSetðkw; KÞ and
between preprocessing time and query time in BinRank. BRSetðkw; KÞ. RAG is the ratio of the aggregated exact
Fig. 6 shows the effect of  on MSG construction time and authority scores of nodes in BRðkw; KÞ to scores of nodes in
the size of MSGs. Smaller  implies that ObjectRank will ORðkw; KÞ. Precision at K computes the ratio of the size of
need more iterations to reach the convergence point, and intersection to K
more nodes will have scores above the bin convergence P
n2BRSetðkw;KÞ ORScoreðn; kwÞ

threshold b ¼ BinSize . Thus, both construction time and
RAGðKÞ ¼ P ;
MSG size decrease as the  increases. n2ORSetðkw;KÞ ORScoreðn; kwÞ
An interesting observation from Figs. 5 and 6 is that the jBRSetðkw; KÞ \ ORSetðkw; KÞj
storage requirements of BinRank, i.e., the total size of MSGs, P recðKÞ ¼ :
K
is controlled by the choice of  and is virtually unaffected by
maxBinSize. Of course, the quality of the BinRank’s score Kendall’s , as defined in Section 4, compares the
orderings of the top-k lists, i.e., ORðkw; KÞ and BRðkw; KÞ.
approximations is also strongly affected by , as we show
It is the most stringent quality measure of the three
next. Thus, one has to strike a balance between the quality
measures that we use.  value of 1 means that the lists are
of results and the storage overhead. For example, BinRank
identical, and 0 that they are disjoint or in inverse order.
produces extremely high-quality results with  ¼ 5:0E-4. Since we primarily aim to get high-quality top-k lists
However, this setup requires 44 GB of storage for MSGs, within reasonable amount of query time, we want to find
which is 50 times of the size of Gwiki . Another way to good combinations of maxBinSize and  for BinRank. To
approach this trade-off is to say that the amount of disk, or tune these parameters, we compute quality measures for all
even better, RAM available to the system will determine the 24 sets of MSGs described above, six different maxBinSize
quality of results. values, and four different  values. The smallest
As discussed in Section 7, it is possible to reduce MSGs maxBinSize, 2,000, is chosen to be the same as the
storage requirements by materializing MSG nodes only and maximum posting list size for terms that are put into bins.
extract links at query time. The edge extraction adds 1.5- We run a workload of 92 randomly selected query terms
2 seconds to the query time, but the storage requirements in on all of these 24 sets of MSGs.
this case go down from 44 GB to only 203 MB, which is Effect of maxBinSize on query time and quality of top-
k lists using BinRank. With  ¼ 5:0E-4, we generated
similar to the size of our Lucene index, 154 MB.
MSGs with six different maxBinSize values starting from
8.3.2 Query Processing the smallest maxBinSize 2,000. Fig. 7 shows that query time
increases linearly as maxBinSize increases. This is because
Quality measures. For a given keyword query, BinRank
the average size of MSGs also increases linearly, as depicted
generates an approximate top-k list using the correspond-
in Fig. 5. For example, when maxBinSize is 2,000, an MSG
ing MSG. The exact top-k list is obtained by executing
is 21 MB, but it increases to 42 MB if maxBinSize increases
ObjectRank on Gwiki with small  ¼ 1:0E-4. The two lists are
to 4,000.
compared using the same three quality measures as in [4]: Next, we investigate the effects of the MSG size, which is
relative aggregated goodness (RAG), precision at K, and determined by the maxBinSize, on the accuracy of top-k
Kendall’s . lists. Fig. 8 shows the average accuracy of top-100 lists
Let ORðkw; KÞ and BRðkw; KÞ denote the accurate top-k measured by the three goodness measures given  ¼ 5:0E-4.
list by ObjectRank and the approximate top-k list by BinRank First, all the measures are in ½0:95; 1 range, indicating that
for a given keyword kw. In our experiments, both top-k lists the quality of the top-100 lists obtained by BinRank is very
are lists of Wikipedia article IDs sorted by the authority good. Second, as maxBinSize increases from 2,000 to
score. Let ORScoreðn; kwÞ denote the exact keyword-specific 12,000, the accuracy remains the same or improves very
authority score of a node n computed by ObjectRank. slightly. However, we do not see a noticeable improvement
1186 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010

Fig. 10. The effect of  on the BinRank running time.


Fig. 9. The effect of maxBinSize on the top-k accuracy with fixed .
the RankMass of an MSG w.r.t. a keyword as the ratio of
on the quality of top-k lists. In contrast, the accuracy of top- aggregated authority scores of nodes in the MSG to the sum
k is sensitive to , as shown in Fig. 11. of all authority scores in Gwiki ¼ ðVwiki ; Ewiki Þ. Let
Fig. 9 illustrates the relationship between maxBinSize, , MSGðb; e; mÞ denote a set of nodes in the MSG generated
and the accuracy of top-k lists. It shows the distribution of for a bin b with  ¼ e and maxBinSize ¼ m. Let us assume
5 through 1;000 with 12 combinations of the parameters: all that the bin b contains a workload keyword kw. Then, the
six different maxBinSize and 2 values, 5.0E-3 and 5.0E-4. RankMass of an MSG w.r.t. kw is:
One can see that the 12 lines form two clusters, one for P
 ¼ 5:0E-3ðbottomÞ, and the other for  ¼ 5:0E-4ðtopÞ. v2MSGðb;e;mÞ ORðv; kwÞ
RankMassðMSGðb; e; mÞ; kwÞ ¼ P :
For a given  and a set of maxBinSize values, if larger v2Vwiki ORðv; kwÞ
maxBinSize does not improve the quality of top-k lists in a
big margin, then we do not see any good reason to increase We computed the average RankMass coverage of an
maxBinSize. Actually, it decreases the preprocessing time MSG using all the keywords in our workload, which shows
in Fig. 5 by reducing the number of MSGs, but increases how well an MSG covers the context of keywords in the
the query processing time, as shown in Fig. 7. For example, corresponding bin. As we can expect with an increasing ,
with  ¼ 5:0E-4, we can see from Fig. 5 that the average size the RankMass also increases rapidly.
of MSGs is 127 MB when maxBinSize ¼ 12;000, while it is For example, if we compare two sets of MSGs constructed
42 MB for maxBinSize ¼ 4;000. However, Fig. 9 shows that with maxBinSize ¼ 4;000 and maxBinSize ¼ 6;000,
the top-k lists generated on these two MSGs are very
similar on average. We computed standard deviations of avgðjMSGðb; 5E-4; 12;000ÞjÞ
 values of top-k lists with varying maxBinSize values and ¼ 3 avgðjMSGðb; 5E-4; 4;000ÞjÞ;
a fixed . They are very low: stdevð20 Þ ¼ 0:00627 and
but the average RankMass only increases by 5.7 percent. The
stdevð100 Þ ¼ 0:00672.
average size of MSGs of maxBinSize ¼ 4;000 is 1.52 percent
However, we cannot reduce maxBinSize without con-
of jVwiki j, while that of maxBinSize ¼ 6;000 is 4.59 percent of
sidering the total MSG construction time. One might want to
construct bins with very small maxBinSize. Setting aside the
accuracy issue, BinRank will construct too many bins to
complete MSGs construction stage in a given time budget.
The extreme case is to precompute and materialize MSGs or
authority vectors for all the keywords in the dictionary, which
is infeasible especially when the size of the dictionary and the
size of the full graph are huge as in our Wikipedia data set.
Effect of  on query time and quality of top-k lists of
BinRank. As observed in Fig. 6, as  increases, the average
size of MSGs also increases. It takes more time to generate
top-k lists on a larger MSG, on average, as shown in Fig. 10.
Now, we analyze the effect of  on the quality of top-k
lists. Unlike maxBinSize, the quality of top-k lists im-
proves, as can be seen in Fig. 11.
To measure how much an MSG covers the context of
keywords in the corresponding bin, we computed the
RankMass coverage metric [15] of sets of MSGs generated Fig. 11. The effect of  on the top-100 accuracy (maxBinSize is fixed to
with five different  values. In our experiments, we define 4,000).
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1187

Fig. 12. The top-100 accuracy for disjunctive queries (maxBinSize ¼ Fig. 13. The top-100 accuracy for conjunctive queries (maxBinSize ¼
4;000 and  ¼ 5:0E-4). 4;000 and  ¼ 5:0E-4).

jVwiki j. However, if we decrease  from 1E-3 to 5E-4, the As is shown in Fig. 13, RAGð100Þ and P recð100Þ are
average size of MSGs increases from 0.98 percent of jVwiki j to above 0.9, indicating that the top-100 lists obtained by
1.52 percent of jVwiki j, while the RankMass increases by BinRank include most of the nodes in the top-100 lists
7.0 percent. generated by ObjectRank over Gwiki . However, the average
8.4 BinRank for Multikeyword Queries 100 for conjunctive queries remains in ½0:75; 0:8 range,
which is lower than those for single-keyword queries (Fig. 8)
In this section, we investigate the performance of BinRank
or disjunctive queries (Fig. 12). Therefore, we can see that for
for multikeyword queries. Given a multikeyword query q
a given conjunctive query, BinRank generates a top-k list
composed of n keywords k1 . . . kn , BinRank first evaluates
that contains all the nodes in the top-k list obtained over
each ki over the MSG corresponding to the keyword
Gwiki , but the ordering of nodes in the BinRank top-k list is
MSGðki Þ. Then, it combines the rank scores computed over
not highly accurate. This is mainly because the MSGs are not
those MSGs according to query semantics to produce the
large enough to cover all the important paths through which
top-k list for q.
significant amount of authority flows into or between the
We observed from our experimental results that if a
top-100 nodes, even though most of the links between top-
multikeyword query contains highly relevant keywords
100 nodes exist on the corresponding MSGs. To improve the
such as “martial” AND “arts” or “fine” AND “performing,”
top-k accuracy of conjunctive queries, we can increase the
BinRank assigns those relevant keywords into the same bin, coverage of MSGs by using smaller . Note that increasing
and thus, evaluates those keywords using the same MSG. In maxBinSize does not improve the top-k accuracy in a big
this case, the top-k accuracy of the query is higher than margin, as shown in Fig. 8.
randomly generated multikeyword queries. However, if However, with smaller , BinRank generates larger
keywords composing a multikeyword query are assigned to MSGs, increasing query execution time. Especially, we
different bins and the query is conjunctive, BinRank has to observed that some MSGs require unacceptably long
evaluate each keyword over different MSGs and combine running time. Given time budget, we want to identify such
scores. We assign zero scores to nodes not in the MSG. MSGs and recompute them as described in Section 8.5.
Hence, if a conjunctive query contains keywords whose
MSGs do not overlap, BinRank will return an empty result. 8.5 Adaptive MSG Recomputation
However, we observed no such cases thorughout our In this section, we first want to examine the entire set of
experiments, because certain highly popular subgraphs of MSGs to understand the features of MSGs. We obtain
Gwiki obtain non-negligible scores regardless of the key- 1,043 bins, and then, generate a set of MSGs M using
words assigned to a bin. BinRank parameters maxBinSize ¼ 4;000 and  ¼ 5:0E-4.
We randomly generated 600 multikeyword queries to The average number of nodes and links on an MSG is 48,616
measure the top-k accuracy of conjunctive queries and and 5.2M, which is just 1.52 percent of jVwiki j and 4.83 percent
disjunctive queries containing 2 to 4 keywords. Throughout of jEwiki j. Recall that Gwiki has 3.2M nodes and 109M links.
our experiments, we use maxBinSize ¼ 4;000 and  ¼ 5:0E-4. Next, to evaluate the quality of the MSGs in M, we pick a
We do not report the statistics of the BinRank running time for set of keywords Q by selecting the keyword with largest
multikeyword queries, because it is dominated by the frequency among the keywords assigned to each bin. The
running time of BinRank for each individual query term. range of keyword frequency of Q is ½1; 2;000. We select the
We can see from Fig. 12 that the top-100 accuracy of most frequent keyword for each bin since they are very
disjunctive queries is higher than the top-100 accuracy for likely to result in the slowest BinRank execution time out of
single-keyword queries, as shown in Figs. 8 and 11. As shown all keywords in the bin, as discussed in Section 6.
in Fig. 9, the accuracy of a top-k list drops as K increases, The average BinRank execution time for queries in Q is
because scores of highly ranked nodes are more stable than 856 ms, which is much faster than the average ObjectRank
those of the rest. Since the top-100 list for a disjunctive query execution time on Gwiki , 30 seconds. However, we observe
tends to include top-k nodes (K
100) in the top-100 lists that some queries in Q require almost 2 seconds to evaluate,
obtained over MSGs, its accuracy is at least as high as that for which is sometimes not acceptable. The goal of the MSG
single-keyword queries or slightly higher than that. recomputation algorithm is to predict and prevent such
1188 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010

Fig. 14. The effect of the number of links on an MSG on the BinRank Fig. 15. The distribution of the number of links on an MSG (1,043 MSGs
running time (maxBinSize ¼ 4;000 and  ¼ 5:0E-4). The Pearson generated by using maxBinSize ¼ 4;000 and  ¼ 5:0E-4).
correlation coefficient is 0.938.
corresponding bin. In contrast, the probability that the
cases so that the BinRank running time does not exceed a worst-case BinRank running time exceeds 1 sec is just
certain time budget, which is set to 1 second throughout 16.4 percent for MSGs with less than ð þ Þ links. If we
these experiments. pick the MSGs with less than  links, only 5.6 percent of
As we discussed in Section 6, the BinRank query running them spend more than 1 sec to compute top-k lists in the
time depends on the features of the query (e.g., the base set worst case. Therefore, by default, BinRank sets the
size) and those of the corresponding MSG (e.g., the number maxMSGSize parameter to ð þ Þ and recomputes bins
of nodes and the number of links). Other factors such as the for all the MSGs with higher link counts, using the halved
connectivity of links on an MSG and the topology of base set maxBinSize, as described in Section 6. Recall from Fig. 7
nodes also affect the BinRank running time, but they are that reducing the maxBinSize linearly reduces the query
harder to quantify, and the simple features prove to be time, thus dramatically reducing the number of queries
sufficiently good predictors. running over budget.
The correlation coefficients, denoted by r, between the In general case, maxMSGSize could be set to ð þ xÞ,
BinRank running time and each of the three simple features where good candidates for x are within ½0; 1 as we can see
are the followings: in our experimental results. In future, we plan to investigate
optimizing x, while considering such factors as the time and
. r1 ¼ 0:564: with the number of nodes on an MSG. space budget for MSG generation. For example, if x ¼ 0, we
. r2 ¼ 0:700: with the number of links on an MSG. need to regenerate about 50 percent of the MSGs, while we
. r3 ¼ 0:459: with the base set size of a query. regenerate only 14 percent of them when x ¼ 1.
r2 is noticeably higher than r1 or r3 , which indicates that the Another approach we are planning to investigate is to
number of links on an MSG is more tightly correlated to the base maxMSGSize on the actual query performance
BinRank running time than the other two features. Actually, measurements. The BinRank running time also follows a
since r2 is obtained from all the queries in Q and their base normal distribution Nðt ; 2t Þ and the time budget, 1 second
set sizes vary significantly within ½1; 2;000, we can see the in our experiments, corresponds to t þ 0:58t . Since the
effect of the number of links on an MSG more clearly after BinRank running time and the number of links on an MSG
reducing the effect of the base set size. To do it, we select a are highly correlated as shown in Fig. 14, we can use 0.58 as
set of 292 queries with high frequency, ½1;000; 2;000 and x to select MSGs to regenerate.
denote it as Qhf . From Fig. 14, obtained using Qhf , we can
clearly observe a very strong correlation between the 8.6 Performance Comparison of BinRank with
number of links on an MSG and the BinRank running time. Monte Carlo Method and HubRank
With very high R2 value, the BinRank running time of a In this section, we present a performance comparison of
query is almost linear in the number of links on the BinRank over Monte Carlo style methods and HubRank.
corresponding MSG. Also, the correlation coefficient be- We implemented the Monte Carlo algorithm 4, “MC
tween the number of links on an MSG and the BinRank
complete path stopping at dangling nodes,” introduced in
running time using high-frequency keywords in Qhf is
[5] and HubRank [8] that combines a hub-based approach
0.938, which also indicates a very strong correlation.
By exploiting this strong correlation, we select MSGs and a Monte Carlo method called fingerprint.
whose BinRank running time will be above a certain time For a given keyword query, the Monte Carlo algorithm
budget with high probability. As can be seen in Fig. 15, the simulates random walks starting from nodes containing the
number of links on an MSG almost follows a normal keyword. Within a specified number of walks, it samples
distribution Nð; 2 Þ, where  ¼ 5:2E6 and  ¼ 1:0E6. Our exactly the same number of random walks per each starting
experiments show that among the 1,043 MSGs in M, point. The authority score of a node is the total number of
144 MSGs have more than ð þ Þ links, and among visits to the node divided by the total number of visits. Fig. 16
them, 138 MSGs (94.4 percent) require more than 1 sec to shows the performance of the Monte Carlo algorithm in
produce top-k lists for the largest frequency keyword in the terms of accuracy of top-k lists and various query times. We
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1189

We introduce a greedy algorithm that groups co-occurring


terms into a number of bins for which we compute
materialized subgraphs. Note that the number of bins is
much less than the number of terms. The materialized
subgraphs are computed offline by using ObjectRank itself.
The intuition behind the approach is that a subgraph that
contains all objects and links relevant to a set of related terms
should have all the information needed to rank objects with
respect to one of these terms. Our extensive experimental
evaluation confirms this intuition.
For future work, we want to study the impact of other
keyword relevance measures, besides term co-occurrence,
such as thesaurus or ontologies, on the performance of
BinRank. By increasing the relevance of keywords in a bin,
we expect the quality of materialized subgraphs, thus the
Fig. 16. Top-k accuracy of Monte Carlo algorithm with various query top-k quality and the query time can be improved.
times. We also want to study better solutions for queries whose
random surfer starting points are provided by Boolean
used our workload keyword queries and executed the Monte conditions. And ultimately, although our system is tunable,
Carlo algorithm with different total numbers of sampled the configuration of our system ranging from number of
walks. As the number of sampled walks increases, the bins, size of bins, and tuning of the ObjectRank algorithm
algorithm generates higher quality top-k lists, which usually itself (edge weights and thresholds) is quite challenging,
takes more time. and a wizard to aid users is desirable.
However, we can see that  values in Fig. 16 are not as high To further improve the performance of BinRank, we plan
as those of BinRank in Fig. 8. With maxBinSize ¼ 2;000 or to integrate BinRank and HubRank [8] by executing
4,000 and  ¼ 5E-4, BinRank generates high-quality top-k HubRank on MSGs BinRank generates. Currently, we use
lists of  0:95 in 350-750 ms, on average, as shown in Figs. 8
the ObjectRank algorithm on MSGs in query time. Even
and 10. However, according to Fig. 16, the Monte Carlo
though HubRank is not as scalable as BinRank, it performs
algorithm generates top-k lists of  0:70 within the same
better than ObjectRank on smaller graphs such as MSGs. In
amount of time. To get high-quality top-k lists, it would take
this way, we can leverage the synergy between BinRank
the Monte Carlo algorithm around 7 seconds per query term,
and HubRank.
which is probably not acceptable in a online search system.
We also implemented HubRank [8] in order to measure
the scalability and the top-k quality over Gwiki . We selected REFERENCES
hubs, and then, materialized a large number of fingerprints, [1] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual
while keeping the hub set fairly focused to our experimental Web Search Engine,” Computer Networks, vol. 30, nos. 1-7, pp. 107-
query workload to save preprocessing cost. For a given 117, 1998.
keyword query, HubRank generates the active graph of the [2] T.H. Haveliwala, “Topic-Sensitive PageRank,” Proc. Int’l World
Wide Web Conf. (WWW), 2002.
query by expanding the base set’s neighborhood until [3] G. Jeh and J. Widom, “Scaling Personalized Web Search,” Proc.
bounded by hub nodes or nodes very far from the given Int’l World Wide Web Conf. (WWW), 2003.
query node. Since Gwiki contains 3.2M nodes and 109M [4] D. Fogaras, B. Rácz, K. Csalogány, and T. Sarlós, “Towards Scaling
links, we often needed to compute many (thousands) of Fully Personalized PageRank: Algorithms, Lower Bounds, and
Experiments,” Internet Math., vol. 2, no. 3, pp. 333-358, 2005.
active vectors to answer a single query, where each active
[5] K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova,
vector is a (sparse) vector of 3.2 million numbers. Due to “Monte Carlo Methods in PageRank Computation: When One
this requirement, for most queries, we could not keep all the Iteration Is Sufficient,” SIAM J. Numerical Analysis, vol. 45, no. 2,
necessary active vectors in memory. The authors of [8] also pp. 890-904, 2007.
reported that their implementation ran out of memory on a [6] A. Balmin, V. Hristidis, and Y. Papakonstantinou, “ObjectRank:
Authority-Based Keyword Search in Databases,” Proc. Int’l Conf.
few queries, while they were running the experiments on a Very Large Data Bases (VLDB), 2004.
graph with less than a million edges. A two orders of [7] Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma, “Object-Level Ranking:
magnitude increase in the size of the graph made this Bringing Order to Web Objects,” Proc. Int’l World Wide Web Conf.
problem ubiquitous and prevented us from obtaining (WWW), pp. 567-574, 2005.
[8] S. Chakrabarti, “Dynamic Personalized PageRank in Entity-
comparable results. Relation Graphs,” Proc. Int’l World Wide Web Conf. (WWW), 2007.
[9] H. Hwang, A. Balmin, H. Pirahesh, and B. Reinwald, “Information
Discovery in Loosely Integrated Data,” Proc. ACM SIGMOD, 2007.
9 SUMMARY AND CONCLUSIONS [10] V. Hristidis, H. Hwang, and Y. Papakonstantinou, “Authority-
Based Keyword Search in Databases,” ACM Trans. Database
In this paper, we proposed BinRank as a practical solution Systems, vol. 33, no. 1, pp. 1-40, 2008.
for scalable dynamic authority-based ranking. It is based [11] M. Kendall, Rank Correlation Methods. Hafner Publishing Co., 1955.
on partitioning and approximation using a number of [12] M.R. Garey and D.S. Johnson, “A 71/60 Theorem for Bin
materialized subgraphs. We showed that our tunable Packing,” J. Complexity, vol. 1, pp. 65-106, 1985.
[13] K.S. Beyer, P.J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla,
system offers a nice trade-off between query time and “On Synopses for Distinct-Value Estimation under Multiset
preprocessing cost. Operations,” Proc. ACM SIGMOD, pp. 199-210, 2007.
1190 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010

[14] J.T. Bradley, D.V. de Jager, W.J. Knottenbelt, and A. Trifunovic, Berthold Reinwald received the PhD degree
“Hypergraph Partitioning for Faster Parallel PageRank Computa- from the University of Erlangen-Nuernberg,
tion,” Proc. Second European Performance Evaluation Workshop Germany, in 1993. Since 1993, he has been
(EPEW), pp. 155-171, 2005. with the IBM Almaden Research Center, where
[15] J. Cho and U. Schonfeld, “Rankmass Crawler: A Crawler with he is currently a research staff member. His
High PageRank Coverage Guarantee,” Proc. Int’l Conf. Very Large current research interests include scalable ana-
Data Bases (VLDB), 2007. lytics and cloud data management. He is a
member of the ACM.
Heasoo Hwang is currently working toward the
PhD degree in computer science at the Uni-
versity of California, San Diego. Her primary
research interests lie in the effective and efficient
search and management of large-scale graph- Erik Nijkamp is working toward the graduate
structured data sets. She spent two summers degree in computer science at the Technical
with IBM Almaden research center, where she University of Berlin. He is now a research
was working on improving the efficiency of assistant in the Department of Database System
dynamic link-based search over graph-struc- and Information Management. His research
tured data which motivated this BinRank paper. interests focus on aspects of distributed comput-
The BinRank system is a part of her thesis work. ing and machine learning.

Andrey Balmin received the PhD degree from


the University of California at San Diego, where
he devised ObjectRank algorithm as part of his
thesis work. He is a research staff member at
. For more information on this or any other computing topic,
IBM’s Almaden Research Center, where his
please visit our Digital Library at www.computer.org/publications/dlib.
research interests include search, querying, and
management of semistructured and graph data.

Vous aimerez peut-être aussi