Académique Documents
Professionnel Documents
Culture Documents
8, AUGUST 2010
Abstract—Dynamic authority-based keyword search algorithms, such as ObjectRank and personalized PageRank, leverage semantic
link information to provide high quality, high recall search in databases, and the Web. Conceptually, these algorithms require a query-
time PageRank-style iterative computation over the full graph. This computation is too expensive for large graphs, and not feasible at
query time. Alternatively, building an index of precomputed results for some or all keywords involves very expensive preprocessing.
We introduce BinRank, a system that approximates ObjectRank results by utilizing a hybrid approach inspired by materialized views in
traditional query processing. We materialize a number of relatively small subsets of the data graph in such a way that any keyword
query can be answered by running ObjectRank on only one of the subgraphs. BinRank generates the subgraphs by partitioning all the
terms in the corpus based on their co-occurrence, executing ObjectRank for each partition using the terms to generate a set of random
walk starting points, and keeping only those objects that receive non-negligible scores. The intuition is that a subgraph that contains all
objects and links relevant to a set of related terms should have all the information needed to rank objects with respect to one of these
terms. We demonstrate that BinRank can achieve subsecond query execution time on the English Wikipedia data set, while producing
high-quality search results that closely approximate the results of ObjectRank on the original graph. The Wikipedia link graph contains
about 108 edges, which is at least two orders of magnitude larger than what prior state of the art dynamic authority-based search
systems have been able to demonstrate. Our experimental evaluation investigates the trade-off between query execution time, quality
of the results, and storage requirements of BinRank.
1 INTRODUCTION
In this paper, we introduce a BinRank system that employs . The idea of approximating ObjectRank by using
a hybrid approach where query time can be traded off for Materialized subgraphs (MSGs), which can be
preprocessing time and storage. BinRank closely approxi- precomputed offline to support online querying for
mates ObjectRank scores by running the same ObjectRank a specific query workload, or the entire dictionary.
algorithm on a small subgraph, instead of the full data graph. . Use of ObjectRank itself to generate MSGs for “bins”
The subgraphs are precomputed offline. The precomputation of terms.
can be parallelized with linear scalability. For example, on the . A greedy algorithm that minimizes the number of
full Wikipedia data set, BinRank can answer any query in less bins by clustering terms with similar posting lists.
than 1 second, by precomputing about a thousand subgraphs, . Extensive experimental evaluation on the Wikipedia
which takes only about 12 hours on a single CPU. data set that supports our performance and search
BinRank query execution easily scales to large clusters by quality claims. The evaluation demonstrates super-
distributing the subgraphs between the nodes of the cluster. iority of BinRank over other state-of-the-art approx-
This way, more subgraphs can be kept in RAM, thus imation algorithms.
decreasing the average query execution time. Since the The rest of the paper is organized as follows: We start
distribution of the query terms in a dictionary is usually with a survey of related work in Section 2. We give an
very uneven, the throughput of the system is greatly overview of the ObjectRank algorithm in Section 3. Materi-
improved by keeping duplicates of popular subgraphs on alized subgraphs are introduced in Section 4, and the bin
multiple nodes of the cluster. The query term is routed to construction algorithm is described in Section 5. In Section 6,
the least busy node that has the corresponding subgraph. we suggest the adaptive MSG recomputation method that
There are two dimensions to the subgraph precomputa- improves the performance of BinRank. Section 7 describes
the architecture of the BinRank system. Section 8 walks
tion problem: 1) how many subgraphs to precompute and
through the experimental evaluation. We conclude in
2) how to construct each subgraph that is used for
Section 9.
approximation. The intuition behind our approach is that
a subgraph that contains all objects and links relevant to a
set of related terms should have all the information needed 2 RELATED WORK
to rank objects w.r.t. one of these terms. For 1), we group all The issue of scalability of PPR [3] has attracted a lot of
terms into a small number (around 1,000 in case of attention. PPR performs a very expensive fixpoint iterative
Wikipedia) of “bins” of terms based on their co-occurrence computation over the entire graph, while it generates
in the entire data set. For 2), we execute ObjectRank for each personalized search results. To avoid the expensive iterative
bin using the terms in the bins as random walk starting calculation at runtime, one can naively precompute and
points and keep only those nodes that receive non-negligible materialize all the possible personalized PageRank vectors
scores. (PPVs). Although this method guarantees fast user response
Our experimental evaluation highlights the tuning of the time, such precomputation is impractical as it requires a
system needed to balance the query performance with size huge amount of time and storage especially when done on
and number of the precomputed subgraphs. Intuitively, large graphs. In this section, we examine hub-based and
query performance is highly correlated to the size of the Monte Carlo style methods that address the scalability
subgraph, which, in turn, is highly correlated with the problem of PPR, and give an overview of HubRank [8]
number of documents in the bin. Thus, normally, it is that integrates the two approaches to improve the scal-
sufficient to create bins with a certain size limit to achieve a ability of ObjectRank. Even though these approaches
specific target running time. However, there is some enabled PPR to be executed on large graphs, they either
variability in the process and some bins may still result in limit the degree of personalization or deteriorate the quality
unusually large subgraphs and slow queries. To address this, of the top-k result lists significantly.
we employ an adaptive iterative process that further splits Hub-based approaches materialize only a selected subset of
the problematic subgraphs to guarantee that a vast majority PPVs. Topic-sensitive PageRank [2] suggests materialization
of queries will be executed within the allotted time budget. of 16 PPVs of selected topics and linearly combining them at
Other approximation techniques have been considered query time. The personalized PageRank computation
before to improve scalability of dynamic authority-based suggested in [3] enables a finer-grained personalization by
search algorithms. Monte Carlo algorithms are introduced efficiently materializing significantly more PPVs (e.g., 100 K)
in [4] and [5] for approximation during precomputation. and combining them using the hub decomposition theorem
HubRank [8] uses the same approximation as [4], but and dynamic programming techniques. However, it is still
performs precomputation only for “hub” nodes. Other not a fully personalized PageRank, because it can persona-
techniques might also suggest sampling-based techniques lize only on a preference set subsumed within a hub set H.
online. However, although these techniques claim online Monte Carlo methods replace the expensive power itera-
query processing, they have only been demonstrated on tion algorithm with a randomized approximation algorithm
graphs with less than 106 links. In contrast, we demonstrate [4], [5]. In order to personalize PageRank on any arbitrary
superior scalability of our approach on a Wikipedia graph preference set with maintaining just a small amount of
that is two orders of magnitude larger. We also show that precomputed results, Fogaras et al. [4] introduce the
our approximation using ObjectRank itself is more precise fingerprint algorithm that simulates the random walk model
than the sampling-based techniques. of PageRank and stored the ending nodes of sampled
Our contributions are: walks. Since each random walk is independent, fingerprint
1178 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010
generation can be easily parallelized and the quality of 3.1 Data Model
search results improves as the number of fingerprints ObjectRank performs top-k relevance search over a database
increases. However, as mentioned in [4], the precision of modeled as a labeled directed graph. The data graph GðV ; EÞ
search results generated by the fingerprint algorithm is models objects in a database as nodes, and the semantic
somewhat less than that of power-iteration-based algo- relationships between them as edges. A node v 2 V contains
rithms, and sometimes, the quality of its results may be a set of keywords and its object type. For example, a paper in
inadequate especially for nodes that have many close a bibliographic database can be represented as a node
neighbors. In [5], a Monte Carlo algorithm that takes into containing its title and labeled with its type, “paper.” A
account not only the last visited nodes, but also all visited directed edge e 2 E from u to v is labeled with its relation-
nodes during the sampled walks, is proposed. Also, it ship type ðeÞ. For example, when a paper u cites another
showed that Monte Carlo algorithms with iterative start paper v, ObjectRank includes in E an edge e ¼ ðu ! vÞ that
outperform those with random start. has a label “cites.” It can also create a “cited by”—type edge
HubRank [8] is a search system based on ObjectRank that from v to u. In ObjectRank, the role of edges between objects
improved the scalability of ObjectRank by combining the is the same as that of hyperlinks between Web pages in
above two approaches. It first selects a fixed number of hub PageRank. However, note that edges of different edge types
nodes by using a greedy hub selection algorithm that utilizes may transfer different amounts of authority. By assigning
a query workload in order to minimize the query execution different edge weights to different edge types, ObjectRank
time. Given a set of hub nodes H, it materializes the can capture important domain knowledge such as “a paper
fingerprints of hub nodes in H. At query time, it generates cited by important papers is important, but citing important
an active subgraph by expanding the base set with its papers should not boost the importance of a paper.” Let wðtÞ
neighbors. It stops following a path when it encounters a denote the weight of edge type t. ObjectRank assumes that
hub node whose PPV was materialized, or the distance from
weights of edge types are provided by domain experts.
the base set exceeds a fixed maximum length. HubRank
recursively approximates PPVs of all active nodes, terminat- 3.2 Query Processing
ing with computation of PPV for the query node itself. For a given query, ObjectRank returns top-k objects relevant
During this computation, the PPV approximations are to the query. We first describe the intuition behind
dynamically pruned in order to keep them sparse. As stated ObjectRank, introduce the ObjectRank equation, and then,
in [8], the dynamic pruning takes a key role in out- elaborate on important calibration factors.
performing ObjectRank by a noticeable margin. However, ObjectRank query processing can be illustrated using the
by limiting the precision of hub vectors, HubRank may get
random surfer model. A random surfer starts from a random
somewhat inaccurate search results, as stated in [8]. Also,
node vi among nodes that contain the given keyword. These
since it materialized only PPVs of H, just as [3], the efficiency
of query processing and the quality of query results are very random surfer starting points are called a base set. For a given
sensitive to the size of H and the hub selection scheme. keyword t, the keyword base set of t, BSðtÞ, consists of nodes
Finally, Chakrabarti [8] did not show any large-scale in which t occurs. Note that any node in G can be part of the
experimental results to verify the scalability of HubRank. base set, which makes ObjectRank support the full degree of
In Section 8, we perform quality and scalability experi- personalization. At each node, the surfer follows outgoing
ments on the full English Wikipedia data set exported in edges with probability d, or jumps back to a random node in
October 2007, to show that BinRank is an efficient the base set with probability ð1 dÞ.2 At a node v, when it
ObjectRank approximation method that generates a high-
determines which edge to follow, each edge e originated from
quality top-k list for any keyword query in the corpus. For wððeÞÞ
v is chosen with probability OutDegððeÞ;vÞ , where OutDegðt; vÞ
comparative evaluation of the performance of BinRank, we
denotes the number of outgoing edges of v whose edge types
implemented the Monte Carlo algorithm 4 in [5] that was
shown to outperform other variations in [5]. We also are the same as t. The ObjectRank score of vi is the probability
implemented HubRank [8] to check its scalability on our rðvi Þ that a random surfer is found at vi at a certain moment.
Wikipedia data set. Let r denote the vector of ObjectRank scores ½rðv1 Þ; . . . ;
Unlike [4] which proves the convergence to the exact rðvi Þ; . . . ; rðvn ÞT , and A be an n n matrix with Aij , the
solution on arbitrary graphs, and [8] and [3] which offer probability that a random surfer moves from vj to vi by
exact methods at the expense of limiting the choice of traversing an edge. Also, let q be a normalized base set
s
personalization, our solution is entirely heuristic. However, vector jBSðtÞj , where jBSðtÞj is the size of base set BSðtÞ and s
extensive experimental evaluation confirms that on real- is a base set vector ½sv1 ; . . . ; svi ; . . . ; svn T , where svi ¼ 1 if vi is
world graphs, BinRank can strike a good balance between in BSðtÞ and 0 otherwise. The ObjectRank equation is
query performance and closeness of approximation.
r ¼ dAr þ ð1 dÞq: ð1Þ
3 OBJECTRANK BACKGROUND For a given query t, the ObjectRank algorithm uses the
power iteration method to get the fixpoint of r, the
In this section, we describe the essentials of ObjectRank [6], ObjectRank vector w.r.t. t, where the ðk þ 1Þth ObjectRank
[9], [10]. We first explain the data model and query vector is calculated as follows:
processing, and then, discuss the result quality and
scalability issues that motivate this paper. 2. Throughout this paper, we assume that d ¼ 0:85.
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1179
rðkþ1Þ ¼ dArðkÞ þ ð1 dÞq: ð2Þ relevant to the query term. Second, we observed that top-k
The algorithm terminates when r converges, which is results of any keyword term t generated on subgraphs of G
composed of nodes with non-negligible ObjectRank values,
determined by using a term convergence threshold
w.r.t. the same t, are very close to those generated on G.
t ¼ jBSðtÞj . The constant is one of the main ObjectRank
Third, when an object has a non-negligible ObjectRank
calibration parameters, as it controls the speed of conver-
value for a given base set BS1 , it is guaranteed that the
gence and the precision of r.
object gains a non-negligible ObjectRank score for another
3.3 Quality and Scalability base set BS2 if BS1 BS2 . Thus, a subgraph of G composed
ObjectRank returns top-k search results for a given query of nodes with non-negligible ObjectRank values, w.r.t. a
using both the content and the link structure in G. Since it union of base sets of a set of terms, could potentially be
utilizes the link structure that captures the semantic used to answer any one of these terms.
relationships between objects, an object that does not Based on the above observations, we speed up the
contain a given keyword but is highly relevant to the ObjectRank computation for query term q, by identifying a
keyword can be included in the top-k list. This is in contrast subgraph of the full data graph that contains all the nodes
to the static PageRank approach that only returns objects and edges that contribute to accurate ranking of the objects
containing the keyword sorted according to their PageRank w.r.t. q. Ideally, every object that receives a nonzero score
score. This key difference is one of the main reasons for during the ObjectRank computation over the full graph
ObjectRank’s superior result quality, as demonstrated by should be present in the subgraph and should receive the
the relevance feedback survey reported in [6]. same score. In reality, however, ObjectRank is a search
system that is typically used to obtain only the top-k result
However, the iterative computation of ObjectRank
list. Thus, the subgraph only needs to have enough
vectors described in Section 3.2 is too expensive to
information to produce the same top-k list. We shall call
execute at runtime. For a given query, ObjectRank iterates such a subgraph a Relevant subgraph (RSG) of a query.
over the entire graph G to calculate the ObjectRank vector
ðkþ1Þ ðkÞ Definition 4.1. The top-k result list of the ObjectRank of
r until jri ri j is less than the convergence threshold
ðkþ1Þ ðkÞ keyword term t on data graph GðV ; EÞ, denoted by
for every ri in rðkþ1Þ and ri in rðkÞ . This is a very
ORðt; G; kÞ, is a list of k objects from V sorted in descending
strict stopping condition. This iterative computation may
order of their ObjectRank scores w.r.t. a base set that is the set
take a very long time if G has a large number of nodes of all objects in V that contain keyword term t.
and edges. Therefore, instead of evaluating a keyword
Definition 4.2. A Relevant Subgraph (RSGðt; G; kÞ) of a data
query at query time, the original ObjectRank system [6]
graph GðV ; EÞ w.r.t. a term t and a list size k is a graph
precomputes the ObjectRank vectors of keywords in H,
Gs ðVs ; Es Þ such that Vs V , Es E, and ORðt; G; kÞ ¼
the set of keywords, during the preprocessing stage, and
ORðt; Gs ; kÞ.
then, stores a list of <ObjId; RankV alue> pairs per
keyword. However, the preprocessing stage of ObjectRank It is hard to find an exact RSG for a given term, and it is not
is expensive, as it requires jHj ObjectRank executions and feasible to precompute one for every term in a large workload.
OðjV j jHjÞ bits of storage. In fact, according to the worst- However, we introduce a method to closely approximate
case bounds for PPR index size proven in [4], the index RSGs. Furthermore, we observed that a single subgraph can
size must be ðjV j jHjÞ bits, for any system that returns serve as an approximate RSG for a number of terms, and that
the exact ObjectRank vectors. it is quite feasible to construct a relatively small number of
such subgraphs that collectively cover, i.e., serve as approx-
imate RSGs, all the terms that occur in the data set.
4 RELEVANT SUBGRAPHS
Definition 4.3. An Approximate Relevant Subgraph
Our goal is to improve the scalability of ObjectRank while
(ARSGðt; G; k; cÞ) of a data graph GðV ; EÞ with respect to
maintaining the high quality of top-k result lists. We focus
on the fact that ObjectRank does not need to calculate the a term t, list size k, and confidence limit c 2 ½0; 1, is a graph
exact full ObjectRank vector r to answer a top-k keyword Gs ðVs ; Es Þ such that Vs V , Es E, and ðORðt; G; kÞ;
query (K jV j). We identify three important properties of ORðt; Gs ; kÞÞ > c.
ObjectRank vectors that are directly relevant to the result
quality and the performance of ObjectRank. First, for many Kendall’s is a measure of similarity between two lists of
of the keywords in the corpus, the number of objects with [11]. This measure is commonly used to describe the quality of
non-negligible ObjectRank values is much less than jV j. approximation of top-k lists of exact ranking (RE ) and
This means that just a small portion of G is relevant to a approximate ranking (RA ) that may contain ties (nodes with
specific keyword. Here, we say that an ObjectRank value of equal ranks) [4], [8]. A pair of nodes that is strictly ordered in
v, rðvÞ is non-negligible if rðvÞ is above the convergence both lists is called concordant if both rankings agree on the
threshold. The intuition for applying the threshold is that ordering, and discordant otherwise. A pair is e-tie, if RE does
differences between the scores that are within the threshold not order the nodes of the pair, and a-tie, if RA does not order
of each other are noise after ObjectRank execution. Thus, them. Let C, D, E, and A denote the number of concordant,
scores below threshold are effectively indistinguishable discordant, e-tie, and a-tie pairs, respectively. Then, Kendall’s
from zero, and objects that have such scores are not at all similarity between two rankings, RE and RA , is defined as
1180 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010
bin, we need to intersect the bin with every term t0 that co-
occurs with t, in order to check if t0 is subsumed by the bin
completely, and can be placed into the bin “for free.”
For example, consider N terms with posting lists of size
X each that all co-occur in one document d0 with no other
co-occurrences. If maximum bin size is 2ðX 1Þ, a bin will
have to be created for every term. However, to get to that
situation, our algorithm will have to check intersections for
every pair of terms. Thus, the upper bound on the number
of intersections is tight.
In fact, it is easy to see from the above example that no
algorithm that packs the bins based on the maximum
overlap can do so with fewer than NðN 1Þ=2 set
intersections in the worst case. Fortunately, real-world text
databases have structures that are far from the worst case,
as shown in Section 8.
Lastly, we show that the number of bins the algorithm
uses to pack a set of posting lists is at most 2OP T , where
indicates that the degree of overlap across posting lists and
OP T is minimal. Note that since BinRank constructs an
MSG for each bin during preprocessing, 2OP T is also the
upper bound of the number of MSGs.
Theorem 5.1. Given a set of posting lists S of Si s, suppose that
P S
there exists 1 such that Si 2S jSi j
j Si 2S Si j. Then,
the approximation ratio of P ackT ermsIntoBins is 2.
Proof. Let OP T and OP T 0 denote the optimal number of
bins and the number of bins P ackT ermsIntoBins uses.
P S
Si 2S
j Si j
. Claim1: OP T maxBinSize .
Since no bin can hold a total capacity of more
than maxBinSize,
Fig. 1. Bin computation algorithm. S
j Si 2S Si j
OP T :
maxBinSize
of the algorithm. The tight upper bound on the number of
set intersections that our algorithm needs to perform is the Also, since satisfies
number of pairs of terms that co-occur in at least one [ P jSi j
document. To speed-up the execution of set intersections for S Si 2S ;
larger posting lists, we use KMV synopses [13] to estimate S 2S i
i
S P S
the size of set intersections. j Si 2S Si j Si 2S j Si j
The algorithm in Fig. 1 works on term posting lists from a OP T :
maxBinSize maxBinSize
text index. As the algorithm fills up a bin, it maintains a list of
document IDs that are already in the bin, and a list of ; Claim1Sholds.
. Claim2: j Si 2S Si j > ðOP T 0 1Þ maxBinSize .
candidate terms that are known to overlap with the bin (i.e., 2
Since no more than one bin is less than half full,
their posting lists contain at least one document that was S
j S j > ðOP T 0 1Þ maxBinSize . Also, since
already placed into the bin). The main idea of this greedy P Si 2S i S 2 P
jSi j
j S i j for 1, Si 2S jSi j
MSGs and replace them with more efficient ones during the and tested against the maxMSGSize. If some MSGs still fail
preprocessing stage. the test, the process can be repeated iteratively.
Recall that the ObjectRank running time scales linearly
with two parameters: the number of iterations required and 7 SYSTEM ARCHITECTURE
the size of the graph. The number of iterations is correlated to
the size of the base set, so for a given MSG, queries with the Fig. 2 shows the architecture of the BinRank system. During
largest base sets are going to be the slowest. And for queries the preprocessing stage (left side of figure), we generate
MSGs as defined in Section 4. During query processing
with fixed sized base sets, the running time will largely
stage (right side of figure), we execute the ObjectRank
depend on the number of links in the graph. In fact, we report
algorithm on the subgraphs instead of the full graph and
in Section 8.5 a 94 percent correlation between the number of
produce high-quality approximations of top-k lists at a
links on an MSG and the BinRank running time for queries
small fraction of the cost. In order to save preprocessing
with large base sets. This observation enables us to reliably cost and storage, each MSG is designed to answer multiple
identify problematic MSGs based only on their link counts. term queries. We observed in the Wikipedia data set that a
However, the correlation between the bin sizes and the single MSG can be used for 330-2,000 terms, on average.
MSG link counts is less obvious. Fig. 15 shows that the link-
count for MSGs follows a normal distribution even with all 7.1 Preprocessing
the Bin and MSG generation parameters fixed. Thus, setting The preprocessing stage of BinRank starts with a set of
the generation parameters in a way that no MSG exceeds a workload terms W for which MSGs will be materialized. If
certain link-count threshold is not going to be practical. an actual query workload is not available, W includes the
Instead, we set the parameters in such a way that only a entire set of terms found in the corpus. We exclude from W
small minority of MSGs exceed the limit, and then, deal all terms with posting lists longer than a system parameter
with this minority separately. maxP ostingList. The posting lists of these terms are deemed
One way to deal with dangerously large MSGs is to too large to be packed into bins. We execute ObjectRank for
recompute them with a larger convergence threshold, thus each such term individually and store the resulting top-k
making them smaller. However, this may diminish the lists. Naturally, maxP ostingList should be tuned so that
subsequent query result quality, so instead we choose to there are relatively few of these frequent terms. In the case of
keep the same , but regenerate the bins that produced these Wikipedia, we used maxP ostingList ¼ 2;000 and only
MSGs with a smaller maxBinSize. 381 terms out of about 700,000 had to be precomputed
To do this, we introduce a new threshold maxMSGSize individually. This process took 4.6 hours on a single CPU.
and generate a set of rejected bins RB that resulted in MSGs For each term w 2 W , BinRank reads a posting list T
with the number of links larger than maxMSGSize. We from the Lucene3 index and creates a KMV synopsis T 0 that
then generate a new set of workload terms W 0 , which is used to estimate set intersections.
consists of all the keywords of all bins in RB, and rerun the The bin construction algorithm, P ackT ermsIntoBins,
P ackT ermsIntoBins algorithm with W 0 and the new partitions W into a set of bins composed of frequently co-
maxBinSize set to the half of the original one. The new
set of bins replaces RB, and the new MSGs are produced 3. http://lucene.apache.org.
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1183
occurring terms. The algorithm takes a single parameter over MSGs to generate the top-k list for the query. For a
maxBinSize, which limits the size of a bin posting list, i.e., disjunctive query, the ObjectRank module sums the Objec-
the union of posting lists of all terms in the bin. During the tRank scores w.r.t. each term calculated using MSGs to
bin construction, BinRank stores the bin identifier of each produce BinRank scores.
term into the Lucene index as an additional field. This One of the advantages of BinRank query execution
allows us to map each term to the corresponding bin and engine is that it can easily utilize large clusters of nodes. In
MSG at query time. this case, we distribute MSGs between the nodes and
The ObjectRank module takes as input a set of bin employ Hadoop4 to start an MSG cache and an ObjectRank
posting lists B and the entire graph GðV ; EÞ with a set of engine Web service on every node. A set of dispatcher
ObjectRank parameters, the damping factor d, and the processes, each with its own replica of the Lucene index,
threshold value . The threshold determines the conver- routes the queries to the appropriate nodes.
gence of the algorithm as well as the minimum ObjectRank
score of MSG nodes.
Our ObjectRank implementation stores a graph as a row- 8 EXPERIMENTS
compressed adjacency matrix. In this format, the entire We present our experimental evaluation in this section. We
Wikipedia graph consumes 880 MB of storage and can be first describe our experimental setup using English Wiki-
loaded into main memory for MSG generation. In case that pedia articles. Then, we show scalability numbers for
the entire data graph does not fit in main memory, we can ObjectRank followed by numbers for BinRank. Finally, we
apply parallel PageRank computation techniques such as present a performance comparison of BinRank with Monte
hypergraph partitioning schemes described in [14]. Carlo Method and HubRank.
The MSG generator takes the graph G and the ObjectRank
result w.r.t. a term bin b, and then, constructs a subgraph 8.1 Setup
Gb ðV 0 ; E 0 Þ by including only nodes with rt ðuÞ b . b is the We evaluate the performance of the BinRank algorithm on
convergence threshold of b, that is, jBSðbÞj . Given the set of the collection of English Wikipedia articles exported in
MSG nodes V , the corresponding set of edges E 0 is copied
0 October 2007. We parsed the 13.8 GB dump file and
from the in-memory copy of G. The edge construction takes extracted 3.2 M articles and 109M intrawiki links of 10 types
(e.g., “Regular links,” “Category links,” “See also links,”
1.5-2 seconds for a typical MSG with about 5 million edges.
Once the MSG is constructed in memory, it is serialized etc.). All the experiments in this section are performed over
to a binary file on disk in the same row-compressed the labeled graph Gwiki ¼ ðVwiki ; Ewiki Þ that is composed of
adjacency matrix format to facilitate fast deserialization. We the Wikipedia articles as nodes and the intrawiki links as
observed that deserializing a 40 MB MSG on a single SATA edges. We used the standard row-compressed matrix
disk drive takes about 0.6 seconds. In general, deserializa- format to represent the link structure and weight dissipa-
tion speed can be greatly improved by increasing the tion rates of Ewiki compactly. We were able to store the
transfer rate of the disk subsystem. 3:2M 3:2M transition matrix of Gwiki with 109M nonzero
elements in only 880 MB. We created a Lucene text index of
7.2 Query Processing the Wikipedia article titles, which takes up 154 MB. The
For a given keyword query q, the query dispatcher retrieves dictionary of the index contains 698,214 terms.
from the Lucene index the posting list bsðqÞ (used as the We chose to index only article titles, by analogy with the
base set for the ObjectRank execution) and the bin identifier original ObjectRank [6] setup that used only publication
bðqÞ. Given a bin identifier, the MSG mapper determines titles from DBLP. It is important for ObjectRank to have a
whether the corresponding MSG is already in memory. If it base set of objects that are highly related to a search term.
is not, the MSG deserializer reads the MSG representation
However, a large article can mention a term without being
from disk. The BinRank query processing module uses all
meaningfully related to it. For that reason, title index works
available memory as an LRU cache of MSGs.
For smaller data graphs, it is possible to dramatically better than an index on the full text of the articles. In order
reduce MSG storage requirements by storing only a set of to use the full article text index, the ObjectRank algorithm
MSG nodes V 0 , and generating the corresponding set of edges would have to be augmented to take into account Lucene
E 0 only at query time. However, in our Wikipedia, data set search scores of the base set documents. This is one of our
that would introduce an additional delay of 1.5-2 seconds, future research directions.
which is not acceptable in a keyword search system. For our experiments, we implemented the BinRank
The ObjectRank module gets the in-memory instance of system (and other algorithms for performance comparisons)
MSG, the base set, and a set of ObjectRank calibrating in Java and performed experiments on a single PC with a
parameters: 1) the damping factor d; 2) the convergence Pentium4 3.40 GHz CPU and 2.0 GB of RAM.
threshold ; and 3) the number of top-k list entries k. Once
the ObjectRank scores are computed and sorted, the 8.2 ObjectRank on the Full Wikipedia Graph
resulting document ids are used to retrieve and present ObjectRank on Gwiki takes too long to be executed online
the top-k objects to the user. and consumes around 880 MB of memory just for the link
Multikeyword queries are processed as follows: For a information of Gwiki . As shown in Fig. 3, it takes around 20-
given conjunctive query composed of n terms ft1 ; . . . ; tn g, 50 seconds (30 seconds on average) to compute the
the ObjectRank module gets MSGs, fMSGðbðt1 ÞÞ; . . . ; dynamically generated top-k list for a given single keyword
MSGðbðtn ÞÞg, and evaluates each term over the correspond-
ing MSG. Then, it multiplies the ObjectRank scores obtained 4. http://hadoop.apache.org.
1184 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010
Fig. 12. The top-100 accuracy for disjunctive queries (maxBinSize ¼ Fig. 13. The top-100 accuracy for conjunctive queries (maxBinSize ¼
4;000 and ¼ 5:0E-4). 4;000 and ¼ 5:0E-4).
jVwiki j. However, if we decrease from 1E-3 to 5E-4, the As is shown in Fig. 13, RAGð100Þ and P recð100Þ are
average size of MSGs increases from 0.98 percent of jVwiki j to above 0.9, indicating that the top-100 lists obtained by
1.52 percent of jVwiki j, while the RankMass increases by BinRank include most of the nodes in the top-100 lists
7.0 percent. generated by ObjectRank over Gwiki . However, the average
8.4 BinRank for Multikeyword Queries 100 for conjunctive queries remains in ½0:75; 0:8 range,
which is lower than those for single-keyword queries (Fig. 8)
In this section, we investigate the performance of BinRank
or disjunctive queries (Fig. 12). Therefore, we can see that for
for multikeyword queries. Given a multikeyword query q
a given conjunctive query, BinRank generates a top-k list
composed of n keywords k1 . . . kn , BinRank first evaluates
that contains all the nodes in the top-k list obtained over
each ki over the MSG corresponding to the keyword
Gwiki , but the ordering of nodes in the BinRank top-k list is
MSGðki Þ. Then, it combines the rank scores computed over
not highly accurate. This is mainly because the MSGs are not
those MSGs according to query semantics to produce the
large enough to cover all the important paths through which
top-k list for q.
significant amount of authority flows into or between the
We observed from our experimental results that if a
top-100 nodes, even though most of the links between top-
multikeyword query contains highly relevant keywords
100 nodes exist on the corresponding MSGs. To improve the
such as “martial” AND “arts” or “fine” AND “performing,”
top-k accuracy of conjunctive queries, we can increase the
BinRank assigns those relevant keywords into the same bin, coverage of MSGs by using smaller . Note that increasing
and thus, evaluates those keywords using the same MSG. In maxBinSize does not improve the top-k accuracy in a big
this case, the top-k accuracy of the query is higher than margin, as shown in Fig. 8.
randomly generated multikeyword queries. However, if However, with smaller , BinRank generates larger
keywords composing a multikeyword query are assigned to MSGs, increasing query execution time. Especially, we
different bins and the query is conjunctive, BinRank has to observed that some MSGs require unacceptably long
evaluate each keyword over different MSGs and combine running time. Given time budget, we want to identify such
scores. We assign zero scores to nodes not in the MSG. MSGs and recompute them as described in Section 8.5.
Hence, if a conjunctive query contains keywords whose
MSGs do not overlap, BinRank will return an empty result. 8.5 Adaptive MSG Recomputation
However, we observed no such cases thorughout our In this section, we first want to examine the entire set of
experiments, because certain highly popular subgraphs of MSGs to understand the features of MSGs. We obtain
Gwiki obtain non-negligible scores regardless of the key- 1,043 bins, and then, generate a set of MSGs M using
words assigned to a bin. BinRank parameters maxBinSize ¼ 4;000 and ¼ 5:0E-4.
We randomly generated 600 multikeyword queries to The average number of nodes and links on an MSG is 48,616
measure the top-k accuracy of conjunctive queries and and 5.2M, which is just 1.52 percent of jVwiki j and 4.83 percent
disjunctive queries containing 2 to 4 keywords. Throughout of jEwiki j. Recall that Gwiki has 3.2M nodes and 109M links.
our experiments, we use maxBinSize ¼ 4;000 and ¼ 5:0E-4. Next, to evaluate the quality of the MSGs in M, we pick a
We do not report the statistics of the BinRank running time for set of keywords Q by selecting the keyword with largest
multikeyword queries, because it is dominated by the frequency among the keywords assigned to each bin. The
running time of BinRank for each individual query term. range of keyword frequency of Q is ½1; 2;000. We select the
We can see from Fig. 12 that the top-100 accuracy of most frequent keyword for each bin since they are very
disjunctive queries is higher than the top-100 accuracy for likely to result in the slowest BinRank execution time out of
single-keyword queries, as shown in Figs. 8 and 11. As shown all keywords in the bin, as discussed in Section 6.
in Fig. 9, the accuracy of a top-k list drops as K increases, The average BinRank execution time for queries in Q is
because scores of highly ranked nodes are more stable than 856 ms, which is much faster than the average ObjectRank
those of the rest. Since the top-100 list for a disjunctive query execution time on Gwiki , 30 seconds. However, we observe
tends to include top-k nodes (K
100) in the top-100 lists that some queries in Q require almost 2 seconds to evaluate,
obtained over MSGs, its accuracy is at least as high as that for which is sometimes not acceptable. The goal of the MSG
single-keyword queries or slightly higher than that. recomputation algorithm is to predict and prevent such
1188 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010
Fig. 14. The effect of the number of links on an MSG on the BinRank Fig. 15. The distribution of the number of links on an MSG (1,043 MSGs
running time (maxBinSize ¼ 4;000 and ¼ 5:0E-4). The Pearson generated by using maxBinSize ¼ 4;000 and ¼ 5:0E-4).
correlation coefficient is 0.938.
corresponding bin. In contrast, the probability that the
cases so that the BinRank running time does not exceed a worst-case BinRank running time exceeds 1 sec is just
certain time budget, which is set to 1 second throughout 16.4 percent for MSGs with less than ð þ Þ links. If we
these experiments. pick the MSGs with less than links, only 5.6 percent of
As we discussed in Section 6, the BinRank query running them spend more than 1 sec to compute top-k lists in the
time depends on the features of the query (e.g., the base set worst case. Therefore, by default, BinRank sets the
size) and those of the corresponding MSG (e.g., the number maxMSGSize parameter to ð þ Þ and recomputes bins
of nodes and the number of links). Other factors such as the for all the MSGs with higher link counts, using the halved
connectivity of links on an MSG and the topology of base set maxBinSize, as described in Section 6. Recall from Fig. 7
nodes also affect the BinRank running time, but they are that reducing the maxBinSize linearly reduces the query
harder to quantify, and the simple features prove to be time, thus dramatically reducing the number of queries
sufficiently good predictors. running over budget.
The correlation coefficients, denoted by r, between the In general case, maxMSGSize could be set to ð þ xÞ,
BinRank running time and each of the three simple features where good candidates for x are within ½0; 1 as we can see
are the followings: in our experimental results. In future, we plan to investigate
optimizing x, while considering such factors as the time and
. r1 ¼ 0:564: with the number of nodes on an MSG. space budget for MSG generation. For example, if x ¼ 0, we
. r2 ¼ 0:700: with the number of links on an MSG. need to regenerate about 50 percent of the MSGs, while we
. r3 ¼ 0:459: with the base set size of a query. regenerate only 14 percent of them when x ¼ 1.
r2 is noticeably higher than r1 or r3 , which indicates that the Another approach we are planning to investigate is to
number of links on an MSG is more tightly correlated to the base maxMSGSize on the actual query performance
BinRank running time than the other two features. Actually, measurements. The BinRank running time also follows a
since r2 is obtained from all the queries in Q and their base normal distribution Nðt ; 2t Þ and the time budget, 1 second
set sizes vary significantly within ½1; 2;000, we can see the in our experiments, corresponds to t þ 0:58t . Since the
effect of the number of links on an MSG more clearly after BinRank running time and the number of links on an MSG
reducing the effect of the base set size. To do it, we select a are highly correlated as shown in Fig. 14, we can use 0.58 as
set of 292 queries with high frequency, ½1;000; 2;000 and x to select MSGs to regenerate.
denote it as Qhf . From Fig. 14, obtained using Qhf , we can
clearly observe a very strong correlation between the 8.6 Performance Comparison of BinRank with
number of links on an MSG and the BinRank running time. Monte Carlo Method and HubRank
With very high R2 value, the BinRank running time of a In this section, we present a performance comparison of
query is almost linear in the number of links on the BinRank over Monte Carlo style methods and HubRank.
corresponding MSG. Also, the correlation coefficient be- We implemented the Monte Carlo algorithm 4, “MC
tween the number of links on an MSG and the BinRank
complete path stopping at dangling nodes,” introduced in
running time using high-frequency keywords in Qhf is
[5] and HubRank [8] that combines a hub-based approach
0.938, which also indicates a very strong correlation.
By exploiting this strong correlation, we select MSGs and a Monte Carlo method called fingerprint.
whose BinRank running time will be above a certain time For a given keyword query, the Monte Carlo algorithm
budget with high probability. As can be seen in Fig. 15, the simulates random walks starting from nodes containing the
number of links on an MSG almost follows a normal keyword. Within a specified number of walks, it samples
distribution Nð; 2 Þ, where ¼ 5:2E6 and ¼ 1:0E6. Our exactly the same number of random walks per each starting
experiments show that among the 1,043 MSGs in M, point. The authority score of a node is the total number of
144 MSGs have more than ð þ Þ links, and among visits to the node divided by the total number of visits. Fig. 16
them, 138 MSGs (94.4 percent) require more than 1 sec to shows the performance of the Monte Carlo algorithm in
produce top-k lists for the largest frequency keyword in the terms of accuracy of top-k lists and various query times. We
HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1189
[14] J.T. Bradley, D.V. de Jager, W.J. Knottenbelt, and A. Trifunovic, Berthold Reinwald received the PhD degree
“Hypergraph Partitioning for Faster Parallel PageRank Computa- from the University of Erlangen-Nuernberg,
tion,” Proc. Second European Performance Evaluation Workshop Germany, in 1993. Since 1993, he has been
(EPEW), pp. 155-171, 2005. with the IBM Almaden Research Center, where
[15] J. Cho and U. Schonfeld, “Rankmass Crawler: A Crawler with he is currently a research staff member. His
High PageRank Coverage Guarantee,” Proc. Int’l Conf. Very Large current research interests include scalable ana-
Data Bases (VLDB), 2007. lytics and cloud data management. He is a
member of the ACM.
Heasoo Hwang is currently working toward the
PhD degree in computer science at the Uni-
versity of California, San Diego. Her primary
research interests lie in the effective and efficient
search and management of large-scale graph- Erik Nijkamp is working toward the graduate
structured data sets. She spent two summers degree in computer science at the Technical
with IBM Almaden research center, where she University of Berlin. He is now a research
was working on improving the efficiency of assistant in the Department of Database System
dynamic link-based search over graph-struc- and Information Management. His research
tured data which motivated this BinRank paper. interests focus on aspects of distributed comput-
The BinRank system is a part of her thesis work. ing and machine learning.