Vous êtes sur la page 1sur 25

1.

INTRODUCTION
Keyword query interfaces for databases have attracted much attention in the last
decade due to their flexibility and ease of use in searching and exploring the data. Since any
entity in a data set that contains the query keywords is a potential answer, keyword queries
typically have many possible answers. KQIs must identify the information needs behind
keyword queries and rank the answers so that the desired answers appear at the top of the list.
Unless otherwise noted, user refers to keyword query as query in the remainder of this
project. Databases contain entities, and entities contain attributes that take attribute values.
Some of the difficulties of answering a query are as follows: First, unlike queries in
languages like SQL, users do not normally specify the desired schema element(s) for each
query term. Thus, a KQI must find the desired attributes associated with each term in the
query.
Second, the schema of the output is not specified, i.e., users do not give enough
information to single out exactly their desired entities recently, there have been collaborative
efforts to provide standard benchmarks and evaluation platforms for keyword search methods
over databases. The XML search methods which we implemented return rankings of
considerably lower quality than their average ranking quality over all queries. Hence, some
queries are more difficult than others.
Moreover, no matter which ranking method is used, we cannot deliver a reasonable
ranking for these queries. To the best of user knowledge, there has not been any work on
predicting or analyzing the difficulties of queries over databases. Researchers have proposed
some methods to detect difficult queries over plain text document collections. However, these
techniques are not applicable to user problem since they ignore the structure of the database.
Researchers have proposed some methods to detect difficult queries over plain text
document collections. However, these techniques are not applicable to this problem since
they ignore the structure of the database. In particular, as mentioned earlier, a KQI must
assign each query term to a schema element in the database. It must also distinguish the
desired result type(s). User empirically shows that direct adaptations of these techniques are
ineffective for structured data. In this thesis, user analyzes the characteristics of difficult
queries over databases and proposes a novel method to detect such queries. User takes
advantage of the structure of the data to gain insight about the degree of the difficulty of a
query given the database.

2. LITERATURE REVIEW
EFFICIENT IR-STYLE KEYWORD SEARCH OVER RELATIONAL DATABASES
Applications in which plain text coexists with structured data are pervasive.
Commercial relational database management systems (RDBMSs) generally provide querying
capabilities for text attributes that incorporate state-of-the-art information retrieval (IR)
relevance ranking strategies, but this search functionality requires that queries specify the
exact column or columns against which a given list of keywords is to be matched. This
requirement can be cumbersome and inflexible from a user perspective: good answers to a
keyword query might need to be assembled in perhaps unforeseen ways by joining tuples
from multiple relations. This observation has motivated recent research on free-form keyword
search over RDBMSs. In this paper, we adapt IR-style document-relevance ranking strategies
to the problem of processing free-form keyword queries over RDBMSs.
Our query model can handle queries with both AND and OR semantics, and exploits
the sophisticated single-column text-search functionality often available in commercial
RDBMSs. We develop query-processing strategies that build on a crucial characteristic of
IR-style keyword search: only the few most relevant matches according to some definition
of relevance are generally of interest. Consequently, rather than computing all matches for
a keyword query, which leads to inefficient executions, our techniques focus on the top-k
matches for the query, for moderate values of k. A thorough experimental evaluation over real
data shows the performance advantages of our approach.
IR ENGINE
Modern RDBMSs include IR-style text indexing functionality at the attribute level.
The IR Engine module of our architecture exploits this functionality to identify all database
tuples that have a non-zero score for a given query. The IR Engine relies on the IR Index,
which is an inverted index that associates each keyword that appears in the database with a
list of occurrences of the keyword; each occurrence of a keyword is recorded as a tuple
attribute pair. Our implementation uses Oracle Text, which keeps a separate index for each
relation attribute. We combine these individual indexes to build the IR Index. When a queryQ
arrives, the IR Engine uses the IR Index to extract from each relation R the tuple set RQ =
{t R | Score(t,Q) > 0}, which consists of the tuples of R with a non-zero score for Q.
The tuples t in the tuple sets are ranked in descending order of Score(t,Q), as required
by the top-k query processing algorithms .
2

EXECUTION ALGORITHMS
We now present algorithms for a core operation in our system: given a set of CNs
together with a set of non-free tuple sets, the Execution Engine needs to efficiently identify
the top-k joining trees of tuples that can be derived. First, we describe the Naive algorithm, a
simple adaptation of query processing algorithms used in prior work [11, 1]. Second, we
present the Sparse algorithm, which improves on the Naive algorithm by dynamically
pruning some CNs during query evaluation. Third, we describe the Single Pipelined
algorithm, which calculates the top-k results for a single CN in a pipelined way. Fourth, we
present the Global Pipelined algorithm, which generalizes the Single Pipelined algorithm to
multiple CNs and can then be used to calculate the final result for top-k queries. Finally, we
introduce the Hybrid algorithm, which combines the virtues of both the Global Pipelined and
the Sparse algorithms.

NAIVE ALGORITHM
The Naive algorithm issues a SQL query for each CN for a top-k query. The results
from each CN are combined in a sort-merge manner to identify the final top-k results of the
query. This approach is an adaptation of the execution algorithms of prior work [11, 1, 12] for
keyword-search queries. As a simple optimization in our experiments, we only get the top-k
results from each CN according to the scoring function, and we enable the top-k hint
functionality, available in the Oracle 9.1 RDBMS.8 In the case of Boolean-AND semantics,
the Naive algorithm (as well as the Sparse algorithm presented below) involves an additional
filtering step on the stream of results to check for the presence of all keywords.
SPARSE ALGORITHM
The Naive algorithm exhaustively processes every CN associated with a query. We
can improve query-processing performance by discarding at any point in time any
(unprocessed) CN that is guaranteed not to produce a top-k match for the query. Specifically,
the Sparse algorithm computes a bound MPSi on the maximum possible score of a tuple tree
derived from a CN Ci. If MPSi does not exceed the actual score of k already produced tuple
trees, then CN Ci can be safely removed from further consideration. To calculate MPSi, we
apply the combining function to the top tuples (due to the monotonicity property in Definition
2) of the non-free tuple sets of Ci. That is, MPSi is the score of a hypothetical joining tree of
tuples T that contains the top tuples from every non-free tuple set in Ci. As a further
3

optimization, the CNs for a query are evaluated in ascending size order. This way, the
smallest CNs, which are the least expensive to process and are the most likely to produce
high-score tuple trees using the combining function above, are evaluated first.
SINGLE PIPELINED ALGORITHM
The Single Pipelined algorithm receives as input a candidate network C and the nonfree tuple sets TS1, . . . , TSv that participate in C. Recall that each of these non-free tuple
sets corresponds to one relation, and contains the tuples in the relation with a non-zero match
for the query. Furthermore, the tuples in TS i are sorted in descending order of their Score for
the query. (Note that the attribute Score(ai,Q) and tuple Score(t,Q) scores associated with
each tuple t TS i are initially computed by the IR Engine, as we described, and do not need
to be re-calculated by the Execution Engine.) The output of the Single Pipelined Algorithm
consists of a stream of joining trees of tuples T in descending Score(T,Q) order.
SPARK: Top-k Keyword Query in Relational Databases
With the increasing amount of text data stored in relational databases, there is a
demand for RDBMS to support keyword queries over text data. As a search result is often
assembled from multiple relational tables, traditional IR-style ranking and query evaluation
methods cannot be applied directly. In this paper, we study the effectiveness and the
efficiency issues of answering top-k keyword query in relational database systems. We
propose a new ranking formula by adapting existing IR techniques based on a natural notion
of virtual document. Compared with previous approaches, our new ranking method is simple
yet effective, and agrees with human perceptions. We also study efficient query processing
methods for the new ranking method, and propose algorithms that have minimal accesses to
the database. We have conducted extensive experiments on large-scale real databases using
two popular RDBMSs. The experimental results demonstrate significant improvement to the
alternative approaches in terms of retrieval effectiveness and efficiency.
MODELLING A JOINED TUPLE TREE AS A VIRTUAL DOCUMENT
We propose a solution based on the idea of modelling a JTT as a virtual document.
Consequently, the entire results produced by a CN will be modelled as a document collection.
The rationale is that most of the CNs carry certain distinct semantics. E.g., CQ PQ gives
all details about complaints and their related products that are collectively relevant to the
4

query Q and form integral logical information units. In fact, the actual lodgment of a
complaint would contain both product information and the detailed comment it was split into
multiple tables due to the normalization requirement imposed by the physical implementation
of the RDBMSs. A very similar notion of virtual document was proposed in [30]. Our
definition differs from [30] in that ours is queryspecific and dynamic. For example, a
customer tuple is only joined with complains matching the query to form a virtual document
on the run-time, rather than joining with all the complaints as [30] does. By adopting such a
model, we could naturally compute the IR-style relevance score without using an esoteric
score aggregation function. More specifically, we assign an IR ranking score to a JTT T as
scorea(T,Q) = X wTQ 1 + ln(1 + ln(tfw(T))) (1 s) + s dlT avdlCN_(T) ln(idfw),
where tfw(T) = X tT tfw(t), idfw = NCN_(T) + 1 dfw(CN(T)) , CN(T) denotes the CN
which the JTT T belongs to, CN(T) is identical to CN(T) except that all full-text selection
conditions are removed. CN(T) is also written as CN if there is no ambiguity.
Consider the CN CQ PQ, CN is C P
NCN_ = 3. dfmaxtor = 2 and
dfnetvista = 3.
In our proposed method, the contributions of the same keyword in different relations
are first combined and then attenuated by the term frequency normalization. Therefore,
fmaxtor(c2 p2) = 0, tfnetvista(c2 p2) = 2, while
tfmaxtor(c1 p1) = 1, tfnetvista(c1 p1) = 1. According
to Equation (1) and omitting the size normalization,
scorea(c2 p2) = 0.44, while scorea(c1 p1) = 0.98.
Thus, c1 p1 is ranked higher than c2 p2, which agrees
with human judgments.5
There are still two technical issues remaining: how to obtain dfw(CN) and NCN_
and how to obtain avdlCN_ . No doubt that computing dfw(CN ) and NCN exactly will
incur prohibitive cost. Therefore, we resort to an approximate solution: we estimate p =
dfw(CN_) NCN_ , such that the idf value of the term in CN can be approximated as 1 p .
Consider a CN = R1 1 R2 1 . . . 1 Rl, and denote the percentage of tuples in Rj that matches
at least a keyword w as pw(Rj).
We can derive
dfw(CN)
NCN_ + 1

dfw(CN)
NCN_
= p 1 _j (1 pw(Rj ))
by assuming that (a) NCN is a large number, and (b) tuples matching keyword w are
uniformly and independently distributed in each relation Rj . In a similar fashion, we estimate
avdlCN_ as P j avdlRj .
KEYWORD++: A FRAMEWORK TO IMPROVE KEYWORD SEARCH OVER
ENTITY DATABASES
Keyword search over entity databases (e.g., product, movie databases) is an important
problem. Current techniques for keyword search on databases may often return incomplete
and imprecise results. On the one hand, they either require that relevant entities contain all (or
most) of the query keywords, or that relevant entities and the query keywords occur together
in several documents from a known collection. Neither of these requirements may be satisfied
for a number of user queries. Hence results for such queries are likely to be incomplete in that
highly relevant entities may not be returned. On the other hand, although some returned
entities contain all (or most) of the query keywords, the intention of the keywords in the
query could be different from that in the entities. Therefore, the results could also be
imprecise. To remedy this problem, in this paper, we propose a general framework that can
improve an existing search interface by translating a keyword query to a structured query.
Specifically, we leverage the keyword to attribute value associations discovered in the results
returned by the original search interface. We show empirically that the translated structured
queries alleviate the above problems.
SCORE THRESHOLDING
CATEGORICAL AND NUMERICAL PREDICATES:
The aggregated correlation scores are compared with a threshold. A mappingis kept if
the aggregate score is above the threshold. As we discussed above, the keyword to predicate
mapping is more robust with more differential query pairs. Intuitively, for those keywords
with a small number of differential query pairs, we need to set a high threshold to ensure a
high statistical significance as they tend to have high variations in the aggregated scores. The
threshold value can be lower when we have more the number of differential query pairs.
Setting up thresholds with multiple criteria has been studied in the literature. In this paper, we
adopt a solution which is similar to that in [6]. The main idea is to consider each keyword to
6

predicate mapping as a point in a twodimensional space, where the x-axis is the aggregate
score and y-axis is the number of differential query pairs. A set of positive and negative
samples are provided. The method searches for an optimal set of thresholding points (e.g., 5
points) in the space, such that among the points which are at the top-right of any of these
thresholding points, as many (or less) as positive (or negative) samples are included. With
that approach, we generate a set of thresholds, each of which corresponds to a range with
respect to the number of differential query pairs.
Note that we have two different metrics for categorical and numerical predicates.
Hence, the scores are not comparable to each other. Therefore, it is necessary to set up a
threshold for each of them separately. Furthermore, after score thresholding, we normalize
scores as follows. For each score s which is above the threshold (with respect to the
corresponding number of differential query pairs), we update a relative score by s = s.
TEXTUAL AND NULL PREDICATES:
For each keyword k, we will compute the score for all possible categorical and
numerical predicates. If none of these predicate whose score passes the threshold, then the
keyword is not strongly correlated with any categorical or numeric predicates. Therefore, we
assign textual or null predicate to k. Specifically, for any keyword k that does not appear in
the textual attributes in the relation, or belongs to a stop word list, we assign a null predicate
TRUE with score 0 to k. Otherwise, we assign a textual predicate Contains(A;k) with score
0 if k appears in a textual attribute A. Mapping Probabilities As briey discussed in the
introduction, we can infer the mapping between each query term and XML element based on
collection statistics. More formally, using Bayes' theorem, we can estimate the posterior
probability PM(Ej jw) that a given query term w is mapped into XML element Ej by
combining the prior probability PM(Ej) and the probability of a term occurring in a given
element
type PM(wjEj ).
PM(Ej jw) = PM(wjEj)PM(Ej)
P(w) = P PM(wjEj)PM(Ej) Ek2E PM(wjEk)PM(Ek)
PM(wjEj) is calculated by dividing the number of occurrences for term w by
total term counts in the element Ej across the whole collection.
In other words, PM(wjEj) is the probability of generating word w from a virtual
document created by combining all Ej elements in the collection. Also, PM(Ej) denotes the
prior probability of element Ej being mapped into any query term before observing collection
7

statistics. Following this estimation procedure, the mapping probability can be viewed as a
normalized query-likelihood score for each element type given a query. It can also be
interpreted as an effort to capture the `significance' of a query term for each element type. For
instance, a query term `romance' can be found in both genre and plot elements of a movie
XML document, yet `romance' is more significant in genre than in plot element and therefore
it is more likely to be the element the user intended.
Another point worth remarking is that it is relatively cheap to calculate this mapping
probability since it is based on collection statistics, which is already available in the search
engine index. Therefore, no additional indexing is required for mapping probability
calculation. Also, this can be done before a user issues a query, thereby having virtually no
impact on the perceived speed of retrieval.
KEYWORD SEARCH WITH BANKS
Keyword searching in BANKS is done using proximity based ranking, based on
foreign key links and other types of links. We model the database as a graph, with the tuples
as nodes and cross references between them as edges. BANKS allows query keywords to
match data (tokens appearing in any textual attribute), and meta data (e.g., column or relation
name). The greatest value of BANKS lies in near zero-effort Web publishing of relational
data which would otherwise remain invisible to the Web [2]. BANKS may be used to publish
organizational data, bibliographic data, and electronic catalogs.
Search facilities for such applications can be hand crafted: manyWeb sites provide
forms to carry out limited types of queries on their backend databases. For example, a
university Web site may provide form interface to search for faculty and students. Searching
for departments would require yet another form, as would searching for courses offered.
Creating an interface for each such task is laborious, and is also confusing to users since they
must first expend effort finding which form to use.
An approach taken in some cases is to export data from the database to Web pages,
and then provide text search on Web documents. This approach results in duplication of data,
with resultant problems of keeping the versions up-todate, in addition to space and time
overheads. Further, with highly connected data, it is not feasible to export every possible
combination. For instance, a bibliographic database can export details of each paper as a Web
document, but a query that requires finding a citation link between two papers would not be
supported.
8

ALGORITHM FOR QUERY SEGMENTATION


For two segmentation S1 and S2 of the same query suppose they di_er at only the one
segment boundry,i.e.,
S1 = si1k1 ; si2k1 ; : : : ; si(l1)k(l1); silkl ; si(l+1)k(l+1);
si(l+2)k(l+2); : : : ; simkm
S2 = si1k1 ; si2k1 ; : : : ; si(l1)k(l1); s 0 ilkl ; si(l+2)k(l+2); : : : ; simkm
where s 0
ilkl = silkl ; si(l+1)k(l+1) is the concatenation of silkl
and si(l+1)k(l+1)
That means we will favour segmentations with higher probability of generating the
query. So in above the language model will have higher probability for P(S1)
than P(S2) if and ony if P(silkl ; si(l+1)k(l+1)) > P(s 0 ilkl )
SUPERVISED LEARNING APPROACH
This is like classic named entity recognition(NER) problem in natural language
processing. This involves query labelling and training with the labelled queries. NER systems
have been created that use linguistic grammar-based techniques as well as statistical models.
Hand-crafted grammar-based systems typically obtain better precision, but at the cost of
lower recall and months of work by experienced computational linguists. Statistical NER
systems typically require a large amount of manually annotated training data. Statistical
NERs usually and the sequence of tags p(NjS)that maximizes the probability , where S is the
sequence of words in a sentence, and N is the sequence of named-entity tags assigned to the
word in S Supervised learning methods represents linguistic information in the form of
features. Each feature informs of the occurrence of certain attribute in a context that contains
a linguistic ambiguity.
That context is the text surrounding this ambiguitiy and relevant to the disambiguation
process. The features used can be of distinct nature: word collocations, part-of-speech labels,
keywords, topics and domain information, etc. A method using supervised learning tries to
classify a context containing an ambiguous word or compound word in one of its possible
senses by means of a classification function. This function is obtained after a training process
on a sense tagged corpus. The information source for this training is the set of results of the
features evaluation on each context, that is, each context has its vector of feature values. The

supervised learning method (Maximum Entropy) used in this paper to do such analysis is
based on Maximum Entropy Markov Model (MEMM).

10

3. PROBLEM DESCRIPTION
The actual data mining task is the automatic or semi-automatic analysis of large
quantities of data to extract previously unknown interesting patterns such as groups of data
records, unusual records and dependencies. This usually involves using database techniques
such as spatial indexes. These patterns can then be seen as a kind of summary of the input
data, and may be used in further analysis or, for example, in machine learning and predictive
analytics.
Keyword query interfaces (KQIs) for databases have attracted much attention in the
last decade due to their flexibility and ease of use in searching and exploring the data. Since
any entity in a data set that contains the query keywords is a potential answer, keyword
queries typically have many possible answers. KQIs must identify the information needs
behind keyword queries and rank the answers so that the desired answers appear at the top of
the list. Unless otherwise noted, we refer to keyword query as query in the remainder of this
paper.

11

4. EXISTING SYSTEM
Databases contain entities, and entities contain attributes that take attribute values
users do not give enough information to single out exactly their desired entities even with
structured data, finding the desired answers to keyword queries is still a hard task. More
interestingly, looking closer to the ranking quality. Pre-retrieval methods predict the difficulty
of a query without computing its results. These methods usually use the statistical properties
of the terms in the query to measure specificity, ambiguity, or term-relatedness of the query to
predict its difficulty These methods generally assume that the more discriminative the query
terms are, the easier the query will be. Empirical studies indicate that these methods have
limited prediction accuracies

4.1 DRAWBACKS OF EXISITNG SYSTEM

Suffer from low ranking quality.

Performing very poorly on a subset of queries

In a traditional keyword-search system over XML data, a user composes a query,


submits it to the system, and retrieves relevant answers from XML data.

12

5. PROPOSED SYSTEM
User introduces the problem of predicting the degree of the difficulty for keyword
queries over databases. Users also analyze the reasons that make a keyword query difficult to
answer by KQIs. User propose the Structured Robustness (SR) score, which measures the
difficulty of a keyword query based on the differences between the rankings of the same
query over the original and noisy versions of the same database, where the noise spans on
both the content and the structure of the result entities.
User introduce efficient algorithms to compute the SR score, given that such a
measure is only useful when it can be computed with a small cost overhead compared to the
query execution cost. Our most efficient algorithm accurately estimates the SR score by
combining the two seemingly independent steps into a single step.

5.1 ADVANTAGES OF PROPOSED SYSTEM

Easily mapped to both XML and relational data.

Higher prediction accuracy and minimize the incurred time overhead.

Each query with multiple keywords needs to be answered efficiently.

13

6. SYSTEM METHODOLOGY
STRUCTURED ROBUSTNESS ALGORITHM
Structured Robustness Algorithm (SR Algorithm), which computes the exact SR score
based on the top K result entities. Each ranking algorithm uses some statistics about query
terms or attributes values over the whole content of DB. Some examples of such statistics are
the number of occurrences of a query term in all attributes values of the DB or total number
of attribute values in each attribute and entity set. These global statistics are stored in M
(metadata) and I (inverted indexes) in the SR Algorithm pseudocode.
SR Algorithm generates the noise in the DB on-the-fly during query processing. Since
it corrupts only the top K entities, which are anyways returned by the ranking module,
it does not perform any extra I/O access to the DB, except to lookup some statistics.
Moreover, it uses the information which is already computed and stored in inverted indexes
and does not require any extra index.
Some of the reasons for SR Algorithm inefficiency are the following: First, Line 5 in
SR Algorithm loops every attribute value in each top-K result and tests whether it must be
corrupted. As noted before, one entity may have hundreds of attribute values. We must note
that the attribute values that do not contain any query term still must be corrupted (Line 8-10
in SR Algorithm) for the second and third levels of corruption defined in Equation 10.
This is because their attributes or entity sets may contain some query keywords. This
will largely increase the number of attribute values to be corrupted. For instance, for IMDB
which has only two entity sets, SR Algorithm corrupts all attribute values in the top-K results
for all query keywords. Second, ranking algorithms for DBs are relatively slow. SR
Algorithm has to re-rank the top K entities N times which is time consuming.

14

SGS-APPROX:
The Above Graph depict the results of applying SGS-Approx on INEX and
SemSearch, respectively. Since re-ranking is done on-the-fly during the corruption, SR-time
is reported as corruption time only. The efficiency improvement on the INEX dataset is
slightly worse than QAO-Approx, but the quality (correlation score) remains high. SGSApprox outperforms QAO-Apporx in terms of both efficiency and effectiveness
on the SemSearch dataset.
COMBINATION OF QAO-APPROX AND SGS-APPROX:
User can combine QAO-Approx and SGS-Approx algorithms to achieve better
performance. Present the results of the combined algorithm for INEX and SemSearch
databases, respectively. Since we use SGS-Approx, the SR-time consists only of corruption
time. Our results show that the combination of two algorithms works more efficiently than
either of them with the same value for N.
IR-STYLE RANKING ALGORITHM:
The best value of MAP for the IR-Style ranking algorithm over INEX is 0.134 for
K = 20, which is very low. Note that we tried both Equation 12 as well as the vector space
model originally used. Thus, we do not study the quality performance prediction for IR-Style
ranking algorithm over INEX. On the other hand, the IR-Style ranking algorithm using
Equation 12 delivers larger MAP value than PRMS on the SemSearch dataset. Hence, we
only present results on SemSearch. Table 5 shows Pearsons correlation of SR score with the
average precision for different values of K, for N = 250 and (A, T, S) = (1, 0.1, 0.6).Plots
SR score against the average precision when K = 20.
BASELINE PREDICTION METHODS:
We use Clarity score (CR), Unstructured Robustness Method (URM), Weighted
Information Gain (WIG), normalized query- commitment (NQC), and prevalence of query
keywords as baseline query difficulty prediction algorithms in databases. CR, URM are two
popular post-retrieval query difficulty prediction techniques over text documents. WIG and
NQC are also post-retrieval predictors that have been proposed recently and are shown to
achieve better query difficulty prediction accuracy than CR and URM .

15

To implement CR, URM,WIG, and NQC, we concatenate the XML elements and tags
of each entity into a text document and assume all entities (now text documents) belong to
one entity set. The values of all j in PRMS ranking formula are set to 1 for every query term.
Hence, PRMS becomes a language model retrieval method for text documents [7]. We have
separately trained the parameters of these method on each dataset using the whole query
workload as the training data to get the optimal settings for these methods.
PREVALENCE OF QUERY KEYWORDS
The query keywords appear in many entities, attributes, or entity sets, it is harder for a
ranking algorithm to locate the desired entities. Given query Q, user compute the average
number of attributes (AA(Q)), average number of entity sets (AES(Q)), and the average
number of entities (AE(Q)) where each keyword in Q occurs. We consider each of these three
values as an individual baseline difficulty prediction metric. We also multiply these three
metrics and create another baseline metric, denoted as AS(Q). Intuitively, if these metrics for
query Q have higher values, Q must be harder and have lower average precision. Thus, we
use the inverse of these values, denoted as iAA(Q), iAES(Q), iAE(Q), and iAS(Q),
respectively.

16

7. SYSTEM IMPLEMENTATION

Keyword searching using banks


IR Ranking
Candidate Network Generator
Steiner-tree-based search
Joined Tuple Tree Algorithm

Keyword searching using banks


Relational databases are commonly searched using structured query languages. The
user needs to know the data schema to be able to ask suitable queries. Search engines on the
Web have popularized an alternative unstructured querying and browsing paradigm that is
simple and user-friendly. Users type in keywords and follow hyperlinks to navigate from one
document to the other. No knowledge of schema is needed. In relational databases,
information needed to answer a keyword query is often split across the tables/tuples, due to
normalization. The BANKS system enables data and schema browsing together with
keywordbased search for relational databases.
IR Ranking
With the amount of available text data in relational databases growing rapidly, the
need for ordinary users to search such information is dramatically increasing. Even though
the major RDBMSs have provided full-text search capabilities, they still require users to have
knowledge of the database schemas and use a structured query language to search
information. This search model is complicated for most ordinary users. Inspired by the big
success of information retrieval (IR) style keyword search on the web, keyword search in
relational databases has recently emerged as a new research topic.
Candidate Network Generator
The Candidate Network Generator inputs the set of keywords k1, . . . ,km, the nonempty tuple sets RKi and the maximum candidate networks size T and outputs a complete
and non-redundant set of candidate networks. The key challenge is to avoid the generation of
redundant joining networks of tuple sets.
The solution to this problem requires an analysis of the conditions that force a joining
network of tuples to be non-minimal the condition for the totality of the network is
17

straightforward. As the amount of information stored in databases increases, so does the need
for efficient information discovery. Keyword search enables information discovery without
requiring from the user to know the schema of the database, SQL or some QBE-like interface,
and the roles of the various entities and terms used in the query.
In discover databases that do not require knowledge of the database schema or of a
querying language, discover is a system that performs keyword search in relational databases.
It proceeds in three steps. First it generates the smallest set of candidate networks.
Steiner-tree-based search
A relational database can be modeled as a database graph G = (V, E) such that there is
a one-toone mapping between a tuple in the database and a node in V. G can be considered as
a directed graph with two a edge: a forward edge (u, v) E iff there is a foreign key from u
to v, and a back edge (v, u) iff (u, v) is a forward edge in E. An edge (u, v) indicates a close
relationship between tuples u and v (i.e., they can be directly joined together), and the
introduction of two edge types allows differentiating the importance of u to v and vice versa.
When such separation is not necessary for some applications, G becomes an undirected
graph. To support keyword search over relational data, G is typically modeled as a weighted
graph, with a node weight (v) to represent the prestige level of each node v V and an
edge weight (u, v) for each edge in E to represent the strength of the closeness relationship
between the two tuples.
Joined Tuple Tree Algorithm
In this paper, it describes the effectiveness and the efficiency issues of answering topk keyword query in relational database systems. They propose a new ranking formula by
adapting existing IR techniques based on a natural notion of virtual document. Compared
with previous approaches, our new ranking method is simple yet effective, and agrees with
human perceptions. It has conducted extensive experiments on large-scale real databases
using two popular RDBMSs. It focuses on the problem of supporting effective and efficient
top-k keyword search in relational databases. While many RDBMSs support full-text search,
they only allow retrieving relevant tuples from within the same relation.

8. RESULTS AND DISCUSSION

18

According to our performance study of QAO-Approx, SGS-Approx, and the


combined algorithm over both datasets, the combined algorithm delivers the best balance of
improvement in efficiency and reduction in effectiveness for both datasets. On both datasets,
the combined algorithm achieves high prediction accuracy (the Pearsons correlation score
about 0.5) with SR-time around 1 second. Using the combined algorithm over INEX when
the value of N is set to 20, the the Pearsons and Spearmans correlation scores are 0.513
an 0.396 respectively and the time decreases to about 1 second. For SR Algorithm on INEX,
when N decreases to 10, the Pearsons correlation is 0.537, but SR-time is over 9.8 seconds,
which is not ideal. If we use the combined algorithm on SemSearch, the Pearsons and
Spearmans correlation scores are 0.495 and 0.587 respectively and SR-time is about 1.1
seconds when N = 50. However, to achieve a similar running time, SGS-Approx needs to
decrease N to 10, with the SR-time of 1.2 seconds, the Pearsons correlation of 0.49 and the
Spearmans correlation of 0.581. Thus, the combined algorithm is the best choice to predict
the difficulties of queries both efficiently and effectively.
The time to compute the SR score only depends on the top-K results, since only the
top-K results are corrupted and reranked (see Section 6). Increasing the data set size will only
increase the query processing time, which is not the focus of this paper. The complexity of
data schema could have impact on the efficiency of our model. A simpler schema may not
mean shorter SR computation time, since more attribute values need to be corrupted, since
more attribute values of the same attribute type of interest exists. The latter is supported

19

by the longer corruption times incurred by INEX, which has simpler schema than SemSearch

20

9. CONCLUSION
We introduced the novel problem of predicting the effectiveness of keyword queries
over DBs. We showed that the current prediction methods for queries over unstructured data
sources cannot be effectively used to solve this problem. We set forth a principled framework
and proposed novel algorithms to measure the degree of the difficulty of a query over a DB,
using the ranking robustness principle. Based on our framework, we propose novel
algorithms that efficiently predict the effectiveness of a keyword query. Our extensive
experiments show that the algorithms predict the difficulty of a query with relatively low
errors and negligible time overheads.

21

10. FUTURE ENHANCEMENT

22

11. APPENDIX
[1] V. Hristidis, L. Gravano, and Y. Papakonstantinou, Efficient IRstyle keyword search over
relational databases, in Proc. 29th VLDB Conf., Berlin, Germany, 2003, pp. 850861.
[2] Y. Luo, X. Lin, W. Wang, and X. Zhou, SPARK: Top-k keyword query in relational
databases, in Proc. 2007 ACM SIGMOD, Beijing, China, pp. 115126.
[3] V. Ganti, Y. He, and D. Xin, Keyword++: A framework to improve keyword search over
entity databases, in Proc. VLDB Endowment, Singapore, Sept. 2010, vol. 3, no. 12, pp.
711722.
[4] J. Kim, X. Xue, and B. Croft, A probabilistic retrieval model for semistructured data, in
Proc. ECIR, Tolouse, France, 2009, pp. 228239.
[5] N. Sarkas, S. Paparizos, and P. Tsaparas, Structured annotations of web queries, in Proc.
2010 ACM SIGMOD Int. Conf. Manage. Data, Indianapolis, IN, USA, pp. 771782.
[6] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan, Keyword searching
and browsing in databases using BANKS, in Proc. 18th ICDE, San Jose, CA, USA, 2002,
pp. 431440.
[7] C. Manning, P. Raghavan, and H. Schtze, An Introduction to Information Retrieval. New
York, NY: Cambridge University Press, 2008.
[8] A. Trotman and Q. Wang, Overview of the INEX 2010 data centric track, in 9th Int.
Workshop INEX 2010, Vugh, The Netherlands, pp. 132,
[9] T. Tran, P. Mika, H. Wang, and M. Grobelnik, Semsearch S10, in Proc. 3rd Int. WWW
Conf., Raleigh, NC, USA, 2010.
[10] S. C. Townsend, Y. Zhou, and B. Croft, Predicting query performance, in Proc. SIGIR
02, Tampere, Finland, pp. 299306.
[11] A. Nandi and H. V. Jagadish, Assisted querying using instantresponse interfaces, in
Proc. SIGMOD 07, Beijing, China, pp. 11561158.
[12] E. Demidova, P. Fankhauser, X. Zhou, and W. Nejdl, DivQ: Diversification for
keyword search over structured databases, in Proc. SIGIR 10, Geneva, Switzerland, pp.
331338.
[13] Y. Zhou and B. Croft, Ranking robustness: A novel framework to predict query
performance, in Proc. 15th ACM Int. CIKM, Geneva, Switzerland, 2006, pp. 567574.
[14] B. He and I. Ounis, Query performance prediction, Inf. Syst., vol. 31, no. 7, pp. 585
594, Nov. 2006.

23

[15] K. Collins-Thompson and P. N. Bennett, Predicting query performance via


classification, in Proc. 32nd ECIR, Milton Keynes, U.K., 2010, pp. 140152.
[16] A. Shtok, O. Kurland, and D. Carmel, Predicting query performance by query-drift
estimation, in Proc. 2nd ICTIR, Heidelberg, Germany, 2009, pp. 305312.
[17] Y. Zhou and W. B. Croft, Query performance prediction in web search environments,
in Proc. 30th Annu. Int. ACM SIGIR, New York, NY, USA, 2007, pp. 543550.
[18] Y. Zhao, F. Scholer, and Y. Tsegay, Effective pre-retrieval query performance prediction
using similarity and variability evidence, in Proc. 30th ECIR, Berlin, Germany, 2008, pp.
5264.
[19] C. Hauff, L. Azzopardi, and D. Hiemstra, The combination and evaluation of query
performance prediction methods, in Proc. 31st ECIR, Toulouse, France, 2009, pp. 301312.
[20] C. Hauff, V. Murdock, and R. Baeza-Yates, Improved query difficulty prediction for the
Web, in Proc. 17th CIKM, Napa Valley, CA, USA, 2008, pp. 439448.
[21] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and
Techniques. San Francisco, CA: Morgan Kaufmann, 2011.

24

SAMPLE SCREENS

25

Vous aimerez peut-être aussi