Vous êtes sur la page 1sur 64

Answering Top-k Queries Using

Views
By:
Gautam Das (Univ. of Texas),
Dimitrios Gunopulos (Univ. of California Riverside),
Nick Koudas (Univ. of Toronto),
Dimitris Tsirogiannis (Univ. of Toronto)




Presented By:
Kushal Shah
Lipsa Patel



Views
Definition: Views

Declaring Views

Advantages of using Views

Views
A view may be thought of as a table, that is derived
from one or more underlying base table.

Two kinds:
1. Virtual: Not stored in the database; just a
query for constructing the relation.
2. Materialized: Actually constructed and
stored.
Declaring Views
Materialized:
CREATE [MATERIALIZED]
VIEW <name> AS <query>;

Virtual: Default
Advantages of using Views
If we have several tables in a DB and we want to
view only specific columns from specific tables we
can go for views.

Suffice the needs of security: Sometimes allowing
specific users to see only specific columns based on
the permission that we can configure on the views.
Answering Top-k Queries Using
Views
By:
Gautam Das (Univ. of Texas),
Dimitrios Gunopulos (Univ. of California Riverside),
Nick Koudas (Univ. of Toronto),
Dimitris Tsirogiannis (Univ. of Toronto)




Presented By:
Kushal Shah
Lipsa Patel



Top-k Query
Top-k Query Processing Definition

Top-k Example

Algorithms for Top-k Query Processing
Top-k Query Processing







Top-k query processing
=
Finding k objects that have the highest overall
Score
Top-k Example
R





Users preferences regarding the ordering of the tuples of a
relation can be expressed as a scoring functions on the
attributes of a relation, eg
f
q
= 3x1 + 2x2 + 5x3
The top-k problem is to find the k tuples with the highest
score according to a given scoring function.


t
id
X
1
X
2
X
3

1 82 1 59
2 53 19 83
3 29 99 15
4 80 45 8
5 28 32 39

f
Q
t
id
Score
2 612
1 543
4 370
3 360
5 343
Algorithms for Top-k Query Processing
How? Which algorithms? Related Work How we
complement existing approaches?
TA [Fagin]
PREFER [Hristidis]
Stores the multiple copies of a relation and each
copy is ordered according to a different scoring
function.

In order to answer a top-k query the algorithm
utilizes a single copy with a scoring function which
is closest to the scoring function of the query.
(a, 0.9)
(b, 0.8)
(c, 0.72)
(d, 0.6)
.
.
.
.
Sorted L
1
(d, 0.9)
(a, 0.85)
(b, 0.7)
(c, 0.2)
.
.
.
.
N
a
b
c
d
.
.
.
.
Object
ID
0.9
0.8
0.72
0.6
.
.
.
.
Attribute 1
0.85
0.2
0.9
.
.
.
.
Attribute 2
0.7
M
Sorted L
2
Example Simple Database model




ID A
1
A
2
Min(A
1
,A
2
)
Step 1: - parallel sorted access to each list
(a, 0.9)
(b, 0.8)
(c, 0.72)
(d, 0.6)
.
.
.
.
L
1
L
2
(d, 0.9)
(a, 0.85)
(b, 0.7)
(c, 0.2)
.
.
.
.
a
d
0.9
0.9
0.85 0.85
0.6 0.6
For each object seen:
- get all grades by random access
- determine Min(A1,A2)
- amongst 2 highest seen ? keep in buffer

Example Threshold Algorithm

ID A
1
A
2
Min(A
1
,A
2
)
a: 0.9
b: 0.8
c: 0.72
d: 0.6
.
.
.
.
L
1
L
2
d: 0.9
a: 0.85
b: 0.7
c: 0.2
.
.
.
.
Step 2: - Determine threshold value based on objects currently
seen under sorted access. T = min(L1, L2)
a
d
0.9
0.9
0.85 0.85
0.6
0.6
T = min(0.9, 0.9) = 0.9
- 2 objects with overall grade threshold value ? stop
else go to next entry position in sorted list and repeat step 1
Example Threshold Algorithm

ID A
1
A
2
Min(A
1
,A
2
)
Step 1 (Again): - parallel sorted access to each list
(a, 0.9)
(b, 0.8)
(c, 0.72)
(d, 0.6)
.
.
.
.
L
1
L
2
(d, 0.9)
(a, 0.85)
(b, 0.7)
(c, 0.2)
.
.
.
.
a
d
0.9
0.9
0.85 0.85
0.6 0.6
For each object seen:
- get all grades by random access
- determine Min(A1,A2)
- amongst 2 highest seen ? keep in buffer

b 0.8 0.7 0.7
Example Threshold Algorithm

ID A
1
A
2
Min(A
1
,A
2
)
a: 0.9
b: 0.8
c: 0.72
d: 0.6
.
.
.
.
L
1
L
2
d: 0.9
a: 0.85
b: 0.7
c: 0.2
.
.
.
.
Step 2 (Again): - Determine threshold value based on objects currently
seen. T = min(L1, L2)
a
b
0.9
0.7
0.85 0.85
0.8
0.7
T = min(0.8, 0.85) = 0.8
- 2 objects with overall grade threshold value ? stop
else go to next entry position in sorted list and repeat step 1
Example Threshold Algorithm
c

ID A
1
A
2
Min(A
1
,A
2
)
a: 0.9
b: 0.8
c: 0.72
d: 0.6
.
.
.
.
L
1
L
2
d: 0.9
a: 0.85
b: 0.7
c: 0.2
.
.
.
.
Situation at stopping condition
a
b
0.9
0.7
0.85 0.85
0.8
0.7
T = min(0.72, 0.7) = 0.7
Example Threshold Algorithm
0.72 0.2 0.2
Related Work for Top-k Query Processing
TA: Sequential as well as Random Access

PREFER
Approach for Top-k Query Processing
Top-k Query Answering using Views

Views are Materialized (incurring space overhead)

Advantages of using views: increased performance
because views are small in size

Space-Performance tradeoff
Example Views
R t
id
X
1
X
2
X
3

1 82 1 59
2 53 19 83
3 29 99 15
4 80 45 8
5 28 32 39
Three attribute relation R
V
1
t
id
Score
3 553
4 385
5 216
2 201
1 169
Top-5 query using
function f1 = 2x1 + 5x2
V
2
t
id
Score
2 351
1 237
5 177
3 159
4 88
Top-5 query using function
f2 = x2 + 2x3
Top-k ranking queries in SQL-like syntax: SELECT TOP[k] FROM R ORDER BY Score(q)
Score(q) - function that assigns numeric score to any tuple t
Ranking Views: Views only aim to rank
A ranking view is the materialized result of a previously asked top-k query.
Can we answer new top-k queries efficiently using ranking
views? Lets see

Formal Definitions
Ranking Queries

Ranking Views
Ranking Queries
Ranking Queries: Top-k ranking queries in SQL-like
syntax: Select Top[k] from R where Range(q) Order By
Score(q)
A ranking query may be expressed as a triple Q = (Score(q),
k, Range(q)), where
Score(q)= Function that assigns numeric score to any tuple t
Range(q) = defines selection condition for the tuples of R
Semantics: Retrieve the k tuples with the top scores
satisfying the selection condition.
Ranking Views
Materialized Ranking View V:
for a previously executed query
Q
1
= (Score
Q
1
, k
1
, Range
Q
1
),
the corresponding materialized ranking view is a set of
k(tid, score
Q
(tid)) pairs,
ordered by decreasing values of score
Q
(tid).
Problems we are going to solve
Top-k Query Answer using Views

View Selection
Top-k Query Answer using Views
Given: Set U of views
Query Q

Obtain an answer to Q combining all the information
conveyed by the views in U

Solution: Algorithm named LPTA
Problems we are going to solve
Top-k Query Answer using Views

View Selection
View Selection
Problem: Given a collection of views V={V1Vr} base
views and a query Q, determine the most efficient subset U
of V to execute Q on.
Input to LPTA: subset U
Obtaining an answer to ranking query: Running TA on base
views.
Find the subset U that when utilized by LPTA
1. Provide answer to query
2. Provide answer faster than running TA on the base
views V
Outline
LPTA Algorithm
View Selection Problem
LPTA: Linear Programming Adaptation
of the Threshold Algorithm


1. Scoring function of Query: Q - f
Q
= 3x1 + 10x2
2. Scoring function of Views: V1 f
v1
= 2x1 + 5x2
Subset of Views U V2 f
v2
= x1 + 2x2
LPTA for Top-k Query Answer using Views
Top-1 Query
View is a set of pairs of (tuple identifier, score).
The LPTA algorithm requires sorted access on each view in
non-increasing order of that score.

LPTA Example
tid x1 x2 x3
1 82 1 59
2 53 19 83
3 29 1 2
4 80 22 90
5 28 8 87
6 12 55 82
7 16 99 42
8 18 42 67
9 42 1 23
10 23 21 88
R
tid Score
7 527
6 299
4 270
8 246
2 201
V1
Top-5 Query
f1 = 2x1 + 5x2
tid Score
6 219
4 202
10 197
Top-3 Query
f2 = x2 + 2x3
V2
Answer Top-2 Query using LPTA
LPTA Setting
The algorithm initializes the top-k buffer to empty.
Top-2 Buffer
tid Score
7 527
6 299
4 270
8 246
2 201
tid Score
6 219
4 202
10 197
V1 V2
7 16 99 42
For each tid read, random access
on R to retrieve tuple and
compute score acc to query
function f3 = 3x1 + 10x2 + 5x3
6 12 55 82
(7,1248)
(6,996)
Top-2 Buffer
Check for stopping Condition
Check for Stopping Condition
The unseen tuples in the view have satisfy the following inequalities:
The domain of each attribute of R [1,100]
0<=X1,X2,X3<=100---------------------------(1)
2x1 + 5x2 <= sd1-------------------------------(2)
x2 + 2x3 <= sd2---------------------------------(3)
sd1 = 527 and sd2 = 219
Unseen
max
= Solution to the linear program where we maximize the
function f3 = 3x1 + 10x2 + 5x3 subject to these inequalities.
The solution of the linear program gives the maximum score of any
unseen tuple.
Unseen
max
= the maximum possible score (with respect to the ranking
querys scoring function) of any tuple not yet visited in the views.
The algorithm terminates when the top-k buffer is full and Unseen
max
<= topk
min
Calculating Unseen
max

Unseen
max
= Solution to the linear program where we maximize the
function f3 = 3x1 + 10x2 + 5x3 subject to these inequalities.
A linear programming problem may be defined as the problem of
maximizing or minimizing a linear function subject to linear
constraints. The constraints may be equalities or inequalities. Here is
a simple example.
Find numbers x1 and x2 that maximize the sum x1 + x2 subject to the
constraints
x1 0, x2 0, and
x1 + 2x2 4
4x1 + 2x2 12
x1 + x2 1
Objective Function
Maximize the function
Convex region
This system of inequalities defines a
convex region.

Occasionally, the maximum occurs
along an entire edge or face of the
constraint set, but then the maximum
occurs at a corner point as well.
LPTA - Example

tid
1
1

s
1
1

tid
2
1

tid
3
1

tid
4
1

tid
5
1

s
2
1

s
3
1

s
4
1

s
5
1

tid
1
2

s
1
2

tid
2
2

tid
3
2

tid
4
2

tid
5
2

s
2
2

s
3
2

s
4
2

s
5
2
V1 V2

tid
1
1

tid
1
2
Top-1 query
V1
V2
Q
stopping
condition
X
1

X
2

R(X
1
, X
2
)

O(0,0)

P (1,0)

R(1,1)

T (0,1)
Normalized Domain[0,1]
Views and top-k query represented by
vectors denoting the direction of increasing
score
Sweeping line perpendicular
to V1 from infinity to origin
Score of a tuple with respect to the query: project that tuple to the vector of the query
Score of a tuple with respect to a view: project that tuple to the the vector of the view
Max posssible score of any tuple not yet
visited in the views with respect to the
scoring func of query UNSEEN
MAX

LPTA - Example (cont)

tid
1
1

s
1
1

tid
2
1

tid
3
1

tid
4
1

tid
5
1

s
2
1

s
3
1

s
4
1

s
5
1

tid
1
2

s
1
2

tid
2
2

tid
3
2

tid
4
2

tid
5
2

s
2
2

s
3
2

s
4
2

s
5
2
V1 V2

tid
1
1

tid
1
2

tid
2
1

tid
2
2
Top-1
V1
V2
Q
stopping
condition
X
1

X
2

R(X
1
, X
2
)

O(0,0)

P (1,0)

R(1,1)

T (0,1)
The algorithm will stop early if the scoring function of the views is
similar to the scoring function of the query.
LPTA Algorithm Pseudo Code
There is Sequential as well as Random Access.
Sequential access on views
Random Access on base table to find the tuple
Comparison of LPTA with TA
LPTA becomes TA when the set of views U = set of
base views

Execution cost: Both have Sequential as well as
Random Access
These I/O Operations play a significant role
overshadow the costs of CPU operations such as
updated top-k buffer, testing for stopping condition
& so on.

Determining Factor for
performance LPTA versus TA
Highly correlated: every sequential access incurs a random
access.
As a result the determining factor for the performance is
(distance from the beginning of the view each algorithm has
to traverse (read sequentially) before coming into a halt with
the correct answer) X (the number of views participating in
the process).
d=number of lock-step r = no of views
Running Cost:
O(dr)
Outline
LPTA Algorithm
View Selection Problem
View Selection Problem
Given a collection of views V = {V
1
,,V
r
} and a
Query Q, determine the most efficient subset U C V
to execute Q on.
Conceptual discussion of View Selection
Two attribute relation (in two dimension)
Multi attribute relation (for any dimension)
Domain of each attribute is normalized to [0,1]
M-attribute relation is refer as m-dimension
View Selection Two Dimension(same side)
Min top-k tuple
Q
V1
V2

O(0,0)

T (0,1)

P (1,0)

R(1,1)

X

Y
Square
OPRT
Two views V1 and V2 and Query Q are represented by vectors.
Both the view vectors are to the same side (clockwise) of the query vector
A
B
B
1

B
2

M
AB 1 Q passes through M & intersect unit square
ABR Top-k tuples
ABPOT Remaining tuples
Sorted access to V1 sweeping line
1 to V1 from infinity to origin
Stopping
condition for V1:
sweepline
crosses AB1
bcoz convex
polygon
AB1POT
unseen tuples
and
score(unseen) <=
Score(M)
Number of sorted access V1 = NumTuples(AB1R) V2 = NumTuples(AB2R)
View Selection Two Dimension(same side)
Conclusion
V2 is slower compared to V1
If several views in two dimension are available &
all their vectors are to one side of query vector,
then it is optimal for LPTA to use the vector that is
closet to the query vector.
Estimating the Number of Tuples
Estimating and Comparing the Number of Tuples by
simply comparing the areas of respective triangles.
Such approach: Need to have an uniform
distribution within the triangles, which is often quite
unrealistic.
In our approach for view selection,
utilize the conceptual conclusions + borrow
knowledge of actual data distribution.
View Selection Two
Dimension(either side of query)
A
B
Min top-k tuple
Q
V1
V2

O(0,0)

T (0,1)

P (1,0)

R(1,1)

X

Y
A1
B1
M
Can use only V1 or only V2 for execution
If uses only v1
to answer the
query the
stopping
condition will be
reached once the
sweepline
perpendicular to
v1 crosses
position A1B/
For V2 - AB1
View Selection Two
Dimension(either side of query)
A
B
Min top-k tuple
Q
V1
V2

O(0,0)

T (0,1)

P (1,0)

R(1,1)

X

Y
A1
B1
M
Running LPTA on both V1 and V2,
rather than just running on only one of
V1 or V2? Two views are better than
one
A1
1

B1
1

A2
1

B2
1

The intersection point of the sweep
lines perpendicular to v1 and v2 is
on the line AB
The stopping
condition is
reached when the
sweeplines resp
crosses A1
1
B1
1

and A2
1
B2
1
such
that
1) intersection pt
of A1
1
B1
1
and
A2
1
B2
1
is on line
AB
2) NumTuples(A1
1
B1
1
R) = NumTuples(A21B21PR) since algo sweeps each view in lock-step
LPTA on both Views versus One
For two views the position of each sweepline is before the
respective stopping positions if only one view has been
used.
Total number of sorted accesses for two views:
NumTuples (A1
1
B1
1
R) + NumTuples (A2
1
B2
1
R) = 2
NumTuples (A1
1
B1
1
R)
If Min (NumTuples (A1BR), NumTuples (AB1PR), 2 NumTuples
(A1
1
B1
1
R)) = NumTuples (A1BR) - Use V1
If Min (NumTuples (A1BR), NumTuples (AB1PR), 2 NumTuples
(A1
1
B1
1
R)) = NumTuples (AB1PR) - Use V2
Else use both V1, V2
Theorem for Two Dimensional Case
Theorem 1: Set of Views = {V
1
,,V
r
} Query = Q
Two Dimensional dataset
V
a
= Closest to query in Anticlockwise
V
c
= Closest to query in Clockwise
So they are on either side of the query
Optimal execution of LPTA requires the use of either
V
a
or V
c
i.e., the use of subset from {V
a
, V
c
}

View Selection Higher Dimension
Extension of Theorem 1
Theorem 2: Set of Views = {V
1
,,V
r
} Query = Q
m-dimensional dataset
Optimal execution of LPTA requires the use of subset
of views U C V such that |U| <= m

Outline
LPTA Algorithm
View Selection Problem
Cost Estimation Framework
Cost Estimation Framework
Running LPTA
Cost Estimation Framework: The cost of running LPTA
when a specific set of views is used to answer a query.
Cost = total number of sequential accesses in a view
Uses 2 views to answer a query

Cost = 6 sequential
accesses
Min top-k tuple
Can we find that cost
without actually running
LPTA?
A
B
Q
V1
V2
Cost Estimation Framework
without Running LPTA
EstimateCost(Q, U): Returns an estimate of the cost
of running LPTA on exactly this set of views: U

Used within SelectViews(Q,V) to search the subset
U that minimizes EstimateCost(Q,U)

EstimateCost(Q,U) takes into account
Multi-attribute views
Non-uniform data distribution
Simulating LPTA on Histograms
rather than on views U
Equi-depth histograms: The number of tuples in
each bucket is the same
Base Table R : n tuples (10)
H
i
Equi-depth histogram
b buckets 2buckets : represent the distribution of
points along the X
i
attribute

Each bucket will represent n/b data points
10/2 = 5 data points
Simulating LPTA on Histograms
rather than on views U
In our estimation procedure:
H
Q
represents the distribution of score of all tuples
of the database according to the scoring function Q

Cannot calculate the score of all tuples, so
approximate H
Q
Simulation of LPTA on Histograms
Simulate LPTA in a
bucket by bucket lock
step to estimate the
cost.
H
Q
H
V1
H
V2

topk
min

H
Q
: approximates the score
distribution of the query Q
b buckets histograms for
the score distribution of
views
n/b tuples per bucket
Cost
We cannot afford to run LPTA on views U
Pre-estimate topk
min
bcoz we do not
have access to actual tuples or their
tids. The value of topk
min
is estimated
from H
Q
by determining the bucket
that contains the kth highest tuple.
Since topk
min
is very likely inside this
bucket we use linear interpolation
with in the bucket to estimate the
topk
min


Cheap procedure because we have one iteration of the
LPTA algorithm for every n/b tuples using the values
from the bucket boundaries.

Approx the value of func
Calculating the Estimated cost
Number of buckets visited along each views = d(3)
Number of views = r
1
(2)
Number of tuples per bucket n/b (10)

Compute the smallest number of tuples n
1
need to be
scanned from the last bucket before stopping

Estimated number of sorted access ((d-1)n/b +n
1
) r
1

((2)(10) + 2) 2 = 44 Therefore running time is
O((d-1) + logn
1
) lock-step iteration

Outline
LPTA Algorithm
View Selection Problem
Cost Estimation Framework
View Selection Algorithms
EstimateCost(Q, U) Pseudo-
code
SelectViews(Q, V) : Select the subset of views U
which minimizes the EstimateCost
Exhaustive (E) Approach: Estimate the cost of all
possible subsets of V and select the subset of views
with the smallest cost.
Feasible for database with few attributes
Greedy Approach: Keep expanding the set of views
to use until the estimated cost stops reducing.
SelectViews(Q,V) Pseudo code
Requires the solution of a single linear program. Fix the score s
Uniform Data distribution & very cheap
Maximize the scoring function of the query Max(fq) using the
inequalities that scoring function of each view <= s /fv <= s


(0,1)

(1,0)

(0,0)
Q
Selected Views whose
hyperplanes intersect at
the point which maximize
the scoring function

s

s

s

s

s
SelectViewsSpherical (SVS)
T
Select Views By Angle (SVA)
Select Views By Angle (SVA): Sort the views by
increasing angle with respect to Query vector.

(0,1)

(1,0)

(0,0)
Q
Selected Views:
view closer to
query will result in
minimum running
time for the algo
V1
V2 V3 V4

2

3

4
Outline
LPTA Algorithm
View Selection Problem
Cost Estimation Framework
View Selection Algorithms
Experimental Evaluation
Experimental Evaluation
Two types of dataset: Real and synthetic (uniform
and zipf data with varying skew distribution)
The real dataset contains 30K tuples from a website
specialized on automobiles.
Experiments Conducted:
Performance comparison of LPTA, PREFER and
TA
Performance of LPTA using each of the view
selection algorithms
Scalability of the LPTA algorithm

Performance comparison of
LPTA, PREFER and TA
Uniform dataset, 3d Real dataset, 2d
Conclusions
Using views for top-k query answering
LPTA: linear programming adaptation of TA
View selection problem, cost estimation framework,
view selection algorithms
Experimental evaluation
References
Answering Top-k Queries Using Views:
Gautam Das, Dimitrios Gunopulos, Nick Koudas

Optimal Aggregation Algorithms for Middleware :
Ronald Fagin, Amnon Lotem & Moni Naor

aitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt

Vous aimerez peut-être aussi