Vous êtes sur la page 1sur 20

OLAP Over Uncertain and

Imprecise Data

T.S. Jayram (IBM Almaden)

with Doug Burdick (Wisconsin), Prasad Deshpande


(IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar
Vaithyanathan (IBM)
Dimensions in OLAP

Automobile
All

Sedan Truck

Civic Camry F150 Sierra

Location
All

West East

CA TX NY MA
Measures, Facts, and Queries Auto = Truck
Automobile Loc = East
ALL
SUM(Repair) = ?
Auto = F150 Sedan Truck

Loc = NY Civic Camry F150 Sierra


Repair = $200
MA

p2
East

p1
NY
ALL
Location

p8 p6 p4
TX

p7 p5
West

p3 Cell
CA
Extend
Extend the
the OLAP
OLAP model
model to
to handle
handle data
data
ambiguity
ambiguity
Imprecision
Imprecision
Uncertainty
Uncertainty
Imprecision
Automobile ALL

Auto = F150 Sedan Truck


Loc = East
Civic Camry F150 Sierra
Repair = $200
p2
MA

p11 p9
East

NY

p1
ALL
Location

p8 p6 p10 p4
TX

p7 p5
West

p3
CA
Representing Imprecision using Dimension
Hierarchies

Dimension hierarchies lead to a natural


space of “partially specified” objects
Sources of imprecision: incomplete data,
multiple sources of data
Motivating Example
Query: COUNT
Truck

F150 Sierra

p3 p4
We
We propose
propose desiderata that
that enable
MA

desiderata
p5 enable
appropriate
appropriate definition
definition of
of query
query
East

semantics
semantics for
for imprecise
imprecise data
data
NY

p1 p2
Desideratum I: Consistency

Truck Consistency
specifies the
F150 Sierra relationship
p3 p4 between answers
MA

p5 to related queries
on a fixed data
East

set
NY

p1 p2
Desideratum II: Faithfulness
Data Set 1 Data Set 2 Data Set 3
F150 Sierra F150 Sierra F150 Sierra
p5 p5 p5
MA

MA

MA
p3 p4 p3 p4 p3 p4
NY

NY

NY
p1 p2 p1 p2 p1 p2

Faithfulness specifies the relationship between


answers to a fixed query on related data sets
Formal
Formal definitions
definitions of
of both
both Consistency
Consistency and
and
Faithfulness
Faithfulness depend
depend onon the
the underlying
underlying
aggregation
aggregation operator
operator
Can
Can we
we define
define query
query semantics
semantics that
that
satisfy
satisfy these
these desiderata?
desiderata?
F150 Sierra
Query p5
Semantics

MA
p3 p4

NY
p1 p2
F150 Sierra
F150 Sierra

MA
p3
p5 p4
w1
MA

p3
p4
p5 w4
w2

NY
w3 p2
NY

p2 p1
p1

F150 Sierra F150 Sierra

Possible
MA

MA
p4 p5 p4
Worlds p5
[Kripke63,…]
p3 p3
NY

NY

p2 p2
p1 p1
Possible Worlds Query Semantics

Given all possible worlds together with


their probabilities, queries are easily
answered (using expected values)
But number of possible worlds is
exponential!
Allocation

Allocation gives facts weighted assignments to


possible completions, leading to an extended
version of the data
Size increase is linear in number of (completions of)
imprecise facts
Queries operate over this extended version
Key contributions:
Appropriate characterization of the large space of
allocation policies
Designing efficient allocation policies that take into
account the correlations in the data
Storing Allocations using Extended Data
Model
ID FactID Auto Loc Repair Weight

Truck 1 1 F150 NY 100 1.0

F150 Sierra 2 2 Sierra NY 500 1.0

p5 3 3 F150 MA 150 0.6


MA

p3 p4
4 3 F150 NY 150 0.4
East

5 4 Sierra MA 200 1.0


NY

p1 p2
6 5 F150 MA 100 0.5

7 5 Sierra MA 100 0.5


Classifying Allocation Policies

Measure Correlation
Ignored Ignored Used
Correlation
Dimension

Uniform

Count EM
Used
Results on Query Semantics

Evaluating queries over extended version of


data yields expected value of the aggregation
operator over all possible worlds
intuitively, the correct value to compute
Efficient query evaluation algorithms for SUM,
COUNT
consistency and faithfulness for SUM, COUNT are
satisfied under appropriate conditions
Dynamic programming algorithm for AVERAGE
Unfortunately, consistency does not hold for
AVERAGE
Alternative Semantics for AVERAGE

APPROXIMATE AVERAGE
E[SUM] / E[COUNT] instead of
E[SUM/COUNT]
simpler and more efficient
satisfies consistency
extends to aggregation operators for
uncertain measures
Uncertainty

Measure value is modeled as a probability


distribution function over some base domain
e.g., measure Brake is a pdf over values {Yes,No}
sources of uncertainty: measures extracted from text
using classifiers
Adapt well-known concepts from statistics to
derive appropriate aggregation operators
Our framework and solutions for dealing with
imprecision also extend to uncertain measures
Summary

Consistency and faithfulness


desiderata for designing query semantics for
imprecise data
Allocation is the key to our framework
Efficient algorithms for aggregation
operators with appropriate guarantees of
consistency and faithfulness
Iterative algorithms for allocation policies
Correlation-based Allocation

Involves defining an objective function to capture


some underlying correlation structure
a more stringent requirement on the allocations
solving the resulting optimization problem yields the
allocations

EM-based iterative allocation policy


interesting highlight: allocations are re-scaled
iteratively by computing appropriate aggregations

Vous aimerez peut-être aussi