OLAP Over Uncertain and Imprecise Data: T.S. Jayram (IBM Almaden)

OLAP Over Uncertain and
Imprecise Data
T.S. Jayram (IBM Almaden)
with Doug Burdick (Wisconsin), Prasad Deshpande

(IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar
Vaithyanathan (IBM)
Dimensions in OLAP
Automobile
All
Sedan Truck
Civic Camry F150 Sierra
Location
All
West East
CA TX NY MA
Measures, Facts, and Queries Auto = Truck
Automobile Loc = East
ALL
SUM(Repair) = ?
Auto = F150 Sedan Truck
Loc = NY Civic Camry F150 Sierra

Repair = $200
MA
p2
East
p1
NY
ALL
Location
p8 p6 p4
TX
p7 p5
West
p3 Cell
CA
Extend
Extend the
the OLAP
OLAP model
model to
to handle
handle data
data
ambiguity
ambiguity
Imprecision
Imprecision
Uncertainty
Uncertainty
Imprecision
Automobile ALL
Auto = F150 Sedan Truck

Loc = East
Civic Camry F150 Sierra
Repair = $200
p2
MA
p11 p9
East
NY
p1
ALL
Location
p8 p6 p10 p4
TX
p7 p5
West
p3
CA
Representing Imprecision using Dimension
Hierarchies
Dimension hierarchies lead to a natural

space of “partially specified” objects
Sources of imprecision: incomplete data,
multiple sources of data
Motivating Example
Query: COUNT
Truck
F150 Sierra
p3 p4
We
We propose
propose desiderata that
that enable
MA
desiderata
p5 enable
appropriate
appropriate definition
definition of
of query
query
East
semantics
semantics for
for imprecise
imprecise data
data
NY
p1 p2
Desideratum I: Consistency
Truck Consistency
specifies the
F150 Sierra relationship
p3 p4 between answers
MA
p5 to related queries
on a fixed data
East
set
NY
p1 p2
Desideratum II: Faithfulness
Data Set 1 Data Set 2 Data Set 3
F150 Sierra F150 Sierra F150 Sierra
p5 p5 p5
MA
MA
MA
p3 p4 p3 p4 p3 p4
NY
NY
NY
p1 p2 p1 p2 p1 p2
Faithfulness specifies the relationship between

answers to a fixed query on related data sets
Formal
Formal definitions
definitions of
of both
both Consistency
Consistency and
and
Faithfulness
Faithfulness depend
depend onon the
the underlying
underlying
aggregation
aggregation operator
operator
Can
Can we
we define
define query
query semantics
semantics that
that
satisfy
satisfy these
these desiderata?
desiderata?
F150 Sierra
Query p5
Semantics
MA
p3 p4
NY
p1 p2
F150 Sierra
F150 Sierra
MA
p3
p5 p4
w1
MA
p3
p4
p5 w4
w2
NY
w3 p2
NY
p2 p1
p1
F150 Sierra F150 Sierra
Possible
MA
MA
p4 p5 p4
Worlds p5
[Kripke63,…]
p3 p3
NY
NY
p2 p2
p1 p1
Possible Worlds Query Semantics
Given all possible worlds together with

their probabilities, queries are easily
answered (using expected values)
But number of possible worlds is
exponential!
Allocation
Allocation gives facts weighted assignments to

possible completions, leading to an extended
version of the data
Size increase is linear in number of (completions of)
imprecise facts
Queries operate over this extended version
Key contributions:
Appropriate characterization of the large space of
allocation policies
Designing efficient allocation policies that take into
account the correlations in the data
Storing Allocations using Extended Data
Model
ID FactID Auto Loc Repair Weight
Truck 1 1 F150 NY 100 1.0
F150 Sierra 2 2 Sierra NY 500 1.0
p5 3 3 F150 MA 150 0.6

MA
p3 p4
4 3 F150 NY 150 0.4
East
5 4 Sierra MA 200 1.0

NY
p1 p2
6 5 F150 MA 100 0.5
7 5 Sierra MA 100 0.5

Classifying Allocation Policies
Measure Correlation
Ignored Ignored Used
Correlation
Dimension
Uniform
Count EM
Used
Results on Query Semantics
Evaluating queries over extended version of

data yields expected value of the aggregation
operator over all possible worlds
intuitively, the correct value to compute
Efficient query evaluation algorithms for SUM,
COUNT
consistency and faithfulness for SUM, COUNT are
satisfied under appropriate conditions
Dynamic programming algorithm for AVERAGE
Unfortunately, consistency does not hold for
AVERAGE
Alternative Semantics for AVERAGE
APPROXIMATE AVERAGE
E[SUM] / E[COUNT] instead of
E[SUM/COUNT]
simpler and more efficient
satisfies consistency
extends to aggregation operators for
uncertain measures
Uncertainty
Measure value is modeled as a probability

distribution function over some base domain
e.g., measure Brake is a pdf over values {Yes,No}
sources of uncertainty: measures extracted from text
using classifiers
Adapt well-known concepts from statistics to
derive appropriate aggregation operators
Our framework and solutions for dealing with
imprecision also extend to uncertain measures
Summary
Consistency and faithfulness

desiderata for designing query semantics for
imprecise data
Allocation is the key to our framework
Efficient algorithms for aggregation
operators with appropriate guarantees of
consistency and faithfulness
Iterative algorithms for allocation policies
Correlation-based Allocation
Involves defining an objective function to capture

some underlying correlation structure
a more stringent requirement on the allocations
solving the resulting optimization problem yields the
allocations
EM-based iterative allocation policy

interesting highlight: allocations are re-scaled
iteratively by computing appropriate aggregations

OLAP Over Uncertain and Imprecise Data: T.S. Jayram (IBM Almaden)

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

OLAP Over Uncertain and Imprecise Data: T.S. Jayram (IBM Almaden)

Transféré par

Droits d'auteur :

Formats disponibles

OLAP Over Uncertain and

T.S. Jayram (IBM Almaden)

with Doug Burdick (Wisconsin), Prasad Deshpande

Civic Camry F150 Sierra

Loc = NY Civic Camry F150 Sierra

Auto = F150 Sedan Truck

Dimension hierarchies lead to a natural

Faithfulness specifies the relationship between

F150 Sierra F150 Sierra

Given all possible worlds together with

Allocation gives facts weighted assignments to

Truck 1 1 F150 NY 100 1.0

F150 Sierra 2 2 Sierra NY 500 1.0

p5 3 3 F150 MA 150 0.6

5 4 Sierra MA 200 1.0

7 5 Sierra MA 100 0.5

Evaluating queries over extended version of

Measure value is modeled as a probability

Consistency and faithfulness

Involves defining an objective function to capture

EM-based iterative allocation policy

Vous aimerez peut-être aussi