Académique Documents
Professionnel Documents
Culture Documents
Panos Vassiliadis
University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)
Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr 2007
Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr 2007
PrOPr 2007
Extract-Transform-Load (ETL)
PrOPr 2007
ETL: importance
ETL and Data Cleaning tools cost
30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project
most vendors implement a core set of operators and provide GUI to create a data flow
PrOPr 2007 6
no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings
PrOPr 2007
Now:
DW
Physical templates
Physical scenario
Engine
PrOPr 2007
Vision:
ETL tool
DW
DW
Logical scenario
Optimizer
Physical scenario
Physical templates
Physical scenario
Engine
Engine
PrOPr 2007
Detail independence
ETL tool
DW
Automate (as much as possible) Conceptual: the details of the interattribute mappings Logical: the order of the transformations Physical: the algorithmic choices
Logical scenario
Optimizer
Physical templates
Physical scenario
Engine
PrOPr 2007
10
Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr 2007
11
PrOPr 2007
12
PrOPr 2007
13
PrOPr 2007
14
Datastore mappings
Datastore annotation
PrOPr 2007
15
PrOPr 2007
16
Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr 2007
17
Logical Model
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PSNEW1 DIFF1 DS.PSOLD1 Log Log DS.PS1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE SK1 rejected DS.PS2 AddAttr2 rejected SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log
DSA
COST DATE QTY,COST
$2
Log
COST
DATE=SYSDATE U
PKEY,DATE
NotNULL rejected
AddDate
PK
rejected Log
Aggregate2
V2
Sources
DW
PrOPr 2007 18
Logical Model
Main question:
What information should we put inside a metadata repository to be able to answer questions like:
what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?
PrOPr 2007
19
Architecture Graph
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PSNEW1 DIFF1 DS.PSOLD1 Log Log DS.PS1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE SK1 rejected DS.PS2 AddAttr2 rejected SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log COST DATE QTY,COST
DSA
$2
Log
COST
DATE=SYSDATE U
PKEY,DATE
NotNULL rejected
AddDate
PK
rejected Log
Aggregate2
V2
Sources
DW
PrOPr 2007 20
Architecture Graph
Example
PrOPr 2007
21
Architecture Graph
Example
PrOPr 2007
22
Optimization
Execution order
Optimization
Execution order
PrOPr 2007
24
Logical Optimization
Can we push selection early enough? Can we aggregate before $2 takes place?
PrOPr 2007
25
Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr 2007
26
Logical to Physical
ETL tool Conceptual to logical mapper
Logical templates
DW
Logical scenario
Optimizer
Physical templates
Physical scenario
identify the best possible physical implementation for a given logical ETL workflow
Engine
PrOPr 2007 27
Problem formulation
Given a logical-level ETL workflow GL Compute a physical-level ETL workflow GP Such that
the semantics of the workflow do not change all constraints are met the cost is minimal
PrOPr 2007
28
Solution
We model the problem of finding the physical implementation of an ETL process as a state-space search problem. States. A state is a graph GP that represents a physical-level ETL workflow.
The initial state G0P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints.
Transitions. Given a state GP, a new state GP is generated by replacing the implementation of a physical activity aP of GP with another valid implementation for the same activity.
Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph.
Sorter introduction
Sorters: impact
We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost. Semantics: unaffected Price to pay:
cost of sorting the stream of processed data it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings
Gain:
PrOPr 2007
30
Sorter gains
Without order
Cost(G) = 100.000+10.000 +3*[5.000*log2(5.000)+5.000] = 309.316 If sorter SA,B is added to V: Cost(G) = 100.000+10.000 +2*5.000+[5.000*log2(5.000)+5.000] = 247.877
PrOPr 2007 31
Interesting orders
A asc
A desc
{A,B, [A,B]}
PrOPr 2007
32
Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr 2007
33
DW
WHY
Logical scenario
WHAT
Optimizer
Physical templates
Physical scenario
HOW
Engine
PrOPr 2007 34
what is the architecture of my DW back stage? it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute? follow the appropriate path in the Architecture Graph
PrOPr 2007
35
Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples Hard! If there is a way to derive an inverse workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data Widoms work on record lineage
PrOPr 2007 36
(update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, diff comparison of extracted snapshots
When errors are discovered during the ETL process, how are they handled?
(update takes place at the data staging area, sources must be updated) Too hard to back-fuse data into the sources, both for political and workload issues. Currently, this is not automated.
PrOPr 2007 37
PrOPr 2007
38
Thank you!
PrOPr 2007
39