Vassiliadis

Data Provenance in ETL Scenarios
Panos Vassiliadis
University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)
Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr 2007
Outline
PrOPr 2007
Data Warehouse Environment
PrOPr 2007
Extract-Transform-Load (ETL)
PrOPr 2007
ETL: importance
ETL and Data Cleaning tools cost

30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project
ETL market: a multi-million market

IBM paid $1.1 billion dollars for Ascential
ETL tools in the market

software packages in-house development
No standard, no common model

most vendors implement a core set of operators and provide GUI to create a data flow
PrOPr 2007 6
Fundamental research question

Now: currently, ETL designers work directly at the physical level (typically, via libraries of physicallevel templates) Challenge: can we design ETL flows as declaratively as possible? Detail independence:

no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings
PrOPr 2007
Now:
DW
Involved data stores
Physical templates
Physical scenario
Engine
PrOPr 2007
Vision:
ETL tool
DW
Schema mappings Conceptual to logical mapping

Logical templates
DW
Conceptual to logical mapper

Physical templates
Involved data stores
Logical scenario
Optimizer
Physical scenario
Physical templates
Physical scenario
Engine
Engine
PrOPr 2007
Detail independence
ETL tool

Logical templates
DW
Automate (as much as possible) Conceptual: the details of the interattribute mappings Logical: the order of the transformations Physical: the algorithmic choices
Conceptual to logical mapper
Logical scenario
Optimizer
Physical templates
Physical scenario
Engine
PrOPr 2007
10
Outline
PrOPr 2007
11
Conceptual Model: first attempts
PrOPr 2007
12
Conceptual Model: The Data Mapping Diagram

Extension of UML to handle inter-attribute mappings
PrOPr 2007
13
Conceptual Model: The Data Mapping Diagram

Aggregating computes the quarterly sales for each product.
PrOPr 2007
14
Conceptual Model: Skoutas annotations

Application vocabulary
VC = {product, store} VPproduct = {pid, pName, quantity, price, type, storage} VPstore = {sid, sName, city, street} VFpid = {source_pid, dw_pid} VFsid = {source_sid, dw_sid} VFprice = {dollars, euros} VTtype = {software, hardware} VTcity = {paris, rome, athens}
Datastore mappings
Datastore annotation
PrOPr 2007
15
Conceptual Model: Skoutas annotations

The class hierarchy Definition for class DS1_Products
PrOPr 2007
16
Outline
PrOPr 2007
17
Logical Model
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PSNEW1 DIFF1 DS.PSOLD1 Log Log DS.PS1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE SK1 rejected DS.PS2 AddAttr2 rejected SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log
DSA
COST DATE QTY,COST
$2
A2EDate rejected rejected Log
Log
COST
DATE=SYSDATE U
PKEY,DATE
NotNULL rejected
AddDate
PK
rejected Log
PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1
DW.PARTSUPP.DATE, DAY S1.PARTS FTP1 TIME

ZY
PKEY, MONTH AVG(COST)
Aggregate2
V2
Sources
DW
PrOPr 2007 18
Logical Model
Main question:
What information should we put inside a metadata repository to be able to answer questions like:

what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?
PrOPr 2007
19
Architecture Graph
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PSNEW1 DIFF1 DS.PSOLD1 Log Log DS.PS1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE SK1 rejected DS.PS2 AddAttr2 rejected SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log COST DATE QTY,COST
DSA
$2
A2EDate rejected rejected Log
Log
COST
DATE=SYSDATE U
PKEY,DATE
NotNULL rejected
AddDate
PK
rejected Log
PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1
DW.PARTSUPP.DATE, DAY S1.PARTS FTP1 TIME

ZY
PKEY, MONTH AVG(COST)
Aggregate2
V2
Sources
DW
PrOPr 2007 20
Architecture Graph
Example
PrOPr 2007
21
Architecture Graph
Example
PrOPr 2007
22
Optimization
Execution order
which is the proper execution order?

PrOPr 2007 23
Optimization
Execution order
order equivalence? SK,f1,f2 or SK,f2,f1 or ... ?
PrOPr 2007
24
Logical Optimization
Can we push selection early enough? Can we aggregate before $2 takes place?
PrOPr 2007
25
Outline
PrOPr 2007
26
Logical to Physical
ETL tool Conceptual to logical mapper
Logical templates
DW
Logical scenario
Optimizer
Physical templates
Physical scenario
identify the best possible physical implementation for a given logical ETL workflow
Engine
PrOPr 2007 27
Problem formulation
Given a logical-level ETL workflow GL Compute a physical-level ETL workflow GP Such that

the semantics of the workflow do not change all constraints are met the cost is minimal
PrOPr 2007
28
Solution
We model the problem of finding the physical implementation of an ETL process as a state-space search problem. States. A state is a graph GP that represents a physical-level ETL workflow.

The initial state G0P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints.
Transitions. Given a state GP, a new state GP is generated by replacing the implementation of a physical activity aP of GP with another valid implementation for the same activity.

Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph.
Sorter introduction

Intentionally introduce sorters to reduce execution & resumption costs

PrOPr 2007 29
Sorters: impact
We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost. Semantics: unaffected Price to pay:

cost of sorting the stream of processed data it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings
Gain:

PrOPr 2007
30
Sorter gains
Without order

cost( i) = n costSO( ) = n*log2(n)+n cost( i) = seli * n costSO( ) = n
Cost(G) = 100.000+10.000 +3*[5.000*log2(5.000)+5.000] = 309.316 If sorter SA,B is added to V: Cost(G) = 100.000+10.000 +2*5.000+[5.000*log2(5.000)+5.000] = 247.877
PrOPr 2007 31
With appropriate order

Interesting orders
A asc
A desc
{A,B, [A,B]}
PrOPr 2007
32
Outline
PrOPr 2007
33
A principled architecture for ETL

ETL tool Conceptual to logical mapper
Logical templates
DW
WHY
Logical scenario
WHAT
Optimizer
Physical templates
Physical scenario
HOW
Engine
PrOPr 2007 34
Logical Model: Questions revisited

What information should we put inside a metadata repository to be able to answer questions like:

what is the architecture of my DW back stage? it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute? follow the appropriate path in the Architecture Graph
PrOPr 2007
35
Fundamental questions on provenance & ETL

Why do we have a certain record in the DW?

Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples Hard! If there is a way to derive an inverse workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data Widoms work on record lineage
PrOPr 2007 36
Where did this record come from in my DW?


How are updates to the sources managed?

(update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, diff comparison of extracted snapshots
When errors are discovered during the ETL process, how are they handled?

(update takes place at the data staging area, sources must be updated) Too hard to back-fuse data into the sources, both for political and workload issues. Currently, this is not automated.
PrOPr 2007 37

What happens if there are updates to the schema of the involved data sources?
Currently this is not automated, although the automation of the task is part of the detail independence vision
What happens if we must update the workflow structure and semantics?

Nothing is versioned back still, not really any user requests for this to be supported
What is the equivalent of citations in ETL?

nothing really
PrOPr 2007
38
Thank you!
PrOPr 2007
39

Vassiliadis

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Vassiliadis

Transféré par

Droits d'auteur :

Formats disponibles

Data Provenance in ETL Scenarios

Data Warehouse Environment

ETL market: a multi-million market

IBM paid $1.1 billion dollars for Ascential

ETL tools in the market

software packages in-house development

No standard, no common model

Fundamental research question

Involved data stores

Schema mappings Conceptual to logical mapping

Conceptual to logical mapper

Involved data stores

Schema mappings Conceptual to logical mapping

Conceptual to logical mapper

Conceptual Model: first attempts

Conceptual Model: The Data Mapping Diagram

Extension of UML to handle inter-attribute mappings

Conceptual Model: The Data Mapping Diagram

Aggregating computes the quarterly sales for each product.

Conceptual Model: Skoutas annotations

Conceptual Model: Skoutas annotations

A2EDate rejected rejected Log

PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1

DW.PARTSUPP.DATE, DAY S1.PARTS FTP1 TIME

PKEY, MONTH AVG(COST)

A2EDate rejected rejected Log

PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1

DW.PARTSUPP.DATE, DAY S1.PARTS FTP1 TIME

PKEY, MONTH AVG(COST)

which is the proper execution order?

order equivalence? SK,f1,f2 or SK,f2,f1 or ... ?

Schema mappings Conceptual to logical mapping

Intentionally introduce sorters to reduce execution & resumption costs

cost( i) = n costSO( ) = n*log2(n)+n cost( i) = seli * n costSO( ) = n

With appropriate order

A principled architecture for ETL

Schema mappings Conceptual to logical mapping

Logical Model: Questions revisited

Fundamental questions on provenance & ETL

Where did this record come from in my DW?

Fundamental questions on provenance & ETL

Fundamental questions on provenance & ETL

What happens if we must update the workflow structure and semantics?

What is the equivalent of citations in ETL?

Vous aimerez peut-être aussi

cost( i) = n costSO( ) = nlog2(n)+n cost( i) = seli n costSO( ) = n