Vous êtes sur la page 1sur 39

Data Provenance in ETL Scenarios

Panos Vassiliadis
University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)

Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

Data Warehouse Environment

PrOPr 2007

Extract-Transform-Load (ETL)

PrOPr 2007

ETL: importance
ETL and Data Cleaning tools cost
  

30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project

ETL market: a multi-million market




IBM paid $1.1 billion dollars for Ascential

ETL tools in the market


 

software packages in-house development

No standard, no common model




most vendors implement a core set of operators and provide GUI to create a data flow
PrOPr 2007 6

Fundamental research question


Now: currently, ETL designers work directly at the physical level (typically, via libraries of physicallevel templates) Challenge: can we design ETL flows as declaratively as possible? Detail independence:
  

no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings

PrOPr 2007

Now:
DW

Involved data stores

Physical templates

Physical scenario

Engine

PrOPr 2007

Vision:
ETL tool
DW

Schema mappings Conceptual to logical mapping


Logical templates

DW

Conceptual to logical mapper


Physical templates

Involved data stores

Logical scenario

Optimizer
Physical scenario
Physical templates

Physical scenario

Engine

Engine

PrOPr 2007

Detail independence
ETL tool

Schema mappings Conceptual to logical mapping


Logical templates

DW

Automate (as much as possible) Conceptual: the details of the interattribute mappings Logical: the order of the transformations Physical: the algorithmic choices

Conceptual to logical mapper

Logical scenario

Optimizer
Physical templates

Physical scenario

Engine

PrOPr 2007

10

Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

11

Conceptual Model: first attempts

PrOPr 2007

12

Conceptual Model: The Data Mapping Diagram




Extension of UML to handle inter-attribute mappings

PrOPr 2007

13

Conceptual Model: The Data Mapping Diagram




Aggregating computes the quarterly sales for each product.

PrOPr 2007

14

Conceptual Model: Skoutas annotations


Application vocabulary
VC = {product, store} VPproduct = {pid, pName, quantity, price, type, storage} VPstore = {sid, sName, city, street} VFpid = {source_pid, dw_pid} VFsid = {source_sid, dw_sid} VFprice = {dollars, euros} VTtype = {software, hardware} VTcity = {paris, rome, athens}

Datastore mappings

Datastore annotation

PrOPr 2007

15

Conceptual Model: Skoutas annotations


The class hierarchy Definition for class DS1_Products

PrOPr 2007

16

Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

17

Logical Model
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PSNEW1 DIFF1 DS.PSOLD1 Log Log DS.PS1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE SK1 rejected DS.PS2 AddAttr2 rejected SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log

DSA
COST DATE QTY,COST
$2

A2EDate rejected rejected Log

Log

COST

DATE=SYSDATE U

PKEY,DATE

NotNULL rejected

AddDate

PK

rejected Log

PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1

DW.PARTSUPP.DATE, DAY S1.PARTS FTP1 TIME


ZY

PKEY, MONTH AVG(COST)

Aggregate2

V2

Sources

DW
PrOPr 2007 18

Logical Model
Main question:
What information should we put inside a metadata repository to be able to answer questions like:
  

what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?

PrOPr 2007

19

Architecture Graph
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DIFF2 DS.PSOLD2 Log DS.PSNEW1.PKEY, DS.PSOLD1.PKEY DS.PSNEW1 DIFF1 DS.PSOLD1 Log Log DS.PS1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE SK1 rejected DS.PS2 AddAttr2 rejected SOURCE DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE SK2 rejected Log COST DATE QTY,COST

DSA

$2

A2EDate rejected rejected Log

Log

COST

DATE=SYSDATE U

PKEY,DATE

NotNULL rejected

AddDate

PK

rejected Log

PKEY, DAY MIN(COST) S2.PARTS FTP2 DW.PARTS Aggregate1 V1

DW.PARTSUPP.DATE, DAY S1.PARTS FTP1 TIME


ZY

PKEY, MONTH AVG(COST)

Aggregate2

V2

Sources

DW
PrOPr 2007 20

Architecture Graph
Example

PrOPr 2007

21

Architecture Graph
Example

PrOPr 2007

22

Optimization
Execution order

which is the proper execution order?


PrOPr 2007 23

Optimization
Execution order

order equivalence? SK,f1,f2 or SK,f2,f1 or ... ?

PrOPr 2007

24

Logical Optimization
Can we push selection early enough? Can we aggregate before $2 takes place?

PrOPr 2007

25

Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

26

Logical to Physical
ETL tool Conceptual to logical mapper
Logical templates

Schema mappings Conceptual to logical mapping

DW

Logical scenario

Optimizer
Physical templates

Physical scenario

identify the best possible physical implementation for a given logical ETL workflow

Engine
PrOPr 2007 27

Problem formulation
Given a logical-level ETL workflow GL Compute a physical-level ETL workflow GP Such that
  

the semantics of the workflow do not change all constraints are met the cost is minimal

PrOPr 2007

28

Solution
We model the problem of finding the physical implementation of an ETL process as a state-space search problem. States. A state is a graph GP that represents a physical-level ETL workflow.


The initial state G0P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints.

Transitions. Given a state GP, a new state GP is generated by replacing the implementation of a physical activity aP of GP with another valid implementation for the same activity.


Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph.

Sorter introduction


Intentionally introduce sorters to reduce execution & resumption costs


PrOPr 2007 29

Sorters: impact
We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost. Semantics: unaffected Price to pay:


cost of sorting the stream of processed data it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings

Gain:
 

PrOPr 2007

30

Sorter gains

Without order
 

cost( i) = n costSO( ) = n*log2(n)+n cost( i) = seli * n costSO( ) = n

Cost(G) = 100.000+10.000 +3*[5.000*log2(5.000)+5.000] = 309.316 If sorter SA,B is added to V: Cost(G) = 100.000+10.000 +2*5.000+[5.000*log2(5.000)+5.000] = 247.877
PrOPr 2007 31

With appropriate order


 

Interesting orders

A asc

A desc

{A,B, [A,B]}

PrOPr 2007

32

Outline
Introduction Conceptual Level Logical Level Physical Level Provenance &ETL

PrOPr 2007

33

A principled architecture for ETL


ETL tool Conceptual to logical mapper
Logical templates

Schema mappings Conceptual to logical mapping

DW

WHY

Logical scenario

WHAT

Optimizer
Physical templates

Physical scenario

HOW

Engine
PrOPr 2007 34

Logical Model: Questions revisited


What information should we put inside a metadata repository to be able to answer questions like:


 

what is the architecture of my DW back stage? it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute? follow the appropriate path in the Architecture Graph

PrOPr 2007

35

Fundamental questions on provenance & ETL


Why do we have a certain record in the DW?


Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples Hard! If there is a way to derive an inverse workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data Widoms work on record lineage
PrOPr 2007 36

Where did this record come from in my DW?




 

Fundamental questions on provenance & ETL


How are updates to the sources managed?
 

(update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, diff comparison of extracted snapshots

When errors are discovered during the ETL process, how are they handled?
 

(update takes place at the data staging area, sources must be updated) Too hard to back-fuse data into the sources, both for political and workload issues. Currently, this is not automated.
PrOPr 2007 37

Fundamental questions on provenance & ETL


What happens if there are updates to the schema of the involved data sources?
Currently this is not automated, although the automation of the task is part of the detail independence vision

What happens if we must update the workflow structure and semantics?


Nothing is versioned back still, not really any user requests for this to be supported

What is the equivalent of citations in ETL?


nothing really

PrOPr 2007

38

Thank you!

PrOPr 2007

39

Vous aimerez peut-être aussi