Vous êtes sur la page 1sur 29

Extract, Transform, Load

(ETL)

SAD 2007/08 H.Galhardas

Overview
General ETL issues:
Data staging area (DSA)
Building dimensions
Building fact tables
Extract
Load
Transformation/cleaning
Commercial (and open-source) tools
The AJAX data cleaning and transformation
framework

SAD 2007/08 H.Galhardas


DW Phases
Design phase
Modeling, DB design, source selection,
Loading phase
First load/population of the DW
Based on all data in sources
Refreshment phase
Keep the DW up-to-date wrt. source data
changes

SAD 2007/08 H.Galhardas

The ETL Process


The most underestimated process in DW development

The most time-consuming process in DW development


Often, 80% of development time is spent on ETL

Extract
Extract relevant data
Transform
Transform data to DW format
Build keys, etc.
Cleansing of data
Load
Load data into DW
Build aggregates, etc.

SAD 2007/08 H.Galhardas


DW Architecture

Monitor
& OLAP Server
Metadata
other Integrator
sources
Analysis
Operational Extract Query
Data Transform Data Serve Reports
DBs Staging Load
Warehouse Data mining
Refresh

Data Marts

Data Sources
SAD 2007/08 Data StorageH.Galhardas OLAP Engine Front-End Tools

ETL process

other Implemented using an


sources
ETL tool!
Operational Extract
Data Transform Data
Ex: SQLServer 2005
DBs Staging Load
Warehouse Integration Services
Refresh

SAD 2007/08 H.Galhardas


Data Staging Area
Transit storage for data underway in the ETL
process
Transformations/cleansing done here
No user queries (some do it)
Sequential operations (few) on large data volumes
Performed by central ETL logic
Easily restarted
No need for locking, logging, etc.
RDBMS or flat files? (DBMS have become better at this)
Finished dimensions copied from DSA to relevant
marts

SAD 2007/08 H.Galhardas

ETL construction process


Plan
1)Make high-level diagram of source-destination flow
2)Test, choose and implement ETL tool
3)Outline complex transformations, key generation and job
sequence for every destination table
Construction of dimensions
4)Construct and test building static dimension
5)Construct and test change mechanisms for one dimension
6)Construct and test remaining dimension builds
Construction of fact tables and automation
7)Construct and test initial fact table build
8)Construct and test incremental update
9)Construct and test aggregate build
10)Design, construct, and test ETL automation

SAD 2007/08 H.Galhardas


Building Dimensions
Static dimension table
Assignment of keys: production keys to DW using
table
Combination of data sources: find common key?
Handling dimension changes
Slowly changing dimensions
Find newest DW key for a given production key
Table for mapping production keys to DW keys
must be updated
Load of dimensions
Small dimensions: replace
Large dimensions: load only changes

SAD 2007/08 H.Galhardas

Building fact tables


Two types of load:
Initial load
ETL for all data up till now
Done when DW is started the first time
Often problematic to get correct historical data
Very heavy -large data volumes
Incremental update
Move only changes since last load
Done periodically (../month/week/day/hour/...) after DW start
Less heavy -smaller data volumes
Dimensions must be updated before facts
The relevant dimension rows for new facts must be in place
Special key considerations if initial load must be performed
again

SAD 2007/08 H.Galhardas


Types of data sources
Non-cooperative sources
Snapshot sources provides only full copy of source
Specific sources each is different, e.g., legacy systems
Logged sources writes change log (DB log)
Queryable sources provides query interface, e.g., SQL
Cooperative sources
Replicated sources publish/subscribe mechanism
Call back sources calls external code (ETL) when changes
occur
Internal action sources only internal actions when changes
occur (DB triggers is an example)

Extract strategy is very dependent on the source types


SAD 2007/08 H.Galhardas

Extract phase
Goal: fast extract of relevant data
Extract from source systems can take a long time

Types of extracts:
Extract applications (SQL): co-existence with other
applications
DB unload tools: must faster than SQL-based extracts
Extract applications sometimes the only solution

Often too time consuming to ETL


Extracts can take days/weeks
Drain on the operational systems
Drain on DW systems
=> Extract/ETL only changes since last load (delta)

SAD 2007/08 H.Galhardas


Computing deltas
Much faster to only ETL changes since last load

A number of methods can be used


Store sorted total extracts in DSA
Delta can easily be computed from current+last extract
+ Always possible
+ Handles deletions
- Does not reduce extract time
Put update timestamp on all rows
Updated by DB trigger
Extract only where timestamp > time for last extract
+ Reduces extract time
+/- Less operational overhead
- Cannot (alone) handle deletions
- Source system must be changed
SAD 2007/08 H.Galhardas

Load (1)
Goal: fast loading into DW
Loading deltas is much faster than total load

SQL-based update is slow


Large overhead (optimization,locking,etc.) for every SQL call
DB load tools are much faster
Some load tools can also perform UPDATEs
Index on tables slows load a lot
Drop index and rebuild after load
Can be done per partition
Parallellization
Dimensions can be loaded concurrently
Fact tables can be loaded concurrently
Partitions can be loaded concurrently

SAD 2007/08 H.Galhardas


Load (2)
Relationships in the data
Referential integrity must be ensured
Can be done by loader
Aggregates
Must be built and loaded at the same time as the detail data
Today, RDBMSes can often do this
Load tuning
Load without log
Sort load file first
Make only simple transformations in loader
Use loader facilities for building aggregates
Use loader within the same database

Should DW be on-line 24*7?


Use partitions or several sets of tables
SAD 2007/08 H.Galhardas

Overview
General ETL issues:
Data staging area (DSA)
Building dimensions
Building fact tables
Extract
Load
Transformation/cleaning
Commercial (and open-source) tools
The AJAX data cleaning and transformation
framework

SAD 2007/08 H.Galhardas


Data Cleaning

Activity of converting source data into target


data without errors, duplicates, and
inconsistencies, i.e.,
Cleaning and Transforming to get
High-quality data!

SAD 2007/08 H.Galhardas

Why Data Cleaning and


Transformation?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=
noisy: containing errors or outliers (spelling, phonetic
and typing errors, word transpositions, multiple values
in a single free-form field)
e.g., Salary=-10
inconsistent: containing discrepancies in codes or
names (synonyms and nicknames, prefix and suffix
variations, abbreviations, truncation and initials)
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
SAD 2007/08 H.Galhardas
Why Is Data Dirty?
Incomplete data comes from:
non available data value when collected
different criteria between the time when the data was collected
and when it is analyzed.
human/hardware/software problems
Noisy data comes from:
data collection: faulty instruments
data entry: human or computer errors
data transmission
Inconsistent (and redundant) data comes from:
Different data sources, so non uniform naming conventions/data
codes
Functional dependency and/or referential integrity violation
SAD 2007/08 H.Galhardas

Why Is Data Cleaning


Important?

Data warehouse needs consistent


integration of quality data
Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse

No quality data, no quality decisions!


Quality decisions must be based on quality data (e.g.,
duplicate or missing data may cause incorrect or even
misleading statistics)
SAD 2007/08 H.Galhardas
Types of data cleaning
Conversion, parsing and normalization
Text coding, date formats, etc.
Most common type of cleansing
Special-purpose cleansing
Normalize spellings of names, addresses, etc.
Remove duplicates, e.g., duplicate customers
Domain-independent cleansing
Approximate, fuzzy joins on not-quite-matching
keys

SAD 2007/08 H.Galhardas

Data quality vs cleaning


Data quality = Data cleaning +
Data enrichment
enhancing the value of internally held data by appending related
attributes from external sources (for example, consumer
demographic attributes or geographic descriptors).
Data profiling
analysis of data to capture statistics (metadata) that provide
insight into the quality of the data and aid in the identification of
data quality issues.
Data monitoring
deployment of controls to ensure ongoing conformance of data to
business rules that define data quality for the organization.
Data stewards responsible for data quality
DW-controlled improvement
Source-controlled improvement
Construct programs to check data quality

SAD 2007/08 H.Galhardas


Overview
General ETL issues:
Data staging area (DSA)
Building dimensions
Building fact tables
Extract
Load
Transformation/cleaning
Commercial (and open-source) tools
The AJAX data cleaning and transformation
framework

SAD 2007/08 H.Galhardas

ETL tools
ETL tools from the big vendors, e.g.,

Oracle Warehouse Builder

IBM DB2 Warehouse Manager

Microsoft Integration Services
Offer much functionality at a reasonable price (included)
Data modeling

ETL code generation

Scheduling DW jobs



Many others

Hundreds of tools

Often specialized tools for certain jobs (insurance cleansing,)

The best tool does not exist



Choose based on your own needs

Check first if the standard toolsfrom the big vendors are ok
SAD 2007/08 H.Galhardas
ETL and data quality tools
http://www.etltool.com/
Magic Quadrant for Data Quality Tools, 2007
Some open source ETL tools:
Talend
Enhydra Octopus
Clover.ETL
Not so many open source quality/cleaning
tools

SAD 2007/08 H.Galhardas

Application context
Integrate data from different sources
E.g.,populating a DW from different operational data stores

Eliminate errors and duplicates within a single source


E.g., duplicates in a file of customers

Migrate data from a source schema into a different


fixed target schema
E.g., discontinued application packages

Convert poorly structured data into structured data


E.g., processing data collected from the Web

SAD 2007/08 H.Galhardas


The AJAX data transformation
and cleaning framework

SAD 2007/08 H.Galhardas

Motivating example (1)


Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)
Authors(authorKey, name)
Events(eventKey, name)
PubsAuthors(pubKey, authorKey)

Data Cleaning &


Transformation

DirtyData(paper:String)

SAD 2007/08 H.Galhardas


Motivating example (2)
Authors
Publications
DQua | Dallan Quass
QGMW96| Making Views Self-Maintainable
for Data Warehousing |PDIS| null | null |
AGup | Ashish Gupta
null | null | Miami Beach | Florida, USA | 1996
Events JWid | Jennifer Widom
.. PubsAuthors
PDIS | Conference on
Parallel and QGMW96 | DQua
Distributed
Information Systems
Data Cleaning &
QGMW96 | AGup

DirtyData
Transformation .

[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer
Widom. Making Views Self-Maintainable for Data Warehousing. In
Proceedings of the Conference on Parallel and Distributed
Information Systems. Miami Beach, Florida, USA, 1996
[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-
maintianable for data warehousing, PDIS95

SAD 2007/08 H.Galhardas

Modeling a data cleaning process


Authors

Duplicate Elimination

DirtyTitles... DirtyEvents
DirtyAuthors

Extraction A data cleaning process is


modeled by a directed acyclic
graph of data transformations
Standardization
Cities
Tags
Formattin
g
SAD 2007/08 H.Galhardas
DirtyData
Existing technology
Ad-hoc programs written in a programming language
like C or Java or using an RDBMS proprietary
language
Programs difficult to optimize and maintain

Data transformation scripts using an ETL


(Extraction-Transformation-Loading) or a data quality tool

SAD 2007/08 H.Galhardas

Problems of ETL and data quality


solutions (1)
Data cleaning transformations

App. Domain 1
...
App. Domain 2

App. Domain 3

The semantics of some data transformations


is defined in terms of their implementation
algorithms
SAD 2007/08 H.Galhardas
Problems of ETL and data quality
solutions (2)
Clean data Rejected data

Cleaning process

Dirty Data
There is a lack of interactive facilities to tune
a data cleaning application program

SAD 2007/08 H.Galhardas

AJAX features
An extensible data quality framework
Logical operators as extensions of relational algebra
Physical execution algorithms

A declarative language for logical operators


SQL extension

A debugger facility for tuning a data cleaning


program application
Based on a mechanism of exceptions

SAD 2007/08 H.Galhardas


AJAX features
An extensible data quality framework
Logical operators as extensions of relational algebra
Physical execution algorithms

A declarative language for logical operators


SQL extension

A debugger facility for tuning a data cleaning


program application
Based on a mechanism of exceptions

SAD 2007/08 H.Galhardas

Logical level: parametric


operators

View: arbitrary SQL query


View Cluster Map: iterator-based one-to-many mapping with
arbitrary user-defined functions
Match: iterator-based approximate join
Map Match
Cluster: uses an arbitrary clustering function
Merge: extends SQL group-by with user-defined
Merge Apply aggregate functions
Apply: executes an arbitrary user-defined
algorithm

SAD 2007/08 H.Galhardas


Logical level Authors

Duplicate Elimination

DirtyTitles...

DirtyAuthors

Extraction

Standardization

Cities
Formattin Tags
g
SAD 2007/08 DirtyData H.Galhardas

Logical level Physical level


Authors Authors

Merge Java Scan

Duplicate Elimination Cluster TC

Match NL
DirtyTitles... DirtyTitles...

DirtyAuthors DirtyAuthors

Extraction Map Java Scan

Standardization Map Java Scan

Cities Cities
Tags Tags
Formattin Map SQL Scan
g
SAD 2007/08 DirtyData H.Galhardas DirtyData
AJAX features
An extensible data quality framework
Logical operators as extensions of relational algebra
Physical execution algorithms

A declarative language for logical operators


SQL extension

A debugger facility for tuning a data cleaning


program application
Based on a mechanism of exceptions

SAD 2007/08 H.Galhardas

Match
Input: 2 relations
Finds data records that correspond to the
same real object
Calls distance functions for comparing field
values and computing the distance
between input tuples
Output: 1 relation containing matching
tuples and possibly 1 or 2 relations
containing non-matching tuples
SAD 2007/08 H.Galhardas
Example
Authors

Merge

Cluster
MatchAuthors
Match

DirtyAuthors

Duplicate Elimination

SAD 2007/08 H.Galhardas

Example
Authors CREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
Merge LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
Cluster
MatchAuthors INTO MatchAuthors
Match

DirtyAuthors

Duplicate Elimination

SAD 2007/08 H.Galhardas


Example
Authors CREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
Merge LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
Cluster
MatchAuthors INTO MatchAuthors
Match
Input:
DirtyAuthors
DirtyAuthors(authorKey, name)
861|johann christoph freytag
Duplicate Elimination 822|jc freytag
819|j freytag
814|j-c freytag
Output:
MatchAuthors(authorKey1, authorKey2, name1, name2)
SAD 2007/08 H.Galhardas
861|822|johann christoph freytag| jc freytag
822|814|jc freytag|j-c freytag ...

Implementation of the match


operator

s1 S1, s2 S2
(s1, s2) is a match if
editDistance (s1, s2) < maxDist

SAD 2007/08 H.Galhardas


Nested loop
S1 S2
editDistance

...

Very expensive evaluation when handling


large amounts of data
Need alternative execution algorithms for
the same logical specification
SAD 2007/08 H.Galhardas

A database solution
CREATE TABLE MatchAuthors AS
SELECT authorKey1, authorKey2, distance
FROM (SELECT a1.authorKey authorKey1,
a2.authorKey authorKey2,
editDistance (a1.name, a2.name) distance
FROM DirtyAuthors a1, DirtyAuthors a2)
WHERE distance < maxDist;
No optimization supported for a
Cartesian product with external function
calls
SAD 2007/08 H.Galhardas
Window scanning
S

SAD 2007/08 H.Galhardas

Window scanning
S

SAD 2007/08 H.Galhardas


Window scanning
S

May loose some matches


SAD 2007/08 H.Galhardas

String distance filtering

S1 S2
John Smit length- 1

John Smith Jogn Smith


length length

John Smithe length + 1


editDistance
maxDist = 1

SAD 2007/08 H.Galhardas


Annotation-based optimization

The user specifies types of optimization


The system suggests which algorithm
to use

Ex:
CREATE MATCHING MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET dist = editDistance(da1.name, da2.name)
WHERE dist < maxDist
% distance-filtering: map= length; dist = abs %
INTO MatchAuthors
SAD 2007/08 H.Galhardas

Declarative specification
DEFINE FUNCTIONS AS
Choose.uniqueString(OBJECT[]) RETURN STRING
THROWS CiteSeerException
Generate.generateId(INTEGER) RETURN STRING
Normal.removeCitationTags(STRING) RETURN STRING
(600)

DEFINE ALGORITHMS AS
TransitiveClosure
SourceClustering(STRING)

DEFINE INPUT DATA FLOWS AS


TABLE DirtyData
(paper STRING (400)
);
TABLE City
(city STRING (80),
citysyn STRING (80)
)
KEY city,citysyn;

DEFINE TRANSFORMATIONS AS

CREATE MAPPING m apKeDiDa


FROM DirtyData Dd
LET keyKdd = generateId(1)
{SELECT keyKdd AS paperKey, Dd.paper AS paper
KEY paperKey
CONSTRAINT NOT NULL m apKeDiDa.paper
}
SAD 2007/08 H.Galhardas
AJAX features
An extensible data quality framework
Logical operators as extensions of relational algebra
Physical execution algorithms

A declarative language for logical operators


SQL extension

A debugger facility for tuning a data cleaning


program application
Based on a mechanism of exceptions

SAD 2007/08 H.Galhardas

Management of exceptions

Problem: to mark tuples not handled by the


cleaning criteria of an operator

Solution: to specify the generation of


exception tuples within a logical operator
exceptions are thrown by external functions
output constraints are violated

SAD 2007/08 H.Galhardas


Debugger facility

Supports the (backward and forward) data


derivation of tuples wrt an operator to
debug exceptions

Supports the interactive data modification


and, in the future, the incremental
execution of logical operators
SAD 2007/08 H.Galhardas

Architecture

SAD 2007/08 H.Galhardas


To see it working....

Workshop-UQ?
Tomorrow at Tagus Park,
10H-16H

SAD 2007/08 H.Galhardas

Vous aimerez peut-être aussi