Clover ETL - 1

Extract, Transform, Load
(ETL)
SAD 2007/08 H.Galhardas
Overview
General ETL issues:
Data staging area (DSA)
Building dimensions
Building fact tables
Extract
Load
Transformation/cleaning
Commercial (and open-source) tools
The AJAX data cleaning and transformation
framework

DW Phases
Design phase
Modeling, DB design, source selection,
Loading phase
First load/population of the DW
Based on all data in sources
Refreshment phase
Keep the DW up-to-date wrt. source data
changes
The ETL Process

The most underestimated process in DW development
The most time-consuming process in DW development

Often, 80% of development time is spent on ETL
Extract
Extract relevant data
Transform
Transform data to DW format
Build keys, etc.
Cleansing of data
Load
Load data into DW
Build aggregates, etc.

DW Architecture
Monitor
& OLAP Server
Metadata
other Integrator
sources
Analysis
Operational Extract Query
Data Transform Data Serve Reports
DBs Staging Load
Warehouse Data mining
Refresh
Data Marts
Data Sources
SAD 2007/08 Data StorageH.Galhardas OLAP Engine Front-End Tools
ETL process
other Implemented using an

sources
ETL tool!
Operational Extract
Data Transform Data
Ex: SQLServer 2005
DBs Staging Load
Warehouse Integration Services
Refresh

Data Staging Area
Transit storage for data underway in the ETL
process
Transformations/cleansing done here
No user queries (some do it)
Sequential operations (few) on large data volumes
Performed by central ETL logic
Easily restarted
No need for locking, logging, etc.
RDBMS or flat files? (DBMS have become better at this)
Finished dimensions copied from DSA to relevant
marts
ETL construction process

Plan
1)Make high-level diagram of source-destination flow
2)Test, choose and implement ETL tool
3)Outline complex transformations, key generation and job
sequence for every destination table
Construction of dimensions
4)Construct and test building static dimension
5)Construct and test change mechanisms for one dimension
6)Construct and test remaining dimension builds
Construction of fact tables and automation
7)Construct and test initial fact table build
8)Construct and test incremental update
9)Construct and test aggregate build
10)Design, construct, and test ETL automation

Building Dimensions
Static dimension table
Assignment of keys: production keys to DW using
table
Combination of data sources: find common key?
Handling dimension changes
Slowly changing dimensions
Find newest DW key for a given production key
Table for mapping production keys to DW keys
must be updated
Load of dimensions
Small dimensions: replace
Large dimensions: load only changes

Two types of load:
Initial load
ETL for all data up till now
Done when DW is started the first time
Often problematic to get correct historical data
Very heavy -large data volumes
Incremental update
Move only changes since last load
Done periodically (../month/week/day/hour/...) after DW start
Less heavy -smaller data volumes
Dimensions must be updated before facts
The relevant dimension rows for new facts must be in place
Special key considerations if initial load must be performed
again

Types of data sources
Non-cooperative sources
Snapshot sources provides only full copy of source
Specific sources each is different, e.g., legacy systems
Logged sources writes change log (DB log)
Queryable sources provides query interface, e.g., SQL
Cooperative sources
Replicated sources publish/subscribe mechanism
Call back sources calls external code (ETL) when changes
occur
Internal action sources only internal actions when changes
occur (DB triggers is an example)
Extract strategy is very dependent on the source types

Extract phase
Goal: fast extract of relevant data
Extract from source systems can take a long time
Types of extracts:
Extract applications (SQL): co-existence with other
applications
DB unload tools: must faster than SQL-based extracts
Extract applications sometimes the only solution
Often too time consuming to ETL

Extracts can take days/weeks
Drain on the operational systems
Drain on DW systems
=> Extract/ETL only changes since last load (delta)

Computing deltas
Much faster to only ETL changes since last load
A number of methods can be used

Store sorted total extracts in DSA
Delta can easily be computed from current+last extract
+ Always possible
+ Handles deletions
- Does not reduce extract time
Put update timestamp on all rows
Updated by DB trigger
Extract only where timestamp > time for last extract
+ Reduces extract time
+/- Less operational overhead
- Cannot (alone) handle deletions
- Source system must be changed
Load (1)
Goal: fast loading into DW
Loading deltas is much faster than total load
SQL-based update is slow

Large overhead (optimization,locking,etc.) for every SQL call
DB load tools are much faster
Some load tools can also perform UPDATEs
Index on tables slows load a lot
Drop index and rebuild after load
Can be done per partition
Parallellization
Dimensions can be loaded concurrently
Fact tables can be loaded concurrently
Partitions can be loaded concurrently

Load (2)
Relationships in the data
Referential integrity must be ensured
Can be done by loader
Aggregates
Must be built and loaded at the same time as the detail data
Today, RDBMSes can often do this
Load tuning
Load without log
Sort load file first
Make only simple transformations in loader
Use loader facilities for building aggregates
Use loader within the same database
Should DW be on-line 24*7?

Use partitions or several sets of tables
Overview
General ETL issues:
Building dimensions
Extract
Load
framework

Data Cleaning
Activity of converting source data into target

data without errors, duplicates, and
inconsistencies, i.e.,
Cleaning and Transforming to get
High-quality data!
Why Data Cleaning and

Transformation?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=
noisy: containing errors or outliers (spelling, phonetic
and typing errors, word transpositions, multiple values
in a single free-form field)
e.g., Salary=-10
inconsistent: containing discrepancies in codes or
names (synonyms and nicknames, prefix and suffix
variations, abbreviations, truncation and initials)
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
Why Is Data Dirty?
Incomplete data comes from:
non available data value when collected
different criteria between the time when the data was collected
and when it is analyzed.
human/hardware/software problems
Noisy data comes from:
data collection: faulty instruments
data entry: human or computer errors
data transmission
Inconsistent (and redundant) data comes from:
Different data sources, so non uniform naming conventions/data
codes
Functional dependency and/or referential integrity violation
Why Is Data Cleaning

Important?
Data warehouse needs consistent

integration of quality data
Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
No quality data, no quality decisions!

Quality decisions must be based on quality data (e.g.,
duplicate or missing data may cause incorrect or even
misleading statistics)
Types of data cleaning
Conversion, parsing and normalization
Text coding, date formats, etc.
Most common type of cleansing
Special-purpose cleansing
Normalize spellings of names, addresses, etc.
Remove duplicates, e.g., duplicate customers
Domain-independent cleansing
Approximate, fuzzy joins on not-quite-matching
keys
Data quality vs cleaning

Data quality = Data cleaning +
Data enrichment
enhancing the value of internally held data by appending related
attributes from external sources (for example, consumer
demographic attributes or geographic descriptors).
Data profiling
analysis of data to capture statistics (metadata) that provide
insight into the quality of the data and aid in the identification of
data quality issues.
Data monitoring
deployment of controls to ensure ongoing conformance of data to
business rules that define data quality for the organization.
Data stewards responsible for data quality
DW-controlled improvement
Source-controlled improvement
Construct programs to check data quality

Overview
General ETL issues:
Building dimensions
Extract
Load
framework
ETL tools
ETL tools from the big vendors, e.g.,

Oracle Warehouse Builder

IBM DB2 Warehouse Manager

Microsoft Integration Services
Offer much functionality at a reasonable price (included)
Data modeling

ETL code generation

Scheduling DW jobs

Many others

Hundreds of tools

Often specialized tools for certain jobs (insurance cleansing,)
The best tool does not exist

Choose based on your own needs

Check first if the standard toolsfrom the big vendors are ok
ETL and data quality tools
http://www.etltool.com/
Magic Quadrant for Data Quality Tools, 2007
Some open source ETL tools:
Talend
Enhydra Octopus
Clover.ETL
Not so many open source quality/cleaning
tools
Application context
Integrate data from different sources
E.g.,populating a DW from different operational data stores
Eliminate errors and duplicates within a single source

E.g., duplicates in a file of customers
Migrate data from a source schema into a different

fixed target schema
E.g., discontinued application packages
Convert poorly structured data into structured data

E.g., processing data collected from the Web

The AJAX data transformation
and cleaning framework
Motivating example (1)

Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)
Authors(authorKey, name)
Events(eventKey, name)
PubsAuthors(pubKey, authorKey)
Data Cleaning &

Transformation
DirtyData(paper:String)

Motivating example (2)
Authors
Publications
DQua | Dallan Quass
QGMW96| Making Views Self-Maintainable
for Data Warehousing |PDIS| null | null |
AGup | Ashish Gupta
null | null | Miami Beach | Florida, USA | 1996
Events JWid | Jennifer Widom
.. PubsAuthors
PDIS | Conference on
Parallel and QGMW96 | DQua
Distributed
Information Systems
Data Cleaning &
QGMW96 | AGup
DirtyData
Transformation .
[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer
Widom. Making Views Self-Maintainable for Data Warehousing. In
Proceedings of the Conference on Parallel and Distributed
Information Systems. Miami Beach, Florida, USA, 1996
[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-
maintianable for data warehousing, PDIS95
Modeling a data cleaning process

Authors
Duplicate Elimination
DirtyTitles... DirtyEvents
DirtyAuthors
Extraction A data cleaning process is

modeled by a directed acyclic
graph of data transformations
Standardization
Cities
Tags
Formattin
g
DirtyData
Existing technology
Ad-hoc programs written in a programming language
like C or Java or using an RDBMS proprietary
language
Programs difficult to optimize and maintain
Data transformation scripts using an ETL

(Extraction-Transformation-Loading) or a data quality tool
Problems of ETL and data quality

solutions (1)
Data cleaning transformations
App. Domain 1
...
App. Domain 2
App. Domain 3
The semantics of some data transformations

is defined in terms of their implementation
algorithms
Problems of ETL and data quality
solutions (2)
Clean data Rejected data
Cleaning process
Dirty Data
There is a lack of interactive facilities to tune
a data cleaning application program
AJAX features
An extensible data quality framework
Logical operators as extensions of relational algebra
Physical execution algorithms
A declarative language for logical operators

SQL extension
A debugger facility for tuning a data cleaning

program application
Based on a mechanism of exceptions

AJAX features

SQL extension

program application
Logical level: parametric

operators
View: arbitrary SQL query

View Cluster Map: iterator-based one-to-many mapping with
arbitrary user-defined functions
Match: iterator-based approximate join
Map Match
Cluster: uses an arbitrary clustering function
Merge: extends SQL group-by with user-defined
Merge Apply aggregate functions
Apply: executes an arbitrary user-defined
algorithm

Logical level Authors
DirtyTitles...
DirtyAuthors
Extraction
Standardization
Cities
Formattin Tags
g
SAD 2007/08 DirtyData H.Galhardas
Logical level Physical level

Authors Authors
Merge Java Scan
Duplicate Elimination Cluster TC
Match NL
DirtyTitles... DirtyTitles...
DirtyAuthors DirtyAuthors
Extraction Map Java Scan
Standardization Map Java Scan
Cities Cities
Tags Tags
Formattin Map SQL Scan
g
SAD 2007/08 DirtyData H.Galhardas DirtyData
AJAX features

SQL extension

program application
Match
Input: 2 relations
Finds data records that correspond to the
same real object
Calls distance functions for comparing field
values and computing the distance
between input tuples
Output: 1 relation containing matching
tuples and possibly 1 or 2 relations
containing non-matching tuples
Example
Authors
Merge
Cluster
MatchAuthors
Match
DirtyAuthors
Example
Authors CREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
Merge LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
Cluster
MatchAuthors INTO MatchAuthors
Match
DirtyAuthors

Example
Authors CREATE MATCH MatchDirtyAuthors
Merge LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
Cluster
MatchAuthors INTO MatchAuthors
Match
Input:
DirtyAuthors
DirtyAuthors(authorKey, name)
861|johann christoph freytag
Duplicate Elimination 822|jc freytag
819|j freytag
814|j-c freytag
Output:
MatchAuthors(authorKey1, authorKey2, name1, name2)
861|822|johann christoph freytag| jc freytag
822|814|jc freytag|j-c freytag ...
Implementation of the match

operator
s1 S1, s2 S2
(s1, s2) is a match if
editDistance (s1, s2) < maxDist

Nested loop
S1 S2
editDistance
...
Very expensive evaluation when handling

large amounts of data
Need alternative execution algorithms for
the same logical specification
A database solution
CREATE TABLE MatchAuthors AS
SELECT authorKey1, authorKey2, distance
FROM (SELECT a1.authorKey authorKey1,
a2.authorKey authorKey2,
editDistance (a1.name, a2.name) distance
FROM DirtyAuthors a1, DirtyAuthors a2)
WHERE distance < maxDist;
No optimization supported for a
Cartesian product with external function
calls
Window scanning
S
Window scanning
S

Window scanning
S
May loose some matches

String distance filtering
S1 S2
John Smit length- 1
John Smith Jogn Smith

length length
John Smithe length + 1

editDistance
maxDist = 1

Annotation-based optimization
The user specifies types of optimization

The system suggests which algorithm
to use
Ex:
CREATE MATCHING MatchDirtyAuthors
LET dist = editDistance(da1.name, da2.name)
WHERE dist < maxDist
% distance-filtering: map= length; dist = abs %
INTO MatchAuthors
Declarative specification
DEFINE FUNCTIONS AS
Choose.uniqueString(OBJECT[]) RETURN STRING
THROWS CiteSeerException
Generate.generateId(INTEGER) RETURN STRING
Normal.removeCitationTags(STRING) RETURN STRING
(600)
DEFINE ALGORITHMS AS
TransitiveClosure
SourceClustering(STRING)
DEFINE INPUT DATA FLOWS AS

TABLE DirtyData
(paper STRING (400)
);
TABLE City
(city STRING (80),
citysyn STRING (80)
)
KEY city,citysyn;
DEFINE TRANSFORMATIONS AS
CREATE MAPPING m apKeDiDa

FROM DirtyData Dd
LET keyKdd = generateId(1)
{SELECT keyKdd AS paperKey, Dd.paper AS paper
KEY paperKey
CONSTRAINT NOT NULL m apKeDiDa.paper
}
AJAX features

SQL extension

program application
Management of exceptions
Problem: to mark tuples not handled by the

cleaning criteria of an operator
Solution: to specify the generation of

exception tuples within a logical operator
exceptions are thrown by external functions
output constraints are violated

Debugger facility
Supports the (backward and forward) data

derivation of tuples wrt an operator to
debug exceptions
Supports the interactive data modification

and, in the future, the incremental
execution of logical operators
Architecture

To see it working....
Workshop-UQ?
Tomorrow at Tagus Park,
10H-16H

Clover ETL - 1

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Clover ETL - 1

Transféré par

Droits d'auteur :

Formats disponibles

Extract, Transform, Load

SAD 2007/08 H.Galhardas

SAD 2007/08 H.Galhardas

SAD 2007/08 H.Galhardas

The ETL Process

The most time-consuming process in DW development

SAD 2007/08 H.Galhardas

other Implemented using an

SAD 2007/08 H.Galhardas

SAD 2007/08 H.Galhardas

ETL construction process

SAD 2007/08 H.Galhardas

SAD 2007/08 H.Galhardas

Building fact tables

SAD 2007/08 H.Galhardas

Extract strategy is very dependent on the source types

Often too time consuming to ETL

SAD 2007/08 H.Galhardas

A number of methods can be used

SQL-based update is slow

SAD 2007/08 H.Galhardas

Should DW be on-line 24*7?

SAD 2007/08 H.Galhardas

Activity of converting source data into target

SAD 2007/08 H.Galhardas

Why Data Cleaning and

Why Is Data Cleaning

Data warehouse needs consistent

No quality data, no quality decisions!

SAD 2007/08 H.Galhardas

Data quality vs cleaning

SAD 2007/08 H.Galhardas

SAD 2007/08 H.Galhardas

The best tool does not exist

SAD 2007/08 H.Galhardas

Eliminate errors and duplicates within a single source

Migrate data from a source schema into a different

Convert poorly structured data into structured data

SAD 2007/08 H.Galhardas

SAD 2007/08 H.Galhardas

Motivating example (1)

Data Cleaning &

SAD 2007/08 H.Galhardas

SAD 2007/08 H.Galhardas

Modeling a data cleaning process

Extraction A data cleaning process is

Data transformation scripts using an ETL

SAD 2007/08 H.Galhardas

Problems of ETL and data quality

The semantics of some data transformations

SAD 2007/08 H.Galhardas

A declarative language for logical operators

A debugger facility for tuning a data cleaning

SAD 2007/08 H.Galhardas

A declarative language for logical operators

A debugger facility for tuning a data cleaning

SAD 2007/08 H.Galhardas

Logical level: parametric

View: arbitrary SQL query

SAD 2007/08 H.Galhardas

Logical level Physical level

Merge Java Scan

Duplicate Elimination Cluster TC