Académique Documents
Professionnel Documents
Culture Documents
(ETL)
Overview
General ETL issues:
Data staging area (DSA)
Building dimensions
Building fact tables
Extract
Load
Transformation/cleaning
Commercial (and open-source) tools
The AJAX data cleaning and transformation
framework
Extract
Extract relevant data
Transform
Transform data to DW format
Build keys, etc.
Cleansing of data
Load
Load data into DW
Build aggregates, etc.
Monitor
& OLAP Server
Metadata
other Integrator
sources
Analysis
Operational Extract Query
Data Transform Data Serve Reports
DBs Staging Load
Warehouse Data mining
Refresh
Data Marts
Data Sources
SAD 2007/08 Data StorageH.Galhardas OLAP Engine Front-End Tools
ETL process
Extract phase
Goal: fast extract of relevant data
Extract from source systems can take a long time
Types of extracts:
Extract applications (SQL): co-existence with other
applications
DB unload tools: must faster than SQL-based extracts
Extract applications sometimes the only solution
Load (1)
Goal: fast loading into DW
Loading deltas is much faster than total load
Overview
General ETL issues:
Data staging area (DSA)
Building dimensions
Building fact tables
Extract
Load
Transformation/cleaning
Commercial (and open-source) tools
The AJAX data cleaning and transformation
framework
ETL tools
ETL tools from the big vendors, e.g.,
Oracle Warehouse Builder
IBM DB2 Warehouse Manager
Microsoft Integration Services
Offer much functionality at a reasonable price (included)
Data modeling
ETL code generation
Scheduling DW jobs
Many others
Hundreds of tools
Often specialized tools for certain jobs (insurance cleansing,)
Application context
Integrate data from different sources
E.g.,populating a DW from different operational data stores
DirtyData(paper:String)
DirtyData
Transformation .
[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer
Widom. Making Views Self-Maintainable for Data Warehousing. In
Proceedings of the Conference on Parallel and Distributed
Information Systems. Miami Beach, Florida, USA, 1996
[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-
maintianable for data warehousing, PDIS95
Duplicate Elimination
DirtyTitles... DirtyEvents
DirtyAuthors
App. Domain 1
...
App. Domain 2
App. Domain 3
Cleaning process
Dirty Data
There is a lack of interactive facilities to tune
a data cleaning application program
AJAX features
An extensible data quality framework
Logical operators as extensions of relational algebra
Physical execution algorithms
Duplicate Elimination
DirtyTitles...
DirtyAuthors
Extraction
Standardization
Cities
Formattin Tags
g
SAD 2007/08 DirtyData H.Galhardas
Match NL
DirtyTitles... DirtyTitles...
DirtyAuthors DirtyAuthors
Cities Cities
Tags Tags
Formattin Map SQL Scan
g
SAD 2007/08 DirtyData H.Galhardas DirtyData
AJAX features
An extensible data quality framework
Logical operators as extensions of relational algebra
Physical execution algorithms
Match
Input: 2 relations
Finds data records that correspond to the
same real object
Calls distance functions for comparing field
values and computing the distance
between input tuples
Output: 1 relation containing matching
tuples and possibly 1 or 2 relations
containing non-matching tuples
SAD 2007/08 H.Galhardas
Example
Authors
Merge
Cluster
MatchAuthors
Match
DirtyAuthors
Duplicate Elimination
Example
Authors CREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
Merge LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
Cluster
MatchAuthors INTO MatchAuthors
Match
DirtyAuthors
Duplicate Elimination
s1 S1, s2 S2
(s1, s2) is a match if
editDistance (s1, s2) < maxDist
...
A database solution
CREATE TABLE MatchAuthors AS
SELECT authorKey1, authorKey2, distance
FROM (SELECT a1.authorKey authorKey1,
a2.authorKey authorKey2,
editDistance (a1.name, a2.name) distance
FROM DirtyAuthors a1, DirtyAuthors a2)
WHERE distance < maxDist;
No optimization supported for a
Cartesian product with external function
calls
SAD 2007/08 H.Galhardas
Window scanning
S
Window scanning
S
S1 S2
John Smit length- 1
Ex:
CREATE MATCHING MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET dist = editDistance(da1.name, da2.name)
WHERE dist < maxDist
% distance-filtering: map= length; dist = abs %
INTO MatchAuthors
SAD 2007/08 H.Galhardas
Declarative specification
DEFINE FUNCTIONS AS
Choose.uniqueString(OBJECT[]) RETURN STRING
THROWS CiteSeerException
Generate.generateId(INTEGER) RETURN STRING
Normal.removeCitationTags(STRING) RETURN STRING
(600)
DEFINE ALGORITHMS AS
TransitiveClosure
SourceClustering(STRING)
DEFINE TRANSFORMATIONS AS
Management of exceptions
Architecture
Workshop-UQ?
Tomorrow at Tagus Park,
10H-16H