Académique Documents
Professionnel Documents
Culture Documents
ETL Overview
General ETL issues
The ETL process Building dimensions Building fact tables Extract Transformations/cleansing Load A concrete ETL tool Demo Example ETL flow
Extract
Extract relevant data Transform data to DW format Build keys, etc. Cleansing of data Load data into DW Build aggregates, etc.
3
Transform
Load
Query side
Metadata
Presentation servers
Query Services
Reporting Tools
Warehouse Bus
-Warehouse Browsing -Access and Security Data marts with aggregate-only data -Query Management - Standard Reporting Conformed -Activity Monitor Data
dimensions and facts
Data mining
Operationelle systemer
Data
Service Element
No user queries (some do it) Sequential operations (few) on large data volumes
Performed by central ETL logic Easily restarted No need for locking, logging, etc. RDBMS or flat files? (DBMS have become better at this)
Finished dimensions copied from DSA to relevant marts Allows centralized backup/recovery
Often too time consuming to initial load all data marts by failure Thus, backup/recovery facilities needed Better to do this centrally in DSA than in all data marts
5
Make high-level diagram of source-destination flow Test, choose and implement ETL tool Outline complex transformations, key generation and job sequence for every destination table Construct and test building static dimension Construct and test change mechanisms for one dimension Construct and test remaining dimension builds Construct and test initial fact table build Construct and test incremental update Construct and test aggregate build (you do this later) Design, construct, and test ETL automation
6
Construction of dimensions
4) 5) 6)
Building Dimensions
Static dimension table
Relatively easy? Assignment of keys: production keys to DW using table Combination of data sources: find common key? Check one-one and one-many relationships using sorting Described in last lecture Find newest DW key for a given production key Table for mapping production keys to DW keys must be updated Small dimensions: replace Large dimensions: load only changes
Load of dimensions
ETL for all data up till now Done when DW is started the first time Often problematic to get correct historical data Very heavy - large data volumes Move only changes since last load Done periodically (../month/week/day/hour/...) after DW start Less heavy - smaller data volumes The relevant dimension rows for new facts must be in place Special key considerations if initial load must be performed again
Incremental update
Extract
Goal: fast extract of relevant data
Extract from source systems can take a long time Extract applications (SQL): co-existence with other applications DB unload tools: must faster than SQL-based extracts Extract applications sometimes the only solution Extracts can take days/weeks Drain on the operational systems Drain on DW systems => Extract/ETL only changes since last load (delta)
Types of extracts:
Computing Deltas
Much faster to only ETL changes since last load
A number of methods can be used Delta can easily be computed from current+last extract + Always possible + Handles deletions - Does not reduce extract time Updated by DB trigger Extract only where timestamp > time for last extract + Reduces extract time +/- Less operational overhead - Cannot (alone) handle deletions - Source system must be changed
10
Applications insert messages in a queue at updates + Works for all types of updates and systems - Operational applications must be changed+operational overhead Triggers execute actions on INSERT/UPDATE/DELETE + Operational applications need not be changed + Enables real-time update of DW - Operational overhead Find changes directly in DB log which is written anyway + Operational applications need not be changed + No operational overhead - Not possible in some DBMS (SQL Server, Oracle, DB2 can do it)
11
DB triggers
ETL Overview
General ETL issues
The ETL process Building dimensions Building fact tables Extract Transformations/cleansing Load A concrete ETL tool Demo Example ETL flow
12
Common Transformations
Data type conversions
EBCDIC->ASCII/UniCode String manipulations Date/time format conversions To the desired DW format Depending on source format Table matches production keys to surrogate DW keys Observe correct handling of history - especially for total reload
Normalization/denormalization
Building keys
13
Data Quality
Data almost never has decent quality Data in DW must be: Precise
DW data must match known numbers - or explanation needed DW has all relevant data and the users know No contradictory data: aggregates fit with detail data The same things is called the same and has the same key (customers) Data is updated frequently enough and the users know when
14
Complete
Consistent
Unique
Timely
Cleansing
BI does not work on raw data
Pre-processing necessary for good results Can disturb BI analyses if not handled Spellings, codings, Production keys, comments, City name instead of ZIP code, Customer data, Extra values used for programmer hacks
15
Cleansing
Mark facts with Data Status dimension
Normal, abnormal, outside bounds, impossible, Facts can be taken in/out of analyses Can disturb BI tools Replace with NULLs (or estimates) Use explicit NULL value rather than normal value (0,-1,..) Use NULLs only for measure values (estimates instead?) Use special dimension keys for NULL dimension values Customer about to cancel contract, Performance Statistical significance only for higher levels
16
Training: used to train model Test: used to check the models generality (overfitting) Evaluation: model uses evaluation set to find real clusters, Derived attributes often more interesting for mining (profit,..) Means more possibilities for data mining Continuous values to intervals, textual values to numerical Demanded by some tools (neural nets) Duplicate rare elements in training set
17
Mapping
A given steward has the responsibility for certain tables Includes manual inspections and corrections! Default values Not yet assigned 157 note to data steward The optimal? Are totals as expected? Do results agree with alternative source? Allow management to see weird data in their reports?
18
DW-controlled improvement
Source-controlled improvements
Load
Goal: fast loading into DW
Loading deltas is much faster than total load Large overhead (optimization,locking,etc.) for every SQL call DB load tools are much faster Some load tools can also perform UPDATEs Drop index and rebuild after load Can be done per partition Dimensions can be loaded concurrently Fact tables can be loaded concurrently Partitions can be loaded concurrently
Parallellization
19
Load
Relationships in the data
Referential integrity must be ensured Can be done by loader Must be built and loaded at the same time as the detail data Today, RDBMSes can often do this Load without log Sort load file first Make only simple transformations in loader Use loader facilities for building aggregates Use loader within the same database Use partitions or several sets of tables (like MS Analysis)
Aggregates
Load tuning
20
10
ETL Tools
ETL tools from the big vendors
Oracle Warehouse Builder IBM DB2 Warehouse Manager Microsoft Data Transformation Services Data modeling ETL code generation Scheduling DW jobs Hundreds of tools Often specialized tools for certain jobs (insurance cleansing,) Choose based on your own needs Check first if the standard tools from the big vendors are ok
Many others
21
Issues
Files versus streams/pipes
Streams/pipes: no disk overhead, fast throughput Files: easier restart, often only possibility Code: easy start, co-existence with IT infrastructure Tool: better productivity on subsequent projects ETL time dependent of data volumes Daily load is much faster than monthly Applies to all steps in the ETL process
Load frequency
22
11
ETL Overview
General ETL issues
The ETL process Building dimensions Building fact tables Extract Transformations/cleansing Load A concrete ETL tool Demo Example ETL flow
23
Part of SQL Server 2000 Import/export wizard - simple transformations DTS designer - complex transformations DTSRun + DTSRunUI - execute DTS packages SQLAgent - schedules execution
A range of tools
Through GUI - basic functionality Programmatically - advanced functionality DTS object model allows programming
24
12
Packages
The central concept in DTS Package = organized set of:
25
DTS Package
26
13
Data Sources/Connections
All ODBC/OLE DB data sources
Almost everything ! SQL Server Access Excel dBase HTML file Paradox Text files Oracle
Examples
27
Tasks
Transform data tasks
Transform Data task - move data and optionally apply column-level transformations Data Driven Query task - perform Transact-SQL, including stored procedures, INSERT, UPDATE, DELETE Parallel Data Pump task - parallel version of the two above Use Lookup for join with other data sources Bulk insert task - fast load into SQL Server Execute SQL task - run SQL statements during execution Copy SQL Server objects task - from one instance to another Transfer Database task - copy whole database Transfer Error Messages task - from one instance to another Transfer Logins task - from one instance to another Transfer Jobs task - from one instance to another Transfer Master Stored Proc. task - from one instance to another
28
14
Tasks 2
Package job tasks
ActiveX Script task - write ActiveX code (VBScript,JScript,PerlScript) Dynamic Properties task - retrieve external values Execute Package task - execute other DTS packages
Execute Process task - execute external programs/batch files FTP task - get data from file or HTTP Message Queue task - get data from MS Message Queues Send mail task - used for notifications, alerts, etc. Analysis Services task - process Analysis Services cubes etc. Data Mining task - create prediction query and output table from mining model
29
Task Dependencies
Four possibilities
30
15
Source and target row set created Data pump instance created Transformations performed for each row Error text file, source error rows file, destination error rows file
Error logging
Data Pump phases allows added functionality (next slide) Source data may be a query
31
32
16
Query Designer
33
Transformations
Copy column
Copy data directly Custom transformations, can be arbitrarily complex Convert date/time source
ActiveX script
Remove blanks Open file given by source and copy contents to destination Copy sourcecolumn1 to file with name given by sourcecolumn2
34
Read File
Write File
17
Local Packages Metadata Services (repository) Structured storage (file) Visual Basic Is available from web page DROP, CREATE, source, destination Error messages? But dependencies can be set up
35
Execute package
Build an ETL flow using MS DTS that can do an initial (first-time) load of the data warehouse. This should include logic for generating special DW surrogate integer keys for the tables. Discuss and implement basic transformations/data cleansing. Extend the ETL flow to handle incremental loads, i.e., updates to the DW, both for dimensions and facts. Extend the DW design and the ETL logic to handle slowly changing dimensions of Type 2. Implement more advanced transformations/data cleansing. Perform error handling in the ETL flow.
Extensions:
36
18
Build first step and check that result is as expected Add second step and execute both, check result Add third step
Use Query Analyzer to test SQL before putting into DTS Do one (or just a few) thing(s) at the time
Only if doing incremental load Versions only if handling slowly changing dimensions
Summary
General ETL issues
The ETL process Building dimensions Building fact tables Extract Transformations/cleansing Load A concrete ETL tool Demo Example ETL flow
38
19