Académique Documents
Professionnel Documents
Culture Documents
ETL Processing
prepared by
ETL Processing
IT Users Operational Data Data Transformation Enterprise Warehouse and Integrated Data Marts Replication Dependent Data Marts or Departmental Warehouses
Business Users
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
Complexity of required transformations: Simple scalar transformations. 0/1 => M/F One to many element transformations. 6x30 address field => street1, street2, city, state, zip Many to many element transformations. Householding and Individualization of customer records
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 7
What does the solution look like? Meta data driven transformation architecture. Modular software solutions with component building blocks. Parallel software and hardware architectures.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
10
A Word of Warning
The data quality in the source systems will be much worse than what you expect. Must allocate explicit time and resources to facilitate data clean-up. Data quality is a continuous improvement process - must institute TQM program to be successful. Use house of quality technique to prioritize and focus data quality efforts.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 14
ETL Processing
It is important to look at the big picture. Data acquisition time may include:
Extracts from source systems. Data movement. Transformations. Data loading. Index maintenance. Statistics collection. Summary data maintenance. Data mart construction. Backups.
15
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
Loading Strategies
Once we have transformed data, there are three primary loading strategies: 1. Full data refresh with block slamming into empty tables. 2. Incremental data refresh with block slamming into existing (populated) tables. 3. Trickle feed with continuous data acquisition using row level insert and update operations.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 16
Loading Strategies
We must also worry about rolling off old data as its economic value drops below the cost for storing and maintaining it.
new data
old data
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
17
Loading Strategies
Choice in loading strategy depends on tradeoffs in data freshness and performance, as well as data volatility characteristics. What is the goal? Increased data freshness. Increased data loading performance.
Real-Time Availability Low Update Rates
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
Loading Strategies
Should consider:
Data
storage requirements. Impact on query workloads. Ratio of existing to new data. Insert versus update workloads.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
19
Loading Strategies
Tradeoffs in data loading with a high percentage of data changes per data block:
Full Refresh
40000 35000
Rows/CPU/Sec
Incremental Update
400
500
20
Rows/DB affected
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
600
Loading Strategies
Tradeoffs in data loading with a low percentage of data changes per data block:
500
Shadow Table + Insert-Select Incremental Update Incremental Insert Trickle Feed
400
Rows/CPU/Sec
300
200
Table Copy
100
Rows/DB affected
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 21
This is a good (simple) strategy for small tables or when a high percentage of rows in the data changes on each refresh (greater than 10%).
e.g., reference lookup tables or account tables where balances change on each refresh.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 22
referential integrity (RI) constraints from table definitions for loading operations.
Assume that data cleansing takes place in transformations.
Remove
Make
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
24
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
25
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
26
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 27
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
28
Trickle Feed
Acquire data on a continuous basis into RDBMS using row level SQL insert and update operations.
Data is made available to DW immediately rather than waiting for batch loading to complete. Much higher overhead for data acquisition on a per record basis as compared to batch strategies. Row level locking mechanisms allow queries to proceed during data acquisition. Typically relies on Enterprise Application Integration (EAI) for data delivery.
29
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
Trickle Feed
A tradeoff exists between data freshness and insert efficiency:
Buffering
rows for insertion allows for fewer round trips to RDBMS... but waiting to accumulate rows into the buffer impacts data freshness.
Suggested approach: Use a threshold that buffers up to M rows, but never waits more than N seconds before sending a buffer of data for insertion.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 30
is extract, transform, load in which transformation takes place on a transformation server using either an engine or by generated code. ELT is extract, load, transform in which data transformations take place in the relational database on the data warehouse server.
Of course, hybrids are also possible...
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 31
ETL Processing
ETL processing performs the transform operations prior to loading data into the RDBMS. 1. Extract data from the source systems. 2. Transform data into a form consistent with the target tables. 3. Load the data into the target tables (or to shadow tables).
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 32
ETL Processing
ETL processing is typically performed using resources on the source systems platform(s) or a dedicated transformation server.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
33
ETL Processing
Perform the transformations on the source system platform if available resources exist and there is significant data reduction that can be achieved during the transformations.
Perform the transformations on a dedicated transformation server if the source systems are highly distributed, lack capacity, or have high cost per unit of computing.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
34
ETL Processing
Two approaches for ETL processing: 1. Engine: ETL processing using an interpretive engine for applying transformation rules based on meta data specifications. - e.g., Ascential, Informatica 2. Code Generation: ETL processing using code generated based on meta data specification. - e.g., Ab Initio, ETI
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 35
ELT Processing
First,
load raw data into empty tables using RDBMS block slamming utilities. Next, use SQL to transform the raw data into a form appropriate to the target tables.
Ideally, the SQL is generated using a meta data driven tool rather than hand coding. Finally,
use insert-select into the target table for incremental loads or view switching if a full refresh strategy is used.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
36
ELT Processing
DW server is the the transformation server for ELT processing.
Files
Source Systems
Network
Teradata Fastload
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
37
ELT Processing
ELT leverages the build-in scalability and manageability of the parallel RDBMS and HW platform. Must allocate sufficient staging area space to support load of raw data and execution of the transformation SQL. Works well only for batch oriented transforms because SQL is optimized for set processing.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission. 38
Bottom Line
ETL is a significant task in any DW deployment.
Many options for data loading strategies: need to evaluate tradeoffs in performance, data freshness, and compatibility with source systems environment.
Many options for ETL/ELT deployment: need to evaluate tradeoffs in where and how transformations should be applied.
Copyright 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
39