ETL Vs ELT

ETL -vs- ELT
DEALING WITH LARGE DATASETS
(C) 2010 - Blue Chips Technology - All rights Reserved

Tuesday, April 27, 2010
Agenda
Blue Chips Overview Evolution of etL The new hybrid My $1,000,000 mistake Background on our data warehouse SMP -vs- MPP Embracing eLt Gulp New Hybrid implemented Gulp Again SPIL methodology invented Lessons Learned
Evolution of Data Migration

Hand Code (Wild West) No Metadata No Interfaces No Centralization No Methodology No Constraints Could Be Highly Tuned Could Leverage Database
Hand Code

Engine (Pioneers) Performance (Bottle Neck) Infrastructure costs Does not leverage database Metadata Robust interfaces Centralization Promotes standards/reuse Ease of use Built in schedulers Nice GUIs
Engine
Hand Code

Code Generators (Settlers) Not many players Enough rope to hang oneself No schedulers or process ow No runtime fees Great performance Flexibility Metadata Robust interfaces Centralization Promotes standards Ease of use Minimizes infrastructure cost Can leverage database
Code Generators
Engine
Hand Code

Hybrids Hybrids (Explorers) Best of Both worlds Infrastructure Costs Not many players Code generation is weak Metadata Robust Interfaces Centralization Promotes Standards Ease of Use Can leverage database Promotes standards Flexibility Tuning options
Code Generators
Engine
Hand Code
The New Hybrid

Hybrids and code generators never grabbed hold Chances are you are either hand coding or using some sort of ETL engine ELT has made SQL and Stored Procs the Language of choice Because of so many stored procedure variants the ELT is generally implemented with SQL and Workflow software The approach seems to be to utilize your Standard ETL tool to launch SQL statements (The new Hybrid)
Hybrids
Code Generators
Engine
Hand Code
New Hybrid
Potential eLt Pitfall
A day and life of eLt
Core Data Warehouse

20+ data providers 150 million records per month 5.5 billion record transaction table 2 node Teradata NCR World-Mark cluster 6TB Storage ~$1.2m Capex
Supporting Infrastructure
2 FTP Servers 4 ETL Servers 2 Data Acquisition Servers 4 App Servers 2 Domain Controllers 1 Gateway 1 File Server .5TB raw storage 1 Staging server .75TB raw storage
Shared Everything Architecture

40 30 # of CPU's 20
SMP Performance Curve
CPU's Performance
10
0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Throughput

MPP Architecture A Unit of parallelism

The MPP Environment is setup as a federation of dedicated servers, blades or nodes. At its lowest level, each blade has its own dedicated CPU, memory, I/O Bus, and Disk Storage
Dedicated I/O
Dedicated Memory
Dedicated CPU Dedicated Disk

Because of the shared nothing MPP Example itsarchitecture, the query only takes as long as it takes for a single blade to go through data.
When a query is executed the query is replicated and executed across all blades.
If I had 1 Billion Records Distributed across 100 Blades 1 blade would only have to query 10 Million Rows A billion row query would would respond as if it were only querying 10 Million records.

SharedPerfect Linearity means true scaleability Nothing Architecture

Double the Parallel Units - Cut the Response Time in Half 80.000
60.025
40.050
20.075
0.100 16 Parallel Units 32 Parallel Units 64 Parallel Units Perfect Linearity

Embracing eLt
Source Data Acquisition
Direct Access
SAS
Reports
Hosted Datamart
Extract and Load
DB Teradata Transform and Store
DB
Gulp
The eLt strategy was a complete success Data transformation performance was incredibly fast We were meeting operational windows that our competitors could not As a result, we received a new data feed which represented an additional 36 months of data then the gulp! Reality hit when we realized that we did not have enough space within our database to perform the transformations on the new data feed We were deeply committed to the eLt strategy and had very little recourse but to spend another $1,000,000 to expand our environment
My $1,000,000 Mistake
Core Data Warehouse

21+ data providers 150 million records per month 5.5 billion record transaction table 4 node Teradata NCR World-Mark cluster 12TB Storage ~$2.0m Capex
Hybrid etL and eLt
SPIL Methodology
SPIL (Stage Pre-Integrate Load)
Data Staging Pre-Integration Load Teradata
Data Sources Validated Data
Extract Transform Load
Sort Join Merge
Validated Data
Production Data
MLOAD / UPSERT

ETL Vs ELT

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ETL Vs ELT

Transféré par

Droits d'auteur :

Formats disponibles

ETL -vs- ELT

DEALING WITH LARGE DATASETS

(C) 2010 - Blue Chips Technology - All rights Reserved

Tuesday, April 27, 2010

Evolution of Data Migration

Tuesday, April 27, 2010

Evolution of Data Migration

Tuesday, April 27, 2010

Evolution of Data Migration

Tuesday, April 27, 2010

Evolution of Data Migration

Tuesday, April 27, 2010

The New Hybrid

Tuesday, April 27, 2010

Potential eLt Pitfall

A day and life of eLt

Tuesday, April 27, 2010

Core Data Warehouse

Tuesday, April 27, 2010

Tuesday, April 27, 2010

Shared Everything Architecture

SMP Performance Curve

(C) 2010 - Blue Chips Technology - All rights Reserved

MPP Architecture A Unit of parallelism

Dedicated CPU Dedicated Disk

(C) 2010 - Blue Chips Technology - All rights Reserved

(C) 2010 - Blue Chips Technology - All rights Reserved

SharedPerfect Linearity means true scaleability Nothing Architecture

0.100 16 Parallel Units 32 Parallel Units 64 Parallel Units Perfect Linearity

(C) 2010 - Blue Chips Technology - All rights Reserved

Extract and Load

DB Teradata Transform and Store

Tuesday, April 27, 2010

Tuesday, April 27, 2010

Tuesday, April 27, 2010

Core Data Warehouse

Tuesday, April 27, 2010

Hybrid etL and eLt

Tuesday, April 27, 2010

Extract Transform Load

Sort Join Merge

Tuesday, April 27, 2010

Vous aimerez peut-être aussi