Vous êtes sur la page 1sur 20

ETL -vs- ELT

DEALING WITH LARGE DATASETS

(C) 2010 - Blue Chips Technology - All rights Reserved


Tuesday, April 27, 2010

Agenda
Blue Chips Overview Evolution of etL The new hybrid My $1,000,000 mistake Background on our data warehouse SMP -vs- MPP Embracing eLt Gulp New Hybrid implemented Gulp Again SPIL methodology invented Lessons Learned

Tuesday, April 27, 2010

Evolution of Data Migration


Hand Code (Wild West) No Metadata No Interfaces No Centralization No Methodology No Constraints Could Be Highly Tuned Could Leverage Database

Hand Code

Tuesday, April 27, 2010

Evolution of Data Migration


Engine (Pioneers) Performance (Bottle Neck) Infrastructure costs Does not leverage database Metadata Robust interfaces Centralization Promotes standards/reuse Ease of use Built in schedulers Nice GUIs

Engine

Hand Code

Tuesday, April 27, 2010

Evolution of Data Migration


Code Generators (Settlers) Not many players Enough rope to hang oneself No schedulers or process ow No runtime fees Great performance Flexibility Metadata Robust interfaces Centralization Promotes standards Ease of use Minimizes infrastructure cost Can leverage database

Code Generators

Engine

Hand Code

Tuesday, April 27, 2010

Evolution of Data Migration


Hybrids Hybrids (Explorers) Best of Both worlds Infrastructure Costs Not many players Code generation is weak Metadata Robust Interfaces Centralization Promotes Standards Ease of Use Can leverage database Promotes standards Flexibility Tuning options

Code Generators

Engine

Hand Code

Tuesday, April 27, 2010

The New Hybrid


Hybrids and code generators never grabbed hold Chances are you are either hand coding or using some sort of ETL engine ELT has made SQL and Stored Procs the Language of choice Because of so many stored procedure variants the ELT is generally implemented with SQL and Workflow software The approach seems to be to utilize your Standard ETL tool to launch SQL statements (The new Hybrid)
Hybrids

Code Generators

Engine

Hand Code

Tuesday, April 27, 2010

New Hybrid

Potential eLt Pitfall

A day and life of eLt

Tuesday, April 27, 2010

Core Data Warehouse


20+ data providers 150 million records per month 5.5 billion record transaction table 2 node Teradata NCR World-Mark cluster 6TB Storage ~$1.2m Capex

Tuesday, April 27, 2010

Supporting Infrastructure
2 FTP Servers 4 ETL Servers 2 Data Acquisition Servers 4 App Servers 2 Domain Controllers 1 Gateway 1 File Server .5TB raw storage 1 Staging server .75TB raw storage

Tuesday, April 27, 2010

Shared Everything Architecture


40 30 # of CPU's 20

SMP Performance Curve

CPU's Performance

10

0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Throughput

(C) 2010 - Blue Chips Technology - All rights Reserved


Tuesday, April 27, 2010

MPP Architecture A Unit of parallelism


The MPP Environment is setup as a federation of dedicated servers, blades or nodes. At its lowest level, each blade has its own dedicated CPU, memory, I/O Bus, and Disk Storage

Dedicated I/O

Dedicated Memory

Dedicated CPU Dedicated Disk

(C) 2010 - Blue Chips Technology - All rights Reserved


Tuesday, April 27, 2010

Because of the shared nothing MPP Example itsarchitecture, the query only takes as long as it takes for a single blade to go through data.

When a query is executed the query is replicated and executed across all blades.

If I had 1 Billion Records Distributed across 100 Blades 1 blade would only have to query 10 Million Rows A billion row query would would respond as if it were only querying 10 Million records.

(C) 2010 - Blue Chips Technology - All rights Reserved


Tuesday, April 27, 2010

SharedPerfect Linearity means true scaleability Nothing Architecture


Double the Parallel Units - Cut the Response Time in Half 80.000

60.025

40.050

20.075

0.100 16 Parallel Units 32 Parallel Units 64 Parallel Units Perfect Linearity

(C) 2010 - Blue Chips Technology - All rights Reserved


Tuesday, April 27, 2010

Embracing eLt
Source Data Acquisition

Direct Access

SAS

Reports

Hosted Datamart

Extract and Load

DB Teradata Transform and Store

DB

Tuesday, April 27, 2010

Gulp
The eLt strategy was a complete success Data transformation performance was incredibly fast We were meeting operational windows that our competitors could not As a result, we received a new data feed which represented an additional 36 months of data then the gulp! Reality hit when we realized that we did not have enough space within our database to perform the transformations on the new data feed We were deeply committed to the eLt strategy and had very little recourse but to spend another $1,000,000 to expand our environment

Tuesday, April 27, 2010

My $1,000,000 Mistake

Tuesday, April 27, 2010

Core Data Warehouse


21+ data providers 150 million records per month 5.5 billion record transaction table 4 node Teradata NCR World-Mark cluster 12TB Storage ~$2.0m Capex

Tuesday, April 27, 2010

Hybrid etL and eLt

Tuesday, April 27, 2010

SPIL Methodology
SPIL (Stage Pre-Integrate Load)
Data Staging Pre-Integration Load Teradata
Data Sources Validated Data

Extract Transform Load

Sort Join Merge

Validated Data

Production Data

MLOAD / UPSERT

Tuesday, April 27, 2010

Vous aimerez peut-être aussi