Vous êtes sur la page 1sur 5

DataStage - estimate how long it will take

to build a DataStage job


updated Mar 28, 2008 10:48 pm | 15,226 views
[[]]
Contents [Hide TOC]

• 1 Introduction

• 2 ϑο β Χο µ π λ ε ξ ι τ ψ Εσ τ ι µ α τ ι ο ν

• 2.1 General Job Complexity Overheads

• 2.2 ∆ α τ α Σ τ α γ ε ϑο β Πα τ τ ε ρ ν σ

• 2.3 Σ κ ι λ λ Ωε ι γ η τ ι ν γ

2.4 ∆ α τ α ςολ υ µ ε Ωε ι γ η τ ι ν γ

• 3 Examples

4 Χο ν χ λ υ σ ι ο ν

[edit]Introduction
This HowTo attempts to define a method for estimating ETL job development for
DataStage jobs by calculating job complexity based on the job pattern and specific
challenges within the job.
This estimation model covers the developer only - the design and development of the job,
unit testing to ensure the job is in a running state and fulfilling the specification, defect
fix support for the job through other testing phases.
Feel free to modify this entry with your own thoughts or add to the entry discussion.
If you have an alternative estimating technique you can add it to this entry as an
additional section or add a new HowTo topic.

[edit]Job Complexity Estimation


Identify your set of ETL job patterns. There will be a common repetitive type of job
pattern for large data migration projects.
A set of job patterns defines a base time for building each type of job. A set of general
overheads lists some stage types that may add more time to the base time. A set of special
overheads can apply to specific job patterns.

[edit]General Job Complexity Overheads


These are stage and actions that can add overhead to the complexity and time to develop
any job:

• 0.25 for key lookup, join or merge

• 0.5 for range lookup

• 0.5 for change data capture

• 0.5 for field level validation (checking nulls, dates, numbers)

• 0.25 for modify stage

• 0.25 for undocumented source file definition (sequential, xml)


These fields have no real overhead taking little time to load and configure. They are part
of the base estimate for each job pattern: sort, aggregation, filter, copy, transformer,
database, dataset, sequential file

[edit]DataStage Job Patterns


These are the common ETL job patterns. What follows is the name of the job, the number
of base days to develop the job and any specific overheads for that job pattern. Each job
is made up of three parts: the base job time, the general overheads from above and the
specific overheads for that job pattern. Add these together for a final estimate.
The base time is the time to create the job, put all the stage together, import the metadata
and do unit testing to get it to a running state.
Relational Table Prepare Job Prepares a data source for a load into a relational
database table.

• 1 day base

• 0.25 for rejection of rows with missing parents.

• 0.5 for output of augmentation requests for missing parents.

• 0.5 for any type of hierarchy validation.


Table Append, Load, Insert, Update or Delete Loads data into a relational table via
insert, update or delete.

• 0.5 day base

• 0.25 for a bulk load.

• 0.25 for user-defined SQL.

• 0.25 for before-sql or after-sql requirements.

• 0.5 rollback table after failed load.

• 1 restart from last row after failed load.


Dimension Prepare job Takes a source database SQL or a source flat file and creates
staging files with transformed data. Typically involve extract, lookup, validation,
changed data capture and transformation. These jobs take the same amount of time to
build as relational table loads but take extra unit testing time due to the increased
combinations of changed data scenarios.

• 1 day base

• 0.5 type I unit testing

• 0.5 type II unit testing

• 0.5 type II unit testing


Fact Prepare Job Loads fact data. The validation against dimensions is a general lookup
overhead. Fact jobs tend to be the most complex jobs having a lot of source tables and
validating against multiple dimensions. They also tend to have the most complex
functions.

• 3 days base

• 0.25 per source table (adds to SQL select, lookup and change capture complexity)

• 0.25 for calculations (this is a custom setting


End to End Load A job that extracts data, prepares it and loads it all in one is a
combination of the estimates from above. Just merge the times for the prepare and load
jobs into one.
The fact job has a high base estimate and attracts a lot of general overheads such as
lookups and joins.

[edit]Skill Weighting
The skill weighting alters the estimated time based on the experience and confidence of
the developer. For lack of an alternative the skills is defined as the number of years of
experience with the tool. An experience consultant has the base weighting of 1 (no affect
on estimates) with less experienced staff attracting more time.

• 4+ years = 1

• 3-4 years = 1.25

• 1-2 years = 1.5

• 6 months to 1 year = 1.75

• Up to 6 months = 2

• Novice = 3
The total number of days estimated for each job is multiplied by the skill weighting to get
a final estimate.

[edit]Data Volume Weighting


Very high volumes of data can take longer to develop: more time is spent making the job
as efficient as possible. Unit testing of large volumes takes more time. Optimisation
testing takes time.

• Low volume = 1

• Medium volume = 1.25

• High volume = 1.5

• Very high volume = 2

[edit]Examples
I extract a list of invoices from a system and load it to some relational data store tables.
An invoice is made up of invoice header records and invoice item records (an invoice can
have many items on it).

• Extract invoice header to flat file: 1 day

• Extract invoice item to flat file: 1 day

• Prepare invoice header file: 1 (base) + 1 (four joins) + 0.5 (change data capture) +
0.5 (validate fields) = 3 days.
• Load invoice header data: 1 day.

• Prepare invoice item file: 1 (base) + 0.5 (two joins) + 0.25 (reject missing header)
+ 0.5 (change data capture) = 2.25 days.

• Load invoice item data: 1 (base) + 0.5 (before-sql disable constraints, bulk load,
after-sql enable constraints) = 1.5.
An expert (4+ years) would complete all jobs in 9.75 days. A novice would take 29.25
days with the x3 weighting.

[edit]Conclusion
These are just guidelines. Happy for people to jump in and add their own estimates for
difficult stages or tasks in a job. These guidelines could then go into a modelling
spreadsheet that estimates all ETL jobs on a project.
This does not take into account complexity of business rules: having to write a lot of
transformation custom code or custom stages. It does not take into account a large
number of columns for change capture and validation jobs.
Related White Papers and Webcasts
Build a Business Case for Simplification
Parallel Build Visualization
Build and Release Management

Vous aimerez peut-être aussi