Vous êtes sur la page 1sur 301

Introduction

To DataStage

What is DataStage?
Design jobs for Extraction, Transformation, and
Loading (ETL)
Ideal tool for data integration projects such as, data
warehouses, data marts, and system migrations
Import, export, create, and managed metadata for use
within jobs
Schedule, run, and monitor jobs all within DataStage
Administer your DataStage development and
execution environments

DataStage Server and Clients

DataStage Administrator

Client Logon

DataStage Manager

DataStage Designer

DataStage Director

Developing in DataStage

Define global and project properties in Administrator


Import meta data into Manager
Build job in Designer
Compile Designer
Validate, run, and monitor in Director

DataStage Projects

Project Properties
Projects can be created and deleted in
Administrator
Project properties and defaults are set in
Administrator

Setting Project Properties

To set project properties, log onto Administrator, select your


project, and then click Properties

Licensing Tab

Projects General Tab

Environment Variables

Permissions Tab

Tracing Tab

Tunables Tab

Parallel Tab

What Is Metadata?

Data

Source
Meta
Data

Transform

Meta Data
Repository

Target
Meta
Data

DataStage Manager

Import and Export

Any object in Manager can be exported to a file


Can export whole projects
Use for backup
Sometimes used for version control
Can be used to move DataStage objects from one project
to another
Use to share DataStage jobs and projects with other
developers

Export Procedure

In Manager, click Export>DataStage Components


Select DataStage objects for export
Specified type of export: DSX, XML
Specify file path on client machine

Exporting DataStage Objects

Exporting DataStage Objects

Import Procedure
In Manager, click Import>DataStage Components
Select DataStage objects for import

Importing DataStage Objects

Import Options

Metadata Import
Import format and column destinations from
sequential files
Import relational table column destinations
Imported as Table Definitions
Table definitions can be loaded into job stages

Sequential File Import Procedure


In Manager, click Import>Table Definitions>Sequential File
Definitions
Select directory containing sequential file and then the file
Select Manager category
Examined format and column definitions and edit is
necessary

Manager Table Definition

Importing Sequential Metadata

What Is a Job?
Executable DataStage program
Created in DataStage Designer, but can use
components from Manager
Built using a graphical user interface
Compiles into Orchestrate shell language (OSH)

Job Development Overview


In Manager, import metadata defining sources and targets
In Designer, add stages defining data extractions and loads
And Transformers and other stages to defined data
transformations
Add linkss defining the flow of data from sources to targets
Compiled the job
In Director, validate, run, and monitor your job

Designer Work Area

Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers

Job
properties

Compile

Tools Palette

Adding Stages and Links


Stages can be dragged from the tools palette or from
the stage type branch of the repository view
Links can be drawn from the tools palette or by right
clicking and dragging from one stage to another

Designer - Create New Job

Drag Stages and Links Using Palette

Assign Meta Data

Editing a Sequential Source Stage

Editing a Sequential Target

Transformer Stage
Used to define constraints, derivations, and column
mappings
A column mapping maps an input column to an output
column
In this module will just defined column mappings (no
derivations)

Transformer Stage Elements

Create Column Mappings

Creating Stage Variables

Result

Adding Job Parameters


Makes the job more flexible
Parameters can be:
Used in constraints and derivations
Used in directory and file names
Parameter values are determined at run time

Adding Job Documentation


Job Properties
Short and long descriptions
Shows in Manager
Annotation stage
Is a stage on the tool palette
Shows on the job GUI (work area)

Job Properties Documentation

Annotation Stage on the Palette

Annotation Stage Properties

Final Job Work Area with


Documentation

Compiling a Job

Errors or Successful Message

Prerequisite to Job Execution


Result from Designer compile

DataStage Director
Can schedule, validating, and run jobs
Can be invoked from DataStage Manager or Designer
Tools > Run Director

Running Your Job

Run Options Parameters and Limits

Director Log View

Message Details are Available

Other Director Functions


Schedule job to run on a particular date/time
Clear job log
Set Director options
Row limits
Abort after x warnings

Process Flow

Administrator add/delete projects, set defaults


Manager import meta data, backup projects
Designer assemble jobs, compile, and execute
Director execute jobs, examine job run logs

Administrator Licensing and


Timeout

Administrator Project
Creation/Removal

Functions
specific to a
project.

Administrator Project Properties

RCP for parallel


jobs should be
enabled
Variables for
parallel
processing

Administrator Environment
Variables

Variables are
category
specific

OSH is what is
run by the EE
Framework

DataStage Manager

Export Objects to MetaStage

Push meta data


to MetaStage

Designer Workspace

Can execute the


job from Designer

DataStage Generated OSH

The EE
Framework
runs OSH

Director Executing Jobs

Messages from
previous run in
different color

Stages

Can now customize the Designers palette

Select desired stages


and drag to favorites

Popular Developer Stages

Row
generator

Peek

Row Generator

Can build test data


Edit row in
column tab

Repeatable
property

Peek
Displays field values
Will be displayed in job log or sent to a file
Skip records option
Can control number of records to be displayed
Can be used as stub stage for iterative development
(more later)

Why EE is so Effective
Parallel processing paradigm
More hardware, faster processing
Level of parallelization is determined by a
configuration file read at runtime
Emphasis on memory
Data read into memory and lookups performed like
hash table

Parallel Processing Systems

DataStage EE Enables parallel processing = executing your


application on multiple CPUs simultaneously
If you add more resources
(CPUs, RAM, and disks) you increase system performance
1

Example system containing


6 CPUs (or processing nodes)
and disks

Scaleable Systems: Examples


Three main types of scalable systems
Symmetric Multiprocessors (SMP): shared memory and
disk
Clusters: UNIX systems connected via networks
MPP: Massively Parallel Processing

note

SMP: Shared Everything


Multiple CPUs with a single operating system
Programs communicate using shared memory
All CPUs share system resources
(OS, memory with single linear address space,
disks, I/O)
cpu

When used with Enterprise Edition: cpu


Data transport uses shared memory
Simplified startup

cpu
cpu

Traditional Batch Processing


Transform

Operational Data

Clean

Load
Data
Warehouse

Archived Data
Disk

Source

Disk

Disk

Traditional approach to batch processing:


Write to disk and read from disk before each processing operation
Sub-optimal utilization of resources
a 10 GB stream leads to 70 GB of I/O
processing resources can sit idle during I/O
Very complex to manage (lots and lots of small jobs)
Becomes impractical with big data volumes

Target

Pipeline Multiprocessing
Data Pipelining
Transform, clean and load processes are executing simultaneously on the same processor
rows are moving forward through the flow
Operational Data

Archived Data

Transform

Clean

Load

Start a downstream process while an upstream process is still running.


Source

This eliminates intermediate storing to disk, which is critical for big data.
This also keeps the processors busy.
Still has limits on scalability

Data
Warehouse

Target

Partition Parallelism
Data Partitioning
Break up big data into partitions

Node 1

4X times faster on 4 processors With data big enough:


100X faster on 100 processors
This is exactly how the parallel
databases work!
Data Partitioning requires the
same transform to all partitions:
Aaron Abbott and Zygmund Zorn
undergo the same transform

Transform

A-F

Run one partition on each processor

G- M
Source
Data

Node 2

Transform

N-T
U-Z

Node 3

Transform
Node 4

Transform

Combining Parallelism Types


Putting It All Together: Parallel Dataflow

Pa
rt

Source
Data

itio
n in
g

Pipelining

Source

Transform

Clean

Load

Data
Warehouse

Target

Repartitioning
Putting It All Together: Parallel Dataflow
with Repartioning on-the-fly

Source

Transform

Customer last name

Clean

Customer zip code

nin
g
art
itio
Re
p

Source
Data

art
itio
nin
g

U-Z
N-T
G- M
A-F

Re
p

Pa
rtit
ion
ing

Pipelining

Load

Credit card number

Without Landing To Disk!

Data
Warehouse

Target

EE Program Elements

Dataset: uniform set of rows in the Framework's internal representation


- Three flavors:
1. file sets
*.fs : stored on multiple Unix files as flat files
2. persistent: *.ds : stored on multiple Unix files in Framework
format
read and written using the DataSet Stage
3. virtual:
*.v : links, in Framework format, NOT stored on disk
- The Framework processes only datasetshence possible need for
Import
- Different datasets typically have different schemas
- Convention: "dataset" = Framework data set.

Partition: subset of rows in a dataset earmarked for processing by the


same node (virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the
dataset

DataStage EE Architecture
Orchestrate Framework:

DataStage:
Provides data integration platform

Provides application scalability


Orchestrate Program
(sequential data flow)

Flat Files

Relational Data

Clean 1
Import

Analyze

Merge
Clean 2

Centralized Error Handling


and Event Logging

Configuration File

Performance
Visualization

Orchestrate Application Framework


and Runtime System

Parallel access to data in RDBMS


Parallel pipelining
Clean 1
Import
Merge
Clean 2

Parallel access to data in files

DataStage Enterprise Edition:


Best-of-breed scalable data integration platform
No limitations on data volumes or throughput

Analyze

Inter-node communications

Parallelization of operations

Introduction to DataStage EE
DSEE:
Automatically scales to fit the machine
Handles data flow among multiple CPUs and disks
With DSEE you can:
Create applications for SMPs, clusters and MPPs
Enterprise Edition is architecture-neutral
Access relational databases in parallel
Execute external applications in parallel
Store data across multiple disks and nodes

Job Design VS. Execution

Developer assembles data flow using the Designer

and gets: parallel access, propagation, transformation, and load.


The design is good for 1 node, 4 nodes,
or N nodes. To change # nodes, just swap configuration file.
No need to modify or recompile the design

Partitioners and Collectors


Partitioners distribute rows into partitions
implement data-partition parallelism
Collectors = inverse partitioners
Live on input links of stages running
in parallel (partitioners)
sequentially (collectors)
Use a choice of methods

Example Partitioning Icons


partitioner

Types of Sequential Data Stages


Sequential
Fixed or variable length
File Set
Lookup File Set
Data Set

Sequential Stage Introduction


The EE Framework processes only datasets
For files other than datasets, such as flat files,
Enterprise Edition must perform import and export
operations this is performed by import and export
OSH operators generated by Sequential or FileSet
stages
During import or export DataStage performs format
translations into, or out of, the EE internal format
Data is described to the Framework in a schema

How the Sequential Stage Works


Generates Import/Export operators, depending on
whether stage is source or target
Performs direct C++ file I/O streams

Using the Sequential File Stage


Both import and export of general files (text, binary) are performed by the
SequentialFile Stage.

Importing/Exporting Data

EE internal format

Data import:

Data export

EE internal format

Working With Flat Files


Sequential File Stage
Normally will execute in sequential mode
Can be parallel if reading multiple files (file pattern
option)
Can use multiple readers within a node
DSEE needs to know
How file is divided into rows
How row is divided into columns

Processes Needed to Import Data


Recordization
Divides input stream into records
Set on the format tab
Columnization
Divides the record into columns
Default set on the format tab but can be overridden
on the columns tab
Can be incomplete if using a schema or not even
specified in the stage if using RCP

File Format Example


R e c o rd d e lim ite r
F ie ld 1

F ie ld 1

F ie ld 1

L a s t fie ld

nl

F in a l D e lim ite r = e n d
F ie ld D e lim ite r

F ie ld 1

F ie ld 1

F ie ld 1

L a s t fie ld

, nl

F in a l D e lim ite r = c o m m a

Sequential File Stage


To set the properties, use stage editor
Page (general, input/output)
Tabs (format, columns)
Sequential stage link rules
One input link
One output links (except for reject link definition)
One reject link
Will reject any records not matching meta data in
the column definitions

Job Design Using Sequential Stages

Stage categories

General Tab Sequential Source

Multiple output
links

Show records

Properties Multiple Files

Click to add more files


having the same meta
data.

Properties - Multiple Readers

Multiple readers option


allows you to set number
of readers

Format Tab

Record into columns


File into records

Read Methods

Reject Link
Reject mode = output
Source
All records not matching the meta data (the column
definitions)
Target
All records that are rejected for any reason
Meta data one column, datatype = raw

File Set Stage

Can read or write file sets


Files suffixed by .fs
File set consists of:
1. Descriptor file contains location of raw data files
+ meta data
2. Individual raw data files
Can be processed in parallel

File Set Stage Example


Descriptor file

File Set Usage


Why use a file set?
2G limit on some file systems
Need to distribute data among nodes to prevent
overruns
If used in parallel, runs faster that sequential file

Lookup File Set Stage


Can create file sets
Usually used in conjunction with Lookup stages

Lookup File Set > Properties


Key column
specified

Key column
dropped in
descriptor file

Data Set

Operating system (Framework) file


Suffixed by .ds
Referred to by a control file
Managed by Data Set Management utility from GUI
(Manager, Designer, Director)
Represents persistent data
Key to good performance in set of linked jobs

Persistent Datasets
Accessed from/to disk with DataSet Stage.
Two parts:
Descriptor file:
contains metadata, data location, but NOT the data
itself
record (

Data file(s)

input.ds

partno: int32;
description:
string;
)

contain the data


node1:/local/disk1/
multiple Unix files (one per node), accessible
in parallel
node2:/local/disk2/

Data Set Stage

Is the data partitioned?

Engine Data Translation


Occurs on import
From sequential files or file sets
From RDBMS
Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS
Engine most efficient when processing internally
formatted records (I.e. data contained in datasets)

Managing DataSets

GUI (Manager, Designer, Director) tools > data set management


Alternative methods
Orchadmin
Unix command line utility
List records
Remove data sets (will remove all components)
Dsrecords
Lists number of records in a dataset

Data Set Management

Display data

Schema

Data Set Management From Unix

Alternative method of managing file sets and data sets


Dsrecords
Gives record count
Unix command-line utility
$ dsrecords ds_name
I.e.. $ dsrecords myDS.ds
156999 records
Orchadmin
Manages EE persistent data sets
Unix command-line utility
I.e. $ orchadmin rm myDataSet.ds

Job Presentation

Document using the


annotation stage

Job Properties Documentation


Organize jobs into
categories

Description shows in DS Manager


and MetaStage

Naming conventions
Stages named after the
Data they access
Function they perform
DO NOT leave defaulted stage names like
Sequential_File_0
Links named for the data they carry
DO NOT leave defaulted link names like DSLink3

Stage and Link Names

Stages and links


renamed to data
they handle

Create Reusable Job Components

Use Enterprise Edition shared


containers when feasible

Container

Use Iterative Job Design


Use copy or peek stage as stub
Test job in phases small first, then increasing in
complexity
Use Peek stage to examine records

Copy or Peek Stage Stub


Copy stage

Transformer Stage Techniques

Suggestions Always include reject link.


Always test for null value before using a column in a function.
Try to use RCP and only map columns that have a derivation
other than a copy. More on RCP later.
Be aware of Column and Stage variable Data Types.
Often user does not pay attention to Stage Variable type.
Avoid type conversions.
Try to maintain the data type as imported.

The Copy Stage


With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder):


Partitioners

Sort / Remove Duplicates

Rename, Drop column


can be inserted on:

input link (Partitioning): Partitioners, Sort, Remove Duplicates)

output link (Mapping page): Rename, Drop.


Sometimes replace the transformer:
Rename,
Drop,
Implicit type Conversions

Developing Jobs
1.
2.

3.

4.

Keep it simple
Jobs with many stages are hard to debug and maintain.
Start small and Build to final Solution
Use view data, copy, and peek.
Start from source and work out.
Develop with a 1 node configuration file.
Solve the business problem before the performance problem.
Dont worry too much about partitioning until the sequential
flow works as expected.
If you have to write to Disk use a Persistent Data set.

Final Result

Good Things to Have in each Job


Use job parameters
Some helpful environmental variables to add to job
parameters
$APT_DUMP_SCORE
Report OSH to message log
$APT_CONFIG_FILE
Establishes runtime parameters to EE engine;
I.e. Degree of parallelization

Setting Job Parameters

Click to add
environment
variables

DUMP SCORE Output

Setting APT_DUMP_SCORE yields:

Double-click

Partitoner
And
Collector

Mapping
Node--> partition

Use Multiple Configuration Files


Make a set for 1X, 2X,.
Use different ones for test versus production
Include as a parameter in each job

Parallel Database Connectivity


Traditional
Client-Server

Client

Enterprise Edition

Client

Sort

Client
Client
Client

Load

Client

Parallel RDBMS

Parallel RDBMS

Only RDBMS is running in parallel

Parallel server runs APPLICATIONS

Each application has only one connection

Application has parallel connections to RDBMS

RDBMS Access
Supported Databases
Enterprise Edition provides high
performance / scalable interfaces for:
DB2
Informix
Oracle
Teradata

RDBMS Access
Automatically convert RDBMS table layouts to/from
Enterprise Edition Table Definitions
RDBMS nulls converted to/from nullable field values
Support for standard SQL syntax for specifying:
field list for SELECT statement
filter for WHERE clause
Can write an explicit SQL query to access RDBMS
EE supplies additional information in the SQL query

RDBMS Stages

DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise

RDBMS Usage

As a source
Extract data from table (stream link)
Extract as table, generated SQL, or user-defined SQL
User-defined can perform joins, access views
Lookup (reference link)
Normal lookup is memory-based (all table data read
into memory)
Can perform one lookup at a time in DBMS (sparse
option)
Continue/drop/fail options
As a target
Inserts
Upserts (Inserts and updates)
Loader

RDBMS Source Stream Link

Stream link

DBMS Source - User-defined SQL

Columns in SQL
statement must
match the meta data
in columns tab

DBMS Source Reference Link

Reject link

Lookup Reject Link

Output option
automatically creates
the reject link

Null Handling
Must handle null condition if lookup record is not found
and continue option is chosen
Can be done in a transformer stage

Lookup Stage Mapping

Link name

Lookup Stage Properties


Referenc
e link

Must have same column


name in input and reference
links. You will get the results
of the lookup in the output
column.

DBMS as a Target

DBMS As Target
Write Methods
Delete
Load
Upsert
Write (DB2)
Write mode for load method
Truncate
Create
Replace
Append

Target Properties
Generated code
can be copied

Upsert mode
determines
options

Checking for Nulls


Use Transformer stage to test for fields with null values
(Use IsNull functions)
In Transformer, can reject or load default value

Concepts

The Enterprise Edition Platform


Script language - OSH (generated by DataStage Parallel
Canvas, and run by DataStage Director)
Communication - conductor,section leaders,players.
Configuration files (only one active at a time, describes H/W)
Meta data - schemas/tables
Schema propagation - RCP
EE extensibility - Buildop, Wrapper
Datasets (data in Framework's internal representation)

DS-EE Stage Elements

EE Stages Involve A Series Of Processing Steps


Input Data Set schema:
prov_num:int16;
member_num:int8;
custid:int32;

Output Data Set schema:


prov_num:int16;
member_num:int8;
custid:int32;

Piece of Application
Logic Running Against
Individual Records
Parallel or Sequential

Output
Interface

Business
Logic

Partitioner

Input
Interface

EE Stage

DSEE Stage Execution


Dual Parallelism Eliminates Bottlenecks!
EE Delivers Parallelism in
Two Ways
Pipeline
Partition

Block Buffering Between


Components

Producer

Eliminates Need for Program


Load Balancing
Maintains Orderly Data Flow

Pipeline
Consumer

Partition

Stages Control Partition Parallelism


Execution Mode (sequential/parallel) is controlled by Stage
default = parallel for most Ascential-supplied Stages
Developer can override default mode
Parallel Stage inserts the default partitioner (Auto) on its
input links
Sequential Stage inserts the default collector (Auto) on
its input links
Developer can override default
execution mode (parallel/sequential) of Stage >
Advanced tab
choice of partitioner/collector on Input > Partitioning
tab

How Parallel Is It?


Degree of parallelism is determined by the
configuration file

Total number of logical nodes in default


pool, or a subset if using "constraints".
Constraints are assigned to specific pools as defined in
configuration file and can be referenced in the stage

OSH
DataStage EE GUI generates OSH scripts
Ability to view OSH turned on in Administrator
OSH can be viewed in Designer using job properties
The Framework executes OSH
What is OSH?
Orchestrate shell
Has a UNIX command-line interface

OSH Script

An osh script is a quoted string which specifies:


The operators and connections of a single
Orchestrate step
In its simplest form, it is:
osh op < in.ds > out.ds

Where:
op is an Orchestrate operator
in.ds is the input data set
out.ds is the output data set

OSH Operators
OSH Operator is an instance of a C++ class inheriting from
APT_Operator
Developers can create new operators
Examples of existing operators:
Import
Export
RemoveDups

Enable Visible OSH in Administrator

Will be enabled
for all projects

View OSH in Designer

Operator

Schema

Operators

Datasets:

Elements of a Framework
Program
set of rows processed by Framework

Orchestrate data sets:


persistent (terminal) *.ds, and
virtual (internal) *.v.
Also: flat file sets *.fs

Schema: data description (metadata) for datasets and links.

Datasets
Consist of Partitioned Data and Schema
Can be Persistent (*.ds)
or Virtual (*.v, Link)
Overcome 2 GB File Limit
What you program:

What gets processed:


Node 3

Node 4

Operator
Operator
A
A

Operator
A

Operator
A

Operator
A

What gets generated:

OSH

$ osh operator_A > x.ds data files


of x.ds

Multiple files per partition


Each file up to 2GBytes (or larger)
.

Node 2

GUI

Node 1

Computing Architectures: Definition


Dedicated Disk

Shared Disk

Disk

Disk

CPU

Memory
Uniprocessor

CPU CPU CPU CPU

Shared Memory
SMP System
(Symmetric Multiprocessor)

Shared Nothing

Disk

Disk

Disk

Disk

CPU

CPU

CPU

CPU

Memory

Memory

Memory

Memory

Clusters and MPP Systems

Job Execution: Orchestrate


Conductor Node

Processing Node

SL
P

Processing Node
SL
P

Conductor - initial DS/EE process


Step Composer
Creates Section Leader processes (one/node)
Consolidates massages, outputs them
Manages orderly shutdown.

Section Leader
Forks Players processes (one/Stage)
Manages up/down communication.
Players
The actual processes associated with Stages
Combined players: one process only
Send stderr to SL
Establish connections to other players for data
flow

Working with Configuration Files


You can easily switch between config files :
'1-node' file

testing

for sequential execution, lighter reportshandy for

'MedN-nodes' file -

aims at a mix of pipeline and data-partitioned

'BigN-nodes' file

aims at full data-partitioned parallelism

parallelism

Only one file is active while a step is running


The Framework queries

(first)

the environment variable:

$APT_CONFIG_FILE

# nodes declared in the config file needs not match #


CPUs
Same configuration file can be used in development and
target machines

Scheduling Nodes, Processes, and


DS/EE does not:
CPUs Nodes = # logical nodes declared in config. file
know how many CPUs are available
Ops = # ops. (approx. # blue boxes in V.O.)
Processes = # Unix processes
schedule
CPUs = # available CPUs
Who knows what?
Nodes

Ops

User

Orchestrate

O/S

Processes

CPUs

Nodes * Ops

"

Who does what?


DS/EE creates (Nodes*Ops) Unix processes
The O/S schedules these processes on the CPUs

Re-Partitioning
Parallel to parallel flow may incur reshuffling:
Records may jump between nodes
node
1

node
2

partitioner

Partitioning Methods

Auto
Hash
Entire
Range
Range Map

Collectors
Collectors combine partitions of a dataset into a single
input stream to a sequential Stage

...

data partitions

collector

Collectors do NOT synchronize data

sequential Stage

Partitioning and Repartitioning Are


Visible On Job Design

Partitioning and Collecting Icons

Partitioner

Collector

Reading Messages in Director

Set APT_DUMP_SCORE to true


Can be specified as job parameter
Messages sent to Director log
If set, parallel job will produce a report showing the
operators, processes, and datasets in the running job

Messages With APT_DUMP_SCORE =


True

Transformed Data

Transformed data is:


Outgoing column is a derivation that may, or may not, include
incoming fields or parts of incoming fields
May be comprised of system variables
Frequently uses functions performed on something (ie. incoming
columns)
Divided into categories I.e.
Date and time
Mathematical
Logical
Null handling
More

Stages Review

Stages that can transform data


Transformer
Parallel
Basic (from Parallel palette)
Aggregator (discussed in later module)

Sample stages that do not transform data


Sequential
FileSet
DataSet
DBMS

Transformer Stage Functions

Control data flow


Create derivations

Flow Control
Separate records flow down links based on data
condition specified in Transformer stage constraints
Transformer stage can filter records
Other stages can filter records but do not exhibit
advanced flow control
Sequential can send bad records down reject link
Lookup can reject records based on lookup failure
Filter can select records based on data value

Rejecting Data

Reject option on sequential stage


Data does not agree with meta data
Output consists of one column with binary data type
Reject links (from Lookup stage) result from the drop option of the
property If Not Found
Lookup failed
All columns on reject link (no column mapping option)
Reject constraints are controlled from the constraint editor of the
transformer
Can control column mapping
Use the Other/Log checkbox

Rejecting Data Example

Property
Reject Mode
= Output

If Not
Found
property

Constraint
Other/log
option

Transformer Stage Properties

Transformer Stage Variables


First of transformer stage entities to execute
Execute in order from top to bottom
Can write a program by using one stage variable to
point to the results of a previous stage variable
Multi-purpose
Counters
Hold values for previous rows to make comparison
Hold derivations to be used in multiple field dervations
Can be used to control execution of constraints

Stage Variables

Show/Hide
button

Transforming Data
Derivations
Using expressions
Using functions
Date/time
Transformer Stage Issues
Sometimes require sorting before the transformer
stage I.e. using stage variable as accumulator and
need to break on change of column value
Checking for nulls

Checking for Nulls


Nulls can get introduced into the dataflow because of
failed lookups and the way in which you chose to
handle this condition
Can be handled in constraints, derivations, stage
variables, or a combination of these

Transformer - Handling Rejects

Constraint Rejects
All expressions are false
and reject row is
checked

Transformer: Execution Order


Derivations in stage variables are executed first
Constraints are executed before derivations
Column derivations in earlier links are executed before later links
Derivations in higher columns are executed before lower columns

Parallel Palette - Two Transformers


Parallel > Processing
All > Processing >
Basic Transformer
Transformer
Makes server style
Is the non-Universe
transforms available on
transformer
the parallel palette
Has a specific set of
Can use DS routines
functions
No
Program
DS routines
available
in Basic
for both transformers

Transformer Functions From


Derivation Editor

Date & Time


Logical
Null Handling
Number
String
Type Conversion

Sorting Data
Important because
Some stages require sorted input
Some stages may run faster I.e Aggregator
Can be performed
Option within stages (use input > partitioning tab
and set partitioning to anything other than auto)
As a separate stage (more complex sorts)

Sorting Alternatives

Alternative representation of same flow:

Sort Option on Stage Link

Sort Stage

Sort Stage - Outputs

Specifies how the output is derived

Sort Specification Options


Input Link Property
Limited functionality
Max memory/partition is 20 MB, then spills to scratch
Sort Stage
Tunable to use more memory before spilling to
scratch.
Note: Spread I/O by adding more scratch file systems to
each node of the APT_CONFIG_FILE

Removing Duplicates
Can be done by Sort stage
Use unique option
OR
Remove Duplicates stage
Has more sophisticated ways to remove duplicates

Combining Data

There are two ways to combine data:


Horizontally:
Several input links; one output link (+ optional rejects) made
of columns from different input links. E.g.,
Joins
Lookup
Merge
Vertically:
One input link, one output link with column combining values
from all input rows. E.g.,
Aggregator

Join, Lookup & Merge


Stages
These "three Stages"
combine two or more input links
according to values of user-designated "key"
column(s).

They differ mainly in:


Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)

Not all Links are Created


Enterprise Edition distinguishes Equal
between:

- The Primary Input (Framework port 0)


- Secondary - in some cases "Reference" (other ports)
Naming convention:

Primary Input: port 0


Secondary Input(s): ports 1,

Joins

Lookup

Merge

Left
Right

Source
LU Table(s)

Master
Update(s)

Tip:
Check "Input Ordering" tab to make sure intended Primary is listed
first

Join Stage Editor


Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)
One of four variants:

Inner

Left Outer

Right Outer

Full Outer

Several key columns


allowed

1. The Join Stage


Four types:

Inner
Left Outer
Right Outer
Full Outer

2 sorted input links, 1 output link


"left outer" on primary input, "right outer" on secondary input
Pre-sort make joins "lightweight": few rows need to be in RAM

2. The Lookup Stage


Combines:
one source link with
one or more duplicate-free table links
Source
input

0
0

Output

One or more
tables (LUTs)

Lookup

Reject

no pre-sort necessary
allows multiple keys LUTs
flexible exception handling for
source input rows with no match

The Lookup Stage


Lookup Tables should be small enough to fit into
physical memory (otherwise, performance hit due
to paging)
On an MPP you should partition the lookup tables
using entire partitioning method, or partition them
the same way you partition the source link
On an SMP, no physical duplication of a Lookup
Table occurs

The Lookup Stage

Lookup File Set


Like a persistent data set only it
contains metadata about the key.
Useful for staging lookup tables

RDBMS LOOKUP
NORMAL
Loads to an in memory hash
table first
SPARSE
Select for each row.
Might become a performance
bottleneck.

3. The Merge Stage

Combines
one sorted, duplicate-free master (primary) link with
one or more sorted update (secondary) links.
Pre-sort makes merge "lightweight": few rows need to be in RAM (as with
joins, but opposite to lookup).
Follows the Master-Update model:
Master row and one or more updates row are merged if they have the
same value in user-specified key column(s).
A non-key column occurs in several inputs? The lowest input port number
prevails (e.g., master over update; update values are ignored)
Unmatched ("Bad") master rows can be either
kept
dropped
Unmatched ("Bad") update rows in input link can be captured in a "reject"
link
Matched update rows are consumed.

The Merge Stage


Master

One or more
updates

Allows composite keys


Multiple update links
Matched update rows are consumed

0
0

Merge

Output

Unmatched updates can be captured

Rejects

Lightweight
Space/time tradeoff: presorts vs. in-RAM table

Synopsis:Joins, Lookup, &


Merge Joins
Lookup

Merge

Model
Memory usage

RDBMS-style relational
light

Source - in RAM LU Table


heavy

Master -Update(s)
light

# and names of Inputs


Mandatory Input Sort
Duplicates in primary input
Duplicates in secondary input(s)
Options on unmatched primary
Options on unmatched secondary
On match, secondary entries are

exactly 2: 1 left, 1 right


both inputs
OK (x-product)
OK (x-product)
NONE
NONE
reusable

1 Source, N LU Tables

1 Master, N Update(s)

no
OK
Warning!
[fail] | continue | drop | reject
NONE
reusable

all inputs
Warning!
OK only when N = 1
[keep] | drop
capture in reject set(s)
consumed

1
Nothing (N/A)

1 out, (1 reject)
unmatched primary entries

1 out, (N rejects)
unmatched secondary entries

# Outputs
Captured in reject set(s)

In this table:
, <comma>

= separator between primary and secondary input links


(out and reject links)

The Aggregator Stage


Purpose: Perform data aggregations
Specify:
Zero or more key columns that define the
aggregation units (or groups)
Columns to be aggregated
Aggregation functions:
count (nulls/non-nulls) sum
max/min/range
The grouping method (hash table or pre-sort) is a
performance issue

Grouping Methods

Hash: results for each aggregation group are stored in a hash table,
and the table is written out after all input has been processed
doesnt require sorted data
good when number of unique groups is small. Running tally for
each groups aggregate calculations need to fit easily into
memory. Require about 1KB/group of RAM.
Example: average family income by state, requires .05MB of RAM
Sort: results for only a single aggregation group are kept in memory;
when new group is seen (key value changes), current group written
out.
requires input sorted by grouping keys
can handle unlimited numbers of groups
Example: average daily balance by credit card

Aggregator Functions

Sum
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation

Aggregator Properties

Aggregation Types

Aggregation types

Containers
Two varieties
Local
Shared
Local
Simplifies a large, complex diagram
Shared
Creates reusable object that many jobs can include

Creating a Container

Create a job
Select (loop) portions to containerize
Edit > Construct container > local or shared

Configuration File Concepts


Determine the processing nodes and disk space
connected to each node
When system changes, need only change the
configuration file no need to recompile jobs
When DataStage job runs, platform reads configuration
file
Platform automatically scales the application to fit
the system

Processing Nodes Are


Locations on which the framework runs applications
Logical rather than physical construct
Do not necessarily correspond to the number of CPUs
in your system
Typically one node for two CPUs
Can define one processing node for multiple physical
nodes or multiple processing nodes for one physical
node

Optimizing Parallelism
Degree of parallelism determined by number of nodes
defined
Parallelism should be optimized, not maximized
Increasing parallelism distributes work load but also
increases Framework overhead
Hardware influences degree of parallelism possible
System hardware partially determines configuration

More Factors to Consider


Communication amongst operators
Should be optimized by your configuration
Operators exchanging large amounts of data should
be assigned to nodes communicating by shared
memory or high-speed link
SMP leave some processors for operating system
Desirable to equalize partitioning of data
Use an experimental approach
Start with small data sets
Try different parallelism while scaling up data set
sizes

Configuration File
Text file containing string data that is passed to the
Framework
Sits on server side
Can be displayed and edited
Name and location found in environmental variable
APT_CONFIG_FILE
Components
Node
Fast name
Pools
Resource

Node Options

Node name name of a processing node used by EE


Typically the network name
Use command uname n to obtain network name
Fastname
Name of node as referred to by fastest network in the system
Operators use physical node name to open connections
NOTE: for SMP, all CPUs share single connection to network
Pools
Names of pools to which this node is assigned
Used to logically group nodes
Can also be used to group resources
Resource
Disk
Scratchdisk

Sample Configuration File


{
node Node1"
{
fastname "BlackHole"
pools "" "node1"
resource disk "/usr/dsadm/Ascential/DataStage/Datasets"
{pools "" }
resource scratchdisk
"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }
}
}

Disk Pools
pool "bigdata"

Disk pools allocate storage


By default, EE uses the default
pool, specified by

Sorting Requirements
Resource pools can also be specified for sorting:
The Sort stage looks first for scratch disk resources in a
sort pool, and then in the default disk pool

Resource Types

Disk
Scratchdisk
DB2
Oracle
Saswork
Sortwork
Can exist in a pool
Groups resources together

Using Different Configurations

Lookup stage where DBMS is using a sparse lookup type

Building a Configuration File

Scoping the hardware:


Is the hardware configuration SMP, Cluster, or MPP?
Define each node structure (an SMP would be single node):
Number of CPUs
CPU speed
Available memory
Available page/swap space
Connectivity (network/back-panel speed)
Is the machine dedicated to EE? If not, what other applications
are running on it?
Get a breakdown of the resource usage (vmstat, mpstat, iostat)
Are there other configuration restrictions? E.g. DB only runs on
certain nodes and ETL cannot run on them?

Wrappers vs. Buildop vs. Custom


Wrappers are good if you cannot or do not want to
modify the application and performance is not critical.
Buildops: good if you need custom coding but do not
need dynamic (runtime-based) input and output
interfaces.
Custom (C++ coding using framework API): good if you
need custom coding and need dynamic input and
output interfaces.

Building Wrapped Stages


You can wrapper a legacy executable:
Binary
Unix command
Shell script
and turn it into a Enterprise Edition stage capable,
among other things, of parallel execution
As long as the legacy executable is:
amenable to data-partition parallelism
no dependencies between rows
pipe-safe
can read rows sequentially
no random access to data

Wrappers (Contd)

Wrappers are treated as a black box


EE has no knowledge of contents
EE has no means of managing anything that occurs inside
the wrapper
EE only knows how to export data to and import data from
the wrapper
User must know at design time the intended behavior of the
wrapper and its schema interface
If the wrappered application needs to see all records prior
to processing, it cannot run in parallel.

LS Example

Can this command be wrappered?

Creating a Wrapper

To create the ls stage

Used in this job ---

Wrapper Starting Point


Creating Wrapped Stages
From Manager:
Right-Click on Stage Type
> New Parallel Stage > Wrapped

We will "Wrapper an existing


Unix executables the ls
command

Wrapper - General Page

Name of stage

Unix command to be wrapped

The "Creator" Page


Conscientiously
maintaining the Creator
page for all your wrapped
stages will eventually earn
you the thanks of others.

Wrapper Properties Page


If your stage will have properties appear, complete the
Properties page

This will be the name


of the property as it
appears in your stage

Wrapper - Wrapped Page

Interfaces input and output columns these should first be entered into the
table definitions meta data (DS
Manager); lets do that now.

Interface schemas
Layout interfaces describe what columns the stage:
Needs for its inputs (if any)
Creates for its outputs (if any)
Should be created as tables with columns in Manager

Column Definition for Wrapper


Interface

How Does the Wrapping Work?

Define the schema for export


and import
Schemas become interface
schemas of the operator and
allow for by-name
column access

input schema
export
stdin or
named pipe

UNIX executable
stdout or
named pipe

import
output schema

Update the Wrapper Interfaces


This wrapper will have no input interface i.e. no input link.
The location will come as a job parameter that will be passed
to the appropriate stage property. Therefore, only the Output
tab entry is needed.

Resulting Job

Wrapped stage

Job Run
Show file from Designer palette

Wrapper Story: Cobol


Hardware Environment: Application

IBM SP2, 2 nodes with 4 CPUs per node.


Software:
DB2/EEE, COBOL, EE
Original COBOL Application:
Extracted source table, performed lookup against table in DB2, and
Loaded results to target table.
4 hours 20 minutes sequential execution
Enterprise Edition Solution:
Used EE to perform Parallel DB2 Extracts and Loads
Used EE to execute COBOL application in Parallel
EE Framework handled data transfer between
DB2/EEE and COBOL application
30 minutes 8-way parallel execution

Buildops
Buildop provides a simple means of extending beyond the functionality
provided by EE, but does not use an existing executable (like the wrapper)
Reasons to use Buildop include:
Speed / Performance
Complex business logic that cannot be easily represented
using existing stages
Lookups across a range of values
Surrogate key generation
Rolling aggregates
Build once and reusable everywhere within project, no
shared container necessary
Can combine functionality from different stages into one

BuildOps
The DataStage programmer encapsulates the business
logic
The Enterprise Edition interface called buildop
automatically performs the tedious, error-prone tasks:
invoke needed header files, build the necessary
plumbing for a correct and efficient parallel execution.
Exploits extensibility of EE Framework

BuildOp Process Overview


From Manager (or Designer):
Repository pane:
Right-Click on Stage Type
> New Parallel Stage > {Custom | Build | Wrapped}

"Build" stages
from within Enterprise Edition
"Wrapping existing Unix
executables

General Page
Identical
to Wrappers,
except:

Under the Build


Tab, your program!

Logic Tab for Business Logic


Enter Business C/C++
logic and arithmetic in four
pages under the Logic tab
Main code section goes in
Per-Record page- it will be
applied to all rows
NOTE: Code will need to
be Ansi C/C++ compliant.
If code does not compile
outside of EE, it wont
compile within EE either!

Code Sections under Logic Tab


Temporary
variables
declared [and
initialized] here

Logic here is executed


once BEFORE
processing the FIRST
row

Logic here is executed


once AFTER
processing the LAST
row

I/O and Transfer


Under Interface tab: Input, Output & Transfer pages

First line:
output 0

Optional renaming
of
output port from
default "out0"

Write row
Input page: 'Auto Read'
Read next row

In-Repository
Table Definition

'False' setting,
not to interfere
with Transfer page

I/O and Transfer

First line:
Transfer of index 0

Transfer all columns from input to output.


If page left blank or Auto Transfer = "False" (and RCP = "False")
Only columns in output Table Definition are written

BuildOp Simple Example

Example - sumNoTransfer
Add input columns "a" and "b"; ignore other
columns
that might be present in input
Produce a new "sum" column
Do not transfer input columns
a:int32; b:int32

sumNoTransfer
sum:int32

No Transfer
From Peek:

NO TRANSFER
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"
Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred

Transfer

TRANSFER
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"
Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)

Columns vs. Temporary C++ Variables


Temp C++ variables

Columns

DS-EE type
Defined in Table Definitions
Value refreshed from row
to row

C/C++ type
Need declaration (in
Definitions or Pre-Loop
page)

Value persistent
throughout "loop" over
rows, unless modified in
code

Custom Stage
Reasons for a custom stage:
Add EE operator not already in DataStage EE
Build your own Operator and add to DataStage EE
Use EE API
Use Custom Stage to add new operator to EE canvas

Custom Stage
DataStage Manager > select Stage Types branch > right click

Custom Stage
Number of input and
output links allowed

Name of Orchestrate
operator to be used

Custom Stage Properties Tab

The Result

Establishing Meta Data

Data definitions
Recordization and columnization
Fields have properties that can be set at individual field level
Data types in GUI are translated to types used by EE
Described as properties on the format/columns tab (outputs or
inputs pages) OR
Using a schema file (can be full or partial)
Schemas
Can be imported into Manager
Can be pointed to by some job stages (i.e. Sequential)

Data Formatting Record Level


Format tab
Meta data described on a record basis
Record level properties

Data Formatting Column Level


Defaults for all columns

Column Overrides
Edit row from within the columns tab
Set individual column properties

Extended Column Properties

Field
and
string
settings

Extended Properties String


Type ASCII to EBCDIC
Note the ability to convert

Editing Columns

Properties depend
on the data type

Schema
Alternative way to specify column definitions for data
used in EE jobs
Written in a plain text file
Can be written as a partial record definition
Can be imported into the DataStage repository

Creating a Schema
Using a text editor
Follow correct syntax for definitions
OR
Import from an existing data set or file set
On DataStage Manager import > Table Definitions >
Orchestrate Schema Definitions
Select checkbox for a file with .fs or .ds

Importing a Schema

Schema location can be


on the server or local
work station

Data Types

Date
Decimal
Floating point
Integer
String
Time
Timestamp

Vector
Subrecord
Raw
Tagged

Runtime Column Propagation

DataStage EE is flexible about meta data. It can cope with the


situation where meta data isnt fully defined. You can define part
of your schema and specify that, if your job encounters extra
columns that are not defined in the meta data when it actually
runs, it will adopt these extra columns and propagate them
through the rest of the job. This is known as runtime column
propagation (RCP).
RCP is always on at runtime.
Design and compile time column mapping enforcement.
RCP is off by default.
Enable first at project level. (Administrator project properties)
Enable at job level. (job properties General tab)
Enable at Stage. (Link Output Column tab)

Enabling RCP at Project Level

Enabling RCP at Job Level

Enabling RCP at Stage Level

Go to output links columns tab


For transformer you can find the output links columns tab by first
going to stage properties

Using RCP with Sequential Stages


To utilize runtime column propagation in the sequential
stage you must use the use schema option
Stages with this restriction:
Sequential
File Set
External Source
External Target

Runtime Column Propagation

When RCP is Disabled


DataStage Designer will enforce Stage Input Column to
Output Column mappings.
At job compile time modify operators are inserted on output
links in the generated osh.

Runtime Column Propagation

When RCP is Enabled


DataStage Designer will not enforce mapping rules.
No Modify operator inserted at compile time.
Danger of runtime error if column names incoming do not
match column names outgoing link case sensitivity.

Job Control Options


Manually write job control
Code generated in Basic
Use the job control tab on the job properties page
Generates basic code which you can modify
Job Sequencer
Build a controlling job much the same way you build
other jobs
Comprised of stages and links
No basic coding

Job Sequencer

Build like a regular job


Type Job Sequence
Has stages and links
Job Activity stage
represents a
DataStage job
Links represent
passing control
Stages

Example
Job Activity
stage
contains
conditional
triggers

Job Activity Properties

Job to be executed
select from dropdown

Job parameters
to be passed

Job Activity Trigger

Trigger appears as a link in the diagram


Custom options let you define the code

Options
Use custom option for conditionals
Execute if job run successful or warnings only
Can add wait for file to execute
Add execute command stage to drop real tables and
rename new tables to current tables

Job Activity With Multiple Links

Different links
having different
triggers

Sequencer Stage
Build job sequencer to control job for the collections
application

Can be set to
all or any

Notification Stage

Notification

Notification Activity

Sample DataStage log from Mail


Sample DataStageNotification
log from Mail Notification

Notification Activity Message


E-Mail Message

Environment Variables

Parallel Environment Variables

Environment Variables Stage Specific

Environment Variables

Environment Variables Compiler

The Director
Typical Job Log Messages:

Environment variables
Configuration File information
Framework Info/Warning/Error messages
Output from the Peek Stage
Additional info with "Reporting" environments
Tracing/Debug output
Must compile job in trace mode
Adds overhead

Job Level Environmental


Job Properties, from Menu
Bar of Designer
Variables

Director will
prompt you
before each
run

Troubleshooting
If you get an error during compile, check the following:

Compilation problems
If Transformer used, check C++ compiler, LD_LIRBARY_PATH
If Buildop errors try buildop from command line
Some stages may not support RCP can cause column mismatch .
Use the Show Error and More buttons
Examine Generated OSH
Check environment variables settings

Very little integrity checking during compile, should run validate from Director.

Highlights source of error

Generating Test Data


Row Generator stage can be used
Column definitions
Data type dependent
Row Generator plus lookup stages provides good way
to create robust test data from pattern files

Thank You !!!


For More Information click below link:

http://
vibranttechnologies.co.in/datastage-classes-in-mumbai.ht
ml

Follow Us on:

Vous aimerez peut-être aussi