Académique Documents
Professionnel Documents
Culture Documents
To DataStage
What is DataStage?
Design jobs for Extraction, Transformation, and
Loading (ETL)
Ideal tool for data integration projects such as, data
warehouses, data marts, and system migrations
Import, export, create, and managed metadata for use
within jobs
Schedule, run, and monitor jobs all within DataStage
Administer your DataStage development and
execution environments
DataStage Administrator
Client Logon
DataStage Manager
DataStage Designer
DataStage Director
Developing in DataStage
DataStage Projects
Project Properties
Projects can be created and deleted in
Administrator
Project properties and defaults are set in
Administrator
Licensing Tab
Environment Variables
Permissions Tab
Tracing Tab
Tunables Tab
Parallel Tab
What Is Metadata?
Data
Source
Meta
Data
Transform
Meta Data
Repository
Target
Meta
Data
DataStage Manager
Export Procedure
Import Procedure
In Manager, click Import>DataStage Components
Select DataStage objects for import
Import Options
Metadata Import
Import format and column destinations from
sequential files
Import relational table column destinations
Imported as Table Definitions
Table definitions can be loaded into job stages
What Is a Job?
Executable DataStage program
Created in DataStage Designer, but can use
components from Manager
Built using a graphical user interface
Compiles into Orchestrate shell language (OSH)
Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers
Job
properties
Compile
Tools Palette
Transformer Stage
Used to define constraints, derivations, and column
mappings
A column mapping maps an input column to an output
column
In this module will just defined column mappings (no
derivations)
Result
Compiling a Job
DataStage Director
Can schedule, validating, and run jobs
Can be invoked from DataStage Manager or Designer
Tools > Run Director
Process Flow
Administrator Project
Creation/Removal
Functions
specific to a
project.
Administrator Environment
Variables
Variables are
category
specific
OSH is what is
run by the EE
Framework
DataStage Manager
Designer Workspace
The EE
Framework
runs OSH
Messages from
previous run in
different color
Stages
Row
generator
Peek
Row Generator
Repeatable
property
Peek
Displays field values
Will be displayed in job log or sent to a file
Skip records option
Can control number of records to be displayed
Can be used as stub stage for iterative development
(more later)
Why EE is so Effective
Parallel processing paradigm
More hardware, faster processing
Level of parallelization is determined by a
configuration file read at runtime
Emphasis on memory
Data read into memory and lookups performed like
hash table
note
cpu
cpu
Operational Data
Clean
Load
Data
Warehouse
Archived Data
Disk
Source
Disk
Disk
Target
Pipeline Multiprocessing
Data Pipelining
Transform, clean and load processes are executing simultaneously on the same processor
rows are moving forward through the flow
Operational Data
Archived Data
Transform
Clean
Load
This eliminates intermediate storing to disk, which is critical for big data.
This also keeps the processors busy.
Still has limits on scalability
Data
Warehouse
Target
Partition Parallelism
Data Partitioning
Break up big data into partitions
Node 1
Transform
A-F
G- M
Source
Data
Node 2
Transform
N-T
U-Z
Node 3
Transform
Node 4
Transform
Pa
rt
Source
Data
itio
n in
g
Pipelining
Source
Transform
Clean
Load
Data
Warehouse
Target
Repartitioning
Putting It All Together: Parallel Dataflow
with Repartioning on-the-fly
Source
Transform
Clean
nin
g
art
itio
Re
p
Source
Data
art
itio
nin
g
U-Z
N-T
G- M
A-F
Re
p
Pa
rtit
ion
ing
Pipelining
Load
Data
Warehouse
Target
EE Program Elements
DataStage EE Architecture
Orchestrate Framework:
DataStage:
Provides data integration platform
Flat Files
Relational Data
Clean 1
Import
Analyze
Merge
Clean 2
Configuration File
Performance
Visualization
Analyze
Inter-node communications
Parallelization of operations
Introduction to DataStage EE
DSEE:
Automatically scales to fit the machine
Handles data flow among multiple CPUs and disks
With DSEE you can:
Create applications for SMPs, clusters and MPPs
Enterprise Edition is architecture-neutral
Access relational databases in parallel
Execute external applications in parallel
Store data across multiple disks and nodes
Importing/Exporting Data
EE internal format
Data import:
Data export
EE internal format
F ie ld 1
F ie ld 1
L a s t fie ld
nl
F in a l D e lim ite r = e n d
F ie ld D e lim ite r
F ie ld 1
F ie ld 1
F ie ld 1
L a s t fie ld
, nl
F in a l D e lim ite r = c o m m a
Stage categories
Multiple output
links
Show records
Format Tab
Read Methods
Reject Link
Reject mode = output
Source
All records not matching the meta data (the column
definitions)
Target
All records that are rejected for any reason
Meta data one column, datatype = raw
Key column
dropped in
descriptor file
Data Set
Persistent Datasets
Accessed from/to disk with DataSet Stage.
Two parts:
Descriptor file:
contains metadata, data location, but NOT the data
itself
record (
Data file(s)
input.ds
partno: int32;
description:
string;
)
Managing DataSets
Display data
Schema
Job Presentation
Naming conventions
Stages named after the
Data they access
Function they perform
DO NOT leave defaulted stage names like
Sequential_File_0
Links named for the data they carry
DO NOT leave defaulted link names like DSLink3
Container
Developing Jobs
1.
2.
3.
4.
Keep it simple
Jobs with many stages are hard to debug and maintain.
Start small and Build to final Solution
Use view data, copy, and peek.
Start from source and work out.
Develop with a 1 node configuration file.
Solve the business problem before the performance problem.
Dont worry too much about partitioning until the sequential
flow works as expected.
If you have to write to Disk use a Persistent Data set.
Final Result
Click to add
environment
variables
Double-click
Partitoner
And
Collector
Mapping
Node--> partition
Client
Enterprise Edition
Client
Sort
Client
Client
Client
Load
Client
Parallel RDBMS
Parallel RDBMS
RDBMS Access
Supported Databases
Enterprise Edition provides high
performance / scalable interfaces for:
DB2
Informix
Oracle
Teradata
RDBMS Access
Automatically convert RDBMS table layouts to/from
Enterprise Edition Table Definitions
RDBMS nulls converted to/from nullable field values
Support for standard SQL syntax for specifying:
field list for SELECT statement
filter for WHERE clause
Can write an explicit SQL query to access RDBMS
EE supplies additional information in the SQL query
RDBMS Stages
DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
RDBMS Usage
As a source
Extract data from table (stream link)
Extract as table, generated SQL, or user-defined SQL
User-defined can perform joins, access views
Lookup (reference link)
Normal lookup is memory-based (all table data read
into memory)
Can perform one lookup at a time in DBMS (sparse
option)
Continue/drop/fail options
As a target
Inserts
Upserts (Inserts and updates)
Loader
Stream link
Columns in SQL
statement must
match the meta data
in columns tab
Reject link
Output option
automatically creates
the reject link
Null Handling
Must handle null condition if lookup record is not found
and continue option is chosen
Can be done in a transformer stage
Link name
DBMS as a Target
DBMS As Target
Write Methods
Delete
Load
Upsert
Write (DB2)
Write mode for load method
Truncate
Create
Replace
Append
Target Properties
Generated code
can be copied
Upsert mode
determines
options
Concepts
Piece of Application
Logic Running Against
Individual Records
Parallel or Sequential
Output
Interface
Business
Logic
Partitioner
Input
Interface
EE Stage
Producer
Pipeline
Consumer
Partition
OSH
DataStage EE GUI generates OSH scripts
Ability to view OSH turned on in Administrator
OSH can be viewed in Designer using job properties
The Framework executes OSH
What is OSH?
Orchestrate shell
Has a UNIX command-line interface
OSH Script
Where:
op is an Orchestrate operator
in.ds is the input data set
out.ds is the output data set
OSH Operators
OSH Operator is an instance of a C++ class inheriting from
APT_Operator
Developers can create new operators
Examples of existing operators:
Import
Export
RemoveDups
Will be enabled
for all projects
Operator
Schema
Operators
Datasets:
Elements of a Framework
Program
set of rows processed by Framework
Datasets
Consist of Partitioned Data and Schema
Can be Persistent (*.ds)
or Virtual (*.v, Link)
Overcome 2 GB File Limit
What you program:
Node 4
Operator
Operator
A
A
Operator
A
Operator
A
Operator
A
OSH
Node 2
GUI
Node 1
Shared Disk
Disk
Disk
CPU
Memory
Uniprocessor
Shared Memory
SMP System
(Symmetric Multiprocessor)
Shared Nothing
Disk
Disk
Disk
Disk
CPU
CPU
CPU
CPU
Memory
Memory
Memory
Memory
Processing Node
SL
P
Processing Node
SL
P
Section Leader
Forks Players processes (one/Stage)
Manages up/down communication.
Players
The actual processes associated with Stages
Combined players: one process only
Send stderr to SL
Establish connections to other players for data
flow
testing
'MedN-nodes' file -
'BigN-nodes' file
parallelism
(first)
$APT_CONFIG_FILE
Ops
User
Orchestrate
O/S
Processes
CPUs
Nodes * Ops
"
Re-Partitioning
Parallel to parallel flow may incur reshuffling:
Records may jump between nodes
node
1
node
2
partitioner
Partitioning Methods
Auto
Hash
Entire
Range
Range Map
Collectors
Collectors combine partitions of a dataset into a single
input stream to a sequential Stage
...
data partitions
collector
sequential Stage
Partitioner
Collector
Transformed Data
Stages Review
Flow Control
Separate records flow down links based on data
condition specified in Transformer stage constraints
Transformer stage can filter records
Other stages can filter records but do not exhibit
advanced flow control
Sequential can send bad records down reject link
Lookup can reject records based on lookup failure
Filter can select records based on data value
Rejecting Data
Property
Reject Mode
= Output
If Not
Found
property
Constraint
Other/log
option
Stage Variables
Show/Hide
button
Transforming Data
Derivations
Using expressions
Using functions
Date/time
Transformer Stage Issues
Sometimes require sorting before the transformer
stage I.e. using stage variable as accumulator and
need to break on change of column value
Checking for nulls
Constraint Rejects
All expressions are false
and reject row is
checked
Sorting Data
Important because
Some stages require sorted input
Some stages may run faster I.e Aggregator
Can be performed
Option within stages (use input > partitioning tab
and set partitioning to anything other than auto)
As a separate stage (more complex sorts)
Sorting Alternatives
Sort Stage
Removing Duplicates
Can be done by Sort stage
Use unique option
OR
Remove Duplicates stage
Has more sophisticated ways to remove duplicates
Combining Data
Joins
Lookup
Merge
Left
Right
Source
LU Table(s)
Master
Update(s)
Tip:
Check "Input Ordering" tab to make sure intended Primary is listed
first
Inner
Left Outer
Right Outer
Full Outer
Inner
Left Outer
Right Outer
Full Outer
0
0
Output
One or more
tables (LUTs)
Lookup
Reject
no pre-sort necessary
allows multiple keys LUTs
flexible exception handling for
source input rows with no match
RDBMS LOOKUP
NORMAL
Loads to an in memory hash
table first
SPARSE
Select for each row.
Might become a performance
bottleneck.
Combines
one sorted, duplicate-free master (primary) link with
one or more sorted update (secondary) links.
Pre-sort makes merge "lightweight": few rows need to be in RAM (as with
joins, but opposite to lookup).
Follows the Master-Update model:
Master row and one or more updates row are merged if they have the
same value in user-specified key column(s).
A non-key column occurs in several inputs? The lowest input port number
prevails (e.g., master over update; update values are ignored)
Unmatched ("Bad") master rows can be either
kept
dropped
Unmatched ("Bad") update rows in input link can be captured in a "reject"
link
Matched update rows are consumed.
One or more
updates
0
0
Merge
Output
Rejects
Lightweight
Space/time tradeoff: presorts vs. in-RAM table
Merge
Model
Memory usage
RDBMS-style relational
light
Master -Update(s)
light
1 Source, N LU Tables
1 Master, N Update(s)
no
OK
Warning!
[fail] | continue | drop | reject
NONE
reusable
all inputs
Warning!
OK only when N = 1
[keep] | drop
capture in reject set(s)
consumed
1
Nothing (N/A)
1 out, (1 reject)
unmatched primary entries
1 out, (N rejects)
unmatched secondary entries
# Outputs
Captured in reject set(s)
In this table:
, <comma>
Grouping Methods
Hash: results for each aggregation group are stored in a hash table,
and the table is written out after all input has been processed
doesnt require sorted data
good when number of unique groups is small. Running tally for
each groups aggregate calculations need to fit easily into
memory. Require about 1KB/group of RAM.
Example: average family income by state, requires .05MB of RAM
Sort: results for only a single aggregation group are kept in memory;
when new group is seen (key value changes), current group written
out.
requires input sorted by grouping keys
can handle unlimited numbers of groups
Example: average daily balance by credit card
Aggregator Functions
Sum
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation
Aggregator Properties
Aggregation Types
Aggregation types
Containers
Two varieties
Local
Shared
Local
Simplifies a large, complex diagram
Shared
Creates reusable object that many jobs can include
Creating a Container
Create a job
Select (loop) portions to containerize
Edit > Construct container > local or shared
Optimizing Parallelism
Degree of parallelism determined by number of nodes
defined
Parallelism should be optimized, not maximized
Increasing parallelism distributes work load but also
increases Framework overhead
Hardware influences degree of parallelism possible
System hardware partially determines configuration
Configuration File
Text file containing string data that is passed to the
Framework
Sits on server side
Can be displayed and edited
Name and location found in environmental variable
APT_CONFIG_FILE
Components
Node
Fast name
Pools
Resource
Node Options
Disk Pools
pool "bigdata"
Sorting Requirements
Resource pools can also be specified for sorting:
The Sort stage looks first for scratch disk resources in a
sort pool, and then in the default disk pool
Resource Types
Disk
Scratchdisk
DB2
Oracle
Saswork
Sortwork
Can exist in a pool
Groups resources together
Wrappers (Contd)
LS Example
Creating a Wrapper
Name of stage
Interfaces input and output columns these should first be entered into the
table definitions meta data (DS
Manager); lets do that now.
Interface schemas
Layout interfaces describe what columns the stage:
Needs for its inputs (if any)
Creates for its outputs (if any)
Should be created as tables with columns in Manager
input schema
export
stdin or
named pipe
UNIX executable
stdout or
named pipe
import
output schema
Resulting Job
Wrapped stage
Job Run
Show file from Designer palette
Buildops
Buildop provides a simple means of extending beyond the functionality
provided by EE, but does not use an existing executable (like the wrapper)
Reasons to use Buildop include:
Speed / Performance
Complex business logic that cannot be easily represented
using existing stages
Lookups across a range of values
Surrogate key generation
Rolling aggregates
Build once and reusable everywhere within project, no
shared container necessary
Can combine functionality from different stages into one
BuildOps
The DataStage programmer encapsulates the business
logic
The Enterprise Edition interface called buildop
automatically performs the tedious, error-prone tasks:
invoke needed header files, build the necessary
plumbing for a correct and efficient parallel execution.
Exploits extensibility of EE Framework
"Build" stages
from within Enterprise Edition
"Wrapping existing Unix
executables
General Page
Identical
to Wrappers,
except:
First line:
output 0
Optional renaming
of
output port from
default "out0"
Write row
Input page: 'Auto Read'
Read next row
In-Repository
Table Definition
'False' setting,
not to interfere
with Transfer page
First line:
Transfer of index 0
Example - sumNoTransfer
Add input columns "a" and "b"; ignore other
columns
that might be present in input
Produce a new "sum" column
Do not transfer input columns
a:int32; b:int32
sumNoTransfer
sum:int32
No Transfer
From Peek:
NO TRANSFER
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"
Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
Transfer
TRANSFER
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"
Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)
Columns
DS-EE type
Defined in Table Definitions
Value refreshed from row
to row
C/C++ type
Need declaration (in
Definitions or Pre-Loop
page)
Value persistent
throughout "loop" over
rows, unless modified in
code
Custom Stage
Reasons for a custom stage:
Add EE operator not already in DataStage EE
Build your own Operator and add to DataStage EE
Use EE API
Use Custom Stage to add new operator to EE canvas
Custom Stage
DataStage Manager > select Stage Types branch > right click
Custom Stage
Number of input and
output links allowed
Name of Orchestrate
operator to be used
The Result
Data definitions
Recordization and columnization
Fields have properties that can be set at individual field level
Data types in GUI are translated to types used by EE
Described as properties on the format/columns tab (outputs or
inputs pages) OR
Using a schema file (can be full or partial)
Schemas
Can be imported into Manager
Can be pointed to by some job stages (i.e. Sequential)
Column Overrides
Edit row from within the columns tab
Set individual column properties
Field
and
string
settings
Editing Columns
Properties depend
on the data type
Schema
Alternative way to specify column definitions for data
used in EE jobs
Written in a plain text file
Can be written as a partial record definition
Can be imported into the DataStage repository
Creating a Schema
Using a text editor
Follow correct syntax for definitions
OR
Import from an existing data set or file set
On DataStage Manager import > Table Definitions >
Orchestrate Schema Definitions
Select checkbox for a file with .fs or .ds
Importing a Schema
Data Types
Date
Decimal
Floating point
Integer
String
Time
Timestamp
Vector
Subrecord
Raw
Tagged
Job Sequencer
Example
Job Activity
stage
contains
conditional
triggers
Job to be executed
select from dropdown
Job parameters
to be passed
Options
Use custom option for conditionals
Execute if job run successful or warnings only
Can add wait for file to execute
Add execute command stage to drop real tables and
rename new tables to current tables
Different links
having different
triggers
Sequencer Stage
Build job sequencer to control job for the collections
application
Can be set to
all or any
Notification Stage
Notification
Notification Activity
Environment Variables
Environment Variables
The Director
Typical Job Log Messages:
Environment variables
Configuration File information
Framework Info/Warning/Error messages
Output from the Peek Stage
Additional info with "Reporting" environments
Tracing/Debug output
Must compile job in trace mode
Adds overhead
Director will
prompt you
before each
run
Troubleshooting
If you get an error during compile, check the following:
Compilation problems
If Transformer used, check C++ compiler, LD_LIRBARY_PATH
If Buildop errors try buildop from command line
Some stages may not support RCP can cause column mismatch .
Use the Show Error and More buttons
Examine Generated OSH
Check environment variables settings
Very little integrity checking during compile, should run validate from Director.
http://
vibranttechnologies.co.in/datastage-classes-in-mumbai.ht
ml
Follow Us on: