Data Stage PPT Materials

IBM Information Server
InfoSphere Information Server provides a

single unified platform that enables companies
to understand, cleanse, transform, and deliver
trustworthy and context-rich information.
1-1
Copyright Sennovate 2010. All rights
Ver.1.0
Products in IBM Information Server
IBM InfoSphere DataStage

IBM InfoSphere QualityStage
IBM InfoSphere Information Services Director
IBM InfoSphere InformationAnalyzer
IBM Information Server FastTrack
IBM InfoSphere Business Glossary
Other Companion Products
IBM InfoSphere Federation Server
Rational Data Architect
InfoSphere ReplicationServer
Event Publisher
1-2
Ver.1.0
IBM Information Server architecture

IBM Information Server architecture is a
client-server architecture made up of client-based
design, administration and operation tools that
access a set of server-based data integration
capabilities through a common services layer.
1-3
Ver.1.0
IBM Information Server architecture
1-4
Ver.1.0
Client tier
The Client tier basically includes the following:
IBM InfoSphere DataStage and QualityStage clients
Administrator
Director
Designer
1-5
Ver.1.0
Server tier
The Server tier includes
Services
Engine
Repository Working area
Working areas
Information Services Director resource providers
1-6
Ver.1.0
Services tier
Three general categories of Services
Design
Execution
Metadata
1-7
Ver.1.0
Repository tier
The Shared Repository is used to share all the IBM
Information Server product module objects.
The common repository contains the following types of
metadata that are required to support InfoSphere
DataStage:
Project metadata
Operational metadata
Design metadata
1-8
Ver.1.0
Engine tier
This is a parallel engine that executes IBM Information
Server tasks.
1-9
Ver.1.0
Working areas
These are the temporary storage areas used by the
components.
1-10
Ver.1.0
Information Service providers

Information Service providers are sources of
operations for the services like
DataStage,QualityStage.
1-11
Ver.1.0
Topologies
IBM InfoSphere Information Server multiple topologies
to support variety of data integration , hardware and
business requirements.
Consider the performance needs to select the
topology
Topologies supported are as follows
Two-tier
Three-tier
Cluster
Grid
1-12
Ver.1.0
Topologies
Two-tier
The engine, application server and the metadata
repository are all on the same computer
systems while client are in different machines.
Three-tier
The engine is on one machine , the application
server and metadata repository is co-located on
other machine.
Clients are in the third machine.
1-13
Ver.1.0
Topologies
Cluster
This is a slight variation of a three tier topology.
The engine is duplicated over multiple computers.
Ina cluster environment, a single parallel job execution
can span multiple computer each with its own engine.
The processing of a job on multiple machines is driven
by a configuration file associated with the job.
1-14
Ver.1.0
Topologies
Grid topology
Grid computing allows to specify more processing
power
This is similar to cluster but the machine in which a
job executes is determined dynamically through
generation of dynamic configuration file
1-15
Ver.1.0
Two-tier
1-16
Ver.1.0
Three tier
1-17
Ver.1.0
Clusters and Grids
1-18
Ver.1.0
IBM Infosphere DataStage

Popular ETL tool
Enables Organization to design data flows that extract
information from multiple source systems, transform it
to make it more valuable, and then deliver it to one or
more target databases.
IBM InfoSphere DataStage is a part of IBM Information
Server architecture
It has
Shared components
Runtime Architecture
1-19
Ver.1.0
DataStage architecture
1-20
Ver.1.0
Shared Components of DataStage

Common User Interface
Designer
Director
Administrator
Common services
Common repository
Common parallel processing engine
Common connectors
1-21
Ver.1.0
Runtime architecture
OSH Script
Using the designer, jobs are created.
The jobs are compiled into parallel job flows and
reusable components that execute on the parallel
information server engine.
Designer generates the OSH(Orchestrate Shell
script).
OSH script
Uses the familiar script of Unix shell.
1-22
Ver.1.0
Four core capabilities of DataStage

Connectivity to wide range of mainframe, legacy and
enterprise application databases, file formats and
external information sources.
Prebuilt library of more than 300 functions including
data validation rules and complex transformations.
Maximum throughput using parallel, highperformance processing architecture.
Provides development, deployment and maintenance
features. It leverages metadata for analysis and
maintenance.
1-23
Ver.1.0
InfoSphere DataStage elements

The Central DataStage elements are
Projects
Created with Administrator.
Each project contains other components such as jobs,
stages, links, containers and table definitions.
Jobs
Job defines sequence of steps(stages)
Stages
Components of a job.
Links
Date flow from one stage to another.
Containers
Group of Stages
Table definitions.
1-24
Ver.1.0
A Job
1-25
Ver.1.0
Example of a job
1-26
Ver.1.0
Types of Jobs
Parallel Jobs
Server Jobs
Job Sequences
1-27
Ver.1.0
Parallel Job
Executed by the DataStage parallel engine.
Built-in functionality for pipeline and partition
parallelism .
Compiled into OSH (Orchestrate Scripting Language).
OSH executes Operators Executable C++ class
instances.
Runtime monitoring in DataStage Director
1-28
Ver.1.0
Server Jobs
Executed by the DataStage server engine

Compiled into Basic
1-29
Ver.1.0
Job Sequences
Master Server jobs that kick-off server or parallel jobs
and other activities.
Executed by the Server engine
1-30
Ver.1.0
Stages
Active stage
Active stages model the flow of data and provide
mechanisms for combining data streams,
aggregating data, and converting data from one
data type to another
Alters the number of rows from source to target.
Passive Stage
A passive stage handles access to databases for
the extraction or writing of data.
Does not alter the number of rows from source
to target.
1-31
Ver.1.0
Parallel processing
Parallel processing is the use of multiple processors to
execute the different parts of the same program
simultaneously.
1-32
Ver.1.0
Representation of job without

parallelism
1-33
Ver.1.0
Two type of parallel processing

Pipeline
Partitioning
Combining and Partitioning
1-34
Ver.1.0
Pipeline Parallelism
Transform, clean, load processes execute
simultaneously
Like a conveyor belt moving rows from process to
process
Start downstream process while upstream process
is running
Advantages
Reduces disk usage for staging areas
Keeps processors busy
Still has limits on scalability
1-35
Ver.1.0
Pipeline Parallelism
1-36
Ver.1.0
Partition Parallelism
Divide the incoming stream of data into subsets to be
separately processed by an operation .
Subsets are called partitions (nodes)
This is key to Scalability
Each partition of data is processed by the same
operation
E.g., if operation is Filter, each partition will be
filtered in exactly the same way
Facilitates near-linear scalability
8 times faster on 8 processors
24 times faster on 24 processors
This assumes the data is evenly distributed
1-37
Ver.1.0
Partitioned Parallelism
1-38
Ver.1.0
Three-Node Partitioning
1-39
Ver.1.0
Parallel Jobs Combine Partitioning

and Pipelining
1-40
Ver.1.0
Parallel processing environments

SMP(Symmetric multiprocessing)
Some hardware resources are shared among
processors.
The processors communicate via shared memory
They have single operating system
Cluster or MPP(Massive parallel processing)
Each processor has exclusive access to hardware
resources.
In cluster the systems are physically dispersed
In MPP all in same system.
1-41
Ver.1.0
Configuration file
DataStage gets the information about the system from
the configuration file.
Resources needed for the job are organized based on
the configuration file.
The configuration file describes every processing
node.
When system changes , change the file not job.
Configuration file provides the hardware configuration.
The path of the configuration file is identified in the
DataStage Administrator.
The environment variable APT_CONFIG_FILE contains
the path of the configuration file
1-42
Ver.1.0
Information in configuration file

Nodes
It identifies the number of nodes in the parallel
processing.
Resource disk
Data files are stored here
Resource scratch disk
Here path is specified. The path is used by
parallel jobs for buffering
1-43
Ver.1.0
Sample Configuration file

{
node "dev1"
{
fast name "etltools-dev"
pool ""
resource disk "/data/etltools-tutorial/d1" { }
resource scratchdisk "/data/etltools-tutorial/temp" { }
}
node "dev2"
{
fastname "etltools-dev"
pool ""
resource scratchdisk "/data/etltools-tutorial/temp" { }
}
}
1-44
Ver.1.0
Partitioning and Collecting
Partitioning breaks incoming rows into multiple streams of

rows (one for each node)
Each partition of rows is processed separately by the
stage/operator
Collecting returns partitioned data back to a single stream
Partitioning / Collecting is specified on stage input links
1-45
Ver.1.0
Partitioning methods
Round Robin Partitioner

Random Partitioner
Same partitioner
Entire Partitioning
Hash partitioner
Modulus partitioner
Range partitioner
DB2 Partitioner
Auto Partitioner
1-46
Ver.1.0
Round Robin Partitioner

The first record goes to the first partitioning node,
second to the second and so on.
When the dataStage reaches the last it starts over
again.
1-47
Ver.1.0
Random Partitioner
Records are randomly distributed over all partitioning
nodes.
Like round robin, random partitioning can rebalance
the partitions of an input data set to guarantee that
each processing node receives an approximately
equal-sized partition.
The random partitioning has a slightly higher
overhead than round robin because of the extra
processing required to calculate a random value for
each record.
1-48
Ver.1.0
Same partitioner
The stage using the data set as input performs no
repartitioning and takes as input the partitions output
by the preceding stage.
With this partitioning method, records stay on the
same processing node; that is, they are not
redistributed.
Same is the fastest partitioning method.
This is normally the method DataStage uses when
passing data between stages in your job.
1-49
Ver.1.0
Entire Partitioning
Every instance of a stage on every processing node
receives the complete data set as input.
It is useful when you want the benefits of parallel
execution, but every instance of the operator needs
access to the entire input data set.
1-50
Ver.1.0
Hash partitioner
Set based on a zip code field, where a large percentage of your
records Partitioning is based on a function of one or more columns
(the hash partitioning keys) in each record. The hash partitioner
examines one or more fields of each input record (the hash key
fields).
Records with the same values for all hash key fields are assigned to
the same processing node.
This method is useful for ensuring that related records are in the
same partition, which might be a prerequisite for a processing
operation.
Hash partitioning does not necessarily result in an even distribution
of data between partitions.
For example, if you hash partition a data are from one or two zip
codes, you can end up with a few partitions containing most of your
records. This behavior can lead to bottlenecks because some nodes
are required to process more records than other nodes.
1-51
Ver.1.0
Modulus partitioner
Partitioning is based on a key column modulo the
number of partitions. This method is similar to hash by
field, but involves simpler computation.
1-52
Ver.1.0
Range partitioner
Divides a data set into approximately equal-sized
partitions, each of which contains records with key
columns within a specified range. This method is also
useful for ensuring that related records are in the
same partition.
A range partitioner divides a data set into
approximately equal size partitions based on one or
more partitioning keys. Range partitioning is often a
preprocessing step to performing a total sort on a data
set.
In order to use a range partitioner, you have to make a
range map. You can do this using the Write Range Map
stage.
1-53
Ver.1.0
DB2 Partitioner
Partitions an input data set in the same way that DB2
would partition it.
For example, if you use this method to partition an
input data set containing update information for an
existing DB2 table, records are assigned to the
processing node containing the corresponding DB2
record. Then, during the execution of the parallel
operator, both the input record and the DB2 table
record are local to the processing node. Any reads and
writes of the DB2 table would entail no network
activity.
1-54
Ver.1.0
Auto Partitioner
Leaving it to DataStage to determine the best
partitioning method to use depending on the type of
stage, and what the previous stage in the job has
done.
Typically DataStage would use round robin when
initially partitioning data, and same for the
intermediate stages of a job.
1-55
Ver.1.0
Collecting
Collecting the process of joining the multiple partitions
in to single dataset.
Collecting methods
Round robin
Ordered collector
Sort merge collector
Auto collector
1-56
Ver.1.0
Round robin
Reads a record from the first input partition, then from
the second partition, and so on. After reaching the last
partition, starts over.
After reaching the final record in any partition, skips
that partition in the remaining rounds
1-57
Ver.1.0
Ordered collector
Reads all records from the first partition, then all
records from the second partition, and so on.
This collection method preserves the order of totally
sorted input data sets. In a totally sorted data set,
both the records in each partition and the partitions
themselves are ordered.
This might be useful as a preprocessing action before
exporting a sorted data set to a single data file.
1-58
Ver.1.0
Sort merge collector

produces a globally sorted sequential stream from
within partition sorted rows.
Sort Merge produces a non-deterministic on un-keyed
columns sorted sequential stream using the following
algorithm:
always pick the partition that produces the row with
the smallest key value.
1-59
Ver.1.0
Auto collector
The default algorithm reads rows from a partition as
soon as they are ready.
This may lead to producing different row orders in
different runs with identical data. The execution is
non-deterministic.
1-60
Ver.1.0
Administrator
Administrator is a client program used to carry out
configuration tasks in DataStage.
It has 3 pages
General
The general page is used to set server-wide
properties.
Project
This lists the projects available and options to
add, edit and delete projects.
NLS
National Language support features.
1-61
Ver.1.0
Attaching to DataStage
1-62
Ver.1.0
Administrator
1-63
Ver.1.0
Project Page
Add To add New DataStage Project
Delete To delete a project. This button is enable only
if you have administrator status.
Properties To set the properties of the selected
project.
Cleanup Cleans up files in selected project
NLS To change project map and locales.
Command To execute DataStage Engine commands
directly from the selected project
1-64
Ver.1.0
Project page
1-65
Ver.1.0
Add Project
1-66
Ver.1.0
Creating a project
1-67
Ver.1.0
Project Properties Pages
General
Permissions
Tracing
Schedule
Mainframe
Tunables
Parallel
Sequence
1-68
Ver.1.0
Project Properties General tab
1-69
Ver.1.0
General tab in Project properties

Enable job administration from the DataStage Director
Lets DataStage operators release the resources of a job that
has aborted or hung, and so return the job to a state in which
it can be rerun when the cause of the problem has been fixed.
Enable runtime column propagation for parallel jobs.

Stages in parallel job can handle undefined columns.
Define a project-wide setting for auto-purge of the job log.

To prevent job log file to become too large this feature is
used.
Set up environment variables.

Can set values to Environment variables
Create new Environment Variables which can be used like
parameters.
1-70
Ver.1.0
Permissions tab
1-71
Ver.1.0
Permissions tab
Assign user categories to operating system user groups, or enable
operators to view all the details of an event in a job log file.
The Permissions tab is enabled only if you have logged on to
DataStage using a name that gives you administrator status.
1-72
Ver.1.0
Tracing tab
1-73
Ver.1.0
Tracing tab
This is to enable or disable tracing on the server.
1-74
Ver.1.0
Schedule
1-75
Ver.1.0
Schedule tab
Set up a user name and password to use for running
scheduled DataStage jobs.
The Schedule tab is enabled only if you have logged
on to a Windows NT server.
1-76
Ver.1.0
Mainframe job properties
1-77
Ver.1.0
Tunable Configure cache settings
1-78
Ver.1.0
Parallel tab
1-79
Ver.1.0
Sequence tab
1-80
Ver.1.0
Importing and Exporting Objects

From Director choose Tools>>Run Manger
Choose Export >>Components.
1-81
Ver.1.0
Designer
A graphical user interface for creating DataStage
applications known as Jobs.
It is a design interface for both Infosphere DataStage
and Infosphere QualityStage
Jobs
Job defines sequence of steps.
After designing jobs are compiled and run on the
parallel processing engine.
1-82
Ver.1.0
Stages
The individual steps that make up the job are called
stages.
Some of the DataStage Prebuilt stages are sort,
merge, join, filter, transform, lookup and aggregate.
Stages provide the 80 to 90 percent of the application
logic required for enterprise data integration
applications.
Each stage has properties that tell how to perform or
process data.
1-83
Ver.1.0
Some of the Stages in DataStage
1-84
Ver.1.0
Common List of stages in DataStage
Aggregator Stage
Complex Flat file Stage
Column Export Stage
Data Set Stage
Distributed Transaction
FTP Enterprise
Funnel
Join
Lookup
Merge
Sequential file Stage
1-85
Ver.1.0
Stages list contd
Sort Stage
Surrogate Key generator
Transformer
Remove Duplicate stage
1-86
Ver.1.0
Steps in creating a job
Open designer and connect to the project

Choose the type of job to be created.
Import table definition
Drag and drop the stages
Link the stages
Set the properties of the stage
Save and compile the job
Execute the job by choosing Tools>>Run Director
Example
Simple job to group department-wise and sum
salary from flat file
1-87
Ver.1.0
Connect to the Project
1-88
Ver.1.0
Choose the type of Job
1-89
Ver.1.0
ry
o
t
osi ts
p
Re bjec
o
Parallel Job Canvas

VA
N
CA S
es
g
Sta atte
l
Pa
1-90
Ver.1.0
Import Sequential file definition

Sequential file
definition
1-91
Ver.1.0
Choose Directory and file to import

Import
option
1-92
Ver.1.0
Define Columns and format
1-93
Ver.1.0
Stages and Links
1-94
Ver.1.0
Sequential file - Source
1-95
Ver.1.0
Format for Sequential file
Other
Properti
es
1-96
Ver.1.0
Columns tab - Load
Load the
columns
1-97
Ver.1.0
Select the columns needed
1-98
Ver.1.0
Columns loaded
1-99
Ver.1.0
Sequential file Target Properties
1100
Ver.1.0
Aggregate Stage Properties

Select
Group by
column
List of
column
s
1101
Ver.1.0
Choose Output columns
1102
Ver.1.0
Input page Aggregator Stage
1103
Ver.1.0
Output page
1104
Ver.1.0
Job
1105
Ver.1.0
Save job
1106
Ver.1.0
Compile the job
1107
Ver.1.0
Run Director
1108
Ver.1.0
Status View
1109
Ver.1.0
Run the job
1110
Ver.1.0
Annotation Stage
This stage is used to insert notes to the diagram
window.
Two types of Annotation
Annotation
Description Annotation
1111
Ver.1.0
Stages in combining data
1112
Ver.1.0
Combining Data based on Key

column
Lookup Stage
Merge Stage
Join Stage
1113
Ver.1.0
Lookup Stage
It is used to perform lookup operations
Lookup Stage can have a
Single Input Link
Single Output Link
Optional reject link
And number of reference links
Lookup combines records based on the key column.
1114
Ver.1.0
Lookup Stage
1115
Ver.1.0
Lookup Stage
Link the
key
Specify
key
1116
Ver.1.0
Lookup
Choose to
handle
rejects
1117
Ver.1.0
Lookup Stage
1118
Ver.1.0
Join Stage
I t performs join operation on two or more inputs to
the stage
This is similar to sql join.
It provides
Inner
Full Outer
Left Outer
Righ Outer
1119
Ver.1.0
Join Stage
1120
Ver.1.0
Join Stage
Choose
the key
for join
1121
Ver.1.0
Join Stage
Choose
the join
type
1122
Ver.1.0
Join Stage
1123
Ver.1.0
Merge Stage
Merge stage is processing stage
It can have
More than one input link
Single Output link
Same number of reject link as that of update links.
1124
Ver.1.0
Merge Stage
1125
Ver.1.0
Merge Stage
Choose
the
merge
key
1126
Ver.1.0
Merge
Keep or
drop
1127
Ver.1.0
Merge Stage
1128
Ver.1.0
Comparison
Merge
Join
Lookup
Stream Input
2 to N
2 To N
Reference Input
NA
NA
1-N
Output
Merged data
Master Update Type
SQL-type joined data
If no duplicates in the lookup data expected

then one for every input stream record
Else If one reference stream provides
legitimate duplicates, then multiple rows for
those records
Sorting requirements
All input
All Input
Stream Input Only
Duplicates
Not allowed except in

last update link
Allowed
Allowed in Stream Input

Upto 1 reference link can handle
duplicates. In others, single(first) value
returned.
Partition
Merge Key
Join Key
Usually set Entire for lookup data
Unmatched Rows
Master - drop/keep,
warning/no warning
Update - drop/reject
Depends on join type

NULL values on outer
join
Unmatched stream - reject/keep
Memory
Very few rows in memory

as data is sorted & no
duplicates are expected
Few rows as data is

sorted.
Higher (sequential,
optimized) I/O for highspeed sort on input &
reference data sets
Lookup data in memory- may page for

large volumes. Not suitable for large
reference data.
When looking up against a database, the
DB stage can be set to provide sparse lookup support
Use When
Larger sorted data
Large data
Small reference data look-up.
11129
Ver.1.0
Funnel Stage
It combines multiple input to single output
The stage can have any number of input links but a
single Output Link.
The metadata of all the inputs has to be identical
Funnel Stage Operates in 3 modes
Continuous funnel
Sort funnel
Sequence funnel
1130
Ver.1.0
Funnel Stage
1131
Ver.1.0
Funnel Stage
Choose the
funnel type
1132
Ver.1.0
Funnel stage
1133
Ver.1.0
Types of funnel
Continuous Funnel
Continuous funnel combines records of the input data in no
guranteed order
It takes one record from each input link in turn.
If data is not available on an input link, the stage skips to
the next link rather than waiting.
Sort Funnel
Sort Funnel combines the input records in the order defined
by the value(s) of one or more key columns, and the order
of the output records is determined by these sorting keys.
Sequence Funnel
Sequence copies all records from the first input data set to
the output data set, then all the records from the second
input data set, and so on.
1134
Ver.1.0
Development and Debug Stages
Head Stage
Tail Stage
Peek Stage
Column Generator Stage
Row Generator Stage
Write Range Map Stage
1135
Ver.1.0
Head Stage
It can have a single input link and a single output link
Select first N rows from each partition of an input data
set and copies selected rows to output data set.
This is used to Debug large Data Sets
Property settings includes the following
Number of records to copy
Partition from which records are copied
Location
1136
Ver.1.0
Stage Page Head Stage

General
General Properties can be provided here
Properties
Properties which includes number of rows per
partition, all rows or skip rows
Advanced
Different execution modes and combinability
mode
1137
Ver.1.0
Input Page Head Stage
General
Partitioning
Column
Advanced
1138
Ver.1.0
Output Page Head Stage
General
Mapping
Column
Advanced
1139
Ver.1.0
Tail Stage
It can have single input link and single output link
It selects last N records from each partition and copies
it to output data set
1140
Ver.1.0
Peek Stage
It has single input link and any number of output links
It let to print the record column values either in the
job log or separate output link as it copies records
from input to output.
It is helpful in monitoring the progress of the
application or diagnose the bug in the application.
1141
Ver.1.0
Sample Stage
It has single input link and any number of output links
Samples an input dataset
Percent mode-It extracts rows by selecting them by
means of a random number generator and writes a
percentage to output data set.
1142
Ver.1.0
Column Generator Stage

It can have a single input link and a single output link.
Column Generator adds columns to the incoming Data
and generates mock data for these columns for each
row processed.
The new data set is the output.
1143
Ver.1.0
Row Generator Stage

Row generator has no input link and a single output
link
Row generator produces mock data fitting the given
metadata.
It is used to test when there is no data available
It has Stage page and Output Page
1144
Ver.1.0
Write Range Map Stage

The Write Range Map stage takes an input data set
produced by sampling and sorting a data set and
writes it to a file in a form usable by the range
partitioning method.
A typical use for the Write Range Map stage would be
in a job which used the Sample stage to sample a data
set, the Sort stage to sort it and the Write Range Map
stage to write the resulting data set to a file.
1145
Ver.1.0
ODBC Stages
ODBC stage is used to extract, write or aggregate
data.
Each ODBC stage can have any number of input links
or output links.
Specify the input link using the following methods
An SQL statement
A user defined SQL query
A stored procedure
1146
Ver.1.0
Import ODBC table definition
1147
Ver.1.0
ODBC stage
1148
Ver.1.0
Choose the tables to import
1149
Ver.1.0
ODBC STAGE
1150
Ver.1.0
ODBC stage
1151
Ver.1.0
Output Mapping
1152
Ver.1.0
Ouput mapping
1153
Ver.1.0
OCI stage Import Plugin metada

definition
1154
Ver.1.0
OCI Stage
1155
Ver.1.0
Data source name and user details

Databas
e name
1156
Ver.1.0
Choose the objects
1157
Ver.1.0
Choose the tables for importing
1158
Ver.1.0
OCI Stage and transformer stage
1159
Ver.1.0
OCI STAGE - Properties
1160
Ver.1.0
Choose the table
1161
Ver.1.0
Choose the columns
1162
Ver.1.0
Surrogate Key generator stage
1163
Ver.1.0
Surrogate Key Generator
1164
Ver.1.0
Sort and Filter
1165
Ver.1.0
Sort stage Properties

Choose
the key
1166
Ver.1.0
Output Mapping
1167
Ver.1.0
Filter Stage
Filter
condition
1168
Ver.1.0
Transformer stage
1169
Ver.1.0
Transformer Stage
Column
derivatio
n
1170
Ver.1.0
Constrai
nts
Transformer Stage
Stage
variable
s
1171
Ver.1.0
Transformer stage
1172
Ver.1.0
Transformer
1173
Ver.1.0
Transformer
1174
Ver.1.0
Transformer conditions
Scenario
Product file has pcode and product colour
Products with yellow colour are moved to one file,
blue are moved to one file and rest are moved to
other
This task is done using Transformer stage
constraints
1175
Ver.1.0
Transformer stage with 3 output

links
1176
Ver.1.0
Transformer stage constraints
1177
Ver.1.0
Change Data capture stage

Compares two data sets and records the differences
between them.
1178
Ver.1.0
Job Properties
1179
Ver.1.0
Job Parameters
It is possible to set up parameters to the job.
Parameters to the job are defined in Job properties
window and default value is provided there.
1180
Ver.1.0
Defining Parameter in Job Properties

window
Paramete
rs option
1181
Ver.1.0
Using Job Parameters
To Use the parameters in the job use

#parametername#
Job Parameters can be used directly in the
expressions
1182
Ver.1.0
Using Parameters in filter stage

Paramet
er
1183
Ver.1.0
Containers
A Container is a group of stages and links.
It is used to modularize Server job designs using Container Stage
DataStage provides 2 types of containers
Local Container
Shared Container
1184
Ver.1.0
Types of Containers
Local Container
These are created within a job and are accessible only within a
job.
Shared Containers
These are created and stored separately in repository as jobs.
There are 2 types of Shared Containers
Server Shared Containers
Server shared containers can be included in the parallel
jobs
Parallel Shared Containers
1185
Ver.1.0
Creating Local Containers

If a job is complex group stages and link through a
container to save an existing stages and links in local
container
Select the Stages
Edit-Construct Container-Local
To insert an empty container
click Container
Double click the stage and add stages and links
s
1186
Ver.1.0
Deconstructing local Container

To Convert container in to group of discrete stages and
links in the job.
Select the Container Stage and choose Deconstruct
from shortcut menu
1187
Ver.1.0
Editing local containers

Choose the container and click Edit--Properties
1188
Ver.1.0
Shared Container
To store the existing stages and links in the shared
container
Choose the stages and links
Choose EditContainer-Shared
Parameters to the components are copied to shared
container as Container Parameters
Saving it is same as saving a job.
1189
Ver.1.0
Job Sequences
Specifies a sequence of jobs to run.
Sequence can contain control information
ie, It is possible specify different course of action to
be taken depending on whether a job succeeds or
fails.
Job sequence can be scheduled and run using
DataStage Director.
1190
Ver.1.0
Example of a Job sequence
1191
Ver.1.0
Designing a Job Sequence

Designing a job sequence is similar to design jobs.
Create Job Sequence
Add activities from the tool palette
Join activities through triggers to define control flow
Control activities
Defined in similar way to activities and allow
more control over sequence execution
Each activity has a set of properties and parameters
Job sequence can itself has properties and parameters
Job sequences support automatic exception handling
1192
Ver.1.0
Restartable sequence
Job sequences are optionally restartable.
Checkpoint information enable dataStage to restart
job
It is possible to enable or disable checkpoints
1193
Ver.1.0
Creating a Job Sequence

File --New---Job Sequence
1194
Ver.1.0
Job Sequence
Reposito
ry
Palette
1195
Ver.1.0
Activity Stages
Job
Specifies a Server or parallel job
Routine
Specifies any routine but not transforms.
Exec command
Specifies an operating system command to execute,
Email Notification
Specifies that an email notification has to be sent at
this point of sequence(using SMTP)
Wait-for-file
Waits for a particular file to appear or disappear
1196
Ver.1.0
Activity Stages
Nested conditions
Allows you to further branch the execution of a
sequence depending on a condition.
Sequencer
Allows you to synchronize the control flow of
multiple activities in a job sequence.
Start and end loop
Together these two stages allow you to implement a
For...Next or For...Each loop within your sequence
Terminator
Allows you to specify that, if certain situations occur,
the jobs a sequence is running shut down cleanly
1197
Ver.1.0
Activity stages
User Variable
Allows you to define variables within a sequence.
These variables can then be used later on in the
sequence, for example to set job parameters.
Exceptional handler
It is executed if a job in the sequence fails to run
(other exceptions are handled by triggers) or if a
job aborts and the Automatically handle activities
that fail option is set for the sequence.
Only one Exception handler for a sequence.
1198
Ver.1.0
Triggers
Triggers provide control information to the Stage
Activities
Specifies different courses of action to be taken based
on jobs status.
Trigger names must be unique
Types of Triggers
Conditional
Unconditional
Otherwise
1199
Ver.1.0
Job Sequence Properties

Specifies
parameter
s
Display
s code
1200
Ver.1.0
Scenario for job Sequence

5 input files are available in a folder with the same
layout
Single Server Job available to sort a input file
Wait for a trigger to start the Job
Send a message to a computer after Job completion
(success or failure)
Handle exception
1201
Ver.1.0
ForNext Loop to
execute Sort Job
for 5 input files
Waits for
trigger file to
appear
Executes OS
command to
send
message
Executes Sort
Job
When any failure

occurs, control is
transferred here
1202
Ver.1.0
Wait-For-File & Start and End Loop

Activities
Wait-For-File Activity
Waits for a specified file to appear or
disappear
Appear option does not delete the file
after finding it
Start and End Loop Activity

Implements ForNext or ForEach
loop
Current value of counter stored in
stage_label.$Counter
1203
Ver.1.0
Programming in DataStage
Programming components
Routines
Transforms
Functions
Expressions
Subroutines
Macros
Precedence rules
1204
Ver.1.0
Routines
Routines are stored in the Routines folder by default.
The following components are classified as routines
Transform functions
Before/After Subroutines
While designing a job it is possible to specify
Custom Universe functions
ActiveX functions
1205
Ver.1.0
Executing jobs from command line

dsjob -run [ -mode [ NORMAL | RESET | VALIDATE
] ] [ -param name=value ] [ -warn n ] [ -rows
n ] [ -wait ] [ -stop ] [ -jobstatus] [userstatus] [-local] [-opmetadata [TRUE |
FALSE]] [-disableprjhandler] [disablejobhandler] [useid] project job|job_id
1206
Ver.1.0
Commands
dsadmin command
DSXImport Service command
SyncProject command
1207
Ver.1.0
Performance tuning in DS
Ensure proper indexes are created.

Partition the table whereever required.
Use multiple nodes.
Use APT_DUMP_SCORE
Try to use order by than sort.
1208
Ver.1.0
Scenarios
Scenario 1
If we have 3 jobs in sequencer while running if job 1
is failed then how to run other 2 jobs
Properties--trigger----unconditional
Scenario 2
Try Left outer join using Lookup stage
1209
Ver.1.0
Server Job Stages in 8.1.2
Complex Flat file Stage

Folder Stage
Hashed file Stage
Sequential file stage
Aggregator Stage
Command Stage
Interprocess Stage
FTP plugin stage
Link Collected stage
Link Partitioner stage
1210
Ver.1.0
Server job stages
Merge Stage
Pivot Stage
Row merger Stage
Row Splitter Stage
Sort Stage
Transformer Stage
1211
Ver.1.0
Parallel Job Stages

File Stages
Data Set Stage
Sequential File stage
File set Stage
Lookup fileset Stage
External source stage
External Target stage
Complex Flat file stage
1212
Ver.1.0
Parallel job Stages

Processing stages
Transformer Stage
Basic Transformer Stage
Aggregator Stage
Join Stage
Lookup Stage
Merge Stage
Sort Stage
Funnel Stage
Remove Duplicate Stage
1213
Ver.1.0
Parallel job stages
Compress Stage
Expand stage
Copy Stage
Modify Stage
Filter Stage
External filter Stage
Change capture stage
Change apply Stage
Difference Stage
Compare Stage
1214
Ver.1.0
Parallel job Stages
Encode stage
Decode Stage
Switch Stage
FTP Enterprise stage
Generic stage
Surrogate key generator stage
Slowly Changing dimension Stage
Pivot Enterprise Stage
Checksum stage
1215
Ver.1.0
Restructure stage
Column Import Stage

Column Export Stage
Make Subrecord stage
Split Subrecord stage
Combine record stage
Promote subrecord stage
Make Vector Stage
Split Vector Stage
1216
Ver.1.0

Data Stage PPT Materials

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Stage PPT Materials

Transféré par

Droits d'auteur :

Formats disponibles

IBM Information Server

InfoSphere Information Server provides a

Products in IBM Information Server

IBM InfoSphere DataStage

IBM Information Server architecture

IBM Information Server architecture

Information Service providers

Clusters and Grids

IBM Infosphere DataStage

Shared Components of DataStage

Four core capabilities of DataStage

InfoSphere DataStage elements

Executed by the DataStage server engine

Representation of job without

Two type of parallel processing

Parallel Jobs Combine Partitioning

Parallel processing environments

Information in configuration file

Sample Configuration file

Partitioning and Collecting

Partitioning breaks incoming rows into multiple streams of

Round Robin Partitioner

Round Robin Partitioner

Sort merge collector

Project Properties Pages

Project Properties General tab

General tab in Project properties

Enable runtime column propagation for parallel jobs.

Define a project-wide setting for auto-purge of the job log.

Set up environment variables.

Mainframe job properties

Tunable Configure cache settings

Importing and Exporting Objects

Some of the Stages in DataStage

Common List of stages in DataStage

Stages list contd

Steps in creating a job

Open designer and connect to the project

Connect to the Project

Choose the type of Job

Parallel Job Canvas

Import Sequential file definition

Choose Directory and file to import

Define Columns and format

Stages and Links

Sequential file - Source

Format for Sequential file

Columns tab - Load

Select the columns needed

Sequential file Target Properties

Copyright Sennovate 2010. All rights

Aggregate Stage Properties

Copyright Sennovate 2010. All rights

Choose Output columns

Copyright Sennovate 2010. All rights

Input page Aggregator Stage

Copyright Sennovate 2010. All rights

Copyright Sennovate 2010. All rights

Copyright Sennovate 2010. All rights

Copyright Sennovate 2010. All rights

Compile the job

Copyright Sennovate 2010. All rights

Copyright Sennovate 2010. All rights

Copyright Sennovate 2010. All rights