Vous êtes sur la page 1sur 216

IBM Information Server

InfoSphere Information Server provides a


single unified platform that enables companies
to understand, cleanse, transform, and deliver
trustworthy and context-rich information.

1-1
Copyright Sennovate 2010. All rights

Ver.1.0

Products in IBM Information Server

IBM InfoSphere DataStage


IBM InfoSphere QualityStage
IBM InfoSphere Information Services Director
IBM InfoSphere InformationAnalyzer
IBM Information Server FastTrack
IBM InfoSphere Business Glossary
Other Companion Products
IBM InfoSphere Federation Server
Rational Data Architect
InfoSphere ReplicationServer
Event Publisher
1-2
Copyright Sennovate 2010. All rights

Ver.1.0

IBM Information Server architecture


IBM Information Server architecture is a
client-server architecture made up of client-based
design, administration and operation tools that
access a set of server-based data integration
capabilities through a common services layer.

1-3
Copyright Sennovate 2010. All rights

Ver.1.0

IBM Information Server architecture

1-4
Copyright Sennovate 2010. All rights

Ver.1.0

Client tier
The Client tier basically includes the following:
IBM InfoSphere DataStage and QualityStage clients
Administrator
Director
Designer

1-5
Copyright Sennovate 2010. All rights

Ver.1.0

Server tier
The Server tier includes
Services
Engine
Repository Working area
Working areas
Information Services Director resource providers

1-6
Copyright Sennovate 2010. All rights

Ver.1.0

Services tier
Three general categories of Services
Design
Execution
Metadata

1-7
Copyright Sennovate 2010. All rights

Ver.1.0

Repository tier
The Shared Repository is used to share all the IBM
Information Server product module objects.
The common repository contains the following types of
metadata that are required to support InfoSphere
DataStage:
Project metadata
Operational metadata
Design metadata

1-8
Copyright Sennovate 2010. All rights

Ver.1.0

Engine tier
This is a parallel engine that executes IBM Information
Server tasks.

1-9
Copyright Sennovate 2010. All rights

Ver.1.0

Working areas
These are the temporary storage areas used by the
components.

1-10
Copyright Sennovate 2010. All rights

Ver.1.0

Information Service providers


Information Service providers are sources of
operations for the services like
DataStage,QualityStage.

1-11
Copyright Sennovate 2010. All rights

Ver.1.0

Topologies
IBM InfoSphere Information Server multiple topologies
to support variety of data integration , hardware and
business requirements.
Consider the performance needs to select the
topology
Topologies supported are as follows
Two-tier
Three-tier
Cluster
Grid

1-12
Copyright Sennovate 2010. All rights

Ver.1.0

Topologies
Two-tier
The engine, application server and the metadata
repository are all on the same computer
systems while client are in different machines.
Three-tier
The engine is on one machine , the application
server and metadata repository is co-located on
other machine.
Clients are in the third machine.

1-13
Copyright Sennovate 2010. All rights

Ver.1.0

Topologies
Cluster
This is a slight variation of a three tier topology.
The engine is duplicated over multiple computers.
Ina cluster environment, a single parallel job execution
can span multiple computer each with its own engine.
The processing of a job on multiple machines is driven
by a configuration file associated with the job.

1-14
Copyright Sennovate 2010. All rights

Ver.1.0

Topologies
Grid topology
Grid computing allows to specify more processing
power
This is similar to cluster but the machine in which a
job executes is determined dynamically through
generation of dynamic configuration file

1-15
Copyright Sennovate 2010. All rights

Ver.1.0

Two-tier

1-16
Copyright Sennovate 2010. All rights

Ver.1.0

Three tier

1-17
Copyright Sennovate 2010. All rights

Ver.1.0

Clusters and Grids

1-18
Copyright Sennovate 2010. All rights

Ver.1.0

IBM Infosphere DataStage


Popular ETL tool
Enables Organization to design data flows that extract
information from multiple source systems, transform it
to make it more valuable, and then deliver it to one or
more target databases.
IBM InfoSphere DataStage is a part of IBM Information
Server architecture
It has
Shared components
Runtime Architecture

1-19
Copyright Sennovate 2010. All rights

Ver.1.0

DataStage architecture

1-20
Copyright Sennovate 2010. All rights

Ver.1.0

Shared Components of DataStage


Common User Interface
Designer
Director
Administrator
Common services
Common repository
Common parallel processing engine
Common connectors

1-21
Copyright Sennovate 2010. All rights

Ver.1.0

Runtime architecture
OSH Script
Using the designer, jobs are created.
The jobs are compiled into parallel job flows and
reusable components that execute on the parallel
information server engine.
Designer generates the OSH(Orchestrate Shell
script).
OSH script
Uses the familiar script of Unix shell.

1-22
Copyright Sennovate 2010. All rights

Ver.1.0

Four core capabilities of DataStage


Connectivity to wide range of mainframe, legacy and
enterprise application databases, file formats and
external information sources.
Prebuilt library of more than 300 functions including
data validation rules and complex transformations.
Maximum throughput using parallel, highperformance processing architecture.
Provides development, deployment and maintenance
features. It leverages metadata for analysis and
maintenance.

1-23
Copyright Sennovate 2010. All rights

Ver.1.0

InfoSphere DataStage elements


The Central DataStage elements are
Projects
Created with Administrator.
Each project contains other components such as jobs,
stages, links, containers and table definitions.
Jobs
Job defines sequence of steps(stages)
Stages
Components of a job.
Links
Date flow from one stage to another.
Containers
Group of Stages
Table definitions.
1-24
Copyright Sennovate 2010. All rights

Ver.1.0

A Job

1-25
Copyright Sennovate 2010. All rights

Ver.1.0

Example of a job

1-26
Copyright Sennovate 2010. All rights

Ver.1.0

Types of Jobs
Parallel Jobs
Server Jobs
Job Sequences

1-27
Copyright Sennovate 2010. All rights

Ver.1.0

Parallel Job
Executed by the DataStage parallel engine.
Built-in functionality for pipeline and partition
parallelism .
Compiled into OSH (Orchestrate Scripting Language).
OSH executes Operators Executable C++ class
instances.
Runtime monitoring in DataStage Director

1-28
Copyright Sennovate 2010. All rights

Ver.1.0

Server Jobs

Executed by the DataStage server engine


Compiled into Basic
Runtime monitoring in DataStage Director

1-29
Copyright Sennovate 2010. All rights

Ver.1.0

Job Sequences
Master Server jobs that kick-off server or parallel jobs
and other activities.
Runtime monitoring in DataStage Director
Executed by the Server engine

1-30
Copyright Sennovate 2010. All rights

Ver.1.0

Stages
Active stage
Active stages model the flow of data and provide
mechanisms for combining data streams,
aggregating data, and converting data from one
data type to another
Alters the number of rows from source to target.
Passive Stage
A passive stage handles access to databases for
the extraction or writing of data.
Does not alter the number of rows from source
to target.

1-31
Copyright Sennovate 2010. All rights

Ver.1.0

Parallel processing
Parallel processing is the use of multiple processors to
execute the different parts of the same program
simultaneously.

1-32
Copyright Sennovate 2010. All rights

Ver.1.0

Representation of job without


parallelism

1-33
Copyright Sennovate 2010. All rights

Ver.1.0

Two type of parallel processing


Pipeline
Partitioning
Combining and Partitioning

1-34
Copyright Sennovate 2010. All rights

Ver.1.0

Pipeline Parallelism
Transform, clean, load processes execute
simultaneously
Like a conveyor belt moving rows from process to
process
Start downstream process while upstream process
is running
Advantages
Reduces disk usage for staging areas
Keeps processors busy
Still has limits on scalability

1-35
Copyright Sennovate 2010. All rights

Ver.1.0

Pipeline Parallelism

1-36
Copyright Sennovate 2010. All rights

Ver.1.0

Partition Parallelism
Divide the incoming stream of data into subsets to be
separately processed by an operation .
Subsets are called partitions (nodes)
This is key to Scalability
Each partition of data is processed by the same
operation
E.g., if operation is Filter, each partition will be
filtered in exactly the same way
Facilitates near-linear scalability
8 times faster on 8 processors
24 times faster on 24 processors
This assumes the data is evenly distributed
1-37
Copyright Sennovate 2010. All rights

Ver.1.0

Partitioned Parallelism

1-38
Copyright Sennovate 2010. All rights

Ver.1.0

Three-Node Partitioning

1-39
Copyright Sennovate 2010. All rights

Ver.1.0

Parallel Jobs Combine Partitioning


and Pipelining

1-40
Copyright Sennovate 2010. All rights

Ver.1.0

Parallel processing environments


SMP(Symmetric multiprocessing)
Some hardware resources are shared among
processors.
The processors communicate via shared memory
They have single operating system
Cluster or MPP(Massive parallel processing)
Each processor has exclusive access to hardware
resources.
In cluster the systems are physically dispersed
In MPP all in same system.

1-41
Copyright Sennovate 2010. All rights

Ver.1.0

Configuration file
DataStage gets the information about the system from
the configuration file.
Resources needed for the job are organized based on
the configuration file.
The configuration file describes every processing
node.
When system changes , change the file not job.
Configuration file provides the hardware configuration.
The path of the configuration file is identified in the
DataStage Administrator.
The environment variable APT_CONFIG_FILE contains
the path of the configuration file
1-42
Copyright Sennovate 2010. All rights

Ver.1.0

Information in configuration file


Nodes
It identifies the number of nodes in the parallel
processing.
Resource disk
Data files are stored here
Resource scratch disk
Here path is specified. The path is used by
parallel jobs for buffering

1-43
Copyright Sennovate 2010. All rights

Ver.1.0

Sample Configuration file


{
node "dev1"
{
fast name "etltools-dev"
pool ""
resource disk "/data/etltools-tutorial/d1" { }
resource disk "/data/etltools-tutorial/d2" { }
resource scratchdisk "/data/etltools-tutorial/temp" { }
}
node "dev2"
{
fastname "etltools-dev"
pool ""
resource disk "/data/etltools-tutorial/d1" { }
resource scratchdisk "/data/etltools-tutorial/temp" { }
}
}

1-44
Copyright Sennovate 2010. All rights

Ver.1.0

Partitioning and Collecting

Partitioning breaks incoming rows into multiple streams of


rows (one for each node)
Each partition of rows is processed separately by the
stage/operator
Collecting returns partitioned data back to a single stream
Partitioning / Collecting is specified on stage input links

1-45
Copyright Sennovate 2010. All rights

Ver.1.0

Partitioning methods

Round Robin Partitioner


Random Partitioner
Same partitioner
Entire Partitioning
Hash partitioner
Modulus partitioner
Range partitioner
DB2 Partitioner
Auto Partitioner

1-46
Copyright Sennovate 2010. All rights

Ver.1.0

Round Robin Partitioner


The first record goes to the first partitioning node,
second to the second and so on.
When the dataStage reaches the last it starts over
again.

1-47
Copyright Sennovate 2010. All rights

Ver.1.0

Random Partitioner
Records are randomly distributed over all partitioning
nodes.
Like round robin, random partitioning can rebalance
the partitions of an input data set to guarantee that
each processing node receives an approximately
equal-sized partition.
The random partitioning has a slightly higher
overhead than round robin because of the extra
processing required to calculate a random value for
each record.

1-48
Copyright Sennovate 2010. All rights

Ver.1.0

Same partitioner
The stage using the data set as input performs no
repartitioning and takes as input the partitions output
by the preceding stage.
With this partitioning method, records stay on the
same processing node; that is, they are not
redistributed.
Same is the fastest partitioning method.
This is normally the method DataStage uses when
passing data between stages in your job.

1-49
Copyright Sennovate 2010. All rights

Ver.1.0

Entire Partitioning
Every instance of a stage on every processing node
receives the complete data set as input.
It is useful when you want the benefits of parallel
execution, but every instance of the operator needs
access to the entire input data set.

1-50
Copyright Sennovate 2010. All rights

Ver.1.0

Hash partitioner
Set based on a zip code field, where a large percentage of your
records Partitioning is based on a function of one or more columns
(the hash partitioning keys) in each record. The hash partitioner
examines one or more fields of each input record (the hash key
fields).
Records with the same values for all hash key fields are assigned to
the same processing node.
This method is useful for ensuring that related records are in the
same partition, which might be a prerequisite for a processing
operation.
Hash partitioning does not necessarily result in an even distribution
of data between partitions.
For example, if you hash partition a data are from one or two zip
codes, you can end up with a few partitions containing most of your
records. This behavior can lead to bottlenecks because some nodes
are required to process more records than other nodes.
1-51
Copyright Sennovate 2010. All rights

Ver.1.0

Modulus partitioner
Partitioning is based on a key column modulo the
number of partitions. This method is similar to hash by
field, but involves simpler computation.

1-52
Copyright Sennovate 2010. All rights

Ver.1.0

Range partitioner
Divides a data set into approximately equal-sized
partitions, each of which contains records with key
columns within a specified range. This method is also
useful for ensuring that related records are in the
same partition.
A range partitioner divides a data set into
approximately equal size partitions based on one or
more partitioning keys. Range partitioning is often a
preprocessing step to performing a total sort on a data
set.
In order to use a range partitioner, you have to make a
range map. You can do this using the Write Range Map
stage.
1-53
Copyright Sennovate 2010. All rights

Ver.1.0

DB2 Partitioner
Partitions an input data set in the same way that DB2
would partition it.
For example, if you use this method to partition an
input data set containing update information for an
existing DB2 table, records are assigned to the
processing node containing the corresponding DB2
record. Then, during the execution of the parallel
operator, both the input record and the DB2 table
record are local to the processing node. Any reads and
writes of the DB2 table would entail no network
activity.

1-54
Copyright Sennovate 2010. All rights

Ver.1.0

Auto Partitioner
Leaving it to DataStage to determine the best
partitioning method to use depending on the type of
stage, and what the previous stage in the job has
done.
Typically DataStage would use round robin when
initially partitioning data, and same for the
intermediate stages of a job.

1-55
Copyright Sennovate 2010. All rights

Ver.1.0

Collecting
Collecting the process of joining the multiple partitions
in to single dataset.
Collecting methods
Round robin
Ordered collector
Sort merge collector
Auto collector

1-56
Copyright Sennovate 2010. All rights

Ver.1.0

Round robin
Reads a record from the first input partition, then from
the second partition, and so on. After reaching the last
partition, starts over.
After reaching the final record in any partition, skips
that partition in the remaining rounds

1-57
Copyright Sennovate 2010. All rights

Ver.1.0

Ordered collector
Reads all records from the first partition, then all
records from the second partition, and so on.
This collection method preserves the order of totally
sorted input data sets. In a totally sorted data set,
both the records in each partition and the partitions
themselves are ordered.
This might be useful as a preprocessing action before
exporting a sorted data set to a single data file.

1-58
Copyright Sennovate 2010. All rights

Ver.1.0

Sort merge collector


produces a globally sorted sequential stream from
within partition sorted rows.
Sort Merge produces a non-deterministic on un-keyed
columns sorted sequential stream using the following
algorithm:
always pick the partition that produces the row with
the smallest key value.

1-59
Copyright Sennovate 2010. All rights

Ver.1.0

Auto collector
The default algorithm reads rows from a partition as
soon as they are ready.
This may lead to producing different row orders in
different runs with identical data. The execution is
non-deterministic.

1-60
Copyright Sennovate 2010. All rights

Ver.1.0

Administrator
Administrator is a client program used to carry out
configuration tasks in DataStage.
It has 3 pages
General
The general page is used to set server-wide
properties.
Project
This lists the projects available and options to
add, edit and delete projects.
NLS
National Language support features.

1-61
Copyright Sennovate 2010. All rights

Ver.1.0

Attaching to DataStage

1-62
Copyright Sennovate 2010. All rights

Ver.1.0

Administrator

1-63
Copyright Sennovate 2010. All rights

Ver.1.0

Project Page
Add To add New DataStage Project
Delete To delete a project. This button is enable only
if you have administrator status.
Properties To set the properties of the selected
project.
Cleanup Cleans up files in selected project
NLS To change project map and locales.
Command To execute DataStage Engine commands
directly from the selected project

1-64
Copyright Sennovate 2010. All rights

Ver.1.0

Project page

1-65
Copyright Sennovate 2010. All rights

Ver.1.0

Add Project

1-66
Copyright Sennovate 2010. All rights

Ver.1.0

Creating a project

1-67
Copyright Sennovate 2010. All rights

Ver.1.0

Project Properties Pages

General
Permissions
Tracing
Schedule
Mainframe
Tunables
Parallel
Sequence

1-68
Copyright Sennovate 2010. All rights

Ver.1.0

Project Properties General tab

1-69
Copyright Sennovate 2010. All rights

Ver.1.0

General tab in Project properties


Enable job administration from the DataStage Director
Lets DataStage operators release the resources of a job that
has aborted or hung, and so return the job to a state in which
it can be rerun when the cause of the problem has been fixed.

Enable runtime column propagation for parallel jobs.


Stages in parallel job can handle undefined columns.

Define a project-wide setting for auto-purge of the job log.


To prevent job log file to become too large this feature is
used.

Set up environment variables.


Can set values to Environment variables
Create new Environment Variables which can be used like
parameters.

1-70
Copyright Sennovate 2010. All rights

Ver.1.0

Permissions tab

1-71
Copyright Sennovate 2010. All rights

Ver.1.0

Permissions tab
Assign user categories to operating system user groups, or enable
operators to view all the details of an event in a job log file.
The Permissions tab is enabled only if you have logged on to
DataStage using a name that gives you administrator status.

1-72
Copyright Sennovate 2010. All rights

Ver.1.0

Tracing tab

1-73
Copyright Sennovate 2010. All rights

Ver.1.0

Tracing tab
This is to enable or disable tracing on the server.

1-74
Copyright Sennovate 2010. All rights

Ver.1.0

Schedule

1-75
Copyright Sennovate 2010. All rights

Ver.1.0

Schedule tab
Set up a user name and password to use for running
scheduled DataStage jobs.
The Schedule tab is enabled only if you have logged
on to a Windows NT server.

1-76
Copyright Sennovate 2010. All rights

Ver.1.0

Mainframe job properties

1-77
Copyright Sennovate 2010. All rights

Ver.1.0

Tunable Configure cache settings

1-78
Copyright Sennovate 2010. All rights

Ver.1.0

Parallel tab

1-79
Copyright Sennovate 2010. All rights

Ver.1.0

Sequence tab

1-80
Copyright Sennovate 2010. All rights

Ver.1.0

Importing and Exporting Objects


From Director choose Tools>>Run Manger
Choose Export >>Components.

1-81
Copyright Sennovate 2010. All rights

Ver.1.0

Designer
A graphical user interface for creating DataStage
applications known as Jobs.
It is a design interface for both Infosphere DataStage
and Infosphere QualityStage
Jobs
Job defines sequence of steps.
After designing jobs are compiled and run on the
parallel processing engine.

1-82
Copyright Sennovate 2010. All rights

Ver.1.0

Stages
The individual steps that make up the job are called
stages.
Some of the DataStage Prebuilt stages are sort,
merge, join, filter, transform, lookup and aggregate.
Stages provide the 80 to 90 percent of the application
logic required for enterprise data integration
applications.
Each stage has properties that tell how to perform or
process data.

1-83
Copyright Sennovate 2010. All rights

Ver.1.0

Some of the Stages in DataStage

1-84
Copyright Sennovate 2010. All rights

Ver.1.0

Common List of stages in DataStage

Aggregator Stage
Complex Flat file Stage
Column Export Stage
Data Set Stage
Distributed Transaction
FTP Enterprise
Funnel
Join
Lookup
Merge
Sequential file Stage
1-85
Copyright Sennovate 2010. All rights

Ver.1.0

Stages list contd

Sort Stage
Surrogate Key generator
Transformer
Remove Duplicate stage

1-86
Copyright Sennovate 2010. All rights

Ver.1.0

Steps in creating a job

Open designer and connect to the project


Choose the type of job to be created.
Import table definition
Drag and drop the stages
Link the stages
Set the properties of the stage
Save and compile the job
Execute the job by choosing Tools>>Run Director
Example
Simple job to group department-wise and sum
salary from flat file
1-87
Copyright Sennovate 2010. All rights

Ver.1.0

Connect to the Project

1-88
Copyright Sennovate 2010. All rights

Ver.1.0

Choose the type of Job

1-89
Copyright Sennovate 2010. All rights

Ver.1.0

ry
o
t
osi ts
p
Re bjec
o

Parallel Job Canvas


VA
N
CA S

es
g
Sta atte
l
Pa

1-90
Copyright Sennovate 2010. All rights

Ver.1.0

Import Sequential file definition


Sequential file
definition

1-91
Copyright Sennovate 2010. All rights

Ver.1.0

Choose Directory and file to import


Import
option

1-92
Copyright Sennovate 2010. All rights

Ver.1.0

Define Columns and format

1-93
Copyright Sennovate 2010. All rights

Ver.1.0

Stages and Links

1-94
Copyright Sennovate 2010. All rights

Ver.1.0

Sequential file - Source

1-95
Copyright Sennovate 2010. All rights

Ver.1.0

Format for Sequential file

Other
Properti
es

1-96
Copyright Sennovate 2010. All rights

Ver.1.0

Columns tab - Load

Load the
columns

1-97
Copyright Sennovate 2010. All rights

Ver.1.0

Select the columns needed

1-98
Copyright Sennovate 2010. All rights

Ver.1.0

Columns loaded

1-99
Copyright Sennovate 2010. All rights

Ver.1.0

Sequential file Target Properties

Copyright Sennovate 2010. All rights

1100
Ver.1.0

Aggregate Stage Properties


Select
Group by
column
List of
column
s

Copyright Sennovate 2010. All rights

1101
Ver.1.0

Choose Output columns

Copyright Sennovate 2010. All rights

1102
Ver.1.0

Input page Aggregator Stage

Copyright Sennovate 2010. All rights

1103
Ver.1.0

Output page

Copyright Sennovate 2010. All rights

1104
Ver.1.0

Job

Copyright Sennovate 2010. All rights

1105
Ver.1.0

Save job

Copyright Sennovate 2010. All rights

1106
Ver.1.0

Compile the job

Copyright Sennovate 2010. All rights

1107
Ver.1.0

Run Director

Copyright Sennovate 2010. All rights

1108
Ver.1.0

Status View

Copyright Sennovate 2010. All rights

1109
Ver.1.0

Run the job

Copyright Sennovate 2010. All rights

1110
Ver.1.0

Annotation Stage
This stage is used to insert notes to the diagram
window.
Two types of Annotation
Annotation
Description Annotation

Copyright Sennovate 2010. All rights

1111
Ver.1.0

Stages in combining data

Copyright Sennovate 2010. All rights

1112
Ver.1.0

Combining Data based on Key


column
Lookup Stage
Merge Stage
Join Stage

Copyright Sennovate 2010. All rights

1113
Ver.1.0

Lookup Stage
It is used to perform lookup operations
Lookup Stage can have a
Single Input Link
Single Output Link
Optional reject link
And number of reference links
Lookup combines records based on the key column.

Copyright Sennovate 2010. All rights

1114
Ver.1.0

Lookup Stage

Copyright Sennovate 2010. All rights

1115
Ver.1.0

Lookup Stage

Link the
key

Specify
key

Copyright Sennovate 2010. All rights

1116
Ver.1.0

Lookup

Copyright Sennovate 2010. All rights

Choose to
handle
rejects

1117
Ver.1.0

Lookup Stage

Copyright Sennovate 2010. All rights

1118
Ver.1.0

Join Stage
I t performs join operation on two or more inputs to
the stage
This is similar to sql join.
It provides
Inner
Full Outer
Left Outer
Righ Outer

Copyright Sennovate 2010. All rights

1119
Ver.1.0

Join Stage

Copyright Sennovate 2010. All rights

1120
Ver.1.0

Join Stage

Copyright Sennovate 2010. All rights

Choose
the key
for join

1121
Ver.1.0

Join Stage

Copyright Sennovate 2010. All rights

Choose
the join
type

1122
Ver.1.0

Join Stage

Copyright Sennovate 2010. All rights

1123
Ver.1.0

Merge Stage
Merge stage is processing stage
It can have
More than one input link
Single Output link
Same number of reject link as that of update links.

Copyright Sennovate 2010. All rights

1124
Ver.1.0

Merge Stage

Copyright Sennovate 2010. All rights

1125
Ver.1.0

Merge Stage

Copyright Sennovate 2010. All rights

Choose
the
merge
key

1126
Ver.1.0

Merge

Copyright Sennovate 2010. All rights

Keep or
drop

1127
Ver.1.0

Merge Stage

Copyright Sennovate 2010. All rights

1128
Ver.1.0

Comparison

Merge

Join

Lookup

Stream Input

2 to N

2 To N

Reference Input

NA

NA

1-N

Output

Merged data
Master Update Type

SQL-type joined data

If no duplicates in the lookup data expected


then one for every input stream record
Else If one reference stream provides
legitimate duplicates, then multiple rows for
those records

Sorting requirements

All input

All Input

Stream Input Only

Duplicates

Not allowed except in


last update link

Allowed

Allowed in Stream Input


Upto 1 reference link can handle
duplicates. In others, single(first) value
returned.

Partition

Merge Key

Join Key

Usually set Entire for lookup data

Unmatched Rows

Master - drop/keep,
warning/no warning
Update - drop/reject

Depends on join type


NULL values on outer
join

Unmatched stream - reject/keep

Memory

Very few rows in memory


as data is sorted & no
duplicates are expected

Few rows as data is


sorted.
Higher (sequential,
optimized) I/O for highspeed sort on input &
reference data sets

Lookup data in memory- may page for


large volumes. Not suitable for large
reference data.
When looking up against a database, the
DB stage can be set to provide sparse lookup support

Use When

Larger sorted data

Large data

Small reference data look-up.

Copyright Sennovate 2010. All rights

11129
Ver.1.0

Funnel Stage
It combines multiple input to single output
The stage can have any number of input links but a
single Output Link.
The metadata of all the inputs has to be identical
Funnel Stage Operates in 3 modes
Continuous funnel
Sort funnel
Sequence funnel

Copyright Sennovate 2010. All rights

1130
Ver.1.0

Funnel Stage

Copyright Sennovate 2010. All rights

1131
Ver.1.0

Funnel Stage
Choose the
funnel type

Copyright Sennovate 2010. All rights

1132
Ver.1.0

Funnel stage

Copyright Sennovate 2010. All rights

1133
Ver.1.0

Types of funnel
Continuous Funnel
Continuous funnel combines records of the input data in no
guranteed order
It takes one record from each input link in turn.
If data is not available on an input link, the stage skips to
the next link rather than waiting.
Sort Funnel
Sort Funnel combines the input records in the order defined
by the value(s) of one or more key columns, and the order
of the output records is determined by these sorting keys.
Sequence Funnel
Sequence copies all records from the first input data set to
the output data set, then all the records from the second
input data set, and so on.

Copyright Sennovate 2010. All rights

1134
Ver.1.0

Development and Debug Stages

Head Stage
Tail Stage
Peek Stage
Column Generator Stage
Row Generator Stage
Write Range Map Stage

Copyright Sennovate 2010. All rights

1135
Ver.1.0

Head Stage
It can have a single input link and a single output link
Select first N rows from each partition of an input data
set and copies selected rows to output data set.
This is used to Debug large Data Sets
Property settings includes the following
Number of records to copy
Partition from which records are copied
Location

Copyright Sennovate 2010. All rights

1136
Ver.1.0

Stage Page Head Stage


General
General Properties can be provided here
Properties
Properties which includes number of rows per
partition, all rows or skip rows
Advanced
Different execution modes and combinability
mode

Copyright Sennovate 2010. All rights

1137
Ver.1.0

Input Page Head Stage

General
Partitioning
Column
Advanced

Copyright Sennovate 2010. All rights

1138
Ver.1.0

Output Page Head Stage

General
Mapping
Column
Advanced

Copyright Sennovate 2010. All rights

1139
Ver.1.0

Tail Stage
It can have single input link and single output link
It selects last N records from each partition and copies
it to output data set

Copyright Sennovate 2010. All rights

1140
Ver.1.0

Peek Stage
It has single input link and any number of output links
It let to print the record column values either in the
job log or separate output link as it copies records
from input to output.
It is helpful in monitoring the progress of the
application or diagnose the bug in the application.

Copyright Sennovate 2010. All rights

1141
Ver.1.0

Sample Stage
It has single input link and any number of output links
Samples an input dataset
Percent mode-It extracts rows by selecting them by
means of a random number generator and writes a
percentage to output data set.

Copyright Sennovate 2010. All rights

1142
Ver.1.0

Column Generator Stage


It can have a single input link and a single output link.
Column Generator adds columns to the incoming Data
and generates mock data for these columns for each
row processed.
The new data set is the output.

Copyright Sennovate 2010. All rights

1143
Ver.1.0

Row Generator Stage


Row generator has no input link and a single output
link
Row generator produces mock data fitting the given
metadata.
It is used to test when there is no data available
It has Stage page and Output Page

Copyright Sennovate 2010. All rights

1144
Ver.1.0

Write Range Map Stage


The Write Range Map stage takes an input data set
produced by sampling and sorting a data set and
writes it to a file in a form usable by the range
partitioning method.
A typical use for the Write Range Map stage would be
in a job which used the Sample stage to sample a data
set, the Sort stage to sort it and the Write Range Map
stage to write the resulting data set to a file.

Copyright Sennovate 2010. All rights

1145
Ver.1.0

ODBC Stages
ODBC stage is used to extract, write or aggregate
data.
Each ODBC stage can have any number of input links
or output links.
Specify the input link using the following methods
An SQL statement
A user defined SQL query
A stored procedure

Copyright Sennovate 2010. All rights

1146
Ver.1.0

Import ODBC table definition

Copyright Sennovate 2010. All rights

1147
Ver.1.0

ODBC stage

Copyright Sennovate 2010. All rights

1148
Ver.1.0

Choose the tables to import

Copyright Sennovate 2010. All rights

1149
Ver.1.0

ODBC STAGE

Copyright Sennovate 2010. All rights

1150
Ver.1.0

ODBC stage

Copyright Sennovate 2010. All rights

1151
Ver.1.0

Output Mapping

Copyright Sennovate 2010. All rights

1152
Ver.1.0

Ouput mapping

Copyright Sennovate 2010. All rights

1153
Ver.1.0

OCI stage Import Plugin metada


definition

Copyright Sennovate 2010. All rights

1154
Ver.1.0

OCI Stage

Copyright Sennovate 2010. All rights

1155
Ver.1.0

Data source name and user details


Databas
e name

Copyright Sennovate 2010. All rights

1156
Ver.1.0

Choose the objects

Copyright Sennovate 2010. All rights

1157
Ver.1.0

Choose the tables for importing

Copyright Sennovate 2010. All rights

1158
Ver.1.0

OCI Stage and transformer stage

Copyright Sennovate 2010. All rights

1159
Ver.1.0

OCI STAGE - Properties

Copyright Sennovate 2010. All rights

1160
Ver.1.0

Choose the table

Copyright Sennovate 2010. All rights

1161
Ver.1.0

Choose the columns

Copyright Sennovate 2010. All rights

1162
Ver.1.0

Surrogate Key generator stage

Copyright Sennovate 2010. All rights

1163
Ver.1.0

Surrogate Key Generator

Copyright Sennovate 2010. All rights

1164
Ver.1.0

Sort and Filter

Copyright Sennovate 2010. All rights

1165
Ver.1.0

Sort stage Properties


Choose
the key

Copyright Sennovate 2010. All rights

1166
Ver.1.0

Output Mapping

Copyright Sennovate 2010. All rights

1167
Ver.1.0

Filter Stage

Filter
condition

Copyright Sennovate 2010. All rights

1168
Ver.1.0

Transformer stage

Copyright Sennovate 2010. All rights

1169
Ver.1.0

Transformer Stage

Copyright Sennovate 2010. All rights

Column
derivatio
n

1170
Ver.1.0

Constrai
nts

Transformer Stage

Copyright Sennovate 2010. All rights

Stage
variable
s

1171
Ver.1.0

Transformer stage

Copyright Sennovate 2010. All rights

1172
Ver.1.0

Transformer

Copyright Sennovate 2010. All rights

1173
Ver.1.0

Transformer

Copyright Sennovate 2010. All rights

1174
Ver.1.0

Transformer conditions
Scenario
Product file has pcode and product colour
Products with yellow colour are moved to one file,
blue are moved to one file and rest are moved to
other
This task is done using Transformer stage
constraints

Copyright Sennovate 2010. All rights

1175
Ver.1.0

Transformer stage with 3 output


links

Copyright Sennovate 2010. All rights

1176
Ver.1.0

Transformer stage constraints

Copyright Sennovate 2010. All rights

1177
Ver.1.0

Change Data capture stage


Compares two data sets and records the differences
between them.

Copyright Sennovate 2010. All rights

1178
Ver.1.0

Job Properties

Copyright Sennovate 2010. All rights

1179
Ver.1.0

Job Parameters
It is possible to set up parameters to the job.
Parameters to the job are defined in Job properties
window and default value is provided there.

Copyright Sennovate 2010. All rights

1180
Ver.1.0

Defining Parameter in Job Properties


window
Paramete
rs option

Copyright Sennovate 2010. All rights

1181
Ver.1.0

Using Job Parameters

To Use the parameters in the job use


#parametername#
Job Parameters can be used directly in the
expressions

Copyright Sennovate 2010. All rights

1182
Ver.1.0

Using Parameters in filter stage


Paramet
er

Copyright Sennovate 2010. All rights

1183
Ver.1.0

Containers
A Container is a group of stages and links.
It is used to modularize Server job designs using Container Stage
DataStage provides 2 types of containers
Local Container
Shared Container

Copyright Sennovate 2010. All rights

1184
Ver.1.0

Types of Containers
Local Container
These are created within a job and are accessible only within a
job.
Shared Containers
These are created and stored separately in repository as jobs.
There are 2 types of Shared Containers
Server Shared Containers
Server shared containers can be included in the parallel
jobs
Parallel Shared Containers

Copyright Sennovate 2010. All rights

1185
Ver.1.0

Creating Local Containers


If a job is complex group stages and link through a
container to save an existing stages and links in local
container
Select the Stages
Edit-Construct Container-Local
To insert an empty container
click Container
Double click the stage and add stages and links
s

Copyright Sennovate 2010. All rights

1186
Ver.1.0

Deconstructing local Container


To Convert container in to group of discrete stages and
links in the job.
Select the Container Stage and choose Deconstruct
from shortcut menu

Copyright Sennovate 2010. All rights

1187
Ver.1.0

Editing local containers


Choose the container and click Edit--Properties

Copyright Sennovate 2010. All rights

1188
Ver.1.0

Shared Container
To store the existing stages and links in the shared
container
Choose the stages and links
Choose EditContainer-Shared
Parameters to the components are copied to shared
container as Container Parameters
Saving it is same as saving a job.

Copyright Sennovate 2010. All rights

1189
Ver.1.0

Job Sequences
Specifies a sequence of jobs to run.
Sequence can contain control information
ie, It is possible specify different course of action to
be taken depending on whether a job succeeds or
fails.
Job sequence can be scheduled and run using
DataStage Director.

Copyright Sennovate 2010. All rights

1190
Ver.1.0

Example of a Job sequence

Copyright Sennovate 2010. All rights

1191
Ver.1.0

Designing a Job Sequence


Designing a job sequence is similar to design jobs.
Create Job Sequence
Add activities from the tool palette
Join activities through triggers to define control flow
Control activities
Defined in similar way to activities and allow
more control over sequence execution
Each activity has a set of properties and parameters
Job sequence can itself has properties and parameters
Job sequences support automatic exception handling

Copyright Sennovate 2010. All rights

1192
Ver.1.0

Restartable sequence
Job sequences are optionally restartable.
Checkpoint information enable dataStage to restart
job
It is possible to enable or disable checkpoints

Copyright Sennovate 2010. All rights

1193
Ver.1.0

Creating a Job Sequence


File --New---Job Sequence

Copyright Sennovate 2010. All rights

1194
Ver.1.0

Job Sequence

Reposito
ry

Palette

Copyright Sennovate 2010. All rights

1195
Ver.1.0

Activity Stages
Job
Specifies a Server or parallel job
Routine
Specifies any routine but not transforms.
Exec command
Specifies an operating system command to execute,
Email Notification
Specifies that an email notification has to be sent at
this point of sequence(using SMTP)
Wait-for-file
Waits for a particular file to appear or disappear

Copyright Sennovate 2010. All rights

1196
Ver.1.0

Activity Stages
Nested conditions
Allows you to further branch the execution of a
sequence depending on a condition.
Sequencer
Allows you to synchronize the control flow of
multiple activities in a job sequence.
Start and end loop
Together these two stages allow you to implement a
For...Next or For...Each loop within your sequence
Terminator
Allows you to specify that, if certain situations occur,
the jobs a sequence is running shut down cleanly
Copyright Sennovate 2010. All rights

1197
Ver.1.0

Activity stages
User Variable
Allows you to define variables within a sequence.
These variables can then be used later on in the
sequence, for example to set job parameters.
Exceptional handler
It is executed if a job in the sequence fails to run
(other exceptions are handled by triggers) or if a
job aborts and the Automatically handle activities
that fail option is set for the sequence.
Only one Exception handler for a sequence.

Copyright Sennovate 2010. All rights

1198
Ver.1.0

Triggers
Triggers provide control information to the Stage
Activities
Specifies different courses of action to be taken based
on jobs status.
Trigger names must be unique
Types of Triggers
Conditional
Unconditional
Otherwise

Copyright Sennovate 2010. All rights

1199
Ver.1.0

Job Sequence Properties


Specifies
parameter
s

Display
s code

Copyright Sennovate 2010. All rights

1200
Ver.1.0

Scenario for job Sequence


5 input files are available in a folder with the same

layout
Single Server Job available to sort a input file
Wait for a trigger to start the Job
Send a message to a computer after Job completion
(success or failure)
Handle exception

Copyright Sennovate 2010. All rights

1201
Ver.1.0

ForNext Loop to
execute Sort Job
for 5 input files

Waits for
trigger file to
appear

Executes OS
command to
send
message

Executes Sort
Job

When any failure


occurs, control is
transferred here

Copyright Sennovate 2010. All rights

1202
Ver.1.0

Wait-For-File & Start and End Loop


Activities

Wait-For-File Activity
Waits for a specified file to appear or
disappear
Appear option does not delete the file
after finding it

Start and End Loop Activity


Implements ForNext or ForEach
loop
Current value of counter stored in
stage_label.$Counter

Copyright Sennovate 2010. All rights

1203
Ver.1.0

Programming in DataStage
Programming components
Routines
Transforms
Functions
Expressions
Subroutines
Macros
Precedence rules

Copyright Sennovate 2010. All rights

1204
Ver.1.0

Routines
Routines are stored in the Routines folder by default.
The following components are classified as routines
Transform functions
Before/After Subroutines
While designing a job it is possible to specify
Custom Universe functions
ActiveX functions

Copyright Sennovate 2010. All rights

1205
Ver.1.0

Executing jobs from command line


dsjob -run [ -mode [ NORMAL | RESET | VALIDATE
] ] [ -param name=value ] [ -warn n ] [ -rows
n ] [ -wait ] [ -stop ] [ -jobstatus] [userstatus] [-local] [-opmetadata [TRUE |
FALSE]] [-disableprjhandler] [disablejobhandler] [useid] project job|job_id

Copyright Sennovate 2010. All rights

1206
Ver.1.0

Commands
dsadmin command
DSXImport Service command
SyncProject command

Copyright Sennovate 2010. All rights

1207
Ver.1.0

Performance tuning in DS

Ensure proper indexes are created.


Partition the table whereever required.
Use multiple nodes.
Use APT_DUMP_SCORE
Try to use order by than sort.

Copyright Sennovate 2010. All rights

1208
Ver.1.0

Scenarios
Scenario 1
If we have 3 jobs in sequencer while running if job 1
is failed then how to run other 2 jobs
Properties--trigger----unconditional
Scenario 2
Try Left outer join using Lookup stage

Copyright Sennovate 2010. All rights

1209
Ver.1.0

Server Job Stages in 8.1.2

Complex Flat file Stage


Folder Stage
Hashed file Stage
Sequential file stage
Aggregator Stage
Command Stage
Interprocess Stage
FTP plugin stage
Link Collected stage
Link Partitioner stage

Copyright Sennovate 2010. All rights

1210
Ver.1.0

Server job stages

Merge Stage
Pivot Stage
Row merger Stage
Row Splitter Stage
Sort Stage
Transformer Stage

Copyright Sennovate 2010. All rights

1211
Ver.1.0

Parallel Job Stages


File Stages
Data Set Stage
Sequential File stage
File set Stage
Lookup fileset Stage
External source stage
External Target stage
Complex Flat file stage

Copyright Sennovate 2010. All rights

1212
Ver.1.0

Parallel job Stages


Processing stages
Transformer Stage
Basic Transformer Stage
Aggregator Stage
Join Stage
Lookup Stage
Merge Stage
Sort Stage
Funnel Stage
Remove Duplicate Stage

Copyright Sennovate 2010. All rights

1213
Ver.1.0

Parallel job stages

Compress Stage
Expand stage
Copy Stage
Modify Stage
Filter Stage
External filter Stage
Change capture stage
Change apply Stage
Difference Stage
Compare Stage

Copyright Sennovate 2010. All rights

1214
Ver.1.0

Parallel job Stages

Encode stage
Decode Stage
Switch Stage
FTP Enterprise stage
Generic stage
Surrogate key generator stage
Slowly Changing dimension Stage
Pivot Enterprise Stage
Checksum stage

Copyright Sennovate 2010. All rights

1215
Ver.1.0

Restructure stage

Column Import Stage


Column Export Stage
Make Subrecord stage
Split Subrecord stage
Combine record stage
Promote subrecord stage
Make Vector Stage
Split Vector Stage

Copyright Sennovate 2010. All rights

1216
Ver.1.0

Vous aimerez peut-être aussi