Vous êtes sur la page 1sur 28

Transformer Looping Functions for Pivoting the data :

Convert a single row into multiple rows using Transformer


Looping Function? (Pivoting of data using parallel
transformer in Datastage 8.5,8.7 and 9.1)

Refer This link for more details : Looping Concept in


Datastage

Now you can argue that this is possible using a pivot stage. But for the sake of this article lets try
doing this using a Transformer!
Below is a screenshot of our input data

We are going to read the above data from a sequential file and transform it to look like this

So lets get to the job design

Step 1: Read the input data.

Step 2:

Logic for Looping in Transformer Properties

In the adjacent image you can see a new box called Loop Condition. This where we are going to control
the loop variables.

Below is the screenshot when we expand the Loop Condition box

The Loop While

constraint is used to implement a functionality similar to WHILE statement in

programming. So, similar to a while statement need to have a condition to identify how many times
the loop is supposed to be executed.
To achieve this

@ITERATION system variab


le
was introduced.

In our example we need to loop the data 3 times to get the column data onto subsequent rows.
So lets have

@ITERATION <=3

Now create a new Loop variable with the name


LoopName

The derivation for this loop variable should be

If

@ITERATION=1 Then DSLink2.Name1


Else If @ITERATION=2 Then DSLink2.Name2

Else DSLink2.Name3

Below is a screenshot illustrating the same

Now all we have to do is map this Loop variable Loop Name to our output
Name
column

Lets map the output

to a sequential file stage and see if the output is a desired.

After running the job, we did a view data on the output stage and here is the data as desired.

Making some tweaks to the above design we can implement things like
1.
2.

Adding new ro ws to existing rows


Splitting data in a single column to multiple rows and many more such stuff..

Posted by Devendra Kumar Yadav


at 4:37 AM No comments:

Partitioning considerations For Best Performance Of datastage Jobs

This Blog give you a complete details, how we can improve the
performance of datastage Parallel jobs using appropriate partitioning
methods.

Refer These links as well :


1. Datastage Partitioning Methods and Use
2. Datastage Jobs Performance Improvement Tips1
3. Datastage Performance Tuning Tips

1.0 Partitioning considerations:


Choose a partition method which makes sure that the number of rows per partition is close
to equal. This will minimize the processing work load and there by improves the overall run
time. Any stage that process a group of related records must be partitioned using a keyed
partition technique. (Egs in the case of Aggregator stage, Remove duplicate, Change
capture, Change apply, Join, Merge stages etc, as well as for transformers that process
group of related records)

Minimize repartitioning as it decreases the performance unless the partition distribution is


highly skewed. Repartitioning results in overhead of network transport as well as even
distribution of data among partitions is also gets disturbed.

Specify hash partitioning for stages that require processing of group of related records.
Partitioning keys should include only those key columns that are necessary for proper
grouping If the grouping is on a single integer key column, go for Modulus partition on the
same key column If the data is highly skewed and the key column values and distribution
will not change significantly over time, use the Range partitioning technique

Use Round robin partition to distribute data evenly across all partitions. (If grouping is not
needed).This is very much suggested when the input data is in sequential mode or it is very
much skewed Same partitioning requires minimum resources and can be used for
optimization of job and to eliminate repartitioning of the already partitioned data

When the input data set is sorted in parallel, we need to use Sort merge collector, which
will produce a single sorted stream of rows. When the input data set is sorted in parallel
and range partitioned, the ordered collector method is more preferred for collection

For round robin partitioned input data set use round robin collector to reconstruct rows in
input order, as the long as the data set has not been re partitioned or reduced.

Minimize the use of sorts in a job.

Figure: Partitioning tab in a Datastage stage properties

Posted by Devendra Kumar Yadav at 12:22 AM No comments:

Datastage Jobs Best Practices and Performance Tuning

This Blog give you a complete details, how we can


improve the performance of datastage Parallel jobs.
Best practices we have to follow, while creating the
datastage jobs.
This Blog will help you on following topics.
1. Performance Tuning Guidelines
1.1 General Job Design
1.2 Transformer Stage
1.3 Data grouping Stages
1.4 ODBC Stages

Refer This link as well : Parallel Job Performance Tuning Tips1

1.0 Performance Tuning Guidelines


1.1 General Job Design
Jobs need to be developed using the modular development approach. Large jobs can be
broken down in to smaller modules, which help in improving the performance.

In scenarios where same data (huge number of records) is to be shared among more than
one jobs in the same project, use dataset stage approach instead of re-reading the same
data again.

Eliminate unused columns

Eliminate unused references

If the input file has huge number of records and the business logic allows splitting up of
the data, then run the job in parallel to have a significant improvement in the
performance

1.2 Transformer stage

Use parallel transformer stage instead of filter/switch stages ( filter/switch stages will
take more resources for execution. For egs: in the case of filter stage the were clause will
get executed during run time, thus creating the requirement for more resources, there by
decaying the job performance)

Figure: Example of using a Transformer stage instead of using a filter stage. The filter condition is
given in the constraint section of the transformer stage properties.

Use BuildOp stage only when the required logic cannot be implemented using the parallel
transformer stage.

Avoid calling routines in derivations in the transformer stage. Implement the logic in
derivation. This will avoid the over head of procedure call

Implement the logic using stage variables and call these stage variables in the derivations.
During processing the execution starts with stage variables then constraints and then to
individual columns. If ever there is a prerequisite formulae which can be used by both
constraints and also individual columns then we can define it in stage variables so that it
can be processed once and can be used by multiple records. If ever we require the
formulae to be modified for each and every row then it is advisable to place in code in
record level than stage variable level

Figure: Example for using stage variables in and using it in the derivations.

1.3 Data grouping stages


When dealing with stages like Aggregator, Filter etc, always try to use sorted data for
better performance

Figure: Sorting the input data on the grouping keys in an aggregator stage

The example shown in the figure is the properties window for an aggregator stage that
finds out the sum of a quantity column by grouping on the columns shown above. In such
scenarios, we will do sorting of the input data on the same columns so that the records
with same/similar values for these grouping columns will come together there by
increasing the performance. Also note that if we are using more than one node, then the
input dataset should be properly partitioned so that the similar records will be available
in the same node.

1.4 ODBC Stages


If possible sort the data in ODBC stage itself; this will reduce the over head of DS sorting
the data. Dont use the sort stage when we have ORDER BY clause in ODBC sql

Select only the required records or Remove the unwanted rows as early, so that the job
need not deal with unnecessary records causing performance degrade

Using a constraint to filter a record is much slower as compared to having a


SELECT.WHERE in ODBC stage. User the power of database where ever possible and
reduce the over head for DS.

Figure: Using the User-defined SQL option in ODBC stages to reduce the overhead of datastage by
specifying the WHERE and ORDER BY clause in the SQL used to get data.

Avoid using like operator in user


defined queries in ODBC stages. But one thing to be noted here is that , if our custom sql
requires a must scenario like it is doing a filter on some string pattern, we will be forced
to use the like pattern to get the requirement done.
Avoid using
Stored Proceedures until and unless the functionality cannot be implemented in Data Stage
jobs.
Posted by Devendra Kumar Yadav at 12:07 AM No comments:
TUESDAY, OCTOBER 22, 2013

Know about Conductor Node, Section Leaders and Players Process in Datastage Details
about Conductor Node, Section Leaders and Players Process in Datastage

Refer This Link as well For More Details : Job Run Time Architecture
Jobs developed with DataStage Enterprise Edition (EE) are independent of the actual
hardware and degree of parallelism used to run the job. The parallel Configuration File
provides a mapping at runtime between the job and the actual runtime infrastructure and
resources by defining logical processing nodes.
To facilitate scalability across the boundaries of a single server, and to
maintain platform independence, the parallel framework uses a multi-process
architecture.
The runtime architecture of the parallel framework uses a process-based
architecture that enables scalability beyond server boundaries while avoiding
platformdependent threading calls. The actual runtime deployment for a given job design
is composed of a hierarchical relationship of operating system processes, running on one
or more physical servers

Section Leaders (one per logical processing node): used to create and manage player
processes which perform the actual job execution. The Section Leaders also manage
communication between the individual player processes and the master Conductor Node.

Players: one or more logical groups of processes used to execute the data flow logic. All
players are created as groups on the same server as their managing Section Leader
process.
Conductor Node (one per job): the main process used to startup jobs, determine
resource assignments, and create Section Leader processes on one or more processing
nodes. Acts as a single coordinator for status and error messages, manages orderly
shutdown when processing completes or in the event of a fatal error. The conductor node
is run from the primary server
It is a main process to
1. Start up jobs
2. Resource assignments
3. Responsible to create Section leader (used to create & manage player
player process which perform actual job execution).
4. Single coordinator for status and error messages.
5. manages orderly shutdown when processing completes in the event of fatal
error.
When the job is initiated the primary process (called the conductor) reads the job
design, which is a generated Orchestrate shell (osh) script. The conductor also reads the
parallel execution configuration file specified by the current setting of the
APT_CONFIG_FILE environment variable.
Once the execution nodes are known (from the configuration file) the conductor causes a
coordinating process called a section leader to be started on each; by forking a child
process if the node is on the same machine as the conductor or by remote shell execution
if the node is on a different machine from the conductor (things are a little more dynamic
in a grid configuration, but essentially this is what happens).
Communication between the conductor, section leaders and player processes in a
parallel job is effected via TCP.
Senario's To Calculate the Processes :
Sample APT CONFIG FILE : See in bold to mention conductor node.
{node "node1"
{
fastname "DevServer1"pools "conductor" resource disk
"/datastage/Ascential/DataStage/Datasets/node1" {pools "conductor"} resource
scratchdisk "/datastage/Ascential/DataStage/Scratch/node1" {pools ""}
}
node "node2"

{
fastname "DevServer1" pools "" resource disk
"/datastage/Ascential/DataStage/Datasets/node2" {pools ""} resource
scratchdisk "/datastage/Ascential/DataStage/Scratch/node2" {pools ""}
}
}
Please find the below different answers :
For every job that starts there will be one
(1) conductor process (started on the conductor node),
There will be one (1) section leader for each node in the configuration file and
There will be one (1) player process (may or may not be true) for each stage
in your job for each node.
So if you have a job that uses a two (2) node configuration file and has 3 stages then your
job will have
1
2

Conductor Node
Section leaders (2 Nodes * 1 Section leader per node)
6 Player processes (3 stages * 2 Nodes)Your dump score may show that your job will run 9
processes on 2 nodes.
This kind of information is very helpful when determining the impact that a particular job
or process will have on the underlying operating system and system resources.
Posted by Devendra Kumar Yadav at 11:53 PM No comments: Situations
to choose Parallel or Server Datastage Jobs

Situations to choose Parallel or Server Datastage Jobs


1. The choice of server or parallel depends upon time to implement, functionality
and cost.
2. When we have lots of functionality to implement for lower volume and hardware is
less and ease of implementation we can go for Server jobs.
3. Parallel jobs are costly due to high scale of hardware , difficult to implement,
extreme processing capabilities for absurd volumes with vast array of operators for
high-performance manipulation.
4. When the data volume is less it is better to go for Server job as parallel jobs can
have a longer start up time.

5. When data volume is high, it is better to choose parallel job than server job.
Parallel job will be a lot faster than server job even if it runs on single node.
The obvious incentive for going parallel is data volume. Parallel jobs can remove
bottlenecks and run across multiple nodes in a cluster for almost unlimited
scalability. At this point parallel jobs become the faster and easier option. A
parallel sort stage is lot faster than server stage. A Transformer stage in parallel
job with the same transformations in server job is faster. Even on one node with a
compiled transformer stage, the parallel version was three times faster. On 1 node
configuration that does not have a lot of parallel processing also we can still get
big performance improvements from an Enterprise Edition job. The improvements
will be multiplied 10 or more than that if we work on 2CPU machines and two
nodes in most stages.
6. Parallel jobs take advantage of both pipeline parallelism and partitioning
parallelism.
7. We can improve the performance of server job by enabling inter process row
buffering. This helps stages to exchange data as soon as it is available in the link.
IPC stage also helps passive stage to read data from another as soon as data is
available. In other words, stages do not have to wait for the entire set of records
to be read first and then transferred to the next stage. Link partitioner and link
collector stages can be used to achieve a certain degree of partitioning
parallelism.
8. Look up with sequential file is possible in parallel jobs and not possible in server
jobs.
9. Datastage EE jobs are compiled into OSH (Orchestrate Shell script language).
OSH executes operators - instances of executable C++ classes, pre-built
components representing stages used in Datastage jobs.
Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is
why parallel jobs run faster, even if processed on one CPU.
10. The major difference between Infosphere Datastage Enterprise and Server edition
is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a
completely new set of stages, which implement the scalable and parallel data
processing mechanisms. In most cases parallel jobs and stages look similiar to the
Datastage Server objects, however their capababilities are way different. In rough
outline:
Parallel jobs are executable datastage programs, managed and controlled by
Datastage Server runtime environment
Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism.
In most cases no manual intervention is needed to implement optimally those
techniques.
Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating

Refer This Link to Know More about parallel Jobs Stages: Parallel Jobs Stages Posted
by Devendra Kumar Yadav at 11:02 PM No comments:
Surrogate Key Generator Implementation
Surrogate Key Generator Implementation in Datastage 8.1, 8.5 & 9.1

The Surrogate Key Generator stage is a processing stage that generates surrogate key
columns and maintains the key source.
A surrogate key is a unique primary key that is not derived from the data that it
represents, therefore changes to the data will not change the primary key. In a star
schema database, surrogate keys are used to join a fact table to a dimension table.
Surrogate key generator stage uses:
1. Create or delete the key source before other jobs run
2. Update a state file with a range of key values
3. Generate surrogate key columns and pass them to the next stage in the job
4. View the contents of the state file

Generated keys are 64 bit integers and the key source can be stat file or database
sequence.

Surrogate keys are used to join a dimension table to a fact table in a star schema
database.

When the SCD stage performs a dimension lookup :


A) If a matching record is found, it retrieves the value of the existing surrogate key.
B) If a match is not found, the stage obtains a new surrogate key value by using the
derivation of the Surrogate Key column on the Dim Update tab.

If you want the SCD stage to generate new surrogate keys by using a key
source that you created with a Surrogate Key Generator stage as
described in Surrogate Key Generator.

If you want to use your own method to handle surrogate keys, you should
derive the Surrogate Key column from a source column.

You can replace the dimension information in the source data stream with the surrogate key
value by mapping the Surrogate Key column to the output link.

Creating the key Source :

Drag the surrogate key stage from palette to parallel job canvas with no input and output
links.

Double click on the surrogate key stage and click on properties tab.

Properties:
Key Source Action = create
Source Type : FlatFile or Database sequence(in this case we are using FlatFile) When
you run the job it will create an empty file.
If you want to the check the content change the View Stat File = YES and check the job
log for details. skey_genstage,0: State file /tmp/skeycutomerdim.stat is empty.
if you try to create the same file again job will abort with the following error.
skey_genstage,0: Unable to create state file /tmp/skeycutomerdim.stat: File exists.
Deleting the key source:

Updating the stat File:


To update the stat file add surrogate key stage to the job with single input link from other
stage.
We use this process to update the stat file if it is corrupted or deleted.
1 1. Open the surrogate key stage editor and go to the properties tab.

If the stat file exists we can update otherwise we can create and update it.
We are using SkeyValue parameter to update the stat file using transformer stage.

Generating Surrogate Keys:

Now we have created stat file and will generate keys using the stat key file.
Click on the surrogate keys stage and go to properties add add type a name for the
surrogate key column in the Generated Output Column Name property.

Go to ouput and define the mapping like below.

Rowgen we are using 10 rows and hence when we run the job we see 10 skey values in the
output.I have updated the stat file with 100 and below is the output.

If you want to generate the key value from begining you can use following property in the
surrogate key stage.

A. If the key source is a flat file, specify how keys are generated:
1. To generate keys in sequence from the highest value that was last used, set the
Generate Key from Last Highest Value property to Yes. Any gaps in the key range
are ignored.
2. To specify a value to initialize the key source, add the File Initial Value property to
the Options group, and specify the start value for key generation.
3. To control the block size for key ranges, add the File Block Size property to the
Options group, set this property to User specified, and specify a value for the block
size.
B. If there is no input link, add the Number of Records property to the Options group, and
specify how many records to generate.

Vous aimerez peut-être aussi