Datastage Points

DATASTAGE POINTS
You can view and modify the table definitions at any point during the design of your application.
1) Server jobs wre doesnt support the partion tehnics but parallel jobs support the partion
technics.
2) Server jobs are not support SMTP,MPP but parallel supports SMTP,MPP.
3) Server jobs are running in single node but parallel jobs are running in multiple nodes.
4) Server jobs prefer while geting source data is low but data is huge then prefer the parallael.
What is metadata?
Data about data. A table definition which describes the structure of the table is an
example of meta data.
What is the difference between maps and locales?

Maps : Defines the character sets that the project can use.
Locales : Defines the local formats for dates, times, sorting order, and so on that the project can
use.
What is the difference between DataStage and Informatica?

DataStage support parallel processing which Informatica doesnt.
Links are object in the DataStage ,in Informatica its a port to port connectivity.
In Informatica its easy to implement Slowly Changing Dimensions which is little bit complex in
dataStage.
DataStage doesnt support complete error handling.
What are the types of stage?

A stage can be passive or active. A passive stage handles access to databases for the
extraction or writing of data. Active stages model the flow of data and provide mechanisms for
combining data streams, aggregating data, and converting data from one data type to another.
There are two types of stage :
Built in stages : Supplied with DataStage and used for extracting, aggregating, transforming, or
writing data.
Plug in stages : Additional stages defined in the DataStage Manager to perform tasks that the
built-in stages do not support.
How can we improve the performance in DataStage?

In server canvas we can improve performance in two ways
Firstly we can increase the memory by enabling interprocess row buffering in job properties
Secondly by inserting an IPC stage we break a process into two processes.We can use this
stage to connect two passive stages or two active stages.
What is APT_CONFIG in datastage?

Datastage understands the architecture of the system through this file(APT_CONFIG_FILE).
For example this file consists information of node names, disk storage information etc.
APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with
*.apt file that has the node's information and Configuration of SMP/MMP server.
What are Sequencers?

Sequencers are job control programs that execute other jobs with preset job parameters.
How do you register plug-ins?

Using DataStage Manager.
What are the command line functions that import and export the DS jobs?
dsimport.exe : imports the DataStage components.
dsexport.exe : exports the DataStage components.
14 Good design tips in Datastage
1) When you need to run the same sequence of jobs again and again, better create a sequencer
with all the jobs that you need to run. Running this sequencer will run all the jobs. You can
provide the sequence as per your requirement.
2) If you are using a copy or a filter stage either immediately after or immediately before a
transformer stage, you are reducing the efficiency by using more stages because a transformer
does the job of both copy stage as well as a filter stage
3) Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options
and sort indicator options.
4) Turn off Runtime Column propagation wherever its not required.
5) Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of Transformer
stage only if the anticipated volumes are high and performance becomes a problem. Otherwise
use Transformer. It is very easy to code a transformer than a modify stage.
6)Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop
the metadata. Modify stage will drop the metadata only when explicitly specified using DROP
clause.
7)Add reject files wherever you need reprocessing of rejected records or you think considerable
data loss may happen. Try to keep reject file at least at Sequential file stages and writing to
Database stages.
8)Make use of Order By clause when a DB stage is being used in join. The intention is to make
use of Database power for sorting instead of Data Stage resources. Keep the join partitioning as
Auto. Indicate dont sort option between DB stage and join stage using sort stage when using
order by clause.
9)While doing Outer joins, you can make use of Dummy variables for just Null checking instead of
fetching an explicit column from table.
10)Data Partitioning is very important part of Parallel job design. Its always advisable to have
the data partitioning as Auto unless you are comfortable with partitioning, since all Data Stage
stages are designed to perform in the required way with Auto partitioning.
11) Do remember that Modify drops the Metadata only when it is explicitly asked to do so using
KEEP/DROP clauses.
12) Range Look-up: Range Look-up is equivalent to the operator between. Lookup against a
range of values was difficult to implement in previous Data Stage versions. By having this
functionality in the lookup stage, comparing a source column to a range of two lookup columns
or a lookup column to a range of two source columns can be easily implemented.
13) Use a Copy stage to dump out data to intermediate peek stages or sequential debug files.
Copy stages get removed during compile time so they do not increase overhead
14)Where you are using a Copy stage with a single input and a single output, you should ensure
that you set the Force property in the stage editor TRUE. This prevents DataStage from deciding
that the Copy operation is superfluous and optimizing it out of the job.
Development Guidelines
Modular development techniques should be used to maximize re-use of DataStage jobs and
components:
Job parameterization allows a single job design to process similar logic instead of creating
multiple copies of the same job. The Multiple-Instance job property allows multiple invocations of
the same job to run simultaneously.
_ A set of standard job parameters should be used in DataStage jobs for source and target
database parameters (DSN, user, password, etc.) and directories where files are stored. To ease
re-use, these standard parameters and settings should be made part of a Designer Job Parameter
Sets.
_ Create a standard directory structure outside of the DataStage project directory for source and
target files, intermediate work files, and so forth.
_ Where possible, create re-usable components such as parallel shared containers to encapsulate
frequently-used logic.
_ DataStage Template jobs should be created
_ Standard parameters such as source and target file paths, and database login properties
_ Environment variables and their default settings
_ Annotation blocks
_ Job Parameters should always be used for file paths, file names, database login settings.
_ Standardized Error Handling routines should be followed to capture errors and rejects.
Component Usage
The following guidelines should be followed when constructing parallel jobs in IBM InfoSphere
DataStage Enterprise Edition:
_ Never use Server Edition components (BASIC Transformer, Server Shared Containers) within a
parallel job. BASIC Routines are appropriate only for job control sequences.
_ Always use parallel Data Sets for intermediate storage between jobs unless that specific data
also needs to be shared with other applications.
_ Use the Copy stage as a placeholder for iterative design, and to facilitate default type
conversions.
_ Use the parallel Transformer stage (not the BASIC Transformer) instead of the Filter or Switch
stages.
_ Use BuildOp stages only when logic cannot be implemented in the parallel Transformer.
DataStage Datatypes
The following guidelines should be followed with DataStage data types:
_ Be aware of the mapping between DataStage (SQL) data types and the internal DS/EE data
types. If possible, import table definitions for source databases using the Orchestrate Schema
Importer (orchdbutil) utility.
_ Leverage default type conversions using the Copy stage or across the Output mapping tab of
other stages.
Partitioning Data
In most cases, the default partitioning method (Auto) is appropriate. With Auto partitioning, the
Information Server Engine will choose the type of partitioning at runtime based on stage
requirements, degree of parallelism, and source and target systems. While Auto partitioning will
generally give correct results, it might not give optimized performance. As the job developer, you
have visibility into requirements, and can optimize within a job and across job flows. Given the
numerous options for keyless and keyed partitioning, the following objectives form a
methodology for assigning partitioning:
_ Objective 1
Choose a partitioning method that gives close to an equal number of rows in each partition, while
minimizing overhead. This ensures that the processing workload is evenly balanced, minimizing
overall run time.
_ Objective 2
The partition method must match the business requirements and stage functional requirements,
assigning related records to the same partition if required.
Any stage that processes groups of related records (generally using one or more key columns)
must be partitioned using a keyed partition method.
This includes, but is not limited to: Aggregator, Change Capture, Change Apply, Join, Merge,
Remove Duplicates, and Sort stages. It might also be necessary for Transformers and BuildOps
that process groups of related records.
_ Objective 3
Unless partition distribution is highly skewed, minimize re-partitioning, especially in cluster or
Grid configurations.
Re-partitioning data in a cluster or Grid configuration incurs the overhead of network transport.
_ Objective 4
Partition method should not be overly complex. The simplest method that meets the above
objectives will generally be the most efficient and yield the best performance.
Using the above objectives as a guide, the following methodology can be applied:
a. Start with Auto partitioning (the default).
b. Specify Hash partitioning for stages that require groups of related records
as follows:
Specify only the key column(s) that are necessary for correct grouping as long as the number
of unique values is sufficient
Use Modulus partitioning if the grouping is on a single integer key column
Use Range partitioning if the data is highly skewed and the key column values and distribution
do not change significantly over time (Range Map can be reused)
If grouping is not required, use Round Robin partitioning to redistribute data equally across all
partitions.
Especially useful if the input Data Set is highly skewed or sequential d. Use Same partitioning
to optimize end-to-end partitioning and to minimize re-partitioning
Be mindful that Same partitioning retains the degree of parallelism of the upstream stage
Within a flow, examine up-stream partitioning and sort order and attempt to preserve for downstream processing. This may require re-examining key column usage within stages and reordering stages within a flow (if business requirements permit).
Across jobs, persistent Data Sets can be used to retain the partitioning and sort order. This is
particularly useful if downstream jobs are run with the same degree of parallelism (configuration
file) and require the same partition and sort order.
Collecting Data
Given the options for collecting data into a sequential stream, the following guidelines form a
methodology for choosing the appropriate collector type:
1. When output order does not matter, use Auto partitioning (the default).
2. Consider how the input Data Set has been sorted:
When the input Data Set has been sorted in parallel, use Sort Merge collector to produce a
single, globally sorted stream of rows.
When the input Data Set has been sorted in parallel and Range partitioned, the Ordered
collector might be more efficient.
3. Use a Round Robin collector to reconstruct rows in input order for round-robin partitioned input
Data Sets, as long as the Data Set has not been re-partitioned or reduced.
Sorting Data
Apply the following methodology when sorting in an IBM InfoSphere DataStage Enterprise Edition
data flow:
1. Start with a link sort.
2. Specify only necessary key column(s).
3. Do not use Stable Sort unless needed.
4. Use a stand-alone Sort stage instead of a Link sort for options that are not available on a Link
sort:
The Restrict Memory Usage option should be included here. If you want more memory
available for the sort, you can only set that via the Sort Stage not on a sort link. The
environment variable
$APT_TSORT_STRESS_BLOCKSIZE can also be used to set sort memory usage (in MB) per
partition.
Sort Key Mode, Create Cluster Key Change Column, Create Key Change Column, Output
Statistics.
Always specify DataStage Sort Utility for standalone Sort stages.
Use the Sort Key Mode=Dont Sort (Previously Sorted) to resort a sub-grouping of a
previously-sorted input Data Set.
5. Be aware of automatically-inserted sorts:
Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish
required sort order.
6. Minimize the use of sorts within a job flow.
7. To generate a single, sequential ordered result set, use a parallel Sort and a
Sort Merge collector.
Stage Specific Guidelines

Transformer
Take precautions when using expressions or derivations on nullable columns within the parallel
Transformer:
Always convert nullable columns to in-band values before using them in an expression or
derivation.
Always place a reject link on a parallel Transformer to capture / audit possible rejects.
Lookup
It is most appropriate when reference data is small enough to fit into available shared memory. If
the Data Sets are larger than available memory resources, use the Join or Merge stage.
Limit the use of database Sparse Lookups to scenarios where the number of input rows is
significantly smaller (for example 1:100 or more) than the number of reference rows, or when
exception processing.
Join
Be particularly careful to observe the nullability properties for input links to any form of Outer
Join. Even if the source data is not nullable, the non-key columns must be defined as nullable in
the Join stage input in order to identify unmatched records.
Aggregators
Use Hash method Aggregators only when the number of distinct key column values is small. A
Sort method Aggregator should be used when the number of distinct key values is large or
unknown.
Database Stages
The following guidelines apply to database stages:
Where possible, use the Connector stages or native parallel database stages for maximum
performance and scalability.
The ODBC Connector and ODBC Enterprise stages should only be used when a native parallel
stage is not available for the given source or target database.
When using Oracle, DB2, or Informix databases, use Orchestrate Schema Importer (orchdbutil)
to properly import design metadata.
Take care to observe the data type mappings.
Datastage Coding Checklist
Ensure that the null handling properties are taken care for all the nullable fields. Do not set the
null field value to some value which may be present in the source.
Ensure that all the character fields are trimmed before any processing. Normally extra spaces
in the data may lead to some errors like lookup mismatch which are hard to detect.
Always save the metadata (for source, target or lookup definitions) in the repository to ensure
re usability and consistency.
In case the partition type for the next immediate stage is to be changed then the Propagate
partition should be set to Clear in the current stage.
Make sure that appropriate partitioning and sorting are used in the stages, where ever
possible. This enhances the performances. Make sure that you understand the partitioning being
used. Otherwise leave it auto.
Make sure that the pathname/format details are not hard coded and job parameters are used
for the same. These details are generally set as environmental variable.
Ensure that all fi le names from external source are parameterized. This will prevent the
developer from the trouble of changing the job or file name if the file name is changed. File
names/Datasets created in the job for intermediate purpose can be hard coded.
Ensure that the environment variable $APT_DISABLE_COMBINATION is set to False.
Ensure that $APT_STRING_PADCHAR is set to spaces.
The parameters used across the jobs should be with same name. This helps to avoid
unnecessary confusions
Use 4-node configuration file for unit testing/system testing the job.
If there are multiple jobs to be run for the same module. Archive the source files in the after
job routine of the last job.
Check whether the fi le exists in the landing directory before moving the sequential file. The
mv command will move the landing directory if the file is not found.
Verify whether the appropriate after job routine is called in the job.
Verify the correct link counts are used in the after job routine for ACR log fi le.
Check whether the log statements are correct for that job.
Ensure that the unix files created by any Datastage job is created by the same unix user who
has run the job.
Check the Director log if the error message does not have readability.
Verify job name, stage name, link name, input file name are as per standards. Ensure that the
job developed adhere to the naming standards defined for the software artifacts.
Job description must be clear and readable.
Make sure that the Short Job Description is filled using Description Annotation and it contains
the job name as part of the description. Dont use Annotation for putting the job description.
Check that the parameter values assigned to the jobs through sequencer.
Verify if Runtime Column Propagation (RCP) is disabled or not.
Ensure that reject links are output from the sequential file stage which reads the data file to
log the records which are rejected.
Check whether the dataset are used instead of sequential fi le for intermediate storage
between the jobs. This enhances performance in a set of linked jobs.
Reject records should be stored as sequential files. This helps in the analysis of rejected
records outside the datastage easier.
Ensure that the dataset from another job use the same metadata which is saved in the
repository.
Verify that the intermediate files that are used by downstream jobs have unix read
access/permission to all users.
For fixed width files, final delimiter should be set to none in the file format property.
Verify that all lookup reference files have unix permission as 744. This will ensure that other
users dont overwrite or delete the reference fi le.
The order of stage variables should be in correct order. Eg: - A stage variable used in
calculation should be in the higher order than the calculation variable.
If any processing stage requires a key ( like remove duplicate, merge, join, etc ) the Keys,
sorting keys and Partitioning keys should be same and in the same order
Make sure that sparse lookup are not used when large volumes of data are handled.
Check look up keys, if they are correct.
Do not validate on a null fi eld in a transformer. Use appropriate data types for the Stage
variables. Use IsNull(), IsNotNull() or Seq() for doing such validations.
In Funnel, all the input links must be hash partitioned on the sort keys.
Verify if any Transformer is set to run in sequential mode. It should be in parallel mode.
RCP should be enabled in the Copy stage before shared container.
Verify whether column generated using column generator is created using part and part
count.
In Remove Duplicate stage, ensure that the correct record (according to the requirements) is
retained.
Every database object referenced is accessed through the parameter schema name.
Always reference database object using the schema name.
Use Upper Case for column names and table names in SQL queries.
Check that the parameter values are assigned to the jobs through sequencer
For every Job Activity stage in sequencer, ensure that Reset if required, then run is selected
where relevant.
Know your DataStage Jobs Status without Director
#!/usr/bin/ksh # Declares a Korn shell ###############################

##
##!/usr/bin/sh # Declares a Bourne shell #
##!/usr/bin/bash # Declares a Bourne-Again shell #
##!/usr/bin/csh # Declares a C shell #
##!/usr/bin/tsh # Declares a T shell #
##
##
# SCRIPT: dsjobStatus.sh #
# AUTHOR: Atul Singh #
# DATE: Jan 04, 2013 #
##
##
# PLATFORM: (AIX, HP-UX, Linux, Solaris & All Nix ) #
##
##
# PURPOSE: This script take the 2 input as argument and fetch the #
# datastage job status and last Start & End time of the job #
##
##
##########################################################
##################
. /opt/IBM/InformationServer/Server/DSEngine/dsenv > /dev/null 2>&1

if [[ $# -eq 2 ]]; then
PROJECT="$1";
JOB="$2";
out=`dsjob -jobinfo $PROJECT $JOB | egrep 'Job Status|Job Start Time|Last Run Time'`
echo "$PROJECT\t$JOB\t$out";
else
echo "Please execute the script like : $0 PROJECT_NAME JOB_NAME";
fi
Tips & Tricks for debugging a DataStage job
The article talks about DataStage debugging techniques. This can be applied to a job which
is not producing proper output data or
to a job that is aborting or generating warnings
Use the Data Set Management utility, which is available in the Tools menu of the DataStage
Designer or the DataStage Manager, to examine the schema, look at row counts, and delete a
Parallel Data Set. You can also view the data itself.
Check the DataStage job log for warnings or abort messages. These may indicate an
underlying logic problem or unexpected data type conversion. Check all the messages. The PX
jobs almost all the times, generate a lot of warnings in addition to the problem area.
Run the job with the message handling (both job level and project level) disabled to find out if
there are any warning that are unnecessarily converted to information messages or dropped
from logs.
Enable the APT_DUMP_SCORE using which you would be able see how different stages are
combined. Some errors/logs mentioned the error is in APT_CombinedOperatorController stages.
The stages that form the part of the APT_CombinedOperatorController can be found using the
dump score created after enabling this env variable.
This environment variable causes the DataStage to add one log entry which tells how stages
are combined in operators and what virtual datasets are used. It also tells how the operators are
partitioned and how many no. of partitions are created.
One can also enable APT_RECORD_COUNTS environment variables. Also enable
OSH_PRINT_SCHEMAS to ensure that a runtime schema of a job matches the design-time schema
that was expected.
Sometimes the underlying data contains the special characters (like null characters) in
database or files and this can also cause the trouble in the execution. If the data is in table or
dataset, then export it to a sequential file (using DS job). Then use the command cat tev or
od xc to find out the special characters.
Once can also use wc -lc filename, displays the number of lines and characters in the
specified ASCII text file. Sometime this is also useful.
Modular approach: If the job is very bulky with many stages in it and you are unable to
locate the error, the one option is to go for modular approach. In this approach, one has to do the
execution step by step. E.g. If a job has 10 stages, then create a copy of the job. Just keep say
first 3 stages and run the job. Check the result and if the result is fine, then add some more
stages (may be one or two) and again run the job. This has to be done till one is unable to locate
the error.
Partitioned approach with data: This approach is very useful if the job is running fine for some
set of data and failing for other set of data, or failing for large no. of rows. In this approach, one
has to run the jobs on selected no .of rows and/or partitions using the DataStage @INROWNUM
(and @PARTITIONNUM in Px). E.g. a job when run with 10K rows works fine and is failing with 1M
rows. Now one can use @INROWNUM and run the job for say first 0.25 million rows. If the first
0.25 million are fine, then from 0.26 million to 0.5 million and so on.
Please note, if the job parallel job then one also has to consider the no. of partitions in the job.
Other option in such case is run the job only one node (may be by setting using
APT_EXECUTION_MODE to sequential or using the config file with one node.
Execution mode: Sometime if the partitions are confusing, then one can run the job in
sequential mode. There are two ways to achieve this:
Use the environment variable APT_EXECUTION_MODE and set it to sequential mode.
Use a configuration file with only one node.
A parallel Job fails and error do not tell which row it has failed for: In this case, if this job is
simple we should try to build the server job and run it. The server jobs can report the errors
along with the rows which are in error. This is very useful in case when DB errors like
primary/unique violation or any other DB error is reported by PX job.
Sometimes when dealing when DB and if the rows are not getting loaded as expected, adding
the reject links to the DB stages can help us locating the rows with issues.
In a big job, adding some intermediate datastes/peek stages to find out the data values at
certain levels can help. E.g. if there 10 stages and after that it is going to dataset. Now there
may be different operations done at different stages. After 2/3 stages, add peek stages or send
data to datasets using copy stages. Check the values after at these intermediate points and see
if they can shed some light on the issue.
ODBC Configuration in DataStage
For configure the DataStage ODBC connections, need to edit three files to set up the required
ODBC connections. These are:
dsenv
.odbc.ini
uvodbc.config
All three are located in the $DSHOME directory. Copies of uvodbc.config are also placed in the
project directories.
Non-wire drivers require different setup information than wire drivers. Non-wire drivers require
information about the location of the database client software. Wire drivers require information
about the database itself.
For information about configuring the ODBC environment for your specific database, see the Data
Direct Drivers Reference manual odbcref.pdf file located in the
$DSHOME/Server/branded_odbc/books/odbc directory. You should also check the ODBCREAD.ME
file in the branded_odbc directory. There is also an html file located in branded_odbc/odbchelp.
An important feature of DataStage is the ability to import table definitions from existing data
sources. Depending on the datasource, users will use a combination of two methods to import
meta-data into DataStage. The first method is to use the DSDB2 Plugin Import (Import -> Table
Defintions -> Plug-In -> DSDB2). This method works great when you are dealing with a database
that has just a few hundred tables. When a database grows upwards of 1000 and above, it is a
better idea to use the ODBC import because the ODBC import allows the user to filter the tables
that they want to see, thereby removing the need to list all of the tables in the database to then
select the one that is wanted.
dsenv file
The WebSphere DataStage server has a centralized file for storing environment variables called
dsenv in $DSHOME. $DSHOME identifies the WebSphere DataStage installation directory. The
default directory is /opt/IBM/InformationServer/Server/DSEngine.
For more details please visit :
dsenv file in DataStage
Example "DSENV" file
.odbc.ini file
The .odbc.ini files gives information about connecting to the database (wire protocol drivers) or
the database client (non-wire protocol drivers). If your system uses a mix of drivers, your
.odbc.ini file will contain a mix of entry types.
.odbc.ini file in DataStage
.odbc.ini EXAMPLE file
uvodbc.config file
Use the uvodbc.config file to specify the DSNs for the databases that you are connecting to
through ODBC.
uvodbc.config File
Example "DSENV" file
Hi Friends
Here I am sharing a example of "DSENV" file.
#!/bin/sh
##########################################################
##########
#
# dsenv - DataStage environment file
#
#
Licensed Materials - Property of IBM (c) Copyright IBM Corp. 1997, 2007 All Rights
Reserved.
#
This is unpublished proprietary source code of IBM Corporation
The copyright notice above does not evidence any actual or
intended publication of such source code.
#
# This script is sourced by the DataStage dsrpcd daemon to establish
# proper environment settings for DataStage client connections.
#
# This script may also be sourced by bourne shells to establish
# proper environment settings for local DataStage use.
#
##########################################################
##########
# PLATFORM SPECIFIC SECTION
set +u
#if [ -z "$DSHOME" ] && [ -f "/.dshome" ]
#then
#
DSHOME=`cat /.dshome`
export DSHOME
#fi
if [ -z "$DSHOME" ]
then
DSHOME=/iis/IBM/InformationServer/Server/DSEngine; export DSHOME
fi
if [ -z "$DSRPCD_PORT_NUMBER" ]
then
true
##DSRPCD_PORT_NUMBER_TAG##
fi
if [ -z "$APT_ORCHHOME" ]
then
APT_ORCHHOME=/iis/IBM/InformationServer/Server/PXEngine; export APT_ORCHHOME
fi
#if [ -z "$UDTHOME" ]
#then
UDTHOME=/iis/IBM/InformationServer/Server/DSEngine/ud41 ; export UDTHOME
UDTBIN=/iis/IBM/InformationServer/Server/DSEngine/ud41/bin ; export UDTBIN
#fi
#if [ -z "$ASBHOME" ] && [ -f "$DSHOME/.asbnode" ]
#then
ASBHOME=`cat $DSHOME/.asbnode`
export ASBHOME
#fi
#if [ -z "$ASBHOME" ]
#then
#ASBHOME=`dirname \`dirname $DSHOME\``/ASBNode
#export ASBHOME
#fi
if [ -n "$DSHOME" ] && [ -d "$DSHOME" ]
then
ODBCINI=$DSHOME/.odbc.ini; export ODBCINI
HOME=${HOME:-/}; export HOME
#LANG="<langdef>";export LANG
#LC_ALL="<langdef>";export LC_ALL
#LC_CTYPE="<langdef>";export LC_CTYPE
#LC_COLLATE="<langdef>";export LC_COLLATE
#LC_MONETARY="<langdef>";export LC_MONETARY
#LC_NUMERIC="<langdef>";export LC_NUMERIC
#LC_TIME="<langdef>";export LC_TIME
#LC_MESSAGES="<langdef>"; export LC_MESSAGES
# Added by Kev
LANG="en_US";export LANG
LC_ALL="EN_US.UTF-8";export LC_ALL
LC_CTYPE="EN_US.UTF-8";export LC_CTYPE
LC_COLLATE="EN_US.UTF-8";export LC_COLLATE
LC_MONETARY="EN_US.UTF-8";export LC_MONETARY
LC_NUMERIC="EN_US.UTF-8";export LC_NUMERIC
LC_TIME="EN_US.UTF-8";export LC_TIME
LC_MESSAGES="EN_US.UTF-8"; export LC_MESSAGES
# End of addition
# Old libpath
#
LIBPATH=`dirname $DSHOME`/branded_odbc/lib:`dirname
$DSHOME`/DSComponents/lib:`dirname $DSHOME`/DSComponents/bin:$DSHOME/lib:
$DSHOME/uvdlls:`dirname $DSHOME`/PXEngine/lib:$ASBHOME/apps/jre/bin:
$ASBHOME/apps/jre/bin/classic:$ASBHOME/lib/cpp:$ASBHOME/apps/proxy/cpp/aix-allppc_64:$LIBPATH
LIBPATH=`dirname $DSHOME`/branded_odbc/lib:`dirname
$DSHOME`/DSComponents/lib:`dirname $DSHOME`/DSComponents/bin:$DSHOME/lib:
$DSHOME/uvdlls:`dirname $DSHOME`/PXEngine/lib:$ASBHOME/apps/jre/bin:
$ASBHOME/apps/jre/bin/classic:$ASBHOME/lib/cpp:$ASBHOME/apps/proxy/cpp/aix-allppc_64:/usr/mqm/lib64:$LIBPATH
export LIBPATH
ulimit -d unlimited
ulimit -m unlimited
ulimit -s unlimited
ulimit -f unlimited
# below changed to unlimited from 1024
ulimit -n unlimited
LDR_CNTRL=MAXDATA=0x60000000@USERREGS
export LDR_CNTRL
fi
# General Path Enhancements
PATH=$PATH:/iis/IBM/InformationServer/Server/PXEngine/bin:/iis/IBM/InformationServer/Server/DS
Engine/bin:/usr/mqm/bin:/var/mqm:/usr/mqm
############################
#
Source DB2 instance
############################
. /home/db2/sqllib/db2profile
####################################################
# DB2 Ver 9.7 Parameter Variables Set Up for
# Multiple instance connection
####################################################
DB2INSTANCE=db2; export DB2INSTANCE
DB2DIR=/opt/IBM/db2/V9.7; export DB2DIR
INSTHOME=/home/db2; export INSTHOME
DB2CODEPAGE=1208; export DB2CODEPAGE
##########################################################
#
# Append the DB2 directories to the PATH
##########################################################
#
PATH=$PATH:$INSTHOME/sqllib:$INSTHOME/sqllib/adm:$INSTHOME/sqllib/misc:$DB2DIR/bin
export PATH
THREADS_FLAG=native; export THREADS_FLAG
LIBPATH=/usr/IBM/db2/V9.7/lib64:/usr/IBM/db2/V9.7/lib32:$LIBPATH;export LIBPATH
DATASTAGE_JRE=/iis/IBM/InformationServer/ASBNode/apps/jre; export DATASTAGE_JRE
APT_CONFIG_FILE=/iis/IBM/InformationServer/Server/Configurations/default.apt; export
APT_CONFIG_FILE
Update the uvodbc.config File
The uvodbc.config is located in the root of the project directory (The project directory can be
determined by opening DataStage Administrator, clicking on the Projects Tab, selecting the
project)
This steps simply adds the datasource to the drop down list on the ODBC Import Screen
Telnet to the datastage server machine
Change to the project directory
Edit the uvodbc.config file
Add two lines that looks like this, where you replace the word DSN with the name of the
datasource
< DSN >
DBMSTYPE = ODBC
Save the File
Example file :
[ODBC DATA SOURCES]
<localuv>
DBMSTYPE = UNIVERSE
network = TCP/IP
service = uvserver
host = 127.0.0.1
<Sybase1>
DBMSTYPE = ODBC
<Sybase2>
DBMSTYPE = ODBC
<Oracle8>
DBMSTYPE = ODBC
<Informix>
DBMSTYPE = ODBC
Setting environment variables for the parallel engine in DataStage
You set environment variables to ensure smooth operation of the parallel engine. Environment
variables are set on a per-project basis from the Administrator client.
Procedure
1. Click Start > All Programs > IBM Information Server > IBM InfoSphere DataStage and
QualityStage Administrator, and log in to the Administrator client.
2. Click the Project tab, and select a project.
3. Click Properties.
4. On the General tab, click Environment.
5. Set the values for the environment variables as necessary.
Environment variables for the parallel engine

Set the listed environment variables depending on whether your environment meets the
conditions stated in each variable.
Network settings
1.
APT_IO_MAXIMUM_OUTSTANDING
If the system connects to multiple processing nodes through a network, set the
APT_IO_MAXIMUM_OUTSTANDING environment variable to specify the amount of memory, in
bytes, to reserve for the parallel engine on every node for TCP/IP communications. The default
value is 2 MB.
If TCP/IP throughput at that setting is so low that there is idle processor time, increment it by
doubling the setting until performance improves. If the system is paging, however, or if your job
fails with messages about broken pipes or broken TCP connections, the setting is probably too
high.
2.
APT_RECVBUFSIZE
If any of the stages within a job has a large number of communication links between nodes,
specify this environment variable with the TCP/IP buffer space that is allocated for each
connection. Specify the value in bytes.
The APT_SENDBUFSIZE and APT_RECVBUFSIZE values are the same. If you set one of these
environment variables, the other is automatically set to the same value. These environment
variables override the APT_IO_MAXIMUM_OUTSTANDING environment variable that sets the total
amount of TCP/IP buffer space that is used by one partition of a stage.
3.
APT_SENDBUFSIZE
If any of the stages within a job has a large number of communication links between nodes,
specify this environment variable with the TCP/IP buffer space that is allocated for each
connection. Specify the value in bytes.
The APT_SENDBUFSIZE and APT_RECVBUFSIZE values are the same. If you set one of these
environment variables, the other is automatically set to the same value. These environment
variables override the APT_IO_MAXIMUM_OUTSTANDING environment variable that sets the total
amount of TCP/IP buffer space that is used by one partition of a stage.
4.
Transform library
If you are working on a non-NFS MPP system, set the APT_COPY_TRANSFORM_OPERATOR

environment variable to true to enable Transformer stages to work in this environment. IBM
InfoSphere DataStage and QualityStage Administrator users must have the appropriate
privileges to create project directory paths on all the remote nodes at runtime. This environment
variable is set to false by default.
5.
Job monitoring
By default, the job monitor uses time-based monitoring in the InfoSphere DataStage and
QualityStage Administrator Director. The job monitor window is updated every five seconds. You
can also specify that the monitoring is based on size. For example, the job monitor window is
updated based on the number of new entries. To base monitoring on the number of new entries,
set a value for the APT_MONITOR_SIZE environment variable. If you override the default setting
for the APT_MONITOR_TIME, the setting of the APT_MONITOR_SIZE environment variable is also
overridden.
6.
Detailed information about jobs
To produce detailed information about jobs as they run, set the APT_DUMP_SCORE value to True.
By default, this environment variable is set to False.
7.
C++ compiler
The environment variables APT_COMPILER and APT_LINKER are set at installation time to point to
the default locations of the supported compilers. If your compiler is installed on a different
computer from the parallel engine, you must change the default environment variables for every
project by using the Administrator client.
8.
Temporary directory
By default, the parallel engine uses the C:\tmp directory for some temporary file storage. If you
do not want to use this directory, assign the path name to a different directory by using the
environment variable TMPDIR.
How to deploy a configuration file in DataStage

Hi Friends
Now how to Deploy/Apply the Conf file.
Deploying the new configuration file

Now that you have created a new configuration file, you use this new file instead of the default file. You use the
Administrator client to deploy the new file. You must have DataStage Administrator privileges to use the Administrator client
for this purpose.
1.
To deploy the new configuration file:

Select Start > Programs > IBM Information Server > IBM WebSphere DataStage and QualityStage Administrator.
2.
In the Administration client, click the Projects tab to open the Projects window.
3.
In the list of projects, select the tutorial project that you are currently working with.
4.
Click Properties.
5.
In the General tab of the Project Properties window, click Environment.
6.
In the Categories tree of the Environment variables window, select the Parallel node.
7.
Select the APT_CONFIG_FILE environment variable, and edit the file name in the path name under the Value column
heading to point to your new configuration file. The Environment variables window should resemble the one in the following
picture:
You deployed your new configuration file.
Applying the new configuration file

Now you run the sample job again.
You will see how the configuration file overrides other settings in your job design.
1.
2.
3.
To apply the configuration file:

Open the Director client and select the sample job.
Click << button to reset the job so that you can run it again.
Run the sample job.
How to Create a Configuration File in DataStage

Hi Friends
Configuration file is playing a most important role in Parallel DataStage job. This is the file which have provide parallel
environment to job to run parallel.
Today I am going to share how to Create/Edit the Conf file.
Creating a configuration file

Open the Designer Client and follow the below steps
1.
Select Tools > Configurations to open the Configurations editor.
2.
Select default from the Configurations list to open the default configurations file. If you want to edit, copy the content
in new file and edit there. Please do not try to edit the file directly.
3.
From drop-down select the NEW category to create a new configuration file .
4.
Type/paste the new content of configuration file here.
5. Click Check to ensure that your configuration file is valid. The configuration editor should resemble the one in the
following picture:
6.
In the Configuration name field of the Save Configuration As window, type a name for your new configuration. For example,
type node4.
7.
Click Save and select Save configuration from the menu.
Must know for an UNIX programmer- Some tips

This article gives an overview of commands which are very useful for executing complex tasks in simple manner in UNIX
environment. This can provide as reference during critical situations
Multi line comments in Shell program
As such Shell does not provide any multi-line commenting feature. However, there is a workaround. To comment a block of
lines, we need to enclose the statements to be commented by using : and . Below is an example of the same.
$'
Thisisamultilinecommentthatdoesnotincludeasinglequoteinthecontent.
'
However, the above syntax works only when you don't have a single quote in the content. In order to circumvent that problem
one can use the HERE document for multi-line comment as given below:
$<<COMMENT
Thisisamultilinecommentthat
doesnotincludeasinglequoteinthecontent.
COMMENT
The literal COMMENT is just indicative and one can use any syntactically valid literal marking the start and end of comment.
Deleting blank lines from the file

The sed (viz. Stream Editor) command can be used to delete the empty lines from the text files.
sede'^$'testfile
Here it is assumed that testfile contains some empty lines which are being removed.
Making control characters visible in a script

When we open a file using a standard Unix editor like vi, the control characters present in the files are not visible and so, in
several scenarios the output is not as expected. Again, the sed command can come handy in such scenarios.
sednltestfile
List all the child process of a given parent process ID

In Unix, a script can call many other scripts and hence a parent process can have many child processes. To know the order, in
which the scripts are called and the set of child processes for a particular parent process, pstree command can be used to
find out.
pstreep10001
where in the process ID of the parent process is 10001. This command helps in identifying all the forked child processes of
any script.
Finding the IP address using the hostname

Quite often we are required to know the IP address of the machine that we need to work upon. Also, it could be other way
round too i.e. finding the host name of the server whose IP address is known. The command nslookup comes to the rescue in
such situations. Assume that one needs to find the IP address of machine with hostname 'testserver.in.ibm.com', then the
command would look like:
nslookuptestserver.in.ibm.com
Upon executing the above command, the nslookup consults the Domain Name Servers and fetches the IP addresses of the
given hostname.
Changing the timestamp on a file to a past date

Whenever a file is created, is takes the time stamp of current system time. To change the timestamp of files to a past date,
lets say before 10 years, the touch command is a boon. This was very useful for my project to execute a scenario where we
needed to delete particular kind of files which are one year old. These files were part of newly created project and
apparently we didnt have this kind of files in the server. We changed the timestamp of files to past date, executed test cases
and completed testing.
Let us take an example where we would like to update the date of the file testfile to 14 Sep 2010 01:12:34, then the
command would be as follows:
touch20100914011234testfile
where the second argument to the command represents the timestamp in the format
<year><month><day><hour><minutes><seconds>.
As seen above in all commands that there are good utilities available on UNIX operatiing system which allow the user to do
complex tasks with least difficulties. By using these commands we can certainly reduce the task time and reduce the delay
during project deliverables.

Datastage Points

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Datastage Points

Transféré par

Droits d'auteur :

Formats disponibles

DATASTAGE POINTS

What is the difference between maps and locales?

What is the difference between DataStage and Informatica?

What are the types of stage?

How can we improve the performance in DataStage?

What is APT_CONFIG in datastage?

What are Sequencers?

How do you register plug-ins?

Stage Specific Guidelines

Know your DataStage Jobs Status without Director

#!/usr/bin/ksh # Declares a Korn shell ###############################

. /opt/IBM/InformationServer/Server/DSEngine/dsenv > /dev/null 2>&1

This is unpublished proprietary source code of IBM Corporation

The copyright notice above does not evidence any actual or

intended publication of such source code.

Source DB2 instance

Environment variables for the parallel engine

If you are working on a non-NFS MPP system, set the APT_COPY_TRANSFORM_OPERATOR

Detailed information about jobs

How to deploy a configuration file in DataStage

Deploying the new configuration file

To deploy the new configuration file:

In the General tab of the Project Properties window, click Environment.

You deployed your new configuration file.

Applying the new configuration file

To apply the configuration file:

How to Create a Configuration File in DataStage

Creating a configuration file

Select Tools > Configurations to open the Configurations editor.

Type/paste the new content of configuration file here.

Click Save and select Save configuration from the menu.

Must know for an UNIX programmer- Some tips

Deleting blank lines from the file

Making control characters visible in a script

List all the child process of a given parent process ID

Finding the IP address using the hostname

Changing the timestamp on a file to a past date

Vous aimerez peut-être aussi