DataStage Standards and Guidelines

Business Intelligence Delivery Services Publication
Extract Transfer Load Center Of Excellence
(DataStage Standards and Guidelines)
The standards and guidelines for the use of the Analytical Pattern's DataStage tool, IBM Software's
InfoSphere DataStage.
Release Date: Aug 15, 2011
For more information or feedback, contact:

Author: BIDS CoEs
BIDS Datastage CoE
Datastage Standards and Guidelines
DataStage Standards and Guidelines

Table of Contents
1. Overview ...............................................................................................................................................................3
A. Standard ............................................................................................................................................................3
B. Guideline ..........................................................................................................................................................3
C. Tip ....................................................................................................................................................................3
2. DataStage Contact List ..........................................................................................................................................5
3. ETL Tool Set .........................................................................................................................................................5
4. IBM InfoSphere DataStage Resource Requests.....................................................................................................5
A. What is a DataStage Project?............................................................................................................................5
B. Requesting DataStage project ...........................................................................................................................6
C. Requesting DataStage Users .............................................................................................................................6
5. InfoSphere DataStage Job Scheduling ...................................................................................................................7
6. InfoSphere DataStage Source Code Management .................................................................................................7
7. Process to Change Standards or Guidelines...........................................................................................................7
8. Compliance to Standards .......................................................................................................................................8
A. Exceptions ........................................................................................................................................................9
9. Security ..................................................................................................................................................................9
A. Standard ............................................................................................................................................................9
B. Guideline ..........................................................................................................................................................9
10. DataStage Naming .......................................................................................................................................... 10
A. Project Names (Standard set by the DataStage Admin Team) ....................................................................... 10
B. Job Category Names Guidelines..................................................................................................................... 10
C. Component Names Standards (Jobs, Sequences, Containers, etc) ............................................................ 11
D. Invocation ID guidelines ................................................................................................................................ 13
E. Stage and Link Names guidelines................................................................................................................... 13
11. DataStage Usage Guidelines .......................................................................................................................... 14
A. Design ............................................................................................................................................................. 14
B. Development .................................................................................................................................................. 15
C. Debugging ...................................................................................................................................................... 16
D. Performance.................................................................................................................................................... 16
E. DataStage Project Organization: .................................................................................................................... 17
F. Job Parameters ................................................................................................................................................ 18
12. Stage Guidelines: ............................................................................................................................................ 18
1. Join stage: ................................................................................................................................................ 18
1. Lookup stage:........................................................................................................................................... 18
2. Funnel stage: ........................................................................................................................................... 19
3. Pivot Enterprise stage:............................................................................................................................ 19
4. Slowly Changing Dimension stage ....................................................................................................... 21
5. Connector Stages Guidelines: ............................................................................................................... 22
6. Big Data stage: ........................................................................................................................................ 29
7. Transformer loops ................................................................................................................................... 30
13. General guidelines and tips............................................................................................................................. 32
Appendices .................................................................................................................................................................. 34
1. Appendix A Corporate Information Factory Framework ................................................................................. 34
2. Appendix B Application IDs ............................................................................................................................ 34
3. Appendix C Suggested Stage and Link Abbreviations ..................................................................................... 35
4. Appendix D DataStage Connector Migration Procedure .................................................................................. 38
5. Appendix E Mapping passwords between IS and OS ....................................................................................... 38
Originator: BIDS Datastage CoE Page 2 of 39 Date Issued: <issue date>

DataStage Standards and Guidelines Date Revised: <date revised>
GIS1 Item Number: Retention Start Date:
GIS2 Classification: Proprietary
BIDS Datastage CoE
6. Appendix F Notes and References.................................................................................................................... 38
1. Overview
The purpose of this document is to define and direct the use of the IBM InfoSphere
DataStage product for the development and deployment of ETL processing mostly for analytical
systems at Ford Motor Company.
This document lays out standards, guidelines and tips in using Datastage for development.
A. Standard
Standard is a rule. Adhering to the standard is strictly mandated.
B. Guideline
Guideline is a good practice to follow. It is not mandatory to follow a guideline
C. Tip
A tip is neither a standard nor a guideline. It is to make things easy and fast for a
developer.
Definition of ETL
ETL is defined as a set of software tools and related processes used to Extract (E) data
from a source, perform any Transformations (T) or cleaning, and Load (L) a target database.
This is the primary process by which source data from transactional systems is prepared for
storage and end user delivery.
These processes generally read data from an input source (flat file, relational table,
message queue, etc.). This is known as the Extraction. Extractions from the input source can
become complex when joining two or more relational tables together to create an input source.
Once sources are Extracted, the data is passed through either an engine, or code-based
process to modify, enhance, or eliminate data elements based on the business rules identified.
This is known as the Transformation. Transformations can become very complex as it relates to
the business rules identified as well as the sorting/merging of data that occurs during the
processing.
The transformed data is finally Loaded to a flat file or relational table.
Though it may seem simple on the surface, this layer of the architecture can get very
complex very quickly based on the number of source feeds being managed, the quality of the
data being provided, the requirements to standardize the data, the business rules identified to
transform the data, and the management of errors/rejects.
Ford's DataStage Environment

BIDS Datastage CoE
The DataStage environment established for Ford Motor Company is a result of the
definition and directions set forth by the Analytical Pattern and in keeping with the concepts of the
Corporate Information Factory (CIF) framework. (See Appendix A).
In order to commonize, consolidate, and provide consistency for ETL processing of

Ford's data the DataStage environment is implemented as a shared environment. This means
that the toolset (IBM InfoSphere DataStage Enterprise Edition), environment (FCE Approved
Hardware), and standards are consistent for all analytical systems developed by Ford Motor
Company.
Why IBM InfoSphere DataStage
A. The powerful ETL solution supports the collection, integration and transformation
of large volumes of data, with data structures ranging from simple to highly complex.
IBM InfoSphere DataStage manages data arriving in real-time as well as data
received on a periodic or scheduled basis.
B. The scalable platform enables companies to solve large-scale business problems
through high-performance processing of massive data volumes. By leveraging the
parallel processing capabilities of multiprocessor hardware platforms, IBM
InfoSphere DataStage can scale to satisfy the demands of ever-growing data
volumes, stringent real-time requirements, and ever shrinking batch windows.
C. Comprehensive source and target support for a virtually unlimited number of
heterogeneous data sources and targets in a single job includes text files; complex
data structures in XML; ERP systems such as SAP and PeopleSoft; almost any
database (including partitioned databases); web services; and business intelligence
tools like SAS.
D. Real-time data integration support operates in real-time. It captures messages
from Message Oriented Middleware (MOM) queues using JMS or WebSphere MQ
adapters to seamlessly combine data into conforming operational and historical
analysis perspectives. IBM InfoSphere Information Services Director provides a
service-oriented architecture (SOA) for publishing data integration logic as shared
services that can be reused across the enterprise. These services are capable of
simultaneously supporting high-speed, high reliability requirements of transactional
processing and the high volume bulk data requirements of batch processing.
E. Advanced maintenance and development enables developers to maximize
speed, flexibility and effectiveness in building, deploying, updating and managing
their data integration infrastructure. Full data integration reduces the development
and maintenance cycle for data integration projects by simplifying administration
and maximizing development resources.
F. Complete connectivity between any data source and any application ensures
that the most relevant, complete and accurate data is integrated and used by the
most popular enterprise application software brands, including SAP, Siebel, Oracle,
and PeopleSoft.
G. Flexibility to perform information integration directly on the mainframe. InfoSphere
DataStage for Linux on System z, provides:
H. Ability to leverage existing mainframe resources in order to maximize the value of
your IT investments
I. Scalability, security, manageability and reliability of the mainframe
J. Ability to add mainframe information integration work load without added z/OS
operational costs

BIDS Datastage CoE
2. DataStage Contact List

All DataStage contacts and information can be found at the DataStage CoE website:
www.etl.ford.com
3. ETL Tool Set

Tool Group Tool Function ETL
Operating Operating system supporting servers. Linux SLES-9
System
Data Access Teradata DBMS Teradata PTP and CLI
as a source or Oracle DBMS Oracle SQL Net
target DB2 DBMS DB2 Connect
SQL Server SQL Server
Mainframe Flat File FTP (Connect Direct)
UNIX Flat File FTP (Connect Direct)
Data Tools that provide for the processing IBM InfoSphere DataStage
Transformation and manipulation of data
Source Code Check in/out source code for Accurev
Control executables to control and track
changes
Batch Tools for end-to-end scheduling and ETL Server - AutoSys
Scheduling monitoring of batch jobs.
Other Related Products: May or may not be used at Ford

Data Profiling Tools that evaluate data domain IBM InfoSphere Information Analyzer
values
Data Quality Tools that evaluate validity of the data InfoSphere QualityStage
MetaData Too-ls that capture and maintain IBM InfoSphere Metadata Workbench
Repository metadata
4. IBM InfoSphere DataStage Resource Requests

A. What is a DataStage Project?
A DataStage Project is a tool specific organizational structure that forms a

grouping of objects around which security is applied. Whenever a DataStage Client
connects to a DataStage server a project that the user has permissions to use must be
selected. The DataStage project contains DataStage Jobs, Built-in Components, and
User-Defined Components.
A new DataStage project requires sufficient information to evaluate the utilization

impact in each of the established DataStage environments; development, test/QA, and
production. An evaluation must be done to insure that the needs of the new project can
be met with minimal impact on the existing projects already operating in the DataStage
environments.
a. The project/application should go through the Enterprise Architecture
Assurance review process.

BIDS Datastage CoE
1. The minimum required documents for this are the

07_Technical_Architecture document.
2. Please make sure to invite the DataStage Administrator and the
DataStage representative from the BIDS DataStage CoE.
3. The 07_Technical_Architecture document should detail data volumes,
transaction volumes, data sources, data targets, processing frequencies,
etc
b. The application should then engage Global Fulfillment Planning team with
the required build of materials to assess the buildability of the proposed
architecture. In this review, all the required engineering teams at Ford will
participate.
c. Prior to the GFP meeting, the DataStage Shared environment team has to be
engaged to go over the applications requirements. The DataStage team
requires an DataStage registration document to be filled out by the
application team. The DataStage team can provide the DataStage standards
and guidelines document for the application to review.
d. In the HIR meeting with the GFP team, the application team will be provided
a quote for the SOW and has to be approved by product line.
e. Once SOW is approved, the Global Fulfillment team will then proceed with
the requests for creation of a datastage/application project required tools at
Ford.
B. Requesting DataStage project
a. The application team has to follow the process outlined in Step A. The lead
time is 2 weeks for each environment. In an ideal situation, as soon as the
SOW is approved, the DataStage team will create a project in 2 weeks time
(Time delays will have to be accounted for the file systems to be created on
the server by the server hosting teams.)
b. The QA and PROD will be created upon request and the application teams
are advised to keep the DataStage administration team in the loop and plan
well ahead. Please allocate sufficient time for the QA testing before
requesting for prod.
c. Application team should come up with 3 char and 4 char acronyms for the
project (Refer "Project Names" section).
d. See Appendix B for more details.
C. Requesting DataStage Users
A new DataStage user requires the creation of a UNIX userid on the DataStage
server and the placement of that user into the appropriate UNIX group for access to the
proper DataStage project.
For more information on User request to datastage, refer DataStage FAQ at

https://dept.sp.ford.com/sites/ACE/dstage/SitePages/Home.aspx

BIDS Datastage CoE
5. InfoSphere DataStage Job Scheduling

Once a project is ready for production deployment the scheduled running of jobs
becomes very important. In order to be in compliance with existing Ford Motor Company
standards and guidelines it is intended that the Autosys product be used for scheduling and
running of DataStage jobs.
A. DataStage Jobs /Job Sequences are to be run using the Autosys product in the production
environment. Use of any tool other than Autosys for the production running of DataStage
Jobs/Job Sequences must be approved by the Lead Developer, DataStage Administrator, the
DataStage representative from the Intelligent Warehouse CoE and the Production Support
Representative.
B. Autosys functionality is available in the Test/QA environment for testing processes, as they
would occur in production.
C. Autosys functionality is NOT available in the development environment.
D. Autosys is to be used to run a UNIX shell script that will issue the DataStage RUN command
for the DataStage Job/Job Sequence. The UNIX script should also support the following
functions:
1. Access to Job parameters which should be obtained from a file
2. Issuing the DataStage RUN command for related DataStage Job/Job Sequence
should be followed. Many projects use a common ETL? script that contains common
project variables.
3. Setting the completion code for evaluation by Autosys.
E. Due to an Autosys limitation the UNIX script executed by Autosys must be a Bourne script. If
you write your DataStage execute script in Korn shell you must have another Bourne script to
invoke it.
F. Autosys jobs are to be implemented in accordance with the standards and procedures
defined by the Autosys Implementation and Support Group. Information on Autosys can be
found in the BIDS Autosys document: https://proj.sp.ford.com/sites/BIDSETLCoE/ETL-
Internal/ETL%20Documents/Developer%20Resources/BIDS%20Autosys%20Guidelines.doc
6. InfoSphere DataStage Source Code Management

There are a variety of objects that require the management of their "code" in accordance
with Ford Motor Company required methods and procedures. For objects in the DataStage
environment this management function involves the use of the Accurev product.
See BIDS Accurev Document for more information:
https://proj.sp.ford.com/sites/BIDSETLCoE/ETL-
Internal/ETL%20Documents/Developer%20Resources/Accurev%20Guidelines%20and%20Best
%20Practices.doc
7. Process to Change Standards or Guidelines

As a development community, we are constantly exposed to new scenarios, techniques
and are constantly learning. This could result in addition to, updating and even deleting sections
of the existing standards and guidelines. We recognize that these changes will help our standards
and guidelines improve continuously. When a Datastage user finds a datastage usage valuable
enough to find a place in the standards document, the process outlined below is recommended.

BIDS Datastage CoE
The changes are captured, evaluated and incorporated into the existing standards and
guidelines if approved. The change process flow is as below:
1. A representative from the Product Line and/or Application team can submit a request to revise
a standard or guideline to the COE by submitting a request in the CHANGE LOG
https://proj.sp.ford.com/sites/BIDSETLCoE/Lists/CHANGE%20LOG/AllItems.aspx
2. The requester must attach a document to the change request above that:
a. References the current standard

b. Provides the proposed revision and justification for the proposal.
3. The requester will then be invited to the next weekly CoE meeting where the change will be
discussed by the CoE members present.
4. The change will either be approved, rejected, or approved with modification.
5. The CoE will then designate who communicates this change to the community.
6. It is recommended that the representative from the Product line and/or Project be a CoE
member.
8. Compliance to Standards
As a systems development community, standards help teams to better read,
troubleshoot, and understand code from project to project. It is recognized that as standards
evolve and change, this eventually results in non-compliance for all teams. Professional
judgment needs to be applied for compliance to standards and exceptions sought if truly needed
with business justification.
Policy Statements
1. When a new job or module is created, it must comply with current standards.
2. When an existing job or module is updated and/or modified, it must comply with current
standards.
3. Jobs or modules that are not being touched are not required to be brought up to current
standards. It is recognized that the benefit of this does not always equal the effort. It is
encouraged that they get updated as time or opportunity allows, especially if the result of those
jobs or modules will be tested in the release.
4. Even the best design & code reviews will occasionally miss the enforcement of a standard.
This does not grant an automatic exception if caught in subsequent reviews. Such errors should
be corrected per the above compliance standards.
5. A formal exception to a standard can be requested.

BIDS Datastage CoE
A. Exceptions
1. A representative from that Product Line and/or Application team can submit a request to for an
exception to a standard to the CoE by submitting a request in EXCEPTION LOG.
2. The requester must prepare a document for the CoE that:
a. References the current standard

b. Shows the non-compliance in their application
c. Provides justification why compliance cannot be achieved in the next 36 months.
3. The requester will then be invited to the next weekly CoE meeting where the exception will be
discussed by the CoE members present.
4. The exception will either be approved or rejected or approved with modification.
5. All exceptions are time bound with a maximum time of 36 months. The CoE will then track the
Exception in an Exception Log.
6. It is recommended that the representative from the Product Line and/or Project be a CoE
member.
7. It is the responsibility of the Application team to share any Approved Exceptions during a
Design or Code review. The Datastage team member will verify the Exception is current in the
Exception Log. The non-compliance must still be documented in the Design or Code review but
the exception should then be noted with no requirement to make the update for the review
comment.
9. Security
Datastage jobs when run as batch jobs from LINUX will use the credentials stored in the
password file in the security directory of the project. If personal IDs are stored in the password
file, all the project team members would use these credentials to run Datastage jobs. This is a
security violation. The security standards and guidelines below are to ensure that there is no
security violation.
A. Standard
Do not use personal database IDs in the password files in the security directory of the
project.
B. Guideline
It is recommended to use generic database IDs in all the three environments (DEV, QA
and PROD) when running Datastage batch jobs from LINUX. If you choose to follow
other compensating controls, the compensating controls need to be documented in the
product line ACR and reviewed with SCC of the product line.

BIDS Datastage CoE
It is allowed to use ones own personal database IDs when running Datastage jobs from
Datastage client or to access data outside of Datastage environment provided the ID and
password are secured when accessing data and the product line approves to do it.
10. DataStage Naming

A. Project Names (Standard set by the DataStage Admin Team)
1. Requested by the project lead, created by the DataStage Administrator.
2. Format:
a. 1-1 Corporate Defined Identifier unique within Ford Motor Company
reference sya3.dearborn.ford.com/ford/proj/cdba/html/CDBACHAR.doc
b. 2-4 Application Identifier Unique within the Corporate Defined Identifier,
5-5 Always an underscore
c. Environment (DEV, QA, EDU, PROD)
d. _proj
3. Sample:
a. MCKS_DEV_proj
1. M = Corporate Defined Identifier
2. CKS = Application Identifier
3. _ = constant
4. DEV = Environment
5. _proj
b. TSNK_PROD_proj
1. T = Corporate Defined Identifier
2. SNK = Application Identifier
3. _ = constant
4. PROD = Environment
5. _proj
B. Job Category Names Guidelines
Job Categories are used to organize the Jobs folder. They are optional
depending on the number of jobs in the project. It is however a good practice to
segregate jobs into categories in order to enhance the organization of the project.
1. Created by the Lead developer

2. Created under the "JOBS" folder
3. Unique within a project.
4. Can contain alphanumeric and underscores.
5. Keep length to < 33 characters.
Format Example A:
a. 1-2 Category Identifier
b. 3-3 Always an Underscore
c. 4-18 Freeform alphanumeric
MA_MaterialAccounting
BIDS Datastage CoE
a. MA = Category Identifier
b. _ = constant
c. MaterialAccounting = freeform alphanumeric
Format Example B:
d. 1-3 Category Identifier
e. 4-4 Always an Underscore
f. 5-18 Freeform alphanumeric
850_sync_metrics
d. 850 = Category Identifier
e. _ = constant
f. sync_metrics = freeform alphanumeric
C. Component Names Standards (Jobs, Sequences, Containers, etc)
1-4 First four of the DataStage Project Name (Ex MCKS, TSNK)
5-9 Free Form Application Standard Text. Different applications group and categorize
jobs in different ways; this area is for that purpose. This should be standard within an
application.
Here are different examples of how this area can be used
Ex A:
5-7 App Group Function
LOD Load job
UNL Unload Job
FTP - FTP Job
EDT Edit Job
MSC Miscellaneous Job
8-9 AppCategory - unique 2 character subject area identifier
SL Sales
SV Service
CO Common
SM Smart Vincent
Ex B:
5-7 App Group Source
GCP GCP source related
OTG OTG source related
INT Integrator source related
PTM PTMS source related
8-9 AppCategory - Relation
IS Issue related
ST Static data related
PA Part related
SP Supplier related
BY Buyer related
Ex C:
GCP GCP source related

BIDS Datastage CoE
OTG OTG source related

INT Integrator source related
PTM PTMS source related
8-9 AppCategory - Inbound/Outbound
IB In Bound
OB Out Bound
Ex D:
GCPP GCP source related
OTGG OTG source related
INTT Integrator source related
PTMS PTMS source related
9 AppCategory - Inbound/Outbound
I In Bound
O Out Bound
10 - Component Type
C = Container
G = Custom Stage
J = Executable Job
R = Custom Routine
S = Sequence Job (Also called "Control Jobs". These jobs control the execution of
"Executable Jobs"; they do not contain any data flows.)
T = Template Job (The purpose of "Template Jobs" is to provide a consistent pattern
for the beginnings other jobs. )
11- 50 Free Form Alphanumeric/Underscores

Capitalize the first letter of each word, the capitalization will assist in readability or
you can use an underscore.
Keep total length to <= 50 characters for readability.
No other special characters are allowed.
Some applications choose to have the first 3 or 4 positions of this area be a project
assigned number. If this is adopted make sure its standard across the application.
Full Job Name Examples

Sample A:
GAAATM1MAJ0001CurrencyEdit
GAAA = Corporate Defined Identifier and Application Identifier.
TM1 = Application Group
MA = Category Identifier.
J = Executable Job.
0001CurrencyEdit = Freeform alphanumeric/underscores
Sample B:
MCKSLODSYJ850LoadCustPrefTbl
MCKS = Corporate Defined Identifier and Application Identifier.
LOD = Application Group
SY = Category Identifier.
J = Executable Job.
BIDS Datastage CoE
850LoadCustPrefTbl = Freeform alphanumeric/underscores
Sample B:
FEDWCKSIBJUnloadCKSData
FEDW = Corporate Defined Identifier and Application Identifier.
CKS = Application Group
IB = Category Identifier
J = Executable Job.
UnloadCKSData = Freeform alphanumeric/underscores
D. Invocation ID guidelines
An Invocation ID is required when a Job or Job Sequence is identified as

allowing "Multiple Instances". The "Allow Multiple Instance" check box is on the Job
Properties General Page.
1. Create a job parameter to be used for supplying the Invocation ID at run time.
2. The Invocation ID should be something pertinent to an instance of the run such
as:
a. Batch Id
b. Plant Code
c. Run Number
E. Stage and Link Names guidelines
See Appendix E for list of suggested names of abreviations for Stages and Links.
1. Name ALL stages and links. DO NOT accept the default names.
2. Use descriptive names to differentiate between reference, reject, and stream
(input/output) links for a specified stage.
3. Avoid using in/input or out/output in link names, since links join multiple
stages. A link used for output of one stage may be the input to the following
stage.
4. Start with a letter; can contain alphanumeric and underscores.
5. Use underscores as word separators.
6. Keep length to < 33 characters. If a longer name is required please check with
your lead developer and DataStage -CoE.
7. Link Naming Suggestions:

a. Use a name that reflects the source/target processing of the data
moving across the link.
Currency_Records_Read for the link between the read stage and
process stage.
Currency_Records_to_Load for the link between the process stage
and the load stage.
b. Use Corporate Defined Abbreviations

LKUP_Currency_Code
1. LKUP = Corporate Defined Stage Identifier
2. _ = Constant
BIDS Datastage CoE
3. Currency_Code = Freeform alphanumeric
8. Stage Naming Suggestions:

a. Assign a name that reflects the function of the stage or the source/target
of the data being processed by the stage
CURRENCY_Read for a stage that reads data from the currency table.
D_CURRENCY_Load for a stage that loads the currency dimension.
Currency_Edit_and_Format for the stage that processes the currency
data.
b. Using Corporate Defined Abbreviations:

SQFS_Batch_Info
1. SQFS = Corporate Defined Stage Identifier
2. _ = Constant
3. Batch_Info = Freeform alphanumeric
11. DataStage Usage Guidelines

A. Design
1. When you are starting design for an DataStage Application complete an

overview diagram of the entire process before breaking into individual modules.
It helps to have an understanding of the big picture during design of the
individual modules.
2. Things to consider when deciding how to divide your application into jobs and
job sequences:
a. Maintainability DataStage 8.5 has modular design and development
to the Enterprise (Parallel) development canvas in the form of Parallel
Shared Containers. Parallel Containers may be local (eg. only within the
job) or shared across multiple jobs. Local parallel containers facilitate job
readability by replacing multiple stages with a container object that can
be separately viewed within the job design.
Parallel Shared Containers allow multiple stages and logic to be re-used
as building blocks across jobs. It is important to note that parallel shared
containers are placed in-line at compile-time. If you modify a shared
container, you must re-compile all jobs that use it. The Usage Analysis
Tool within DataStage can help determine which jobs use a particular
shared container.
For best performance and easy of maintenance, do not create jobs with
large numbers of stages. You can use the Job Scheduler to link together
multiple jobs that are part of the same module.
b. Performance - Data is passed in memory between stages within a
DataStage job. Between DataStage jobs, data is written to the file
system. Reading from memory is certainly faster than reading from the
file system. However, if you combine too many stages within a given job,
you may increase the amount of paging to the point where even though
you are reading from memory you have negatively impacted
performance. Each stage in a job flow uses more memory at runtime.
c. Restart Capability - DataStage jobs cannot be check pointed and hence
cannot be restarted. DataStage jobs are only re-runable. If you have a
BIDS Datastage CoE
lengthy ETL process that you wish to restart, you need to break the
process into multiple jobs controlled by a job sequence. The process
can then be restarted at the beginning of any of the jobs within the job
sequence. Intermediate results should be landed to Parallel DataSets,
unless the results need to be shared with processes outside of
DataStage.
3. Identify and design the common modules first.
4. Layout the file and table structures, including the touch points for each module to
ease development and integration. Make sure the entire team is involved when
determining the inputs and output layouts for each module.
5. Import record layouts and table schemas from the appropriate metadata source.
Developers should not key in columns and column metadata into their jobs.
6. For interim files used between DataStage jobs (not shared with other
applications), record layouts should be created within the tool and shared by all
DataStage jobs referencing the file.
7. Check all field characteristics (type, length, null/not null) in file layouts and data
models to ensure they match. DataStage is very particular about ensuring that
these match.
8. Need to have complete ETL? tool knowledge and applicability of different tools
to make the system integration smooth (e.g. where to use DataStage, PL/SQL,
Autosys, etc).
9. When planning job flows use a flowcharts to layout the jobs, their data sources,
intermediate files, and target files to assist in determining order of execution,
concurrency, and dependence between jobs.
B. Development
1. Use stage descriptions and annotations to provide job comments.

2. Use environment variables for data that can change when migrating from one
environment to the next.
3. Larger, more complex jobs require more system resources (CPU, memory,swap)
than a series of smaller jobs, sequenced together through intermediate
datasets.As a rule of thumb, keeping job designs to less than 30 stages is a good
starting point. But this is not a hard-and-fast rule. The proper job boundaries are
ultimately dictated by functional/restart/performance requirements, expected
throughput and data volumes, degree of parallelism, number of simultaneous
jobs and their corresponding complexity, and the capacity and capabilities of the
target hardware environment.
4. Assign one person the responsibility of updating table and flat file definitions.
5. Always design the jobs considering record counts, table definition and
looping/chucking the loads.
6. Use existing stages whenever possible. Don't create custom stages (Build ops)
unless necessary. They increase the difficultly of building and maintaining the
job.
7. Where possible, develop small jobs and use job sequencer to connect them.
8. Test each individually before incorporating in a job sequencer.
9. Record layouts and table schemas are copied into your job at design time when
referenced, not at compile time. If the layout or schema changes after its been
incorporated in your job, you must explicitly load the new version.
10. Examine the GUI representation of your job looking for unexpected sorts.
Determine cause if possible and adjust design if needed. Additional sorts will
increase runtime.

BIDS Datastage CoE
11. The order of the stage variables within a transformer stage is critical if one stage
variable refers to another. Execution happens from top to bottom.
12. Job parameterization allows a single job design to process similar logic instead of
creating multiple copies of the same job. The Multiple-Instance job property
allows multiple invocations of the same job to run simultaneously.
13. If you are using a copy or a filter stage either immediately after or immediately
before a transformer stage, you are reducing the efficiency by using more stages
because a transformer does the job of both copy stage as well as a filter stage.
14. Make use of Order By clause when a DB stage is being used in join. The
intention is to make use of Database power for sorting instead of Data Stage
resources.
15. It is recommended to use DB2 connector instead of the legacy DB2 stage. The
DB2 legacy stages are deprecating stage and there won't be any long term
support for issues related to them. DB2 connector is however an emerging stage
type in IBM Datastage.
16. It is recommended to avoid using "Unicode" in extended attribute in Teradata
connector stage unless intended for specific purposes such as multi language
support etc. During the 8.5 migration, several applications have noticed data
issues when "Unicode" is used in extended attributes of Teradata connector
stage unintentionally.
17. It is recommended to give meaningful names to columns in the Datastage jobs.
Avoid using names such as t for column names. Some product lines faced
issues when column is named t as this happened to be Datastage internal key
word as well.
C. Debugging
1. Use Peek stages in PX to view stage results. Number of rows to peek can be set
to 1 but not zero. If there are not too many peeks in your job you can set the
"Number of records" property to 1 before promoting the job from the test
environment. Need to determine the impact on the log if too many peek stages
remain in production jobs even if only displaying a single record.
2. Use debug stages like column generator, row generator to generate data
needed.
3. Use performance statistics in Director log and in Designer to view stage results.
D. Performance
1. Use parallel datasets to land intermediate result between parallel jobs.Parallel

datasets retain data partitioning and sort order, in the DS parallel native internal
format, facilitating end-to-end parallelism across job boundaries.
2. Lookup File Sets can be used to store reference data used in subsequent jobs.
They maintain reference data in DS parallel internal format, pre-
indexed.However, Lookup File Sets can only be used on reference links to a
Lookup stage. There are no utilities for examining data in a Lookup File Set.
3. Remove unneeded columns as early as possible in the data flow. Every unused
column requires additional memory that can impact performance.
4. Avoid type conversions, if possible when working with database sources and
targets.

BIDS Datastage CoE
5. Minimize the number of Transformers. For data type conversions, renaming and
removing columns, other stages (for example, Copy, Modify) might be more
appropriate. Note that unless dynamic (parameterized) conditions are required, a
Transformer is always faster than a Filter or Switch stage.Avoid using the BASIC
Transformer, especially in large-volume data flows.
6. Use BuildOps only when existing Transformers do not meet performance
requirements, or when complex reusable logic is required. Because BuildOps are
built in C++, there is greater control over the efficiency of code, at the expense of
ease of development (and more skilled developer
requirements).
7. Minimize the number of partitioners in a job. When possible, ensure data is as
evenly dlistributed as possible.
8. Minimize and combine use of Sorts where possible.
E. DataStage Project Organization:
1. Project Directories default directories and environment variables are created

for each project to organize the storage of DataStage components.
a. DataStage Project - /ford/thishost/proj/IBM/Projects/ (project name)
1. Example: /ford/thishost/proj/IBM/Projects/MCKS_DEV_proj
2. Never accept the default directory (within the DataStage install
file system) for project file location.
b. As a standard practice across the etl environment the below set of
directories are provided.
To understand the directory structure and their use with in a project refer
below, not all directories are necessary for a given project.
scriptfiles Place holder for all UNIX scripts.

logfiles Place holder for all logs generated from scripts
proj_dir Place holder for all project files
rejectfiles Place holder for any reject files
paramfiles Place holder for all parameter files required to run etl jobs.
projfiles Place holder for all files created by datastage jobs.
scratchfiles - Place holder for all intermediate files created by datastage.
interimfiles - Place holder for all intermediate files created by datastage
All the above directories should be owned by batchid.
Sub Directories under each of these main directories is allowed based on project
needs.
2. Configuration Files each project is assigned a default configuration file for

each environment and one or more multi-node configurations depending on
processing requirements.
b. All environments
1. Default (project name)default.apt
Single node configuration
Example: sandboxdefault.apt
2. Two Node (project name)2node.apt
2 Node configuration
Example: sandbox2node.apt

BIDS Datastage CoE
Additional configurations based on project requirements.
F. Job Parameters
1. Create parameters for all items that can vary by environment. (dev, test, QA,
Prod) Some of those items are:
a. Database Objects such as table names and database names
b. Source and Target connection string information such as userids, passwords,
and server names
2. Establish a naming standard within the project for repeatedly used parameters to
insure consistency and ease of maintenance such as:
a. dbname = database name
b. orauser = Oracle userid
c. orapass = Oracle password
d. tduser = Teradata userid
e. tdpass = Teradata password.
3. Parameters should be stored in a file(s) that can be accessed at job execution time.
a. If the parameter contains a password, the password must be in security
folder. Establish a consistent format for all parameter files for your
application. Whichever method you choose will required the appropriate
handling of the values
i. String of names and values separated by a delimiter
1. parametername,parametervalue,parametername,parameterv
alue
ii. Strings of values separated by a delimiter
1. parametervalue,parametervalue,paramatervalue
iii. One parameter name and value per record
1. parametername=parametervalue
4. Parameter sets simplify the management of job parameters by providing a central
location for managing large lists of parameters and offers a convenient way of
defining multiple set of values in the form of value files.
12. Stage Guidelines:
This section provides some general guidelines for several new stages and new
functionality provided by Infosphere Datastage 8.5.
1. Join stage:
When joining tables in the same database or on the same database
server the join should be done by the DBMS and not the DataStage Join Stage.
1. Lookup stage:
a) Restrict use to small look-up tables, ones that can be stored in physical
memory. The number of rows that can be used depends on the number of
BIDS Datastage CoE
columns, their data types (and associated storage lengths), and the available
memory.
b) Avoid single select (Sparse option) look-ups against DB2 and Oracle when
the number of master records times the number of look-ups exceeds 1
million. Instead, RDBMS in the Look-Up Stage and use the Normal option
to pre-load into memory.. When dealing with a flat file or Teradata table the
data is always loaded into memory for processing (the Sparse option is not
available with Teradata reference data).Use the lookup stage when the
lookup dataset size is small and it contains no duplicate keys. Otherwise,
use a join or merge.
c) Limit the use of database Sparse Lookups to scenarios where the number of
input rows is significantly smaller (for example 1:100 or more) than the
number of reference rows, or when exception processing.
d) When the lookup source is a table and the master file contains only a small
number of distinct lookup keys, consider using the sparse option. Otherwise,
the entire table is read into memory.
e) The lookup stage will drop records from the master file where the key
contains null values without displaying a message
2. Funnel stage:
In previous releases of DataStage, the funnel stage could block job performance
when the number of source rows on input links was unbalanced. DataStage v7
corrects this behavior with the continuous funnel type. New jobs created in
DataStage v7 will default to the continuous funnel type. When importing jobs
created in earlier versions, ensure that the funnel type is set to continuous
funnel. Funnel Stage requires all the input streams to have the same metadata.
This is strictly enforced from Version 8.7 by throwing a compilation error when
the meta data is different on various input streams.
3. Pivot Enterprise stage:
The Pivot Enterprise stage is a processing stage that pivots data horizontally and
vertically. Horizontal pivoting maps a set of columns in an input row to a single column in
multiple output rows. Vertical pivoting maps a set of rows in the input data to single or
multiple output columns.
a. Specifying a horizontal pivot operation and mapping output columns
You can specify the horizontal pivot operation and then map the resulting
columns onto output columns. Use horizontal pivot operations to map rows
into output columns.
Procedure to specify the horizontal pivot operation:
1. Design your basic job by doing the following tasks:

a. Place a Pivot Enterprise stage on your job design canvas.
b. Place the stage that will supply the input data on the canvas
on the left of the Pivot Enterprise stage.
c. Place the stage that is the target for the output data on the
canvas on the right of the Pivot Enterprise stage.
d. Link the three stages together.
2. Configure the stage that provides the input data.
BIDS Datastage CoE
3. Open the Pivot Enterprise stage and click the Properties tab.
4. Specify the Pivot type as Horizontal on the Pivot Action tab.
5. Specify the horizontal pivot operation on the Pivot Definitions tab of
the Stage page by doing the following tasks:
a. In the Name field, type the name of the output column that
will contain the pivoted data (the pivot column).
b. Specify the SQL type and, if necessary (for example if the
SQL type is decimal), the length and scale for the pivoted
data.
c. Double-click the Derivation field to open the Column
Selection window.
d. In the Available Columns list, select the columns that you
want to combine in the pivot column.
e. Click the right arrow to move the selected column to the
Selected Columns list.
f. Click OK to return to the Pivot tab.
g. If you want the stage to number the pivoted rows by
generating a pivot index column, select Pivot Index.
6. Click the Output page to go to the Mapping tab.
7. Specify your output data by doing the following tasks:
a. Select your pivot column or columns from the Columns table
in the right pane and drag them over to the output link table
in the left pane.
b. Drag all the columns that you want to appear in your output
data from the available columns on the left side of the
window to the available output columns on the right side of
the window.
c. Click the Columns tab to view the data that the stage will
output when the job runs.
8. Click OK to save your pivot operation and close the Pivot Enterprise
stage.
b. Specifying a vertical pivot operation and mapping output columns
You can specify the vertical pivot operation and then map the resulting
columns onto output columns. You can configure a pivot operation to
vertically pivot data and then map the resulting columns onto output columns.
Procedure To specify the vertical pivot operation and map the columns:
1. Design your basic job by doing the following tasks:

a. Place a Pivot Enterprise stage on your job design canvas.
b. Place the stage that will supply the input data on the canvas
on the left of the Pivot Enterprise stage.
c. Place the stage that is the target for the output data on the
canvas on the right of the Pivot Enterprise stage.
d. Link the three stages together.
2. Configure the stage that provides the input data.
3. Open the Pivot Enterprise stage and click the Properties tab.
4. Select the Pivot Type property and select Vertical for the Pivot Type.
5. Specify the vertical pivot operation on the Pivot Definitions tab of the
Stage page by doing the following tasks:
a. Click Load to load the input data.

BIDS Datastage CoE
b. Select GroupBy for each column that you want to group in

the output.
c. Select Pivot for each column that you want to pivot from
rows to columns. You cannot pivot columns that have
GroupBy selected.
d. Double-click the Aggregation functions required for this
column field to open the Pivot - Function Select window.
e. In the Available Columns list, select the columns that you
want to aggregate in the pivot column and click Save.
f. Specify the Array Size to specify the number of sets of
pivoted data that will be generated for each output row. For
example, specify an array size of 7 to generate a row of
pivoted data in 7 columns, with each column for a day of the
week.
g. If you want the stage to generate a pivot index column to
number the pivoted columns, select Pivot Index.
h. Click OK to return to the Pivot tab.
6. Click Output to go to the Mapping tab of the Output page.
7. Specify your output data by doing the following tasks:
a. Select your pivot column or columns from the Columns table
in the right pane and drag them over to the output link table
in the left pane.
b. Drag all the columns that you want to appear in your output
data from the available columns on the left side of the
window to the available output columns on the right side of
the window.
c. Click the Columns tab to view the data that the stage will
output when the job runs.
8. Click OK to save your pivot operation and close the Pivot Enterprise
stage.
4. Slowly Changing Dimension stage
The Slowly Changing Dimension (SCD) stage is a processing stage that

works within the context of a star schema database. The SCD stage has a single
input link, a single output link, a dimension reference link, and a dimension
update link.
The SCD stage reads source data on the input link, performs a
dimension table lookup on the reference link, and writes data on the output link.
The output link can pass data to another SCD stage, to a different type of
processing stage, or to a fact table. The dimension update link is a separate
output link that carries changes to the dimension. You can perform these steps in
a single job or a series of jobs, depending on the number of dimensions in your
database and your performance requirements.
SCD stages support both SCD Type 1 and SCD Type 2 processing. SCD
Type 1 overwrites an attribute in a dimension table where as SCD Type 2 adds a
new row to a dimension table.

BIDS Datastage CoE
Each SCD stage processes a single dimension and performs lookups by

using an equality matching technique. If the dimension is a database table, the
stage reads the database to build a lookup table in memory. If a match is found,
the SCD stage updates rows in the dimension table to reflect the changed data. If
a match is not found, the stage creates a new row in the dimension table. All of
the columns that are needed to create a new dimension row must be present in
the source data.
Input data to SCD stages must accurately represent the order in which
events occurred. You might need to presort your input data by a sequence
number or a date field. If a job has multiple SCD stages, you must ensure that
the sort order of the input data is correct for each stage.
If the SCD stage is running in parallel, the input data must be hash
partitioned by key. Hash partitioning allows all records with the same business
key to be handled by the same process. The SCD stage divides the dimension
table across processes by building a separate lookup table for each process.
a. Job design using a Slowly Changing Dimension stage
Each SCD stage processes a single dimension, but job design is flexible.
You can design one or more jobs to process dimensions, update the
dimension table, and load the fact table.
Processing dimensions
You can create a separate job for each dimension, one job for all
dimensions, or several jobs, each of which has several dimensions.
Updating dimensions
You can update the dimension table as the job runs by linking the
SCD stage to a database stage, or you can update the dimension table later
by sending dimension changes to a flat file that you use in a separate job.
Actual dimension changes are applied to the lookup table in memory and are
mirrored to the dimension update link, giving you the flexibility to handle a
series of changes to the same dimension row.
Loading the fact table
You can load the fact table as the final step of the job that updates
the last dimension, or in a separate job.
5. Connector Stages Guidelines:
As a general guideline, new projects should give preference to

Connector stages, and take advantage of existing Enterprise equivalents in
specific cases where these have an edge in terms of performance.
a. ODBC Connector:

BIDS Datastage CoE
The ODBC Connector has the following features:
I. Source, target and lookup context:

The ODBC Connector can be used similar to typical ODBC stage for
extracting (in the context of source), for loading (in the context of target)
and for lookup (in the context of lookup). The lookup context is similar to
the source context, with the difference that the SELECT SQL statements
are parameterized. The parameter values are provided dynamically on
the input link by the other stages in the job.
II. Reject links:

The ODBC Connector supports a special type of link called reject link.
The Connector can be configured to direct the data that it cannot process
to the reject link, from which it can be sent to any stage in the job.
III. Passing LOBs by reference:

The Connector allows the option of passing LOBs by reference, rather
than by extracting the data and passing it inline into the job flow. When
configured to pass LOB values by reference, the ODBC Connector
assembles a special block of data, called a locator or reference, that it
passes into the job dataflow. Other Connectors are placed at the end of
the job dataflow. When the LOB locator arrives at the other Connector
stage, the Connector framework initiates the retrieval of the actual ODBC
data represented by the reference and provides the data to the target
Connector so that it can be loaded into the represented resource. This
way it is possible to move LOB data from one data resource to another
where the size of the data is measured in megabytes and gigabytes,
without having to move the actual data through the job. The drawback is
that the data passed this way cannot be altered by the intermediate
stages in the job.
IV. Arrays:
The Connector supports array insert operations in target context. The
Connector buffers the specified number of input rows before inserting
them to the database in a single operation. This provides for better
performance when inserting large numbers of rows.
V. SQL Builder
The Connector user uses the SQL Builder tool to design the SQL
statements.
VI. Pre/post run statements:

The Connector provides a mechanism for executing SQL statements that
typically are used to initialize the database objects (for example to create
a new table or truncate the existing table before inserting data into the
table).
VII. Metadata import:

The Connector supports metadata operations, such as the discovery of
database objects and describing these objects.
b. WebSphere MQ Connector:

BIDS Datastage CoE
WebSphere MQ Connector provides access to message queues in the

WebSphere MQ enterprise messaging system. It provides the following types of
support: WebSphere MQ versions 5.3 and 6.0, and WebSphere Message Broker
6.0 for the publish/subscribe mode of work. MQ client and MQ server connection
modes. The choice can be made dynamically through the special connection
property in the Connectors stage editor.
1. Filtering of messages based on various combinations of message

header fields. Complex filtering conditions might be specified.
2. Synchronous (request/reply) messaging. This scenario is configured
by defining both input link (for request messages) and output link (for
reply messages) for the Connectors stage.
3. Publish/Subscribe mode of work. Both the WebSphere MQ broker
(with MQRFH command messages) and WebSphere Message
Broker (with MQRFH2 command messages) are supported. The
Connector stage can be configured to run as a publisher or as a
subscriber. Dynamic registration and deregistration of
publisher/subscriber is supported.
4. MQ dynamic queues, name lists, transmission queues and shared
cluster queues for remote queue messaging.
5. Designated error queues and standard reject links.
c. Teradata Connector:
The Teradata Connector includes the following features similar to the

ODBC Connector:
Source, target and lookup context
Reject links
Passing LOBs by reference
Arrays
SQL Builder
Pre/post run statements
Metadata import
The Teradata Connector uses CLIv2 API for immediate operations
(SELECT, INSERT, UPDATE, DELETE) and Parallel Transporter Direct API
(formerly TEL-API) for bulk load and bulk extract operations.Parallel bulk load is
supported through LOAD, UPDATE, and STREAM operators in Parallel
Transporter. This corresponds to the functionality provided by the FastLoad,
MultiLoad, and TPump Teradata utilities, respectively. When the UPDATE
operator is used it supports the option for deleting rows of data (MultiLoad delete
task). Parallel bulk export is supported through the EXPORT operator in Parallel
Transporter. This corresponds to the functionality provided by the FastExport
Teradata utility. The Connector persists the bulk-load progress state and
provides sophisticated support for restarting the failed bulk-load operations. The
Connector uses a designated database table for synchronization of distributed
Connector instances in the parallel bulk-load. A limited support for stored
procedures and macros is also available.
a. Teradata Connector advantages:
The following list details the Teradata Connector advantages:
Parallel MultiLoad capability including MPP configurations
Parallel immediate lookups and writes
Array support for better performance of immediate writes
BIDS Datastage CoE
Reject link support for lookups and writes on DS Enterprise Edition

Reject link support for bulk loads
Cursor lookups (lookups that return more than one row)
Restart capability for parallel bulk loads
MultiLoad delete task support
Support for BLOB and CLOB data types
Reject link support for missing UPDATE or DELETE rows
Error message and row count feedback for immediate lookups/writes.
b. Parallel synchronization table:

The following list details the Parallel Synchronization Table properties:
Used for coordination of player processes in parallel mode
Now optional, connector runs sequentially if not specified
Can be used for logging of execution statistics
Connector stage can use its own sync table or share it
Primary key is SyncID, PartitionNo, StartTim
Each player updates its own row, no lock contention
Management properties for dropping, deleting rows.
c. Comparison with existing Teradata stages:

The following list details comparisons with Teradata stages:
Limited support for stored procedures and macros, but the Stored
Procedure plug-in is still better suited for it.
No more utilities, named pipes, control scripts, or report files.
Interface with the Parallel Transporter is through a direct call-level API.
Error messages are reported in the DataStage Director log.
MultiLoad plug-in jobs that use advanced custom script features cannot
be migrated to use the Teradata Connector.
Number of players is determined by the PX Engine config file.
Table shows a list of advantages and disadvantages of the Parallel Transporter

operators.
Equivalent
Operator Utility Advantages Disadvantages
Fastest export Uses utility slot, No single-
Export FastExport method. AMP SELECTs.
Uses utility slot, INSERT
only, Locks table, No
Fastest load views, No secondary
Load FastLoad method. indexes.
INSERT,
UPDATE,
DELETE,
Views, Non- Uses utility slot, Locks
unique table, No unique secondary
secondary indexes, Table
Update MultiLoad indexes. inaccessible on abort.

BIDS Datastage CoE
INSERT,
UPDATE,
DELETE,
Views,
Secondary
indexes, No
utility slot, No Slower than UPDATE
Stream Tpump table lock. operator.
d. DB2 Connector
The DB2 Connector includes the following features similar to the ODBC
Connector:
Source
Target and lookup context
Reject links
Arrays
SQL Builder
Metadata import
Supports DB2 version V9.1
The Connector is based on the CLI client interface. It can connect to any
database cataloged on the DB2 client. The DB2 client must be collocated with
the Connector, but the actual database might be local or remote to the
Connector.
Separate the sets of connection properties for the job setup phase
(conductor) and execution phase (player nodes), so the same database might be
cataloged differently on conductor and player nodes.
The Connector provides support for the following tasks:
1. Specifying DB2 instance dynamically (through connection

properties), which overrides the default environment settings
(DB2INSTANCE environment variable).
2. XML data type in DB2 V9.1.

3. DB2 DPF. A job with a DB2 Connector target stage might be
configured to assign on execution player node with each DB2
partition, and to write data to the partitioned database in parallel,
providing dramatic performance improvement over sending the data
BIDS Datastage CoE
to the same partition node and forcing DB2 to redirect data to

corresponding partitions.
4. DB2 bulk load functionality. The invocation of bulk load is done
through the CLI interface. Parallel bulk load is also supported for
DB2 with DPF.
New features
In terms of functionality, the DB2 Connector offers more capabilities than

all of the existing stages. Specifically it provides support for the following
elements:
I. XML data type

II. LOBs, with data passed either inline, or by a reference mechanism.
The latter allows for LOB data of any size to be moved from a source
to a target.
III. Client-server access to DB2 server. This overcomes a limitation with
the EE stage that can only be used in homogenous environments
where DataStage and DB2 are on identical (or the same) servers.
IV. Options to control bulk load than EE Operator.
V. Design time capabilities, such as metadata import to the common
model, and enumeration of server-side information.
Using rejects with user-defined SQL
Using rejects with user-defined SQL (UserSQL) is described in this

section.
The following list describes the use of UserSQL without the reject link:
All statements in the UserSQL property are either passed for all of the input
records in the current batch (as specified by the Array size property) or none.
In other words events in previous statements do not control the number of
records passed to the statements that follow.
If FailOnError=Yes, the first statement that fails causes the job to fail and the
current transaction, as depicted by the Record Count property, is rolled back.
No more statements from the UserSQL property are executed after that. For
example, if there are three statements in the property, and the second one
fails, it means that the first one has already executed, and the third one is not
executed. None of the statements have its work committed because of the
error.
If FailOnError=No, all statements still get all the records but any
statement errors are ignored and statements continue to be executed. For
example, if there are three statements in the UserSQL property and the
second one fails, all three are executed and any successful rows are
committed. The failed rows are ignored.
The following list describes the use of UserSQL with the reject link:
All statements in the UserSQL property are either passed all of the input
records in the current batch (as specified by the Array size property) or none.

BIDS Datastage CoE
In other words, events in previous statements do not control the number of

records passed to the statements that follow.
All the rows in each batch (as specified by the Array size property) are either
successfully consumed by all statements in the UserSQL property, or are
rejected as a whole. This is important to preserve the integrity of the records
processed by multiple statements that are expected to be atomic and
committed in the same transaction. In other words, the connector tries to
eliminate the possibility of having each statement successfully consume a set
of rows.
If any of the rows in any statement in the UserSQL property are not
consumed successfully, the processing of the current batch is aborted and
the whole batch of records is sent to the reject link. The statements that
follow the failed statement are not executed with the current batch. The
processing resumes with the next batch.
To preserve the consistency of records in each batch, the connector forces a
commit after every successful batch and forces a rollback after every failed
batch. This means the connector overrides the transactional behavior
specified by the Record Count property.
e. Oracle Connector
The Oracle Connector includes these features similar to the ODBC

Connector:
Source
Target and lookup context
Reject links
Arrays
SQL Builder
Metadata import
In addition, it supports bulk loads and Oracle partitioning.
The Connector works with Oracle versions 10g and 11g. It supports
connecting to an Oracle database through Oracle Full Client or Oracle
Instant Client.
The following list details improvements made to the Oracle Connector:
a. Distributed transactions:
Support for guaranteed delivery of transactions arriving in form of
MQ messages. In case of success, the messages are processed by
the job and the data is written to the target Oracle database. In case
of a failure the messages are rolled back to the queue.
To use the Distributed Transaction stage, you need MQ 6.0 and
Oracle 10g R2.
b. Built-in Oracle scalar data types are supported, including BLOB,

CLOB,NCLOB, BFILE, LONG, and LONG RAW data types.
c. XMLType columns and object tables are supported.
d. PL/SQL anonymous blocks with bind parameters are supported.

Support for a rich set of options for configuring reject links. Reject links
are also supported in bulk load mode
BIDS Datastage CoE
e. Pre- and Post- SQL operations are supported:

SQL statement, multiple SQL statements or PL/SQL anonymous
block might be specified.
The statements might be configured to run at the job or at the node
level.
The statements might be specified in the stage UI or in external files.
f. Rich metadata import functionality:

Table selection based on table type (table, view, IOT, materialized
view, external table, synonym), the table owner or the table name
pattern.
Supports importing PK, FK, and index information.
g. Table action:
Performing Create, Replace, or Truncate table actions in the job is
supported before writing data to the table.
Input link column definitions automatically used to define target table
columns.
h. Fast read, write, and bulk load operations:

Performance improvement over Oracle Enterprise stage.
Rich set of options for configuring parallel read and write operations.
Provides control over the Transparent Application Failover (TAF)
mechanism in environments such as Oracle RAC.
Includes storing TAF notifications in the job log to inform the user
about the failover progress.
Oracle or OS authentication is supported for the connector at job
runtime.
f. Essbase connector
The Essbase connector supports the extraction, delta extraction, and

load of data to and from Essbase databases. It performs hierarchical to
relational mapping of cube data. The connector supports parallel read and
writes.
It is implemented in C++ using a 3rd party interface library from a

partner, and uses DMDI to allow selection from cube data.
6. Big Data stage:
You can use the Big Data File stage to access files on the Hadoop
Distributed File System (HDFS). You use the Big Data File stage to read and
write HDFS files.
The Big Data File stage is similar in function to the Sequential File stage.
You can use the stage to process multiple files and preserve the multiple files on
the output. You can use the Big Data File stage in jobs that run in parallel or

BIDS Datastage CoE
sequential mode. However, you cannot use the Big Data File stage in server
jobs.
As a target, the Big Data File stage can have a single input link and a
single reject link. As a source, you can use the Big Data File stage to write data
to one or more files.
Configuring the environment for the Big Data File stage
You must have InfoSphere DataStage administrator level access to

modify environment variables. InfoSphere Information Server must be installed
on a Linux operating system.
If you are working with IBM InfoSphere BigInsights, have all patches
installed.
You must ensure that Hadoop Distributed File System (HDFS) library
and configuration files can be accessed by InfoSphere Information Server. You
set the environment variables on the engine tier. You configure environment
paths on the computer where the InfoSphere DataStage engine is installed.
This procedure describes the setup for InfoSphere BigInsights and

Cloudera. The examples show InfoSphere BigInsights Version 1.4 and CDH4,
the fourth distribution of Cloudera which includes Apache Hadoop.
7. Transformer loops
You can specify that the Transformer stage repeats a loop multiple times
for each input row that it reads. The stage can generate multiple output rows
corresponding to a single input row while a particular condition is true. You can
use stage variables in the specification of this condition. Stage variables are
evaluated once after each input row is read, and therefore hold the same value
while the loop is executed.You can also specify loop variables. Loop variables
are evaluated each time that the loop condition is evaluated as true. Loop
variables can change for each iteration of the loop, and for each output row that
is generated. The name of a variable must be unique across both stage variables
and loop variables.
To specify a loop in a Transformer stage:
a. Define any required stage variables.

b. Defining a loop condition
You specify that the Transformer stage loops when processing each
input row by defining a loop condition. The loop continues to iterate while the
condition is true. You can use the @ITERATION system variable in your
expression. @ITERATION holds a count of the number of times that the loop
has been executed, starting at 1. @ITERATION is reset to one when a new
input row is read.
To define a loop condition:

BIDS Datastage CoE
1. If required, open the Loop Condition grid by clicking the

arrow on the title bar.
2. Double-click the Loop While condition, or type CTRL-D,
to open the expression editor.
3. In the expression editor, specify the expression that
controls your loop. The expression must return a result
of true or false.
c. Define any required loop variables.
You can use loop variables when a loop condition is defined

for the Transformer stage. When a loop is defined, the Transformer
stage can output multiple rows for every row input to the stage. Loop
variables are evaluated every time that the loop is iterated, and so
can change their value for every output row. Such variables are
accessible only from the Transformer stage in which they are
declared. You cannot use a loop variable in a stage variable
derivation.
Loop variables can be used as follows:
1. They can be assigned values by expressions.

2. They can be used in expressions which define an output
column derivation.
3. Expressions evaluating a variable can include other loop
variables or stage variables or the variable being
evaluated itself.
Any loop variables you declare are shown in a table in the right pane of
the links area. The table looks like the output link table and the stage
variables table. You can maximize or minimize the table by clicking the
arrow in the table title bar.
The table lists the loop variables together with the expressions that are
used to derive their values. Link lines join the loop variables with input
columns used in the expressions. Links from the right side of the table
link the variables to the output columns that use them, or to the stage
variables that they use.
To declare a loop variable:
1. Select Loop Variable Properties from the loop variable pop-

up menu.
2. In the grid on the Loop Variables tab, enter the variable
name, initial value, SQL type, extended information (if
variable contains Unicode data), precision, scale, and an
optional description. Variable names must begin with an
alphabetic character (a-z, A-Z) and can only contain
alphanumeric characters (a-z, A-Z, 0-9).
3. Click OK. The new loop variable appears in the loop variable
table in the links pane.

BIDS Datastage CoE
Note: You can also add a loop variable by selecting Insert

New Loop Variable or Append New Loop Variable from the
loop variable pop-up menu. A new variable is added to the
loop variables table in the links pane. The first variable is
given the default name LoopVar and default data type
VarChar (255), subsequent loop variables are named
LoopVar1, LoopVar2, and so on. You can edit the variables
on the Loop Variables tab of the Stage Properties window.
d. Defining output column derivations
13. General guidelines and tips
Guidelines:
a. Try to keep the canvas clean. Try to maintain Left-Right Top-down design approach.
b. When writing intermediate results that will be shared between parallel jobs, try to write to data
sets whenever possible. When writing to a dataset, ensure that the data is partitioned, and that
the partitions, and sort order, are retained at every subsequent stage. This is helpful specifically
when processing large volumes of data.
c. Whenever possible, try to run smaller jobs concurrently in order to optimize overall processing
time.
d. Parameterize jobs as much as possible.
e. Use shared containers to share common logic across a number of jobs. Remember that shared
containers are inserted when a job is compiled. If the shared container is changed, the jobs using
it will need recompiling.
f. Avoid unnecessary type conversion. If you are using stage variables on a Transformer stage,
ensure that their data types match the expected result types.
g. Avoid using a transformer for tasks that can be accomplished through other parallel stages. Type
conversion, column name changes, filter by column are examples where modify or filter stage can
be used instead of a transformer.
Use a Copy stage rather than a Transformer for simple operations such as:
1. Providing a job design placeholder on the canvas. (Provided you do not set the
Force property to True on the Copy stage, the copy will be optimized out of the
job at run time.)
2. Renaming columns.
3. Dropping columns.
4. Implicit type conversions.
h. Careful job design can improve the performance of sort operations, both in standalone Sort
stages and in on-link sorts specified in the Input/Partitioning tab of other stage types. Be aware of
Datastage inserted sorts in cases where not required. Stages like Join requires sorted input data.
When the input data is sorted externally like in a database, a sort stage with dont sort,
previously sorted option on the hash key columns prior to the join stage will ensure that the data
is not re-sorted and will ensure pipeline parallelism
Look at job designs and try to reorder the job flow to combine operations around the same
sort keys if possible, and coordinate your sorting strategy with your hashing strategy. It is sometimes

BIDS Datastage CoE
possible to rearrange the order of business logic within a job flow to leverage the same sort order,
partitioning, and groupings. If data has already been partitioned and sorted on a set of key columns,
specify the dont sort, previously sorted option for the key columns in the Sort stage. This reduces
the cost of sorting and takes greater advantage of pipeline parallelism. When writing to parallel data
sets, sort order and partitioning are preserved. When reading from these data sets, try to maintain
this sorting if possible by using same partitioning method.
i. Remove Un-needed Columns. Remove un-needed columns as early as possible within the job
flow. Every additional unused column requires additional buffer memory, which can impact
performance and make each row transfer from one stage to the next more expensive. If possible,
when reading from databases, use a select list to read just the columns required, rather than the
entire table.
j. Avoid reading from sequential files using the Same partitioning method. Unless you have
specified more than one source file, this will result in the entire file being read into a single
partition, making the entire downstream flow run sequentially unless you explicitly repartition
k. Wherever possible allow the source database to handle transformations by adding sort or
transformation logic to extract SQL, instead of using the Transformer stage.
l. The Lookup stage is most appropriate when the reference data for all Lookup stages in a job is
small enough to fit into available physical memory. Each lookup reference requires a contiguous
block of physical memory. The Lookup stage requires all but the first input (the primary input) to fit
into physical memory. If the reference to a lookup is directly from a database table and the
number of input rows is significantly smaller than the reference rows, 1:100 or more, a Sparse
Lookup may be appropriate. If performance issues arise while using Lookup, consider using the
Join stage. The Join stage must be used if the datasets are larger than available memory
resources.
m. Intermediate files between jobs should be written to dataset files over sequential files whenever
possible.
n. When importing fixed-length data, the number of readers per node property on the Sequential File
stage can often provide a noticeable performance boost as compared with a single process
reading the data.
o. Evenly partitioned date: Because of the nature of parallel jobs, the entire flow runs only as fast as
its slowest component. If data is not evenly partitioned, the slowest component is often slow due
to data skew. If one partition has ten records, and another has ten million, then a parallel job
cannot make ideal use of the resources. Setting the environment variable
APT_RECORD_COUNTS displays the number of records per partition for each component.
Ideally, counts across all partitions should be roughly equal. Differences in data volumes between
keys often skew data slightly, but any significant (e.g., more than 5-10%) differences in volume
should be a warning sign that alternate keys, or an alternate partitioning strategy, may be
required.
Tips:
a. To re-arrange an existing job design, or insert new stage types into an existing job flow,
first disconnect the links from the stage to be changed, then the links will retain any Meta
data associated with them.
b. The Copy stage is a good placeholder between stages if you anticipate that new stages
or logic will be needed in the future without damaging existing properties and derivations.
When inserting a new stage, simply drag the input and output links from the Copy
BIDS Datastage CoE
placeholder to the new stage. Unless the Force property is set in the Copy stage,
DataStage optimizes the actual copy out at runtime.
c. You can set the OSH_PRINT_SCHEMAS environment variable to verify that runtime
schemas match the job design column definitions.
Appendices
1. Appendix A Corporate Information Factory Framework

Customer Information Factory Diagram ETL Perspective.
2. Appendix B Application IDs

A. Application team is responsible for creation of access control IDs in SILAs and on unix box.
1. Create similar 5 new Generic IDs, for your project in SILAS with proper description in the
"title" section.
Naming standard - 7Letter (4LetterProjName+3LetterType),
<ndwh>dev,<ndwh>qa,<ndwh>prd,<ndwh>mig,<ndwh>sup,<ndwh>sec
For EDW - 8Letter (edw+ <3Letter ProjCode> +2LetterType)

edw<XXX>dv,edw<XXX>qa,edw<XXX>pd,edw<XXX>mg,edw<XXX>sp, edw<XXX>se
More Info below:
2. Application team should request SILAs for any IDs like ftp IDs which needs SILAs
registration.

BIDS Datastage CoE
3. After IDs are created in SILAs, application team should request these ids in DEV, QA and
PROD environments as group or unix ID depending on the type of ID.
http://www.request.ford.com/RequestCenter/myservices/navigate.do?query=orderform&si
d=383&
4. After Ids/groups are created on UNIX, batch Id should be added in respective groups.
DEV SERVER QA SERVER EDU SERVER PROD SERVER

Request Request Request Request
UNIX ID as UNIX ID as UNIX ID as UNIX ID as
<ndwh>dev Unix ID <ndwh>qa Unix ID <ndwh>edu Unix ID <ndwh>prd Unix ID
<ndwh>mig Group <ndwh>mig Group <ndwh>mig Group <ndwh>mig Group
<ndwh>sup Group <ndwh>sup Group <ndwh>sup Group <ndwh>sup Group
<ndwh>sec Group <ndwh>sec Group <ndwh>sec Group <ndwh>sec Group
Refer to https://dept.sp.ford.com/sites/ACE/dstage/SitePages/Home.aspx for more

information about project and ID set up.
3. Appendix C Suggested Stage and Link Abbreviations

Stage Type Abbreviation Description
_LINK INST Insert into a table
_LINK LKUP Lookup
_LINK MESG Message
_LINK MOVE Moving data within a job from stage to stage, not elsewhere classified
_LINK READ Reading in data from an external source
_LINK RECV Receive data from a shared container
_LINK REJT Reject data
_LINK SEND Send data to a shared container
_LINK UPDT Update a table
_LINK WRIT Writing data to an external target
DataBase D2AS DB2/UDB API Source
DataBase D2AT DB2/UDB API Target
DataBase D2LS DB2/UDB Load Source
DataBase D2LT DB2/UDB Load Target
DataBase DBCS DB2/UDB Connector Source
DataBase DBCT DB2/UDB Connector Target
DataBase DB2S DB2/UDB Source
DataBase DB2T DB2/UDB Target
DataBase DYNS Dynamic RDBMS Source
DataBase DYNT Dynamic RDBMS Target
DataBase IXCS Informix CLI Source
DataBase IXCT Informix CLI Target
DataBase IXLS Informix Load Source
DataBase IXLT Informix Load Target
BIDS Datastage CoE
DataBase IFXS Informix Source

DataBase IFXT Informix Target
DataBase ORCS Oracle Connector Source
DataBase ORCT Oracle Connector Target
DataBase ORAS Oracle Source
DataBase ORAT Oracle Target
DataBase RBLS RedBrick Load Source
DataBase RBLT RedBrick Load Target
DataBase SYLS Sybase IQ 12 Load Source
DataBase SYLT Sybase IQ 12 Load Target
DataBase SYOS Sybase OC Source
DataBase SYOT Sybase OC Target
DataBase TRLS Teradata Multiload Source
DataBase TRLT Teradata Multiload Target
DataBase TERS Teradata Source
DataBase TERT Teradata Target
DataBase TECS Teradata Connector Source
DataBase TECT Teradata Connector Target
DataBase SSCS SQL Server Connector Source
DataBase SSCT SQL Server Connector Target
DataBase SSVS SQL Server Source
DataBase SSVT SQL Server Target
Development/Debug CGEN Column Generator
Development/Debug HEAD Head
Development/Debug PEEK Peek
Development/Debug RGEN Row Generator
Development/Debug SMPL Sample
Development/Debug TAIL Tail
File DATS Data Set Source
File DATT Data Set Target
File EXTS External Source
File EXTT External Target
File FLSS File Set Source
File FLST File Set Target
File LKFS Lookup File Set
File SQFS Sequential File Source
File SQFT Sequential File Target
Other WRAP Custom wrapped stage
Other LCNT Local container
Other SCNT Shared container. Format to be SCNT_<Shared container name>
Processing AGGR Aggregation
Processing CHGA Change Apply
Processing CHGC Change Capture
Processing COMP Compare
Processing CMPR Compress
Processing COPY Copy
Processing DECD Decode
Processing DIFF Difference
Processing ENCD Encode

BIDS Datastage CoE
Processing EXPD Expand

Processing EXTF External Filter
Processing FILT Filter
Processing FTPS FTP Plug-in as source
Processing FTPT FTP Plug-in as target
Processing FUNL Funnel
Processing GENR Generic
Processing JOIN Join
Processing LKPS Lookup Stage
Processing MERG Merge
Processing MODF Modify
Processing PIVT Pivot
Processing RDUP Remove Duplicates
Processing SORT Sort
Processing SKGN Surrogate Key Generator
Processing SWCH Switch
Processing TRNS Transformer
Real Time MQMS WebSphere MQ Message Source
Real Time MQMT WebSphere MQ Message Target
Real Time XMLI XML Input
Real Time XMLO XML Output
Real Time XMLT XML Transform
Restructure COLE Column Export
Restructure COLI Column Import
Restructure CMBR Combine Records
Restructure MKSR Make Subrecord
Restructure MKVR Make Vector
Restructure PRSR Promote Subrecord
Restructure SPSR Splitt Subrecord
Restructure SPVR Splitt Vector
Sequencer_Links CUST Custom
Sequencer_Links FAIL Failed
Sequencer_Links OKAY OK
Sequencer_Links OTHR Otherwise
Sequencer_Links VALU ReturnValue
Sequencer_Links UCND Unconditional
Sequencer_Links USER UserStatus
Sequencer_Links WARN Warning
Sequencer_Stages XHDL Exception Handler
Sequencer_Stages XCMD Execute Command
Sequencer_Stages PJOB Job Activity - Parallel Job (Follow the abbreviation and underscore with the full
job name)
Sequencer_Stages JSEQ Job Activity - Sequence (Follow the abbreviation and underscore with the full
job name)
Sequencer_Stages SJOB Job Activity - Server Job (Follow the abbreviation and underscore with the full
job name)
Sequencer_Stages NEST Nested Condition
Sequencer_Stages MAIL Notification Activity
Sequencer_Stages RTNA Routine Activity (Follow the abbreviation and underscore with the full routine
name, if the routine is used more than once in the sequence add an underscore
BIDS Datastage CoE
and suffix number, e.g. RTNA_MyRoutineName_2)

Sequencer_Stages SEQL Sequencer - Mode = All
Sequencer_Stages SEQY Sequencer - Mode = Any
Sequencer_Stages WAIT Wait for File Activity
4. Appendix D DataStage Connector Migration Procedure

IBM InfoSphere Information Server provides Connector migration tool to migrate deprecated stages like
ODBC, API, Enterprise etc. to Connector technology.
Procedure to migrate:
Step 1: Connect to the Connector Migration Tool providing credentials.
Step 2: Goto Preferences->Variant Selection , select corresponding variant against specific
connector type for the database.
Step 3: Goto Preferences->Backup Options, select the appropriate option depending on the
requirement.
Step 4: Select appropriate category or job or stage in a job by checking the check box against
them (only jobs and stages with a blue '?' indicator can be selected to be migrated).
Step 5: Click migrate on the top left. Before migrating, ensure that the jobs are not currently
being edited.
5. Appendix E Mapping passwords between IS and OS

In Version8x, there are 2 steps for user authentication. All users are required to login to the
Information Server Console and map their user id and the OS Credentials (on which the Engine is
Installed).
STEP 1. Login to the Information Server Repository

http://<host name>:9080/ibm/iis/console/loginForm.jsp?displayForm=true
Use the IS Credentials to login to the console.( If password needs to be reset, then submit RC
ticket to Datastage admin team)
There will be 3 tabs. To map the Engine Credentials

Goto Administration -> Domain Management -> Engine Credentials
Select <host> -> Open My Credentials
Enter the Engine Server Credentials (the OS Userid and password you use to telnet to
the <host> server).
Save the Record.
Log off.
STEP 2. Login via the Client using the IS credentials (Credentials used to login to the webpage in step1)
to login to the Project.
6. Appendix F Notes and References

BIDS Datastage CoE
All standards mentioned in the ETL standards and guidelines document for Datastage version 7.5
should also be followed as reference for stages and functionality already existing in Datastage version 7.5
and carried over to version 8.0. Evolving BIDS standards and Guidelines document should also be
referenced periodically for updated standards.
Reference: IBM RedBooks


DataStage Standards and Guidelines

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DataStage Standards and Guidelines

Transféré par

Droits d'auteur :

Formats disponibles

Business Intelligence Delivery Services Publication

Extract Transfer Load Center Of Excellence

(DataStage Standards and Guidelines)

Release Date: Aug 15, 2011

For more information or feedback, contact:

DataStage Standards and Guidelines

Originator: BIDS Datastage CoE Page 2 of 39 Date Issued: <issue date>

6. Appendix F Notes and References.................................................................................................................... 38

The transformed data is finally Loaded to a flat file or relational table.

Ford's DataStage Environment

In order to commonize, consolidate, and provide consistency for ETL processing of

Why IBM InfoSphere DataStage

Originator: BIDS Datastage CoE Page 4 of 39 Date Issued: <issue date>

2. DataStage Contact List

3. ETL Tool Set

Other Related Products: May or may not be used at Ford

4. IBM InfoSphere DataStage Resource Requests

A DataStage Project is a tool specific organizational structure that forms a

A new DataStage project requires sufficient information to evaluate the utilization

Originator: BIDS Datastage CoE Page 5 of 39 Date Issued: <issue date>

1. The minimum required documents for this are the

B. Requesting DataStage project

C. Requesting DataStage Users

For more information on User request to datastage, refer DataStage FAQ at

Originator: BIDS Datastage CoE Page 6 of 39 Date Issued: <issue date>

5. InfoSphere DataStage Job Scheduling

6. InfoSphere DataStage Source Code Management

7. Process to Change Standards or Guidelines

Originator: BIDS Datastage CoE Page 7 of 39 Date Issued: <issue date>

a. References the current standard

4. The change will either be approved, rejected, or approved with modification.

5. A formal exception to a standard can be requested.

Originator: BIDS Datastage CoE Page 8 of 39 Date Issued: <issue date>

2. The requester must prepare a document for the CoE that:

a. References the current standard

4. The exception will either be approved or rejected or approved with modification.

Originator: BIDS Datastage CoE Page 9 of 39 Date Issued: <issue date>

10. DataStage Naming

B. Job Category Names Guidelines

1. Created by the Lead developer

C. Component Names Standards (Jobs, Sequences, Containers, etc)

Originator: BIDS Datastage CoE Page 11 of 39 Date Issued: <issue date>

OTG OTG source related

11- 50 Free Form Alphanumeric/Underscores

Full Job Name Examples

850LoadCustPrefTbl = Freeform alphanumeric/underscores

An Invocation ID is required when a Job or Job Sequence is identified as

E. Stage and Link Names guidelines

7. Link Naming Suggestions:

b. Use Corporate Defined Abbreviations

3. Currency_Code = Freeform alphanumeric

8. Stage Naming Suggestions:

b. Using Corporate Defined Abbreviations:

11. DataStage Usage Guidelines

1. When you are starting design for an DataStage Application complete an

1. Use stage descriptions and annotations to provide job comments.

Originator: BIDS Datastage CoE Page 15 of 39 Date Issued: <issue date>

1. Use parallel datasets to land intermediate result between parallel jobs.Parallel

Originator: BIDS Datastage CoE Page 16 of 39 Date Issued: <issue date>

E. DataStage Project Organization:

1. Project Directories default directories and environment variables are created

scriptfiles Place holder for all UNIX scripts.

All the above directories should be owned by batchid.