Académique Documents
Professionnel Documents
Culture Documents
The standards and guidelines for the use of the Analytical Pattern's DataStage tool, IBM Software's
InfoSphere DataStage.
1. Overview
The purpose of this document is to define and direct the use of the IBM InfoSphere
DataStage product for the development and deployment of ETL processing mostly for analytical
systems at Ford Motor Company.
This document lays out standards, guidelines and tips in using Datastage for development.
A. Standard
Standard is a rule. Adhering to the standard is strictly mandated.
B. Guideline
Guideline is a good practice to follow. It is not mandatory to follow a guideline
C. Tip
A tip is neither a standard nor a guideline. It is to make things easy and fast for a
developer.
Definition of ETL
ETL is defined as a set of software tools and related processes used to Extract (E) data
from a source, perform any Transformations (T) or cleaning, and Load (L) a target database.
This is the primary process by which source data from transactional systems is prepared for
storage and end user delivery.
These processes generally read data from an input source (flat file, relational table,
message queue, etc.). This is known as the Extraction. Extractions from the input source can
become complex when joining two or more relational tables together to create an input source.
Once sources are Extracted, the data is passed through either an engine, or code-based
process to modify, enhance, or eliminate data elements based on the business rules identified.
This is known as the Transformation. Transformations can become very complex as it relates to
the business rules identified as well as the sorting/merging of data that occurs during the
processing.
Though it may seem simple on the surface, this layer of the architecture can get very
complex very quickly based on the number of source feeds being managed, the quality of the
data being provided, the requirements to standardize the data, the business rules identified to
transform the data, and the management of errors/rejects.
The DataStage environment established for Ford Motor Company is a result of the
definition and directions set forth by the Analytical Pattern and in keeping with the concepts of the
Corporate Information Factory (CIF) framework. (See Appendix A).
A. The powerful ETL solution supports the collection, integration and transformation
of large volumes of data, with data structures ranging from simple to highly complex.
IBM InfoSphere DataStage manages data arriving in real-time as well as data
received on a periodic or scheduled basis.
B. The scalable platform enables companies to solve large-scale business problems
through high-performance processing of massive data volumes. By leveraging the
parallel processing capabilities of multiprocessor hardware platforms, IBM
InfoSphere DataStage can scale to satisfy the demands of ever-growing data
volumes, stringent real-time requirements, and ever shrinking batch windows.
C. Comprehensive source and target support for a virtually unlimited number of
heterogeneous data sources and targets in a single job includes text files; complex
data structures in XML; ERP systems such as SAP and PeopleSoft; almost any
database (including partitioned databases); web services; and business intelligence
tools like SAS.
D. Real-time data integration support operates in real-time. It captures messages
from Message Oriented Middleware (MOM) queues using JMS or WebSphere MQ
adapters to seamlessly combine data into conforming operational and historical
analysis perspectives. IBM InfoSphere Information Services Director provides a
service-oriented architecture (SOA) for publishing data integration logic as shared
services that can be reused across the enterprise. These services are capable of
simultaneously supporting high-speed, high reliability requirements of transactional
processing and the high volume bulk data requirements of batch processing.
E. Advanced maintenance and development enables developers to maximize
speed, flexibility and effectiveness in building, deploying, updating and managing
their data integration infrastructure. Full data integration reduces the development
and maintenance cycle for data integration projects by simplifying administration
and maximizing development resources.
F. Complete connectivity between any data source and any application ensures
that the most relevant, complete and accurate data is integrated and used by the
most popular enterprise application software brands, including SAP, Siebel, Oracle,
and PeopleSoft.
G. Flexibility to perform information integration directly on the mainframe. InfoSphere
DataStage for Linux on System z, provides:
H. Ability to leverage existing mainframe resources in order to maximize the value of
your IT investments
I. Scalability, security, manageability and reliability of the mainframe
J. Ability to add mainframe information integration work load without added z/OS
operational costs
Data Tools that provide for the processing IBM InfoSphere DataStage
Transformation and manipulation of data
Source Code Check in/out source code for Accurev
Control executables to control and track
changes
Batch Tools for end-to-end scheduling and ETL Server - AutoSys
Scheduling monitoring of batch jobs.
a. The application team has to follow the process outlined in Step A. The lead
time is 2 weeks for each environment. In an ideal situation, as soon as the
SOW is approved, the DataStage team will create a project in 2 weeks time
(Time delays will have to be accounted for the file systems to be created on
the server by the server hosting teams.)
b. The QA and PROD will be created upon request and the application teams
are advised to keep the DataStage administration team in the loop and plan
well ahead. Please allocate sufficient time for the QA testing before
requesting for prod.
c. Application team should come up with 3 char and 4 char acronyms for the
project (Refer "Project Names" section).
d. See Appendix B for more details.
A new DataStage user requires the creation of a UNIX userid on the DataStage
server and the placement of that user into the appropriate UNIX group for access to the
proper DataStage project.
The changes are captured, evaluated and incorporated into the existing standards and
guidelines if approved. The change process flow is as below:
1. A representative from the Product Line and/or Application team can submit a request to revise
a standard or guideline to the COE by submitting a request in the CHANGE LOG
https://proj.sp.ford.com/sites/BIDSETLCoE/Lists/CHANGE%20LOG/AllItems.aspx
2. The requester must attach a document to the change request above that:
3. The requester will then be invited to the next weekly CoE meeting where the change will be
discussed by the CoE members present.
5. The CoE will then designate who communicates this change to the community.
6. It is recommended that the representative from the Product line and/or Project be a CoE
member.
8. Compliance to Standards
As a systems development community, standards help teams to better read,
troubleshoot, and understand code from project to project. It is recognized that as standards
evolve and change, this eventually results in non-compliance for all teams. Professional
judgment needs to be applied for compliance to standards and exceptions sought if truly needed
with business justification.
Policy Statements
1. When a new job or module is created, it must comply with current standards.
2. When an existing job or module is updated and/or modified, it must comply with current
standards.
3. Jobs or modules that are not being touched are not required to be brought up to current
standards. It is recognized that the benefit of this does not always equal the effort. It is
encouraged that they get updated as time or opportunity allows, especially if the result of those
jobs or modules will be tested in the release.
4. Even the best design & code reviews will occasionally miss the enforcement of a standard.
This does not grant an automatic exception if caught in subsequent reviews. Such errors should
be corrected per the above compliance standards.
A. Exceptions
1. A representative from that Product Line and/or Application team can submit a request to for an
exception to a standard to the CoE by submitting a request in EXCEPTION LOG.
3. The requester will then be invited to the next weekly CoE meeting where the exception will be
discussed by the CoE members present.
5. All exceptions are time bound with a maximum time of 36 months. The CoE will then track the
Exception in an Exception Log.
6. It is recommended that the representative from the Product Line and/or Project be a CoE
member.
7. It is the responsibility of the Application team to share any Approved Exceptions during a
Design or Code review. The Datastage team member will verify the Exception is current in the
Exception Log. The non-compliance must still be documented in the Design or Code review but
the exception should then be noted with no requirement to make the update for the review
comment.
9. Security
Datastage jobs when run as batch jobs from LINUX will use the credentials stored in the
password file in the security directory of the project. If personal IDs are stored in the password
file, all the project team members would use these credentials to run Datastage jobs. This is a
security violation. The security standards and guidelines below are to ensure that there is no
security violation.
A. Standard
Do not use personal database IDs in the password files in the security directory of the
project.
B. Guideline
It is recommended to use generic database IDs in all the three environments (DEV, QA
and PROD) when running Datastage batch jobs from LINUX. If you choose to follow
other compensating controls, the compensating controls need to be documented in the
product line ACR and reviewed with SCC of the product line.
It is allowed to use ones own personal database IDs when running Datastage jobs from
Datastage client or to access data outside of Datastage environment provided the ID and
password are secured when accessing data and the product line approves to do it.
b. TSNK_PROD_proj
1. T = Corporate Defined Identifier
2. SNK = Application Identifier
3. _ = constant
4. PROD = Environment
5. _proj
Job Categories are used to organize the Jobs folder. They are optional
depending on the number of jobs in the project. It is however a good practice to
segregate jobs into categories in order to enhance the organization of the project.
Format Example A:
a. 1-2 Category Identifier
b. 3-3 Always an Underscore
c. 4-18 Freeform alphanumeric
MA_MaterialAccounting
Originator: BIDS Datastage CoE Page 10 of 39 Date Issued: <issue date>
DataStage Standards and Guidelines Date Revised: <date revised>
GIS1 Item Number: Retention Start Date:
GIS2 Classification: Proprietary
BIDS Datastage CoE
Datastage Standards and Guidelines
a. MA = Category Identifier
b. _ = constant
c. MaterialAccounting = freeform alphanumeric
Format Example B:
d. 1-3 Category Identifier
e. 4-4 Always an Underscore
f. 5-18 Freeform alphanumeric
850_sync_metrics
d. 850 = Category Identifier
e. _ = constant
f. sync_metrics = freeform alphanumeric
1-4 First four of the DataStage Project Name (Ex MCKS, TSNK)
5-9 Free Form Application Standard Text. Different applications group and categorize
jobs in different ways; this area is for that purpose. This should be standard within an
application.
Here are different examples of how this area can be used
Ex A:
5-7 App Group Function
LOD Load job
UNL Unload Job
FTP - FTP Job
EDT Edit Job
MSC Miscellaneous Job
8-9 AppCategory - unique 2 character subject area identifier
SL Sales
SV Service
CO Common
SM Smart Vincent
Ex B:
5-7 App Group Source
GCP GCP source related
OTG OTG source related
INT Integrator source related
PTM PTMS source related
8-9 AppCategory - Relation
IS Issue related
ST Static data related
PA Part related
SP Supplier related
BY Buyer related
Ex C:
5-7 App Group Source
GCP GCP source related
Ex D:
5-8 App Group Source
GCPP GCP source related
OTGG OTG source related
INTT Integrator source related
PTMS PTMS source related
9 AppCategory - Inbound/Outbound
I In Bound
O Out Bound
10 - Component Type
C = Container
G = Custom Stage
J = Executable Job
R = Custom Routine
S = Sequence Job (Also called "Control Jobs". These jobs control the execution of
"Executable Jobs"; they do not contain any data flows.)
T = Template Job (The purpose of "Template Jobs" is to provide a consistent pattern
for the beginnings other jobs. )
Sample B:
MCKSLODSYJ850LoadCustPrefTbl
MCKS = Corporate Defined Identifier and Application Identifier.
LOD = Application Group
SY = Category Identifier.
J = Executable Job.
Originator: BIDS Datastage CoE Page 12 of 39 Date Issued: <issue date>
DataStage Standards and Guidelines Date Revised: <date revised>
GIS1 Item Number: Retention Start Date:
GIS2 Classification: Proprietary
BIDS Datastage CoE
Datastage Standards and Guidelines
Sample B:
FEDWCKSIBJUnloadCKSData
FEDW = Corporate Defined Identifier and Application Identifier.
CKS = Application Group
IB = Category Identifier
J = Executable Job.
UnloadCKSData = Freeform alphanumeric/underscores
D. Invocation ID guidelines
1. Create a job parameter to be used for supplying the Invocation ID at run time.
2. The Invocation ID should be something pertinent to an instance of the run such
as:
a. Batch Id
b. Plant Code
c. Run Number
See Appendix E for list of suggested names of abreviations for Stages and Links.
1. Name ALL stages and links. DO NOT accept the default names.
2. Use descriptive names to differentiate between reference, reject, and stream
(input/output) links for a specified stage.
3. Avoid using in/input or out/output in link names, since links join multiple
stages. A link used for output of one stage may be the input to the following
stage.
4. Start with a letter; can contain alphanumeric and underscores.
5. Use underscores as word separators.
6. Keep length to < 33 characters. If a longer name is required please check with
your lead developer and DataStage -CoE.
lengthy ETL process that you wish to restart, you need to break the
process into multiple jobs controlled by a job sequence. The process
can then be restarted at the beginning of any of the jobs within the job
sequence. Intermediate results should be landed to Parallel DataSets,
unless the results need to be shared with processes outside of
DataStage.
3. Identify and design the common modules first.
4. Layout the file and table structures, including the touch points for each module to
ease development and integration. Make sure the entire team is involved when
determining the inputs and output layouts for each module.
5. Import record layouts and table schemas from the appropriate metadata source.
Developers should not key in columns and column metadata into their jobs.
6. For interim files used between DataStage jobs (not shared with other
applications), record layouts should be created within the tool and shared by all
DataStage jobs referencing the file.
7. Check all field characteristics (type, length, null/not null) in file layouts and data
models to ensure they match. DataStage is very particular about ensuring that
these match.
8. Need to have complete ETL? tool knowledge and applicability of different tools
to make the system integration smooth (e.g. where to use DataStage, PL/SQL,
Autosys, etc).
9. When planning job flows use a flowcharts to layout the jobs, their data sources,
intermediate files, and target files to assist in determining order of execution,
concurrency, and dependence between jobs.
B. Development
11. The order of the stage variables within a transformer stage is critical if one stage
variable refers to another. Execution happens from top to bottom.
12. Job parameterization allows a single job design to process similar logic instead of
creating multiple copies of the same job. The Multiple-Instance job property
allows multiple invocations of the same job to run simultaneously.
13. If you are using a copy or a filter stage either immediately after or immediately
before a transformer stage, you are reducing the efficiency by using more stages
because a transformer does the job of both copy stage as well as a filter stage.
14. Make use of Order By clause when a DB stage is being used in join. The
intention is to make use of Database power for sorting instead of Data Stage
resources.
15. It is recommended to use DB2 connector instead of the legacy DB2 stage. The
DB2 legacy stages are deprecating stage and there won't be any long term
support for issues related to them. DB2 connector is however an emerging stage
type in IBM Datastage.
16. It is recommended to avoid using "Unicode" in extended attribute in Teradata
connector stage unless intended for specific purposes such as multi language
support etc. During the 8.5 migration, several applications have noticed data
issues when "Unicode" is used in extended attributes of Teradata connector
stage unintentionally.
17. It is recommended to give meaningful names to columns in the Datastage jobs.
Avoid using names such as t for column names. Some product lines faced
issues when column is named t as this happened to be Datastage internal key
word as well.
C. Debugging
1. Use Peek stages in PX to view stage results. Number of rows to peek can be set
to 1 but not zero. If there are not too many peeks in your job you can set the
"Number of records" property to 1 before promoting the job from the test
environment. Need to determine the impact on the log if too many peek stages
remain in production jobs even if only displaying a single record.
2. Use debug stages like column generator, row generator to generate data
needed.
3. Use performance statistics in Director log and in Designer to view stage results.
D. Performance
5. Minimize the number of Transformers. For data type conversions, renaming and
removing columns, other stages (for example, Copy, Modify) might be more
appropriate. Note that unless dynamic (parameterized) conditions are required, a
Transformer is always faster than a Filter or Switch stage.Avoid using the BASIC
Transformer, especially in large-volume data flows.
6. Use BuildOps only when existing Transformers do not meet performance
requirements, or when complex reusable logic is required. Because BuildOps are
built in C++, there is greater control over the efficiency of code, at the expense of
ease of development (and more skilled developer
requirements).
7. Minimize the number of partitioners in a job. When possible, ensure data is as
evenly dlistributed as possible.
8. Minimize and combine use of Sorts where possible.
Sub Directories under each of these main directories is allowed based on project
needs.
F. Job Parameters
1. Create parameters for all items that can vary by environment. (dev, test, QA,
Prod) Some of those items are:
a. Database Objects such as table names and database names
b. Source and Target connection string information such as userids, passwords,
and server names
2. Establish a naming standard within the project for repeatedly used parameters to
insure consistency and ease of maintenance such as:
a. dbname = database name
b. orauser = Oracle userid
c. orapass = Oracle password
d. tduser = Teradata userid
e. tdpass = Teradata password.
3. Parameters should be stored in a file(s) that can be accessed at job execution time.
a. If the parameter contains a password, the password must be in security
folder. Establish a consistent format for all parameter files for your
application. Whichever method you choose will required the appropriate
handling of the values
i. String of names and values separated by a delimiter
1. parametername,parametervalue,parametername,parameterv
alue
ii. Strings of values separated by a delimiter
1. parametervalue,parametervalue,paramatervalue
iii. One parameter name and value per record
1. parametername=parametervalue
4. Parameter sets simplify the management of job parameters by providing a central
location for managing large lists of parameters and offers a convenient way of
defining multiple set of values in the form of value files.
This section provides some general guidelines for several new stages and new
functionality provided by Infosphere Datastage 8.5.
1. Join stage:
When joining tables in the same database or on the same database
server the join should be done by the DBMS and not the DataStage Join Stage.
1. Lookup stage:
a) Restrict use to small look-up tables, ones that can be stored in physical
memory. The number of rows that can be used depends on the number of
Originator: BIDS Datastage CoE Page 18 of 39 Date Issued: <issue date>
DataStage Standards and Guidelines Date Revised: <date revised>
GIS1 Item Number: Retention Start Date:
GIS2 Classification: Proprietary
BIDS Datastage CoE
Datastage Standards and Guidelines
columns, their data types (and associated storage lengths), and the available
memory.
b) Avoid single select (Sparse option) look-ups against DB2 and Oracle when
the number of master records times the number of look-ups exceeds 1
million. Instead, RDBMS in the Look-Up Stage and use the Normal option
to pre-load into memory.. When dealing with a flat file or Teradata table the
data is always loaded into memory for processing (the Sparse option is not
available with Teradata reference data).Use the lookup stage when the
lookup dataset size is small and it contains no duplicate keys. Otherwise,
use a join or merge.
c) Limit the use of database Sparse Lookups to scenarios where the number of
input rows is significantly smaller (for example 1:100 or more) than the
number of reference rows, or when exception processing.
d) When the lookup source is a table and the master file contains only a small
number of distinct lookup keys, consider using the sparse option. Otherwise,
the entire table is read into memory.
e) The lookup stage will drop records from the master file where the key
contains null values without displaying a message
2. Funnel stage:
In previous releases of DataStage, the funnel stage could block job performance
when the number of source rows on input links was unbalanced. DataStage v7
corrects this behavior with the continuous funnel type. New jobs created in
DataStage v7 will default to the continuous funnel type. When importing jobs
created in earlier versions, ensure that the funnel type is set to continuous
funnel. Funnel Stage requires all the input streams to have the same metadata.
This is strictly enforced from Version 8.7 by throwing a compilation error when
the meta data is different on various input streams.
The Pivot Enterprise stage is a processing stage that pivots data horizontally and
vertically. Horizontal pivoting maps a set of columns in an input row to a single column in
multiple output rows. Vertical pivoting maps a set of rows in the input data to single or
multiple output columns.
a. Specifying a horizontal pivot operation and mapping output columns
You can specify the horizontal pivot operation and then map the resulting
columns onto output columns. Use horizontal pivot operations to map rows
into output columns.
3. Open the Pivot Enterprise stage and click the Properties tab.
4. Specify the Pivot type as Horizontal on the Pivot Action tab.
5. Specify the horizontal pivot operation on the Pivot Definitions tab of
the Stage page by doing the following tasks:
a. In the Name field, type the name of the output column that
will contain the pivoted data (the pivot column).
b. Specify the SQL type and, if necessary (for example if the
SQL type is decimal), the length and scale for the pivoted
data.
c. Double-click the Derivation field to open the Column
Selection window.
d. In the Available Columns list, select the columns that you
want to combine in the pivot column.
e. Click the right arrow to move the selected column to the
Selected Columns list.
f. Click OK to return to the Pivot tab.
g. If you want the stage to number the pivoted rows by
generating a pivot index column, select Pivot Index.
6. Click the Output page to go to the Mapping tab.
7. Specify your output data by doing the following tasks:
a. Select your pivot column or columns from the Columns table
in the right pane and drag them over to the output link table
in the left pane.
b. Drag all the columns that you want to appear in your output
data from the available columns on the left side of the
window to the available output columns on the right side of
the window.
c. Click the Columns tab to view the data that the stage will
output when the job runs.
8. Click OK to save your pivot operation and close the Pivot Enterprise
stage.
b. Specifying a vertical pivot operation and mapping output columns
You can specify the vertical pivot operation and then map the resulting
columns onto output columns. You can configure a pivot operation to
vertically pivot data and then map the resulting columns onto output columns.
Procedure To specify the vertical pivot operation and map the columns:
The SCD stage reads source data on the input link, performs a
dimension table lookup on the reference link, and writes data on the output link.
The output link can pass data to another SCD stage, to a different type of
processing stage, or to a fact table. The dimension update link is a separate
output link that carries changes to the dimension. You can perform these steps in
a single job or a series of jobs, depending on the number of dimensions in your
database and your performance requirements.
SCD stages support both SCD Type 1 and SCD Type 2 processing. SCD
Type 1 overwrites an attribute in a dimension table where as SCD Type 2 adds a
new row to a dimension table.
Input data to SCD stages must accurately represent the order in which
events occurred. You might need to presort your input data by a sequence
number or a date field. If a job has multiple SCD stages, you must ensure that
the sort order of the input data is correct for each stage.
If the SCD stage is running in parallel, the input data must be hash
partitioned by key. Hash partitioning allows all records with the same business
key to be handled by the same process. The SCD stage divides the dimension
table across processes by building a separate lookup table for each process.
Each SCD stage processes a single dimension, but job design is flexible.
You can design one or more jobs to process dimensions, update the
dimension table, and load the fact table.
Processing dimensions
You can create a separate job for each dimension, one job for all
dimensions, or several jobs, each of which has several dimensions.
Updating dimensions
You can update the dimension table as the job runs by linking the
SCD stage to a database stage, or you can update the dimension table later
by sending dimension changes to a flat file that you use in a separate job.
Actual dimension changes are applied to the lookup table in memory and are
mirrored to the dimension update link, giving you the flexibility to handle a
series of changes to the same dimension row.
You can load the fact table as the final step of the job that updates
the last dimension, or in a separate job.
a. ODBC Connector:
IV. Arrays:
The Connector supports array insert operations in target context. The
Connector buffers the specified number of input rows before inserting
them to the database in a single operation. This provides for better
performance when inserting large numbers of rows.
V. SQL Builder
The Connector user uses the SQL Builder tool to design the SQL
statements.
b. WebSphere MQ Connector:
c. Teradata Connector:
Equivalent
Operator Utility Advantages Disadvantages
Fastest export Uses utility slot, No single-
Export FastExport method. AMP SELECTs.
Uses utility slot, INSERT
only, Locks table, No
Fastest load views, No secondary
Load FastLoad method. indexes.
INSERT,
UPDATE,
DELETE,
Views, Non- Uses utility slot, Locks
unique table, No unique secondary
secondary indexes, Table
Update MultiLoad indexes. inaccessible on abort.
INSERT,
UPDATE,
DELETE,
Views,
Secondary
indexes, No
utility slot, No Slower than UPDATE
Stream Tpump table lock. operator.
d. DB2 Connector
The DB2 Connector includes the following features similar to the ODBC
Connector:
Source
Target and lookup context
Reject links
Passing LOBs by reference
Arrays
SQL Builder
Pre/post run statements
Metadata import
Supports DB2 version V9.1
The Connector is based on the CLI client interface. It can connect to any
database cataloged on the DB2 client. The DB2 client must be collocated with
the Connector, but the actual database might be local or remote to the
Connector.
Separate the sets of connection properties for the job setup phase
(conductor) and execution phase (player nodes), so the same database might be
cataloged differently on conductor and player nodes.
New features
The following list describes the use of UserSQL without the reject link:
All statements in the UserSQL property are either passed for all of the input
records in the current batch (as specified by the Array size property) or none.
In other words events in previous statements do not control the number of
records passed to the statements that follow.
If FailOnError=Yes, the first statement that fails causes the job to fail and the
current transaction, as depicted by the Record Count property, is rolled back.
No more statements from the UserSQL property are executed after that. For
example, if there are three statements in the property, and the second one
fails, it means that the first one has already executed, and the third one is not
executed. None of the statements have its work committed because of the
error.
If FailOnError=No, all statements still get all the records but any
statement errors are ignored and statements continue to be executed. For
example, if there are three statements in the UserSQL property and the
second one fails, all three are executed and any successful rows are
committed. The failed rows are ignored.
The following list describes the use of UserSQL with the reject link:
All statements in the UserSQL property are either passed all of the input
records in the current batch (as specified by the Array size property) or none.
e. Oracle Connector
a. Distributed transactions:
Support for guaranteed delivery of transactions arriving in form of
MQ messages. In case of success, the messages are processed by
the job and the data is written to the target Oracle database. In case
of a failure the messages are rolled back to the queue.
To use the Distributed Transaction stage, you need MQ 6.0 and
Oracle 10g R2.
g. Table action:
Performing Create, Replace, or Truncate table actions in the job is
supported before writing data to the table.
Input link column definitions automatically used to define target table
columns.
f. Essbase connector
You can use the Big Data File stage to access files on the Hadoop
Distributed File System (HDFS). You use the Big Data File stage to read and
write HDFS files.
The Big Data File stage is similar in function to the Sequential File stage.
You can use the stage to process multiple files and preserve the multiple files on
the output. You can use the Big Data File stage in jobs that run in parallel or
sequential mode. However, you cannot use the Big Data File stage in server
jobs.
As a target, the Big Data File stage can have a single input link and a
single reject link. As a source, you can use the Big Data File stage to write data
to one or more files.
If you are working with IBM InfoSphere BigInsights, have all patches
installed.
You must ensure that Hadoop Distributed File System (HDFS) library
and configuration files can be accessed by InfoSphere Information Server. You
set the environment variables on the engine tier. You configure environment
paths on the computer where the InfoSphere DataStage engine is installed.
7. Transformer loops
You can specify that the Transformer stage repeats a loop multiple times
for each input row that it reads. The stage can generate multiple output rows
corresponding to a single input row while a particular condition is true. You can
use stage variables in the specification of this condition. Stage variables are
evaluated once after each input row is read, and therefore hold the same value
while the loop is executed.You can also specify loop variables. Loop variables
are evaluated each time that the loop condition is evaluated as true. Loop
variables can change for each iteration of the loop, and for each output row that
is generated. The name of a variable must be unique across both stage variables
and loop variables.
Any loop variables you declare are shown in a table in the right pane of
the links area. The table looks like the output link table and the stage
variables table. You can maximize or minimize the table by clicking the
arrow in the table title bar.
The table lists the loop variables together with the expressions that are
used to derive their values. Link lines join the loop variables with input
columns used in the expressions. Links from the right side of the table
link the variables to the output columns that use them, or to the stage
variables that they use.
Guidelines:
a. Try to keep the canvas clean. Try to maintain Left-Right Top-down design approach.
b. When writing intermediate results that will be shared between parallel jobs, try to write to data
sets whenever possible. When writing to a dataset, ensure that the data is partitioned, and that
the partitions, and sort order, are retained at every subsequent stage. This is helpful specifically
when processing large volumes of data.
c. Whenever possible, try to run smaller jobs concurrently in order to optimize overall processing
time.
d. Parameterize jobs as much as possible.
e. Use shared containers to share common logic across a number of jobs. Remember that shared
containers are inserted when a job is compiled. If the shared container is changed, the jobs using
it will need recompiling.
f. Avoid unnecessary type conversion. If you are using stage variables on a Transformer stage,
ensure that their data types match the expected result types.
g. Avoid using a transformer for tasks that can be accomplished through other parallel stages. Type
conversion, column name changes, filter by column are examples where modify or filter stage can
be used instead of a transformer.
Use a Copy stage rather than a Transformer for simple operations such as:
1. Providing a job design placeholder on the canvas. (Provided you do not set the
Force property to True on the Copy stage, the copy will be optimized out of the
job at run time.)
2. Renaming columns.
3. Dropping columns.
4. Implicit type conversions.
h. Careful job design can improve the performance of sort operations, both in standalone Sort
stages and in on-link sorts specified in the Input/Partitioning tab of other stage types. Be aware of
Datastage inserted sorts in cases where not required. Stages like Join requires sorted input data.
When the input data is sorted externally like in a database, a sort stage with dont sort,
previously sorted option on the hash key columns prior to the join stage will ensure that the data
is not re-sorted and will ensure pipeline parallelism
Look at job designs and try to reorder the job flow to combine operations around the same
sort keys if possible, and coordinate your sorting strategy with your hashing strategy. It is sometimes
possible to rearrange the order of business logic within a job flow to leverage the same sort order,
partitioning, and groupings. If data has already been partitioned and sorted on a set of key columns,
specify the dont sort, previously sorted option for the key columns in the Sort stage. This reduces
the cost of sorting and takes greater advantage of pipeline parallelism. When writing to parallel data
sets, sort order and partitioning are preserved. When reading from these data sets, try to maintain
this sorting if possible by using same partitioning method.
i. Remove Un-needed Columns. Remove un-needed columns as early as possible within the job
flow. Every additional unused column requires additional buffer memory, which can impact
performance and make each row transfer from one stage to the next more expensive. If possible,
when reading from databases, use a select list to read just the columns required, rather than the
entire table.
j. Avoid reading from sequential files using the Same partitioning method. Unless you have
specified more than one source file, this will result in the entire file being read into a single
partition, making the entire downstream flow run sequentially unless you explicitly repartition
k. Wherever possible allow the source database to handle transformations by adding sort or
transformation logic to extract SQL, instead of using the Transformer stage.
l. The Lookup stage is most appropriate when the reference data for all Lookup stages in a job is
small enough to fit into available physical memory. Each lookup reference requires a contiguous
block of physical memory. The Lookup stage requires all but the first input (the primary input) to fit
into physical memory. If the reference to a lookup is directly from a database table and the
number of input rows is significantly smaller than the reference rows, 1:100 or more, a Sparse
Lookup may be appropriate. If performance issues arise while using Lookup, consider using the
Join stage. The Join stage must be used if the datasets are larger than available memory
resources.
m. Intermediate files between jobs should be written to dataset files over sequential files whenever
possible.
n. When importing fixed-length data, the number of readers per node property on the Sequential File
stage can often provide a noticeable performance boost as compared with a single process
reading the data.
o. Evenly partitioned date: Because of the nature of parallel jobs, the entire flow runs only as fast as
its slowest component. If data is not evenly partitioned, the slowest component is often slow due
to data skew. If one partition has ten records, and another has ten million, then a parallel job
cannot make ideal use of the resources. Setting the environment variable
APT_RECORD_COUNTS displays the number of records per partition for each component.
Ideally, counts across all partitions should be roughly equal. Differences in data volumes between
keys often skew data slightly, but any significant (e.g., more than 5-10%) differences in volume
should be a warning sign that alternate keys, or an alternate partitioning strategy, may be
required.
Tips:
a. To re-arrange an existing job design, or insert new stage types into an existing job flow,
first disconnect the links from the stage to be changed, then the links will retain any Meta
data associated with them.
b. The Copy stage is a good placeholder between stages if you anticipate that new stages
or logic will be needed in the future without damaging existing properties and derivations.
When inserting a new stage, simply drag the input and output links from the Copy
Originator: BIDS Datastage CoE Page 33 of 39 Date Issued: <issue date>
DataStage Standards and Guidelines Date Revised: <date revised>
GIS1 Item Number: Retention Start Date:
GIS2 Classification: Proprietary
BIDS Datastage CoE
Datastage Standards and Guidelines
placeholder to the new stage. Unless the Force property is set in the Copy stage,
DataStage optimizes the actual copy out at runtime.
c. You can set the OSH_PRINT_SCHEMAS environment variable to verify that runtime
schemas match the job design column definitions.
Appendices
1. Create similar 5 new Generic IDs, for your project in SILAS with proper description in the
"title" section.
Naming standard - 7Letter (4LetterProjName+3LetterType),
<ndwh>dev,<ndwh>qa,<ndwh>prd,<ndwh>mig,<ndwh>sup,<ndwh>sec
2. Application team should request SILAs for any IDs like ftp IDs which needs SILAs
registration.
3. After IDs are created in SILAs, application team should request these ids in DEV, QA and
PROD environments as group or unix ID depending on the type of ID.
http://www.request.ford.com/RequestCenter/myservices/navigate.do?query=orderform&si
d=383&
4. After Ids/groups are created on UNIX, batch Id should be added in respective groups.
Use the IS Credentials to login to the console.( If password needs to be reset, then submit RC
ticket to Datastage admin team)
STEP 2. Login via the Client using the IS credentials (Credentials used to login to the webpage in step1)
to login to the Project.
All standards mentioned in the ETL standards and guidelines document for Datastage version 7.5
should also be followed as reference for stages and functionality already existing in Datastage version 7.5
and carried over to version 8.0. Evolving BIDS standards and Guidelines document should also be
referenced periodically for updated standards.