Vous êtes sur la page 1sur 5

EDI Ascential DataStage Best Practice

Ascential DataStage Best Practice

80424236.doc

Page 1 of 5

EDI Ascential DataStage Best Practice

TABLE OF CONTENTS
. 1.1. DataStage Best Practices and Standards............................................................................................3 . 1.2. Naming Conventions..........................................................................................................................3 . 1.3. Job Design..........................................................................................................................................3

80424236.doc

Page 2 of 5

EDI Ascential DataStage Best Practice

. 1.1.

DataStage Best Practices and Standards

On July 2003 Fifth Third Bank purchased the Ascential Tool Suite to provide a software infrastructure for enterprise wide Data Quality Analysis, Data Movement, Data Cleansing and Meta Data initiatives. This documentation is a living form, ever-changing with time and improving understanding of the tools for the banks task and environment. This document will be referenced during technical quality assurance exercises throughout the development life cycle.

. 1.2. Naming Conventions


Type Job Category Job Name RoutineCategory RoutineName StageName LinkName Standard Source+Type Type+TargetFile Project+ Functional Description RoutineResult TableName TableName Example SapSelect StageLoadDw LoadCustomerHash AggrSales SeqSales UnifiStringRoutines AscentialDateRoutines OracleTimestamp CompareLastRow Customer Schema+TableName SalesFact TableName+ ["In" or "Out"] ... SalesIn Schema+Table ... SapSales Type+Table ... HashCustomer

. 1.3. Job Design


Select only the columns from the data sources that you intend to actually use in the jobs, whenever possible. Remove unused columns from an ODBC or HASH input stage definition.

Use row selection at the data source, whenever possible. For ODBC data source stages, add WHERE clauses to eliminate rows that are not interesting. This saves on the amount of data to be processed, validated and rejected. Perform joins, validation, and simple transformations at the source database, whenever possible. Databases are good at these things, and the more you can do during the process of data extraction, the less processing will be needed within the DataStage job Use a database bulk loader stage rather than an ODBC stage to populate a target warehouse table (where available). The bulk loader stages can often take advantage of database options not available via ODBC, although the functionality is generally limited to INSERTing new rows.

80424236.doc

Page 3 of 5

EDI Ascential DataStage Best Practice

Optimizing an ODBC stage o o Set the "Transaction isolation level" to the lowest level consistent with the goals of the job. This minimizes the target database overhead in journaling. Set the "Rows per transaction" to 0 to have all inserts/updates performed in a single transaction, or set it to a number of at least 50-100 to have transaction commits on chunks of rows. This minimizes the overhead associated with transaction commits. Set the "Parameter array size" to a number that is at least 1024 divided by the total number of bytes per row to be written (e.g., set it to 10 or greater for 100-byte rows). This tells DataStage to pass that many rows in each ODBC request. This "array-binding" capability requires an ODBC driver that supports the "SQLParamOptions" (ODBC API Extension level 2) capability. When using an ODBC stage to populate a target table, use straight INSERT or UPDATE rather than other update actions. This allows target database array binding which is fast. Other options may require an attempt to UPDATE a row first, followed by an INSERT if the UPDATE fails.

When Oracle is a remote target database o If the target database is remote Oracle, use the ORABULK stage to create the necessary data and control files for running the Oracle SQL*LOAD utility. Use an AFTER routine on the ORABULK stage to automate the copying of the data and control files to the target machine and start SQL*LOAD. To maximize loading performance, use either the structure of the DataStage job or the ORABULK stage to create multiple data files and then use the Oracle SQL*LOAD "PARALLEL" option to perform loading in parallel. Use the SQL*LOAD "DIRECT" option (if available and configured on the Oracle database) to bypass SQL for even greater loading performance.

Reference Lookups o Avoid remote reference lookups on secondary inputs to Transformer stages. If you must perform such lookups on every data row (e.g., for validation and transformation), arrange to copy the reference data into a local HASHed file (e.g., via a "BEFORE" exit routine or another ODBC=>HASH job) first, and then use the local HASHed file for reference lookups. You can request that the file/table being used for lookups be copied into memory. This works best when reference files fit into real memory and don't cause extra paging.).

If multiple CPUs are available on the DataStage server machine, split the work into multiple parallel paths within one job, or into multiple jobs. (Note: a parallel path travels from a passive stage (SEQ, HASH or ODBC) through independent active stages to another passive stage. If two paths share any active stage, they are not parallel, and are performed in a single process and cannot take advantage of multiple CPUs.) Perform aggregation as early as possible in a job. This minimizes the amount of data that needs to be processed in later job stages. Aggregation is faster if the input data is sorted on some or all of the columns to be grouped. Avoid intermediate files where possible, since this adds considerable disk I/O to the job.

80424236.doc

Page 4 of 5

EDI Ascential DataStage Best Practice

Avoid validation and transformation that is not really needed. Every item of work done on a column value in a row adds per-row overhead. If you have control over the quality of the arriving data, then do not validate things that will always be true. Move large data files/tables in bulk where possible, instead of using remote database access. Use a "BEFORE" routine to copy source data into a local file or table. Transform Functions o Avoid writing custom transform functions unless you must. If possible, use BASIC expressions and built-in functions as much as possible. For example, let us say that an input data row contains a separate DATE field in "mm/dd/yy" format and a TIME field in "hh:mm:ss" format. The output row to the target database needs a single TIMESTAMP field, which requires a string of the form "yyyy-mm-dd hh:mm:ss". Rather than define a custom transform function, use the following derivation expression: OCONV(ICONV(Input.DATE,"D2/"),"D4-YMD[4,2,2]"):" ":Input.TIME o When writing custom transform functions (e.g., in BASIC), avoid known slow BASIC operations (e.g., concatenation of new values to the beginning or end of a string, pattern matching). Use built-in BASIC functions and expressions as much as possible (because they are implemented efficiently) and avoid calling additional BASIC programs. Perform initialization into private named COMMON once to minimize per-row overhead. Keep in mind that every transform will be executed once per row. If multiple custom transforms are needed, do them if possible as separate transform functions instead of one function with switch arguments. This minimizes the code path on individual columns that need only one of the transforms.

Parameterize DataStage Jobs o Job parameters allow you to design flexible, reusable jobs. If you want to process data based on a particular file, file location, time period, or product, you can include these settings as part of your job design. However, if you do this, when you want to use the job again for a different file, file location, time period, or product, you must edit the design and recompile the job. Instead of entering inherently variable factors as part of the job design, you can set up parameters which represent processing variables.

80424236.doc

Page 5 of 5