Académique Documents
Professionnel Documents
Culture Documents
Parallel Processing
Rich Connectivity to Applications, Data, and Content
This paper outlines what is new in the WebSphere DataStage 8.0 release. This exciting new release is a revolution
in data integration and transformation, which contains many enhancements and new features, including many
enhancements requested by customers. New features that are specific to the parallel framework are also noted in
this document.
For more detailed information on these features, please consult the product documentation.
Note: This document will be updated prior to the General Availability of the DataStage 8.0 release with additional
information about the release.
What are the key WebSphere DataStage 8.0 new features and
capabilities?
• New WebSphere Metadata Server Foundation to better integrate the products across IBM
Information Server and support the enterprise; with new meta data services DataStage provides
graphical impact analysis and job diff right within the DataStage and QualityStage Designer
• Completely Integrated Data Quality to ensure the most accurate, complete information is made
available
• Significant ease-of-use enhancements to improve usability and productivity
• New and Expanded Transformation Functionality to aid DataStage users in job design
• Continued focus on Performance and Throughput Improvements while providing detailed
performance analysis and system resource estimation
• Connectivity improvements, including the next generation of connectors
• Common and Enhanced Installation, Configuration, Administration and Reporting across
IBM Information Server
• General Enhancements related to specific customer technical support requests
• Migration, Upgrade, and Platform Support
Architecture
IBM Information Server embraces the design concepts of Service-Oriented Architecture (SOA) to deliver
multiple discrete services that hide the complexities of distributed configurations, thus allowing services to
concentrate on their functionality. With this architecture, individual components within IBM Information
Server can be used to compose intricate tasks without custom programming.
A Service-Oriented Architecture enables the design of common components which are accessible and
shared by all the other elements of the platform. These Common components allow all of the products in
IBM Information Server to operate in a uniform and well integrated manner. By eliminating duplication of
functions the architecture makes efficient use of hardware resources and reduces the amount of
development and administrative effort required to deploy a data integration platform.
COMMON SERVICES
COMMON CONNECTIVITY
The figure above shows the five top-level components of the IBM Information Server architecture. The
common repository and services are explained in the sections below. The common services layer is
deployed on a J2EE compliant application servers, the IBM WebSphere Application Server for the 8.0
release.
WebSphere Metadata Server also provides a number of services, located on the server for performance
and scalability, that DataStage takes advantage of. This provides, for instance, impact analysis in the
context which the DataStage user can better use. More on this in Ease of Use Enhancements below.
Logging
All products in IBM Information Server will log messages to a common place. Customers using multiple
products in the IBM Information Server will no longer have to look in multiple logs for problem
determination. For DataStage users, the log will still be available in the Director and through the existing
command line interfaces. Users can also view logs from the Web Console for IBM Information Server, a
browser-based interface. Administrators can define users to a specific role where that user can only view
logs. This allows developers, for instance, to only be able to view logs from a browser to aid in problem
determination in a production environment, but they are not allowed to do anything else such as start
jobs, change jobs, delete logs, etc. This is critical for locked down production environments.
Administrators can also create log views which allow users to only view specific entries in a log.
Security
See the Security section of Enhanced Installation, Configuration, and Administration below.
Ease-of-Use Enhancements
With the Metadata Server described above, DataStage is taking advantage of the services directly in the
DataStage and QualityStage Designer. These include:
• Impact Analysis
• Object Difference (job, table, and routine difference)
• Quick Find and Advanced Find
Note: These new capabilities are available for all DataStage (and QualityStage) users, regardless of job
type (server, parallel, and job sequences).
Impact Analysis
From the contextual menu of an object in the repository tree or in some cases the stages on the design
canvas, several new options are available, including Find Dependencies and Find Where Used.
The results show “What does this item depend on?” and “Where is this item used?”, respectively. The
results are shown in a tabular view and graphically. This brings more information to the DataStage and
QualityStage user to assess the impact of a change for instance.
The Impact Analysis also allows the selection of an object from the result list and then shows where and
how that object is used in a flow in the Impact Analysis – Path Viewer.
In this example, Job HVCustomerContainerStanFreq has a process stage, which has the Standardized
output link, containing 20 columns, which came from the BankDemoAccounts table.
The object editor can also be launched from this viewer.
The graphical view has navigation features including a bird’s eye view and zooming. Results can be
printed or saved to an XML file for additional processing, or remote user viewing, and can also be
published to the Reporting Console.
Object Difference
Object difference is now available for jobs, routines, and table definitions. A textual report in a DataStage
context is returned.
Hot links inside the report bring the user to the relevant editor in the Designer for the object selected.
Jobs and table definitions can also be compared across projects. The user is required to log into the
second project. The results are then shown as described above. This new feature will significantly
improve productivity for DataStage and QualityStage users.
The Job Report, first introduced with the DataStage release 7.5 along with the Impact Analysis, Job Diff,
and Advanced Find can be printed and all of them can be published to the Reporting Console for viewing
from a web browser by anyone with access (see below).
Find
Customers have built up their DataStage repositories with literally tens of thousands of objects – jobs, job
sequences, shared containers, table definitions, and more. Finding those objects can, at times, be a
daunting task, even for the most organized and well documented repositories.
The 8.0 release adds new Quick and Advanced Find features to make it easier to locate objects and work
with them.
Available at the top of the Repository View and from the toolbar, the Quick Find allows users to locate
objects with the following capabilities:
• Find Name (full and partial)
• Wild card support
• Find next
• Filter on object type
• Include the objects’ descriptions in the search
The Advanced Find, available from an Object’s contextual menu and the
tool bar, allows the user to add more advanced filtering criteria to the
Find:
• Object type
• Creation
o Date/time
o User
• Last modification
o Date/time
o User
• Where used (What other objects use this object)
• Dependencies (What does this object use)
Now a user can, for example, find all the jobs changed within the past
week by Keith.
Advanced options include the ability to restrict case and match on “name
and description” or “name or description”.
Results from both the Find and Advanced Find are the same as from
Impact Analysis results (tabular view). Quick Find is available anywhere
you browse the repository, for example from a stage editor when you are
browsing for a table definition and in the new Export dialog.
The new repository provides locking semantics that now allow more than one
user to have a job open. The first user in opens the job for write, and the
second user opens the job read-only (the second user is presented with a
dialog informing them who has the job locked). This is a long-standing
enhancement request from customers that provides increased collaboration.
In addition, the repository tree now has an “expanded view” to optionally show
more details of repository items. The properties visible are configurable and
each column in this expanded view is sortable.
Job Parameter Sets enable users to share job parameters and their associated values across multiple
jobs. This makes it easier to share common properties and also enables easier deployment of jobs
across machines. And, since Job Parameter Sets are objects in the repository, the user can perform
impact analysis to see where (which jobs) use a particular Job Parameter Set.
A new parameter set dialog allows parameters to be added to the parameter set. Environment variables
can also be added to a parameter set.
The Values tab on the Parameter Set dialog is used to specify sets of values to be used for the
parameters in this set when executing a job. Each set of values is stored in a file of the given name when
the parameter set is saved. Parameter set files are stored in the DataStage directory at the same level
as the project folder. The Values file can be changed dynamically and the values will be picked up by the
job when it is run.
Once a parameter set is associated with a job, it can be used in the stages of the job by referencing
ParameterSet.Parameter.
When a job is executed either through the Director (see below) or the command line (dsjob...-param
KTParamSet=TestSystem), the user can specify which default values to use for the parameter set.
Documentation Improvements
DSEE-TFRS-00013
The record_format variable must have a sub-property type. {0} was the returned value.
Explanation:
The record_format variable was used without a sub-property.
User response:
You must use either the implicit or varying sub-property when you use the record_format variable. Use varying to
specify one of the following blocked or spanned formats: V, VB, VS, or VBS. Data is imported by using the selected
format. If you use the implicit sub-property, data is imported or exported as a stream with no explicit record boundaries.
Lookup Enhancement
The Lookup stage has been extended to support lookup on a range of values. It now allows a single or
multiple row result of “input field A is between table field B and table field C.” This is very useful for date
processing.
o Typically the dimension row will be found. If not, a dimension row needs to be created.
If a dimension row is found but needs to be updated, the update is performed
o The source data is augmented by the inclusion of the surrogate key, and is reduced by
the elimination of non-fact data (i.e., data that is present in the input only for the case that
a dimension row would need to be created or updated)
• The record is written or loaded into the fact table (with all surrogate keys)
The SCD stage also introduces a new “Fast Path” concept for improved usability and faster
implementation. The fast path walks the user through the screens/tabs of the stage properties required
to process the stage. Help is available for each tab by hovering the mouse over the “I” in the lower left.
The first tab of the fast path in the dialog above defines the output link from the SCD stage.
The second step matches the source column with the dimension column to define the lookup. For
performance reasons, DataStage will only load the latest dimension records into memory for each
partition.
The next tab defines how to create the surrogate key information. As described above, DataStage now
handles surrogate key generation and management across job runs. In this example, a specific file is
used. A job parameter can also be used to specify a file name. Alternatively, a database (DB2 or
Oracle) can be used.
Step 4 defines how to detect changes to dimension records and what data to use when records are
created or updated. For Type 2 dimensions for instance, the user defines the current record indicator
verses the history records of the dimension.
Finally, map the output columns coming out of the SCD stage. The next stage could be another
dimension, or any other DataStage stage.
The new SCD stage will greatly enhance productivity of users that are working with star schema data
warehouses.
©2006 IBM Corporation. All Rights Reserved. Page 18
What’s New in WebSphere DataStage 8.0
Performance Improvements
The enhancements described in this section are specific to the DataStage parallel environment.
Resource Estimation
Predicting hardware resources needed to run DataStage jobs in order to meet your processing time
requirements can sometimes be more of an art than a science. With new sophisticated analytical
information and deep understanding of the parallel framework, IBM has added Resource Estimation to
DataStage (and QualityStage) 8.0.
With a job open, a new toolbar option is
available called Resource Estimation.
This option opens a new dialog called
Resource Estimation. The Resource
Estimation works by first creating a model of
the DataStage job. There are two types of
models that can be created:
• Static. The static model does not
actually run the job to create the
model. CPU utilization can not be estimated, but disk space can be. The record size is always
fixed. The “best case” scenario is considered when the input data is propagated. The “worst
case” scenario is considered when computing record size.
• Dynamic. The Resource Estimation tool actually runs the job with a sample of the data. Both
CPU and disk space are estimated. This is a more predictable model to use for estimating.
Resource Estimation is used to project the resources required to execute the job based on varying data
volumes for each input data source.
A projection is then executed using the model selected. The results show the total CPU needed, disk
space requirements, scratch space requirements, and more.
Different projections can be run with different data volumes and each can be saved. Graphical charts are
also available for analysis, which allow the user to drill into each stage and each partition. A report can
be generated or printed with these estimations.
This new feature will greatly assist users in estimating the time and machine resources needed for job
execution.
Performance Analysis
Isolating job performance bottlenecks during a job execution or even seeing what else was being
performed on the machine during the job run can be extremely difficult. DataStage 8.0 adds a new
capability called Performance Analysis. It is enabled through a job property on the Execution tab which
collects data at job execution time. Note: by default, this option is disabled. Once enabled and with a job
open, a new toolbar option is available called Performance Analysis.
Connectivity Improvements
New GUI
A new common GUI is provided for each common connector. A navigator panel allows users to select
stages and links easily with Explorer-style navigation. Drag and drop connection objects make it easy to
configure a connection (see below). The SQL Builder is integrated to assist users to build SQL
statements. The source/target and properties are validated at design time, with warning indicators for
properties requiring user attention. Job parameters can be used/inserted for any property.
Stage/Link
Overview
Stage/Link Properties in
Explorer model
Built in
Connection Test
BLOB Support
With the new common connectors, DataStage has been extended to support BLOB’s. BLOB support
allows BLOB’s to be moved from a data source to a target without paying a huge performance penalty.
As BLOBs typically are not manipulated as part of a data integration flow, they are referenced by a
location in the job versus sending the BLOB through the DataStage job itself. Only when the target is
written, is the BLOB moved. BLOB support will be added as the new common connectors are released
after the 8.0 release.
Connection Objects
Connection objects are new objects that hold connection path information (username, password,
©2006 IBM Corporation. All Rights Reserved. Page 24
What’s New in WebSphere DataStage 8.0
database name, etc.) to a particular source or target which allows saving and reusing connection
information. Connection objects are created manually, during metadata import, and from a stage editor.
Connection objects are used to save stage connection properties to be later used when building a job.
They can be dragged and dropped from the repository tree and also be used for metadata import from
that source or target. Drag and drop the table imported from that source or target onto the canvas to
create a pre-populated stage instance. Connection objects are used “by reference” at design time. The
stage editor displays the current state of the Data Connection, not the state when it was first loaded in the
stage instance.
Installation
The installation process has been completely re-written with all of the software of IBM Information Server
in one platform installation process and media. Multiple CD’s just for DataStage are a thing of the past.
Also, authorization codes are gone. Licensing is done by IBM through a simple licensing file that is read
at installation time.
Security
Users, assignment of groups, and roles are now done at the Web Console for IBM Information Server.
Integration with LDAP or Active Directory is also provided. All products in IBM Information Server,
including DataStage, authenticate using this new service. This provides one place for userid
administration for all products.
Note: LAN Manager support is removed from the DataStage Client logon screen with the 8.0 release.
You will no longer see the “Omit” check box.
DataStage Administration
The DataStage Administrator client tool still exists for DataStage (and QualityStage) specific
administration tasks. The DataStage Administrator client tool is used to set-up DataStage and
QualityStage projects, assign users & roles, and perform other DataStage specific tasks. Only authorized
DataStage administrator-level users can use the DataStage Administrator tool.
The DataStage user roles have been expanded with the DataStage 8.0 release.
• There is new “DataStage Administrator” role at the IBM Information Server level for DataStage
and QualityStage use of the DataStage Administrator tool.
• A new “Super Operator” role who can run and view objects in the Designer, but cannot change
them.
Reporting Console
A new browser-based Reporting Console is provided with IBM Information Server. Reports are available
to users who have access. The products of Information Server, such as Information Analyzer, publish
reports to the Reporting Console. Information Analyzer will publish reports on the results of data profiling.
DataStage and QualityStage can publish reports such as the job report, results of Find & Impact Analysis,
and more.
New Source-to-Target and Target-to-Source Job and Database reports are available in the Reporting
Console. This allows users to build a report based on selecting a job and the columns to see in the
report. The report traverses the job either forward (source-to-target) or “in reverse” (target-to-source) of
the columns and their transformations inside the job.
General Enhancements
An expand FILLER capability has been added to the CFF stage in the WebSphere DataStage MVS
Edition. DataStage MVS Edition also gets the benefit of many of the services explained above including
“where used” impact analysis and Find.
DataStage Enterprise Edition has better handling of failed conversions in the transformer e.g. when a
string to decimal conversion fails, we used to just report it in the log, but we now send the record down
the reject link if one exists for the transform.
In DataStage Enterprise Edition, in the CFF stage better support for scaled COMP types, such as
©2006 IBM Corporation. All Rights Reserved. Page 28
What’s New in WebSphere DataStage 8.0
S9(16)V99 has been added. Previously DataStage would read this in as an integer and then the user
would need to divide by 100 to the right value. DataStage now handles this transparently.
More Information
Contact your IBM representative or log on to www.ibm.com.