DataStage How To Kick Start

Information Server 8.
1 Introduction
Course Outline
This course will explain the concepts of DataStage, its architecture, and how to apply it to a 'real life' scenario in a business case-study in which you'll solve business problems. We will begin by looking at the big picture and discuss why businesses need ETL tools and where DataStage fits in the product set. Once we've talked about the very basic architecture of DataStage, we'll investigate a business case-study, learn about a company called Amalgamated Conglomerate Corporation (ACC) - a fictitious holding company - and its business and technical needs. We'll then go about solving this company's problems with DataStage in a Guided Tour Product Simulation. In a practice environment, you'll become ACCs Senior DataStage Developer. In this capacity, you will assess an existing DataStage Job, build your own Job, modify a Job, and build a Sequencer Job. Using the DataStage clients, you'll log onto the DataStage server and look at a Job that was built previously by the former DataStage Developer at ACC. Youll then build your own Job by importing meta data, building a Job design, compiling, running, troubleshooting, and then fixing this Job. Youll then modify a Job and finally build a special type of Job called a Sequence Job. Lets get started! This section begins by talking about the DataStage tool, what it does, and some of the concepts that underpin its use. Well discuss how it fits in the Information Server suite of products and how it can be purchased as a stand-alone product or in conjunction with other products in the suite. Well briefly cover DataStages architecture and its clients.
Business Intelli Solutions Inc
www.businessintelli.com
Information Server 8.1 Introduction
What Is DataStage? InfoSphere DataStage is an ETL (Extract Transform and Load) tool that is a part of the InfoSphere suite of products. It functions as a stand-alone product or in conjunction with other products in the suite. DataStage provides a visual UI, which you can use, in a point-and-click fashion (A non-linear programming tool), to quickly build DataStage Jobs that will perform extractions, transformations, and loads of data for use in data warehousing, system migrations, data integration projects, and data marts.
Sometimes DataStage is purchased with QualityStage, which performs data cleansing (Well touch on it in this training but be sure to take the Introduction to QualityStage FlexLearning module as well). When implementing DataStage as a stand-alone product, you will still need to install InfoSphere Information Server components with it. Other components or products in the suite can be added at a later time. The other products in the suite are QualityStage, Information Analyzer, Business Glossary, and MetaData Workbench. InfoSphere Information Server is a suite of tools that is used to manage your information needs. Several components come with Information Server. One of the main components is DataStage, which provides the ETL capability. IBM InfoSphere Metadata Workbench provides end-to-end metadata management, depicting the relationships between sources and consumers.
IBM InfoSphere Change Data Capture Real-time change Data Capture (CDC) and replication solution across heterogeneous environments
IBM InfoSphere Change Data Capture for Oracle Replication Real-time data distribution and high availability/disaster recovery solution for Oracle environments
IBM InfoSphere Information Analyzer Profiles and establishes an understanding of source systems and monitors data rules.
IBM InfoSphere Business Glossary Creates, manages, and searches metadata definitions.
IBM InfoSphere QualityStage Standardizes and matches information across heterogeneous sources.
IBM InfoSphere DataStage Extracts, transforms, and loads data between multiple sources and targets
IBM InfoSphere Datastage MVS Edition provides native data integration capabilities for the mainframe.

IBM InfoSphere Federation Server Defines integrated views across diverse and distributed information sources, including cost-based query optimization and integrated caching.
IBM InfoSphere Information Services Director allows information access and integration processes to be published as reusable services in a service oriented architecture.
IBM InfoSphere Information Server FastTrack simplifies and streamlines communication between the business analyst and developer by capturing business requirements and automatically translating into DataStage ETL jobs.
Connectivity Software provides efficient and cost-effective cross-platform, high-speed, realtime, batch and change-only integration for your data sources. This training will focus on DataStage which gives you the ability to import, export, create, and manage metadata from a wide variety of sources to use within these DataStage ETL Jobs. After Jobs are created, they can be scheduled, run, and monitored, all within the DataStage environment.
DataStage provides a point-and-click user interface in which there is a Canvas. You dragand-drop icons that represent objects (Stages and Links) from a Palette onto the Canvas to build Jobs. For instance, you might drag and drop icons for a source, a transformation, and a target onto the Canvas. The data might then flow from a Source Stage via a Link to a Transformation Stage through another Link to a Database Stage, for example. Stages and Links present a graphical environment that guides you, the developer through needed steps. You are presented with fill-in-the-blank-style boxes that enable you to quickly and easily configure data flows. When connecting to a database for example, you drag a proprietary database Stage icon onto the Canvas, you dont have to worry about how that databases native code works to still be able to leverage its high performance. After configuring some particulars and doing a compile, the result is a working Job that is executable within the DataStage environment as well as a Job design which provides a graphical look as to how the data or information is flowing and being used
Information Server Backbone In order to understand the environment in which we will work as Developers, lets first very briefly look at the architecture and framework behind the InfoSphere Information Server and its various components (Collectively known as InfoSphere). InfoSphere has various product modules within it including the Information Services Director, the Business Glossary, Information Analyzer, DataStage, QualityStage, and Federation Server.
These product modules are all supported by common metadata access services and metadata analysis services. They all sit on top of the repository. They all work with one another and can be used as an integrated product suite or it also comes as a stand-alone DataStage application. Whichever component or components you have, you will need to install the basics of the Information Server itself including the repository and the application layer. From there, this 3-tiered architecture fulfills many information integration needs allowing metadata and terminology collected with one product during one phase of a project to flow to other products throughout the enterprise enabling common understandings of data and leveraging common discoveries.
Lets just review the products in the suite again. DataStage, by itself, can pull and push metadata, table definitions, and information from and to various targets and sources of data, DataStage will provide the components needed to get data from place to another and manage it on an ongoing basis. The Information Services Director let's you turn your Jobs into real-time Web Services. The Business Glossary maps your business terms to your technical terms so that all the members of your organization can have common nomenclature during all business activities and stages of technical development thus reducing ambiguity among varying audiences. The Information Analyzer allows you to analyze data in databases, files, and data stores and gives you the knowledge to determine what types of Jobs need to be developed. DataStage and QualityStage allow you to then manipulate the data and cleanse it in-line while you're working with it. DataStage and QualityStage have been integrated into a single canvas so that all your activities within a Job can flow from a traditional DataStage Stage to one of the QualityStage Stages; thereby allowing you to, in the process of doing your other ETL activities, cleanse your data inline. This tutorial will focus primarily on the ETL activities with a little bit of knowledge on DataStage's QualityStage Stages that go along with it. The Federation Server is a tool that let's you put together disparate and heterogeneous data stores through a single point of view for the organizations entire data world: Databases still stored on mainframes as well as a variety of other computers are made to look as one repository.
These product modules feed to DataStage allowing you to create Jobs that will function in a high efficiency manner to move data and/or cleanse it as necessary. Using DataStage, you will be able to build data warehouses and other new repositories for integrated forms and views of data for your entire organization. DataStage is also very useful in data migration activities, EDI and other means of data communication. As mergers and acquisitions occur, for example, the underlying IT systems can, using DataStage, be brought together into a single trusted view or common unified view with which to make better business decisions and implementations. DataStage can help with ordering and receiving activities as well by putting data to and/or pulling data from an EDI wire, effectively parse it, and move it into a data warehouse or to online operating systems so that you can have quicker access and more robust delivery of your business data. DataStage also integrates with various other tools in addition to databases such as MQ Series that provide reliable delivery of messaging traffic. DataStage has both real-time and batch capabilities. Now that we've talked about the overall Information Server's architecture and components, let's take a more detailed look at DataStage's components.
DataStage Architecture DataStage has a 3-tiered architecture consisting of clients, dedicated engines, and a shared repository. These are all components within the InfoSphere Information Server using its shared and common repository. The clients include an Administrator client, a Designer client, and the Director client.
The Administrator is used for setting up and managing individual DataStage Projects and each Project's related common project information. The Designer is used to develop and manage DataStage applications and all their related components such as metadata. The Designer is where you will be doing most of your development. The Director allows you to actively monitor runtime activities, watch Jobs, schedule Jobs, and keep track of events once your Job has been developed, deployed, and is running
Inside DataStage, there are two engines; the traditional Server engine (used for DataStage 'Server' Jobs) and the Parallel engine (used for DataStage Parallel Jobs). The 'Server jobs are typically single-threaded and have evolved over the years during various versions of DataStage since its inception. The Parallel engine allows you to create Jobs that can, at runtime, dynamically increase or decrease their efficiency and to optimize the use of resources at hand within your hardware. The Parallel engine can dynamically change the way in which it performs its parallelism.
The Shared or 'Common' repository holds all of the information such as Job designs, logs of the Job runs, and metadata that you will need while developing Jobs. Now, let's take a look at the Administrator client's UI.
Administrator Client We're looking at the Administrator's UI under the tab entitled 'General'. Here, you can enable job administration from within the Director client, enable runtime column propagation (RCP) for all of your Parallel Jobs, enable editing of the internal Job references, sharing metadata with importing from Connectors (the DataStage objects that automatically connect to various proprietary sources and databases with minimal configuration) and also give you some auto-purge settings for your log. What is RCP and how does it work?
DataStage is a role-based tool. Usually, when it is installed the Administrator assigns roles such as Developer or Operator. This means that the UI may have different options available to different people. The Administrator, developers, and operators may not be the same person. Thus, when you look at the actual product, your UI may be different than others'. Under the Permissions tab, DataStage can assign functions to individuals who will be using the tool. The person with the role of administrator may have more options than does a developer. There may be differing functions made available to different people. This can all be designated for each separate DataStage Project. You can thus determine which user gets what role for that particular Project.
Traces are available should you wish to put tracing on the engine. The Schedule tab allows us to dictate which scheduler will be used. In the Windows environment, we would need to designate who will be the authorized user to run Jobs from the scheduler.
The Mainframe tab allows us to set needed options when working with the mainframe version of DataStage. This would allow us to create Jobs that could then be ported up to the mainframe and then executed in their native form with the appropriate JCL. Other tabs let us set the memory requirements on a per-Project basis. The Parallel tab gives us options as to how the parallel operations will work with the parallel engine. Under the Sequence tab, the Sequencer options are set. And under the Remote tab of the

Administrator client, we make settings for remote parallel Job deployment on the USS system or remote Job deployment on a Grid.
DataStage Designer The second of the three DataStage clients is the Designer. As an ETL Developer, this is where you'll spend a good deal of your time. Let's examine the main areas of the UI. In the upper-left (shown highlighted) is the Repository Browser area. It shows us the metadata objects and Jobs found within DataStage. Toward the lower-left is the Palette. The Palette contains all of the Stages (drag and droppable icons). You would select a Database source icon, for instance and drag it to the right.
On the right is the Canvas. This is where you would drop the Stages that you have chosen to be part of your Job design. You can see that the Job in the graphic above is a Parallel Job and that is contains two Stages that have been joined together with a Link. Links can be added from the Palette directly or you can just Right-click on a Stage, drag, and then drop on the next Stage to create the Link dynamically. Now, In this particular example, data will flow from a Row Generator Stage to the Peek Stage.

The Designer is where you will create the kinds of working code that you need in order to properly move your data from one source to another. There are over 100 Stages that come with the product out of the box (At installation time there are approximately 60 base Stages installed, an additional 40 that are optional, and 50-100 more based on service packs and add-ons, and then you can build your own custom Stages in addition).
DataStage Director The third DataStage client is the Director. Director gives us a runtime view of the Job. Above, we can see an individual Job log. There are messages and other events shown in this Job run that are logged to this area and kept inside the Shared Repository. We can see the types of things that occur during a Job run such as the Job starting, setting environment variables, main activities of the program, and then eventually the successful completion of the Job. Should there be any warnings or runtime errors, they can be found and analyzed here to determine what went wrong in order to take corrective action. Color-coding shows that green messages are just informational, a yellow message is a warning and may not directly affect the running of the Job but should probably be looked at. If you should ever see a red message in here, this is a fatal error, which must be resolved before the Job can run successfully.
DataStage Repository In the Repository Browser, we have various folder icons that represent different categories within the DataStage Project. Some come automatically with the Project and then you can add your own and name them. Organize them in a way suitable to your DataStage Project providing for easy export and import. Standard folders include those for Jobs, Table Definitions, Rule Sets, and other DataStage objects. Tip: Create a Folder named after your initiative and then order or move all the appropriate objects for just that initiative under that one common location.
Table Definitions are also known as metadata or schemas. These terms are often used interchangeably. Each typically contains a collection of information that defines the metadata about the individual record or row within a table or file. A piece of metadata can describe column names, column lengths, or columns' data types or it can describe fields within a file.
Other folders include one for Routines (sub-routines able to be called from within a Job). Shared Containers are pieces of DataStage Jobs that can be pulled together, used, and reused as modular components. Stage Types (highlighted) is a master list of all the Stages that are available to you in this Project. Stages can also be added later through various

options depending on what product(s) have already been installed and depending on what your administrator has made available to you. The Standardization Rules folder contains out-of-the-box or default rulesfor QualityStage.
Transforms provide other abilities from the standard Server engine to create macro types of use and re-use of various functions to its calling structure. WAVES Rules are also used by QualityStage for address verification modules. Match Specifications which are used by QualityStage's Match Stages. Machine Profiles which are used in IBM Mainframes. IMS view sets are for working with legacyIMS types of Database Management Systems
Steps to Create a DataStage Job . This section talks about the steps to building your first Job. First, you'll need to understand some basic concepts, then we'll talk about setting up an environment, next you'll connect to the sources, then we'll talk about how to import table definitions, and then we'll provide you with an understanding of the various types of Stages and how they are used. Then, we'll talk about working with RCP, creating Parameter Sets, understanding how to use the CLI, and, in the next lession, you will put this all of this knowledge to use and begin building Jobs in a case-study scenario. This section covers the following: 1. Understand Some Underpinnings 1. Types of Jobs 2. Design Elements of Parallel Jobs 3. Pipeline Parallelism 4. Partition Parallelism 5. Three-Node Partitioning 6. Job Design Versus Execution 7. Architectural Setup Option 2. Setting up a new DataStage environment 1. Setting up the DataStage Engine for the First Time 2. DataStage Administrator Projects Tab 3. Environment Reporting Variables 4. DataStage Administrator Permissions Tab 5. Export Window 6. Choose Stages; Passive Stages 7. Connect to Databases 3. Import Table Definitions 4. Active Stages; The Basics 5. An Important Active Stage; The Transformer Stage 6. Advanced Concepts When Building Jobs; RCP and More 7. Pulling Jobs Together; Parameter Sets, CLI, etc
Lets now shift our discussion to the process of developing Jobs in DataStage. What are the steps involved in developing Jobs? First, you define global and Project properties using the Administrator client that we mentioned earlier in the tutorial. This includes defining how you want the Job to be run. For instance, do you want Runtime Column Propagation or any other common variables in your environment that you might be using throughout the Project? We'll discuss this in greater depth later in this tutorial.
Next, you go into the Designer client and import metadata into the repository for use, later, in building your Jobs. Then, using the Designer tool, you actually build the Job and compile it. The next step is to use the Director tool to run and monitor the Job (Jobs can also be run from within the Designer but the Job log messages must be viewed from within the Director client). So, as youre testing your Jobs it can be a good idea to have both tools open. This way, you can get very detailed information about your Job and all of the things that are happening in it.
Administrator Client Lets say that we want to create a Job that extracts data from our primary operational system. We could go to the Administrator tool set up our Jobs to be able to use RCP and set the number of purges in the logs for however often we want to run purges. Still in the Administrator, we could then put in some common environmental variables, for instance, the name of the database that is being accessed, the user ID, and password. These can be set there and encrypted in a common location so that everyone can use these without necessarily exposing security in your system. Designer Client To then create this kind of extract Job, we would next want to go into our Designer tool to first import metadata from the database itself for the table or tables which well be pulling. Also within the Designer we want to use Stages that will connect us to the database. We do this by dragging and dropping the appropriate Stage onto the Canvas. Inside the Stage(s), we will use the variables (that we defined in our common environment using the Administrator tool) to define the Job as we see it needs to be done. We do this by double-clicking the Stage in question and filling out the particulars that it requests. Well pull that data out and store it into either a local repository such as a flat file (By dragging a Sequential File Stage onto the Canvas and creating a link between the two Stages) or store it into one of DataStages custom Data Sets (A proprietary data store which keeps the data in a partitioned, ready-to-use, high-speed format so that it doesnt need to be parsed for re-use as it would otherwise need to be from a flat file) And then we could, for instance, put it directly into another database (By dragging a Stage onto the Canvas at the end of that data flow). The database Stages come pre-built to meet the needs of many proprietary databases and satisfy their native connectivity requirements thus providing high-speed transfer capability while eliminating the need to do a lot of the configuration.
Types of Jobs Lets talk about the three types of Jobs that we would most likely be developing. These fit into the categories of Parallel Jobs, Job Sequences (which control other Jobs), and Server Jobs (the legacy Jobs from earlier versions of DataStage).
Parallel Jobs are executed by DataStage using a parallel engine. They have built-in functionality for what is called pipeline and partition parallelism. These are means of efficiency that can be set. For instance, you can configure your parallelism dynamically. Well discuss these things in greater detail later in the tutorial. When Jobs are compiled, they are compiled into a DataStage-specific language called Orchestrate Scripting Language, called OSH. The OSH executes various Operators
Operators are pre-built functions relating to the Stages in our Job Design. Also you can create custom Operators using a toolkit and the C++ language. If you have special data need to write to a special device, such as one for scanning and tracking control on delivery trucks, then you could write a custom module to put the data directly into that format without having to go through any other intermediate tools. The executable that the OSH executes is accomplished with C++ class instances. These are then monitored in the DataStage Director with the runtime Job monitoring process.
Job Sequences, as we mentioned, are Jobs that control other Jobs. Master Sequencer Jobs can, for example, kick off other Jobs and other activities including Command Line activities, other control logic, or looping that needs to go on in order to re-run a job over and over

again. The Job Sequencer can control all your Jobs at once without having to schedule each individual Job with the scheduler. Although you still have the ability to call any/each Job individually, you can also group them together through the Job Sequencer and call them that way.
In addition, you can perform activities based on their failures and do that either specifically, when a local Job fails or globally, when anything happens and an exception is thrown. A common Command Line API is provided. Thus, you can embed this into third-party schedulers and can then execute Jobs in whatever fashion you choose.
The third type of Job is the Server Job. Server Jobs are legacy DataStage Jobs that continue to exist and are supported as DataStage applications but you may also find them useful for their string-parsing capabilities. These are executed by the DataStage Server engine and are compiled into a form of BASIC and then run against the Server engine. Runtime monitoring is built into the DataStage engine and can also be viewed from the Director client. Server Jobs cannot be parallelized in the same way that the Parallel Jobs can be. However, you can achieve parallel processing capabilities by either sequencing these Jobs with the Job Sequencer to run at the same time, or using other options to gain efficiencies with their performance. Legacy Server Jobs do not follow the same parallelization scheme that the Parallel Jobs do. Therefore, they cannot easily have their parallelism changed dynamically at runtime as the Parallel Jobs can.
Design Elements of Parallel Jobs One of the nice things about having a visual medium like DataStage is that you can easily take the notions of Extract, Transform, and Load and represent them visually in a manner that is familiar to people. For instance, you can drag an Extract Stage out onto the visual Canvas and drop it on the left of the screen, flow through transformations in the middle of the Canvas, going outputs or Loading Stages on the right side. DataStage lets you put elements such as Stages and Links in any fashion that you wish, using an orderly, flow chart style left-to-right top-to-bottom visual design. This flow makes future interpretation of your Jobs simple. Those who are new to DataStage are able to look at, see, and understand the activities and flows that are going on within the DataStage Job. This also means that other Developer's Jobs will be intuitively understandable to you When we start designing a Parallel Job, we should think of the basic elements, which are Stages and Links. Stages, as we mentioned, get implemented under the covers as OSH Operators (The pre-built components that you dont have to worry about and that will execute based on your Job design on the graphical Canvas).
Next, we have Passive Stages, which represent the E and the L of ETL (Extract and Load). These allow you to read data and write data. For example, Passive Stages include things such as the Sequential File Stage, the Stage to read DB2, the Oracle Stage, Peek Stages that allow you to debug and look at some of the Jobs interimactivity without having to land your data somewhere. Other Passive Stages might include the MQ Series Stage, and various other third-party component Stages such as SAP and Siebel Stages

Processing or active Stages are the T of ELT (Transformation). They do the transformation and other heavy lifting work. These include the Transformer Stage, which allows you to transform data, manipulate column order and do a variety of other functions to your data to enrich it and potentially to validate and pre-cleanse it before applying some of the QualityStage Stages to it.
You will learn to rely heavily on the Transformer Stage. It provides a variety of functionality including the ability to filter rows, as well as individually modifying each column of data that goes out. It allows you to propagate the data that goes out in a number of ways and, in some cases, merge two streams of data together or pieces from two streams into common one common stream. These are also assisted by specific Stages to do filtering, perform aggregation of the data (summations or counting of data or finding first and last), generating the data such as generating rows or columns, splitting or merging the data (using the ability to direct data down multiple paths simultaneously, or pulling the data either by joining it into one common larger column or by funneling it into multiple rows within the same data. Now that weve talked about some Stages, lets talk about the things that help the data flow from one Stage to the Next
Links are the pipes through which the data moves from Stage to Stage. DataStage is a link-based system. Although the Stages are visually predominant on the Canvas, in actuality, all the action is happening on the links. When you open up a Stage by doubleclicking it, you get various dialogue boxes that refer to input and output (input to that Stage and output from that Stage) in DataStage. These represent Links coming into and out of the Stage (when viewed from the Canvas).
The setting that you define in these two areas within the Stage affect the Link coming in and the Link going out of the Stage, Your settings, including the metadata that you chose, are all kept with the Link. And therefore, its easy, when you modify Jobs, to move those pieces of data because they stay with the Links (all the metadata and accompanying information) from one Stage type to another Stage very easily without losing data or having to re-enter it.
Pipeline Parallelism The purpose of parallelism is to create more efficiency and get more work out of your machine(s). This is done is two ways the first of which is Pipeline Parallelism. Here, each of the Operators (That correspond to the various Stages that we see on our Canvas) set themselves up within their own processing space and pass data from one to the other rapidly without having to be re-instantiated continually or without having to do all the work in a given Stage before starting the next portion of the pipelining process.
For example, this allows us to execute transformer, cleansing, and loading processes simultaneously. You can think of it as a conveyer-belt moving the rows from process to process. This reduces disk usage for the staging areas, by not requiring you to land data as often. It uses your processor more efficiently and thus maximizes the investment in your hardware. There are some limits on the scalability depending on how much CPU processing horsepower is available and how much memory is available to store things dynamically.
Partition Parallelism The second factor of parallelism is called Partition Parallelism. This is more of a divide and conquer approach in which we divide an incoming stream of data into subsets to be separately processed by an Operator. Each one of these subsets is called a partition. Each partition of data is processed by the same Operator. For instance, lets say that you want to do a filtering type of operation. Each partition will be filtered exactly the same way but multiple filters will be set up in order to handle each of the partitions themselves. This facilitates a near-linear scalability of your hardware. That means that we could run 8 times faster on 8 processors, 24 times faster on 24 processors, and so on. This assumes that data is evenly distributed and no other factors that would limit your data from getting out there, such as network latency or other external issues.
Three-Node Partitioning Data can be partitioned into multiple streams. 15:34 The way in which we partition will depend upon how we need to process this data. The simplest of the methods is called the Round Robin in which we assign each incoming row to the next subsequent partition in order to create a balanced workload. In other words, like a card dealer, in Vegas, dealing out cards in the deck to every player around the table, thus dealing all the cards evenly to each player.
Other times, we need to segregate the data within the partition based on some key values, whether they are numeric or alpha values. This is known as Hash partitioning (Or with numeric you can use a Modular partitioning) and in this way, you keep the relevant data together by these common values. This is very important when were doing things such as sorting the data or possibly filtering the data on these key values. Should the data become separated (The Key value fields start going into different partitions), then, when the Operation occurs it will not fully work correctly as some of the data that should have been removed, may be in another partition and thereby mistakenly kept alive and moving down another stream in a parallel Operation. Then at the other end of the partitioning process when all the data is collected, we still have multiple copies when there shouldnt be. If we had put them all correctly on one partition then only one duplicate would survive which was our desired outcome.

The kind of partition that you perform matters a lot! One of the reasons why Round Robin is the default is that if you can evenly distribute the data, it will be processed much more quickly. If you have to partition by range or some type of hash partitioning scheme, the quantities of data in each partition may become imbalanced and one or two of the partitions may finish long before the others and thus reduce performance.
Job Design Versus Execution Lets look at an example that will give you an idea of what happens when we design the Job and then what happens when we run it. At the top of the graphic, you can see that the data is flowing from two Oracle Stages and the data is brought together inside of a Merge operation. From there, we take the merged data and we do an aggregation on it and finally we send the data out to a DB2 database.
In the background, lets say that we want to partition this data four-ways (In DataStage terminology would be known as 4 Nodes). Heres what we would see: For each of the Oracle Stages (Toward the left of the data flow) and other Stages in this design, we would create four instances of each to handle the four separate streams of data. In this kind of activity when doing summation, it is extremely important that we partition or segregate our data and that we partition it in a way that will work with our design.
Here, we want to partition it by the values by which we are going to do the summation. This way, we dont accidentally sum on two different things and loose parts of our rows. Lets say that we have 5 rows that need to come together for a sum, we want them all to be together in the same partition, so that all the summation can be done together. Otherwise, we might get four values out of this phase (Highlight on the 4 instances of Aggregation Stage) with only one of the set of partitions summing anything at all!
Architectural Setup Options Now that we've talked about the considerations that we must be aware of as we do our partitioning both in the design of our Job and then in the choice of partitioning strategy, let's continue our discussion in a different direction. Note that we will discuss these partitioning strategies in greater detail later in the tutorial. For now, though, let's talk about some architectural setup options. This section of the tutorial will talk about some of the underpinnings of DataStage in general and specifically some of the architectural setups that we can use. In one setup, we have deployed all of the Information Server DataStage components on a single machine. All of the components are hosted in one environment.
In one setup, we have deployed all of the Information Server DataStage components on a single machine. All of the components are hosted in one environment. This includes the WebSphere Domain, the Xmeta repository using DB2 (other databases can be used for the repository such as Oracle or SQL Server), and the DataStage Server itself. The clients can be stored on the same machine as well if it is a Windows machine. In other words, lets say that we are carrying a laptop around: It can contain all the co-located components (and it could still be connected to by a remote client). What are some reasons for putting all of these components on a single machine? Perhaps that is all the resources that you have available or perhaps you have very poor network performance and dont want the resulting latency going across the network wire to the repository.
Now lets talk about putting the components on two machines. A typical two-machine setup would consist of the Metadata repository and the Domain layer on Machine A and the DataStage Server onto Machine B and potentially be able to connect to Machine B remotely with the clients. One of the benefits to this separation (Having two machines) is that the metadata repository on Machine A can be utilized (By a product such as Information Analyzer) concurrently without impacting the performance of DataStage as it runs on Machine B
In a 3rd scenario, major components are separated onto different machines. The database repository is on Machine A (Maybe because you are using a machine that already has your database farms installed on it). The Metadata Server backbone is installed on Machine B so that Information Analysis and Business Glossary activities dont risk negatively impacting any development on the DataStage Server: Several DataStage development teams can keep working and utilizing their resources most effectively. Finally, you have the DataStage Server on Machine C, connected to by remote clients.
Setting up a new DataStage environment. In this last section, we talked about some different options. And, it should be said that you must determine the optimal setup in your environment depending on your own conditions and requirements. In this next section of the tutorial, we are going to talk about DataStages architecture under the cover and some of the ways that we can set up the product. Well also talk about setting up the DataStage engine itself for the first time using the Administrator client and see how to then setup DataStage Projects. Within a Project, well see how to import and export table definitions into that Project. Next well look at Jobs within Projects and talk about a number of important development objects such as Passive DataStage Stages. For

example, well talk about the Data Set Stage, Sequential Files Stage, and various relational data Stages. Well also talk about some Active Stages, which are important when we are trying to combine and/or manipulate data. These include the Lookup Stage, the Join Stage, the Merge, Stage, and the Transformer Stage (at the heart of many Jobs). Then well talk about some key concepts that youll want to know about as you are developing Jobs such as the use of Runtime Column Propagation (RCP), how to create Shared Containers for re-usability, putting together Sequencers to organize all of your Jobs. Then well talk about ways of implementing and deploying our DataStage application that can make things easier on you and the team. These might include the creating Parameter Sets, using various Run options, and using the Command Line Interface (CLI) in which Jobs can be called from a third-party scheduler. Of course, well discuss parallelism more in depth. Specifically, well talk about where we perform parallelism. This may be within Sequencers or at the Job level we will talk about strategies for different ways to apply parallelism. Within InfoSphere, well talk in more detail about the two forms of parallelism (Pipeline parallelism and Partitioning parallelism). Well see how these things work and then what happens when the parallel Job gets compiled. Finally, well discuss how to apply these concepts and scale our previously constructed DataStage application up or down depending on the resources that we have available to us.

Setting up the DataStage Engine for the First Time When we're setting up a new DataStage engine, the first thing is to go into the Administrator client and make sure that some of the basics have been taken care of. When we log on, we are presented with the Attach to DataStage dialog box shown above. We can see it shows the Domain to which we want to connect with its corresponding port number of the application server, the user ID and password for the DataStage Administrator with which we will connect, and the DataStage Server name or its IP address.
Often, you will find that the Servers name is the same as the beginning portion of the Domain name (But this does not have to be the case it depends on the architecture and name of the computer and how it is set up in your companys Domain Name Server (DNS)).
DataStage Administrator Projects Tab Within the Administrator client, we have a tab entitled Projects. We use this for each of the individual Projects that we want to set up here within DataStage. The Command button allows you to issue native DataStage engine commands when necessary such as the list.readu command which would show you any locks that are still active in DataStage (as well as any other commands associated with the

DataStage engine or with the Universe which was the predecessor to the DataStage engine).
Below that, you can see the Project pathname box. This shows us where the Project directory is located. This is extremely important as you must know where the Project is located on the machine (It is located the machine on which the DataStage Server has been installed). Lets take a look at the Properties button. It will show us the settings for the selected Project (datastage1 shown in the graphic).
The Properties button brings up the dialog box shown above. Various tabs across the top each have thier own options. One thing that youll probably want to do is to check the first option. This will enable job administration in Director. Enabling job administration in Director will allow developers to do things such as stopping their jobs, re-setting their jobs, clearing job information, and so on. One of the times that you may NOT want to do this is in a production environment. Otherwise, you do, typically, want to give people the ability to administer Jobs in the Director client as they see fit.

Next, is the option to enable RCP for parallel Jobs. Well cover this more in depth a little later in the tutorial but just be aware that this is where you enable it for the datastage1 (each) Project. The Enable editing of internal references in jobs option will help you when dealing with lineage and impact analysis (If you change Jobs or objects over time), The Share metadata when importing from Connectors option. This again, allows us to perform cross-application impact analysis.
The autopurge of the job log option is very important. You always want to set up default purge options so that the logs dont fill up and consume large disk space on your machine (The machine that the DataStage Server is installed on not necessarily the machine that houses your clients like the Administrator client).
You have your choice of purging them based on the number of runs or by the number of days old. Some strategies include setting it to the 5 previous runs within your development and test environment, which may go for long periods of time between each run of a particular Job. In your production environment, you may want to set it to every 30 days where you plan on running your Jobs on a daily basis.
Next is a button that allows you to protect a Project. It's also useful in production where you don't want people to make modifications to Jobs. Since most people want to make changes regularly during development work, it's not very useful for development environments. The Environment button allows you to set global environment variables for your entire Project. These will be Project-specific. These will be separate from your overall DataStage Environment variables, which are global to the entire Server. Project-specific Environment variables are only for this one particular Project. The Generate operational metadata checkbox will allow the data, as it is being generated; in particular the row counts to be able to be captured, later, for further analysis and understanding data in the operational form. Metadata is usually understood as a descriptive (How big is a field, What type of data is it) but operational metadata tells us things like How frequently has it been run and How much volume has gone through it. This can be important later as you determine your scalability and future considerations such as capacity planning.
When you click on the Environment button, it opens the Environment variables dialog box. Here, there are two panes in the window. On the left, we have our categories of environmental variables such as General, which includes Parallel environmental variables. The other category is User Defined. Under each category area, the corresponding pane on the right displays the name of the environmental variable to be used, its prompt, and then the value, that will be used for the entire DataStage Project.
Environment Reporting Variables A standard category is for environmental variables for reporting such as the APT_DUMP_SCORE, APT_MSG_FILELINE, APT_NO_JOBMON. By setting these here, you can have all the developers on the Project use them in a standardized way. However, each of these variables can also be set individually or in each Job. That is to say, at runtime, you can have just one Job contain a different value from the other uses of that same environmental variable throughout the rest of the Project.

DataStage Administrator Permissions Tab is where we set up users and groups for a variety of authorizations. We would add a user and then assign a product role. The DataStage Operator to view the full log checkbox is typically used unless you have a particular reason such as sensitive data or Operators being overwhelmed by too much information. Normally, we allow Operators to view the full log so that if they encounter problems during a DataStage Job, they can easily communicate that information to their second line of support when escalating a problem.
Once we create a new user, we can see in the User Role drop-down menu that there are several roles available for us to assign to that user or group. These consist of Operator, Super Operator, Developer, and Production Manager.
Lets continue our discussion by moving over to the Tracing tab. This allows us to perform engine traces. Typically, you will not use unless working with IBM support while experiencing a difficult problem that cannot be solved any other way.
The Parallel tab contains some options. The checkbox at the top allows you to see the Orchestrate Shell (OSH) Scripting language that has been generated from within your Job properties box. It creates a tab entitled OSH. Further down, there are some advanced options for doing things such as creating format defaults for various data types used in the parallel framework for this Project.

Under the tab entitled Sequence, we can add checkpoints so that the Job sequences are restartable upon any failure. We can automatically handle activities that fail, log warnings after activities that finish with a status other than OK, and log report messages after each Job run.
Work on Projects and Jobs Once we get our engine configured and set our basic Project properties, the next thing that we will want to be able to do is to work on a new Project. If we generate all these DataStage Jobs from scratch, we wont be re-using anything from previous Projects. However, often times, it makes sense to leverage work that we have already done. We may want to import things from a previous Project. These include things such as program objects such as Jobs, Sequences, Shared Containers, Metadata objects (for instance Table Definitions), Data Connections, things used globally at runtime such as Parameter Sets, and other objects such as routines, transforms, that we want to bring along and re-use from a previous Project.
Exporting and Importing Objects
Export Window In order to make this happen, the first thing that we need to do is to export it from one environment. Then we will be able to import it. Here in the Export window, we can highlight certain objects and then add them to our export set. One important thing to consider is the Job components to export drop-down menu shown highlighted. This will allow you to export the job designs with executables, where applicable. Lets say that we are promoting all of these objects to a production environment that does not have its own compiler. You will then need to have these pre-compiled Jobs ready to run when they reach their target. You can opt to exclude read only items. This is a typical default.
Next, you would choose which file you would like to export it into. This is a text file that will live on your client machine. All of the export functionality is done through client components. Once we have identified all the objects that we want to export, we simply click the Export button to begin the export.
We export either into an XML file or the native DSX file. Once this is accomplished, then we can begin our import. For the import, we would use the DataStage Designer client, click its import option and select the objects to import from the flat file into our new environment. Let's take a look at that dialog box.
This file that we are importing may serve as a backup in that it allows us to import objectby-object. Our export is exactly like a database export in so far as it allows us to pull out one object from the export at a time. Should someone accidentally delete something (in the new environment), instead of losing the entire box, we only lose one object that can then be

restored without losing any of the other work that youve done in the meantime by importing it from the export file.
On the Import screen, you will need to identify the export file. You then have a choice to Import all or Import selected. At the time, you can choose to Overwrite without a query if you know that everything that you want is the most current. You can also choose to perform an Impact analysis to see what objects are related to each other. Once you have set things up and are ready to begin developing Jobs. Lets talk about some very important and key components that we will use during our development. Choose Stages Now that we've talked about setting up our Project, let's now focus on building Jobs. In particular, lets talk about DataStage Stages that we see on our Canvas. There are two classes of Stages; Passive Stages and Active Stages. First, let's discuss the Passive Stages. Passive Stages Passive Stages are typically used for data in and data out such as working with sequential files, databases, and Data Sets (Data Sets are the DataStage proprietary high-speed data format). Whereas, Active Stages are the things that actually perform on the data: As rows are coming in, they immediately go out. For example a Transformer Stage, the Filter Stage, and the Modify Stage pass data as they receive the data unlike the Passive Stage which only either inputs data or only outputs data. The active Stages manipulate the data. So the Passive Stages are beginning points in the data flow or ending points. The Active Stages are continuous flow Stages that go in the middle of the data flow. Most notable of the Active type is the Transformer Stage, which has many functions. We'll get to this Stage a little later. First, let's cover the Passive Stages.
Types of File Data Lets describe the types of file data that well be dealing with as we begin to develop Jobs. Primarily this falls into three categories; sequential file data (Should be of either fixed or variable length), Data Sets which are the DataStage proprietary format for high-speed access, and then complex flat file data which is most notably used with older COBOL-type structures these allow us to look at complicated data structures embedded inside of files and are used in proprietary formats much the way that COBOL does.
How Sequential Data is Handled When we are using our Sequential File Stage, it is important to know that DataStage has an import/export operator working on the data (Not to be confused with importing and exporting objects, components, and metadata). This Sequential File Stages import and export is about taking the data from its native format, parsing it, and putting it into an internal structure (within DataStage) so that it can be used in the Job and then, likewise, on the way out (when writing it), taking that internal structure and writing it back out into a flat file format.
After the Job has been run, we could look in the log and see messages relating to things such as How many records were imported successfully and How many were rejected. When the records get rejected, it is because they cannot be converted correctly during the import or the export. Later, well see how we can push those rejected files out.
Normally, this is going to execute in sequential mode but you can use one of the features of the Sequential File Stage to be able to read multiple files at the same time. These will execute in parallel when executing multiple files. If we set up multiple readers, we can read chunks of a single file in parallel.
The Stage needs to know how the file is divided into rows and it needs to know what the record format is (e.g. how the row is divided into columns). Typically, a field of data within a record is either delimited or in a fixed position. We can use these record delimiters and column delimiters (such as the comma or the new-line delimiter) as a record delimiter.
Job Design Using Sequential Stages The graphic above shows how we can read data from the Selling_Group_Mapping file, send it into our Copy Stage and load it into the target flat file. However, if anything is not read correctly by the Selling_Group_Mapping link, it will be sent down the dashed reject link entitled Source_Rejects and, in this case, read by a Peek Stage. Over on the right, by the target, if anything isnt written correctly, it will be sent out to the Target_File_Target_Rejects link to the TargetRejects Peek Stage. To configure the Sequential File Stage, we would double-click it.
Sequential Source Columns Tab Then, within the Sequential File Stage, on the Output tab, under the Columns sub-tab, we would configure our Input file column definitions. The column metadata looks like the metadata that we have elsewhere. In other words all of the Column Names, Keys, SQL Types, Lengths, Scales, etc. under the Columns sub-tab look like all of the other metadata that weve seen for this file. Next, let's say that we clicked on the sub-tab entitled 'Properties'.
Under the Properties tab, we are presented with some options. These options help us define how we will read the file. For instance, there is the files name, its read method, and other options such as whether or not the first line is a column header, whether or not we will reject unreadable rows, etc.
Datasets The other primary file that we use is the Data Set. This kind of file stores data in a binary format that is non-readable without a DataStage viewer. It preserves the partitioning that we established during our Jobs. This way, the data contained in the Data Set is already partitioned and ready to go into the next parallel Job in the proper format. Data Sets are very important to our parallel operations in that they work within the parallel framework and use all the same terminology and are therefore accessed more easily.
We use a Data Set as an intermediate point to land data that does not need to be interchanged with any application outside of DataStage: It will only be used within DataStage Jobs going from one job to the next (or just within a Job). It is not a useful form of data storage for sending to other people as an FTP file, or for EDI, or for any of the other normal forms of data interchange. It is a proprietary DataStage file format that can be used within the InfoSphere world for intermediate staging of data where we do not want to use the database or a sequential file.
Data Set management Utility You may need to look at a Data Set to verify that it looks right or to find specific values as a part of your testing. You can look at a Data Set using the Designer client: There is a tool option that allows you to do data set management. It lets you first, seek out the file that you want. It then opens a screen, which will show you how the file is separated. It shows the number of partitions and nodes and it shows how the data is balanced between them. And, if you would like, it will also include a Displaying option for your data. Lets look at this display
Data and Schema Displayed A Data Set cannot be read like a typical sequential file by native tools such as the VI Editor or your Notepad editor in Windows. When looking at the Data Viewer we will see data in the normal tabular format just as you view all other DataStage data using the View command.
Connect to Databases Now that weve talked about some of the basic Passive Stages for accessing file data we will take a look at some of the basic Database Stages that allow us to connect to a variety of relational databases. To do this, we will need to talk a little bit about working with relational data. However, before we can access our databases, we need to know how to import relational data. So, let's begin by talking a little bit about working with relational data.
To import relational data, we will use the Table Definition Import utility within the Designer client or you can use the orchdbutil. The orchdbutil is the preferred method to get correct type conversions. However, in any situation, you need to verify that your type conversions have come through correctly and will suit the downstream needs (They must suit the needs of subsequent Stages and other activities within DataStage).
You will need to work with Data Connection objects. Data Connection objects store all of our database connection information into one single named object so that, when we go into our other Stages in Jobs, we wont need to include every piece of detailed information such as User ID, password of the various databases that were connecting to.

Next we need to see what Stages are available to access the relational data. We just talked briefly about Connector Stages. They provide parallel support and are the most functional and provide consistent GUI and functionality across all the relational data types. Then, we have the Enterprise Stages. These are the legacy Stages from previous versions of DataStage. They provide parallel support as well with the parallel extender set. There are also things called Plug-in Stages: These are the oldest family of connectivity Stages for databases and other relational sources. These were ported to the current version of Information Server in order to support DataStage Server Jobs and their functionality. When you have one of these Stages, it gives you the ability to select the data you want. We have the ability to use Select statements as we would in any database. With DataStage, you get a feature called the SQL Builder utility that quickly builds up such statements allowing you to get the data that you desire. When were writing the data, DataStage has a similar functionality that allows you to build INSERT, UPDATE, and DELETE statements also using the SQL Builder.
Import Table Definitions The first thing that we want to do is to import our table definitions. To import these table definitions, we can use either ODBC or the Orchestrate schema definitions. Orchestrate schema imports are better because the data types tend to be more accurate. From within the Designer client, you simply click on Import>Table Definitions>Orchestrate Schema Definitions. Alternatively, you could choose to import table definitions using the ODBC option. Likewise you can use the Plug-In table definitions or other sources from which you can import metadata including legacy information such as COBOL files, which can then later be translated into pulling that same data out from corresponding databases.

Orchestrate Schema Import The Orchestrate Schema Import utility does require certain bits of information including the database type, the name of the database from which you are pulling it, the server on which it is hosted, username, and password.
ODBC Import When were using our ODBC Import, first we select the ODBC data source name (DSN). This will need to have been set up for you by your System Administrator before you use the screen shown above (Otherwise, all the entries will be blank). Many will need a username and password although some may come pre-configured.
Connector Stage Types There are several Connector Stage types including ODBC, which conforms to the ODBC 3.5 standard and is level 3 compliant and certified for use with Oracle, DB2 UDB, SQL Server, and various other databases. This will also include the suite of DataDirect drivers, which allow you to connect to these various data sources and give you unlimited connections.
The next connector type is the DB2 UDB Stage. This is useful for DB2 versions 8.1 and 8.2. There is also a connector for the WebSphere MQ to allow us to connect to MQ series queues. It can be used with MQ 5.3 and 6.0 for client/server. And there is also WSMB 5.0. The last Connector Stage type is for Teradata. This gives us fast access to Teradata databases.
Connector Stages Now that weve done some of the basic set up for our relational data, we want to be able to use it on our Canvas. You can see, in the graphic above, that an ODBC connector has been dragged and droppedon on the left. Next we would want to configure it so that we can then pull our data from an ODBC sources. Lets take a look inside of the Stage.
Stage Editor Inside of the Connector Stage, you can see a Navigator panel, an area in which we can see the link properties, and then below, we can see the various other properties of the connection itself. Once we have chosen the connector for our ODBC Connector Stage, we will then want to configure an SQL statement that will allow us to pull data. We can use the SQL Builder tool highlighted on the right. Or we can manually enter our own SQL. Alternatively, we can use SQL that is file-based.
Connector Stage Properties When configuring the Connector Stage, we should have our connection information and know what SQL we will be using. This may include any transaction information and session management information that we entered (This would be our record count and commitment control shown highlighted) Additionally, we can enter any Before SQL commands or After SQL commands that may need to occur such as dropping indexes, re-creating indexes, removing constraints, or various other directives to the database.
Building a Query Using SQL Builder When you build a query using the SQL Builder utility, there are certain things that youll need to know: Be sure that you are using the proper table definition and be sure that the Locator tab information is specified fully and correctly. You can also drag the table definition to the SQL Builder Canvas. And, you can drag on the columns that you want to select.
Data Connection Data Connections are objects in DataStage that allow us to store all of the characteristics and important information that we need to connect to our various data sources. The Data Connection stores the database parameters and values as a named object in the DataStage repository. It is associated with a Stage type. The property values can be specified in a Job Stage of the given type by loading the Data Connection into that Stage.
Creating a New Data Connection Object You can see the icon that represents the Data Connection highlighted above.
Select the Stage Type Inside the Data Connection, we can select the Stage type that were interested in using and that we want to associate this to. Then we specify any parameters that are needed by that particular database. Obviously, different databases have different required parameters, so the selection of the Stage type is important.
Loading the Data Connection Once we have built our Data Connection, then we will want to load the Data Connection into the Stage that we have selected.
Active Stages Lookup, Merge, Join Stages Aside from Passive Stages, our other main classification of Stages is the Active Stage. Some of the Active Stages that well be looking at include a number of functionalities such as data combination, the transformation of data, filtering of data, modification of metadata, and a number of other things for which there are corresponding Active Stages.
Data Combination First, lets talk about data combination. For this, the three Stages that we use are the Lookup Stage, the Merge Stage, and the Join Stage. These Stages combine two or more input links. These Stages differ, mainly, in the way that they use memory. They also differ in how they treat rows of data when there are unmatched key values. Some of these Stages also have input requirements such as needing to sort the data or deduplicate it prior to its combination.
The Lookup Stage Some of the features of the Lookup Stage include its requirement for one input for its primary input link. You can have multiple reference links. It can only have one primary output link (And an optional reject link if that option is selected).
Lookup Failure Actions It is limited to one output link however, with its lookup failure options, you can include a reject link. Other lookup failure options include continuing with the row of data through the process, passing the data through with a null value for what should have been looked up, dropping the row, or we could just fail the Job (Which will abort the Job).

The Lookup Stage can also return multiple matching rows, so you must be careful how you use it. The Lookup Stage builds a Hash table in memory from the Lookup file(s). This data is indexed by using a hash key, which gives it a high-speed lookup capability. You should make sure that the Lookup data is small enough to fit into physical memory.
The Lookup Stage can also return multiple matching rows, so you must be careful how you use it. The Lookup Stage builds a Hash table in memory from the Lookup file(s). This data is indexed by using a hash key, which gives it a high-speed lookup capability. You should make sure that the Lookup data is small enough to fit into physical memory.
Lookup Types There are different types of Lookups depending on whether we want to do an equality match, a caseless match, or a range lookup on the reference link.
Lookup Example Lets see how a Lookup Stage might typically fit into a Job. We have a primary input coming in from the left and the data flows into the Lookup Stage. Coming in from the top is our Reference data. Notice that the reference link is dashed (a string of broken lines). As you are dragging objects onto the Canvas and building the Job, always draw the Primary link before the Reference link. Then coming from our Lookup Stage is the output link going to the target. What would happen if we clicked on the Lookup Stage to see how it's configured? Let's take a look.
Lookup Stage With an Equality Match Within the Lookup Stage, there is a multi-pane window. Everything on the left represents the data coming into the Lookup Stage both Primary link and Reference link(s). Everything on the right is coming out of the Lookup Stage. On the top of the screen is more of a graphical representation of each row. Toward the bottom of the screen is a more metadata-oriented view in which you can see each of the links and their respective characteristics. This same metaphor applies to most other Stages as well.
Notice that the metadata references (shown highlighted lower left) look like the table definitions that we used earlier to pull metadata into the DataStage Project.
Highlighted toward the top, is a place where you connect column(s) from the input area (highlight Item input link) and dragging them down and dropping them to the corresponding Reference link and connect them to the various equivalent fields here (highlight on Warehouse item in the lower blue area). In this case, we are using a single column to connect an equal: Where a Warehouse Item is equal to an Item.
Next, we can move all the columns from the input area on the left to the output link area on the right but replace this Item (highlight on the right box) description with the Warehouse Item description (the one that we found in the Reference link). So, we are taking all of our input columns and we are adding a new column to it based on what we looked up on the Warehouse Item ID.
When you click on the icon with golden-chains (shown highlighted) it gives you a new dialog screen.
Specifying Lookup Failure Actions This dialog box shows us the constraints that we can use. Here we can determine what actions to take should our lookup fail. You can continue the row, drop the row, fail the row, or reject it. Also, if you use a condition during your lookup (highlight on fail under conditions not met), you can use these same options of continue, drop, fail, or reject for if/when your condition is not met.
Join Stage The Join Stage involves the combination of data. There are 4 types of joins; inner, left outer, right outer, and the full outer join. The input links must be sorted before they come into a Join Stage. The Join Stage will use much less memory: It is known as a light-weight Stage. The data doesn't need to be indexed and instead has come into the Stage in a sorted fashion. We must identify a left and a right link as they come in. This Stage supports additional intermediate links. Like a Lookup Stage in an SQL query, in the Join Stage we will use columns to pull our data together.
Job with Join Stage Above is an example of a Job that uses the Join Stage. Notice the green Join Stage icon in the middle. Similar to the Lookup Stage, here, we have a primary input, a secondary input, and an output. But, in this case, notice that the right outer link input is not a dashed line (Highlight on the upper link). Remember that the reference link that we saw earlier did have a dashed line.
Join Stage Editor When we get inside the Join Stage, we will use the Properties editor. Similar editors are found in other Stages. Here, we will use the attributes necessary to create the join. We will specify the join type that we want and what kind of key we will be joining on.
The Merge Stage It is quite similar to the Join Stage. Much like the Join Stage, its input links must be sorted. Instead of a left and a right link, we have a 'Master' link and one or more Secondary links. It is a light-weight Stage in that it uses little memory (This is because we expect the data to already be sorted before coming into the Stage and therefore there are no keys in memory [indexes to be used]). Unmatched Master rows can be kept or dropped. This gives the effect of a left-outer join. Unmatched Secondary links can be captured in a Reject link and dispositioned accordingly.
When the data is not pre-sorted, the sorting can be done by using an explicit Sort Stage or you can use an On-link sort. If you have an On-link sort, then you will see its icon on the Link (The icon is shown in the lower-left of the graphic). The icon toward the right of each link above is the partitioning icon (Partitioning it out, it looks like a fan, and then collecting the partitioned data back up).
Coming out of the Merge Stage, we have a solid line for the main output and a dashed line for the rejects. The latter will capture the incoming rows that were not matched in the merge.
Merge Stage Properties Inside the Merge Stage, you must specify the attributes by which you will pull together the two sets of data; What to do with the unmatched master records and whether to warn or to reject updates. Also you must specify whether or not to warn on unmatched masters. This is another way to combine data, which will create more columns within one row and, in some cases produce multiple rows.
The Funnel Stage The Funnel Stage provides a way to bring many rows from different input links into one common output link. The important thing here is that all sources must have identical metadata! If your links do not have this, then they must first go through some kind of Transformer Stage or Modify Stage in order to make all of their metadata match before coming into the Funnel Stage.
The Funnel Stage works in three modes. The Continuous mode is where records are combined in no particular order. The first one coming into the Funnel is the first one put out. The Sort mode is where we combine the input records in order defined based on keys. This produces a sorted output if all input links are sorted by the same key. The Sequence mode outputs all of the records from the first input link and then outputs all from the second input link, and so on (Based on the number of links that you have coming into your Funnel Stage).
Funnel Stage Example Above, you can see an example of the Funnel Stages use.
Lets continue our discussion on Active Stages. Here, well briefly talk about the Sort Stage and the Aggregate Stage. Obviously, the Sort Stage is used to sort data that requires sorting, as was the case with the Join Stage and the Merge Stage. As mentioned, sorts can be done on the input link to a given Stage. This on-link sort is configured within a given Stage on its Input links Partitioning tab. Just set the partitioning to anything other than Auto.
Alternatively, you can have a separate Sort Stage. One of the advantages to the Sort Stage is that it gives you more options for controlling memory usage during the sort. The advantage to on-line sorting (within the Stage) is that it is quick and easy and can often by done in conjunction with any partitioning that youll be doing.
Sorting Alternatives The Sort Stage is highlighted above. In this example, we have sequential data coming into a Sort Stage before it is moved to a Remove_Duplicates Stage and then sent out to a Data Set. Below, we have a different example in which file that is being sorted directly into a Data Set but in this case the sort is happening directly on the link (on-link sort).
Our next Active Stage is the Aggregator Stage. The purpose of this Stage is to perform data aggregations. When using it, you specify one or more key columns, which define the aggregation units (or groups). Columns to be aggregated must be specified. Next, you specify the aggregation functions. Some examples of functions are counting values (such as nulls/non-nulls), summing values, or determining the maximum, the minimum, or ranges of values that you are looking for. The grouping method (hash table or pre-sort) is often a performance issue.
If your data is already sorted, then its not necessary. If you dont want to take the time to do a sort of your data prior to aggregation, then, using the default hash option will allow you to pull the data in without any sorting ahead of time.
Job with Aggregator Stage In this Job we can see the Aggregator Stages icon. Data flows from a Data Set (at the top of the data flow) and is copied two ways. One way (to the right) is sent to a Join Stage. Another way (straight down) is sent to an Aggregator Stage to aggregate the row counts and then (flowing from the Aggregate to the Join) is passed back into the same Join so that each row that came into this Job can also have, attached to it, the total number of rows that are being counted. The Aggregator Stage uses a Greek Sigma character as a symbol for summation.
Remove Duplicates Stage Our next Active Stage is the Remove Duplicate Stage. The process of removing duplicates can be accomplished using the Sort Stage with the Unique option. However, this leaves us with no choice as to which duplicate to keep. Use of the Unique option provides for a Stable sort which always retains the first row in the group or a Non-Stable sort in which it is indeterminate as to which row will be kept.
The alternative to removing duplicates is to use the Remove Duplicates Stage. This tactic gives you more sophisticated ways to remove duplicates. In particular, it lets you choose whether you want to retain the first or the last of the duplicates in the group.
This sample Job shows us an example of the Remove Duplicate Stage and we can see what the icon looks like. From the Sort Stage (toward the left of the graphic), the data comes into a Copy Stage toward the bottom, and then runs the sorted data up into the Remove Duplicate Stage. Then the data is sent out to a Data Set.
An Important Active Stage: The Transformer Stage One of the most important Active Stages is the Transformer Stage. It provides for four things: Column Mappings, Derivations, Constraints, and Expressions to be referenced.
It is one of the most powerful Stages. With column mappings, we can change the metadata, its layout, and its content. Also, within the Transformer Stage are derivations. These are written in a BASIC code and the final compiled code is C++ generated object code. There are a number of out-of-the-box derivations that you can use allowing you to do things such as: string manipulation, character detection, concatenation, and other activities. You can also write your own custom derivations and save these in your Project for use in many Jobs.
Other features of the Transformer include Constraints. The Constraint allows you to filter data much like the Filter Stage does. You can direct data down different Output links and process it differently or process different forms of the data.
The Transformer's use of expressions is for constraints and/or derivations to use as reference. We can use the input columns that are coming into the Transformer to determine how we want to either derive or constrain our data. We can use Job parameters and/or functions (whether they be the native ones provided with DataStage or ones that you have custom-created). We can use system variables with constants, Stage variables (local in their scope) to just the Stage as opposed to being global for the whole Job), and we can also use external routines within our Transformer.
Job with a Transformer Stage The Transformer Stage has an icon that looks like a T shape with an arrow. In the example Job above, we have the data coming into the Transformer. There is a reject link coming out of the Transformer (performed with the Otherwise option) and two other links coming out of the Transformer as well to populate two different files.
Inside the Transformer Stage Inside a Transformer Stage we have a look-and-feel similar to that of the Lookup Stage that we talked about earlier, in that we have an Input link (highlighted on the left) and all of our Output links on the right. These two output links (the two boxes on the right) give us a graphical view and specify the column names and also gives us places to enter any derivations. The top area (highlight on upper-most blue box on the right) allows us to create Stage variables, which we can use to make local calculations in the process of doing all of our other derivations.
Again, as with the Lookup Stage, the bottom half of the screen resembles the table definitions that you will see inside of your repository (highlight on center lower half). This gives you the very detailed information about the metadata on each column.
(Yellow highlight on the Golden-Chain button in the toolbar at the top) When you click the Golden Chain button then this will bring you to the section that allows you to do Constraints.
Defining a Constraint; you can type in your constraint into the highlighted area that also uses BASIC terminology. Here you can create an expression that will let you put certain data into one link or a different link or not put it into a link. When you right-click in any of the white space, DataStage brings up options for your Expression Builder (highlighted toward the left).
Alternatively, you can also use the Ellipse button to bring up the same menu. In the example above it shows an UpCase function and a Job parameter named (Channel_Description). At the beginning of the expression, we are using an input column. We compare our input column to an upper-cased version of one of our Job parameters.
Defining a Derivation When we are defining or building a derivation, we also use the BASIC-like code to create expressions within a particular cell. These derivations can occur within the Stage Variable area or down within any one of the Links. As with the constraints, a right-click or clicking the Ellipse button brings up the context-sensitive code menu to help you build your expression.
If Then Else Derivation Using our Transformer Stage, we can also build up If, then, else derivations where we can put conditional logic that will affect the data on our output link
String Functions and Operators Some things included in the Transformer Stages functionality are string functions and operators. The Transformer can use a Substring operator, it can Upcase/Downcase, and/or it can find the length of strings. It provides a

variety of other string functions as well such as string substitution and finding positional information.
Checking for NULLs Other out-of-the-box functionality of the Transformer includes checking for NULLs. DataStage can identify whether or not there are NULLs, assist us in how we want to handle them, and can do various testing, setting, or replacing NULLs as we see fit.
Transformer Functions Other functions of the Transformer include those for Date and Time, logic, NULL handling, Numbers, Strings, and those for type-conversion.
Transformer Execution Order It is important to know that, when a Transformer is executed, there is a very specific order of execution. The first things that are executed in the Transformer for each row are the derivations in the Stage Variables. The second things that are executed are the constraints for each link that is going out of the Transformer. Then within each link, the next thing that is executed are the column derivations and they are executed first, in the earlier links before the later links.
The derivations in higher columns are executed before lower columns: Everything has a topdown flow to it. 1. Derivations in Stage variables (from top to bottom) 2. Constraints (from top to bottom) 3. Within each Link, column derivations (from top to bottom).
Lets look at a quick example of how this might work. So, at the beginning we will execute all my variables (from top to bottom) and then execute each constraint (from top to bottom). For each constraint, the Transformer will fire off the corresponding output Link, and within that Link, we do all the columns from top to bottom, Which means that, if we do something in a column above, we can use its results in the column below. TIP: You must be very, very careful if you ever re-arrange your metadata (because of how the order of execution could affect your results within a Transformer.
Transformer Reject Links You can also have Reject links coming out of your Transformer. Reject links differentiate themselves from other links by how they are designated inside the constraint. Lets quickly see how this is done.
Off to the right, we have the Otherwise Log option or heading. Checking the respective checkbox create a Reject link as necessary.
Our Otherwise link creates a straight line (such as the Missing Data Link shown above) whereas; a Reject link would have a dashed line (Such as the Link all the way to the left).
Advanced Concepts When Building Jobs Weve discussed the key objects in DataStage and we have set up DataStage. Weve been able to use Passive Stages to pull data in from various files or relational database sources and weve talked about various Active Stages that let us manipulate our data. Now, we need to cover a few other important concepts that will help us in our development of DataStage Jobs. Run Time Column Propagation Shared Containers Runtime Column Propagation (RCP) One of the most important concepts is that of Runtime Column Propagation (RCP). When RCP is turned on, the columns of data can flow through a Stage without being explicitly defined in the Stage. Well see some examples of this later. With RCP enabled, the target columns in a Stage need not have any columns explicitly mapped to them: No column mapping is enforced at design-time. With RCP, the input columns are mapped to unmapped columns by name. This is all done behind the scenes for you.

Lets talk about how the implicit columns get into a Job. The implicit columns are read from sequential files associated with schemas or from the tables when read from relational databases using a select or they can be explicitly defined as an output column in a Stage that is earlier in the data flow of our Job. The main benefits of RCP are twofold. First, they give us greater Job flexibility. This way, a Job can process input with different layouts. We dont have to have a separate Job for multiple record types that are just slightly different. Second, RCP gives us the ability to create more re-usability within components, such as the Shared Containers (There are many ways to create re-usable bits of DataStage ETL code). This way you can create a component of logic and apply it to a single named column while all the other columns will flow through untouched. When we want to enable RCP, it must be enabled at the Project-level. If its not enabled there, then you wont find the option in your Job. At the Job level, we can enable it for the entire Job or just for individual Stages. Within each Stage, we look at the Output Column tab in order to decide whether or not we want to use RCP. You must enable RCP for the entire Project if you intend to use it. And, if you do, then can also choose to have each newly created Link use RCP by default or not. This default setting can be overridden at the Job-level or at the Link-level within each Stage.

Heres an example of where we set up and enable RCP just within a Stage (Highlight on checkbox of Enabling RCP at Stage Level slide). When you see this checkmark, you know that columns coming out of this Stage will be RCP-enabled.
When RCP is Disabled Lets say that we had four columns coming in to this Stage shown above as the input link shown above on the left and four columns going out (on the right) then, if we only drew derivations to two of the columns (the upper two on the right), the two lower columns would remain in a RED state (ineligible derivations) and this Job will not compile. However, lets see what we can do with RCP
When RCP is enabled we can see that the output links columns (lower right) do not have anything in the red state. By name matching, DataStage will assign the two columns on the input link (on the left) to two columns going out on the output link (lower right) automatically.
Or, we could leave these two columns completely off (the ones in the lower right of the previous graphic) and they would be carried in an invisible fashion through subsequent Stages and thereby, allowing you to run many Jobs that have row-types that have the upper two columns explicitly but also have the lower two (or more) columns implicitly
In other words, we can run this Job one time with a record that has four columns and another time we can run a Job that has eight columns another time we can run that Job with six columns. The only thing that they all have to share is that the very first two columns (highlight on top two columns) must be the same and must be explicit.
Runtime Column Propagation (RCP) In Detail Because of the importance of concepts including Runtime Column Propagation (RCP), the use of schemas, the use of Shared Containers (the chunks of DataStage code that can be re-used), and other things that allow you greater flexibility and help create greater reusability within your DataStage Jobs, it is worth talking about them in greater detail. Understanding RCP often takes several years in the field to understand. Even people who have been in the field for years can have difficulty grasping both the power and the difficulties of understanding RCP better, so lets review and continue our discussion about it. RCP, when it is turned on, will allow columns of data to flow through a Stage without being explicitly defined in the Stage. Target columns in the Stage do not need to have any columns explicitly mapped to them. Therefore, no column mapping is enforced at designtime when RCP is turned on. Input columns are mapped to unmapped columns by name. RCP is how implicit columns get into a Job. We can define their metadata by using a schema file in conjunction with a Sequential File Stage. As youll recall, a schema file is another way to define our metadata much like a table definition. Using this schema, we can work with a Sequential File Stage or the Modify Stage to take unknown data and give it an explicit metadata tag. We can also pull in our metadata implicitly when we read it from a database using our select statement. We can also define it by explicitly defining it as an output column in an earlier Stage in our data flow (Using a previous Stage in the Job design). The main benefits of RCP are Job flexibility so that we can process input with different or varying layouts. RCP also allows us to create re-usability. It also works very well within DataStages Shared Containers (the user-defined chunks of DataStage ETL applications). Lets look at an example of RCP so that we can better understand it.
The graphic above summarizes the main points of RCP but by revisiting the example that we looked at a few moments ago, we can gain a deeper understanding of RCP. So lets review our example in which RCP is NOT used.
Were looking within a Transformer Stage at a typical example of columns when RCP has not been enabled. Lets review how it functions without RCP. We can see four columns on the input link to the left. To the lower right, we can see four outgoing columns (on the output link). But the two lower columns SPEC_HANDLING_CODE and DISTR_CHANNEL_DESC arent connected to the input columns and dont have any derivations. Therefore, these two columns are left in a red color indicating that this is an incomplete derivation. Therefore, this Job wont compile.
When RCP is Enabled. But, in the exact same situation but one in which RCP is enabled. Now with the same Transformer, the bottom two columns are no longer in a red state and therefore, this Job will compile. When this Job runs, by name reference these two columns on the bottom left will be assigned to the two columns on the bottom right.
If we had additional explicitly named input link columns beyond the four highlighted to the left, they would be carried invisibly along to the next Stage. And, they could be either used or not used as necessary. But this would allow us to not require any kind of individual column-checking at design-time. This has mixed benefits. It gives us a lot of flexibility but can create headaches at runtime if you are not aware of data that is passing through your Job. This can negatively impact downstream activities. Thus you must be very careful when using the RCP.
Important things to know when using RCP are to understand what data is coming in, what kind of metadata needs to get processed there, and to understand what data is going out of the Stage. This applies in terms of the number of columns and the metadata type. If they

do not match, then what you are bringing in at some point in the processing, will need to be modified accordingly to work with the final target metadata of the Job. For instance, lets say that you bring in one column of a numeric type of data but it needs to go out as character data, then youll need to perform some type of conversion function on it. For example if it comes in with one column length but needs to go out with a longer column length then youll need to affect it from within either a Transformer Stage or a Modify Stage. Likewise, if you have a longer column coming in, youll need to make sure that it is truncated properly to fit into the outgoing column space so that it doesnt overflow and create a runtime failure. Business Use Case Scenario Lets talk about a good use of RCP in business terms. Lets say that you have a large conglomerate company and five companies underneath it in a corporate hierarchy. The conglomerate wants to take data from all five of these companies operational systems and populating a data warehouse with Customer Information. Each of the sub-systems has different layouts for their Customer but they all contain a commonality of having its own Customer ID field. However, each of these Customer ID fields may overlap: The value of 123 at Company A may represent a totally different Customer than the value of 123 at Company B even though they are the same value. The conglomerate has a single Customer ID that can track each of them distinctly. But each of these sub-companies will not know what their corporate conglomerate Customer ID is until it comes time to pull all of this data into the warehouse. They dont know this corporate conglomerate Customer ID because their systems werent built knowing about any corporate conglomerate Customer ID. But, the corporation conglomerate knows about it and can pull the customer data from all of the sub-companies, add another column or piece of information to say from which company it came, and then process the data through a DataStage Job which would then relate the subsystems ID to the corporate ID via a cross-reference that is stored in the corporate data warehouse.
Two columns would be needed; the source system ID and the source system Customer ID. The Job could then use this information to do a Lookup to a reference from the conglomerates warehouse that would in turn find the corresponding corporate conglomerate Customer ID. It would then bring this data together so that it could be written out with the new number attached to it. And then that data could then be processed subsequently knowing to which corporate Customer ID this particular sub-systems row belongs. So, if we were to process five Customer records from the five companies separately, one company might have 10 columns in their Customer table while another company has 20 columns in their Customers table. We are only going to specifically identify 2 of those columns coming in and only specify 3 of those columns coming out of our Stage. If we do this in a file-to-file example; We will read the first file coming in, and process the data, then write it out even though we only see, in our Job, the 2 or 3 columns that we

have explicitly identified, when we write our output file, all of the columns that came from the input file will be written in addition to the new conglomerate Customer ID that we did the lookup on. This is the case with RCP on. Thereby, we will only need one Job instead of having to write 5 separate Jobs, one for each of specific data layouts coming from each of the sub-companies.
Shared Containers Shared Containers are encapsulated pieces of DataStage Job designs. They are components of Jobs that are stored in a different container. They can then be reused in various other Job designs. We can apply stored Transformer business logic to a variety of different situations. In the previous example we performed some logic on the grouping Customer Information from sub-companies with a corporate Customer ID. This logic could be encapsulated, stored, and then re-applied.
Creating a Shared Container In the example above, the Developer has selected the Copy Stage and the Transformer Stage as a Shared Container. Then clicking (E)dit from the menubar, the (C)onstruct Container option, and (S)hared completes the task. Then this same Job would have a slightly different look. Lets see what it would look like.
Using a Shared Container in a Job Above, you can see the Shared Container icon where the selected Stages used to be. This same Shared Container can now be used in many other Jobs. Shared Containers can use RCP although they dont have to do so. The combination of Shared Container and RCP is a very powerful tool for re-use. Lets examine the concept of combining the two.
If we create a Shared Container in conjunction with RCP then the metadata inside the Shared Container only need to be what is necessary to conduct the functions that are going to occur in that Shared Container. Recall for a moment the example that we were talking about with the conglomerate and the five sub-companies. If that Shared Container is where we did our Lookup to find our corporate conglomerate number, and all we really needed to do was to pass into the Shared Container the 2 columns we explicitly knew, any other columns (regardless of how many there are different sub-companies have different amounts of columns) passed in there too. But on the way out of the Transformer (that adds the conglomerate Customer ID), all the columns would be passed out with the newly added corporate identifier (that we had done a Lookup in our Shared Container).
Mapping Input/Output Links to the Container When mapping the input and output links to the Shared Container, you need to select the Shared Container link with which the input link will be mapped to. You will need to select the Container link to map the input link as well as the appropriate columns that youll need. By using RCP we only have to specify the few columns that really matter to us while we are inside the Shared Container. When we come back out of the Shared Container we can re-identify any of the columns as necessary depending on what we want to do with them
Shared Containers Continued The main idea around Shared Containers is that they let us create re-usable code. It can be used in a variety of Jobs. Maybe even in Jobs that werent intended to process just the Customer data. For instance, we might want to process Order data where we still need to Lookup the conglomerates Customer ID just for this Order. By simply passing in our whole Order record but only exposing the Customer number (the sub-system Customer identifier) and the source sub-system from which it came so that the Shared Container can do the Lookup, produce the appropriate output including the concatenated conglomerate Customer ID, and tag that into the Order. Then, as we process this Order into our data warehouse, we will know not only the Source CustomerID that appeared on the billing invoice and other information from that particular sub-company but also be able to tie it to our conglomerate database so that we can distinguish it from other customers from other sub-companies.

Again, Shared Containers do not necessarily have to be RCP-enabled but the combination of the two is a very powerful re-usability tool.
Job Control Now that weve created the DataStage environment, have talked about how to pull in metadata, and have seen how to build Jobs, the next topic answers the question How do we control all of these Jobs and process them in an orderly flow without having to do each one individually? The main element that we use for this is called a Job Sequence. A Job Sequence is a special type of Job. It is a master controlling Job that controls the execution of a set of subordinate Jobs. They might even be of the Sequencer type in which case they are sub-Sequencers or they can be Jobs themselves whether Parallel or Server Jobs.

One of the things that a Job Sequencer does is to pass values to the subordinate Jobs parameters. It also controls the order of execution by using Links. These Links are different from our Job Links though in that there is no metadata being passed on them. The Sequence Link tells us what should be executed next and when it should be executed. The Sequence specifies the conditions under which the subordinate Jobs get executed using a term know as Triggers.
We can specify a complex flow of control using Loops, the All or Some options, and we can also do things such as Wait for File before our Job starts. For example, lets say that we were waiting for a file to be FTPed to our system; we can then sit there in a waiting-mode until that file comes. When it does arrive, the rest of the activities within that Sequencer will be kicked off.

The Job Sequencers will allow us to do system activities. For example you can use its e-mail option that ties into your native system and sends an e-mail. They also allow us to execute system commands and executables (that we might normally execute from the Command Line). Any Command Line options can be executed from within the Sequencer. For example, you could send a message to your Tivoli Work Manager, by writing a Command Line option that writes a message to the Tivoli Command Center so that operators can see what has happened.
The Sequencer can include restart checkpoints. Should there be some failure within your Sequence, you dont have to go back and re-run all the Jobs. You can just find the one that failed, skipping over all the ones that worked, and pick up and re-run the one that failed.
When you create a Job Sequencer, you open a new Job Sequence & specify whether or not it's re-startable. Then, on the Job Sequencers Canvas you add Stages that will execute the Jobs, Stages that will execute system commands in other executables, and/or special purpose Stages that will perform functions such as looping options or setting up local variables. Next, you add Links between Stages. This will specify the order in which the Jobs are executed. The order of sub-sequences, Jobs, looping, system commands, e-mails is determined by the flow that you create on the Canvas
You can specify triggers on the Links. For instance, you might create a condition so that when the flow is coming out of Activity A, one Link goes to Activity B if the Activity A Job finished successfully but goes to Activity C on a different Link if the Job errors-out. The error-out Link might then go to an e-mail that sends an error message to someones console and stops the Sequencer at that point. Job Sequences allow us to specify particular error handling such as the global ability to trap errors. Then you could enable and disable re-start checkpoints within the Sequencer. Now let's see some of the Stages that can be used within the Sequencer.
Here is a graphic of some of the Stages that can be used within the Sequencer. These include the EndLoop Activity Stage, the Exception Handler Stage, Executing (the native OS) Command Stage, the Job Activity Stage (used to execute a DataStage Job), and the Nested Condition Stage. In addition there are the Notification Activity Stage (that is typically tied into DataStages email function), the Routine Activity Stage (used to call DataStage Server routines), the Sequencer Stage (which allows us to bring together and to coordinate a rendezvous point for multiple Links once we have started many Jobs in parallel) and the StartLoop Activity Stage. There is a (forced) terminator Activity Stage, the UserVariables Activity Stage (used to create local variables), and the Wait For File Activity.
Sequencer Job Example Here we can see an example of a Sequencer Job. At the top left, this Job Sequence starts with nothing coming into it. This Stage will thus start executing immediately. Normally the one at the bottom would as well, however, the bottom Stage is an exception handler Stage and will only kick off if there is an exception during our runtime. In a different example, if there were many Stages independent of one another, they normally will all start at the same time within a typical Job Sequence. In the top flow of the Job Sequence above though, the first Stage waits for a file, then it kicks off Job 1, if Job 1 is finished successfully the flow follows the green Link to Job 2 and so on. The green color of the Link indicates that there is a trigger on the Link (this trigger basically says if the Job was successful, go to Job 2). If Job 3 completes successfully, then the flow goes on to execute a command. The red links indicate error activity. Should any of Jobs 1, 2, or 3 fail, the flow follows the red link to a Sequencer (which can have all or any of the Jobs complete) but in this case, if any of the Links are followed to it, then it sends a notification warning by email that there has been a Job failure.

Exception Handler Stage There is one final note about our Sequencer Job: It is not a scheduler unto itself. It just sequences the activities. There must be some external scheduler whether that is the DataStage scheduler or a third-party enterprise scheduler to start up this whole Sequencer Job and only from there, will it call all the other Jobs for you. Think of it as an entry point into your application from whatever external source you intend to use. Whether that source is just kicking it off from a Command Line or using something like a Tivoli Work Manager to execute the Job based on a daily or monthly schedule or a conditional schedule should some other activity occur first and then, in turn, call this Job. In order to do that, we would need to learn to execute the Job from the Command Line. We will talk a little about how the Command Line can be used in this manner as well as how it will help us deploy our applications effectively and efficiently.
Pulling Jobs Together
Parameter Sets Running Jobs from the Command Line Other Director Functions
Parameter Sets Let's start by talking about Parameter Sets. One of the areas that will help us to pull our Jobs together more effectively will be the use of Parameter Sets. Parameters are needed for various functions within our Jobs and are extremely useful in providing Jobs with flexibility and versatility. The Parameter Set allows us to store a number of parameters in a particular named object. Thereby, one or more value files can be named and specified. A value file stores the values for each parameter within the Parameter Set. These values can then be picked up at runtime.
Parameter Sets can be added to the Jobs parameter list, which is on the Parameter tab within the Job Properties. This makes it very convenient for us. As we develop our Jobs, we dont have to enter a long list of parameters for every single Job thus risking the possibilities of mis-keying or having omissions that prevent our parameters from passing successfully from the Sequencer Job into our lower level Jobs.
An example of a Parameter Set might include all the things that happen within a certain environment. We might set up a Parameter Set entitled environment and have it contain parameters such as the Host Name, the Host Type, the Database Name, the Database Type, the Usernames, and the Passwords (Not to be confused with the DataStage Connection Objects). We could then have one set of values in a value file used specifically for development, another set for testing, and another set for production. We could then move our Job from one environment to the next and simply select the appropriate values for that environment from the Parameter Set.
Running Jobs from the Command Line Now that weve learned the basics of how to sequence our Jobs and group our parameters, we have an application. Lets say that weve built, developed, and tested a Job and are now ready to put it into production. The next question is How do we execute it? Although it can be executed by hand using the DataStage Director client, this is not the typical way that applications are utilized. Instead, there are a series of Command Line options that will allow us to start the Job from the native Command Line when we type in the command or to use our scheduler to call Jobs. The graphic above shows some of the particulars.

Other Director Functions We mentioned that we can schedule Jobs using the DataStage Director, but more often than not, Jobs are scheduled by an external scheduler. To do this, we must run Jobs from the Command Line as most schedulers use a Command Line execution.
The InfoSphere suite provides a utility for DataStage called the dsjob. The dsjob is a multifunctional Application Programming Interface (API) on the Command Line that allows us to perform a number of functions. One of the most common functions is to run a Job. By executing the dsjob command on your Command Line with various options after it, youll be able to accomplish a number or automated functions using a whole host of other tools (Since the Command Line is the most common method for launching other activities). The top one in the graphic above (HIGHLIGHT ON TOP ONE) will run a Job. In it, we are passing parameters that include the number of rows that we want to run, the name of the Project that we want to run from, and the name of the Job.
dsjob also has other functions such as giving us information about a Jobs run status, summary of its messages, and even Link information. The example above (in the second bullet point) displays a summary of all Job messages in the log. You can use the -logsum or the -logdetail options to the dsjob command to write your output to a file, which can then either be archived or even used by another DataStage Job and loaded into a database with all your log information (any database doesnt have to be part of the DataStage repository). With DataStage version 8.1, you are able to store the logged information directly into the Xmeta repository or keep it locally in the DataStage engine. You may choose the latter for performance reasons so that you dont tie up your Xmeta database with a lot of logging activity while the Job is running which could affect your network should your Xmeta be remotely located from your DataStage server. Having the option to pull the logging information out subsequently and allow you to run all the log information into the local DataStage engines repository and then, only at the end, pull that data out and process it into your master repository (whether that be Xmeta or some other relational data source) so that you can keep it for long term usage. This way you can do it without negatively impacting the performance of the application as it is running since you will be doing it after the application is finished. As mentioned earlier, Xmeta is the repository for all of DataStage. dsjobs function is documented in the Parallel Job Advanced Developers Guide. Now, typically, when we use the dsjob command it will be encapsulated inside some kind of shell script (whether on a Windows or UNIX type of system or whatever system DataStage is running on) and we would want to make sure that all the options are correct before issuing the command.
Product Simulation Welcome to the Guided Tour Product Simulation. Earlier in the tutorial, we discussed the many of the important concepts involved in DataStage. In this interactive section, you will use the product, without having to perform any installations, to create DataStage Jobs, navigate around the UI, and learn by doing. Specifically, we will first introduce you to a case-study of a business use-case scenario in which you will learn about a business entitled Amalgamated Conglomeration Corporation. Amalgamated Conglomeration Corporation (ACC) has hired you as their DataStage Developer! ACC is a large holding company that owns many subsidiary companies underneath it. In our scenario, were going to talk about what happens to ACC over a period of a year as they acquire new companies. In this section you will do the following: 1. 2. 3. 4. 5. 6. Understand the Business Problem and Case Scenario Assess the Existing Architecture Look at a Feed Job Create a Feed Job Modify the Consolidation Job Create a Sequence Job
Lets look at ACCs corporate structure as it looks at the beginning of the year on January 1 st. As you can see in the graphic above, ACC has four subsidiary companies. These subsidiary companies include Acme Universal Manufacturing (Company A), Big Box Stores (Company B), Cracklin Communications (Company C) and Eco Research and Design (Company E). All of these belong to ACC and information from all of them will need to be pulled together from time to time.
Now, some time will have gone by, and, on March 3rd, the state of ACC has changed: Your employer has acquired a new company called Disks of the World (Company D). Disks of the World will now need to be able to feed its information into ACCs Data Warehouse. Lets go back to January 1 and see what the feeds looked like before Disks of the World (Company D) came on board.
Each of the individual subsidiaries (Companies A, B, C, and E) feeds its data into a common customer Data Mart. But this is done in two steps. Lets quickly talk about how this was done from more of a technical standpoint. Above, we can see four particular feeds; one for each company and that particular company's customers. The feeds are all turned into files. Each FEED FILE has been generated by an individual DataStage DS FEED JOB.
These files are then later pulled together by a consolidation job (A DataStage Job) and run into the COMMON CUSTOMER DATA MART. The reason why we have several different feeds running to files is that each of these subsidiary companies put out their data at different intervals or periods. In other words, each puts out its own feed at different frequencies. However, we only want to consolidate it in one fell swoop (ACCs business requirements are that we put all the feed files into the Data Mart at one time). Acme or Company A puts out their data once per day. But Big Box or Company B has such a large feed that they put it out once an hour. Cracklin Communications or Company C has a large feed but not quite as large as Company B, so they do it four times per day. Eco Research or Company E has such a small customer base that they only produce their feed one time per week. ACCs business requirement though, is for us to have a daily consolidation of these feeds into the Common Customer Data Mart. It will therefore pull in many files from Companies B

and C, probably only one from Company A, and then, once per week, there will also be one in the heap from Company E.
Now lets go ahead in time to March 3 rd. Youll recall that ACC has now acquired another company called Disks of the World. The DataStage Developers for ACC (thats you) will now need to add a new DataStage feed for the customer data coming from Company D (Disks of the World) into the Common Customer Data Mart. You will build a DataStage feed that accommodates their need to produce their data multiple times per day during their salescycle. This feed will be landed to the same set of files (Feed files at the center of the graphic) that are then picked up by another DataStage Job entitled Consolidation DS Job (which combines them into the mart). Finally, you will need to build a special type of Job called a Sequencer Job. This type of Job is used in order to control the running of other DataStage Jobs. In other words, well see how all of the Jobs (All DS FEED JOBs and the CONSOLIDATION JOB), can be brought together by using a Sequencer Job. This Sequencer will call all the Jobs in the correct order. This way, we dont have to schedule them individually in our corporate-wide scheduling software. We will just schedule the one Sequencer.

In sum, as the DataStage Developer for ACC, you will need to do the following: 1. View the existing Acme DS FEED JOB to see how the previous DataStage Developer at ACC did it 2. create a new DataStage Job shown highlighted above (Company D's DS FEED JOB) 3. Then you will need to modify ACCs DS CONSOLIDATION JOB to bring in this new/extra feed (from Company D) into the data mart 4. Finally, later in this Guided Tour Product Simulation, you will also create a Job that controls all of these Jobs! 5. There is a technical problem that has been identified and reported to ACC in the past. You will need to understand the issue so that, as you develop a new DS FEED JOB, you will help solve this problem: ACC will need to recognize any particular customer as having come from a particular subsidiary: 6. One of the things that all of these DS Feed Jobs have in common is that the DataStage Jobs will need to find a common customer ID for the corporation. Since each of the subsidiary companies developed their own systems independently (of ACC), the numbering systems that they use for their customers are different and may well inappropriately overlap. For example, Company A has a Customer 123 that represents Joe Smith but over at Company B, there is a Customer 123 that represents Emmanuel Jones Enterprises. Even though these two share the same number, they are not the same customer. If we put them into the Common Customer Data Mart as is, then we would have ambiguity and not know which customer we were dealing with. 7. The Common Customer Data Mart will have its own numbering system for these customers and it will differentiate by using the Customer ID that comes from the company (Joe Smith = Customer 123), along with an additional identifier, that indicates which company the data came from (Company A is known to the Data Mart as Source System ID = 1, for instance). These two fields when combined will create a new field that is an alternate key and is now a unique identifier in the Common Customer Data Mart for Joe. This number will now be able to distinguish any customer from any subsidiary. 8. Our DataStage Feed Jobs will need to be able to find out if their customer is already known to the Common Customer Data Mart. 9. If the customer is not known, then we will hand it off to ACC's Data Mart team to add it in a separate process (a Black Box process Job that we will not see in this tutorial) which will then place the new customer or insert it for the first time into the Data Mart. 10. However, If the customer is already known to the Mart, then the Data Mart must find the unique ID, and add the unique ID to the data. For example, Customer_123SourceSysID_1 is added to an ADDRESS-update for Joe Smith). This way, when it comes time to load the Common Customer Data Mart (With the Consolidation DS Job) then we wont have ambiguity in our data. Well be able to load it correctly and distinctly. Joes address is updated. 11. Lets talk about how the Consolidation DataStage Job would do this. 12. ACC's Business Requirements 13. Anytime that ACC runs one of these Jobs the Feed Jobs designate from which source system (Company) they are coming. In other words, each Feed Job has designated its subsidiary company uniquely. For example, Source system 1 or coming from Company A for Acme, Source system 2 or coming from Company B for Big Box Stores, and so forth. When the data is processed, the source system number will be sent into the Job (as a parameter). Then, it the Job must perform a lookup. For this we will use both the source system number or ID and the Customer ID from the source system to do a lookup into the table, from the Common Customer Data Mart,

to see if this compound key already exists. This is called a reverse lookup. In other words, the Job uses those two columns to find the true key for the Common Customer Data Mart which should then be returned into the Job so that that column (the true key) can be carried along with all the other native source columns that are coming in the row as it passes through the Job. 14. Additionally, we need to code the Job so that it will convert all of all of the source system-specific metadata from the source systems metadata layout into the target metadata layout. Coming into the Job, the metadata appears as it would in the source system (Acmes layout of the customer) but, on the way out, we should see the same type of data but captured in columns and fields which conform to the Common Customer Data Marts metadata structure. 15. The Common Customer ID is a unique key within the Common Customer Data Mart. This key has no meaning to each individual source system. So the only way that the source system can find this unique key is through this feed process. The feed process enriches the data by looking at the Common Customer Data Mart and finding this Common Customer ID to help differentiate all of our different customers across all of our subsidiary companies. If the customer is new, then it will be sent to another process for insertion into the Customer Data Mart. This Insertion Job is not discussed in this tutorial and is considered a Black Box process used by Amalgamated Conglomerate Corporation to add new customers. 16. In summary, the unique Customer ID in the Common Customer Data Mart has compounded the Source System Identifier with a source-specific Customer ID into a field that, for example, might contain a value like 9788342. 17. And this field is unique so that no customer is overlapped with another

Look At Acme DS Feed Job Now that we know the business requirements, the architecture, and the development strategy, let's look at the DS Feed Job for the Acme subsidiary (Company A) that was built previously by another DataStage Developer at ACC (Highlighted above). After we've seen it and how it works, then, after ACC makes another acquisition in March, we will go to a section of the tutorial, in which you will have to build a similar DS Feed Job for Disks of the World (Company D).
In order to look at the DataStage Feed Job for Acme Universal Manufacturing, we'll need to bring up the DataStage Designer Client. In this Job, we will see how it pulls the data from Acmes operational database and puts that data into a file after it has looked up to find the appropriate Common Customer ID from the Data Mart. Then, well look at the Consolidation Job. There, we will see how all of the individual Feed files are brought together prior to being loaded into the Common Customer Data Mart. We will need the highlighted client to look at, create, and modify DataStage Jobs.

Our Login screen comes up. Here, well need to enter username/password information. At the top of the screen, you can see the domain in which we are working. At the bottom of the screen, we need to make sure that we are working with the correct DataStage Project in this case it is the AmalCorp Project (Amalgamated Conglomerate Corporation Project). Then, with a password filled in for you (for purposes of this tutorial) enter a username of student and click the OK button (The NEXT button below has been disabled for the remainder of the Guided Tour Product Simulation)
Our Designer client is now connecting to the DataStage Server via authentication and authorization from the Servers Security layer. A screen will then pop up within which we can work on our Job. First, we are asked if we want to create a Job and if so, what type. For right now, since we will look at an existing Job, just click the Cancel button. From here please develop as per the spec

DataStage How To Kick Start

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DataStage How To Kick Start

Transféré par

Droits d'auteur :

Formats disponibles

Information Server 8.

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc

Information Server 8.1 Introduction

Business Intelli Solutions Inc