DataStage Corso Slides

DATASTAGE ESSENTIALS
DataStage Essentials
Table of Contents
Table of Contents
Module 1: Introduction to DataStage ............................ 1-01
Course Objectives .......................................................................................................... 1-02 Module Objectives ......................................................................................................... 1-03 What is a Data Warehouse? ........................................................................................... 1-04 What is DataStage? ........................................................................................................ 1-06 Developing in DataStage ............................................................................................... 1-13
Module 2: Installing DataStage ..................................... 2-01

Module Objectives ......................................................................................................... 2-02 DataStage Server Installation......................................................................................... 2-03 DataStage Clients Installation........................................................................................ 2-06
Module 3: Configuring Projects ..................................... 3-01

Module Objectives ......................................................................................................... 3-02 Project Properties ........................................................................................................... 3-03 Project Properties General Tab ...................................................................................... 3-05 Permissions Tab ............................................................................................................. 3-06 Tracing Tab.................................................................................................................... 3-07 Schedule Tab.................................................................................................................. 3-08 Tunibles Tab .................................................................................................................. 3-09
Module 4: Designing and Running Jobs ........................ 4-01

Module Objectives ......................................................................................................... 4-02 What is a Job? ................................................................................................................ 4-03 Job Development Overview........................................................................................... 4-04 Designer Work Area ...................................................................................................... 4-05 Sequential Stage............................................................................................................. 4-11 Transformer Stage.......................................................................................................... 4-14
Contents - 1
Table of Contents
Compiling a job.............................................................................................................. 4-19 Running your job ........................................................................................................... 4-20
Module 5: Working with Metadata................................. 5-01

Module Objectives ......................................................................................................... 5-02 What is Metadata?.......................................................................................................... 5-03 Import and Export .......................................................................................................... 5-06 Metadata Import............................................................................................................. 5-14 Loading Table Definitions ............................................................................................. 5-18
Module 6: Working with Relational Data ....................... 6-01

Module Objectives ......................................................................................................... 6-02 Working with Relational data ........................................................................................ 6-03 Setting up an ODBC Connection................................................................................... 6-04 Importing ODBC Metadata............................................................................................ 6-06 Extracting Relational Data............................................................................................. 6-07 ODBC Target Stage ....................................................................................................... 6-15
Module 7: Constraints and Derivations .......................... 7-01

Module Objectives ......................................................................................................... 7-02 To Define a Constraint................................................................................................... 7-04 Derivations..................................................................................................................... 7-09 Stage Variables .............................................................................................................. 8-13
Module 8: Creating BASIC Expressions ........................ 8-01

Module Objectives ......................................................................................................... 8-02 DataStage BASIC .......................................................................................................... 8-03 System Variables ........................................................................................................... 8-07 DataStage Routines........................................................................................................ 8-12 Data elements................................................................................................................. 8-13 DataStage Transforms.................................................................................................... 8-14
Contents - 2
Table of Contents
Date Manipulation ......................................................................................................... 8-18 Using Iconv and Oconv ................................................................................................. 8-19 Date Transforms............................................................................................................. 8-25
Module 9: Troubleshooting ............................................ 9-01

Module Objectives ......................................................................................................... 9-02 Job Log........................................................................................................................... 9-03 Event Window ............................................................................................................... 9-04 DataStage Debugger ...................................................................................................... 9-10 Setting Breakpoints........................................................................................................ 9-12 Debug Window .............................................................................................................. 9-14
Module 10: Defining Lookups ...................................... 10-01

Module Objectives ....................................................................................................... 10-02 Hashed Files................................................................................................................. 10-03 Creating and Loading a Hashed File............................................................................ 10-04 Extracting Data from a Hashed File............................................................................. 10-07 Defining a Lookup ....................................................................................................... 10-11
Module 11: Aggregating Data ...................................... 11-01

Module Objectives ....................................................................................................... 11-02 Aggregation Functions................................................................................................. 11-03 Aggregator Stage ......................................................................................................... 11-06 Defining Aggregations................................................................................................. 11-11
Module 12: Job Control................................................ 12-01

Module Objectives ....................................................................................................... 12-02 Job Parameters ............................................................................................................. 12-03 Before and After Routines ........................................................................................... 12-08
Contents - 3
Table of Contents
Job Control Functions .................................................................................................. 12-10 Controlling Jobs ........................................................................................................... 12-12 Job Sequencer .............................................................................................................. 12-12 DataStage Containers................................................................................................... 12-22
Module 13: Working with Plug-Ins ............................... 13-01

Module Objectives ....................................................................................................... 13-02 What is a Plug-In?........................................................................................................ 14-03 Installing a Plug-In....................................................................................................... 14-04 Sort stage...................................................................................................................... 14-15
Module 14: Scheduling and Reporting ........................ 14-01

Module Objectives ....................................................................................................... 14-02 Job Scheduling ............................................................................................................. 14-03 Job Reporting ............................................................................................................... 15-06
Contents - 4
Module 1
Introduction to DataStage
Module 1 Introduction to DataStage
DataStage
1-2
Advanced DataStage
1-3
DataStage
A data warehouse is a central database that integrates data from many operational sources within an organization. The data is transformed, summarized, and organized to support business analysis and report generation. Repository of data Optimized for analysis Supports business: Projections Comparisons Assessments Extracted from operational sources Integrated Summarized Filtered Cleansed Denormalized Historical
1-4
Advanced DataStage
Data marts are like data warehouses but smaller in scope. Frequently an organization will have both an enterprise-wide data warehouse and data marts that extract data from it for specialized purposes. Like data warehouses but smaller in scope Organize data from a single subject area or department Solve a small set of business requirements Are cheaper and faster to build than a data warehouse Distribute data away from the data warehouse
1-5
DataStage
DataStage is a comprehensive tool for the fast, easy creation and maintenance of data marts and data warehouses. It provides the tools you need to build, manage, and expand them. With DataStage, you can build solutions faster and give users access to the data and reports they need. With DataStage you can: Design the jobs that extract, integrate, aggregate, load, and transform the data for your data warehouse or data mart. Create and reuse metadata and job components. Run, monitor, and schedule these jobs. Administer your development and execution environments.
1-6
Advanced DataStage
DataStage is client/server software. The server stores all DataStage objects and metadata in a repository, which consists of the UniVerse RDBMS. The clients interface with the server. The clients run on Windows 95 or later (Windows 98, NT, 2000). The server runs on Windows NT 4.0 and Windows 2000. Most versions of UNIX are supported. See the installation release notes for details. The DataStage client components are: Component Administrator Designer Director Manager Description Administers DataStage projects and conducts housekeeping on the server Creates DataStage jobs that are compiled into executable programs Used to run and monitor the DataStage jobs Allows you to view and edit the contents of the repository
1-7
DataStage
True or False? The DataStage Server and clients must be running on the same machine. True: Incorrect. Typically, there are many client machines each accessing the same DataStage Server running on a separate machine. The Server can be running on Windows NT or UNIX. The clients can be running on a variety of Windows platforms. False: Correct! Typically, there are many client machines each accessing the same DataStage Server running on a separate machine. The Server can be running on Windows NT or UNIX. The clients can be running on a variety of Windows platforms.
1-8
Advanced DataStage
Use the Administrator to specify general server defaults, add and delete projects, and to set project properties. The Administrator also provides a command interface to the UniVerse repository. Use the Administrator Project Properties window to: Set job monitoring limits and other Director defaults on the General tab. Set user group privileges on the Permissions tab. Enable or disable server-side tracing on the Tracing tab. Specify a user name and password for scheduling jobs on the Schedule tab. Specify hashed file stage read and write cache sizes on the Tunables tab.
General server defaults can be set on the Administrator DataStage Administration window (not shown): Change license information. Set server connection timeout.
The DataStage Administrator is discussed in detail in a later module.
1-9
DataStage
Use the Manager to store and manage reusable metadata for the jobs you define in the Designer. This metadata includes table and file layouts and routines for transforming extracted data. Manager is also the primary interface to the DataStage repository. In addition to table and file layouts, it displays the routines, transforms, and jobs that are defined in the project. Custom routines and transforms can also be created in Manager.
1 - 10
Advanced DataStage
The DataStage Designer allows you to use familiar graphical point-and-click techniques to develop processes for extracting, cleansing, transforming, integrating and loading data into warehouse tables. The Designer provides a visual data flow method to easily interconnect and configure reusable components. Use Designer to: Specify how the data is extracted. Specify data transformations. Decode (denormalize) data going into the data mart using reference lookups. For example, if the sales order records contain customer IDs, you can look up the name of the customer in the CustomerMaster table. This avoids the need for a join when users query the data mart, thereby speeding up the access. Aggregate data. Split data into multiple outputs on the basis of defined constraints.
You can easily move between the Director, Designer, and Manager by selecting commands in the Tools menu.
1 - 11
DataStage 304
Use the Director to validate, run, schedule, and monitor your DataStage jobs. You can also gather statistics as the job runs.
1 - 12
Advanced DataStage
Define your projects properties: Administrator Open (attach to) your project Import metadata that defines the format of data stores your jobs will read from or write to: Manager Design the job: Designer Define data extractions (reads) Define data flows Define data integration Define data transformations Define data constraints Define data loads (writes) Define data aggregations
Compile and debug the job: Designer Run and monitor the job: Director
1 - 13
DataStage
All your work is done in a DataStage project. Before you can do anything, other than some general administration, you must open (attach to) a project. Projects are created during and after the installation process. You can add projects after installation on the Projects tab of Administrator. A project is associated with a directory. The project directory is used by DataStage to store your jobs and other DataStage objects and metadata. You must open (attach to) a project before you can do any work in it. Projects are self-contained. Although multiple projects can be open at the same time, they are separate environments. You can, however, import and export objects between them. Multiple users can be working in the same project at the same time. However, DataStage will prevent multiple users from accessing the same job at the same time.
1 - 14
Advanced DataStage
True or False? DataStage Designer is used to build and compile your Extraction, Transformation, and Load (ETL) jobs. True: Correct! With Designer you can graphically build your job by placing graphical components (called "stages") on a canvas. After you build it, your job is compiled in Designer. False: Incorrect. With Designer you can graphically build your job by placing graphical components (called "stages") on a canvas. After you build it, your job is compiled in Designer.
1 - 15
DataStage
True or False? DataStage Manager is used to execute your jobs after you build them. True: Incorrect. DataStage Manager is your primary interface to the DataStage repository. Use Manager to manage metadata and other DataStage objects. False: Correct! DataStage Manager is your primary interface to the DataStage repository. Use Manager to manage metadata and other DataStage objects.
1 - 16
Advanced DataStage
True or False? DataStage Director is used to execute your jobs after they have been built. True: Correct! Use Director to validate and run your jobs. You can also monitor the job while it is running. False: Incorrect. Use Director to validate and run your jobs. You can also monitor the job while it is running.
1 - 17
DataStage
DataStage Administrator is used to set global and project properties. True: Correct! You can set some global properties such as connection timeout, as well as project properties, such as permissions. False: Incorrect. You can set some global properties such as connection timeout, as well as project properties, such as permissions.
1 - 18
Module 2
Installing DataStage
Module 2 Installing DataStage
DataStage
2-2
Advanced DataStage
The DataStage server should be installed before the DataStage clients are installed. The server can be installed on Windows NT (including Workstation and Server), Windows 2000, or UNIX. This module describes the Windows NT installation. The exact system requirements depend on your version of DataStage. See the installation CD for the latest system requirements. To install the server you will need the installation CD and a license for the DataStage server. The license contains the following information: Serial number Project count The maximum number of projects you can have installed on the server. This includes new projects as well as previously created projects to be upgraded. Expiration date Authorization code This information must be entered exactly as written in the license.
2-3
DataStage
The installation wizard guides you through the following steps: Enter license information Specify server directories Select program folder Create new projects and/or upgrade existing projects
2-4
Advanced DataStage
2-5
DataStage
2-6
Advanced DataStage
The DataStage clients should be installed after the DataStage server is installed. The clients can be installed on Windows 95, Windows 98, Windows NT, or Windows 2000. There are two editions of DataStage. The Developers edition contains all the client applications (in addition to the server). The Operators edition contains just the client applications needed to run and monitor DataStage jobs (in addition to the server), namely, the Director and Administrator.
To install the Developers edition you need a license for DataStage Developer. To install the Operators edition you need a license for DataStage Director. The license contains the following information: Serial number User limit Expiration date Authorization code This information must be entered exactly as written in the license.
2-7
DataStage
The DataStage services must be running on the server machine in order to run any DataStage client applications. To start or stop the DataStage services in Windows 2000, open the DataStage Control Panel window in the Windows 2000 Control Panel. Then click Start All Services (or Stop All Services). These services must be stopped when installing or reinstalling DataStage. UNIX note: In UNIX, these services are started and stopped using the uv.rc script with the stop or start command options. The exact name varies by platform. For SUN Solaris, it is /etc/rc2.d/S99uv.rc.
2-8
Advanced DataStage
2-9
DataStage 304
2 - 10
Module 3
Configuring Projects
Module 3 Configuring Projects
DataStage
3-2
Advanced DataStage
In DataStage all development work is done within a project. Projects are created during installation and after installation using Administrator. Each project is associated with a directory. The directory stores the objects (jobs, metadata, custom routines, etc.) created in the project. Before you can work in a project you must attach to it (open it). You can set the default properties of a project using DataStage Administrator.
3-3
DataStage
3-4
Advanced DataStage
Click Properties on the DataStage Administration window to open the Project Properties window. There are five active tabs. (The Mainframe tab is only enabled if your license supports mainframe jobs.) The default is the General tab. If you select the Enable job administration in Director box, you can perform some administrative functions in Director without opening Administrator. When a job is run in Director, events are logged describing the progress of the job. For example, events are logged when a job starts, when it stops, and when it aborts. The number of logged events can grow very large. The Auto-purge of job log box tab allows you to specify conditions for purging these events. You can limit the logged events either by number of days or number of job runs.
3-5
DataStage
Use this page to set user group permissions for accessing and using DataStage. All DataStage users must belong to a recognized user role before they can log on to DataStage. This helps to prevent unauthorized access to DataStage projects. There are three roles of DataStage user: DataStage Developer, who has full access to all areas of a DataStage project. DataStage Operator, who can run and manage released DataStage jobs. <None>, who does not have permission to log on to DataStage.
UNIX note: In UNIX, the groups displayed are defined in /etc/group.
3-6
Advanced DataStage
This tab is used to enable and disable server-side tracing. The default is for server-side tracing to be disabled. When you enable it, information about server activity is recorded for any clients that subsequently attach to the project. This information is written to trace files. Users with in-depth knowledge of the system software can use it to help identify the cause of a client problem. If tracing is enabled, users receive a warning message whenever they invoke a DataStage client. Warning: Tracing causes a lot of server system overhead. This should only be used to diagnose serious problems.
3-7
DataStage
Use the Schedule tab to specify a user name and password for running scheduled jobs in the selected project. If no user is specified here, the job runs under the same user name as the system scheduler.
3-8
Advanced DataStage
On the Tunables tab, you can specify the sizes of the memory caches used when reading rows in hashed files and when writing rows to hashed files. Hashed files are mainly used for lookups and are discussed in a later module.
3-9
DataStage
3 - 10
Module 4
Designing and Running Jobs
Module 4 Designing and Running Jobs
DataStage
4-2
Advanced DataStage
A job is an executable DataStage program. In DataStage, you can design and run jobs that perform many useful data warehouse tasks, including data extraction, data conversion, data aggregation, data loading, etc. DataStage jobs are: Designed and built in Designer. Scheduled, invoked, and monitored in Director. Executed under the control of DataStage.
4-3
DataStage
In this module, you will go through the whole process with a simple job, except for the first bullet. In this module you will manually define the metadata.
4-4
Advanced DataStage
In the center is the Designer canvas. On it you place stages and links from the Tools Palette on the right. On the bottom left is the Repository window, which displays the branches in Manager. Items in Manager, such as jobs and table definitions can be dragged to the canvas area. Click View>Repository to display the Repository window. Click View>Property Browser to display the Property Broswer window. This window displays the properties of objects selected on the canvas.
4-5
DataStage
The toolbar at the top contains quick access to the main functions of Designer.
4-6
Advanced DataStage
The tool palette contains icons that represent the components you can add to your job design.
Most of the stages shown here are automatically installed when you install DataStage. You can also install additional stages called plug-ins for special
4-7
DataStage
purposes. For example, there is a plug-in called sort that can be used to sort data. Plug-ins are discussed in a later module.
4-8
Advanced DataStage
There are two kinds of stages: Passive stages define read and write access to data sources and repositories. Sequential ODBC Hashed Transformer Aggregator Sort plug-in
Active stages define how data is filtered and transformed.
4-9
DataStage
True or False? The Sequential stage is an active stage. True: Incorrect. The Sequential stage is considered a passive stage because it is used to extract or load sequential data from a file. It is not used to transform or modify data. False: Correct! The Sequential stage is considered a passive stage because it is used to extract or load sequential data from a file. It is not used to transform or modify data.
4 - 10
Advanced DataStage
4 - 11
DataStage
The Sequential stage is used to extract data from a sequential file or to load data into a sequential file. The main things you need to specify when editing the sequential file stage are the following: Path and name of file File format Column definitions If the sequential stage is being used as a target, specify the write action: Overwrite the existing file or append to it.
4 - 12
Advanced DataStage
4 - 13
DataStage
Defining a sequential target stage is similar to defining a sequential source stage. You are defining the format of the data flowing into the stage, that is, from the input links. Define each input link listed in the Input name box. You are defining the file the job will write to. If the file doesnt exist, it will be created. Specify whether to overwrite or append the data in the Update action set of buttons. On the Format tab, you can specify a different format for the target file than you specified for the source file. If the target file doesnt exist, you will not (of course!) be able to view its data until after the job runs. If you click the View data button, DataStage will return a Failed to open error. The column definitions you defined in the source stage for a given (output) link will appear already defined in the target stage for the corresponding (input) link. Think of a link as like a pipe. What flows in one end flows out the other end. The format going in is the same as the format going out.
4 - 14
Advanced DataStage
The Transformer stage is the primary active stage. Other active stages perform more specialized types of transformations. In the Transformer stage you can specify: Column mappings Derivations Constraints
A column mapping maps an input column to an output column. Values are passed directly from the input column to the output column. Derivations calculate the values to go into output columns based on values in zero or more input columns. Constraints specify the conditions under which incoming rows will be written to output links.
4 - 15
DataStage
Notice the following elements of the transformer: The top, left pane displays the columns of the input links. If there are multiple input links, multiple sets of columns are displayed. The top, right pane displays the contents of the output links. We havent defined any fields here yet. If there are multiple output links, multiple sets of columns are displayed. For now, ignore the Stage Variables window in the top, right pane. This will be discussed in a later module. The bottom area shows the column definitions (metadata) for the input and output links. If there are multiple input and/or output links, there will be multiple tabs.
4 - 16
Advanced DataStage
4 - 17
DataStage
Add one or more Annotation stages to the canvas to document your job. An Annotation stage works like a text box with various formatting options. You can optionally show or hide the Annotation stages by pressing a button on the toolbar. There are two Annotation stages. The Description Annotation stage is discussed in a later module.
4 - 18
Advanced DataStage
Type the text in the box. Then specify the various options including: Text font and color Text box color Vertical and horizontal text justification
4 - 19
DataStage
Before you can run your job, you must compile it. To compile it, click File>Compile or click the Compile button on the toolbar. The Compile Job window displays the status of the compile. If an error occurs: Click Show Error to identify the stage where the error occurred. Click More to retrieve more information about the error.
4 - 20
Advanced DataStage
As you know, you run your jobs in Director. You can open Director from within Designer by clicking Tools>Run Director. In a similar way, you can move between Director, Manager, and Designer. There are two methods for running a job: Run it immediately. Schedule it to run at a later time or date. Select the job in the Job Status view. The job must have been compiled. Click Job>Run Now or click the Run Now button in the toolbar. The Job Run Options window is displayed.
To run a job immediately:
4 - 21
DataStage
This shows the Director Status view. To run a job, select it and then click Job>Run Now.
4 - 22
Advanced DataStage
The Job Run Options window is displayed when you click Job>Run Now. This window allows you to stop the job after: A certain number of rows. A certain number of warning messages.
You can validate your job before you run it. Validation performs some checks that are necessary in order for your job to run successfully. These include: Verifying that connections to data sources can be made. Verifying that files can be opened. Verifying that SQL statements used to select data can be prepared.
Click Run to run the job after it is validated. The Status column displays the status of the job run.
4 - 23
DataStage
Click the Log button in the toolbar to view the job log. The job log records events that occur during the execution of a job. These events include control events, such as the starting, finishing, and aborting of a job; informational messages; warning messages; error messages; and program-generated messages.
4 - 24
Advanced DataStage
4 - 25
DataStage
4 - 26
Module 5
Working with Metadata
Module 5 Working with Metadata
DataStage
5-2
Advanced DataStage
DataStage Manager is a graphical tool for managing the contents of your DataStage project repository, which contains metadata and other DataStage components such as jobs and routines. Metadata is data about data that describes the formats of sources and targets. This includes general format information such as whether the record columns are delimited and, if so, the delimiting character. It also includes the specific column definitions.
5-3
DataStage
The left pane contains the project tree. There are eight main branches, but you can create subfolders under each. Select a folder in the project tree to display its contents. In this example, a folder named DS304 has been created that contains some of the jobs in the project. Data Elements branch: Lists the built-in and custom data elements. (Data elements are extensions of data types, and are discussed in a later module.) Jobs branch: Lists the jobs in the current project. Routines branch: Lists the built-in and custom routines. Routines are blocks of DataStage BASIC code that can be called within a job. (Routines are discussed in a later module.) Shared Containers branch: Shared Containers encapsulate sets of DataStage components into a single stage. (Shared Containers are discussed in a later module.) Stage Types branch: Lists the types of stages that are available within a job. Built-in stages include the sequential and transformer stages you used in Designer. Table Definitions branch: Lists the table definitions available for loading into a job. Transforms branch: Lists the built-in and custom transforms. Transforms are
5-4
Advanced DataStage
functions you can use within a job for data conversion. Transforms are discussed in a later module.
5-5
DataStage
DataStage Manager manages two different types of objects: Metadata describing sources and targets: Called table definitions in Manager. These are not to be confused with relational tables. DataStage table definitions are used to describe the format and column definitions of any type of source: sequential, relational, hashed file, etc. Table definitions can be created in Manager or Designer and they can also be imported from the sources or targets they describe. DataStage components Every object in DataStage (jobs, routines, table definitions, etc.) is stored in the DataStage repository. Manager is the interface to this repository. DataStage components, including whole projects, can be exported from and imported into Manager.
5-6
Advanced DataStage
Any set of DataStage objects, including whole projects, which are stored in the Manager Repository, can be exported to a file. This export file can then be imported back into DataStage. Import and export can be used for many purposes, including: Backing up jobs and projects. Maintaining different versions of a job or project. Moving DataStage objects from one project to another. Just export the objects, move to the other project, then re-import them into the new project. Sharing jobs and projects between developers. The export files, when zipped, are small and can be easily emailed from one developer to another.
5-7
DataStage
Click Export>DataStage Components in Manager to begin the export process. Any object in Manager can be exported to a file. Use this procedure to backup your work or to move DataStage objects from one project to another. Select the types of components to export. You can select either the whole project or select a portion of the objects in the project. Specify the name and path of the file to export to. By default, objects are exported to a text file in a special format. By default, the extension is dsx. Alternatively, you can export the objects to an XML document. The directory you export to is on the DataStage client, not the server.
5-8
Advanced DataStage
True or False? You can export DataStage objects such as jobs, but you can't export metadata, such as field definitions of a sequential file. True: Incorrect. Metadata describing files and relational tables are stored as "Table Definitions". Table definitions can be exported and imported as any DataStage objects can. False: Correct! Metadata describing files and relational tables are stored as "Table Definitions". Table definitions can be exported and imported as any DataStage objects can.
5-9
DataStage
True or False? The directory you export to is on the DataStage client machine, not on the DataStage server machine. True: Correct! The directory you select for export must be addressible by your client machine. False: Incorrect. The directory you select for export must be addressible by your client machine.
5 - 10
Advanced DataStage
5 - 11
DataStage 304
To import DataStage components, click Import>DataStage Components. Select the file to import. Click Import all to begin the import process or Import selected to view a list of the objects in the import file. You can import selected objects from the list. Select the Overwrite without query button to overwrite objects with the same name without warning.
5 - 12
Advanced DataStage
5 - 13
DataStage
5 - 14
Advanced DataStage
Table definitions define the formats of a variety of data files and tables. These definitions can then be used and reused in your jobs to specify the formats of data stores. For example, you can import the format and column definitions of the Customers.txt file. You can then load this into the sequential source stage of a job that extracts data from the Customers.txt file. You can load this same metadata into other stages that access data with the same format. In this sense the metadata is reusable. It can be used with any file or data store with the same format. If the column definitions are similar to what you need you can modify the definitions and save the table definition under a new name. You can also use the same table definition for different types of data stores with the same format. For example, you can import a table definition from a sequential file and use it to specify the format for an ODBC table. In this sense the metadata is loosely coupled with the data whose format it defines. You can import and define several different kinds of table definitions including: Sequential files, ODBC data sources, UniVerse tables, hashed files.
5 - 15
DataStage
To start the import, click Import>Table Definitions>Sequential File Definitions. The Import Meta Data (Sequential) window is displayed. Select the directory containing the sequential files. The Files box is then populated with the files you can import. Select the file to import. Select or specify a category (folder) to import into. The format is: <Category>\<Sub-category> <Category> is the first-level sub-folder under Table Definitions. <Sub-category> is (or becomes) a sub-folder under the type.
5 - 16
Advanced DataStage
In Manager, select the category (folder) that contains the table definition. Double-click the table definition to open the Table Definition window. Click the Columns tab to view and modify any column definitions. Select the Format tab to edit the file format specification.
5 - 17
DataStage
5 - 18
Advanced DataStage
5 - 19
DataStage
5 - 20
Module 6
Working with Relational Data
Module 6 Working with Relational Data
DataStage
6-2
Advanced DataStage
You can perform the same tasks with relational data that you can with sequential data. You can extract, filter, and transform data from relational tables. You can also load data into relational tables. Although you can work with many relational databases through native drivers (including UniVerse, UniData, and Oracle), you can access many more relational databases using ODBC. In the ODBC stage, you can either specify your query to one or more tables in the database interactively or you can type the query or you can paste in an existing query.
6-3
DataStage
Before you can access data through ODBC you must define an ODBC data source. In Windows NT, this can be done using the (32 bit) ODBC Data Source Administrator in the Control Panel. The ODBC Data Source Administrator has several tabs. For use with DataStage, you should define your data sources on the System DSN tab (not User DSN). You can install drivers for most of the common relational database systems from the DataStage installation CD. Click Add to define a new data source. When you click Add a list of available drivers is displayed. Select the appropriate driver and then click Finish. Different relational databases have different requirements. As an example, we will define a Microsoft Access data source. Type the name of the data source in the Data Source Name box. Click Select to define a connection to an existing database. Type the name and location of the database. Click Create to define a connection to a new database.
6-4
Advanced DataStage
6-5
DataStage
Importing table definitions from ODBC databases is similar to importing sequential file definitions. Click Import>Table Definitions>ODBC Table Definitions in Manager to start the process. The DSN list displays the data sources that are defined for the DataStage Server. Select the data source you want to import from and, if necessary, provide a user name and password. The Import Metadata window is displayed. It lists all tables in the database that are available for import. Select one or more tables and a category to import to, and then click OK.
6-6
Advanced DataStage
Extracting data from a relational table is similar to extracting data from a sequential file except that you use an ODBC stage instead of a sequential stage. In this example, well extract data from a relational table and load it into a sequential file.
6-7
DataStage
Specify the ODBC data source name in the Data source name box on the General tab of the ODBC stage. You can click the Get SQLInfo button to retrieve the quote character and schema delimiters from the ODBC database.
6-8
Advanced DataStage
Specify the table name on the General tab of the Outputs tab. Select Generated query to define the SQL SELECT statement interactively using the Columns and Selection tabs. Select User-defined SQL query to write your own SQL SELECT statement to send to database.
6-9
DataStage
Load the table definitions from Manager on the Columns tab. The procedure is the same as for sequential files. When you click Load, the Select Columns window is displayed. Select the columns data is to be extracted from.
6 - 10
Advanced DataStage
Optionally, specify a WHERE clause and other additional SQL clauses on the Selection tab.
6 - 11
DataStage
The View SQL tab enables you to view the SELECT statement that will be used to select the data from the table. The SQL displayed in read-only. Click View Data to test the SQL statement against the database.
6 - 12
Advanced DataStage
If you want to define your own SQL query, click User-defined SQL query on the General tab and then write or paste the query into the SQL for primary inputs box on the SQL Query tab.
6 - 13
DataStage
6 - 14
Advanced DataStage
Editing an ODBC target stage is similar to editing an ODBC source stage. It includes the following tasks: Specify the data source containing the target table. Specify the name of the table. Select the update action. You can choose from a variety of INSERT and/or UPDATE actions. Optionally, create the table. Load the column definitions from the Manager table definition.
6 - 15
DataStage
Some of the options are different in the ODBC stage when it is used as a target. Select the type of action to perform from the Update action list. You can optionally have DataStage create the target table or you can load to an existing table. On the View SQL tab you can view the SQL statement used to insert the data into the target table.
6 - 16
Advanced DataStage
On the Edit DDL tab you can generate and modify the CREATE TABLE statement used to create the target table. If you make any changes to column definitions, you need to regenerate the CREATE TABLE statement by clicking the Create DDL button.
6 - 17
DataStage
On the Transaction Group tab you can specify the features of the transaction. By default, all the rows are written to the target table before a COMMIT. In the Rows per transaction box, you can specify a specific number of rows to write before the COMMIT. Specify the isolation level in the Isolation Level box.
6 - 18
Advanced DataStage
6 - 19
DataStage
True or False? Using a single ODBC stage, you can only extract data from a single table. True: Incorrect. You can join data from multiple tables within a single data source. False: Correct! You can join data from multiple tables within a single data source.
6 - 20
Advanced DataStage
6 - 21
DataStage
6 - 22
Module 7
Constraints and Derivations
Module 7 Constraints and Derivations
DataStage
7-2
Advanced DataStage
A constraint specifies the condition under which data flows through a link. For example, suppose you want to split the data in the customer file into separate customer address files based on the customers country. We need to define a constraint on the USACustomersOut link so that only USA customers are written to the USACustomers.txt file. Similarly, for Mexican and German customers. The Others.txt file will contain all other customers.
7-3
DataStage
Click the Constraints button in the toolbar at the top of the Transformer Stage window to open the Transformer Stage Contraints window. The Transformer Stage Contraints window lists all the links out of the transformer. Double-click on the cell next to a link to create the constraint. Rows that are not written out to previous rows are written to a rejects link. A row of data is sent down all the links it satisfies. If there is no constraint on a (non-rejects) link, all rows will be sent down the link.
7-4
Advanced DataStage
This shows the Constraints window. Constraints are defined for each of the top three links. The Reject Row box is selected for the last link. All rows that fail to satisfy the top three links will be sent down this link.
7-5
DataStage
True or False? A constraint specifies a condition under which incoming rows of data will be written to an output link True: Correct! You can separately define a constraint for each output link. If no constraint is written for a particular output link, then all rows will be written to that link. False: Incorrect. You can separately define a constraint for each output link. If no constraint is written for a particular output link, then all rows will be written to that link.
7-6
Advanced DataStage
True or False? A Rejects link can be placed anywhere in the link ordering. True: Incorrect. A Rejects link should be placed last in the link ordering, if it is to get every row that doesn't satisfy any of the other constraints. False: Correct! A Rejects link should be placed last in the link ordering, if it is to get every row that doesn't satisfy any of the other constraints.
7-7
DataStage
7-8
Advanced DataStage
A derivation is an expression that specifies the value to be moved into a target column (field). Every target column must have a derivation. The simplest derivation is an input column. The value in the input column is moved to the target column. To construct a derivation for a target column double-click on the derivation cell next to the target column. Derivations are constructed in the same way that constraints are constructed: Type constants. Type or enter operators from Operator shortcut menu. Type or enter operands from Operand shortcut menu. Constraints apply to links; derivations apply to columns. Constraints are conditions, either true or false; derivations specify a value to go into a target column.
Whats the difference between derivations and constraints?
7-9
DataStage
In this example the concatenation of several fields is moved into the LocationString target field. For example, San Jose : , : CA : : 94089 : : USA becomes: San Jose, CA 94089 USA The colon (:) is the concatenation operator. You can insert this from the Operator menu or type it in.
7 - 10
Advanced DataStage
True or False? If the constraint for a particular link is not satisified, then the derivations defined for that link are not executed. True: Correct! Constraints have precedence over derivations. Derivations in an output link are only executed if the constraint is satisfied. False: Incorrect. Constraints have precedence over derivations. Derivations in an output link are only executed if the constraint is satisfied.
7 - 11
DataStage
7 - 12
Advanced DataStage
You can create stage variables for use in your column derivations and constraints. Stage variables store values without writing them out to a target file or table. They can be used in expressions just like constants, input columns, and other operands. Stage variables retain their values across reads. This allows them to be used as counters and accumulators. You can also use them to compare a current input value to a previous input value. To create a new stage variable, click the right mouse button over the Stage Variables window and then click Append New Stage Variable (or Insert New Stage Variable). After you create it, you specify a derivation for it in the same way as for columns.
7 - 13
DataStage
This lists the execution order: Derivations in stage variables are executed before constraints. This allows them to be used in constraints. Next constraints are executed. Then column derivations are executed. Derivations in higher columns are executed before lower columns.
7 - 14
Advanced DataStage
Derivations for stage variables are executed before derivations for any output link columns. True: Correct! So you can be sure that the derivations for any of the stage variables referenced in column derivations will have already been executed. False: Incorrect. The derivations for stage variables are executed first. So you can be sure that the derivations for any of the stage variables referenced in column derivations will have already been executed.
7 - 15
DataStage
7 - 16
Advanced DataStage
7 - 17
DataStage
7 - 18
Module 8
Creating Basic Expressions
Module 8 Creating Basic Expressions
DataStage
8-2
Advanced DataStage
DataStage BASIC is a form of BASIC that has been customized to work with DataStage. In the previous module you learned how to define constraints and derivations. Derivations and constraints are written using DataStage BASIC. Job control routines, which are discussed in a later module, are also written in DataStage BASIC. This module will not attempt to teach you BASIC programming. Our focus is on what you need to know in order to construct complex DataStage constraints and derivations.
8-3
DataStage
For more information about BASIC operators than is provided here, search for BASIC Operators in Help. You can insert these operators from the Operators menu (except for the IF operator, which is on the Operands menu). Arithmetic operators: -, +, *, / Relational operators: =, <, >, <=, >= Logical operators: AND, OR, NOT IF operator: IF min_lvl < 0 THEN Out of Range ELSE In Range Concatenation operator (:) The employees name is : lname : , : fname Substring operator ([start, length]). First character is 1 (not 0). APPL3245[1, 4] APPL APPL3245[5, 2] 32
8-4
Advanced DataStage
For more information about BASIC functions than is provided here, look up Alphabetical List BASIC Functions and Statements in Help. BASIC functions include the standard Pick BASIC functions. Click Function from the Operands menu to insert a function. Here are a few of the more common functions: TRIM(string), TRIM(string, character), TRIMF, TRIMB TRIM( xyz LEN(string) UPCASE(string), DOWNCASE(string) ICONV, OCONV ICONV is used to convert values to an internal format OCONV is used to convert values from an internal format Very powerful functions. Often used for date and time conversions and manipulations. These functions are discussed later in the module. ) xyz
8-5
DataStage
For more information about BASIC functions than is provided here, look up System Variables in Help. Click System Variable from the Operands menu to insert a system variable. @DATE, @TIME Date/time job started Extracted from @DATE @YEAR, @MONTH, @DAY @INROWNUM, @OUTROWNUM @LOGNAME @NULL @TRUE, @FALSE @WHO Name of current project User logon name NULL value
8-6
Advanced DataStage
True or False? TRIM is a system variable. True: Incorrect. TRIM is a DataStage function that removes surrounding spaces in a character string. False: Correct! TRIM is a DataStage function that removes surrounding spaces in a character string.
8-7
DataStage
True or False? @INROWNUM is a DataStage function. True: Incorrect. System variables all begin with the @-sign. @INROWNUM is a system variable that contains the number of the last row read from the input link. False: Correct! System variables all begin with the @-sign. @INROWNUM is a system variable that contains the number of the last row read from the input link.
8-8
Advanced DataStage
DataStage is supplied with a number of functions you can use to obtain information about your jobs and projects. You can insert these functions into derivations. DS functions and macros are discussed in a later module.
8-9
DataStage
DS (DataStage) routines are defined in DataStage Manager. There are several types of DS routines. The type you can insert into your derivations and constraints are of the Transform Function type. A DS Transform Function Routine consists of a predefined block of BASIC statements that takes one or more arguments and returns a single value. DS routines are defined in DataStage Manager. You can define your own routines, but there are also a number of pre-built routines that are supplied with DataStage. The pre-built routines include a number of routines for manipulating dates, such as ConvertMonth, QuarterTag, and Timestamp.
8 - 10
Advanced DataStage
8 - 11
DataStage
Data elements are extended data types. For example, a phone number is a kind of string. You could define a data element called PHONE.NUMBER to precisely define this type. Data elements are defined in DataStage Manager. A number of built-in types are supplied with DataStage. For example MONTH.TAG represents a string of the form YYYY-MM.
8 - 12
Advanced DataStage
8 - 13
DataStage
DS Transforms are similar to DS Transform Function routines. They take one or more arguments and return a single value. There are two primary differences: The argument(s) and return value have specific data elements associated with them. In this sense, they transform data from one data element type to another data element type. Unlike DS routines, they do not consist of blocks of BASIC statements. Rather, they consist of a single (though possibly very complex) BASIC expression.
You can define your own DS Transforms, but there are also a number of pre-built transforms that are supplied with DataStage. The pre-built transforms include a number of routines for manipulating strings and dates.
8 - 14
Advanced DataStage
8 - 15
DataStage
8 - 16
Advanced DataStage
8 - 17
DataStage
Date manipulation in DataStage can be done in several ways: Using the Iconv and Oconv functions using the D conversion code. Using the built-in date Transforms. Using the built-in date routines. Using routines in the DataStage Software Development Kit (SDK)
Using routines in the DataStage Software Development Kit (SDK) is covered in another DataStage course. Your instructor can provide further details. The SDK routines are installed in the Manager Routines\sdk folder.
8 - 18
Advanced DataStage
For detailed help on Iconv and Oconv, see their entries in the Alphabetical List of BASIC Functions and Statements in Help. Use Iconv to convert a string date in a variety of formats to the internal DataStage integer format. Use Oconv to convert an internal date to a string date in a variety of formats. Use these two functions together to covert a string date from one format to another. The internal format for a date is based on a reference date of December 31, 1967, which is day 0. Dates before are negative integers; dates after are positive integers. Use the D conversion code to specify the format of the date to be converted to an internal date by Iconv or the format of the date to be output by Oconv.
8 - 19
DataStage
For detailed help (more than you probably want), see D Code under Iconv or Oconv in Help. D4-MDY[2,2,4] D 4 MDY [2,2,4] Date conversion code Number of digits in year Separator Ordering is month, day, year Number of digits for M,D,Y, respectively
Note: The number in brackets for Y (namely 4) overrides the number following D. Iconv ignores some of the characters. Any separator will do. Number of characters is ignored if there are separators.
8 - 20
Advanced DataStage
Iconv(12-31-67, D2-MDY[2,2,2]) Iconv(12311967, D MDY[2,2,4])
0 0 12-31-1967 31/12/67 1968/10/JANUARY
Iconv(31-12-1967, D-DMY[2,2,4]) 0 Oconv(0, D2-MDY[2,2,4]) Oconv(0, D2/DMY[2,2,2]) Oconv(10, D/YDM[4,2,A10])
This example illustrates the use of an additional formatting option. The A10 options says to alphabetically express the name, length 10 characters. Oconv( Iconv(12-31-67, D2-MDY[2,2,2]), D/YDM[4,2,A10]) 1967/31/DECEMBER This example shows how to convert from one string representation to another.
8 - 21
DataStage
8 - 22
Advanced DataStage
DataStage provides a number of built-in transforms you can use for date conversions. The following data elements are used with the built-in transforms:
Data element DATE.TAG WEEK.TAG MONTH.TAG QUARTER.TAG YEAR.TAG
String format YYYY-MM-DD YYYYWnn YYYY-MM YYYYQn YYYY
Example 1999-02-24 1999W06 1999-02 1999Q4 1999
8 - 23
DataStage
True or False? You can use Oconv to convert a string date from one format to another. True: Incorrect. Oconv by itself can't do this. You would first use Iconv to convert the input string into a day integer. Then you can use Oconv to convert the day integer into the output string. False: Correct! Oconv by itself can't do this. You would first use Iconv to convert the input string into a day integer. Then you can use Oconv to convert the day integer into the output string.
8 - 24
Advanced DataStage
The transforms can be grouped into the following categories: String to day number Formatted string internal date integer Day number to date string Internal date integer formatted string Date string to date string DATE.TAG string formatted string
8 - 25
DataStage
The following transforms convert strings of the specified format (MONTH.TAG, QUARTER.TAG, ) to an internal date representing the first or last day of the period. Function MONTH.FIRST MONTH.LAST QUARTER.FIRST QUARTER.LAST WEEK.FIRST WEEK.LAST WEEK.TAG QUARTER.TAG Tag MONTH.TAG Description Returns a numeric internal date corresponding to the first/last day of a month Returns a numeric internal date corresponding to the first/last day of a quarter Returns a numeric internal date corresponding to the first day (Monday) / last day (Sunday) of a week Returns a numeric internal date corresponding to the first/last day of a year
YEAR.FIRST YEAR.LAST
YEAR.TAG
Examples:
8 - 26
Advanced DataStage
MONTH.FIRST(1993-02) 9164 MONTH.LAST(1993-02) 9191
8 - 27
DataStage
The following functions convert internal dates to strings in various formats (DATE.TAG, MONTH.TAG, ). Function DATE.TAG MONTH.TAG QUARTER.TAG WEEK.TAG DAY.TAG Argument type Internal date Internal date Internal date Internal date Internal date Description Converts internal date to string in DATE.TAG format Converts internal date to string in MONTH.TAG format Converts internal date to string in QUARTER.TAG format Converts internal date to string in WEEK.TAG format Converts internal date to string in DAY.TAG format
Examples: MONTH.TAG(9177) 1993-02 DATE.TAG(9177) 1993-02-14
8 - 28
Advanced DataStage
The following functions convert strings in DATE.TAG format to strings in various other formats (DAY.TAG, MONTH.TAG, ). Function TAG.TO.MONTH TAG.TO.QUARTER TAG.TO.WEEK TAG.TO.DAY Tag DATE.TAG DATE.TAG DATE.TAG DATE.TAG Description Convert DATE.TAG to MONTH.TAG Convert DATE.TAG to QUARTER.TAG Convert DATE.TAG to WEEK.TAG Convert DATE.TAG to DAY.TAG
Examples: TAG.TO.MONTH(1993-02-14) 1993-02 TAG.TO.QUARTER(1993-02-14) 1993Q1
8 - 29
DataStage
8 - 30
Advanced DataStage
8 - 31
DataStage
8 - 32
Module 9
Troubleshooting
Module 9 Troubleshooting
DataStage
9-2
Advanced DataStage
Events are logged to the job log file when a job is validated, run, or reset. You can use the log file to troubleshoot jobs that fail during validation or a run. Various entries are written to the log, including when: The job starts The job finishes An active stage starts An active stage finishes Rows are rejected (yellow icons) Errors occur (red icons) DataStage informational reports are logged User-invoked messages are displayed
09 - 3
DataStage
The event window shows the events that are logged for a job during its run. The job log contains the following information: Column Name Occurred On date Type Description Time the event occurred Date the event occurred Info Informational. No action required. Warning An error occurred. Investigate the cause of the warning, as this may indicate a serious error. Fatal A fatal error occurred. Control The job starts and finishes. Reject Rejected rows are output. Reset A job or the log is reset. Event A message describing the event. The system displays the first line of the message. If a message has an ellipsis () at the end, it contains more than one line. You can view the full message in the Event Detail window.
9-4
Advanced DataStage
Clearing the log

To clear the log, click Job>Clear Log.
09 - 5
DataStage
Double-click on an event to open the Event Detail window. This window gives you more information. When an active stage finishes, DataStage logs an informational message that describes how many rows were read in to the stage and how many were written. This provides you with valuable information that can indicate possible errors.
9-6
Advanced DataStage
The Monitor can be used to display information about a job while it is running. To start the Monitor, click Tools>New Monitor. Once in Monitor, click the right mouse button and then select Show links to display information about each of the input and output links.
09 - 7
DataStage
When you are testing a job, you can save time by limiting the number of rows and warnings.
9-8
Advanced DataStage
Server side tracing is enabled in Administrator. It is designed to be used to help customer support analysts troubleshoot serious problems. When enabled, it logs a record to a trace file whenever DataStage clients interact with the server. Caution: Because of the overhead caused by server side tracing it should only be used when working with customer support.
09 - 9
DataStage
9 - 10
Advanced DataStage
DataStage provides a debugger for testing and debugging your job designs. The debugger runs within Designer. With the DataStage debugger you can: Set breakpoints on job links, including conditional breakpoints. Step through your job link-by-link or row-by-row. Watch the values going into link columns.
09 - 11
DataStage
To begin debugging a program, click View>Debug Bar to display the debug toolbar. The toolbar provides access to all of the debugging functions.
Stop Next link Debug job parameters Toggle breakpoint View job log
Clear breakpoints Go Next row Edit breakpoints Debug window
Button Go Next Link Next Row
Description Start/continue debugging. The job continues until the next action occurs on the link. The job continues until the next row is processed or until another link with a breakpoint is
9 - 12
Advanced DataStage
encountered. Stop Job Job Parameters Edit Breakpoints Toggle Breakpoint Clear All Breakpoints View job log Debug Window Stops the job at the point it is at. Click Go to continue. Set limits on rows and warnings. Displays the Edit Breakpoints window, in which you can edit existing breakpoints. Set or clear a breakpoint on a selected link. Removes breakpoints from all links. Open Director and view the job log. Show/hide the Debug Window, which displays link column values.
09 - 13
DataStage
To set a breakpoint on a link, select the link and then click the Toggle Breakpoint button. A black circle appears on the link.
9 - 14
Advanced DataStage
Click the Edit Breakpoints button to open the Edit Breakpoints window. Existing breakpoints are listed in the lower pane. To set a condition for a breakpoint, select the breakpoint and then specify the condition in the above pane. You can either specify the number of rows before breaking or specify an expression to break upon when its true.
09 - 15
DataStage
Click the Debug Window button to open the Debug Window. The top pane lists all the columns defined for all links. The Local Data column lists the data currently in the column. The Current Break box at the top of the window lists the link where execution stopped. To add a column to the lower pane (where it is isolated), select the column and then click Add Watch. If a breakpoint is set, execution stops at that link when a row is written to the link.
9 - 16
Advanced DataStage
You can step through row-by-row or step-by-step. Next Row extracts a row of data and stops at the next link with a breakpoint that the row is written to. For example, if a breakpoint is set on the MexicoCustomersOut link, execution stops at the MexicoCustomersOut link when a Mexican customer is read. If a breakpoint is not set on the MexicoCustomersOut link, execution will not stop at the MexicoCustomersOut link when a Mexican customer is read. Execution will stop at the CustomersIn link (even if there is no breakpoint set on it) because all rows are read through that link. Next Link stops at the next link that data is written to.
09 - 17
DataStage
9 - 18
Module 10
Defining Lookups
Module 10 Defining Lookups
DataStage
10 - 2
Advanced DataStage
A hashed file is a file that distributes records in one or more evenly-sized groups based on a primary key. The primary key value is processed by a "hashing algorithm" to determine the location of the record. The number of groups in the file is referred to as its modulus. In this example, there are 4 groups (modulus 4). Hashed files are used for reference lookups in DataStage because of their fast performance. The hashing algorithm determines the group the record is in. The groups contain a small number of records, so the record can be quickly located within the group. If write caching is enabled, DataStage does not write hashed file records directly to disk. Instead it caches the records in memory, and writes the cached records to disk when the cache is full. This improved performance. You can specify the size of the cache on the Tunables tab in Administrator.
10 - 3
DataStage
To create and load a hashed file, create a job that has the hashed file stage as a target. For example, heres a simple job that will create and load the StoresHashed hashed file, which will contain a list of stores and their addresses keyed by stor_id. Loading a hashed file with data is similar to loading a sequential file with data.
10 - 4
Advanced DataStage
10 - 5
DataStage
You can extract data from a hashed file: As a stream. As a lookup.
The process of extracting data from a hashed file as a stream is similar to extracting data from a sequential file. A hashed file stage used as a source has an additional tab called the Selection tab. Use it to specify a condition for filtering the data from the hashed file.
10 - 6
Advanced DataStage
10 - 7
DataStage
Your job can delete a hashed file and then recreate it. To delete the file and then recreate it, select the Delete file before create box in the Create file options window on the hashed file target stage. To delete a hashed file without recreating it in a job, you can execute the DELETE.FILE command. To execute this command, log onto Administrator, select the project (account) containing the hashed file, and then click Command top open the Command Interface window. In the Command box, type DELETE.FILE followed by the name of the hashed file. Then click Execute. The DELETE.FILE command can also be executed in a Before/After Routine. (Before/After Routines are discussed in a later module.)
10 - 8
Advanced DataStage
Extracting data from a hashed file as a lookup involves several steps. In this example, data is read in from the Sales.txt file using the Sales stage. For each record read, the store ID is looked up in the STORES hashed file. The dashed line indicates reference input as opposed to stream input. Keep the following in mind: Reference lookups from hashed files must go into a Transformer stage. Multiple lookups can be done at the same Transformer. To specify an ordering, open the transformer and then click the Input Link Execution Order icon at the top. The procedure is the same as defining target link execution order. Lookups cannot be done from sequential files. Lookups can also be done from relational (ODBC) tables.
10 - 9
DataStage
The lookup expression or join is defined in the transformer stage.
Click the right mouse button over the hashed file key column and select Edit Key Expression. This defines the value to look up in the hashed file. In this example, the stor_id input column defines the expression.
10 - 10
Advanced DataStage
Any valid expression can be specified (not just a column mapping). You can drag input columns to the key column like when defining derivations for target columns.
Output from the lookup file is mapped to fields in the target link. If the lookup fails (the result of the expression is not found in the hashed file), NULLs are returned in all the lookup link columns. You can test for NULLs in a derivation to determine whether the lookup succeeded.
10 - 11
DataStage
10 - 12
Advanced DataStage
10 - 13
DataStage
10 - 14
Module 11
Aggregating Data
Module 11 Aggregating Data
DataStage
11 - 2
Advanced DataStage
The data sources youre extracting data from can contain many thousands of rows of data. You can summarize groups of data in each column using the functions listed below. Function Minimum Maximum Count Sum Average Standard deviation Description Returns the lowest value in the column. Returns the highest value in the column. Counts the number of values in the column. Sums the values in the column. Returns the average of the values in the column. Returns the standard deviation for the values in the column.
The first three functions (minimum, maximum, count) apply to non-numeric as well as numeric data. The last three only make sense when applied to numeric data.
11 - 3
DataStage
Heres an example of some data you might want to summarize. The Sales file contains a list of sales. You may want the following aggregations, for example: Total sales amount by store (by product, by year, by month, etc.) Average sales amount by store (by product, etc.) Total (average) quantity sold by product
11 - 4
Advanced DataStage
In this example, we will determine the average sales amount for each product sold. The Sales stage is used to read the data. The transformer performs some initial calculations on the data. For instance, the sales amount for each order (qty * price) is calculated. Calculations cant be defined in the aggregator stage. The aggregator stage can have at most one input link and it cant be a reference link. In this example, the columns coming in include the product ID (product_id) and the sales amount (sales_amount) for each order. The output link will define the aggregations.
11 - 5
DataStage
This lists the main tasks in defining aggregations.
11 - 6
Advanced DataStage
True or False? Suppose you want to aggregate over derived values. For example, you want to SUM(qty * unit_price). You can perform this derivation within the Aggregator True: Incorrect. You cannot perform derivations within the Aggregator stage. If you want to aggregate derived values, perform the the derivation in an output column in a prior Transformer stage. Then Aggegate over that incoming column in the Aggregator stage. False: Correct! You cannot perform derivations within the Aggregator stage. If you want to aggregate derived values, perform the the derivation in an output column in a prior Transformer stage. Then Aggegate over that incoming column in the Aggregator stage.
11 - 7
DataStage
The Inputs Columns tab specifies the incoming columns. Aggregations are performed in memory. If the data is presorted before it is aggregated, this can greatly improve the way in which the Aggregator stage handles the data. Use the Sort and Sort Order columns to specify whether and how the data is sorted. The Sort column is used to specify which columns are sorted and their order. For example, if the incoming data is sorted by stor_id and product_id, in that order, then stor_id would be column 1 and product_id would be column 2. In the Sort Order column specify the sort order, that is, whether the data is sorted in ascending or descending order or some other more complex ordering. The aggregator stage does not itself sort the data. Sorting must be performed in an earlier stage, for example, using an ODBC stage or sort plug-in.
11 - 8
Advanced DataStage
Define the aggregation for each output column. Select the column(s) to group by. You will not be able to specify an aggregate function for the group by column(s).
11 - 9
DataStage
Double-click on the Derivation cell to open the Derivation window. This window is special in the aggregator stage. It allows you to select a column and an aggregate function for the column.
11 - 10
Advanced DataStage
11 - 11
DataStage
11 - 12
Module 12
Job Control
Module 12 Job Control
DataStage
12 - 2
Advanced DataStage
Job parameters allow you to design flexible, reusable jobs. If you want to process data based on a particular file, file location, time period, or product, you can include these settings as part of your job design. However, if you do this, when you want to use the job again for a different file, file location, time period, or product, you must edit the design and recompile the job. Instead of entering inherently variable factors as part of the job design, you can set up parameters which represent processing variables. When you run or schedule a job with parameters, DataStage prompts for the required information before continuing. Job parameters can be used in many places in DataStage Designer, including: Passive stage file and table names. Passive stage directory paths. Account names for hashed files. Transformer stage constraints. Transformer stage derivations.
12 - 3
DataStage
Recall this job. Customers from different countries are written out to separate files. The problem here is that the countries are hard-coded into the job design. What if we want a file containing, for example, Canadian customers? We can add a new output stage from the transformer and define a new constraint. Then recompile and run the job. A more flexible method is to use a parameter in the constraint in place of a specific country string such as USA. Then during runtime, the user can specify the particular country.
12 - 4
Advanced DataStage
To define job parameters for a job, open the job in Designer and then click Edit>Job Properties. Click the Parameters tab on the Job Properties window.
12 - 5
DataStage
True or False? When job parameters are used in passive stages such as Sequential File stages, they must be surrounded with pound (#) signs. True: Correct! You must surround the name of the job parameter with pound signs. Otherwise, DataStage won't recognize it as a job parameter. False: Incorrect. You must surround the name of the job parameter with pound signs. Otherwise, DataStage won't recognize it as a job parameter.
12 - 6
Advanced DataStage
12 - 7
DataStage
Before and after routines are DS routines that run before or after a job and before or after a transformer. DS Before/After routines are defined in Manager. Three built-in Before/After routines are supplied with DataStage: ExecDOS, ExecSH, ExecTCL. These routines can be used to execute Windows DOS, UNIX, and UniVerse commands, respectively. The command, together with any output, is added to the job log as an informational message. You can also define custom Before/After routines. They are similar to other routines except that they have only two arguments: an input argument, an error code argument.
12 - 8
Advanced DataStage
Click Edit>Job Properties on the Designer window or the Stage Properties button in a transformer or other active stage. In either case, a window is displayed in which you can select a Before/After routine and specify an input parameter. Input parameters can contain job parameters. In this example, the target file is copied to a temporary directory after the job runs using the standard Windows DOS copy command.
12 - 9
DataStage
DataStage is supplied with a number of functions you can use to control jobs and obtain information about jobs. For detailed information about these functions, see Job Control in Help. These functions can be executed in the Job control tab of the Job Properties window, within DS routines, and within column derivations.
12 - 10
Advanced DataStage
Here are some of the job control functions. BASIC Function DSAttachJob DSSetParam DSSetJobLimit DSRunJob DSWaitForJob DSGetProjectInfo DSGetJobInfo DSGetStageInfo Description Specify the job you want to control Set parameters for the job you want to control Set limits for the job you want to control Request that a job is run Wait for a called job to finish Get information about the current project Get information about the controlled job or current job Get information about a stage in the controlled job or current job
12 - 11
DataStage
DSGetLinkInfo DSGetParamInfo DSGetLogEntry DSGetLogSummary DSGetNewestLogId DSLogEvent DSLogInfo DSStopJob DSDetachJob DSSetUserStatus
Get information about a link in a controlled job or current job Get information about a controlled jobs parameters Get the log event from the job log Get a number of log events on the specified subject from the job log Get the newest log event, of a specified type, from the job log Log an event to the job log of a different job Log an informatory message to the job log Stop a controlled job Return a job handle previously obtained from DSAttachJob Set a status message to return as a termination message when it finishes
12 - 12
Advanced DataStage
The job control routines and other BASIC statements written in the Job control tab are executed after the job in which they are defined runs. This enables you to run a job that controls other jobs. In fact this can be all the job does. For example, suppose you want a job that first loads a hashed file and then uses that hashed file in a lookup. You can define this as a single job. Alternatively, you can define this as two separate jobs (as we did earlier) and then define a master controlling job that first runs the load and then runs the lookup.
12 - 13
DataStage
Create an empty job and then click Edit>Job Properties. Click the Job control tab. Select the jobs you want to run one at a time in the Add Job box and then click Add. The job control functions and other BASIC statements are added to the edit box. Add and modify the statements as necessary. In this example: DSRunJob is used to run the load job. DSWaitForJob waits for the job to finish. You dont want the lookup to be performed until the hashed file is fully loaded. DSGetJobInfo gets information about the status of the job. If an error occurs the job is aborted before the lookup job is run.
12 - 14
Advanced DataStage
12 - 15
DataStage
The Job Sequencer enables you to graphically create controlling jobs, without using the job control functions. Job control code is automatically generated from your graphical design. Job Sequences resemble standard DataStage jobs. They consist of stages and links, like DataStage jobs. However, it is a different set of stages and links. Among the stages are Job Activity stages, which are used to run DataStage jobs. Links are used to specify the sequence of execution. Triggers can be defined on these links to specify the condition under which control passes through the link. There are other Activity stages, including: Routine Activity stages for execution a DataStage Routine. Execute Command stages for executing Windows, UNIX, or DataStage commands. Notification stages for sending email notifications.
12 - 16
Advanced DataStage
Here is an example of a Job Sequence. The stages are Job Activity stages. The stage validates a job that loads a lookup hashed file. The second stage runs the job, if the validation succeeded. The third stage runs a job that does a lookup from the hashed file. The links execute these three stages in sequence. Triggers are defined on each of the links, so that control is passed to the next stage only if the previous stage executed without errors. To create a new Job Sequence click the New button and then select Job Sequence.
12 - 17
DataStage
This shows a Job Activity stage. Select the job to run in the Job name box. Select how you want to run it in the Execution action box. The Parameters box lists all the parameters defined for the job. Select a parameter and then click Insert Parameter Value to specify a value to be passed to the parameter.
12 - 18
Advanced DataStage
Triggers specify the condition under which control passes through a link. Select the type of trigger in the Expression Type. The types include: Unconditional: Pass control unconditionally. Otherwise: Pass control if none of the triggers on the links are executed. OK: Pass control if the job ran without errors or warnings. Failed: Pass control if the job failed. Warning: Pass control if the job ran with warnings. UserStatus: Pass control if the User Status variable contains the specified value. The User Status variable can be set in a job or Routine using the DSSetUserStatus job control function. Custom: Specify your own condition in DataStage Basic.
12 - 19
DataStage
True or False? Triggers can be defined on the Job Activity Triggers tab for each Input link. True: Incorrect. Triggers are defined on Output links. They determine whether execution will continue down the link. False: Correct! Triggers are defined on Output links. They determine whether execution will continue down the link.
12 - 20
Advanced DataStage
12 - 21
DataStage
12 - 22
Advanced DataStage
DataStage Containers encapsulate a set of job design components (stages and links) into a single stage icon. There are two kinds of Containers: Local and Shared. Local Containers only exist within the single job they are used. Use Shared Containers to simplify complex job designs. Shared Containers exist outside of any specific job. They are listed in the Shared Containers branch is Manager. These Shared Containers can be added to any job. Shared containers are frequently used to share a commonly used set of job components. A Job Container contains two unique stages. The Container Input stage is used to pass data into the Container. The Container Output stage is used to pass data out of the Container.
12 - 23
DataStage
This shows the components that make up a example Container. The same job components are used with the exception of the Container Input stage, shown on the left, and the Container Output stage, shown on the right.
12 - 24
Advanced DataStage
12 - 25
DataStage
This shows a job with a Job Container (the stage in the middle). Data is passed into the Container from the link on the left. Data is retrieved from the Container in the link on the right. The Container processes the data using the set of stages and links it is designed with.
12 - 26
Advanced DataStage
12 - 27
DataStage
12 - 28
Module 13
Working with Plug-Ins
Module 13 Working with Plug-Ins
DataStage
13 - 2
Advanced DataStage
A plug-in is a custom-built stage (active or passive) that you can install and use in DataStage in addition to the built-in stages. Plug-ins provide additional functionality without the need for new versions of DataStage to be released. Plug-ins can be written in either C or C++. Sample code is loaded in the /sample directory when DataStage is installed. A number of plug-ins are provided by Ascential. These include: Plug-in stages pre-installed with DataStage Found in Stage Types/PlugIn branch in Manager. Includes the Oracle bulk loader. Plug-in stages on the installation CD. These include: Additional bulk loaders. An ftp plug-in for accessing data using the ftp protocol. A sort plug-in for sorting data. A merge plug-in for integrating data. Plug-ins for accessing RDBMs, such as Oracle, through native drivers. Chargeable plug-in stages available from Ascential. These include:
13 - 3
DataStage
External Data Access (EDA) for access to mainframe systems. Change Data Capture (CDC) for obtaining only changed records. Plug-in stages written to the DataStage C API may also be available from third-party vendors. Once a plug-in is installed you can use it in your jobs just as you can the builtin stages.
13 - 4
Advanced DataStage
13 - 5
DataStage
You can view installed plug-ins for a project in Manager.
13 - 6
Advanced DataStage
Documentation for the plug-ins that come with DataStage is provided in PDF format on the DataStage installation CD. In addition, open the plug-in in Manager. The Stage Type window provides a variety of information in the four tabs: A description of the plug-in. Plug-in creator information. Plug-in dependencies. Plug-in properties.
Most of what you need to do when you use a plug-in in a job is to set its properties correctly. Plug-ins provide online documentation for each property when you open the Properties tab in Designer.
13 - 7
DataStage
The sort plug-in can have one input link and one output link. The input link specifies the records of data to be sorted. The output link outputs the data in sorted order.
13 - 8
Advanced DataStage
This lists the main tasks involved in defining a sort using the DataStage Sort plugin.
13 - 9
DataStage
The sort stage has three tabs: Inputs tab: Stage tab: sort. Specify the format of the data to be sorted. On the Properties sub-tab, you set the properties that define the Outputs tab: Specify the format of the data after its sorted.
13 - 10
Advanced DataStage
True or False? Job parameters can be used in the Sort plug-in stage. True: Correct! Like when using job parameters in sequential stages, surround job parameters with # signs. False: Incorrect. Like when using job parameters in sequential stages, surround job parameters with # signs.
13 - 11
DataStage
13 - 12
Advanced DataStage
13 - 13
DataStage
13 - 14
Module 14
Scheduling and Reporting
Module 14 Scheduling and Reporting
DataStage
14 - 2
Advanced DataStage
Jobs are scheduled in Director. A job can be scheduled to run in a number of different ways: Once today at a specified time. Once tomorrow at a specified time. On a specific day and at a particular time. Daily at a particular time. On the next occurrence of a particular date and time.
Each job can be scheduled to run on any number of occasions and can be run with different job parameter values on the different occasions. Jobs run on the DataStage server under the user name specified on the Schedule tab in Administrator. If no user name is specified, it runs under the same name as the Windows NT Schedule service. If DataStage is running on Windows NT, DataStage uses the Windows NT Schedule service to schedule jobs. If you intend to use the DataStage scheduler, be sure to start or verify that the Windows NT Scheduler service is running. To start the NT Scheduler, open the Windows NT Control Panel and then open the Services icon. You can then manually start the service or set the service to start automatically each time the computer is started.
14 - 3
DataStage
True or False? When a scheduled job runs, it runs under the user ID of the person who scheduled it. True: Incorrect. When a user manually runs a job in Director, the job runs under the user ID of the person who manually started it. When a scheduled job runs, it runs under the user ID specified in Administrator. False: Correct! When a user manually runs a job in Director, the job runs under the user ID of the person who manually started it. When a scheduled job runs, it runs under the user ID specified in Administrator.
14 - 4
Advanced DataStage
14 - 5
DataStage
In addition to simple reports you can generate in Designer and Director using File>Print, DataStage provides a flexible and powerful reporting tool. The DataStage Reporting Assistant is invoked from DataStage Manager. You can generate reports at various levels within a project, including: Entire project Selected jobs Selected table definitions Selected routines and transforms Selected plug-in stages
Information generated for reporting purposes is stored in an ODBC database on the DataStage client. You can use this information for printing a report, writing a report to a file, or for browsing. By default, DataStage stores the reporting information in a Microsoft Access data source named DSReporting that is defined when the Reporting Assistant is installed.
14 - 6
Advanced DataStage
This shows an example of a report created for a job.
14 - 7
DataStage
This lists the main tasks involved in generating a report.
14 - 8
Advanced DataStage
True or False? The DataStage Reporting Assistant stores the data it uses in its reports in an ODBC database. True: Correct! This data source is set up on your client machine when the DataStage clients are installed on your machine. False: Incorrect. This data source is set up on your client machine when the DataStage clients are installed on your machine.
14 - 9
DataStage
14 - 10
Advanced DataStage
14 - 11
DataStage
14 - 12

DataStage Corso Slides

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DataStage Corso Slides

Transféré par

Droits d'auteur :

Formats disponibles

DATASTAGE ESSENTIALS

Module 2: Installing DataStage ..................................... 2-01

Module 3: Configuring Projects ..................................... 3-01

Module 4: Designing and Running Jobs ........................ 4-01

Compiling a job.............................................................................................................. 4-19 Running your job ........................................................................................................... 4-20

Module 5: Working with Metadata................................. 5-01

Module 6: Working with Relational Data ....................... 6-01

Module 7: Constraints and Derivations .......................... 7-01

Module 8: Creating BASIC Expressions ........................ 8-01

Module 9: Troubleshooting ............................................ 9-01

Module 10: Defining Lookups ...................................... 10-01

Module 11: Aggregating Data ...................................... 11-01

Module 12: Job Control................................................ 12-01

Module 13: Working with Plug-Ins ............................... 13-01

Module 14: Scheduling and Reporting ........................ 14-01

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

The DataStage Administrator is discussed in detail in a later module.

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 1 Introduction to DataStage

Module 2 Installing DataStage

Module 2 Installing DataStage

Module 2 Installing DataStage

Module 2 Installing DataStage

Module 2 Installing DataStage

Module 2 Installing DataStage

Module 2 Installing DataStage

Module 2 Installing DataStage

Module 2 Installing DataStage

Module 3 Configuring Projects

Module 3 Configuring Projects

Module 3 Configuring Projects

Module 3 Configuring Projects

Module 3 Configuring Projects

UNIX note: In UNIX, the groups displayed are defined in /etc/group.

Module 3 Configuring Projects

Module 3 Configuring Projects

Module 3 Configuring Projects

Module 3 Configuring Projects

Designing and Running Jobs

Module 4 Designing and Running Jobs

Module 4 Designing and Running Jobs

Module 4 Designing and Running Jobs

Module 4 Designing and Running Jobs

Module 4 Designing and Running Jobs

Module 4 Designing and Running Jobs

Module 4 Designing and Running Jobs

Module 4 Designing and Running Jobs

Active stages define how data is filtered and transformed.

Module 4 Designing and Running Jobs