Vous êtes sur la page 1sur 23

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Module 4 Metadata and Metadata Management


Metadata is a term that means "information about data". DataStage relies
heavily upon metadata to describe the data that are to be processed, the
format of the data, the processing that is required, and so on. Additional
metadata can be used to answer end users' questions such as "where did
this value come from?"
Information Server has a unified metadata layer through which metadata
can be shared among many products, including DataStage. In addition
each DataStage project has its own, local repository for metadata. All of
this metadata is available to DataStage users, and therefore must be
managed rigorously.

Objectives
Having completed this module you will be able:

to list three classes of metadata

to import DataStage components from a given DataStage export


file

to inspect metadata in the Repository using Designer

to use Quick Find and Advanced Find in the Repository

to define "nullable"

to export DataStage components from the Repository

Page 4-1

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Metadata
The word metadata comes from a Greek prefix "meta", meaning above,
and "data". "Data" is the plural past participle of the Latin verb "dare",
meaning "to give". The singular past participle is "datum", something
(that has been) given. So "data" literally means things (that have been)
given.
In information technology (IT), the word "metadata" is usually taken to
mean "information that describes data", and everyone claims to understand
what "data" means in IT.
Metadata allow questions about the data to be answered. For example, an
end user may be looking at a pie chart in which one sector contains 42%
of the overall total. The user may be interested to know how up to date the
data are, whether they are complete, what relationship they bear to the
operational systems' data, and what processing they underwent between
there and the pie chart.
There are several classes of metadata. Authorities differ on how many.
For a DataStage developer the three most important are listed here.

Business metadata incorporates all knowledge about the data that


the business has (or ought to have). This might include business
rules (for example "a customer number has the following format",
"metric measures of distance are converted to US measures in the
DW", "order date must be no later than current date during data
entry", and so on), and ownership and/or responsibility (for
example "the product price table is owned by, and kept up to date
by, the sales management group"). Quite often, business metadata
are produced by people with titles such as business analyst or
metadata steward.

Technical metadata are those that describe the technical aspects


of data, such as the format (particularly of text files), the rows and
columns, SQL data types, and so on. Technical metadata also
describe the processing that occurs to the data, not only during
ETL but also during original data entry and any reformatting that
BI tools might perform. These often become specifications with
which programmers/developers work.

Process metadata are no less important. These record what


processing actually took place, whether all records were processed
or some were rejected. On that basis, questions such as how up to
date the data are, how complete and lineage can be answered.

DataStage metadata are stored in two locations. There is a local repository


for each DataStage project, and there is the Information Server common
metadata repository whereby all suite products can share metadata.

Page 4-2

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Business Metadata in DataStage


Business metadata are typically maintained outside of DataStage, perhaps
by a business analyst using the Business Glossary product (another
product in the Information Server suite). DataStage does not directly use
business metadata, but having it available can assist developers in, for
example, assigning correct validation logic. The usual place where
business metadata is to be found in the DataStage repository is in Data
Elements. Business metadata can also be found in annotations in job
designs and in description fields on jobs, stages and links.

Figure 4-1 Example of Data Element

Figure 4-1 shows an example of a Data Element. The SQL tab allows the
most likely data type for this element to be recorded, though it is not
enforced. The other two tabs define the data elements relationships with
DataStage Transforms, which are available only in server jobs, not in
parallel jobs (this class is about parallel jobs, so Transforms will not be
discussed).
Data elements can be added to any table definition, to highlight that a
particular field has business metadata to be carried with it. Usage analyses
can be performed on data elements, for example to answer developer
questions such as which jobs process revenue? (assuming there is a data
element called Revenue or something similar).

Page 4-3

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Technical Metadata in DataStage


There are four main areas into which technical metadata for parallel jobs
may be grouped. These are configurations, table definitions, source code
(primarily of routines) and ETL job designs themselves.
Configurations cover a wide range of things, most of which we discuss in
other modules or in the Administrator class.
The obvious one is the parallel execution configuration files. Each of
these provides a list of hosts and resources (nodes) on which parallel
execution can take place. Different configuration files can be used for
different tasks; for example a one-way configuration is best suited for
processing a single row, a one-way or two-way configuration is suited to a
small volume of data, whereas a 36-way configuration could process a
very large volume of data indeed.
Other things have to be configured too. For example connections to
ODBC data sources have to be configured centrally (as Data Source
Names) and then each project has a configuration file uvodbc.config that
specifies which of those DSNs is accessible from the project.
Another example is the deadlock daemon configuration file, which the
DataStage administrator uses to specify whether the deadlock daemon
runs automatically, at what interval, etc. The name of this particular file is
dsdlockd.config and it is in the server engine directory.
Depending on the stage types being used, database client software may
need to be installed and configured, and the appropriate environment
variables set up. These can be set up in a script called dsenv (in the server
engine directory, and which all DataStage processes execute), and/or as
project-specific environment variables (configured via the Administrator
client and stored in a file called DSParams in the project directory on the
server).
Table definitions is a generic term that describes the record layout that is
to be processed. Even though there may be no table involved, the term has
historical origins, and has become a bit of a tradition. Technically, when
referring to parallel jobs, we should refer to the record schema. Complex
flat files produced by COBOL applications (typically on mainframes) use
the term file definition. The metadata model used in DataStage and in
Information Server allows all of these to be stored in the same way.
DataStages table definition editor has a Layout tab where you can view a
table definition as a standard definition, using SQL data types, as a
COBOL file definition, using COBOL data types, or as an Orchestrate
record schema, using C++ data types. Because these are selected using
option buttons, it is clear that no matter which view you are using you are
looking at the same underlying piece of metadata.

Page 4-4

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Routines are of three kinds. Parallel routines, which can be called from
the Transformer stage in parallel jobs, are created outside of DataStage in
the C++ language and compiled and linked. An entry is placed into the
repository to record the name, arguments and location of the library or
object containing the routine.
Server routines are created within DataStage, using DataStage BASIC as
the programming language. They are stored directly in the repository and
are of two kinds.

Before/after subroutines can be used in parallel jobs, in server jobs


and in active stages in server jobs.

Transform functions can be used in server Transformer stage, in


the BASIC Transformer stage in parallel jobs, and in Routine
activities in job sequences. (There is a full course called
Programming with DataStage BASIC available. We do not have
time to cover creation of routines in this class.)

Job designs are created using DataStage designer and are stored directly
in the repository.

Process Metadata in DataStage


Each time a job runs, it keeps a log of its activity and periodically updates
status information such as CPU usage and row counts. This information is
stored in the Repository, and may be viewed using the Director client and
reported on using the reporting console of Information Server.
Environment variable options allow the collection of extra information
about processing; most of these are in the Reporting folder.

Figure 4-2 Environment Variables That Control Reporting

Page 4-5

DataStage Fundamentals (version 8 parallel jobs)

Page 4-6

Staffordshire, December 2010

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Metadata Repository
Metadata need to be stored somewhere so that they can be used. For
DataStage, metadata are stored in the metadata repository.
In fact there are two metadata repositories. Each DataStage project has a
local repository and there is a central, unified metadata repository for all
Information Server products. Unified in this context means that the
metadata are stored in a format such that they are accessible via the
metadata delivery and analysis services by any Information Server
product.
When using DataStage, you are not aware, in general, in which of the two
repositories your particular metadata resides. Metadata may be in one, the
other, or both. You access the metadata repository through the Repository
toolbar in the Designer client.

Figure 4-3 Repository Toolbar

Figure 4-3 shows the Repository toolbar in a project called (as indicated
by the tab) Demonstrations. Chances are that your Repository will have a
different set of folders, since the structure is completely customizable.
Some of the folders in Figure 4-3 relate to QualityStage jobs, some relate
to mainframe jobs. If you do not have these capabilities installed on your
DataStage server, then you may not see these folders.

Page 4-7

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

In the Designer client, the repository is organized as a tree, in which you


create as many branches as needed. Be careful, though, not to create so
complex a structure that it becomes impossible to maintain.

Figure 4-4 Repository with One Branch Expanded

In Figure 4-4 the Routines branch in the Repository has been expanded.
One of the second-level folders is called Examples, and a third-level folder
called Functions contains examples of (server) functions. These examples
ship with DataStage.
If you ever need to refer to a particular sub-folder in the repository, use
backslash characters between the folder names. For example the open
folder in Figure 4-4 would be specified by the name
\Routines\Examples\Functions

Whenever you have a repository object (other than a job or a shared


container) open for editing, its full name is reported in the title bar of the
editing dialog. See, for example Figure 4-1 on page 4-3, which shows a
Data Element open for editing.
Components that ship with DataStage have a flag set in their repository
records which causes them to be read only when opened in Designer. If
you need to adapt their functionality, make a copy of the object the copy
will not be flagged read only.
To make a copy, select the objects name, right click and choose Create a
copy from the menu. For example, if the object is called XYZ, then the
new object will be called CopyOfXYZ. You can indeed you probably
should rename it as soon as possible. The same right-click menu has a
Page 4-8

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Rename option available, or you can rename most objects within their
editing dialog.

Creating New Categories


To create a new folder anywhere in the repository, right click on the folder
which will be the parent of the new folder. (There is a project folder at
the very top of the tree that can serve as the parent of new top-level
folders.)
Choose New from the pop-up menu, then Folder from the subsequentlydisplayed menu. In Figure 4-5 this process is illustrated creating a new
sub-folder in the ParameterSets branch of the Repository.

Figure 4-5 Creating a New Folder

Page 4-9

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

This will open a revised Repository toolbar with the newly-created folder
named NewFolder, selected (highlighted) waiting for its name to be
changed.
This is shown in Figure 4-6. You should, of course, rename the new folder
immediately to something more meaningful.

Figure 4-6 Newly Created Folder

Deleting a Folder
When you select a folder in the repository and press the Del button on
your keyboard, or right-click the folder and choose Delete from the menu,
you might be deleting not just the folder but also its entire contents.
To help guard against the possibility of accidental deletion, a confirmation
dialog appears asking you to confirm deletion of the selected items.
In the case of deleting a single folder this dialog will have only the named
folder in its list. However, deletion can be initiated from the result of a
search of the Repository, so that the confirmation dialog allows you to
limit the items to be deleted to just those which you select in the dialog
itself.

Searching the Repository Quick Find


There are two tools for searching the repository Quick Find and
Advanced find. Lets look at Quick Find first.

Page 4-10

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Whenever you are working in the Repository there is almost always a link
to Open Quick Find, usually at the top right or bottom left of the dialog
in which you are working. In Figure 4-6, for example, the link is in the
top right.
The Quick Find dialog allows you to specify the name for which you want
to search in the repository. This name may include asterisk as a wildcard
character. No other form of regular expression is supported.

Figure 4-7 Quick Find Dialog

The Types to find drop down list allows you to limit the search to
particular DataStage object types, using the standard checked tree
method, illustrated in Figure 4-8.

Page 4-11

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Figure 4-8 Quick Find Dialog Types to find List (Partial)

The Include description check box allows the search to include


searching for the indicated string or wildcard pattern in the description
fields of DataStage objects.
The initial result of Quick Find is an expanded repository tree with the
first object in which the search was successful highlighted. Next and Prev
buttons allow this tree view to be navigated.

Figure 4-9 Quick Find Initial Result

In Figure 4-9 the result of an unconstrained search for Month are shown.
16 hits were obtained, the first being in the DateGenericToTimestamp
routine in \Routines\sdk\Date folder.

Page 4-12

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Clicking on the 16 matches link or on the Adv button would open the
Advanced find capability. Alternatively you can right-click on any of the
selected objects or, indeed, any of the objects and perform other
activities such as rename, export, or where used or dependencies
analyses.

Searching the Repository Advanced Find


The Advanced Find dialog offers the same search capabilities as Quick
Find, but with a greater range of filters available.

Figure 4-10 Advanced Find

Figure 4-10 shows the same search as was illustrated for Quick Find,
namely for the word Month occurring in the object name or description.
In Advanced Find, however, you can specify different words in the
description while still filtering on the object name. Other filtering options
are displayed on the bottom left of the dialog.
Type is the same filter as in Quick Find, allowing the search to be limited
to specific DataStage object types.
Creation and Modification filters allow the user who created or modified
the object and/or the date on which creation or modification occurred to be
selected as filters for the search.
As you can see in Figure 4-11, when a filter condition is selected there is a
visual cue (green circle with yellow check mark) that indicates that this is

Page 4-13

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

the case. This cue remains visible even though that particular part of the
filter has been minimized.

Figure 4-11 Advanced Find Created Filter Dialog

Where used allows you to set up a list of repository objects so that the
search finds only objects that use the objects in your list.
Dependencies of allows you to set up a list of repository objects so that
the search finds only objects that are dependencies of any of the objects in
your list. For example, a job can be a dependency of a job sequence, a
routine can be a dependency of a job or even of another routine.
Type specific allows you to set a table definition that will be used to find
those table definitions in the repository that are related via the same shared
Table. In this context, a shared Table is a table definition in the
common, unified metadata Repository for Information Server.
There are four Options. Search can be case sensitive or not, can be within
the last result set only (or not), can include nested results for dependency
searches, and can search for a match in object name or description or both.

Page 4-14

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Table Definitions
DataStage uses the term table definition to mean any form of record
layout definition. The term has its origin in database terminology but has
been extended, for DataStage use, to mean record layout metadata from
any source.
So, for example, DataStage records the format of a sequential file as its
table definition. DataStage records the format of a COBOL-generated
complex flat file as its table definition. The term used by the parallel
execution engine (Orchestrate) is a record schema. Other tools, such as
CASE tools, have their own terminology and structure.
Both DataStage and Information Server use a generic storage format
which means that, whatever the origin, the record layout can be
ascertained and used by any stage type. For example, you may have
imported a table definition from a database server and now wish to create
a text file of records to be loaded into that table all you do is use that
table definition in the Sequential File stage even though it was not
imported from a sequential file.
To make this point, choose any table definition in the repository and open
it. Select the Layout tab. From here you can view the table definition as a
standard table definition (using SQL data types), as a COBOL file
definition (using COBOL data types) or as an Orchestrate record schema
(using C data types).

Page 4-15

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Figure 4-12 Table Definition Layout Tab

Exporting DataStage Components


DataStage Components (that is, any object in the repository) can be
exported into a text file. Two formats are available.

A DSX (DataStage export) file is the original format used by


DataStage. It is the more compact of the two formats, a factor that
might be considered if, for example, contemplating emailing the
export file.

The other format uses XML (extensible markup language) which


identifies each component with its own pair of tags as well as using
tags and a style sheet to represent the relationships between
components.

Exporting DataStage components is accomplished via the Export menu in


Designer, or by choosing Export from the results of a Quick Find or an
Advanced Find. The Repository Export dialog allows you to specify what
to export, and where.

Page 4-16

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Figure 4-13 DataStage Repository Export Dialog

The Items to Export pane contains the list of items to be exported. The
Add link re-invokes Quick Find to locate more items. Eventually you
have a list of items in this field, some or all of which you have selected to
be exported. In the status bar at the bottom of the window is reported how
many objects have been selected and how many of these will be ignored
(not exported). For example, if any read-only items have been selected
and Exclude read-only items is set, then these read-only items will be
ignored.
The export file is always on the client machine1. The type of export field
governs the format of the export file and also its filename suffix; DSX
files have .dsx as their suffix, while XML export files have xml as
their suffix.
If append to existing file is not selected and the export file already
exists, an OK to overwrite? message box confirms whether you wish to
overwrite an existing export file.
Other Options (the button opens a further dialog) include whether to
include source code with routines, job executables and data quality
specifications (QualityStage), whether to include data type definitions
(DTD) in an XML export file, whether to export property definitions as
internal values or externalized strings in an XML export file, and whether
to use a non-default style sheet for an XML export file.
Once you have a DataStage export file, it may be used as source from
which to import DataStage components.
A DataStage export file can, therefore, be used as a backup of an existing
project (or select components in it), or can be used to transfer components
from one project to another. It may also happen that you are asked to
create an export file as part of an attempt by your support provider to solve
some particular problem that you may be having, especially if you are able
to reproduce the problem reliably2.
In version 8.1 a tool called Information Server Manager is available. This
also can export components, either into a "deployment package" or into an
"archive file". Its command line interface is a command called istool.
Naturally Information Server Manager can import from the packages or
archive files that it creates.

There are command line interfaces to export, namely dscmdexport and dsexport. These
are beyond the scope of these notes. Details may be found in the appropriate manuals.
One of the best tools that a support analyst can have is a reproducible case.

Page 4-17

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Importing DataStage Components


Using DataStage Designers Import menu you can import DataStage
components that have been exported from any DataStage project3.
Under the Import menu the first two options are DataStage components
for importing from a DSX-format file, and DataStage components
(XML) for importing from an XML-format file. The DataStage
Repository Import dialog is relatively simple.

Figure 4-14 DataStage Repository Import Dialog

Obviously it is necessary to identify the export file. A browser is provided


to help to find it. Recall that the export file is always on the client
machine and that, therefore, it will always have a Windows pathname.
Import all imports every object in the export file. Import selected reads
the entire export file into memory and displays the objects in a screen
from which individual objects may be selected for import. This is a useful
way to import a single component when no export of that single object is
available.
Overwrite without query suppresses the OK to overwrite? message
box that is displayed if an object in the export file already exists in the
project. This check box should be used with great care.
Perform impact analysis checks and reports (if import selected is
chosen) whether each object being imported already exists in the project
and whether any object depends on it.

If the source project was using a different character set than the character set used in the
current project into which you wish to import, you may need to edit the export file to mention the
new character set. It is possible, in this case, that some text strings may not import accurately.
However, all DataStage objects themselves should import successfully.

Page 4-18

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Importing Table Definitions


As noted earlier, a table definition in DataStage describes the record
layout in any data source; it does not have to be a database table. Table
definitions can be imported into the repository from a number of sources
as illustrated in Figure 4-15.

Figure 4-15 DataStage Table Definition Import Menu

In later modules we will investigate a couple of these in somewhat more


detail. As a general principle, however, each opens a wizard that takes
you through identifying the metadata source, retrieving the definitions
from that source and storing them in a particular category in the
repository.
The Connector import wizard allows table definitions to be imported into
the DataStage repository from the unified Information Server repository.
Sequential File definitions are unusual in that you also have to specify
format information, as well as importing/defining column definitions.
Customarily table definitions are stored in the Table Definitions branch of
the repository with two levels of category, data source type and data
source name. For example a table definition imported from an ODBC
data source called Pricing would conventionally be stored in the

Page 4-19

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

category \Table Definitions\ODBC\Pricing. For sequential files table


definitions the data source name is the name of the directory in which the
file exists. For example, if the file award_flights.txt is in the directory
/usr/Flights, then the default repository category for storing its table
definition would be \Table Definitions\Sequential\Flights.
There is no compulsion to follow this convention. Your site may, for
example, store table definitions and other metadata in folders reflecting
the subject area being processed.

Nullable Fields
Sometimes in database tables certain fields are marked NOT NULL
while others are nullable meaning that they may contain NULL.
In the context of database tables, NULL indicates that there is no known
value for this field in the current row. How NULL is stored is different in
different databases, and immaterial.
Because NULL is unknown, there are very few operations that can be
performed with it. For example, adding 34 to an unknown value yields a
still-unknown value.
Functions in parallel Transformer stages are particularly intolerant of
NULL you need to handle NULL specifically.
Two tests may be performed with nullable fields you can ask whether
the value IS NULL or whether the value IS NOT NULL. Each of these
questions returns a true/false answer. No other operation is legitimate.
NULL can be generated by database joins (and in the DataStage Join
stage) when performing outer joins. Fields from the outer source will
return NULL if there is no match on the join condition.
Within a DataStage table definition any field can be marked as Nullable or
not. However, if there is any possibility that this field may contain NULL
then it must be marked Nullable.
Text files have no data types, and therefore no implicit concept of NULL.
With sequential files, therefore, it is necessary to specify some text string
which, if encountered, will be understood to represent NULL. This is
covered in more detail in the module on Sequential Files.
DataStages internal representation of NULL is usually a single byte
whose binary value is 100000004. However, in environments where this
byte is used to represent the Euro currency symbol, a different byte value
can be configured for DataStage to use. DataStages internal NULL is
referred to as an out-of-band null.

This value can be represented as an int8 field whose value is -128.

Page 4-20

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

An in-band null is a special value, legal for the data type, that is used to
represent NULL even though it is not. For example, an in-band null for
DateHired (data type Date) might be 1800-01-01 a legal date but
impossible in data as a date hired. Therefore any representation of NULL
in a Sequential File stage is, effectively, an in-band null.
Conversion functions exist for switching between out-of-band and in-band
null, and for generating null. These are different in the Transformer stage
and the Modify stage.
Table 4-1 Null Handling Functions

Description

Modify Stage Transformer Stage

Test for null

null()

IsNull()

notnull()

IsNotNull()

Convert null to value

handle_null()

NullToValue()
NullToEmpty()
NullToZero()

Convert to in-band null

handle_null()

NullToValue()
NullToEmpty()
NullToZero()

Convert to out-of-band null

make_null()

SetNull()

Test for not null

Page 4-21

DataStage Fundamentals (version 8 parallel jobs)

Staffordshire, December 2010

Review
The term metadata is usually understood to mean information about data
or the processing of the data. Business metadata incorporates knowledge
that the business has about the data, such as business rules, ownership and
responsibility. Technical metadata includes things like table definitions,
routine code and the like. Process metadata describes what happened to
the data, when, and with what result/success. DataStage stores metadata
in both the central Information Server repository and in its own local
repository.
The Repository toolbar in DataStage (Figure 4-3) does not reveal in which
location any particular item of metadata is stored. It is organized into
folders, over which you have complete control. But it is wise to follow
some systematic way of storing metadata. The terms category and
pathname are both used to describe the location of a particular folder, or
component in a folder, in the Repository.
DataStage has two search utilities, Quick Find and Advanced Find. The
latter has more filters, and allows a greater range of things to be done with
the results of the search.
One critical item of metadata for DataStage is the table definition, a
term that encapsulates any collection of column definitions. These can be
imported using a number of different tools. It is also possible to export
any combination of components from the Repository into a file that can be
subsequently used to import some or all of these components into another
DataStage project.
NULL is a concept, that of a data item whose value is unknown. Every
database has its own way of representing NULL internally, as does the
DataStage server. Functions exist to test whether a data item is null (or is
not null), to substitute a value where this is true, and to generate out-ofband null where needed. Some activities, such as outer joins, can also
return NULL.

Further Reading
Parallel Job Developers Guide Chapter 2
Designer Client Guide Chapter 2 and 13

Page 4-22

DataStage Fundamentals (version 8 parallel jobs)

Page 4-23

Staffordshire, December 2010

Vous aimerez peut-être aussi