Académique Documents
Professionnel Documents
Culture Documents
Objectives
Having completed this module you will be able:
to define "nullable"
Page 4-1
Metadata
The word metadata comes from a Greek prefix "meta", meaning above,
and "data". "Data" is the plural past participle of the Latin verb "dare",
meaning "to give". The singular past participle is "datum", something
(that has been) given. So "data" literally means things (that have been)
given.
In information technology (IT), the word "metadata" is usually taken to
mean "information that describes data", and everyone claims to understand
what "data" means in IT.
Metadata allow questions about the data to be answered. For example, an
end user may be looking at a pie chart in which one sector contains 42%
of the overall total. The user may be interested to know how up to date the
data are, whether they are complete, what relationship they bear to the
operational systems' data, and what processing they underwent between
there and the pie chart.
There are several classes of metadata. Authorities differ on how many.
For a DataStage developer the three most important are listed here.
Page 4-2
Figure 4-1 shows an example of a Data Element. The SQL tab allows the
most likely data type for this element to be recorded, though it is not
enforced. The other two tabs define the data elements relationships with
DataStage Transforms, which are available only in server jobs, not in
parallel jobs (this class is about parallel jobs, so Transforms will not be
discussed).
Data elements can be added to any table definition, to highlight that a
particular field has business metadata to be carried with it. Usage analyses
can be performed on data elements, for example to answer developer
questions such as which jobs process revenue? (assuming there is a data
element called Revenue or something similar).
Page 4-3
Page 4-4
Routines are of three kinds. Parallel routines, which can be called from
the Transformer stage in parallel jobs, are created outside of DataStage in
the C++ language and compiled and linked. An entry is placed into the
repository to record the name, arguments and location of the library or
object containing the routine.
Server routines are created within DataStage, using DataStage BASIC as
the programming language. They are stored directly in the repository and
are of two kinds.
Job designs are created using DataStage designer and are stored directly
in the repository.
Page 4-5
Page 4-6
Metadata Repository
Metadata need to be stored somewhere so that they can be used. For
DataStage, metadata are stored in the metadata repository.
In fact there are two metadata repositories. Each DataStage project has a
local repository and there is a central, unified metadata repository for all
Information Server products. Unified in this context means that the
metadata are stored in a format such that they are accessible via the
metadata delivery and analysis services by any Information Server
product.
When using DataStage, you are not aware, in general, in which of the two
repositories your particular metadata resides. Metadata may be in one, the
other, or both. You access the metadata repository through the Repository
toolbar in the Designer client.
Figure 4-3 shows the Repository toolbar in a project called (as indicated
by the tab) Demonstrations. Chances are that your Repository will have a
different set of folders, since the structure is completely customizable.
Some of the folders in Figure 4-3 relate to QualityStage jobs, some relate
to mainframe jobs. If you do not have these capabilities installed on your
DataStage server, then you may not see these folders.
Page 4-7
In Figure 4-4 the Routines branch in the Repository has been expanded.
One of the second-level folders is called Examples, and a third-level folder
called Functions contains examples of (server) functions. These examples
ship with DataStage.
If you ever need to refer to a particular sub-folder in the repository, use
backslash characters between the folder names. For example the open
folder in Figure 4-4 would be specified by the name
\Routines\Examples\Functions
Rename option available, or you can rename most objects within their
editing dialog.
Page 4-9
This will open a revised Repository toolbar with the newly-created folder
named NewFolder, selected (highlighted) waiting for its name to be
changed.
This is shown in Figure 4-6. You should, of course, rename the new folder
immediately to something more meaningful.
Deleting a Folder
When you select a folder in the repository and press the Del button on
your keyboard, or right-click the folder and choose Delete from the menu,
you might be deleting not just the folder but also its entire contents.
To help guard against the possibility of accidental deletion, a confirmation
dialog appears asking you to confirm deletion of the selected items.
In the case of deleting a single folder this dialog will have only the named
folder in its list. However, deletion can be initiated from the result of a
search of the Repository, so that the confirmation dialog allows you to
limit the items to be deleted to just those which you select in the dialog
itself.
Page 4-10
Whenever you are working in the Repository there is almost always a link
to Open Quick Find, usually at the top right or bottom left of the dialog
in which you are working. In Figure 4-6, for example, the link is in the
top right.
The Quick Find dialog allows you to specify the name for which you want
to search in the repository. This name may include asterisk as a wildcard
character. No other form of regular expression is supported.
The Types to find drop down list allows you to limit the search to
particular DataStage object types, using the standard checked tree
method, illustrated in Figure 4-8.
Page 4-11
In Figure 4-9 the result of an unconstrained search for Month are shown.
16 hits were obtained, the first being in the DateGenericToTimestamp
routine in \Routines\sdk\Date folder.
Page 4-12
Clicking on the 16 matches link or on the Adv button would open the
Advanced find capability. Alternatively you can right-click on any of the
selected objects or, indeed, any of the objects and perform other
activities such as rename, export, or where used or dependencies
analyses.
Figure 4-10 shows the same search as was illustrated for Quick Find,
namely for the word Month occurring in the object name or description.
In Advanced Find, however, you can specify different words in the
description while still filtering on the object name. Other filtering options
are displayed on the bottom left of the dialog.
Type is the same filter as in Quick Find, allowing the search to be limited
to specific DataStage object types.
Creation and Modification filters allow the user who created or modified
the object and/or the date on which creation or modification occurred to be
selected as filters for the search.
As you can see in Figure 4-11, when a filter condition is selected there is a
visual cue (green circle with yellow check mark) that indicates that this is
Page 4-13
the case. This cue remains visible even though that particular part of the
filter has been minimized.
Where used allows you to set up a list of repository objects so that the
search finds only objects that use the objects in your list.
Dependencies of allows you to set up a list of repository objects so that
the search finds only objects that are dependencies of any of the objects in
your list. For example, a job can be a dependency of a job sequence, a
routine can be a dependency of a job or even of another routine.
Type specific allows you to set a table definition that will be used to find
those table definitions in the repository that are related via the same shared
Table. In this context, a shared Table is a table definition in the
common, unified metadata Repository for Information Server.
There are four Options. Search can be case sensitive or not, can be within
the last result set only (or not), can include nested results for dependency
searches, and can search for a match in object name or description or both.
Page 4-14
Table Definitions
DataStage uses the term table definition to mean any form of record
layout definition. The term has its origin in database terminology but has
been extended, for DataStage use, to mean record layout metadata from
any source.
So, for example, DataStage records the format of a sequential file as its
table definition. DataStage records the format of a COBOL-generated
complex flat file as its table definition. The term used by the parallel
execution engine (Orchestrate) is a record schema. Other tools, such as
CASE tools, have their own terminology and structure.
Both DataStage and Information Server use a generic storage format
which means that, whatever the origin, the record layout can be
ascertained and used by any stage type. For example, you may have
imported a table definition from a database server and now wish to create
a text file of records to be loaded into that table all you do is use that
table definition in the Sequential File stage even though it was not
imported from a sequential file.
To make this point, choose any table definition in the repository and open
it. Select the Layout tab. From here you can view the table definition as a
standard table definition (using SQL data types), as a COBOL file
definition (using COBOL data types) or as an Orchestrate record schema
(using C data types).
Page 4-15
Page 4-16
The Items to Export pane contains the list of items to be exported. The
Add link re-invokes Quick Find to locate more items. Eventually you
have a list of items in this field, some or all of which you have selected to
be exported. In the status bar at the bottom of the window is reported how
many objects have been selected and how many of these will be ignored
(not exported). For example, if any read-only items have been selected
and Exclude read-only items is set, then these read-only items will be
ignored.
The export file is always on the client machine1. The type of export field
governs the format of the export file and also its filename suffix; DSX
files have .dsx as their suffix, while XML export files have xml as
their suffix.
If append to existing file is not selected and the export file already
exists, an OK to overwrite? message box confirms whether you wish to
overwrite an existing export file.
Other Options (the button opens a further dialog) include whether to
include source code with routines, job executables and data quality
specifications (QualityStage), whether to include data type definitions
(DTD) in an XML export file, whether to export property definitions as
internal values or externalized strings in an XML export file, and whether
to use a non-default style sheet for an XML export file.
Once you have a DataStage export file, it may be used as source from
which to import DataStage components.
A DataStage export file can, therefore, be used as a backup of an existing
project (or select components in it), or can be used to transfer components
from one project to another. It may also happen that you are asked to
create an export file as part of an attempt by your support provider to solve
some particular problem that you may be having, especially if you are able
to reproduce the problem reliably2.
In version 8.1 a tool called Information Server Manager is available. This
also can export components, either into a "deployment package" or into an
"archive file". Its command line interface is a command called istool.
Naturally Information Server Manager can import from the packages or
archive files that it creates.
There are command line interfaces to export, namely dscmdexport and dsexport. These
are beyond the scope of these notes. Details may be found in the appropriate manuals.
One of the best tools that a support analyst can have is a reproducible case.
Page 4-17
If the source project was using a different character set than the character set used in the
current project into which you wish to import, you may need to edit the export file to mention the
new character set. It is possible, in this case, that some text strings may not import accurately.
However, all DataStage objects themselves should import successfully.
Page 4-18
Page 4-19
Nullable Fields
Sometimes in database tables certain fields are marked NOT NULL
while others are nullable meaning that they may contain NULL.
In the context of database tables, NULL indicates that there is no known
value for this field in the current row. How NULL is stored is different in
different databases, and immaterial.
Because NULL is unknown, there are very few operations that can be
performed with it. For example, adding 34 to an unknown value yields a
still-unknown value.
Functions in parallel Transformer stages are particularly intolerant of
NULL you need to handle NULL specifically.
Two tests may be performed with nullable fields you can ask whether
the value IS NULL or whether the value IS NOT NULL. Each of these
questions returns a true/false answer. No other operation is legitimate.
NULL can be generated by database joins (and in the DataStage Join
stage) when performing outer joins. Fields from the outer source will
return NULL if there is no match on the join condition.
Within a DataStage table definition any field can be marked as Nullable or
not. However, if there is any possibility that this field may contain NULL
then it must be marked Nullable.
Text files have no data types, and therefore no implicit concept of NULL.
With sequential files, therefore, it is necessary to specify some text string
which, if encountered, will be understood to represent NULL. This is
covered in more detail in the module on Sequential Files.
DataStages internal representation of NULL is usually a single byte
whose binary value is 100000004. However, in environments where this
byte is used to represent the Euro currency symbol, a different byte value
can be configured for DataStage to use. DataStages internal NULL is
referred to as an out-of-band null.
Page 4-20
An in-band null is a special value, legal for the data type, that is used to
represent NULL even though it is not. For example, an in-band null for
DateHired (data type Date) might be 1800-01-01 a legal date but
impossible in data as a date hired. Therefore any representation of NULL
in a Sequential File stage is, effectively, an in-band null.
Conversion functions exist for switching between out-of-band and in-band
null, and for generating null. These are different in the Transformer stage
and the Modify stage.
Table 4-1 Null Handling Functions
Description
null()
IsNull()
notnull()
IsNotNull()
handle_null()
NullToValue()
NullToEmpty()
NullToZero()
handle_null()
NullToValue()
NullToEmpty()
NullToZero()
make_null()
SetNull()
Page 4-21
Review
The term metadata is usually understood to mean information about data
or the processing of the data. Business metadata incorporates knowledge
that the business has about the data, such as business rules, ownership and
responsibility. Technical metadata includes things like table definitions,
routine code and the like. Process metadata describes what happened to
the data, when, and with what result/success. DataStage stores metadata
in both the central Information Server repository and in its own local
repository.
The Repository toolbar in DataStage (Figure 4-3) does not reveal in which
location any particular item of metadata is stored. It is organized into
folders, over which you have complete control. But it is wise to follow
some systematic way of storing metadata. The terms category and
pathname are both used to describe the location of a particular folder, or
component in a folder, in the Repository.
DataStage has two search utilities, Quick Find and Advanced Find. The
latter has more filters, and allows a greater range of things to be done with
the results of the search.
One critical item of metadata for DataStage is the table definition, a
term that encapsulates any collection of column definitions. These can be
imported using a number of different tools. It is also possible to export
any combination of components from the Repository into a file that can be
subsequently used to import some or all of these components into another
DataStage project.
NULL is a concept, that of a data item whose value is unknown. Every
database has its own way of representing NULL internally, as does the
DataStage server. Functions exist to test whether a data item is null (or is
not null), to substitute a value where this is true, and to generate out-ofband null where needed. Some activities, such as outer joins, can also
return NULL.
Further Reading
Parallel Job Developers Guide Chapter 2
Designer Client Guide Chapter 2 and 13
Page 4-22
Page 4-23