White Paper

White Paper
What Is QualityStage?
DSXchange.com White Paper
What is QualityStage?
DSXchange.com Page 2 of 11
All rights reserved.
White Paper What Is QualityStage?
This white paper is one in a series of what is non-technical explanations of the various
products that are supported by IBM InfoSphere Information Server. White papers in
this series include:
What Is Business Glossary?
What Is information Analyzer?
What Is FastTrack?
What Is QualityStage?
What Is DataStage?
What Is Metadata Workbench?
Information Server is a suite of products and services that combine seamlessly to enable
an organization to deliver trusted information to all consumers of that information, as
well as to deliver knowledge about that information to the organization itself.
QualityStage
QualityStage is a tool intended to deliver high quality data required for success in a
range of enterprise initiatives including business intelligence, legacy consolidation and
master data management. It does this primarily by identifying components of data that
may be in columns or free format, standardizing the values and formats of those data,
using the standardized results and other generated values to determine likely duplicate
records, and building a best of breed record out of these sets of potential duplicates.
Through its intuitive user interface QualityStage substantially reduces time and cost to
implement Customer Relationship Management (CRM), data warehouse/business
intelligence (BI), data governance, and other strategic IT initiatives and maximizes their
return on investment by ensuring their data quality.
With QualityStage it is possible, for example, to construct consolidated customer and
household views, enabling more effective cross-selling, up-selling, and customer
retention, and to help to improve customer support and service, for example by
identifying a company's most profitable customers.
The cleansed data provided by QualityStage allows creation of business intelligence on
individuals and organizations for research, fraud detection, and planning.
Out of the box QualityStage provides for cleansing of name and address data and some
related types of data such as email addresses, tax IDs and so on. However, QualityStage
is fully customizable to be able to cleanse any kind of classifiable data, such as
infrastructure, inventory, health data, and so on.
QualityStage Heritage
The product now called QualityStage has its origins in a product called INTEGRITY from a
company called Vality. Vality was acquired by Ascential Software in 2003 and the
product renamed to QualityStage. This first version of QualityStage reflected its
heritage (for example it only had batch mode operation) and, indeed, its mainframe
antecedents (for example file name components limited to eight characters).
Ascential did not do much with the inner workings of QualityStage which was, after all,
already a mature product. Ascentials emphasis was to provide two new modes of
operation for QualityStage. One was a plug-in for DataStage that allowed data
cleansing/standardization to be performed (by QualityStage jobs) as part of an ETL data
flow. The other was to provide for QualityStage to use the parallel execution technology
(Orchestrate) that Ascential had as a result of its acquisition of Torrent Systems in 2001.
IBM acquired Ascential Software at the end of 2005. Since then the main direction has
been to put together a suite of products that share metadata transparently and share a
common set of services for such things as security, metadata delivery, reporting, and so
on. In the particular case of QualityStage, it now shares a common Designer client with
DataStage: from version 8.0 onwards QualityStage jobs run as, or as part of, DataStage
jobs, at least in the parallel execution environment.
QualityStage Functionality
Four tasks are performed by QualityStage; they are investigation, standardization,
matching and survivorship. We need to look at each of these in turn. Under the covers
QualityStage incorporates a set of probabilistic matching algorithms that can find
potential duplicates in data despite variations in spelling, numeric or date values, use of
non-standard forms, and various other obstacles to performing the same tasks using
deterministic methods. For example, if you have what appears to be the same
employee record where the name is the same but date of hire differs by a day or two, a
deterministic algorithm would show two different employees whereas a probabilistic
algorithm would show the potential duplicate.
(Deterministic means absolute in this sense; either something is equal or it is not.
Probabilistic leaves room for some degree of uncertainty; a value is close enough to be
considered equal. Needless to say, the degree of uncertainty used within QualityStage
is configurable by the designer.)
Investigation
By investigation we mean inspection of the data to reveal certain types of information
about those data. There is some overlap between QualityStage investigation and the
kinds of profiling results that are available using Information Analyzer, but not so much
overlap as to suggest that removal of functionality from either tool. QualityStage can
undertake three different kinds of investigation.
Character discrete investigation looks at the characters in a single field (domain) to
report what values or patterns exist in that field. For example a field might be expected
to contain only codes A through E. A character discrete investigation looking at the
values in that field will report the number of occurrences of every value in the field (and
therefore any out of range values, empty or null, etc.) Pattern in this context means
whether each character is alphabetic, numeric, blank or something else. This is useful in
planning cleansing rules; for example a telephone number may be represented with or
without delimiters and with or without parentheses surrounding the area code, all in
the one field. To come up with a standard format, you need to be aware of what
formats actually exist in the data. The result of a character discrete investigation (which
can also examine just part of a field, for example the first three characters) is a
frequency distribution of values or patterns the developer determines which.
Character concatenate investigation is exactly the same as character discrete
investigation except that the contents of more than one field can be examined as if they
were in a single field the fields are, in some sense, concatenated prior to the
investigation taking place. The results of a character concatenate investigation can be
useful in revealing whether particular sets of patterns or values occur together.
Word investigation is probably the most important of the three for the entire
QuialityStage suite, performing a free-format analysis of the data records. It performs
two different kinds of task; one is to report which words/tokens are already known, in
terms of the currently selected rule set, the other is to report how those words are to
be classified, again in terms of the currently selected rule set. There is no overlap to
Information Analyzer (data profiling tool) from word investigation.
A rule set includes a set of tables that list the known words or tokens. For example,
the GBNAME rule set contains a list of names that are known to be first names in Great
Britain, such as Margaret, Charles, John, Elizabeth, and so on. Another table in the
GBNAME rule set contains a list of name prefixes, such as Mr, Ms, Mrs and so on, that
can not only be recognized as name prefixes (titles, if you prefer) but can in some cases
reveal additional information, such as gender.
When a word investigation reports about classification, it does so by producing a
pattern. This shows how each known word in the data record is classified, and the order
in which each occurs. For example, under the USNAME rule set the name WILLIAM F.
GAINES III would report the pattern FI?G the F indicates that William is a known first
name, the I indicates the F is an initial, the ? indicates that Gaines is not a known
word in context, and the G indicates that III is a generation as would be Senior,
IV and fils. Punctuation may be included or ignored.
Rule sets also come into play when performing standardization (discussed below).
Classification tables contain not only the words/tokens that are known and classified,
but also contain the standard form of each (for example William might be recorded as
the standard form for Bill) and may contain an uncertainty threshold (for example
Felliciity might still be recognizable as Felicity even though it is misspelled in the
original data record). Probabilistic matching is one of the significant strengths of
QualityStage.
Investigation might also be performed to review the results of standardization,
particularly to see whether there are any unhandled patterns or text that could be
better handled if the rule set itself were tweaked, either with improved classification
tables or through a mechanism called rule set overrides.
Standardization
Standardization, as the name suggests, is the process of generating standard forms of
data that might more reliably be matched. For example, by generating the standard
form William from Bill, then there is an increased likelihood of finding the match
between William Gates and Bill Gates. Other standard forms that can be generated
include phonetic equivalents (using NYSIIS and/or Soundex), and something like
initials maybe the first two characters from each of five fields.
Each standardization specifies a particular rule set. As well as word/token classification
tables, a rule set includes specification of the format of an output record structure, into
which original and standardized forms of the data, generated fields (such as gender) and
reporting fields (for example whether a user override was used and, if so, what kind of
override) may be written.
It may be that standardization is the desired end result of using QualityStage. For
example street address components such as Street or Avenue or Road are often
represented differently in data, perhaps differently abbreviated in different records.
Standardization can convert all the non-standard forms into whatever standard format
the organization has decided that it will use.
This kind of QualityStage job can be set up as a web service. For example, a data entry
application might send in an address to be standardized. The web service would return
the standardized address to the caller.
More commonly standardization is a preliminary step towards performing matching.
More accurate matching can be performed if standard forms of words/tokens are
compared than if the original forms of these data are compared.
Matching
Matching is the real heart of QualityStage. Different probabilistic algorithms are
available for different types of data. Using the frequencies developed during
investigation (or subsequently), the information content (or rarity value) of each value
in each field can be estimated. The less common a value, the more information it
contributes to the decision. A separate agreement weight or disagreement weight is
calculated for each field in each data record, incorporating both its information content
(likelihood that a match actually has been found) and its probability that a match has
been found purely at random. These weights are summed for each field in the record to
come up with an aggregate weight that can be used as the basis for reporting that a
particular pair or records probably are, or probably are not, duplicates of each other.
There is a third possibility, a grey area in the middle, which QualityStage refers to as
the clerical review area record pairs in this category need to be referred to a human
to make the decision because there is not enough certainty either way. Over time the
algorithms can be tuned with things like improved rule sets, weight overrides, different
settings of probability levels and so on so that fewer and fewer clericals are found.
Matching makes use of a concept called blocking, which is an unfortunately-chosen
term that means that potential sets of duplicates form blocks (or groups, or sets) which
can be treated as separate sets of potentially duplicated values. Each block of potential
duplicates is given a unique ID, which can be used by the next phase (survivorship) and
can also be used to set up a table of linkages between the blocks of potential duplicates
and the keys to the original data records that are in those blocks. This is often a
requirement when de-duplication is being performed, for example when combining
records from multiple sources, or generating a list of unique addresses from a customer
file, et cetera.
More than one pass through the data may be required to identify all the potential
duplicates. For example, one customer record may refer to a customer with a street
address but another record for the same customer may include the customers post
office box address. Searching for duplicate addresses would not find this customer; an
additional pass based on some other criteria would also be required. QualityStage does
provide for multiple passes, either fully passing through the data for each pass, or only
examining the unmatched records on subsequent passes (which is usually faster).
Survivorship
As the name suggests survivorship is about what becomes of the data in these blocks of
potential duplicates. The idea is to get the best of breed data out of each block, based
on built-in or custom rules such as most frequently occurring non-missing value,
longest string, most recently updated and so on.
The data that fulfil the requirements of these rules can then be handled in a couple of
ways. One technique is to come up with a master record a single version of the
truth that will become the standard for the organization. Another possibility is that
the improved data could be populated back into the source systems whence they were
derived; for example if one source were missing date of birth this could be populated
because the date of birth was obtained from another source. Or more than one. If this
is not the requirement (perhaps for legal reasons), then a table containing the linkage
between the source records and the master record keys can be created, so that the
original, source systems have the ability also to refer to the single source of truth and
vice versa.
Address Verification and Certification
QualityStage can do more (than simple matching). Address verification can be
performed; that is, whether or not the address is a valid format can be reported. Out of
the box address verification can be performed down to city level for most countries. For
an extra charge, an additional module for world-wide address verification (WAVES) can
be purchased, which will give address verification down to street level for most
countries.
For some countries, where the postal systems provide appropriate data (for example
SERP in the USA, CASS in Canada, DPID in Australia), address certification can be
performed: in this case, an address is given to QualityStage and looked up against a
database to report whether or not that particular address actually exists. These
modules carry an additional price, but that includes IBM obtaining regular updates to
the data from the postal authorities and providing them to the QualityStage licensee.
QualityStage Designer
The Designer is a Windows-based graphical user interface in which the various
QualityStage tasks (called jobs) can be specified. This user interface has been created
to use a design in the same way that you think paradigm. That is, you select
components and metadata that describe precisely what you want to do, and draw a
picture of the flow of activities.
This is then converted into something that can be executed in batch or in real time as a
web service. A single mouse click effects the conversion (the process is called
compilation in the Designer).
QualityStage Designer also includes facilities for constructing and testing rule sets and
rule set overrides, for obtaining quality metrics and for tuning the entire process, for
example by setting thresholds for match/non-match, applying weight overrides based
on external knowledge (typically of the rarity value or sampling strategy), qualifying the
uncertainty thresholds applicable to the various probabilistic matching algorithms, and
more.
Figure 1 shows a QualityStage job design that performs standardization of customer
records and generates frequency distributions of selected fields. These frequency
distributions will be used in a subsequent match job.
Figure 1 QualityStage Designer
QualityStage Benefits
QualityStage provides the most powerful, accurate matching available, based on
probabilistic matching technology, easy to set up and maintain, and providing the
highest match rates available in the market.
An easy-to-use graphical user interface (GUI) with an intuitive, point-and-click interface
for specifying automated data quality processes data investigation, standardization,
matching, and survivorship reduces the time needed to deploy data cleansing
applications.
QualityStage offers a thorough data investigation and analysis process for any kind of
free formatted data. Through its tight integration with DataStage and other Information
Server products it also offers fully integrated management of the metadata associated
with those data.
There exists rigorous scientific justification for the probabilistic algorithms used in
QualityStage; results are easy to audit and validate.
Worldwide address standardization verification and enrichment capabilities including
certification modules for the United States, Canada, and Australia add to the value of
cleansed address data.
Domain-agnostic data cleansing capabilities including product data, phone numbers,
email addresses, birth dates, events, and other comment and descriptive fields, are all
handled. Common data quality anomalies, such as data in the wrong field or data
spilling over into the next field, can be identified and addressed.
Extensive reporting providing metrics yield business intelligence about the data and help
tune the application for quality assurance.
Service oriented architecture (SOA) enablement with InfoSphere Information Services
Director, allowing you to leverage data quality logic built using the IBM InfoSphere
Information Server and publish it as an "always on, available everywhere" service in a
SOA in minutes.
The bottom line is that QualityStage helps to ensure that systems deliver accurate,
complete, trusted information to business users both within and outside the enterprise.
More Information
http://www-01.ibm.com/software/data/infosphere/qualitystage
IBM, the IBM logo, InfoSphere, WebSphere, Information Server, Information Analyzer,
FastTrack, Business Glossary, QualityStage, DataStage and Metadata Workbench are
trademarks or registered trademarks of International Business Machines Corporation in
the United States, other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered trademarks
of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft,
Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation
in the United States, other countries, or both. Intel, Intel Inside (logos), MMX and
Pentium are trademarks of Intel Corporation in the United States, other countries, or
both. UNIX is a registered trademark of The Open Group in the United States and other
countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or
both.
Other company, product or service names might be trademarks or service marks of
others.
About the Author
Ray Wurlod
Ray is a self-employed trainer and consultant for the IBM DataStage, IBM UniVerse and
IBM Red Brick Warehouse and DataStage XE suites of products.
Ray has taught advanced classes in the USA, the UK and Germany, and has been used
frequently as a training consultant by IBM to conduct advanced in-house training
classes. Additionally, Ray has presented training classes in almost every country in the
Asia-Pacific region, and has been involved in technical presentations and
implementations throughout the region.
Ray joined Prime Computer of Australia in 1986. He later joined VMARK Software
(original developers of DataStage) after Prime Computer sold its database businesses
to VMARK. Rays principal role with VMARK, and subsequently with Ardent, Informix
and IBM was as a DataStage trainer, but was also actively involved in technical support.
While with VMARK and Ardent he was actively involved in the development of
DataStage, creating a complete training curriculum for use in the Asia-Pacific region.
He has also developed training curriculum and train-the-trainer programs for the
UniVerse RDBMS, including its NLS (national language support) implementation. When
Ardent sold its database businesses to Informix Software, Ray continued his
involvement in Data Warehouse technology by becoming expert with the Red Brick
Warehouse product; a database designed specifically for Data Warehouse (star schema)
implementations.
When Informix was acquired by IBM Ray continued his concentration on training, while
additionally focusing on Data Warehousing applications.

White Paper - What Is QualityStage

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

White Paper - What Is QualityStage

Transféré par

Droits d'auteur :

Formats disponibles

Vous aimerez peut-être aussi