Vous êtes sur la page 1sur 86

IQ8 Data Cleanse

Modifiers Guide

User Modifiable Dictionary for the IQ8


Data Cleanse transform
November 2004

UMD command line and configuration file


Query the dictionaries to investigate parsing and
mixed-casing
Create a custom parsing dictionary
Create custom capitalization dictionaries
Edit the pattern file for custom pattern matching
Edit the rule file for custom parsing
Run QuickParse

Notices
Published in the United States of America by Firstlogic, Inc., 100 Harborview Plaza,
La Crosse, Wisconsin 54601-4071.
Customer Care

Technical help is free for customers who are current on their ESP. Advisors are
available from 8 a.m. to 6 p.m. central time, Monday through Friday. When you call,
have at hand the users manual and the version number of the product you are using.
Call from a location where you can operate your software while speaking on the
phone. To save time, fax or e-mail your questions, and an advisor will call or e-mail
back with answers prepared. Or visit our Knowledge Base on the Customer Portal
web site, where you can find answers on your own, right away, at any time of the day
or night.
Our Customer Care group also manages our customer database and order processing.
Call them for order status, shipment tracking, reporting damaged shipments or flawed
media, changes in contact information, and so on.

What do you think of


this guide?

Legal notices

Phone

888-788-9004 in the U.S. and Canada;


elsewhere +1-608-788-9000

Web site

http://www.firstlogic.com/customer

E-mail

customer@firstlogic.com

Product literature

888-215-6442, fax 608-788-1188,


or information@firstlogic.com

Corporate receptionist

608-782-5000, or fax 608-788-1188

The Firstlogic Technical Publications group strives to bring you the most useful and
accurate publications possible. Please give us your opinion about our documentation
by filling out the brief survey at http://www.firstlogic.com/customer/surveys/
default.asp. We appreciate your feedback! Thank you!

2004 Firstlogic, Inc. All rights reserved. This publication and accompanying software are protected by U.S. copyright
law and international treaties. No part of this publication or accompanying software may be copied, transferred, or
distributed to any person without the express written permission of Firstlogic, Inc.
National ZIP+4 Directory 2004 United States Postal Service. Firstlogic Directories 2004 Firstlogic, Inc. All City,
ZCF, state ZIP+4, regional ZIP+4, and supporting directories are also protected under the Firstlogic copyright. Firstlogic,
Inc. is a nonexclusive interface distributor of the USPS and holds a nonexclusive license to publish and sell ZIP+4
databases on optical and magnetic media. Firstlogic publishes this document and offers the Firstlogic product to the public
under a nonexclusive license from the United States Postal Service. The price of the Firstlogic product is not established,
controlled, or approved by the U.S. Postal Service.
Firstlogic, Inc., or any authorized dealer distributing this product, makes no warranty, expressed or implied, with respect to
this computer software product or with respect to this manual or its contents, its quality, performance, merchantability, or
fitness for any particular purpose or use. It is solely the responsibility of the purchaser to determine its suitability for a
particular purpose or use. Firstlogic, Inc. will in no event be liable for direct, indirect, incidental, or consequential damages
resulting from any defect or omission in this software product, this manual, the program disks, or related items and
processes, including, but not limited to, any interruption of service, loss of business or anticipatory profit, even if
Firstlogic, Inc. has been advised of the possibility of such damages. This statement of limited liability is in lieu of all other
warranties or guarantees, expressed or implied, including warranties of merchantability or fitness for a particular purpose.
1L, 1L (ball design), ACE, ACSpeed, DataJet, DocuRight, eDataQuality, Entry Planner, Firstlogic, Firstlogic InfoSource,
FirstPrep, FirstSolutions, GeoCensus, idCentric, IQ Insight, iSummit, Label Studio, MailCoder, Match/Consolidate,
PostWare, Postalsoft, Postalsoft Address Dictionary, Postalsoft Business Edition by Firstlogic, Postalsoft DeskTop Mailer,
Postalsoft DeskTop PostalCoder, Postalsoft DeskTop Presort, Postalsoft Manifest Reporter, PrintForm, RapidKey, Total
Rewards, and TrueName are registered trademarks of Firstlogic, Inc. DataRight, IRVE, and TaxIQ are trademarks of
Firstlogic, Inc. CASS, DPV, eLOT, FASTforward, NCOAlink and ZIP are trademarks of the United States Postal Service.
All other trademarks are the property of their respective owners.

IQ8 Data Cleanse Modifiers Guide

Contents

Preface .............................................................................................................5
Chapter 1:
Custom parsing dictionaries......................................................................... 7
Step 1: Query the dictionary.............................................................................9
Step 2: Create a parsing transaction file.........................................................12
Step 3: Put your entries in the transaction file ...............................................13
Step 4: Build your custom parsing dictionary................................................15
Step 5: Maintain and update your custom dictionary.....................................16
Sample transaction: Add a new word.............................................................17
Sample transaction: Add a title phrase...........................................................18
Sample transaction: Add a multiple-word firm name ....................................20
Sample transaction: Add a firm that looks like a personal name ...................22
Sample transaction: Modify information codes .............................................23
Sample transaction: Modify standards and standard-types ............................24
Sample transaction: Add an acronym for acronym conversion .....................25
Rules for working with match standards........................................................26
Chapter 2:
Custom capitalization dictionaries ............................................................ 29
Step 1: Create a capitalization transaction file ...............................................30
Step 2: Put your entries in the transaction file ...............................................31
Step 3: Build your custom capitalization dictionary ......................................32
Step 4: Update your custom dictionary ..........................................................33
Chapter 3:
User-defined pattern matching (UDPM)................................................... 35
Overview of UDPM .......................................................................................36
Working with the pattern file .........................................................................37
Introduction to regular expressions ................................................................39
Creating regular expressions ..........................................................................42
Example of defining a pattern ........................................................................43
Alternate expressions .....................................................................................45
Multiple rules .................................................................................................46
Example of a user-defined pattern file ...........................................................47
Chapter 4:
Modify the rule file...................................................................................... 51
What is the rule file? ......................................................................................52
How the rule file is organized ........................................................................53
Rule example..................................................................................................54
Definition section of a parsing rule ................................................................55
Action section of a parsing rule......................................................................58
Example of a parsing rule...............................................................................62

Contents

Chapter 5:
Check parsing results with QuickParse.................................................... 67
Get started with QuickParse .......................................................................... 68
Run QuickParse ............................................................................................. 70
Appendix A:
UMD configuration file, umd.cfg................................................................ 75
Appendix B:
UMD command line..................................................................................... 77
Appendix C:
Information codes and standard-type codes ............................................. 79
Index.............................................................................................................. 85

IQ8 Data Cleanse Modifiers Guide

Preface

About this guide

This guide explains the User-Modifiable Dictionary (UMD), which is a tool for
viewing and customizing dictionary files.
This guide explains how to use the command-line version of UMD to create
custom parsing dictionaries, and custom capitalization dictionaries
Another way you can modify the Data Cleanse transforms behavior to suit your
needs is to edit the pattern file (drludpm.dat). You edit the pattern file in order to
parse user-defined data patterns.
Additionally, you can edit the rules that are used to parse different types of name
and firm data. For more information, see Modify the rule file on page 51.
The last chapter of this guide explains how to check your results with QuickParse.
Use QuickParse to quickly see how data that you input would parse if input
through the Data Cleanse transform.

Related documents

Before using UMD, you should understand how your Firstlogic product uses the
dictionaries. For details, see your product documentation.

Conventions

This document follows these conventions:


Convention

Description

Bold

We use bold type for file names, paths, emphasis, and text that you
should type exactly as shown. For example, Type iq8\bin.

Italics

We use italics for emphasis and text for which you should substitute
your own data or values. For example, Type a name for your
project, and the .xml extension (projectname.xml).

Menu commands We indicate commands that you choose from menus in the following
format: Menu Name > Command Name. For example, Choose File
> New.
Changes

We use a change bar in the right margin to mark product changes


since the last version.
We use this symbol to alert you to important information and potential problems.
We use this symbol to point out special cases that you should know
about.
We use this symbol to draw your attention to tips that may be useful
to you.

IQ8 Data Cleanse Modifiers Guide

Chapter 1:
Custom parsing dictionaries

What is a parsing
dictionary?

The parsing dictionary identifies and parses name, title, and firm data. The parser
looks up words in the parsing dictionary to retrieve information. The parser then
uses the dictionary information, as well as the rule file, to identify and parse
name, title, and firm data.
The parsing dictionary contains entries for words and phrases. Each entry tells
how the word or phrase might be used. For example, the dictionary indicates that
the word Engineering can be used in a firm name (such as Smith Engineering,
Inc.) or job title (such as VP of Engineering).
The dictionary also contains other information:

Why create a custom


dictionary?

Type of information
in dictionary

Description

Acronyms

The dictionary contains the standard and acronymic forms of


words. For example, the dictionary indicates that Inc. is the
standardized form of Incorporated and IBM is the acronym
for Intl Business Machines.

Match
standards

The dictionary contains match standards (potential matches).


For example, Patrick and Patricia are match standards for
Pat.

Gender

The dictionary contains gender data. For example, it indicates


that Anne is a feminine name and Mr. is a masculine prename.

Address

The dictionary contains address components for address


rules.

Our base parsing dictionary contains thousands of name, title, and firm entries.
You might tailor the dictionary to better suit your data. For example:
You might customize the dictionary to correct specific parsing behavior. For
example, given the name Mary Jones, CRNA, the word CRNA is parsed as a
job title. In reality, CRNA is a postname (Certified Registered Nurse
Anesthetist). To correct this, you could add CRNA to the parsing dictionary as
a postname.
You might tailor the dictionary to better suit your data by adding regional or
ethnic names, special titles, or industry jargon. For example, if you process
data for the real estate industry, you might add postnames such as CRS
(Certified Residential Specialist) and ABR (Accredited Buyer
Representative).
If a specific title or firm name is parsed incorrectly, you can add an entry for
the entire phrase. For example, the parser previously identified Hewlett
Packard as a personal name, so we added Hewlett Packard to the dictionary
as a firm name.

Chapter 1:

Overview of creating a
dictionary

To create a custom parsing dictionary, follow these basic steps:


1. Use UMD Show to query our base parsing dictionary. Look for existing
entries for the words you wish to add or change.
2. Put your custom entries in a transaction file. A transaction file is a database
containing the additions and changes you wish to make to our dictionary.
3. Build your custom dictionary. UMD Build takes our base dictionary, makes
the additions and changes specified in the transaction file, and creates the
custom dictionary.
Source dictionary
Our parsing dictionary,
parsing.dct

Transaction file
A database containing
your additions and
changes
Supporting files
Files that enable UMD to
read the transaction file

UMD
Build

Custom dictionary
A new dictionary containing entries from the
source dictionary with
your additions and
changes

Qualifications

Preparing custom parsing dictionaries is a task for a data-management


professional. If you employ UMD in all its capabilities, dictionary editing is
almost an engineering task. Dictionary editing is not a clerical task.

A note about
examples

The sample queries and transactions in this chapter are for example only. By the
time you read this manual, the particular examples may have been added to our
base parsing dictionaries, so your query results may differ from what is shown.

IQ8 Data Cleanse Modifiers Guide

Step 1: Query the dictionary


Before you add a word to the dictionary, query our base parsing dictionary,
parsing.dct, to see whether there is already an entry for the word.
Run UMD Show

To query a dictionary, run UMD Show. To run UMD Show, use the command line
(see UMD Show on page 77). UMD Show is interactive. You enter a query and
UMD Show responds, either with data or a message that your query was not
found in the dictionary.

Query a single word

To query a single word, type the word at the Enter> prompt. Do not include any
punctuation. If the word is in the dictionary, UMD Show displays the dictionary
entry:
C:\umd /s parsing.dct
Using a parsing Dictionary.
Enter a query, or press <Esc> to exit.
Enter> Beth
Usage: 99
Intl Code(s): USENGLISH
Info Code(s): NAME_STRONG_FN NAMEGEN5
Standard(s) for BETH:
- BETHANY
NAME_MTC
- BETHEL
NAME_MTC
- ELIZABETH
NAME_MTC

For descriptions of the


information and standardtype codes, see Information codes and standardtype codes on page 79.

If the word is not in the dictionary, UMD Show tells you the entry was not found:
C:\umd /s parsing.dct
Using a parsing Dictionary.
Enter a query, or press <Esc> to exit.
Enter> Michelangelo
Text not found in dictionary.

Query a title phrase

To look up a multiple-word title, you must query the lookup form of the title
the same form as the parser would look up (see the procedure, below).
Note: If a line contains consecutive words that are marked as phrase
words, the parser retrieves the standard for each word, removes any
punctuation, and looks up the phrase.
Procedure

Example

1. Start with the raw title.

Chief Executive Officer

2. Query each word and get the standard (TITLE_MTC) for


each. If an appropriate match standard does not exist, use
the original word.

Chf. Exec. Off.

3. Remove all punctuation.


This is the form of the title that you should query.

Chf Exec Off

Chapter 1:

C:\umd /s parsing.dct
Enter a query, or press <ESC> to exit.
Enter> Chief
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): PRENAME TITLE TITLE_INIT TITLE_TERM PREGEN3 PHRASE_WRD
FIRMMISC
Standard(s) for CHIEF:
- CHIEF
FIRM_MTC, FIRM_STD, PRENAME_MTC, PRENAME_STD,
TITLE_MTC, TITLE_STD

Get the first appropriate match standard for each word


in the phrase.

Query the lookup


form of the phrase.

Enter a query, or press <ESC> to exit.


Enter> Executive
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): TITLE TITLE_INIT TITLE_TERM PHRASE_WRD FIRMMISC
Standard(s) for EXECUTIVE:
- EXECUTIVE FIRM_MTC, FIRM_STD, TITLE_MTC, TITLE_STD
Enter a query, or press <ESC> to exit.
Enter> Officer
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): PRENAME_ALONE NAME_STRONG_LN TITLE TITLE_ALONE
TITLE_TERM PREGEN2 NAMEGEN1 PHRASE_WRD FIRMMISC
Standard(s) for OFFICER:
- OFFICER FIRM_MTC, FIRM_STD, NAME_MTC, PRENAME_MTC,
PRENAME_STD, TITLE_MTC, TITLE_STD
Enter a query, or press <ESC> to exit.
Enter> Chief Executive Officer
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): TITLE_ALONE
Standard(s) for CHIEF EXECUTIVE OFFICER:
- CEO
TITLE_ACR, TITLE_MTC, TITLE_STD

Query a multiple-word
firm name

If you want to query a firm name that is also a personal name, such as
Hewlett Packard or Johnson & Johnson, see Query a firm name that looks
like a personal name on page 11.
To look up a multiple-word firm name, you must query the lookup form of the
firm namethe same form as the parser would look up:

10

Procedure

Example

1. Start with the raw firm name.

The General Motors Corporation

2. Remove the words and, or, of, the, and for.

General Motors Corporation

3. Remove firm terminator words such as Corp, Inc,


Ltd, Co, and so on.

General Motors

4. Query each remaining word. Get the standard


(FIRM_MTC) for each.
If an appropriate match standard does not exist, use
the original word.

Gen. Motors

5. Remove all punctuation.


This is the lookup form of the firm name.

Gen Motors

IQ8 Data Cleanse Modifiers Guide

Query the lookup form of the firm name:


C:\umd /s parsing.dct
Using a Parsing Dictionary.
Enter a query, or press <ESC> to exit.
Enter> Gen Motors
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): FIRMMISC FIRMNAME
Standard(s) for GEN MOTORS:
- GM
FIRM_ACR, FIRM_MTC, FIRM_STD

Query a firm name


that looks like a
personal name

For descriptions of the


information codes and
standard-type codes,
see Appendix C.

Some firms are named after peoplefor example, Hewlett Packard or Johnson
and Johnson.
To look up this type of firm name, you must query the lookup form of the firm
namethe same form of the name that the parser would look up:
Procedure

Example

1. Start with the raw firm name.

Johnson and Johnson Corp.

2. Remove all punctuation characters.

Johnson and Johnson Corp

3. Remove the words and, or, of, the, and for.

Johnson Johnson Corp

4. Remove all firm-terminator words, such as Corporation, Inc, Ltd, and so on.
This is the lookup form of the firm name.

Johnson Johnson

If all of the words in a line are identified as both FIRMNAME and NAME
words, the parser removes noise words and punctuation, then looks to see
whether the name is listed as a firm name. If so, the line is parsed as a firm
name. If not, the line is parsed as a personal name.
Query the lookup form of the firm name:
C:\umd /s parsing.dct
Using a Parsing Dictionary.
Enter a query, or press <ESC> to exit.
Enter> Johnson Johnson
Usage: 1
Intl Code(s): USENGLISH
Info Code(s): FIRMNAME
Standard(s) for JOHNSON JOHNSON:
- JOHNSON JOHNSON
FIRM_MTC, FIRM_STD

Chapter 1:

11

Step 2: Create a parsing transaction file


A transaction file is a database that contains all the additions and changes that you
want to make to the parsing dictionary. The first time you create a custom parsing
dictionary, you must create a transaction file.

Create a transaction
database

If youre updating an existing custom dictionary, use your existing


transaction file. Your dictionary will be easier to manage if you store your
entries in one transaction file, rather than scattering them among many
files.

The quickest, easiest way to create a transaction database file and its supporting
files is to use the output file feature of UMD Show. (See UMD Show on
page 77.)
1. Use UMD Show to query our base parsing dictionary, parsing.dct. Include
the o option on the command line. Use the file name that you plan to use for
your custom dictionary, but with the extension .trnfor example,
my_parse.trn.
If you plan to use a database program or spreadsheet program to edit the
file, we recommend creating a dBASE3 or ASCII file. If you plan to use a
text editor or word processor to edit the file, we recommend you create a
delimited file. However, be aware that our UMD Views program doesnt
support updating of delimited files.
2. Query a word that is in the dictionary, such as Bob.
3. To save the query to your output file, press Enter. To exit, press Escape.
C:\umd /s parsing.dct /o my_parse.trn /d dBase3
Using a Parsing Dictionary
Enter a query, or press <ESC> to exit.
Enter> Bob
Usage: 99
Intl Code(s): USENGLISH
Info Code(s): NAME_STRONG_FN NAMEGEN1
Standard(s) for BOB:
- ROBERT
NAME_MTC
Enter a query, or press
to save, or press <ESC> to exit
Enter>
Previous query appended to C:\my_parse.trn.
Enter a query, or press <ESC> to exit.
Enter> <Esc>

UMD Show will create an output database filefor example, my_parse.trn.


You can use this database as your transaction file.
Keep supporting files
with transaction file

12

When you create a transaction database as described above, UMD Show creates a
supporting file such as my_parse.def. For ASCII and delimited transaction files,
UMD Show also creates an additional supporting file such as my_parse.fmt or
my_parse.dmt. To open and read the transaction file, UMD requires these files.
If you move the transaction file to a new location, make sure you also move the
corresponding supporting files.

IQ8 Data Cleanse Modifiers Guide

Step 3: Put your entries in the transaction file


To add records to your transaction file, use a text editor or database program. For
each record, provide the information described below.
For examples, see the sample transactions starting with Sample transaction: Add
a new word on page 17.
Parse transaction
entries

Field

Data to enter

Action

Choose one:
Code

Description

Create a new entry or overwrite the existing entry.

Add information to an existing entry.

Change the usage or gender data for an existing entry.

Delete information from an existing entry, or delete the entry.

Primary

Type the word or phrase that you want to add or whose entry you want to
modify. Fifty four characters maximum, not case-sensitive, do not include
any punctuation.
For phrases and multiple-word firm names, use the lookup form. To get
the lookup form, see Query a title phrase on page 9 and Query a firm
name that looks like a personal name on page 11.

Secondary

Type one of the following:


The preferred standardized form of the Primary.
A match standard (for information and guidelines, see Rules for working with match standards on page 26).
The acronym form of the Primary.

Intl

Type USENGLISH.

Info

Type all information codes that apply, if not already in the dictionary. Put
one space (no punctuation) between codes.
For a list of information codes, see Information codes on page 79.

Stdtype

Type all standard-type codes that apply, if not already in the dictionary.
Put one space (no punctuation) between codes.
For a list of standard-type codes, see Standard-type codes on page 79.

Chapter 1:

13

Required fields

For each action, you must provide certain information. In the table below, a check
mark ( ) means that you must provide information for that field.

Type of change

Action

Create a new entry

Add a standard to an existing entry

Add information to an existing entry

Delete an entire entry

Delete a standard

Primary

Secondary

Usage

Intl

Information

Stdtype

Note 1
Note 2
Note 3
Must be blank
Note 4

Delete a standard-type code

Delete an information code

Change usage data for an existing entry

Change gender data for an existing entry

Notes on the table

Must be blank
Note 5

Note 6

1) Required only if the necessary Info code is not already in the existing
dictionary entry. For example, if you add the Stdtype code TITLE_MTC, you
must specify the Info code TITLE unless one of those Info codes is already
specified in the existing dictionary entry.
2) Required if the Info field contains anything besides PHRASE_WRD.
3) Must include all of the Info codes listed in the existing dictionary entry.
4) Must include all of the Stdtype codes listed in the existing dictionary entry.
5) This field is ignored. UMD automatically deletes dependent standard types.
6) PREGENx or NAMEGENx only. The existing dictionary entry must contain a
corresponding gender code. For example, if the existing entry contains the gender
code NAMEGEN1, you may change it to any other NAMEGENx code.

14

IQ8 Data Cleanse Modifiers Guide

Step 4: Build your custom parsing dictionary


After you put all of your entries in your transaction file, use UMD Build to build
your custom parsing dictionary.
UMD Build

To build your dictionary, run UMD Build. The easiest way to convey your
instructions to UMD is through the UMD configuration file.
1. Open a copy of the configuration file umd.cfg.
2. Type entries for the UMD Build parameters. Specify our dictionary,
parsing.dct, as your Source Dictionary.
For descriptions of the configuration-file parameters, see UMD
configuration file, umd.cfg on page 75.
# UMD Show
Output File Name (path & file name) ....
Output File Type (See NOTE) ............
#
# UMD Build
Dictionary Type (See NOTE) .............
Source Dictionary (path & dct) .........
Transaction File Name (path & file name)
Target Dictionary (path & dct) .........
Verify Input File Only (YES/NO) ........
Error Message Log File (path & name) ...
Work Directory (path) ..................

=
=
=
=
=
=
=
=
=

Parsing
c:\pathname\parsing.dct
c:\pathname\my_parse.trn
c:\pathname\my_parse.dct
NO
c:\pathname\my_parse.log
r:\temp

3. Save the configuration file.


4. Run UMD with the cfg option. For example:
umd /cfg my_parse.cfg
Verify and build

Before UMD builds your custom dictionary, it checks to make sure the entries in
your transaction file are valid. If a validation error or warning occurs, look at the
error log file. If an error occurred, fix your transaction file, then run UMD Build
again.
If the transaction file is free of errors, UMD builds your custom dictionary.
During the build process, UMD takes the source dictionary, makes the changes
and additions specified in your transaction file, and creates your custom
dictionary.

Chapter 1:

15

Step 5: Maintain and update your custom dictionary


Use your existing
transaction file

If you want to update your custom dictionary, put your changes and additions in
your existing transaction file. Your custom dictionary will be much easier to
manage if you accumulate all your entries in one transaction file, rather than
scattering them among many files.
When you rebuild your parsing dictionary, always use our base parsing
dictionary, parsing.dct, as the source dictionary.

Keep your dictionary


up to date

Whenever we send you a new parsing.dct file, build an updated custom


dictionary by running your transaction file against our new dictionary. This
allows you to benefit from the additions and improvements that we have made to
the base dictionary.
If you do not run your transaction file against each new base dictionary, the
differences between our base dictionary and your custom dictionary will increase.
This will affect your parsing results and impede our ability to provide technical
support.

16

IQ8 Data Cleanse Modifiers Guide

Sample transaction: Add a new word


Suppose your data file contains the name line Harold Smith, ABSFC. You notice
that the word ABSFC is being parsed as a job title. However, ABSFC is really a
postname (Able Bodied Seaman First Class).
When you query ABSFC, you discover it is not in the parsing dictionary:
C:\umd /s parsing.dct
Using a Parsing Dictionary.
Enter a query, or press <ESC> to exit.
Enter> ABSFC
Text not found in dictionary.

To add the word to the dictionary, you would add the following record to your
transaction file:

Capitalization

Field name

Transaction entry

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

N
ABSFC
ABSFC
0
USENGLISH
HONPOST
HONPOST_STD HONPOST_MTC

As you add words to the parsing dictionary, make a note of any words that have
unusual mixed-case capitalization. To get the correct mixed-case capitalization,
you must also add these words to your custom capitalization dictionary.
For example, if you add ABSFC to the parsing dictionary, you should also add it
to your custom capitalization dictionary. Otherwise, the mixed-casing will be
Absfc rather than ABSFC.

Chapter 1:

17

Sample transaction: Add a title phrase


There is a lot of overlap between words that can be used in firm names and words
that can be used in job titles. For example, the words Vice, President, and
Marketing can all be used in firm names and in job titles. As a result, the parser
may incorrectly identify Vice President of Marketing as a firm name rather than a
title. To correct this kind of parsing behavior, you can add a title phrase to the
dictionary.
Two main tasks

To add a title phrase to the dictionary, you must do two things:


Make sure each word is in the dictionary and has the information codes
PHRASE_WRD and TITLE.
Enter the lookup form of the phrase so that the parser will find it (see
Query a title phrase on page 9). Otherwise, the entry will have no affect on
parsing results.

To add a title phrase


to the dictionary:

1. Query the lookup form of the phrase (see Query a title phrase on page 9).
For example, to add the phrase Vice President of Marketing to the dictionary,
use the lookup form Vice Pres of Mktg.
2. If the phrase is not in the dictionary, create a new entry in your transaction
file. Use the lookup form as the primary and secondary:
Field name

Transaction entry

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

N
Vice Pres of Mktg
Vice Pres of Mktg
0
USENGLISH
TITLE
TITLE_STD TITLE_MTC

3. Query each word in the original phrase (for example, Vice, President, of, and
Marketing). Make sure each word meets the following requirements:

It has the information code PHRASE_WRD.


It has the information code TITLE.
The first title match standard (TITLE_MTC) is the same as the word
used in the phrase entry.
4. If a word is not in the dictionary or does not meet the requirements listed in
Step 3, add the word (or modify it) by putting an entry in your transaction
file.

18

IQ8 Data Cleanse Modifiers Guide

For our example, the word President is in the dictionary but is not identified
as a phrase word, so we need to mark it as a PHRASE_WRD. We also need
to mark the word of as a TITLE word and a PHRASE_WRD.
Field name

Entry for President

Entry for of

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

A
President
Pres.

A
of
of

PHRASE_WRD

PHRASE_WRD TITLE
TITLE_MTC TITLE_STD

For best results. Perform steps 3 and 4 for variant spellings and
abbreviations of each word. For our example, we would check to make
sure that Pres and Mktg are marked as phrase words. This enables the
parser to recognize variant raw forms of the phrasesuch as Vice Pres. of
Marketing, Vice President of Mktg., and Vice Pres. of Mktg.in addition to
the original phrase Vice President of Marketing.

Chapter 1:

19

Sample transaction: Add a multiple-word firm name


If a multiple-word firm name such as Emery Worldwide is parsed incorrectly, you
can add the firm name to the dictionary.

If a firm name looks like a personal name, such as Hewlett Packard or


Johnson & Johnson, the procedure is different from the one shown on this
page. See Sample transaction: Add a firm that looks like a personal
name on page 22.
To the parser, a line looks like a personal name if all of the words in the
line are marked as NAME words. For example, Check N Go looks like a
personal name because the words Check, N, and Go are all NAME words.

Two main tasks

To add a multiple-word firm name to the dictionary, you must do two things:
Make sure each at least one of the words is in the dictionary and has the
FIRMNAME information code.
Enter the lookup form of the firm name so that the parser will find it (see
Step 1: Query the dictionary on page 9). Otherwise, the entry will have no
affect on parsing results.

To add a multi-word
entry to the
dictionary:

1. If the firm name looks like a personal namefor example, Hewlett Packard,
Merrill Lynch, Johnson & Johnsonsee Query a firm name that looks like a
personal name on page 11.
2. In your transaction file, create a new entry for the lookup form of the firm
name (see Query a multiple-word firm name on page 10).
Field

Transaction entry

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

N
Emery Worldwide
Emery Worldwide
0
USENGLISH
FIRMNAME
FIRM_STD FIRM_MTC

3. Make sure that both of the words in the firm name (for example, both Emery
and Worldwide) meet the following requirements:

The entry includes the FIRMNAME information code.


The first firm match standard (FIRM_MTC) is the word that you used
in your firm-name entry in step 2.
4. If the firm name doesnt meet the requirements in step 3, add an entry to your
transaction file.

20

IQ8 Data Cleanse Modifiers Guide

Most often, youll need to mark one of the words as a FIRMNAME word, as
shown here.
Field name

Transaction entry

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

A
Emery
Emery
FIRMNAME
FIRM_MTC FIRM_STD

Chapter 1:

21

Sample transaction: Add a firm that looks like a personal name


Many firms are named after peoplefor example, Hewlett Packard or Merrill
Lynch. The parser often identifies these as personal names rather than firm names.
To correct this, you can add the firm name to the dictionary.
Two main tasks

If a firm name looks like a personal name,1 you must do two things:
Make sure each word is in the dictionary and has both the NAME and
FIRMNAME information codes.
Create an entry for the lookup form of the firm name.

To add a firm name


that looks like a
personal name:

1. Query the lookup form of the firm name (see Query a firm name that
looks like a personal name on page 11).
2. If the firm name is not in the dictionary, create a new entry in your
transaction file. Use the lookup form as the primary and secondary.
Field name

Transaction entry

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

N
Robert W. Baird
Robert W. Baird
0
USENGLISH
FIRMNAME
FIRM_MTC FIRM_STD

3. Query each word. Make sure it is in the dictionary and is identified as both a
NAME and a FIRMNAME. If not, add the word (or modify it) by putting an
entry in your transaction file. In our example, Robert W Baird, all three words
are in the dictionary, but none has the FIRMNAME information code.
For each word, we would put an entry in the transaction file to add the
FIRMNAME information code, as shown here for Robert.
Field name

Transaction entry

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

A
Robert
Robert
FIRMNAME
FIRM_MTC FIRM_STD

1. To the parser, a line looks like a personal name if all of the words in the line are marked as NAME words. For
example, Check N Go looks like a personal name because the words Check, N, and Go are all NAME words.

22

IQ8 Data Cleanse Modifiers Guide

Sample transaction: Modify information codes


Suppose your data file contains the line John Smith PsyD. You notice that this line
is parsed as a firm name rather than a personal name. Although the name of John
Smiths business might possibly be John Smith PsyD, you would prefer to parse
this as a name rather than a firm.
When you query the dictionary, you notice that PsyD is listed as a firm word:
C:\umd /s parsing.dct
Using a Parsing Dictionary.
Enter a query, or press <ESC> to exit.
Enter> PsyD
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): FIRMMISC
Standard(s) for PSYD:
- PSYD
FIRM_MTC, FIRM_STD

In your custom dictionary, you could specify that PsyD is also an honorary
postname (Doctor of Psychiatry). To do this, modify the existing entry to add the
honorary-postname codes:
Field name

Transaction entry

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

A
PsyD
PsyD
HONPOST
HONPOST_STD HONPOST_MTC

Notice that when you add a new information code, you must also specify at least
one standard for that type of information. In this case, we specified PsyD as the
standard and match standard for honorary postnames.

Chapter 1:

23

Sample transaction: Modify standards and standard-types


Suppose you want to standardize your data to make it as consistent as possible. In
job titles, the word Engineer is standardized to Engineer., but you would prefer to
standardize it to Eng. instead.
In the dictionary, the title standard for Engineer is Engineer.:
C:\umd /s parsing.dct
Using a Parsing Dictionary.
Enter a query, or press <ESC> to exit.
Enter>engineer
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): TITLE TITLE_ALONE TITLE_TERM PHRASE_WRD FIRMMISC
Standard(s) for ENGINEER:
- ENGINEER
FIRM_STD, TITLE_STD
- ENGR.
FIRM_MTC, TITLE_MTC

To change the title standard to Eng., you need to do two things:


Delete the TITLE_STD code from the standard Engr.
Add the standard Eng. and identify it as a title standard (TITLE_STD).
You would put these entries in your transaction file:

24

Field name

Entry to add Eng.


as a standard

Entry to delete TITLE_STD code


from the standard Engineer

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

A
Engineer
Eng.

D
Engineer
Engineer

TITLE_STD

TITLE_STD

IQ8 Data Cleanse Modifiers Guide

Sample transaction: Add an acronym for acronym conversion


You can convert a prename, postname, job title, firm name, or firm location to an
acronym. The Data Cleanse transform produces an acronym only when one is
available in the parsing dictionaryit doesnt generate initials by algorithm or
rule.
How the parser
generates acronyms

Before looking for an acronym, the parser removes all punctuation and noise
words and gets the first appropriate match standard for each word. You must use
the same phrase that the parser will actually look upotherwise, the parser wont
find your entry and wont generate the acronym.

To add an acronym
Procedure

Example

1. Start with the raw phrase.

Certified Residential Specialist

2. Remove the words and, or, of, the, and for.

Certified Residential Specialist

3. For firm names, remove firm terminator words


such as Corp, Inc, Ltd, Co and so on.

Certified Residential Specialist

4. Query each remaining word. Get the first appropriate match standard for each. For example, if you
are adding a firm match standard, get the first
FIRM_MTC.1

Cert. Residential Specialist

5. Remove all punctuation. This is the lookup form


of the acronym phrase.

Cert Residential Specialist

1. If the word is not in the dictionary, create a new entry for the word (see page 17). If the word is in the dictionary but
does not list an appropriate match standard, create an entry to add the appropriate information code and match-standard
type (see pages 23 and 24). For example, for the word Residential we would add the information code HONPOST and the
standard-type code HONPOST_MTC.

To add the phrase to the dictionary, put an entry in your transaction file. Use the
lookup form of the phrase as the primary, and use the acronym itself as the
secondary:
Field name

Transaction entry

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

N
Cert Residential Specialist
CRS
0
USENGLISH
HONPOST
HONPOST_ACR

Chapter 1:

25

Rules for working with match standards


Each entry in the parsing dictionary may include one or more match standards.
You can use match standards to improve the performance of your Match
transform.
How match standards
work

To simplify this discussion, we discuss match standards for personal names.


In the dictionaries, a match standard is a one-way relationship, a pointer from one
name to another:
Alberto

Allen

Alfredo

Alex

Alphonso

Alonzo

Albert

Alan

Alfred

Alexander

Alphonse

Almon

Al

For the name Al, the match standards are Albert, Alan, Alfred, Alexander,
Alphonse, and Almon.
For the name Alberto, the match standard is Albert. (Likewise, for Allen the
match standard is Alan; for Alfredo, Alfred; and so on.)
If two different names return the same match standard, you can use your
matching software to do multiway comparisons and find a match. For example,
since Alberto and Al both return Albert as a match standard, your matching
software could match Alberto Smith to Al Smith.
Here are partial dictionary entries for the name Al and its direct match standards.
Primary

Standard

ALBERT

ALBERT

ALAN

ALAN

ALFRED

ALFRED

ALEXANDER

ALEXANDER

ALPHONSE

ALPHONSE

ALMON

ALMON

AL

ALBERT, ALAN, ALFRED, ALEXANDER, ALPHONSE

Notice that each match standard has its own entry, and that in that entry, the
standard is the same as the primary.
Work with match
standards

To use a word as a match standard, it should have its own entry in the dictionary
(or have its own entry in the transaction file).2 In that entry, the word must be a
match standard of itselfin other words, the match standard must be the same as
the query word.
2. Technically, you could also use a word as a match standard if that word does not have an entry in the dictionaryfor
example, you could use Michelangelo as a match standard because Michelangelo is not in the dictionary. In practice,
however, if you use a word as a match standard, youll probably also want that word to have its own entry in the dictionary,
so we make that assumption in our guidelines.

26

IQ8 Data Cleanse Modifiers Guide

For example, you could use the word Dr as a match standard because it is in the
dictionary and has itself, Dr, as a match standard:
Enter a query, or press <ESC> to exit.
Enter> Dr
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): PRENAME_ALONE HONPOST PREGEN3 SUFFIX
Standard(s) for DR:
- DR. HONPOST_MTC, HONPOST_STD, PRENAME_MTC, PRENAME_STD
- DR ADDRESS_STD

If a word is a match standard of


itself, you can use it as the same
type of match standard for another
word.

Field name

Transaction entry

Action
Primary
Secondary
Usage
Intl
Info
Stdtype

A
Doc
DR.

PRENAME PREGEN3
PRENAME_STD PRENAME_MTC

Spelling and punctuation. The spelling and punctuation of the


Secondary in the transaction entry must exactly match the Standard in the
existing dictionary entry.

Chapter 1:

27

28

IQ8 Data Cleanse Modifiers Guide

Chapter 2:
Custom capitalization dictionaries

This chapter explains how to create and maintain custom capitalization


dictionaries.
What is a
capitalization
dictionary?

In a custom capitalization dictionary, you can specify the correct casing for a
word in different situations. For example, you can specify that when MCKAYE is
used as a last name, the casing should be McKaye.

Why create a custom


dictionary?

Most users find that our capitalization dictionary, pwcap.dct, produces good
mixed-case results. However, if a word is not cased as you would like, you can
enter that word in a custom capitalization dictionary.
For example, if you want the word TECHTEL to be cased as TechTel, you could
add the word TechTel to your custom dictionary.
Most of our products allow you to use two capitalization dictionaries at once, so
we expect that most users will employ our base dictionary as is and build their
own, separate dictionary as an extension. When you use your dictionary, you can
give it priority over ours by specifying your dictionary as Dictionary #2.

Create transactions,
build your dictionary

For each entry you want to place in your capitalization dictionary, you will create
a record, or transaction, in a database called a transaction file.
After you make all of your entries in the transaction file, you will run the UMD
Build process. UMD Build reads the entries from your transaction file and creates
your custom dictionary.

Query our dictionary


or yours

You can look up words in our dictionary, pwcap.dct, or your custom dictionary.
For example, if you want to see how we capitalize the word PHD, you can query
the dictionary:
c:\umd /s pwcap.dct
Using a Capital Dictionary.
Enter a query, or press <ESC> to exit.
Enter> PHD
PHD is capitalized as follows:
-PhD is used with EVERY occurrence.

For more details about querying a capitalization dictionary, see Query your
dictionary on page 33.

Chapter 2:

29

Step 1: Create a capitalization transaction file


A transaction file is a database that contains all of your entries for a particular
custom capitalization dictionary. Each entry in your transaction file will create
one entry in your custom dictionary.

!
Create a transaction
database

If you are working with an existing custom dictionary, use the existing
transaction file for that dictionary. Do not create more than one transaction
file for each custom dictionary.

The quickest, easiest way to create a transaction database and its supporting files
is to use the output file feature of UMD Show. (See UMD Show on page 77.)
1. Use UMD Show to query our base capitalization dictionary, pwcap.dct.
Include the o option on the command line. Use the base file name that you
plan to use for your custom dictionary, but with the extension .trnfor
example, my_cap.trn.3
2. Query a word that is in the dictionary, such as PhD.
3. Press Enter to save the query to your output file. Press Esc again to exit.
C:\umd /s pwcap.dct /o my_cap.trn /d dBase3
Using a Capital Dictionary
Enter a query, or press <ESC> to exit.
Enter> phd
PHD is capitalized as follows:
-PhD is used with EVERY occurrence
Enter a query, or press
to save, or press <ESC> to exit
Enter>
Previous query appended to C:\my_cap.trn.
Enter a query, or press <ESC> to exit.
Enter> <Esc>

UMD Show will create an output database filefor example, my_cap.trn. You
can use this database as your transaction file. For instructions on adding your
entries to the transaction file, see Step 2: Put your entries in the transaction file
on page 31.
Keep supporting files
with transaction file

When you create a transaction database as described above, UMD Show creates a
supporting file such as my_cap.def. If the transaction file is ASCII or delimitedASCII, UMD Show also creates an additional file such as my_cap.fmt or
my_cap.dmt.
To open and read the transaction file, UMD needs these files. If you move the
transaction file to a new location, make sure you also move the corresponding
supporting files.

3. If you plan to use a database program or spreadsheet program to edit the file, we recommend creating a dBASE3 or
ASCII file. If you plan to use a text editor or word processor to edit the file, we recommend creating a delimited file.

30

IQ8 Data Cleanse Modifiers Guide

Step 2: Put your entries in the transaction file


For each word you want to add to your custom capitalization dictionary, create a
record in your transaction file. Use a text editor or a database program to add
records to your transaction file.
Capitalization
transaction entries

The table below describes what information to put in each field in your
transaction file.
Field

Data to enter

Action

Choose one:
N Create a new entry.
D Delete the existing entry from the source dictionary.1

Primary

Type the word in the preferred casing, 54 characters maximum. Type a single word (no spaces). Do not include any punctuation.

Attribute

Specify when this casing should be used. Include all that apply, separated
by one space (no punctuation):
PRENAME Prenames
FIRSTNAME First names
LASTNAME Last names
PRELASTNAME Last-name prefixes
POSTNAME Postnames
TITLE Job titles
FIRM Firm data
ADDRESS Address lines2
CITY City names
STATE State names
EVERY Every occurrence
You may type the entire word or just the portion shown in bold (for example, FIRSTNAME or FIRS).
If the Action field contains D, you may leave this field blank.

1. This command is used rarely, if ever. If you want to delete an entry from your custom dictionary, simply delete that
record from your transaction file, then rebuild the dictionary. If you dont like the casing for a word in our base dictionary,
pwcap.dct, you dont need to delete the entry from our dictionary. Instead, put the desired casing in your custom
dictionary. When you process data, specify your dictionary as Dictionary #2 so that your entry will override ours.
2. Do not specify the ADDRESS, CITY, or STATE attribute unless the product that uses the dictionary has addressparsing capability.

Sample entries

Here are some sample entries.


Action

Primary

Attribute

dos

PRELASTNAME

McCathie

EVERY

Chapter 2:

31

Step 3: Build your custom capitalization dictionary


After you put all of your entries in your transaction file, use UMD Build to build
your custom capitalization dictionary.
UMD Build

To build your dictionary, run UMD Build. The easiest way to convey your
instructions to UMD is through the UMD configuration file.
1. Open a copy of the configuration file umd.cfg.
2. Type your instructions in the UMD Build parameters. For descriptions of the
parameters, see Appendix A.
# UMD Show
Output File Name (path & file name) ....
Output File Type (See NOTE) ............
#
# UMD Build
Dictionary Type (see NOTE) .............
Source Dictionary (path & dct) .........
Transaction File Name (path & file name)
Target Dictionary (path & dct) .........
Verify Input File only (YES/NO) ........
Error Message Log File (path & name) ...
Work Directory (path) ..................
...

=
=
=
=
=
=
=
=
=

Capital
c:\pathname\my_cap.trn
c:\pathname\my_cap.dct
NO
c:\pathname\my_cap.log
r:\temp

3. Save the configuration file. We recommend using the same base file name as
your dictionary, but with the extension .cfgfor example, my_cap.cfg.
4. Run UMD with the cfg option. For example:
umd /cfg my_cap.cfg
During the build process, UMD reads the entries from your transaction file and
creates your custom dictionary.
Tips:
We recommend that you use the same base file name for the
transaction file and custom dictionary, and store both files in the
same location.
We recommend that you accumulate all your custom entries in one
transaction file and build your custom dictionary from the
transaction file only. If you do this, you will not need to specify a
source dictionary when you run UMD Build.

32

IQ8 Data Cleanse Modifiers Guide

Step 4: Update your custom dictionary


If you want to add a new entry to your custom dictionary, edit your transaction
file, then rebuild your custom dictionary.
Use your existing
transaction file

To update an existing dictionary, add your new entries to your existing transaction
file. Your custom dictionary will be much easier to manage if you accumulate all
of your entries in one transaction file, rather than scattering them among many
files.
To add a word, create a record for that word in the transaction file. To delete a
word, delete the record for that word from the transaction file. For dBASE3 files,
UMD supports non-destructive delete marking.

Rebuild your custom


dictionary

After you add your new entries, run UMD Build as instructed on page 32. UMD
will rebuild your custom dictionary based on your updated transaction file.

Query your dictionary

You may wish to query your custom capitalization dictionary to see whether it
contains a particular word. To query a dictionary, run UMD Show (see UMD
Show on page 77). For example:
umd /s my_cap.dct
If you look up a word that is in the dictionary, UMD displays the preferred casing
and tells you when that casing is used:
c:\umd /s my_cap.dct
Using a Capital Dictionary.
Enter a query, or press <ESC> to exit.
Enter> TECHTEL
TECHTEL is capitalized as follows:
-TechTel is used with FIRM occurrences.

Chapter 2:

33

34

IQ8 Data Cleanse Modifiers Guide

Chapter 3:
User-defined pattern matching (UDPM)

One way you can modify the Data Cleanse transforms behavior to suit your
needs is to edit the pattern file (default name is drludpm.dat). You edit the
pattern file in order to parse with your own data patterns.

Test your results with


QuickParse

Proceed with care. The pattern file controls how incoming data is
parsed. Accordingly, changing this file changes how items are parsed.
Before you define new patterns to parse, proceed with great caution. If
you aren't careful when you add user-defined patterns, you may receive
unexpected and unwanted results. Make backups of your files just in case
you need to revert to previous parsing rules.

Treat this ability as you would treat any feature you add to an application. Before
you integrate any modification into your enterprise, you should put your
modification through a stringent cycle of research, quality assurance, testing,
regression testing, and so on. Don't find out at release time that you've changed
something for the worse. As always, we recommend that you maintain adequate
backups and test your results with the QuickParse utility. For more information
see Check parsing results with QuickParse on page 67.

Chapter 3:

35

Overview of UDPM
With the User-Defined Pattern Matching (UDPM) system, you can parse data that
the Data Cleanse transform currently doesnt parse. For example, records in your
database may contain a customer ID number that is unique to your company. With
UDPM, you can locate and parse this number.
The UDPM utility provides a method for you to define one or several data
patterns that are specific to types of the data you want to parse, such as part
numbers, customer account number, employee numbers, product numbers or any
other specific pattern of data that you have a need to parse.
Parsing patterns

The Data Cleanse transform parses UDPM patterns that are either by themselves
on the input line or surrounded by noise text. For example, input text could be
Here is the part number 123AB. Or just 123AB. If you have defined a pattern
that fits this part number 123AB, then the Data Cleanse transform parses 123AB.
How does Data Cleanse do it? You edit Data Cleanses UDPM Pattern File
(drludpm.dat).

Pattern file

The user-defined patterns are stored in a pattern file. The pattern file is a plain
text file that you can edit in any text-editing program. This pattern file consists of
a definition section and a rule section. The definition section is where you can
define subcomponents using a syntax that uses Perl (PCRE) regular expressions.
You can then combine these subcomponents with other elements of valid regular
expression in the rule section.
Regular expressions are powerful, while offering a flexible, widely used syntax.
Coupled with the Data Cleanse transforms rule language, you can easily create
definitions and rules to parse your own patterns of data. For more information
about creating regular expressions, see Introduction to regular expressions on
page 39.

36

IQ8 Data Cleanse Modifiers Guide

Working with the pattern file


This section explains what you need to know before you edit the pattern file, as
well as how to actually create your own definitions and rules for parsing your
own defined patterns.
User-defined pattern
example

Each user-defined pattern file is composed of a definition section and a rule


section. Below is an example usage of defining a pattern for dates.

This heading line is required.

DRL UDPM Pattern File v1.0

Each comment line must start


with a pound (#) symbol.

# DO NOT MODIFY, MOVE, OR DELETE THE ABOVE LINE!

Definition section
This section is used to define
the patterns for the subcomponents of the data that you
want to parse.

Month = 0?[1-9]|1[0-2];
Separator = [/-];
Year = [0-9]{2,2};
!end_def

Rule Section
This section is used to set the
rule for how to parse your
user-defined components,
based on the subcomponent
definitions you create.

ILINE1:UDPM1:Date=({Month}){Separator}({Year});

Ending the file: You must end the file with a hard return. If you dont
insert a carriage return/linefeed at the end of the file, the file wont be
read correctly.

Before you can begin creating user-defined patterns, its important that you
understand what makes up these two sections of the file. See the following:
Definition section of a user-defined pattern on page 38
Rules section of a user-defined pattern on page 38

Chapter 3:

37

Definition section of a
user-defined pattern

In the definition section, you may define the subcomponents that will make up the
data pattern that you want to parse. The definition you create is a combination of
a simple language specific to the Data Cleanse transform and Perl (PCRE) regular
expressions. The diagram and table below explain how the definition section of a
user-defined pattern is set up.
1

Month = 0?[1-9]|1[0-2];
Separator = [/]-];
Year = [0-9]{2,2};
!end_def

Note that each line must


end with a semi-colon (;)
except the !end_def line.

Element

Description

Macro name

The first part of each macros definition is to give a name to


the macro followed by the equals (=) sign. This name is later
used in the rule section of the user-defined pattern.
You can define any number of macros in the definition section.

Regular expression Following the equals sign after the subcomponent name, you
use a simple regular expression to define what the subcomponent will equal.

End definition

The !end_def command indicates the end of the pattern definition. This is a required element.

For more information about creating regular expressions, see Introduction to


regular expressions on page 39.
Rules section of a
user-defined pattern

In the rule section, you must explain the rule or rules for how to parse the
subcomponents that you defined in the definition section. The diagram and table
below explain how the rule section of a user-defined pattern is set up.
ILINE1:UDPM1:Date=({Month}){Separator}({Year});
1

38

Element

Input and
Each rule begins with the input field and output field, separated by
output fields colons.

Rule name

Each rule includes a unique name (pattern label), followed by the


equals (=) sign. This label is used in your job when you use the
udpm_fldbylabel function for output.

Macros

To add a macro to a rule, you must use the macro name as designated
in the definition section. The macro name must be surrounded by
curly brackets { }. You can use any number of macros per rule.

IQ8 Data Cleanse Modifiers Guide

Description

Introduction to regular expressions


What is a regular
expression?

A regular expression is a string of characters that a computer application can use


as a pattern for assessing whether data fed to it matches that pattern. The program
can then operate on data that fits the pattern, regardless of its non-essential
variety.
The regular expression can be as precise or as fuzzy as is needed to catch things
that match, but not catch things that dont.
Type

Example

Description

Precise

[Hh]orse

This example allows only these two


options for matching: Horse and
horse.

Fuzzy

[A-Z][[:digit:]]{5}

This example allows any data that


begins with an uppercase letter and is
followed by five numbers. This could
be a large amount of matches.

PCRE (Perl)

Keep in mind that there are several varieties, or families, of regular expressions.
When you refer to additional documentation on functions and capabilities of
regular expressions, be sure youre researching PCRE (Perl) regular expressions.

How does it work?

The process of using regular expressions follows these steps:


Data is fed to the processing engine (specified with input fields).
The processor tries the entire regular expression from the position before the
first input character. If it finds a matching string from there, it stops looking.
If no matching string is found there, the processor bumps forward one
character and tries the entire regular expression again.
If no matching string is found there, the process is repeated at each input
character until a matching string is found or the end of the input is reached.
Pattern:

([0-9]{4}[[:space:]]){4}

Input data:

AB 1234 1234 1234 1234}

The input data didnt match the pattern in


the first, second, or third positions of the
input data. But in the position just before
the fourth character of the input data, the
processor finds that the remaining data
string matches the pattern (four digits and
a space, repeated four times).

Chapter 3:

39

Operators

For operators, regular expressions use special characters (also known as


metacharacters). These can operate on a valid regular expression in powerful
ways. Operators normally follow a character or group of characters. Available
operators include the following:
Char

Description

Example

Zero or one occurrence


of the preceding element.

[1-9]? equals no number or


any one number from 1 to 9.

Zero or more occurrences


of the preceding element.

[1-9]* equals no number or


any number of numbers from 1
to 9.

One or more occurrences


of the preceding element.

[1-9]+ equals at least one


number from 1 to 9.

Any single character of input.

Logical OR.

a|b equals either a or b.

Position indicator: the beginning of the


input.
When not a position indicator: Negation of
the next element

[^skty] equals any single


character except s, k, t, or y.

Position indicator: the end of the input.

()

A subpattern. Subpatterns are regular


expressions within the parentheses, and are
used to group constructs.

(0?[1-9])|(1[0-2])

{}

A bound, with one integer (to specify a


number) or two integers, separated by a
comma (to specify a range).
Note: If the { is followed by a character
other than a digit, the { is an ordinary
character; not the beginning of a bound.

{2,5} means at least two times


but no more than five times.
{2,2} means exactly two
times.

Marks the next character as one of the following:


a special character or a literal (toggles)
a backreference (not used in Data
Cleanse pattern matching)
an octal escape
Note: Its illegal to end a regular expression
with the \ character.

\$ means the dollar symbol

Using metacharacters
literally

40

(instead of the usual meaning


of $ as a position indicator).

To use any of the metacharacters as a literal character, precede it with the


backslash character (\). For example, to find cost as a dollar sign and two digits,
you could create a subexpression like this: \$[0-9] [0-9]

IQ8 Data Cleanse Modifiers Guide

Character Classes

You may also see or use a number of character class names inside brackets. These
class names must be surrounded by colons and brackets, as well, so the
expression will look like:
[[:alnum:]]

or

[[:digit:]]

Name

Description

alnum

Alphabetic characters and numeric characters.

digit

Digits.

punct

Punctuation characters.

alpha

Alphabetic characters.

graph

Non-blank (not spaces, control characters, and so on).

space

Whitespace characters (space, tab, newline, carriage return, and so on).

blank

Space and tab.

lower

Lowercase alphabetic characters.

upper

Uppercase alphabetic characters.

cntrl

Control characters.

print

Non-blank (not control characters and the like, but includes spaces)

xdigit

Digits allowed in a hexadecimal number (0-9a-fA-F)

Chapter 3:

41

Creating regular expressions


This section is an overview of some basics you will need to know when you
create regular expressions.
Brackets [] present
options for a single
input character

Brackets enclose lists, any of whose members can match a single input character.
For example:
This string:

Will match:

[aei]

only an a, e, or i character

[^aei]

any one character not an a, e, or i

Negation. If the first character of the list is ^ then the list only matches what is
not in the character set.
Range. A hyphen specifies a range of characters as defined by their collating
sequence. For example:
[a-d]
[2-4]
[A-Z]
Grouping characters
with parentheses (...)

matches a, or b, or c, or d
matches 2, or 3, or 4
matches any upper-case letter

You can group characters into strings (subexpressions) by using parentheses.


Parentheses enclose strings of characters that must each match in order.
This string:

Will match:

(aei)

the following string: a, then e, then i

(mn|xy)

a string of either mn or xy

Each subexpression is a complete regular expression, which can also include


other subexpressions. These subexpressions can become very useful in your job.
For example: ((aei)(ou)) is three subexpressions:
first, the string aei must be found, (the first subexpression)
followed immediately by the string ou (the second subexpression)
the combination (the entire subexpression) is the third subexpression
For example: ((aei) [[:digit:]]{9}) includes two subexpressions:
first, the string aei must be found, (the first subexpression)
followed immediately by nine digits
the combination (aei and nine digits) is the second subexpression
Quantity {}
indicators operate on
the entire
subexpression

42

For example: ((aei)?(ou)+) includes two subexpressions:


First, the string aei must be found zero or one times
followed immediately by at least one occurrence of the string ou.

IQ8 Data Cleanse Modifiers Guide

Example of defining a pattern


First and foremost, know your data. You have to know what the data represents in
order to know what is acceptable and what is not. Only then can you create a
regular expression thats flexible enough to match the full range of good data,
while at the same time excluding even close matches of data that isnt valid for
your pattern.
There are many ways to approach the task. Here is a general step-by-step
approach as applied to a fairly straight-forward examplefinding an account
number that should be a series of two alphabetical characters and either six or
seven numbers.
Step 1:
Write examples of good data

Write several examples of the data you want to parse. For example:
A
A
A

N
L
B

3
4
3

4
6
4

6
5
2

3
9
6

5
8
9

9
0
9

9
0

Step 2:
Identify any literal elements

Decide if there are any characters (either alphabetical or numeric) that must be
seen without variance, for the input to be valid. If so, put a pair of brackets around
each. For example, you may decide that the first letter must always be A, but
that all the others may vary to some extent.
A
A
A

N
L
B

3
4
3

4
6
4

6
5
2

3
9
6

5
8
9

9
0
9

9
0

[A]

Step 3:
Define the ranges of the
variable elements

For each of the variable elements, decide the range of the optional data. Here,
weve decided that the second letter can be any of four possible letters: B, C, L, or
N. In addition, the last two digits must be either 99 or 00 (representing a year).
A
A
A

N
L
B

3
4
3

4
6
4

[A] [BCLN]

6
5
2

3
9
6

5
8
9

9
0
9

9
0

(99|00)

Chapter 3:

43

Step 4:
Define acceptable variances
in quantity

Define any acceptable differences in the quantity of the elements. In this example,
its acceptable for the input data to have either 4 or 5 digits after the first two
letters, and before the last two digits. Use quantity indicators to show that
acceptable range:
A
A
A

N
L
B

3
4
3

4
6
4

[A] [BCLN]

6
5
2

3
9
6

5
8
9

9
0
9

[0-9] {4,5}

9
0

(99|00)

Step 5:
Expand the pattern for
acceptable deviances

Finally, think about how your data was input, or what was its source, in order to
predict what sort of deviances you might find in the format of the data that would
not precisely fit your pattern so far, but which would reflect a useful datathat is,
not exactly right, but close enough to be usefully parsed.
In this example, you may note that some of your record data starts with small
letters instead of capital letters, and that in some cases, the data was input with a
space between the two letters and the numbers, like these:
AN3463599
AL4659800
Ab 3452399
an 623400
[Aa] [BCLNbcln] [[:space:]]?[0-9] {4,5} (99|00)

These additions make the small


letters acceptable for the first
two alpha characters.

This added character class makes it OK


for one space or tab to be after the first
two alpha characters (or not).

Test and refine

Try out your expression with QuickParse or by running the job on a small portion
of your data. If necessary, adjust your expression to suit the results.

Literal Characters
(alpha, numeric, &)

To specify a literal character, just include it in the expression. You can use any of
the extended ASCII characters, like A through Z, a through z, 0 through 9, and
the specialized characters, like @, #, $, and so on.
Note: Some of these characters are Regular Expression MetaCharacters,
and must be treated as described next.

44

IQ8 Data Cleanse Modifiers Guide

Alternate expressions
You can search for more than one variation of a pattern. For example, heres how
you could search for data that fits the two patterns of Wisconsin license plates:
Through 1999, these were 3 letters followed by 3 numbers (ABC123). Then,
starting in 2000, the state switched to a pattern of 3 numbers followed by 3 letters
(123ABC).
In the pattern
matching file:

1. Set up one line to match the early numbers (for example, ABC123).
2. Add a second line to match the late numbers.
3. End the definition section.
4. Make a rule that accepts either expression.
1)
2)
3)
4)

Implications for
output

early=[A-Z]{3}[[:space:]]?[[:digit:]]
late= [[:digit:]] [[:space:]]? [A-Z]{3}
!end_def

IUDPM1:L_UDPM1:wis_plate=({early}|{late})

Alternative expressions dont complicate output, because output is governed by


the rules, not the definitions. In this example, data that matches either the early
or late alternatives is considered a match for the wis_plate rule.

Chapter 3:

45

Multiple rules
You can also search for more than just one pattern. For example, you also want to
search for data that matches the pattern of a Wisconsin auto title number, which
is: 2 year digits, 3 0-9 digits, 1 A-Z alpha, 4 0-9 digits, a hyphen, and
finally a 0-9 digit. An example of a Wisconsin auto title is 95172L1031-0.
In the pattern
matching file:

1. Keep the expressions for the license plates, and define another for titles
2. Add another rule for the title expression:

1)

2)

46

IQ8 Data Cleanse Modifiers Guide

early=[A-Z]{3}[[:space:]]?[[:digit:]];
late=[[:digit:]][[:space:]]?[A-Z]{3};
title=[0-9]{5}[L][13][0][[:digit:]]{2}[-][0-9];
!end_def
IUDPM1:UDPM1:wis_plate=({early}|{late});
IUDPM1:UDPM1:wis_title=({title});

Example of a user-defined pattern file


In this section, we take a hypothetical situation and walk you through the steps to
set up the user-defined pattern file in order to parse the necessary data. This
section only explains how to set up the user-defined pattern file in order to get the
parsing results that you want.
The scenario

Assume your database contains standard customer information, such as firm


name, contact name, and so on. In addition, you have a specific Payment Type/
Number field that this application doesnt currently parse. The Payment Type/
Number field includes a code of either a P indicating payment by purchase
order, or C indicating purchase by credit card. This field also contains either a
purchase order number, or a credit card number, corresponding to the payment
code.
If you put the database into a table format, it might look something like this:

Step 1: Identify the


subcomponents

Firm

Contact

Payment Type_Number

Imaginary Industries

Joe Edwards

Ppo123-456

Fake, Inc.

Mary Peterson

c1234-5678-9123-4567

Before you can set up the definition section of your pattern file, you have to
decide what subcomponents you might look for in the data you want to parse. You
dont need to worry about the Firm field or Contact field, because the application
can already parse those types of data. You need only worry about the Payment
Type/Number field, which can be composed of three main subcomponents:
Payment type code
Purchase order number
Credit card number

Step 2:
Define the
subcomponents

Now that youve identified the subcomponents that you need, you can go about
defining them. The first subcomponent is the payment type abbreviation. This
code is always going to be either C for credit card or P for purchase order.
However, as shown in the example table, it may be entered as either lowercase or
uppercase.
To edit the pattern file, open it in any text editing program. We recommend you
make a backup copy of the drludpm.dat file, in case you want to revert to the
original file later.

Payment type
subcomponent

As with all definitions for user-defined patterns, you need to name it first, and
then use regular expressions to look for the pattern. The definition for this
subcomponent should look like this.
payment_type=[p|P] | [c|C];

With this rule, we are telling the application to look for a P or a C that can be
uppercase or lower case.

Chapter 3:

47

Purchase order
subcomponent

The next subcomponent is a purchase order number. In the example, a purchase


order always has this format POxyy-yyz, where x is either 1 or 2, y is any number
between 0-9, and z is any number between 0-9 (but may not exist for all purchase
orders). In this case, the definition would look like this:
po=[P|p][o|O][1|2][0-9][0-9]-[0-9][0-9][0-9]?;

Credit card
subcomponent

The last subcomponent is a credit card number, in the typical format of xxxx-xxxxxxxx-xxxx where x is any digit between 0-9. You also want to check for a hyphen
or space between the groups of digits, though this hyphen or space may not be
there. The definition would look like this:
credit_card=[0-9][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9][ -]?[0-9]
[0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9];

Completed definition
section

After adding the !end_def command to designate the end of the subcomponent
definitions, the file would look something like this:
payment_type=[p|P]|[c|C];
po=[P|p][o|O][1|2][0-9][0-9]-[0-9][0-9][0-9]?;
credit_card=[0-9][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9][ -]?[0-9]
[0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9];
!end_def;

Step 3:
Define the rule(s)

If you closed the file now, the application would do nothing differently. Though
you have explained some patterns, you need to create a rule to actually look for
those subcomponents, as well as to explain how those subcomponents should
appear before they are parsed.
First you indicate the fields the data is coming in on and going to. You separate
the names of these input and output fields with a colon.
Next you name the rule, and explain it in terms of the subcomponents that youve
defined. You want to tell the application to look for a payment_type
subcomponent, immediately followed by either a po subcomponent or credit_card
subcomponent. The rule only applies for the input field specified. The rule would
look like this:
input field

output field

Pattern

ILINE1:UDPM1:account_info=({payment_type})({po}|
{credit_card});
Subpatterns

Remember that the order listed in the rule is very important. The Data Cleanse
transform will only parse a rule if it finds the subcomponents in the exact order
listed in your rule.
Step 4:
Save the modified
pattern file

48

Now that you have completed the sample file, it should look like this:
DRL UDPM Pattern File v1.0
credit_card=[0-9][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9][ -]?[09][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9];
po=[P|p][o|O][1|2][0-9][0-9]-[0-9][0-9][0-9]?;
payment_type=[p|P]|[c|C];
!end_def;

IQ8 Data Cleanse Modifiers Guide

ILINE1:UDPM1:account_info=({payment_type})({po}|{credit_card});

Save the file to save your user-defined pattern.


Step 5:
Submit user-defined
data to Data Cleanse

Now that youve created a user-defined pattern, you can configure your
application to apply the rules to incoming records. You define your input fields in
the rule section of the pattern file.
Use the ILINE field when your user-defined data appears on a line mixed in with
other data. If your user-defined data is in its own discrete field, use IUDPM1 ...
IUDPM4 field. Use this field in the rule section of your file.

Step 6:
Retrieve user-defined
data

Finally, you can configure your application to retrieve the parsed user-defined
data that you have defined in your pattern file.
You can request the entire user-defined field (as defined in your rule) with the
UDPM field. Or, you can request individual subcomponents in your rule with the
UDPM_SUB1-5 components.

Chapter 3:

49

50

IQ8 Data Cleanse Modifiers Guide

Chapter 4:
Modify the rule file

One of the ways you can modify the Data Cleanse transforms behavior is by
creating or editing parsing rules in the rule file (drlrules.dat). The rule file
controls parsing name data, firm data, and addresses.

Test your results with


QuickParse

Proceed with care. The rule file controls how incoming data is parsed.
Accordingly, changing this file changes how items are parsed. Before you
change the parsing rules, proceed with great caution. If you aren't careful
when you add or edit parsing rules, you may receive unexpected and
unwanted results. Make backups of your files just in case you need to
revert to previous parsing rules.

Treat this ability as you would treat any feature you add to an application. Before
you integrate any modification into your enterprise, you should put your
modification through a stringent cycle of research, quality assurance, testing,
regression testing, and so on. Don't find out at release time that you've changed
something for the worse. As always, we recommend that you maintain adequate
backups and test your results with the QuickParse utility. For more information
see Check parsing results with QuickParse on page 67.

Chapter 4:

51

What is the rule file?


The rule file (drlrules.dat) controls how the Data Cleanse transform parses
groups of output type subcomponents for name and firm data.
Pre-defined rules

The Data Cleanse transform already provides hundreds of rules for many
different possible combinations of data. These rules will likely satisfy the parsing
needs of most users.
However, you may encounter data that isnt being parsed as youd like it to be.
Or, maybe you would like to tweak a rule so that it returns a different confidence
score by adding a confidence booster. In situations like this, it is very handy to be
able to edit the rule file.

General guidelines

When modifying the rule file, you should keep the following points in mind:

Start small

You should create new rules that define very specific situations. Start
conservatively with very narrow parameters. Only after you master narrowly
defined rules should you proceed to create rules that cover broader situations.

Be careful

You should always double-check the syntax and take great care when you apply
operators. Its easy to enter an inappropriate operatorand it may not be as easy
to spot it later.

Confidence may affect


results

If the item parsed incorrectly and you want to write a rule, the confidence still
factors in. Because of this, your results might not be exactly as you had
anticipated.

Test results

You should always test your results very thoroughly. See Check parsing results
with QuickParse on page 67 for more information.

52

IQ8 Data Cleanse Modifiers Guide

How the rule file is organized


The rule file has a straight-forward organization. It consists of a header,
explanatory information, and parsing rules grouped by data type.
The header

The files header identifies the rule file. You must not alter or delete the header.
DRL Rule File v1.0;
# DO NOT EDIT, MODIFY OR REMOVE THE ABOVE LINE!!!!!
#

Explanatory information

Lines that start with a pound sign (#) are commented out.
#
#
#
#
#

Parsing rules by data type

Group
The file consists of rules
for several types of data.
Groups of rules include:

Name rules
Dual name rules
Firm rules
Address Line rules
Last Line rules
Optional rules, not
enabled by default

Rule

The following types will be used in each example


NAME_DESIGN = ATTN
PRENAME = MR
NAME_STRONG_FN = JOHN

Lastly, the rule file consists of rule grouped by data type.


#######################################################
#
#
#
NAME RULES
#
#
#
#######################################################
#######################################################
#
# Titles by themselves
#
Admin
#
title1 = TITLE_ALONE;
options = begin : end;
action = PERSON conf: 40;
PERSON = 1 : TITLE : L;
end_action

Chapter 4:

53

Rule example
The following is an example of a rule that already exists in the rule file
(drlrules.dat).
The rules within the rule file can be divided into two main sections: the definition
section and the action section. These sections can then be further divided into
smaller components. Though each rule is unique, all rules follow the same
structure as explained in the next sections of this chapter.
When you work with rules, keep in mind the differences between the rule file and
the pattern file. Though the pattern file only has one definition, the rule file has a
separate definition section for each rule defined in the file.
Definition section
Here you name the
rule and define what
type of data pattern
to look for to match
this rule.
Lines that start with a
pound sign (#) are
commented out.

#######################################################
#
# Prename with last name and prename with first, last name
#
Mr Smith and Mrs Mary Jones
#
nfdual34 =
# Prename
PRENAME_ALONE +
# last name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER+
# connector
CONNECTOR +
# Prename
PRENAME_ALONE +
# first name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
INITIAL | PREFIRST] +
# last name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER;

Action section
Here you specify the
action that is performed when the rule
is matched.

54

action = PERSON : D;
PERSON = 1 : PRENAME : 1;
PERSON = 1 : LAST_NAME : 2;
PERSON = 2 : PRENAME : 4;
PERSON = 2 : FIRST_NAME : 5;
PERSON = 2 : LAST_NAME : 6;
PERSON = 1 : NAME_CONNECTOR : 3;
end_action

IQ8 Data Cleanse Modifiers Guide

Definition section of a parsing rule


The main purpose of the definition section is to define the components that the
application is looking for in order to match the rule. The diagram and table below
explain the pieces that make up a definition section of a rule.

Rule description line


Example data line

Rule label

#######################################################
#
# Prename with last name and prename with first, last name
#
Mr Smith and Mrs Mary Jones
#
nfdual34 =
# Prename
PRENAME_ALONE +

Rule definition

Rule description

Rule label

Rule definition

# last name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER+

Component

Description

Rule description line

In this line, you can designate a description for the rule. This is
an optional line, and therefore it must begin with at least one
pound sign (#) so the application treats it as a comment. This
line is helpful to use so you know what the rule is intended to
parse.

Example data line

Here you can enter an example of the data that you will parse
with the rule. We recommend that you use such a line, because
it is helpful in locating the rule you want to edit.
This is an optional line, and therefore it must begin with at least
one pound sign (#) so so that it is treated as a comment.

Rule label

The name, or label, for this parsing rule.

Rule definition

The rule definition lists the components that make up the parse.
This line and the components that make it up are described in
more detail in the following section.

If you think of the rule definition as an equation, it may help you understand it.
The rule label (before the equal sign) can be equated with the description (after
the equal sign).
#
nfdual34 =
# Prename
PRENAME_ALONE +
# last name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER+

Chapter 4:

55

Rule label

The rule label must be unique; no two rules can have the same label. The table
below explains how the rule label is created.
Character

Values

First

Second
(only if
first character is n)

additional

Description

Use n in the first character spot to indicate that this is a name


rule. If you use n, you must also specify a 2nd character.

Use f in the first character spot to indicate that this is a firm rule.

If the first character is n, use f in the second character spot to


indicate that the name order of the components in this rule are
first name first.

If the first character is n, use l in the second character spot to


indicate that the name order of the components in this rule are
last name first.

If the first character is n, use a in the second character spot to


indicate that the name order of the components in this rule are
ambiguous.

any

Use any combination of letters, numbers, or underscore characters for any additional characters in the rule label. The only stipulation is that each rule label is unique.

Note: Although the additional characters for a rule label are completely up
to you, you should label the rule so you can understand it and you can
separate it from others.
Rule definition

The rule definition is a combination of token types that the application looks for
when parsing data. This section always follows the equals sign (=).
In this section of the rule file you can use only certain dictionary types. For a
listing of dictionary types, see the Information codes in Appendix C.
In addition to dictionary types, you can use some token types that are not
dictionary components in the rule file.

Order of types

56

Valid type

Description

ALPHA_NUM

The token is alphanumeric.

PUNCTUATION

The token is a punctuation character.

CONTAINS_PUNC

The token contains punctuation.

NO_VOWEL

The token contains no vowels.

LOOKUP_NOT_FOUND

The token is not in the dictionary.

LOOKUP_ANY

The token could be any one of the above types (including


any of the dictionary types).

The Data Cleanse transform looks for the token types youve listed in the precise
order youve listed them in. Multiple identifiers are connected by a plus sign (+).
Additionally, by adding an asterisk (*) after a token type, you signify that there
can be one or more of these type of tokens in the incoming data.

IQ8 Data Cleanse Modifiers Guide

Rule file operators

Within the rule definition, you can use any of the following operators.
Symbol Also known as

Description

Pound sign

Comments out the line. You must place at least one


pound sign (#) at the start of any line that is for your
reference purposes only. To improve readability, you
can use more than one pound sign.

Equal sign

Shows the relationship between items, such as equating the first part of a line with the second part.

Plus sign (and)

Connects multiple components in rule definition lines.

&

Ampersand

Associates tokens.

[]

Brackets

Groups elements so you can tell which go together.


For an example, see the description for pipe ( | ).

Pipe (or)

Toggles between two options (this or that). The entire


or expression must be enclosed in brackets ([]).

Exclamation mark
(not)

Negates the statement. Place this operator before the


item to negate (for example, !SUFFIX)

Question mark

Indicates that the item is optional.

Asterisk

Indicates that the rule allows for one or any number of


these components.

Colon

Separates components within action lines and within


action item lines.

Semicolon

Terminates the line. Is required only at the end of rule


definition and at the end of each action line.

Chapter 4:

57

Action section of a parsing rule


After you define the components that the Data Cleanse transform should look for
when parsing, you must tell the application what to do when it finds a match for
that rule. In other words, you must tell the Data Cleanse transform what action it
must perform. These actions are performed for both the main parsed item, as well
as the subcomponents that make up the main item.
The diagram and table below explain the pieces that make up the action section of
a parsing rule.
options = no_multiline;
action = PERSON : D;
PERSON = 1 : PRENAME : 1;
PERSON = 2 : PRENAME : 3;
PERSON = 1 : FIRST_NAME : 4;
PERSON = 1 : LAST_NAME : 5;
PERSON = 2 : LAST_NAME : 5;
end_action

1
2
3
4

Component

Description

Options line

In this line you can specify your parsing preferences. All


the components on this line are optional. In fact, the
options line is optional.

Action line

In this line, you assign the output type of the parsed item.

Action item lines

In these lines, you assign the output type for each of these
subcomponents.

end_action command

You enter end_action to signify the end of the action section and, in effect, the end of the rule.

The components that make up action lines and action item lines are discussed in
more detail in the following sections.
How to terminate lines

58

Except for the last line (end_action), you must terminate each line of the rules
action section with a semicolon (;) after the last component or indicator.

IQ8 Data Cleanse Modifiers Guide

Options line

The options line lists optional components, such as whether matching should start
at the end or beginning of data. The options line is optional to the rule file.

Components

The options line consists of two partsthe label that tells you what the line is for
(start options command) and the options themselves.
1
2
options = no_multiline : begin : end;

Available options

Component

Description

Start options
command

Enter option= to designate the start of the command section.

Option

Enter one or more of the valid option values (see Available


options below). Separate option values with a colon (:), and
end the line with a semicolon (;).

The options line accepts only three values as options. An example from the
default rule file (State) shows all three of these options:
# State
STATE;

Options

options = no_multiline : begin : end;


action = LAST_LINE;
LAST_LINE=1;LAST_LINE:0;
end_action
Valid option

Description

begin

When matching, the data must start the line.

end

When matching, the data must end the line.

no_multiline

When matching, the data must be input on a nameline or fimline.

Note: When begin and end options are used together in the rule file, the
data to be found must be by itself on a line; it cant be pulled out of the
middle.

Chapter 4:

59

Action line

The diagram and table below explain the components that make up the action
line.
1
2
3
4
action = PERSON : D conf: 40;

Action item lines

Component

Description

Start action command

Enter action= to designate the start of the action section.

Output type

Enter the output type for the parsed component. Valid output types are:
PERSON
FIRM
ADDRESS
LAST_LINE

Dual rule indicator

If your rule is for two or more people (Mr. and Mrs. John
Smith, for example), enter D after the output type.
The dual rule indicator is needed in the action line only if
the rule is a dual rule. If you need to enter a dual name
indicator, follow the output type with a colon (:).

Confidence score

Optional. This component (for example, conf: 40) adds


a user-defined weight to the calculated confidence, called
a confidence booster. You can add any number here to
make the score higher than the calculated score.

Each subcomponent used in the rule definition usually has a corresponding action
item line. The diagram and table below explain the components that make up the
action item lines.
1
2
3
4
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON

60

=
=
=
=
=
=

1
1
2
2
2
1

:
:
:
:
:
:

PRENAME
LAST_NAME
PRENAME
FIRST_NAME
LAST_NAME
NAME_CONNECTOR

:
:
:
:
:
:

1;
2;
4;
5;
6;
3;

# Component

Description

1 Output type

For this component, enter the output type used in the action line,
followed by the equal sign (=).

IQ8 Data Cleanse Modifiers Guide

2 Item index
number

Enter the number corresponding to which output item the line is


referring to. This number will be 1 or if dual, 2.
For example, if your rule was set up to parse a dual nameMr. and
Mrs. John Smithyou have two items: Mr. John Smith and Mrs.
John Smith. In this example, you use a 1 in the action item lines
that refer to Mr. John Smith (because Mr. is listed first in the input)
and a 2 in action item lines that refer to Mrs. John Smith.
Follow the item index number with a colon (:).

3 Output type
subcomponent

Enter the subcomponent of the output type for the information code
that this line corresponds to. For example, if this action item line
corresponds to the PRENAME information code in the rule definition, then PRENAME would be the subcomponent you use here.
For a list of valid subcomponents for each output type, see Valid
output type subcomponents on page 61.
Follow the output type subcomponent with a colon (:).

4 Information
code index
number

This number indicates the information code in the rule definition


that this action item line corresponds to. For example, with the following data:
Mr Smith and Mrs Mary Jones

the following action line


PERSON = 1 : LAST_NAME : 2;

refers to Smith. To be Jones, the number would be 6.


If the action item line applies to all of the information codes in the
rule definition, use an index of 0. For example,
FIRM = 1 : FIRM : 0;

Use an index of L here to include the whole line (data in the firm
line that may not match the rule definition).
Follow the information code index number with a semicolon (;) to
terminate the action item line.

Valid output type


subcomponents

The following table lists the valid subcomponents for each output type.
Output type

Subcomponents

PERSON

PRENAME
FIRST_NAME
MID_NAME
LAST_NAME
OTH_POST
MAT_POST
TITLE
NAME_DESIG
PRELAST
NAME_SPEC
NAME_CONNECTOR

FIRM

FIRM
FIRM_LOC

ADDRESS

ADDRESS

LAST_LINE

LAST_LINE

Chapter 4:

61

Example of a parsing rule


This section takes a hypothetical situation and walks through the steps for setting
up the parsing rule for this situation. For this example, you will create a rule very
similar to one that already exists in the rule file. However, you can use the general
concepts in this example to create any entirely new rule, or edit any other existing
rule.
Heres the situation

Your incoming data often contains names with three last names in the Name field.
For example, one record contains Juan Carlos Fernandez Torres Perez. You want
to create a rule to parse this as one name with discrete components of first name,
middle name, last name, last name, last name.
Note: The Data Cleanse transform already has rules for scenarios with two
last names, but not for a third so we will modify an existing rule to account
for this extra name component.

Step 1: Identify the


data you want to
parse

You already have an example of the data you want to parse: Juan Carlos
Fernandez Torres Perez. Before adding a rule, you should see how this example
currently parses. Using QuickParse you can find that it parses most of the entry
correctly, but it sends the final last name to the extra field. The rule it matches is
nfname15.
Now that you know what you want to parse and how it currently parses, you can
begin to edit the rule file for your scenario. Open drlrules.dat in any text editing
program. It is best to create a backup of this rule in order to reverse any changes,
if necessary.

Step 2: Create the


definition section of
the parsing rule

The definition section includes a number of optional lines, and one required
linethe rule definition line. We recommend including comment lines before
your rule definition as an explanation of what the rule will parse. To add comment
lines to this rule, you could enter the following:
####################################################
#
#First name first rule for names with a first name,
#middle name, and 3 last names
#
#Examples: Juan Carlos Fernandez Torres Perez

Next, you must create the rule label. Name each rule label with a descriptive
name. For this example, we will build on the rule we are using as a base and name
it:
nfname15_extralast =

Because this is a name rule, you use n in the first character position. Again
because it is a name rule, you must include a second specific character in your
label to indicate the name order. Juan, the first name, is listed first, therefore you
use an f as the second character. From there the rest of the name is up to you.
Now you need to list the token types that make up the subcomponents of the main
component you hope to parse. In Step 1, you found that adding an extra last name
subcomponent would fix your problem, so you add that here, joining it to the
previous subcomponents with the plus (+) sign:

62

IQ8 Data Cleanse Modifiers Guide

# name designator (ATTN:)


NAMEDESIG? +
# prename (mr.)
PRENAME_ALONE? +
# first name
[NAME_STRONG_FN |
NAME_WEAK_FN ] +
# middle name
[INITIAL | NAME_STRONG_FN | NAME_WEAK_FN] +
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +

Add this last


name subcomponent

# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# maturity post (Jr.)
MATURPOST? +
# honorary post (phd)
HONPOST*? +
# occupational title
TITLE_ALONE*?;

Notice that some subcomponents in this rule are optional. The name designator,
prename, maturity post name, honorary post name, and occupational title are all
followed by the ? operator indicating that they may or may not be present in the
input.
Step 3: Add the
options line

We only want to apply this rule on a nameline, so we must add the following
options line:
Options = no_multiline;

Chapter 4:

63

Step 4: Create the


action line

Every action line needs an action = (which equals to begin) followed by the
output type. Follow up with any optional components for this line, which are
separated by colons. The line ends with a semi colon.
Because the rule we have based this rule on is also a one person name rule, the
action line remains the same:
action = PERSON;

Step 5: Create the


action item lines

To finish the action section, you need to add one action item line for the extra last
name subcomponent. Also, you need to update the index number for the
subcomponents below this inserted line:
PERSON = 1
PERSON = 1
PERSON = 1
PERSON = 1
end_action

:
:
:
:

LAST_NAME : 7;
MAT_POST : 8;
OTH_POST : 9;
TITLE : 10;

The action line begins with PERSON = because this is a name rule. There is only
one person parsed in this rule, therefore all subcomponents have a 1 as the item
index number.
You must tell specify the output type that subcomponents in this line refers to. For
a list of valid subcomponents, see Valid output type subcomponents on
page 61. Finally, remember that the token types in your rule definition correspond
directly to the example of data you want to parse. You must enter the index
number of the token type that the line applies to.
Here is the final action item lines you should have:
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON

Step 6: Finish your


rule and save

64

=
=
=
=
=
=
=
=
=
=

1
1
1
1
1
1
1
1
1
1

:
:
:
:
:
:
:
:
:
:

NAME_DESIG : 1;
PRENAME : 2;
FIRST_NAME : 3;
MID_NAME : 4;
LAST_NAME : 5;
LAST_NAME : 6;
LAST_NAME : 7;
MAT_POST : 8;
OTH_POST : 9;
TITLE : 10;

To finish your rule, enter end_action after the action item lines. To enable your
changes, save drlrules.dat. See the final rule on the next page.

IQ8 Data Cleanse Modifiers Guide

####################################################
#First name first rule for names with a first name, middle name and 3 last names
#Examples: Juan Carlos Fernandez Torres Perez
nfname15_extralast =
# name designator (ATTN:)
NAMEDESIG? +
# prename (mr.)
PRENAME_ALONE? +
# first name
[NAME_STRONG_FN |
NAME_WEAK_FN ] +
# middle name
[INITIAL | NAME_STRONG_FN | NAME_WEAK_FN] +
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# maturity post (Jr.)
MATURPOST? +
# honorary post (phd)
HONPOST*? +
# occupational title
TITLE_ALONE*?;
action = PERSON;
PERSON = 1 : NAME_DESIG : 1;
PERSON = 1 : PRENAME : 2;
PERSON = 1 : FIRST_NAME : 3;
PERSON = 1 : MID_NAME : 4;
PERSON = 1 : LAST_NAME : 5;
PERSON = 1 : LAST_NAME : 6;
PERSON = 1 : LAST_NAME : 7;
PERSON = 1 : MAT_POST : 8;
PERSON = 1 : OTH_POST : 9;
PERSON = 1 : TITLE : 10;
end_action

Chapter 4:

65

66

IQ8 Data Cleanse Modifiers Guide

Chapter 5:
Check parsing results with QuickParse

About QuickParse

QuickParse is a tool to help you check parsing results. QuickParse lets you
quickly see how data that you input would parse if input through your Data
Cleanse transform. With QuickParse you can manually type in questionable
records or use an input file and shuffle through the entries.
When you make any modifications to the parsing setup or to the user-modifiable
dictionary, you should use QuickParse to make sure that your changes produce
the results you want.

QuickParses main
window

After you specify the data


to be parsed, it shows up
here.

With these controls you


can navigate your input
database.

When you start QuickParse, the program initially opens the window below. This
window is where you find out how the Data Cleanse transform would parse
records.

The parsed items


are displayed
here.

Components for
the selected
parsed item are
shown here.

General information about


the selected item is displayed here.

Setup needed. This window doesnt display any data until you set up
QuickParse. For information on setting up QuickParse, see Get started
with QuickParse on page 68. For more information on the above window,
see Run QuickParse on page 70.

Chapter 5:

67

Get started with QuickParse


The first thing you need to do to get started is specify which configuration file
you want to use. The configuration file contains all of the parsing option settings
and tells QuickParse which fields you will be using for input.
Set up QuickParse

1. From the QuickParse windows menu bar, select Setup > QuickParse. The
Setup window opens.
Here you specify the type of
data to be used as input for
QuickParsewhether its a
database or data that you
will input manually.

2. Specify the configuration file you want to use by typing in the path or by
browsing for it.
3. As necessary, specify the type of casing, text type, input setup, and greeting
you want to use.
4. Click OK.
Specify manual and/
or database input

68

With QuickParse you can specify the type of data to be used as input by selecting
either Manual Input or Database Input at the Setup window.
Type of input

Description

Manual

If you choose to input your data manually, QuickParse returns you


to the programs main window, where you can start typing in
entries.

Database

If you choose to have a database as your input, QuickParse opens a


window where you can set up your database (see Set up your input
database on page 69).

IQ8 Data Cleanse Modifiers Guide

Set up your input


database

If you specify database input at QuickParses Setup window, the Database Setup
window (below) opens. When you specify a valid input file (ASCII, Delimited,
dBase3 ), the Database fields box displays a list of the fields in the file.

Lists the fields in the


file.

Lists the fields turned


on in the specified configuration file.

Lists the field mappings.

To map which input field(s)


should be put on a line:

1. Click on the fields in each box (Database fields and DataRight fields), and
click Map. The Mapped fields box then shows what fields the input is coming
in on.
Because fields can only be mapped once, the fields in the Database fields box
and DataRight fields box are removed when theyre in use.
2. To delete or change a mapping, click Remove. This will put the appropriate
fields back into the lists to make them available to be mapped again.
3. When you have the fields set up the way you want, click OK. QuickParse
takes you back to the main window where you can begin going through the
records.

Chapter 5:

69

Run QuickParse
After you set up QuickParse (see Get started with QuickParse on page 68), the
applications main window opens.
QuickParses main
window

QuickParses main window lets you view all the parsed information about any
record in an input file.
Shows items the way
they were parsed.
Shows all the components of
the item selected above.

Configuration file.
The name of the configuration file youre using is noted
on the top title bar.

Input lines
activated in
.cfg file.

Input data.

Arrow buttons
for navigation.
Parsed item line.
What type the item
parsed as.
Confidence score.
Type of line the input
came in on.
Rule this item hit if addr,
name or firm.
How fields are mapped.

Data file.
If youre using input from a file,
the file name and number of
records along with which record
displayed is listed on the bottom.

Navigate your
input file

The arrow buttons (in the center on the left, beneath the input) let you navigate
through your input file. You can move forward to the next or ending record, and
backward to the previous or starting record. You can also go directly to a specific
record by typing its record number in the entry box and then pressing the Enter
key.

Enter records in
database mode

If youre using an input file, you still have the ability to type in entries or make
modifications to an entry to see what that change would do. No changes will ever
be made to the original data file. Simply type in the change or addition and click
Parse Current. You can then continue with the file as you were before.

70

IQ8 Data Cleanse Modifiers Guide

Keep a log file

If you have a record that is not parsing correctly and want to take note of it, you
can save it to a log file. You simply have to set up the log file and then save
entries when you come across them.
Log files can also be extremely helpful to Firstlogic Customer Support when you
call in with questions.

To create a log file:

1. Go to Setup > Logfile. QuickParse opens the following window.

2. Type or browse for the directory you want and type the file name.
3. Specify the type of log file you want created.
4. Click OK to complete the setup.
Each time you start a new session of QuickParse you have to specify a new log
file. You cant append to an existing log file.
To add to a log file

You can add an entry to your log file from QuickParses main window.
1. When the entry you want recorded is active, click Log. QuickParse opens the
Log Item Information window.

2. Fill out the Correct item type entry, indicating how you wanted the item to
parse. (QuickParse automatically fills out the item text, input, database name,
current item type and input line fields. If the entry came from an input file,
the log file will also include the input records number.)
3. In the Notes box, enter any additional comments pertaining to this entry.
Note: If you press the Enter key while entering comments in the Notes
box, youll insert a return character, which shows up in your log file as
another line. You may want to have only one line per log file entry.
4. When the log contains the information you want, click OK.

Chapter 5:

71

Change your
configuration file

If you want to change the options set in the configuration file youre using,
choose Edit > Config file. QuickParse opens a text box with the configuration
file you specified at setup.

You can change options by commenting and uncommenting items. To


uncomment a line, remove the # to the left of the line. To comment out a line,
insert a # at the very left of the line.
When you work with these options, you have to be very careful. Only one of each
type can be active at a time. For example, to change the AssignPrenames
parameter from Yes to No, you have to not only uncomment the Yes parameter,
but also comment out the No parameter.

Manage your sessions

Yes

No

AssignPrenames = YES
#AssignPrenames = NO

#AssignPrenames = YES
AssignPrenames = NO

Instead of having to go through all the steps of setup every time you open
QuickParse, you have the option to save the session. Saving a session means
you are saving in your registry the name of the configuration file and other
options on the setup screen to be easily accessed.
If you are using database input, the name and the field mappings are also retained.
Remember, you cant append to a log file, so if youre using a log file youll have
to specify a new one each time you open QuickParse.

72

IQ8 Data Cleanse Modifiers Guide

To save a session:

1. At QuickParses main window, choose File > Save Session. The Save
Session window opens.

2. Enter the name of the session, and click OK.


To open a session:

1. At QuickParses main window, choose File > Open Session.


2. Choose the name of the session to open, and click OK.

To remove a session:

1. At QuickParses main window, choose File > Remove Session.


2. Choose the name of the session to remove, and click OK.

Chapter 5:

73

74

IQ8 Data Cleanse Modifiers Guide

Appendix A:
UMD configuration file, umd.cfg

umd.cfg

Rather than type a long command line, you can use the UMD configuration file.
Make a copy of umd.cfg and save it under a different file name. Then edit and use
your copy.
# UMD Show
Output File Name (path & file name) ....
Output File Type (See NOTE) ............
#
# UMD Build
Dictionary Type (see NOTE) .............
Source Dictionary (path & dct) .........
Transaction File Name (path & file name)
Target Dictionary (path & dct) .........
Verify Input File only (YES/NO) ........
Error Message Log File (path & name) ...
Work Directory (path) ..................
#
# Dictionary Types:
#
parsing
#
generic
#
capital
#
# Output File Types:
#
delimited
#
ascii
#
dbase3

=
=
=
=
=
=
=
=
=

Guidelines for editing


the configuration file

In the configuration file, do not edit anything to the left of the equal signs. To
insert comments, prefix them with a pound sign (#). Complete either the UMD
Show section or the UMD Build section, not both. (Exception: For UMD Show,
specify the dictionary at the Source Dictionary parameter.)

Command line

When you run UMD, include the configuration file as a parameter on the UMD
command line:
Platform

Command line

UNIX

umd -cfg cfg_file.cfg

Windows

umd /cfg cfg_file.cfg

Output File Name


Output File Type

These parameters are for UMD Show mode only. If you want to record query
results in an output file, enter the path and file name. Specify a file type of ASCII,
dBASE3, or Delimited.4

Dictionary Type

Enter the type of dictionary you want to modify. Possible dictionary types are
Parsing, Capital, and Generic (search-and-replace).
4. If you plan to use the output file as a transaction file, we recommend the following: If you plan to use a database
program or spreadsheet program to edit the file, create a dBASE3 or ASCII file. If you plan to use a text editor or word
processor to edit the file, create a delimited file.

Appendix A:

75

Source Dictionary

If you are building a custom dictionary or table, enter the path and file name of
the source dictionary:
If you are creating a parsing dictionary, specify our parsing dictionary
parsing.dct as the source dictionary.
If you are creating a capitalization dictionary or search-and-replace table, you
should usually leave this blank.
If you are querying an existing dictionary (UMD Show), specify the path and file
name of the dictionary you want to query.

Transaction File Name

Type the path and file name of the transaction file containing your entries.

Target Dictionary

Type the path and file name of the custom dictionary you want to create. If the file
already exists, UMD overwrites the existing file.5 If you do not specify a target,
UMD uses the source dictionary as the target.

Verify Input File Only

Do not overwrite any of our base dictionaries. Instead, give your custom
dictionary a separate name. Each time you install a software update, we
overwrite our base dictionaries. If you use our file names for your
dictionaries, your custom dictionaries may be overwritten.

If you set this option to Yes, UMD checks all the entries in the transaction file but
does not actually produce the target dictionary. This is handy if you want to verify
during the day and run the build process during the night.
If you set this option to No, UMD checks the entries in the transaction file. If no
verification errors occur, UMD builds the target dictionary.

Error Message Log File

We recommend that you specify an error log file. UMD will write any error or
warning messages to the log file so you can review them later.
If you leave this parameter blank, UMD sends error and warning messages to the
screen (standard output). If any messages scroll off the screen, you will not be
able to retrieve them.

Work Directory

By default, UMD places its temporary work files in the current directory. If you
would like to use some other location, specify a path.
To estimate the space required for work files, use this formula:
Work space = 4 x (size of transaction file + size of source dictionary)

5. Before overwriting an existing dictionary, UMD makes a backup copy of the existing file. For example, if the
dictionary is named custom.dct, UMD creates a backup file named custom.001. The next time, UMD creates a backup
named custom.002, and so on up to custom.999.

76

IQ8 Data Cleanse Modifiers Guide

Appendix B:
UMD command line

You can use one of three command lines with UMD:


UMD Show, for querying an existing dictionary or table
UMD Build, for verifying and building your user-modifiable dictionary
UMD Config, for using the UMD configuration file.
UMD Show

You can query an existing dictionary or table by using the UMD Show command
line.
Platform

Command line

UNIX

umd -s dct_file.dct [-o out_file] [-d db_type]

Windows

umd /s dct_file.dct [/o out_file] [/d db_type]

Parameter

Description

s dct_file.dct

Path and file name of the dictionary to query.

o out_file

Path and file name of the output file. If you save a query, UMD
writes it to this file. If the file already exists, UMD appends to the
end of the file.
Note: You can edit the output file and use it as a transaction file

d db_type

Database type for the output file. Choose one: dBASE3, ASCII
(default), or Delimited.1

1. If you plan to use the output file as a transaction file, we recommend the following: If you plan to use a database
program or spreadsheet program to edit the file, create a dBASE3 or ASCII file. If you plan to use a text editor or word
processor to edit the file, create a delimited file.

UMD Build

If you prefer not to use the configuration file, you can place all the UMD Build
parameters on the command line.
Platform

Command line

UNIX

umd dct_type -i trans [-s source] [-t target] [-e err_log] [-p work] [-v]

Windows

umd dct_type /i trans [/s source] [/t target] [/e err_log] [/p work] [/v]

Parameter

Description

dct_type

Dictionary type: Parsing or Capital.

i trans

Path and file name of the transaction file containing your custom
entries.

s source

Path and file name of the source dictionary to use as a base for your custom dictionary.
Appendix B:

77

Parameter

Description

t target

Path and file name of the custom dictionary to create. If the file already
exists, UMD will overwrite it.1 If you do not specify a target, UMD
uses the source dictionary as the target.

e err_log

Log file for validation warnings and errors. We recommend that you
include this parameter.

p work

Path and directory to use for temporary storage of work files. To estimate space requirements, use this formula:
Work space = 4 x (size of transaction file + size of source dictionary)

Verify only. If you include this option, UMD checks all the entries in
the transaction file but does not actually produce the target dictionary.

1. Before overwriting an existing dictionary, UMD makes a backup copy of the existing file. For example, if the
dictionary is named custom.dct, UMD creates a backup file named custom.001. The next time, UMD creates a backup
named custom.002, and so on up to custom.999.

UMD Config

Rather than type the UMD Show or UMD Build command line, you can specify
file names and options in the UMD configuration file (see UMD configuration
file, umd.cfg on page 75).
To run UMD with the configuration file, use the following command:

78

Platform

Command line

UNIX

umd -cfg cfg_file.cfg

Windows

umd /cfg cfg_file.cfg

IQ8 Data Cleanse Modifiers Guide

Appendix C:
Information codes and standard-type codes

Information codes

If youre creating a parsing transaction entry, type the appropriate information


codes (or info codes) in the Info field. Put one space (no punctuation) between
codes.
If youre using UMD Show to query a parsing dictionary, these are the codes
shown in the Info Codes field.

Standard-type codes
Information code

Description

DIRECTIONAL

Refers to the part of the address that gives directional information for delivery, such as N, S, N.E

FIRMDESIG

Indicates that a firm is to follow.

FIRMINIT

When used in a firm name, likely to be the first word in the


firm name.

FIRMLOC

A location within a firm (usually used for internal mail


delivery).
For example: Department, Mailstop, Room, Building.

FIRMMISC

A word used in firm names.

FIRMNAME

This code is used for firm names that may be parsed incorrectly. For example, Hewlett Packard could be incorrectly
parsed as a personal name, so Hewlett, Packard, and Hewlett
Packard are all listed as Firm Name words.

FIRMNAME_ALONE

A firm name that can stand on its own, for exam

FIRMTERM

Likely to be the last word in a firm name.


For example: Inc, Corp, Ltd, and so on.

HONPOST

A postname that signifies certification, academic degree, or


affiliation
For example: CPA, PhD, or USNR.

INITIAL

A middle initial, such as C or J.

MATURPOST

A maturity postname such as Jr or Sr.

NAMEDESIG

A name designator such as Attn or c/o.

Appendix C:

79

80

Information code

Description

NAMEGEN1-5

The gender of the name.


NAMEGEN1 94 to 100 percent chance the person is a
man (e.g., Robert).
NAMEGEN2 70 to 93 percent chance the person is a man
(e.g., Adrian).
NAMEGEN3 The name does not reliably indicate gender,
or is a last name.
NAMEGEN4 70 to 93 percent chance the person is a
woman (e.g., Lynn).
NAMEGEN5 94 to 100 percent chance the person is a
woman (e.g., Anne).

NAMESPEC

A word that may appear in a name line, such as Family, Resident, Occupant.

NUMBER

A number word, such as One, First, or 1st.

PHRASE_WRD

A word that is part of a phrase.


For example, the dictionary contains an entry for the phrase
VP Mktg. Each word in the phraseVP and Mktgis
marked as a PHRASE_WRD.

POST_OFFICE

The name of a SCF, ASF, BMC, or ADC.

PREGEN1-5

The gender of a prename.


PREGEN1 Masculine. For example, Mr or Senor.
PREGEN3 Neutral. For example, Dr or Capt.
PREGEN5 Feminine. For example, Ms, Mrs, or Senora.
Note: The entry must also include the PRENAME or
PRENAME_ALONE code.

PREFIRST

A first-name prefix.

PRELAST

A last-name prefix, such as Van Allen or OConnor.

PRENAME

A prename, such as Mr, Ms, Senor, Senora, Dr, Capt.


Note: The entry must also include one of the PREGEN
codes.

PRENAME_ALONE

A prename, such as Mr. or Ms. without the PREGEN code.

PRIVATE_ADDR

Private mailbox, pmb.

REGION

A geographical word such as North, Western, Minnesota, or


NY.

TITLE
TITLE_INIT
TITLE_TERM
TITLE_ALONE

A word used in a job title, such as Software or Engineer.


A word used at the begining of the title, such as Vice or
Associate.
A word used at the end of the title, such as Engineer or Manager
A word that can stand as a single title, such as Accountant or
Attorney

SEC_ADDR

Secondary address, such as an apartment, building, or suite.

STATE

A U.S. abbreviation, such as WI, MN, or NY.

SUFFIX

A suffix, such as Jr, II, or III.

IQ8 Data Cleanse Modifiers Guide

Information code

Description

MIL_ADDR

Part of a domestic military address, such as psc.

MIL_LAST

Part of an overseas military address preceding the two-character


"state" abbreviation, such as APO, FPO.

MIL_STATE

Part of an overseas military address indicating the state, such as


AE, AP, AA

NAME_STRONG_FN
NAME_WEAK_FN
NAME_AMBIGUOUS
NAME_WEAK_LN
NAME_STRONG_LN

Used for a name whose position (first or last) can be qualified.


NAME_STRONG_FN Most likely a first name (e.g.,
Michael).
NAME_WEAK_FN Possibly a first name (e.g.,
Corey).
NAME_AMBIGUOUS Indeterminate (not listed).
NAME_WEAK_LN Possibly a last name (e.g.,
Hunter).
NAME_STRONG_LN Most likely a last name (e.g.,
McMichaels).

RR_HC_ADDR

Highway Contract Rural Routes

CONNECTOR

The word, character, or symbol between other words.


For example: and and &.

ZIP
ZIP4

A ZIP Code
A ZIP+4 Code

If youre creating a parsing transaction entry, type the appropriate standard-type


codes in the Stdtype field. Put one space (no comma) between codes.
If youre using UMD Show to query a parsing dictionary, these are the codes
shown next to each Standard.
In this table, we use the terminology Primary and Secondary. If you are
using UMD Show, the Primary is the word you queried and the Secondary
is the Standard.
Standard-type code

Description

ADDRESS_STD

If the Primary is parsed as an address, use this Secondary as


the standardized form.

ALL_TEXT_TYPES

If a text standard (STD) is not indicated for a particular type


of data, use this standard as the default. Usually used with
NUMBER and REGION words.
For example, if a word is parsed as a firm name but the dictionary does not list a FIRM_STD for the word, then use the
ALL_TEXT_TYPES standard as the text standard.
Note: The ALL_TEXT_TYPES is used as a default text standard (STD) only. It is not used as a default match standard or
acronym.

FIRM_ACR

If the Primary is parsed as a firm name, use this Secondary as


the acronym.

FIRM_MTC

If the Primary is parsed as a firm name, use this Secondary as


the match standard.
Appendix C:

81

82

Standard-type code

Description

FIRM_STD

If the Primary is parsed as a firm name, use this Secondary as


the standardized form.

FIRMLOC_ACR

If the Primary is parsed as a firm location, use this Secondary


as the acronym.

FIRMLOC_MTC

If the Primary is parsed as a firm location, use this Secondary


as the match standard.

FIRMLOC_STD

If the Primary is parsed as a firm location, use this Secondary


as the standardized form.

HONPOST_ACR

If the Primary is parsed as a honorary postname, use this Secondary as the acronym.

HONPOST_MTC

If the Primary is parsed as a honorary postname, use this Secondary as the match standard.

HONPOST_STD

If the Primary is parsed as a honorary postname, use this Secondary as the standardized form.

LAST_LINE_STD

If the Primary is parsed as a last line, use this Secondary as


the standardized form.

MATURPOST_MTC

If the Primary is parsed as a maturity postname, use this Secondary as the match standard.

MATURPOST_STD

If the Primary is parsed as a maturity postname, use this Secondary as the standardized form.

NAME_MTC

If the Primary is parsed as a name, use this Secondary as the


match standard.

NAMEDESIG_ACR

If the Primary is parsed as a name designator, use this Secondary as the acronym.

NAMEDESIG_MTC

If the Primary is parsed as a name designator, use this Secondary as the match standard.

NAMEDESIG_STD

If the Primary is parsed as a name designator, use this Secondary as the standardized form.

NAMESPEC_ACR

If the Primary is parsed as a name-special component, use this


Secondary as the acronym.

NAMESPEC_MTC

If the Primary is parsed as a name-special component, use this


Secondary as the match standard.

NAMESPEC_STD

If the Primary is parsed as a name-special component, use this


Secondary as the standardized form.

PRELAST_MTC

If the Primary is parsed as a last-name prefix, use this Secondary as the match standard.

PRELAST_STD

If the Primary is parsed as a last-name prefix, use this Secondary as the standardized form.

PRENAME_ACR

If the Primary is parsed as a prename, use this Secondary as


the acronym.

PRENAME_MTC

If the Primary is parsed as a prename, use this Secondary as


the match standard.

IQ8 Data Cleanse Modifiers Guide

Standard-type code

Description

PRENAME_STD

If the Primary is parsed as a prename, use this Secondary as


the standardized form.

TITLE_ACR

If the Primary is parsed as a title, use this Secondary as the


acronym.

TITLE_MTC

If the Primary is parsed as a title, use this Secondary as the


match standard.

TITLE_STD

If the Primary is parsed as a title, use this Secondary as the


standardized form.

Appendix C:

83

84

IQ8 Data Cleanse Modifiers Guide

Index

acronyms
how parser generates, 25
adding
firm that looks like a personal name, 22
multiple-word firm name, 20
new word, 17
title phrase, 18

defining, 39
definition section, 37
dictionary
creating custom for parsing, 7
dictionary type
parsing, 7, 15
drludpm.dat, 5, 35

build UMD (command line), 77


building
custom parsing dictionary, 15

e-mail address
Firstlogic, 2

C
capitalization
in parsing dictionary, 17
capitalization dictionary
build customized, 32
create transaction file, 30
create your own, 30
definition, 29
querying, 29, 33
update customized, 33
capitalization transaction entries, 31
capitalization transaction file
creating, 30
codes
information, 79
standard-type, 79
command line
build UMD, 77
config UMD, 78
query UMD, 77
comments, 38
configuration file
change QuickParses, 72
configure UMD (command line), 78
contact information
Firstlogic, 2
copyright statement, 2
creating a parsing transaction file, 12
custom capitalization dictionary
building, 32
updating, 33
custom dictionary
creating for parsing, 7
maintaining, 16
updating, 16
custom parsing dictionary
building, 15
customized parsing dictionary, 15
customizing DataRight, 5

F
firm
adding when it looks like a personal name, 22
adding when multiple word, 20
firm name that looks like a personal name
querying, 11
Firstlogic contact information, 2

G
generating acronyms via parser, 25

I
Information codes, 79
information codes
modifying, 23
input database
set up QuickParses, 69

L
legal notices, 2
log file
maintain in QuickParse, 71

M
macro, 38
macro name, 38
match standards
how standards work, 26
rules, 26
spelling and punctuation, 27
working with, 26
modifying
information codes, 23
standards and standard-types, 24
multiple-word firm name
adding, 20
querying, 10

Index

85

N
new word
adding, 17

O
operators
using in regular expressions, 40

P
parser
generating acronyms, 25
parsing dictionary
building custom, 15
creating custom, 7
definition, 7
parsing transaction file
creating, 12
pattern file, 36
personal name
really a firm name, 22
phone number
Firstlogic, 2
punctuation
in match standards, 27

Q
query UMD (command line), 77
QuickParse, 67
run, 70
set up, 68
set up input database, 69

R
regular expression, 38
rule file, 5
rule name, 38
rule section, 37
rules
defining, 48

S
session of QuickParse, 72
open, 73
remove, 73
save, 73
spelling
in match standards, 27
standards
modifying, 24

86

IQ8 Data Cleanse Modifiers Guide

standard-type codes, 79
standard-types
modifying, 24
subcomponents, 39, 47
defining, 47

T
title phrase
adding, 18
querying, 9
trademarks, 2
transaction database
creating, 12, 30
transaction entries
parsing, 13
transaction file
creating for parsing, 12
placing entries in, 13
putting entries in, 31
transaction file for capitalization
creating, 30
transaction files
supporting files, 12

U
UMD, 5
UMD Build (command line), 32
UMD Build command line, 77
UMD Config command line, 78
UMD Show command line, 77
user-defined data
retrieve from DataRight, 49
submitting to DataRight, 49
user-defined pattern
definition section, 38
rules section, 38
user-defined pattern example, 37
user-defined pattern file
saving after modifying, 48

V
verification
when building custom dictionary, 15

W
web site
Firstlogic, 2
word
querying, 9

Vous aimerez peut-être aussi