Iq8 Modguide

IQ8 Data Cleanse
Modifiers Guide
User Modifiable Dictionary for the IQ8

Data Cleanse transform
November 2004
UMD command line and configuration file

Query the dictionaries to investigate parsing and
mixed-casing
Create a custom parsing dictionary
Create custom capitalization dictionaries
Edit the pattern file for custom pattern matching
Edit the rule file for custom parsing
Run QuickParse
Notices
Published in the United States of America by Firstlogic, Inc., 100 Harborview Plaza,
La Crosse, Wisconsin 54601-4071.
Customer Care
Technical help is free for customers who are current on their ESP. Advisors are
available from 8 a.m. to 6 p.m. central time, Monday through Friday. When you call,
have at hand the users manual and the version number of the product you are using.
Call from a location where you can operate your software while speaking on the
phone. To save time, fax or e-mail your questions, and an advisor will call or e-mail
back with answers prepared. Or visit our Knowledge Base on the Customer Portal
web site, where you can find answers on your own, right away, at any time of the day
or night.
Our Customer Care group also manages our customer database and order processing.
Call them for order status, shipment tracking, reporting damaged shipments or flawed
media, changes in contact information, and so on.
What do you think of

this guide?
Legal notices
Phone
888-788-9004 in the U.S. and Canada;

elsewhere +1-608-788-9000
Web site
http://www.firstlogic.com/customer
E-mail
customer@firstlogic.com
Product literature
888-215-6442, fax 608-788-1188,

or information@firstlogic.com
Corporate receptionist
608-782-5000, or fax 608-788-1188
The Firstlogic Technical Publications group strives to bring you the most useful and
accurate publications possible. Please give us your opinion about our documentation
by filling out the brief survey at http://www.firstlogic.com/customer/surveys/
default.asp. We appreciate your feedback! Thank you!
2004 Firstlogic, Inc. All rights reserved. This publication and accompanying software are protected by U.S. copyright
law and international treaties. No part of this publication or accompanying software may be copied, transferred, or
distributed to any person without the express written permission of Firstlogic, Inc.
National ZIP+4 Directory 2004 United States Postal Service. Firstlogic Directories 2004 Firstlogic, Inc. All City,
ZCF, state ZIP+4, regional ZIP+4, and supporting directories are also protected under the Firstlogic copyright. Firstlogic,
Inc. is a nonexclusive interface distributor of the USPS and holds a nonexclusive license to publish and sell ZIP+4
databases on optical and magnetic media. Firstlogic publishes this document and offers the Firstlogic product to the public
under a nonexclusive license from the United States Postal Service. The price of the Firstlogic product is not established,
controlled, or approved by the U.S. Postal Service.
Firstlogic, Inc., or any authorized dealer distributing this product, makes no warranty, expressed or implied, with respect to
this computer software product or with respect to this manual or its contents, its quality, performance, merchantability, or
fitness for any particular purpose or use. It is solely the responsibility of the purchaser to determine its suitability for a
particular purpose or use. Firstlogic, Inc. will in no event be liable for direct, indirect, incidental, or consequential damages
resulting from any defect or omission in this software product, this manual, the program disks, or related items and
processes, including, but not limited to, any interruption of service, loss of business or anticipatory profit, even if
Firstlogic, Inc. has been advised of the possibility of such damages. This statement of limited liability is in lieu of all other
warranties or guarantees, expressed or implied, including warranties of merchantability or fitness for a particular purpose.
1L, 1L (ball design), ACE, ACSpeed, DataJet, DocuRight, eDataQuality, Entry Planner, Firstlogic, Firstlogic InfoSource,
FirstPrep, FirstSolutions, GeoCensus, idCentric, IQ Insight, iSummit, Label Studio, MailCoder, Match/Consolidate,
PostWare, Postalsoft, Postalsoft Address Dictionary, Postalsoft Business Edition by Firstlogic, Postalsoft DeskTop Mailer,
Postalsoft DeskTop PostalCoder, Postalsoft DeskTop Presort, Postalsoft Manifest Reporter, PrintForm, RapidKey, Total
Rewards, and TrueName are registered trademarks of Firstlogic, Inc. DataRight, IRVE, and TaxIQ are trademarks of
Firstlogic, Inc. CASS, DPV, eLOT, FASTforward, NCOAlink and ZIP are trademarks of the United States Postal Service.
All other trademarks are the property of their respective owners.
IQ8 Data Cleanse Modifiers Guide
Contents
Preface .............................................................................................................5
Chapter 1:
Custom parsing dictionaries......................................................................... 7
Step 1: Query the dictionary.............................................................................9
Step 2: Create a parsing transaction file.........................................................12
Step 3: Put your entries in the transaction file ...............................................13
Step 4: Build your custom parsing dictionary................................................15
Step 5: Maintain and update your custom dictionary.....................................16
Sample transaction: Add a new word.............................................................17
Sample transaction: Add a title phrase...........................................................18
Sample transaction: Add a multiple-word firm name ....................................20
Sample transaction: Add a firm that looks like a personal name ...................22
Sample transaction: Modify information codes .............................................23
Sample transaction: Modify standards and standard-types ............................24
Sample transaction: Add an acronym for acronym conversion .....................25
Rules for working with match standards........................................................26
Chapter 2:
Custom capitalization dictionaries ............................................................ 29
Step 1: Create a capitalization transaction file ...............................................30
Step 2: Put your entries in the transaction file ...............................................31
Step 3: Build your custom capitalization dictionary ......................................32
Step 4: Update your custom dictionary ..........................................................33
Chapter 3:
User-defined pattern matching (UDPM)................................................... 35
Overview of UDPM .......................................................................................36
Working with the pattern file .........................................................................37
Introduction to regular expressions ................................................................39
Creating regular expressions ..........................................................................42
Example of defining a pattern ........................................................................43
Alternate expressions .....................................................................................45
Multiple rules .................................................................................................46
Example of a user-defined pattern file ...........................................................47
Chapter 4:
Modify the rule file...................................................................................... 51
What is the rule file? ......................................................................................52
How the rule file is organized ........................................................................53
Rule example..................................................................................................54
Definition section of a parsing rule ................................................................55
Action section of a parsing rule......................................................................58
Example of a parsing rule...............................................................................62
Contents
Chapter 5:
Check parsing results with QuickParse.................................................... 67
Get started with QuickParse .......................................................................... 68
Run QuickParse ............................................................................................. 70
Appendix A:
UMD configuration file, umd.cfg................................................................ 75
Appendix B:
UMD command line..................................................................................... 77
Appendix C:
Information codes and standard-type codes ............................................. 79
Index.............................................................................................................. 85
Preface
About this guide
This guide explains the User-Modifiable Dictionary (UMD), which is a tool for
viewing and customizing dictionary files.
This guide explains how to use the command-line version of UMD to create
custom parsing dictionaries, and custom capitalization dictionaries
Another way you can modify the Data Cleanse transforms behavior to suit your
needs is to edit the pattern file (drludpm.dat). You edit the pattern file in order to
parse user-defined data patterns.
Additionally, you can edit the rules that are used to parse different types of name
and firm data. For more information, see Modify the rule file on page 51.
The last chapter of this guide explains how to check your results with QuickParse.
Use QuickParse to quickly see how data that you input would parse if input
through the Data Cleanse transform.
Related documents
Before using UMD, you should understand how your Firstlogic product uses the
dictionaries. For details, see your product documentation.
Conventions
This document follows these conventions:

Convention
Description
Bold
We use bold type for file names, paths, emphasis, and text that you
should type exactly as shown. For example, Type iq8\bin.
Italics
We use italics for emphasis and text for which you should substitute
your own data or values. For example, Type a name for your
project, and the .xml extension (projectname.xml).
Menu commands We indicate commands that you choose from menus in the following
format: Menu Name > Command Name. For example, Choose File
> New.
Changes
We use a change bar in the right margin to mark product changes

since the last version.
We use this symbol to alert you to important information and potential problems.
We use this symbol to point out special cases that you should know
about.
We use this symbol to draw your attention to tips that may be useful
to you.
Chapter 1:
Custom parsing dictionaries
What is a parsing
dictionary?
The parsing dictionary identifies and parses name, title, and firm data. The parser
looks up words in the parsing dictionary to retrieve information. The parser then
uses the dictionary information, as well as the rule file, to identify and parse
name, title, and firm data.
The parsing dictionary contains entries for words and phrases. Each entry tells
how the word or phrase might be used. For example, the dictionary indicates that
the word Engineering can be used in a firm name (such as Smith Engineering,
Inc.) or job title (such as VP of Engineering).
The dictionary also contains other information:
Why create a custom

dictionary?
Type of information
in dictionary
Description
Acronyms
The dictionary contains the standard and acronymic forms of

words. For example, the dictionary indicates that Inc. is the
standardized form of Incorporated and IBM is the acronym
for Intl Business Machines.
Match
standards
The dictionary contains match standards (potential matches).

For example, Patrick and Patricia are match standards for
Pat.
Gender
The dictionary contains gender data. For example, it indicates

that Anne is a feminine name and Mr. is a masculine prename.
Address
The dictionary contains address components for address

rules.
Our base parsing dictionary contains thousands of name, title, and firm entries.
You might tailor the dictionary to better suit your data. For example:
You might customize the dictionary to correct specific parsing behavior. For
example, given the name Mary Jones, CRNA, the word CRNA is parsed as a
job title. In reality, CRNA is a postname (Certified Registered Nurse
Anesthetist). To correct this, you could add CRNA to the parsing dictionary as
a postname.
You might tailor the dictionary to better suit your data by adding regional or
ethnic names, special titles, or industry jargon. For example, if you process
data for the real estate industry, you might add postnames such as CRS
(Certified Residential Specialist) and ABR (Accredited Buyer
Representative).
If a specific title or firm name is parsed incorrectly, you can add an entry for
the entire phrase. For example, the parser previously identified Hewlett
Packard as a personal name, so we added Hewlett Packard to the dictionary
as a firm name.
Chapter 1:
Overview of creating a
dictionary
To create a custom parsing dictionary, follow these basic steps:

1. Use UMD Show to query our base parsing dictionary. Look for existing
entries for the words you wish to add or change.
2. Put your custom entries in a transaction file. A transaction file is a database
containing the additions and changes you wish to make to our dictionary.
3. Build your custom dictionary. UMD Build takes our base dictionary, makes
the additions and changes specified in the transaction file, and creates the
custom dictionary.
Source dictionary
Our parsing dictionary,
parsing.dct
Transaction file
A database containing
your additions and
changes
Supporting files
Files that enable UMD to
read the transaction file
UMD
Build
Custom dictionary
A new dictionary containing entries from the
source dictionary with
your additions and
changes
Qualifications
Preparing custom parsing dictionaries is a task for a data-management

professional. If you employ UMD in all its capabilities, dictionary editing is
almost an engineering task. Dictionary editing is not a clerical task.
A note about
examples
The sample queries and transactions in this chapter are for example only. By the
time you read this manual, the particular examples may have been added to our
base parsing dictionaries, so your query results may differ from what is shown.
Step 1: Query the dictionary

Before you add a word to the dictionary, query our base parsing dictionary,
parsing.dct, to see whether there is already an entry for the word.
Run UMD Show
To query a dictionary, run UMD Show. To run UMD Show, use the command line
(see UMD Show on page 77). UMD Show is interactive. You enter a query and
UMD Show responds, either with data or a message that your query was not
found in the dictionary.
Query a single word
To query a single word, type the word at the Enter> prompt. Do not include any
punctuation. If the word is in the dictionary, UMD Show displays the dictionary
entry:
C:\umd /s parsing.dct
Using a parsing Dictionary.
Enter a query, or press <Esc> to exit.
Enter> Beth
Usage: 99
Intl Code(s): USENGLISH
Info Code(s): NAME_STRONG_FN NAMEGEN5
Standard(s) for BETH:
- BETHANY
NAME_MTC
- BETHEL
NAME_MTC
- ELIZABETH
NAME_MTC
For descriptions of the

information and standardtype codes, see Information codes and standardtype codes on page 79.
If the word is not in the dictionary, UMD Show tells you the entry was not found:
Using a parsing Dictionary.
Enter a query, or press <Esc> to exit.
Enter> Michelangelo
Text not found in dictionary.
Query a title phrase
To look up a multiple-word title, you must query the lookup form of the title
the same form as the parser would look up (see the procedure, below).
Note: If a line contains consecutive words that are marked as phrase
words, the parser retrieves the standard for each word, removes any
punctuation, and looks up the phrase.
Procedure
Example
1. Start with the raw title.
Chief Executive Officer
2. Query each word and get the standard (TITLE_MTC) for

each. If an appropriate match standard does not exist, use
the original word.
Chf. Exec. Off.
3. Remove all punctuation.

This is the form of the title that you should query.
Chf Exec Off
Chapter 1:
Enter a query, or press <ESC> to exit.
Enter> Chief
Usage: 0
Info Code(s): PRENAME TITLE TITLE_INIT TITLE_TERM PREGEN3 PHRASE_WRD
FIRMMISC
Standard(s) for CHIEF:
- CHIEF
FIRM_MTC, FIRM_STD, PRENAME_MTC, PRENAME_STD,
TITLE_MTC, TITLE_STD
Get the first appropriate match standard for each word

in the phrase.
Query the lookup

form of the phrase.

Enter> Executive
Usage: 0
Info Code(s): TITLE TITLE_INIT TITLE_TERM PHRASE_WRD FIRMMISC
Standard(s) for EXECUTIVE:
- EXECUTIVE FIRM_MTC, FIRM_STD, TITLE_MTC, TITLE_STD
Enter> Officer
Usage: 0
Info Code(s): PRENAME_ALONE NAME_STRONG_LN TITLE TITLE_ALONE
TITLE_TERM PREGEN2 NAMEGEN1 PHRASE_WRD FIRMMISC
Standard(s) for OFFICER:
- OFFICER FIRM_MTC, FIRM_STD, NAME_MTC, PRENAME_MTC,
PRENAME_STD, TITLE_MTC, TITLE_STD
Enter> Chief Executive Officer
Usage: 0
Info Code(s): TITLE_ALONE
Standard(s) for CHIEF EXECUTIVE OFFICER:
- CEO
TITLE_ACR, TITLE_MTC, TITLE_STD
Query a multiple-word
firm name
If you want to query a firm name that is also a personal name, such as
Hewlett Packard or Johnson & Johnson, see Query a firm name that looks
like a personal name on page 11.
To look up a multiple-word firm name, you must query the lookup form of the
firm namethe same form as the parser would look up:
10
Procedure
Example
1. Start with the raw firm name.
The General Motors Corporation
2. Remove the words and, or, of, the, and for.
General Motors Corporation
3. Remove firm terminator words such as Corp, Inc,

Ltd, Co, and so on.
General Motors
4. Query each remaining word. Get the standard

(FIRM_MTC) for each.
If an appropriate match standard does not exist, use
the original word.
Gen. Motors
5. Remove all punctuation.

This is the lookup form of the firm name.
Gen Motors
Query the lookup form of the firm name:

Using a Parsing Dictionary.
Enter> Gen Motors
Usage: 0
Info Code(s): FIRMMISC FIRMNAME
Standard(s) for GEN MOTORS:
- GM
FIRM_ACR, FIRM_MTC, FIRM_STD
Query a firm name

that looks like a
personal name
For descriptions of the

information codes and
standard-type codes,
see Appendix C.
Some firms are named after peoplefor example, Hewlett Packard or Johnson
and Johnson.
To look up this type of firm name, you must query the lookup form of the firm
namethe same form of the name that the parser would look up:
Procedure
Example
1. Start with the raw firm name.
Johnson and Johnson Corp.
2. Remove all punctuation characters.
Johnson and Johnson Corp
Johnson Johnson Corp
4. Remove all firm-terminator words, such as Corporation, Inc, Ltd, and so on.
This is the lookup form of the firm name.
Johnson Johnson
If all of the words in a line are identified as both FIRMNAME and NAME
words, the parser removes noise words and punctuation, then looks to see
whether the name is listed as a firm name. If so, the line is parsed as a firm
name. If not, the line is parsed as a personal name.
Query the lookup form of the firm name:
Enter> Johnson Johnson
Usage: 1
Info Code(s): FIRMNAME
Standard(s) for JOHNSON JOHNSON:
- JOHNSON JOHNSON
FIRM_MTC, FIRM_STD
Chapter 1:
11
Step 2: Create a parsing transaction file

A transaction file is a database that contains all the additions and changes that you
want to make to the parsing dictionary. The first time you create a custom parsing
dictionary, you must create a transaction file.
Create a transaction
database
If youre updating an existing custom dictionary, use your existing

transaction file. Your dictionary will be easier to manage if you store your
entries in one transaction file, rather than scattering them among many
files.
The quickest, easiest way to create a transaction database file and its supporting
files is to use the output file feature of UMD Show. (See UMD Show on
page 77.)
1. Use UMD Show to query our base parsing dictionary, parsing.dct. Include
the o option on the command line. Use the file name that you plan to use for
your custom dictionary, but with the extension .trnfor example,
my_parse.trn.
If you plan to use a database program or spreadsheet program to edit the
file, we recommend creating a dBASE3 or ASCII file. If you plan to use a
text editor or word processor to edit the file, we recommend you create a
delimited file. However, be aware that our UMD Views program doesnt
support updating of delimited files.
2. Query a word that is in the dictionary, such as Bob.
3. To save the query to your output file, press Enter. To exit, press Escape.
C:\umd /s parsing.dct /o my_parse.trn /d dBase3
Using a Parsing Dictionary
Enter> Bob
Usage: 99
Info Code(s): NAME_STRONG_FN NAMEGEN1
Standard(s) for BOB:
- ROBERT
NAME_MTC
Enter a query, or press
to save, or press <ESC> to exit
Enter>
Previous query appended to C:\my_parse.trn.
Enter> <Esc>
UMD Show will create an output database filefor example, my_parse.trn.

You can use this database as your transaction file.
Keep supporting files
with transaction file
12
When you create a transaction database as described above, UMD Show creates a
supporting file such as my_parse.def. For ASCII and delimited transaction files,
UMD Show also creates an additional supporting file such as my_parse.fmt or
my_parse.dmt. To open and read the transaction file, UMD requires these files.
If you move the transaction file to a new location, make sure you also move the
corresponding supporting files.
Step 3: Put your entries in the transaction file

To add records to your transaction file, use a text editor or database program. For
each record, provide the information described below.
For examples, see the sample transactions starting with Sample transaction: Add
a new word on page 17.
Parse transaction
entries
Field
Data to enter
Action
Choose one:
Code
Description
Create a new entry or overwrite the existing entry.
Add information to an existing entry.
Change the usage or gender data for an existing entry.
Delete information from an existing entry, or delete the entry.
Primary
Type the word or phrase that you want to add or whose entry you want to
modify. Fifty four characters maximum, not case-sensitive, do not include
any punctuation.
For phrases and multiple-word firm names, use the lookup form. To get
the lookup form, see Query a title phrase on page 9 and Query a firm
name that looks like a personal name on page 11.
Secondary
Type one of the following:

The preferred standardized form of the Primary.
A match standard (for information and guidelines, see Rules for working with match standards on page 26).
The acronym form of the Primary.
Intl
Type USENGLISH.
Info
Type all information codes that apply, if not already in the dictionary. Put
one space (no punctuation) between codes.
For a list of information codes, see Information codes on page 79.
Stdtype
Type all standard-type codes that apply, if not already in the dictionary.
Put one space (no punctuation) between codes.
For a list of standard-type codes, see Standard-type codes on page 79.
Chapter 1:
13
Required fields
For each action, you must provide certain information. In the table below, a check
mark ( ) means that you must provide information for that field.
Type of change
Action
Create a new entry
Add a standard to an existing entry
Add information to an existing entry
Delete an entire entry
Delete a standard
Primary
Secondary
Usage
Intl
Information
Stdtype
Note 1
Note 2
Note 3
Must be blank
Note 4
Delete a standard-type code
Delete an information code
Change usage data for an existing entry
Change gender data for an existing entry
Notes on the table
Must be blank
Note 5
Note 6
1) Required only if the necessary Info code is not already in the existing
dictionary entry. For example, if you add the Stdtype code TITLE_MTC, you
must specify the Info code TITLE unless one of those Info codes is already
specified in the existing dictionary entry.
2) Required if the Info field contains anything besides PHRASE_WRD.
3) Must include all of the Info codes listed in the existing dictionary entry.
4) Must include all of the Stdtype codes listed in the existing dictionary entry.
5) This field is ignored. UMD automatically deletes dependent standard types.
6) PREGENx or NAMEGENx only. The existing dictionary entry must contain a
corresponding gender code. For example, if the existing entry contains the gender
code NAMEGEN1, you may change it to any other NAMEGENx code.
14
Step 4: Build your custom parsing dictionary

After you put all of your entries in your transaction file, use UMD Build to build
your custom parsing dictionary.
UMD Build
To build your dictionary, run UMD Build. The easiest way to convey your
instructions to UMD is through the UMD configuration file.
1. Open a copy of the configuration file umd.cfg.
2. Type entries for the UMD Build parameters. Specify our dictionary,
parsing.dct, as your Source Dictionary.
For descriptions of the configuration-file parameters, see UMD
configuration file, umd.cfg on page 75.
# UMD Show
Output File Name (path & file name) ....
Output File Type (See NOTE) ............
#
# UMD Build
Dictionary Type (See NOTE) .............
Source Dictionary (path & dct) .........
Transaction File Name (path & file name)
Target Dictionary (path & dct) .........
Verify Input File Only (YES/NO) ........
Error Message Log File (path & name) ...
Work Directory (path) ..................
=
=
=
=
=
=
=
=
=
Parsing
c:\pathname\parsing.dct
c:\pathname\my_parse.trn
c:\pathname\my_parse.dct
NO
c:\pathname\my_parse.log
r:\temp
3. Save the configuration file.

4. Run UMD with the cfg option. For example:
umd /cfg my_parse.cfg
Verify and build
Before UMD builds your custom dictionary, it checks to make sure the entries in
your transaction file are valid. If a validation error or warning occurs, look at the
error log file. If an error occurred, fix your transaction file, then run UMD Build
again.
If the transaction file is free of errors, UMD builds your custom dictionary.
During the build process, UMD takes the source dictionary, makes the changes
and additions specified in your transaction file, and creates your custom
dictionary.
Chapter 1:
15
Step 5: Maintain and update your custom dictionary

Use your existing
transaction file
If you want to update your custom dictionary, put your changes and additions in
your existing transaction file. Your custom dictionary will be much easier to
manage if you accumulate all your entries in one transaction file, rather than
scattering them among many files.
When you rebuild your parsing dictionary, always use our base parsing
dictionary, parsing.dct, as the source dictionary.
Keep your dictionary

up to date
Whenever we send you a new parsing.dct file, build an updated custom

dictionary by running your transaction file against our new dictionary. This
allows you to benefit from the additions and improvements that we have made to
the base dictionary.
If you do not run your transaction file against each new base dictionary, the
differences between our base dictionary and your custom dictionary will increase.
This will affect your parsing results and impede our ability to provide technical
support.
16
Sample transaction: Add a new word

Suppose your data file contains the name line Harold Smith, ABSFC. You notice
that the word ABSFC is being parsed as a job title. However, ABSFC is really a
postname (Able Bodied Seaman First Class).
When you query ABSFC, you discover it is not in the parsing dictionary:
Enter> ABSFC
Text not found in dictionary.
To add the word to the dictionary, you would add the following record to your
transaction file:
Capitalization
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
ABSFC
ABSFC
0
USENGLISH
HONPOST
HONPOST_STD HONPOST_MTC
As you add words to the parsing dictionary, make a note of any words that have
unusual mixed-case capitalization. To get the correct mixed-case capitalization,
you must also add these words to your custom capitalization dictionary.
For example, if you add ABSFC to the parsing dictionary, you should also add it
to your custom capitalization dictionary. Otherwise, the mixed-casing will be
Absfc rather than ABSFC.
Chapter 1:
17
Sample transaction: Add a title phrase

There is a lot of overlap between words that can be used in firm names and words
that can be used in job titles. For example, the words Vice, President, and
Marketing can all be used in firm names and in job titles. As a result, the parser
may incorrectly identify Vice President of Marketing as a firm name rather than a
title. To correct this kind of parsing behavior, you can add a title phrase to the
dictionary.
Two main tasks
To add a title phrase to the dictionary, you must do two things:

Make sure each word is in the dictionary and has the information codes
PHRASE_WRD and TITLE.
Enter the lookup form of the phrase so that the parser will find it (see
Query a title phrase on page 9). Otherwise, the entry will have no affect on
parsing results.
To add a title phrase

to the dictionary:
1. Query the lookup form of the phrase (see Query a title phrase on page 9).
For example, to add the phrase Vice President of Marketing to the dictionary,
use the lookup form Vice Pres of Mktg.
2. If the phrase is not in the dictionary, create a new entry in your transaction
file. Use the lookup form as the primary and secondary:
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
Vice Pres of Mktg
Vice Pres of Mktg
0
USENGLISH
TITLE
TITLE_STD TITLE_MTC
3. Query each word in the original phrase (for example, Vice, President, of, and
Marketing). Make sure each word meets the following requirements:
It has the information code PHRASE_WRD.

It has the information code TITLE.
The first title match standard (TITLE_MTC) is the same as the word
used in the phrase entry.
4. If a word is not in the dictionary or does not meet the requirements listed in
Step 3, add the word (or modify it) by putting an entry in your transaction
file.
18
For our example, the word President is in the dictionary but is not identified
as a phrase word, so we need to mark it as a PHRASE_WRD. We also need
to mark the word of as a TITLE word and a PHRASE_WRD.
Field name
Entry for President
Entry for of
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
President
Pres.
A
of
of
PHRASE_WRD
PHRASE_WRD TITLE
TITLE_MTC TITLE_STD
For best results. Perform steps 3 and 4 for variant spellings and
abbreviations of each word. For our example, we would check to make
sure that Pres and Mktg are marked as phrase words. This enables the
parser to recognize variant raw forms of the phrasesuch as Vice Pres. of
Marketing, Vice President of Mktg., and Vice Pres. of Mktg.in addition to
the original phrase Vice President of Marketing.
Chapter 1:
19
Sample transaction: Add a multiple-word firm name

If a multiple-word firm name such as Emery Worldwide is parsed incorrectly, you
can add the firm name to the dictionary.
If a firm name looks like a personal name, such as Hewlett Packard or

Johnson & Johnson, the procedure is different from the one shown on this
page. See Sample transaction: Add a firm that looks like a personal
name on page 22.
To the parser, a line looks like a personal name if all of the words in the
line are marked as NAME words. For example, Check N Go looks like a
personal name because the words Check, N, and Go are all NAME words.
Two main tasks
To add a multiple-word firm name to the dictionary, you must do two things:
Make sure each at least one of the words is in the dictionary and has the
FIRMNAME information code.
Enter the lookup form of the firm name so that the parser will find it (see
Step 1: Query the dictionary on page 9). Otherwise, the entry will have no
affect on parsing results.
To add a multi-word
entry to the
dictionary:
1. If the firm name looks like a personal namefor example, Hewlett Packard,
Merrill Lynch, Johnson & Johnsonsee Query a firm name that looks like a
personal name on page 11.
2. In your transaction file, create a new entry for the lookup form of the firm
name (see Query a multiple-word firm name on page 10).
Field
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
Emery Worldwide
Emery Worldwide
0
USENGLISH
FIRMNAME
FIRM_STD FIRM_MTC
3. Make sure that both of the words in the firm name (for example, both Emery
and Worldwide) meet the following requirements:
The entry includes the FIRMNAME information code.

The first firm match standard (FIRM_MTC) is the word that you used
in your firm-name entry in step 2.
4. If the firm name doesnt meet the requirements in step 3, add an entry to your
transaction file.
20
Most often, youll need to mark one of the words as a FIRMNAME word, as
shown here.
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
Emery
Emery
FIRMNAME
FIRM_MTC FIRM_STD
Chapter 1:
21
Sample transaction: Add a firm that looks like a personal name

Many firms are named after peoplefor example, Hewlett Packard or Merrill
Lynch. The parser often identifies these as personal names rather than firm names.
To correct this, you can add the firm name to the dictionary.
Two main tasks
If a firm name looks like a personal name,1 you must do two things:
Make sure each word is in the dictionary and has both the NAME and
FIRMNAME information codes.
Create an entry for the lookup form of the firm name.
To add a firm name

that looks like a
personal name:
1. Query the lookup form of the firm name (see Query a firm name that
looks like a personal name on page 11).
2. If the firm name is not in the dictionary, create a new entry in your
transaction file. Use the lookup form as the primary and secondary.
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
Robert W. Baird
Robert W. Baird
0
USENGLISH
FIRMNAME
FIRM_MTC FIRM_STD
3. Query each word. Make sure it is in the dictionary and is identified as both a
NAME and a FIRMNAME. If not, add the word (or modify it) by putting an
entry in your transaction file. In our example, Robert W Baird, all three words
are in the dictionary, but none has the FIRMNAME information code.
For each word, we would put an entry in the transaction file to add the
FIRMNAME information code, as shown here for Robert.
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
Robert
Robert
FIRMNAME
FIRM_MTC FIRM_STD
1. To the parser, a line looks like a personal name if all of the words in the line are marked as NAME words. For
example, Check N Go looks like a personal name because the words Check, N, and Go are all NAME words.
22
Sample transaction: Modify information codes

Suppose your data file contains the line John Smith PsyD. You notice that this line
is parsed as a firm name rather than a personal name. Although the name of John
Smiths business might possibly be John Smith PsyD, you would prefer to parse
this as a name rather than a firm.
When you query the dictionary, you notice that PsyD is listed as a firm word:
Enter> PsyD
Usage: 0
Info Code(s): FIRMMISC
Standard(s) for PSYD:
- PSYD
FIRM_MTC, FIRM_STD
In your custom dictionary, you could specify that PsyD is also an honorary
postname (Doctor of Psychiatry). To do this, modify the existing entry to add the
honorary-postname codes:
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
PsyD
PsyD
HONPOST
HONPOST_STD HONPOST_MTC
Notice that when you add a new information code, you must also specify at least
one standard for that type of information. In this case, we specified PsyD as the
standard and match standard for honorary postnames.
Chapter 1:
23
Sample transaction: Modify standards and standard-types

Suppose you want to standardize your data to make it as consistent as possible. In
job titles, the word Engineer is standardized to Engineer., but you would prefer to
standardize it to Eng. instead.
In the dictionary, the title standard for Engineer is Engineer.:
Enter>engineer
Usage: 0
Info Code(s): TITLE TITLE_ALONE TITLE_TERM PHRASE_WRD FIRMMISC
Standard(s) for ENGINEER:
- ENGINEER
FIRM_STD, TITLE_STD
- ENGR.
FIRM_MTC, TITLE_MTC
To change the title standard to Eng., you need to do two things:

Delete the TITLE_STD code from the standard Engr.
Add the standard Eng. and identify it as a title standard (TITLE_STD).
You would put these entries in your transaction file:
24
Field name
Entry to add Eng.

as a standard
Entry to delete TITLE_STD code

from the standard Engineer
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
Engineer
Eng.
D
Engineer
Engineer
TITLE_STD
TITLE_STD
Sample transaction: Add an acronym for acronym conversion

You can convert a prename, postname, job title, firm name, or firm location to an
acronym. The Data Cleanse transform produces an acronym only when one is
available in the parsing dictionaryit doesnt generate initials by algorithm or
rule.
How the parser
generates acronyms
Before looking for an acronym, the parser removes all punctuation and noise
words and gets the first appropriate match standard for each word. You must use
the same phrase that the parser will actually look upotherwise, the parser wont
find your entry and wont generate the acronym.
To add an acronym
Procedure
Example
1. Start with the raw phrase.
Certified Residential Specialist
3. For firm names, remove firm terminator words

such as Corp, Inc, Ltd, Co and so on.
4. Query each remaining word. Get the first appropriate match standard for each. For example, if you
are adding a firm match standard, get the first
FIRM_MTC.1
Cert. Residential Specialist
5. Remove all punctuation. This is the lookup form

of the acronym phrase.
Cert Residential Specialist
1. If the word is not in the dictionary, create a new entry for the word (see page 17). If the word is in the dictionary but
does not list an appropriate match standard, create an entry to add the appropriate information code and match-standard
type (see pages 23 and 24). For example, for the word Residential we would add the information code HONPOST and the
standard-type code HONPOST_MTC.
To add the phrase to the dictionary, put an entry in your transaction file. Use the
lookup form of the phrase as the primary, and use the acronym itself as the
secondary:
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
Cert Residential Specialist
CRS
0
USENGLISH
HONPOST
HONPOST_ACR
Chapter 1:
25
Rules for working with match standards

Each entry in the parsing dictionary may include one or more match standards.
You can use match standards to improve the performance of your Match
transform.
How match standards
work
To simplify this discussion, we discuss match standards for personal names.

In the dictionaries, a match standard is a one-way relationship, a pointer from one
name to another:
Alberto
Allen
Alfredo
Alex
Alphonso
Alonzo
Albert
Alan
Alfred
Alexander
Alphonse
Almon
Al
For the name Al, the match standards are Albert, Alan, Alfred, Alexander,
Alphonse, and Almon.
For the name Alberto, the match standard is Albert. (Likewise, for Allen the
match standard is Alan; for Alfredo, Alfred; and so on.)
If two different names return the same match standard, you can use your
matching software to do multiway comparisons and find a match. For example,
since Alberto and Al both return Albert as a match standard, your matching
software could match Alberto Smith to Al Smith.
Here are partial dictionary entries for the name Al and its direct match standards.
Primary
Standard
ALBERT
ALBERT
ALAN
ALAN
ALFRED
ALFRED
ALEXANDER
ALEXANDER
ALPHONSE
ALPHONSE
ALMON
ALMON
AL
ALBERT, ALAN, ALFRED, ALEXANDER, ALPHONSE
Notice that each match standard has its own entry, and that in that entry, the
standard is the same as the primary.
Work with match
standards
To use a word as a match standard, it should have its own entry in the dictionary
(or have its own entry in the transaction file).2 In that entry, the word must be a
match standard of itselfin other words, the match standard must be the same as
the query word.
2. Technically, you could also use a word as a match standard if that word does not have an entry in the dictionaryfor
example, you could use Michelangelo as a match standard because Michelangelo is not in the dictionary. In practice,
however, if you use a word as a match standard, youll probably also want that word to have its own entry in the dictionary,
so we make that assumption in our guidelines.
26
For example, you could use the word Dr as a match standard because it is in the
dictionary and has itself, Dr, as a match standard:
Enter> Dr
Usage: 0
Info Code(s): PRENAME_ALONE HONPOST PREGEN3 SUFFIX
Standard(s) for DR:
- DR. HONPOST_MTC, HONPOST_STD, PRENAME_MTC, PRENAME_STD
- DR ADDRESS_STD
If a word is a match standard of

itself, you can use it as the same
type of match standard for another
word.
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
Doc
DR.
PRENAME PREGEN3
PRENAME_STD PRENAME_MTC
Spelling and punctuation. The spelling and punctuation of the

Secondary in the transaction entry must exactly match the Standard in the
existing dictionary entry.
Chapter 1:
27
28
Chapter 2:
Custom capitalization dictionaries
This chapter explains how to create and maintain custom capitalization

dictionaries.
What is a
capitalization
dictionary?
In a custom capitalization dictionary, you can specify the correct casing for a
word in different situations. For example, you can specify that when MCKAYE is
used as a last name, the casing should be McKaye.
Why create a custom

dictionary?
Most users find that our capitalization dictionary, pwcap.dct, produces good
mixed-case results. However, if a word is not cased as you would like, you can
enter that word in a custom capitalization dictionary.
For example, if you want the word TECHTEL to be cased as TechTel, you could
add the word TechTel to your custom dictionary.
Most of our products allow you to use two capitalization dictionaries at once, so
we expect that most users will employ our base dictionary as is and build their
own, separate dictionary as an extension. When you use your dictionary, you can
give it priority over ours by specifying your dictionary as Dictionary #2.
Create transactions,
build your dictionary
For each entry you want to place in your capitalization dictionary, you will create
a record, or transaction, in a database called a transaction file.
After you make all of your entries in the transaction file, you will run the UMD
Build process. UMD Build reads the entries from your transaction file and creates
your custom dictionary.
Query our dictionary

or yours
You can look up words in our dictionary, pwcap.dct, or your custom dictionary.
For example, if you want to see how we capitalize the word PHD, you can query
the dictionary:
c:\umd /s pwcap.dct
Using a Capital Dictionary.
Enter> PHD
PHD is capitalized as follows:
-PhD is used with EVERY occurrence.
For more details about querying a capitalization dictionary, see Query your
dictionary on page 33.
Chapter 2:
29
Step 1: Create a capitalization transaction file

A transaction file is a database that contains all of your entries for a particular
custom capitalization dictionary. Each entry in your transaction file will create
one entry in your custom dictionary.
!
Create a transaction
database
If you are working with an existing custom dictionary, use the existing
transaction file for that dictionary. Do not create more than one transaction
file for each custom dictionary.
The quickest, easiest way to create a transaction database and its supporting files
is to use the output file feature of UMD Show. (See UMD Show on page 77.)
1. Use UMD Show to query our base capitalization dictionary, pwcap.dct.
Include the o option on the command line. Use the base file name that you
plan to use for your custom dictionary, but with the extension .trnfor
example, my_cap.trn.3
2. Query a word that is in the dictionary, such as PhD.
3. Press Enter to save the query to your output file. Press Esc again to exit.
C:\umd /s pwcap.dct /o my_cap.trn /d dBase3
Using a Capital Dictionary
Enter> phd
PHD is capitalized as follows:
-PhD is used with EVERY occurrence
Enter a query, or press
to save, or press <ESC> to exit
Enter>
Previous query appended to C:\my_cap.trn.
Enter> <Esc>
UMD Show will create an output database filefor example, my_cap.trn. You
can use this database as your transaction file. For instructions on adding your
entries to the transaction file, see Step 2: Put your entries in the transaction file
on page 31.
Keep supporting files
with transaction file
When you create a transaction database as described above, UMD Show creates a
supporting file such as my_cap.def. If the transaction file is ASCII or delimitedASCII, UMD Show also creates an additional file such as my_cap.fmt or
my_cap.dmt.
To open and read the transaction file, UMD needs these files. If you move the
transaction file to a new location, make sure you also move the corresponding
supporting files.
3. If you plan to use a database program or spreadsheet program to edit the file, we recommend creating a dBASE3 or
ASCII file. If you plan to use a text editor or word processor to edit the file, we recommend creating a delimited file.
30
Step 2: Put your entries in the transaction file

For each word you want to add to your custom capitalization dictionary, create a
record in your transaction file. Use a text editor or a database program to add
records to your transaction file.
Capitalization
transaction entries
The table below describes what information to put in each field in your
transaction file.
Field
Data to enter
Action
Choose one:
N Create a new entry.
D Delete the existing entry from the source dictionary.1
Primary
Type the word in the preferred casing, 54 characters maximum. Type a single word (no spaces). Do not include any punctuation.
Attribute
Specify when this casing should be used. Include all that apply, separated
by one space (no punctuation):
PRENAME Prenames
FIRSTNAME First names
LASTNAME Last names
PRELASTNAME Last-name prefixes
POSTNAME Postnames
TITLE Job titles
FIRM Firm data
ADDRESS Address lines2
CITY City names
STATE State names
EVERY Every occurrence
You may type the entire word or just the portion shown in bold (for example, FIRSTNAME or FIRS).
If the Action field contains D, you may leave this field blank.
1. This command is used rarely, if ever. If you want to delete an entry from your custom dictionary, simply delete that
record from your transaction file, then rebuild the dictionary. If you dont like the casing for a word in our base dictionary,
pwcap.dct, you dont need to delete the entry from our dictionary. Instead, put the desired casing in your custom
dictionary. When you process data, specify your dictionary as Dictionary #2 so that your entry will override ours.
2. Do not specify the ADDRESS, CITY, or STATE attribute unless the product that uses the dictionary has addressparsing capability.
Sample entries
Here are some sample entries.

Action
Primary
Attribute
dos
PRELASTNAME
McCathie
EVERY
Chapter 2:
31
Step 3: Build your custom capitalization dictionary

After you put all of your entries in your transaction file, use UMD Build to build
your custom capitalization dictionary.
UMD Build
To build your dictionary, run UMD Build. The easiest way to convey your
instructions to UMD is through the UMD configuration file.
1. Open a copy of the configuration file umd.cfg.
2. Type your instructions in the UMD Build parameters. For descriptions of the
parameters, see Appendix A.
# UMD Show
#
# UMD Build
Dictionary Type (see NOTE) .............
Verify Input File only (YES/NO) ........
...
=
=
=
=
=
=
=
=
=
Capital
c:\pathname\my_cap.trn
c:\pathname\my_cap.dct
NO
c:\pathname\my_cap.log
r:\temp
3. Save the configuration file. We recommend using the same base file name as
your dictionary, but with the extension .cfgfor example, my_cap.cfg.
4. Run UMD with the cfg option. For example:
umd /cfg my_cap.cfg
During the build process, UMD reads the entries from your transaction file and
creates your custom dictionary.
Tips:
We recommend that you use the same base file name for the
transaction file and custom dictionary, and store both files in the
same location.
We recommend that you accumulate all your custom entries in one
transaction file and build your custom dictionary from the
transaction file only. If you do this, you will not need to specify a
source dictionary when you run UMD Build.
32
Step 4: Update your custom dictionary

If you want to add a new entry to your custom dictionary, edit your transaction
file, then rebuild your custom dictionary.
Use your existing
transaction file
To update an existing dictionary, add your new entries to your existing transaction
file. Your custom dictionary will be much easier to manage if you accumulate all
of your entries in one transaction file, rather than scattering them among many
files.
To add a word, create a record for that word in the transaction file. To delete a
word, delete the record for that word from the transaction file. For dBASE3 files,
UMD supports non-destructive delete marking.
Rebuild your custom

dictionary
After you add your new entries, run UMD Build as instructed on page 32. UMD
will rebuild your custom dictionary based on your updated transaction file.
Query your dictionary
You may wish to query your custom capitalization dictionary to see whether it
contains a particular word. To query a dictionary, run UMD Show (see UMD
Show on page 77). For example:
umd /s my_cap.dct
If you look up a word that is in the dictionary, UMD displays the preferred casing
and tells you when that casing is used:
c:\umd /s my_cap.dct
Using a Capital Dictionary.
Enter> TECHTEL
TECHTEL is capitalized as follows:
-TechTel is used with FIRM occurrences.
Chapter 2:
33
34
Chapter 3:
User-defined pattern matching (UDPM)
One way you can modify the Data Cleanse transforms behavior to suit your
needs is to edit the pattern file (default name is drludpm.dat). You edit the
pattern file in order to parse with your own data patterns.
Test your results with

QuickParse
Proceed with care. The pattern file controls how incoming data is
parsed. Accordingly, changing this file changes how items are parsed.
Before you define new patterns to parse, proceed with great caution. If
you aren't careful when you add user-defined patterns, you may receive
unexpected and unwanted results. Make backups of your files just in case
you need to revert to previous parsing rules.
Treat this ability as you would treat any feature you add to an application. Before
you integrate any modification into your enterprise, you should put your
modification through a stringent cycle of research, quality assurance, testing,
regression testing, and so on. Don't find out at release time that you've changed
something for the worse. As always, we recommend that you maintain adequate
backups and test your results with the QuickParse utility. For more information
see Check parsing results with QuickParse on page 67.
Chapter 3:
35
Overview of UDPM
With the User-Defined Pattern Matching (UDPM) system, you can parse data that
the Data Cleanse transform currently doesnt parse. For example, records in your
database may contain a customer ID number that is unique to your company. With
UDPM, you can locate and parse this number.
The UDPM utility provides a method for you to define one or several data
patterns that are specific to types of the data you want to parse, such as part
numbers, customer account number, employee numbers, product numbers or any
other specific pattern of data that you have a need to parse.
Parsing patterns
The Data Cleanse transform parses UDPM patterns that are either by themselves
on the input line or surrounded by noise text. For example, input text could be
Here is the part number 123AB. Or just 123AB. If you have defined a pattern
that fits this part number 123AB, then the Data Cleanse transform parses 123AB.
How does Data Cleanse do it? You edit Data Cleanses UDPM Pattern File
(drludpm.dat).
Pattern file
The user-defined patterns are stored in a pattern file. The pattern file is a plain
text file that you can edit in any text-editing program. This pattern file consists of
a definition section and a rule section. The definition section is where you can
define subcomponents using a syntax that uses Perl (PCRE) regular expressions.
You can then combine these subcomponents with other elements of valid regular
expression in the rule section.
Regular expressions are powerful, while offering a flexible, widely used syntax.
Coupled with the Data Cleanse transforms rule language, you can easily create
definitions and rules to parse your own patterns of data. For more information
about creating regular expressions, see Introduction to regular expressions on
page 39.
36
Working with the pattern file

This section explains what you need to know before you edit the pattern file, as
well as how to actually create your own definitions and rules for parsing your
own defined patterns.
User-defined pattern
example
Each user-defined pattern file is composed of a definition section and a rule

section. Below is an example usage of defining a pattern for dates.
This heading line is required.
DRL UDPM Pattern File v1.0
Each comment line must start

with a pound (#) symbol.
# DO NOT MODIFY, MOVE, OR DELETE THE ABOVE LINE!
Definition section
This section is used to define
the patterns for the subcomponents of the data that you
want to parse.
Month = 0?[1-9]|1[0-2];
Separator = [/-];
Year = [0-9]{2,2};
!end_def
Rule Section
This section is used to set the
rule for how to parse your
user-defined components,
based on the subcomponent
definitions you create.
ILINE1:UDPM1:Date=({Month}){Separator}({Year});
Ending the file: You must end the file with a hard return. If you dont
insert a carriage return/linefeed at the end of the file, the file wont be
read correctly.
Before you can begin creating user-defined patterns, its important that you
understand what makes up these two sections of the file. See the following:
Definition section of a user-defined pattern on page 38
Rules section of a user-defined pattern on page 38
Chapter 3:
37
Definition section of a
user-defined pattern
In the definition section, you may define the subcomponents that will make up the
data pattern that you want to parse. The definition you create is a combination of
a simple language specific to the Data Cleanse transform and Perl (PCRE) regular
expressions. The diagram and table below explain how the definition section of a
user-defined pattern is set up.
1
Month = 0?[1-9]|1[0-2];
Separator = [/]-];
Year = [0-9]{2,2};
!end_def
Note that each line must

end with a semi-colon (;)
except the !end_def line.
Element
Description
Macro name
The first part of each macros definition is to give a name to

the macro followed by the equals (=) sign. This name is later
used in the rule section of the user-defined pattern.
You can define any number of macros in the definition section.
Regular expression Following the equals sign after the subcomponent name, you
use a simple regular expression to define what the subcomponent will equal.
End definition
The !end_def command indicates the end of the pattern definition. This is a required element.
For more information about creating regular expressions, see Introduction to

regular expressions on page 39.
Rules section of a
In the rule section, you must explain the rule or rules for how to parse the
subcomponents that you defined in the definition section. The diagram and table
below explain how the rule section of a user-defined pattern is set up.
ILINE1:UDPM1:Date=({Month}){Separator}({Year});
1
38
Element
Input and
Each rule begins with the input field and output field, separated by
output fields colons.
Rule name
Each rule includes a unique name (pattern label), followed by the

equals (=) sign. This label is used in your job when you use the
udpm_fldbylabel function for output.
Macros
To add a macro to a rule, you must use the macro name as designated
in the definition section. The macro name must be surrounded by
curly brackets { }. You can use any number of macros per rule.
Description
Introduction to regular expressions

What is a regular
expression?
A regular expression is a string of characters that a computer application can use

as a pattern for assessing whether data fed to it matches that pattern. The program
can then operate on data that fits the pattern, regardless of its non-essential
variety.
The regular expression can be as precise or as fuzzy as is needed to catch things
that match, but not catch things that dont.
Type
Example
Description
Precise
[Hh]orse
This example allows only these two

options for matching: Horse and
horse.
Fuzzy
[A-Z][[:digit:]]{5}
This example allows any data that

begins with an uppercase letter and is
followed by five numbers. This could
be a large amount of matches.
PCRE (Perl)
Keep in mind that there are several varieties, or families, of regular expressions.
When you refer to additional documentation on functions and capabilities of
regular expressions, be sure youre researching PCRE (Perl) regular expressions.
How does it work?
The process of using regular expressions follows these steps:

Data is fed to the processing engine (specified with input fields).
The processor tries the entire regular expression from the position before the
first input character. If it finds a matching string from there, it stops looking.
If no matching string is found there, the processor bumps forward one
character and tries the entire regular expression again.
If no matching string is found there, the process is repeated at each input
character until a matching string is found or the end of the input is reached.
Pattern:
([0-9]{4}[[:space:]]){4}
Input data:
AB 1234 1234 1234 1234}
The input data didnt match the pattern in

the first, second, or third positions of the
input data. But in the position just before
the fourth character of the input data, the
processor finds that the remaining data
string matches the pattern (four digits and
a space, repeated four times).
Chapter 3:
39
Operators
For operators, regular expressions use special characters (also known as

metacharacters). These can operate on a valid regular expression in powerful
ways. Operators normally follow a character or group of characters. Available
operators include the following:
Char
Description
Example
Zero or one occurrence

of the preceding element.
[1-9]? equals no number or

any one number from 1 to 9.
Zero or more occurrences

[1-9]* equals no number or

any number of numbers from 1
to 9.
One or more occurrences

[1-9]+ equals at least one

number from 1 to 9.
Any single character of input.
Logical OR.
a|b equals either a or b.
Position indicator: the beginning of the

input.
When not a position indicator: Negation of
the next element
[^skty] equals any single

character except s, k, t, or y.
Position indicator: the end of the input.
()
A subpattern. Subpatterns are regular

expressions within the parentheses, and are
used to group constructs.
(0?[1-9])|(1[0-2])
{}
A bound, with one integer (to specify a

number) or two integers, separated by a
comma (to specify a range).
Note: If the { is followed by a character
other than a digit, the { is an ordinary
character; not the beginning of a bound.
{2,5} means at least two times

but no more than five times.
{2,2} means exactly two
times.
Marks the next character as one of the following:

a special character or a literal (toggles)
a backreference (not used in Data
Cleanse pattern matching)
an octal escape
Note: Its illegal to end a regular expression
with the \ character.
\$ means the dollar symbol
Using metacharacters
literally
40
(instead of the usual meaning

of $ as a position indicator).
To use any of the metacharacters as a literal character, precede it with the

backslash character (\). For example, to find cost as a dollar sign and two digits,
you could create a subexpression like this: \$[0-9] [0-9]
Character Classes
You may also see or use a number of character class names inside brackets. These
class names must be surrounded by colons and brackets, as well, so the
expression will look like:
[[:alnum:]]
or
[[:digit:]]
Name
Description
alnum
Alphabetic characters and numeric characters.
digit
Digits.
punct
Punctuation characters.
alpha
Alphabetic characters.
graph
Non-blank (not spaces, control characters, and so on).
space
Whitespace characters (space, tab, newline, carriage return, and so on).
blank
Space and tab.
lower
Lowercase alphabetic characters.
upper
Uppercase alphabetic characters.
cntrl
Control characters.
print
Non-blank (not control characters and the like, but includes spaces)
xdigit
Digits allowed in a hexadecimal number (0-9a-fA-F)
Chapter 3:
41
Creating regular expressions

This section is an overview of some basics you will need to know when you
create regular expressions.
Brackets [] present
options for a single
input character
Brackets enclose lists, any of whose members can match a single input character.
For example:
This string:
Will match:
[aei]
only an a, e, or i character
[^aei]
any one character not an a, e, or i
Negation. If the first character of the list is ^ then the list only matches what is
not in the character set.
Range. A hyphen specifies a range of characters as defined by their collating
sequence. For example:
[a-d]
[2-4]
[A-Z]
Grouping characters
with parentheses (...)
matches a, or b, or c, or d
matches 2, or 3, or 4
matches any upper-case letter
You can group characters into strings (subexpressions) by using parentheses.

Parentheses enclose strings of characters that must each match in order.
This string:
Will match:
(aei)
the following string: a, then e, then i
(mn|xy)
a string of either mn or xy
Each subexpression is a complete regular expression, which can also include

other subexpressions. These subexpressions can become very useful in your job.
For example: ((aei)(ou)) is three subexpressions:
first, the string aei must be found, (the first subexpression)
followed immediately by the string ou (the second subexpression)
the combination (the entire subexpression) is the third subexpression
For example: ((aei) [[:digit:]]{9}) includes two subexpressions:
first, the string aei must be found, (the first subexpression)
followed immediately by nine digits
the combination (aei and nine digits) is the second subexpression
Quantity {}
indicators operate on
the entire
subexpression
42
For example: ((aei)?(ou)+) includes two subexpressions:

First, the string aei must be found zero or one times
followed immediately by at least one occurrence of the string ou.
Example of defining a pattern

First and foremost, know your data. You have to know what the data represents in
order to know what is acceptable and what is not. Only then can you create a
regular expression thats flexible enough to match the full range of good data,
while at the same time excluding even close matches of data that isnt valid for
your pattern.
There are many ways to approach the task. Here is a general step-by-step
approach as applied to a fairly straight-forward examplefinding an account
number that should be a series of two alphabetical characters and either six or
seven numbers.
Step 1:
Write examples of good data
Write several examples of the data you want to parse. For example:
A
A
A
N
L
B
3
4
3
4
6
4
6
5
2
3
9
6
5
8
9
9
0
9
9
0
Step 2:
Identify any literal elements
Decide if there are any characters (either alphabetical or numeric) that must be
seen without variance, for the input to be valid. If so, put a pair of brackets around
each. For example, you may decide that the first letter must always be A, but
that all the others may vary to some extent.
A
A
A
N
L
B
3
4
3
4
6
4
6
5
2
3
9
6
5
8
9
9
0
9
9
0
[A]
Step 3:
Define the ranges of the
variable elements
For each of the variable elements, decide the range of the optional data. Here,
weve decided that the second letter can be any of four possible letters: B, C, L, or
N. In addition, the last two digits must be either 99 or 00 (representing a year).
A
A
A
N
L
B
3
4
3
4
6
4
[A] [BCLN]
6
5
2
3
9
6
5
8
9
9
0
9
9
0
(99|00)
Chapter 3:
43
Step 4:
Define acceptable variances
in quantity
Define any acceptable differences in the quantity of the elements. In this example,
its acceptable for the input data to have either 4 or 5 digits after the first two
letters, and before the last two digits. Use quantity indicators to show that
acceptable range:
A
A
A
N
L
B
3
4
3
4
6
4
[A] [BCLN]
6
5
2
3
9
6
5
8
9
9
0
9
[0-9] {4,5}
9
0
(99|00)
Step 5:
Expand the pattern for
acceptable deviances
Finally, think about how your data was input, or what was its source, in order to
predict what sort of deviances you might find in the format of the data that would
not precisely fit your pattern so far, but which would reflect a useful datathat is,
not exactly right, but close enough to be usefully parsed.
In this example, you may note that some of your record data starts with small
letters instead of capital letters, and that in some cases, the data was input with a
space between the two letters and the numbers, like these:
AN3463599
AL4659800
Ab 3452399
an 623400
[Aa] [BCLNbcln] [[:space:]]?[0-9] {4,5} (99|00)
These additions make the small

letters acceptable for the first
two alpha characters.
This added character class makes it OK

for one space or tab to be after the first
two alpha characters (or not).
Test and refine
Try out your expression with QuickParse or by running the job on a small portion
of your data. If necessary, adjust your expression to suit the results.
Literal Characters
(alpha, numeric, &)
To specify a literal character, just include it in the expression. You can use any of
the extended ASCII characters, like A through Z, a through z, 0 through 9, and
the specialized characters, like @, #, $, and so on.
Note: Some of these characters are Regular Expression MetaCharacters,
and must be treated as described next.
44
Alternate expressions
You can search for more than one variation of a pattern. For example, heres how
you could search for data that fits the two patterns of Wisconsin license plates:
Through 1999, these were 3 letters followed by 3 numbers (ABC123). Then,
starting in 2000, the state switched to a pattern of 3 numbers followed by 3 letters
(123ABC).
In the pattern
matching file:
1. Set up one line to match the early numbers (for example, ABC123).
2. Add a second line to match the late numbers.
3. End the definition section.
4. Make a rule that accepts either expression.
1)
2)
3)
4)
Implications for
output
early=[A-Z]{3}[[:space:]]?[[:digit:]]
late= [[:digit:]] [[:space:]]? [A-Z]{3}
!end_def
IUDPM1:L_UDPM1:wis_plate=({early}|{late})
Alternative expressions dont complicate output, because output is governed by

the rules, not the definitions. In this example, data that matches either the early
or late alternatives is considered a match for the wis_plate rule.
Chapter 3:
45
Multiple rules
You can also search for more than just one pattern. For example, you also want to
search for data that matches the pattern of a Wisconsin auto title number, which
is: 2 year digits, 3 0-9 digits, 1 A-Z alpha, 4 0-9 digits, a hyphen, and
finally a 0-9 digit. An example of a Wisconsin auto title is 95172L1031-0.
In the pattern
matching file:
1. Keep the expressions for the license plates, and define another for titles
2. Add another rule for the title expression:
1)
2)
46
early=[A-Z]{3}[[:space:]]?[[:digit:]];
late=[[:digit:]][[:space:]]?[A-Z]{3};
title=[0-9]{5}[L][13][0][[:digit:]]{2}[-][0-9];
!end_def
IUDPM1:UDPM1:wis_plate=({early}|{late});
IUDPM1:UDPM1:wis_title=({title});
Example of a user-defined pattern file

In this section, we take a hypothetical situation and walk you through the steps to
set up the user-defined pattern file in order to parse the necessary data. This
section only explains how to set up the user-defined pattern file in order to get the
parsing results that you want.
The scenario
Assume your database contains standard customer information, such as firm

name, contact name, and so on. In addition, you have a specific Payment Type/
Number field that this application doesnt currently parse. The Payment Type/
Number field includes a code of either a P indicating payment by purchase
order, or C indicating purchase by credit card. This field also contains either a
purchase order number, or a credit card number, corresponding to the payment
code.
If you put the database into a table format, it might look something like this:
Step 1: Identify the

subcomponents
Firm
Contact
Payment Type_Number
Imaginary Industries
Joe Edwards
Ppo123-456
Fake, Inc.
Mary Peterson
c1234-5678-9123-4567
Before you can set up the definition section of your pattern file, you have to
decide what subcomponents you might look for in the data you want to parse. You
dont need to worry about the Firm field or Contact field, because the application
can already parse those types of data. You need only worry about the Payment
Type/Number field, which can be composed of three main subcomponents:
Payment type code
Purchase order number
Credit card number
Step 2:
Define the
subcomponents
Now that youve identified the subcomponents that you need, you can go about
defining them. The first subcomponent is the payment type abbreviation. This
code is always going to be either C for credit card or P for purchase order.
However, as shown in the example table, it may be entered as either lowercase or
uppercase.
To edit the pattern file, open it in any text editing program. We recommend you
make a backup copy of the drludpm.dat file, in case you want to revert to the
original file later.
Payment type
subcomponent
As with all definitions for user-defined patterns, you need to name it first, and
then use regular expressions to look for the pattern. The definition for this
subcomponent should look like this.
payment_type=[p|P] | [c|C];
With this rule, we are telling the application to look for a P or a C that can be
uppercase or lower case.
Chapter 3:
47
Purchase order
subcomponent
The next subcomponent is a purchase order number. In the example, a purchase

order always has this format POxyy-yyz, where x is either 1 or 2, y is any number
between 0-9, and z is any number between 0-9 (but may not exist for all purchase
orders). In this case, the definition would look like this:
po=[P|p][o|O][1|2][0-9][0-9]-[0-9][0-9][0-9]?;
Credit card
subcomponent
The last subcomponent is a credit card number, in the typical format of xxxx-xxxxxxxx-xxxx where x is any digit between 0-9. You also want to check for a hyphen
or space between the groups of digits, though this hyphen or space may not be
there. The definition would look like this:
credit_card=[0-9][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9][ -]?[0-9]
[0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9];
Completed definition
section
After adding the !end_def command to designate the end of the subcomponent
definitions, the file would look something like this:
payment_type=[p|P]|[c|C];
po=[P|p][o|O][1|2][0-9][0-9]-[0-9][0-9][0-9]?;
credit_card=[0-9][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9][ -]?[0-9]
[0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9];
!end_def;
Step 3:
Define the rule(s)
If you closed the file now, the application would do nothing differently. Though
you have explained some patterns, you need to create a rule to actually look for
those subcomponents, as well as to explain how those subcomponents should
appear before they are parsed.
First you indicate the fields the data is coming in on and going to. You separate
the names of these input and output fields with a colon.
Next you name the rule, and explain it in terms of the subcomponents that youve
defined. You want to tell the application to look for a payment_type
subcomponent, immediately followed by either a po subcomponent or credit_card
subcomponent. The rule only applies for the input field specified. The rule would
look like this:
input field
output field
Pattern
ILINE1:UDPM1:account_info=({payment_type})({po}|
{credit_card});
Subpatterns
Remember that the order listed in the rule is very important. The Data Cleanse
transform will only parse a rule if it finds the subcomponents in the exact order
listed in your rule.
Step 4:
Save the modified
pattern file
48
Now that you have completed the sample file, it should look like this:
DRL UDPM Pattern File v1.0
credit_card=[0-9][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9][ -]?[09][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9];
po=[P|p][o|O][1|2][0-9][0-9]-[0-9][0-9][0-9]?;
payment_type=[p|P]|[c|C];
!end_def;
ILINE1:UDPM1:account_info=({payment_type})({po}|{credit_card});
Save the file to save your user-defined pattern.

Step 5:
Submit user-defined
data to Data Cleanse
Now that youve created a user-defined pattern, you can configure your
application to apply the rules to incoming records. You define your input fields in
the rule section of the pattern file.
Use the ILINE field when your user-defined data appears on a line mixed in with
other data. If your user-defined data is in its own discrete field, use IUDPM1 ...
IUDPM4 field. Use this field in the rule section of your file.
Step 6:
Retrieve user-defined
data
Finally, you can configure your application to retrieve the parsed user-defined
data that you have defined in your pattern file.
You can request the entire user-defined field (as defined in your rule) with the
UDPM field. Or, you can request individual subcomponents in your rule with the
UDPM_SUB1-5 components.
Chapter 3:
49
50
Chapter 4:
Modify the rule file
One of the ways you can modify the Data Cleanse transforms behavior is by
creating or editing parsing rules in the rule file (drlrules.dat). The rule file
controls parsing name data, firm data, and addresses.
Test your results with

QuickParse
Proceed with care. The rule file controls how incoming data is parsed.
Accordingly, changing this file changes how items are parsed. Before you
change the parsing rules, proceed with great caution. If you aren't careful
when you add or edit parsing rules, you may receive unexpected and
unwanted results. Make backups of your files just in case you need to
revert to previous parsing rules.
Treat this ability as you would treat any feature you add to an application. Before
you integrate any modification into your enterprise, you should put your
modification through a stringent cycle of research, quality assurance, testing,
regression testing, and so on. Don't find out at release time that you've changed
something for the worse. As always, we recommend that you maintain adequate
backups and test your results with the QuickParse utility. For more information
see Check parsing results with QuickParse on page 67.
Chapter 4:
51
What is the rule file?

The rule file (drlrules.dat) controls how the Data Cleanse transform parses
groups of output type subcomponents for name and firm data.
Pre-defined rules
The Data Cleanse transform already provides hundreds of rules for many
different possible combinations of data. These rules will likely satisfy the parsing
needs of most users.
However, you may encounter data that isnt being parsed as youd like it to be.
Or, maybe you would like to tweak a rule so that it returns a different confidence
score by adding a confidence booster. In situations like this, it is very handy to be
able to edit the rule file.
General guidelines
When modifying the rule file, you should keep the following points in mind:
Start small
You should create new rules that define very specific situations. Start
conservatively with very narrow parameters. Only after you master narrowly
defined rules should you proceed to create rules that cover broader situations.
Be careful
You should always double-check the syntax and take great care when you apply
operators. Its easy to enter an inappropriate operatorand it may not be as easy
to spot it later.
Confidence may affect

results
If the item parsed incorrectly and you want to write a rule, the confidence still
factors in. Because of this, your results might not be exactly as you had
anticipated.
Test results
You should always test your results very thoroughly. See Check parsing results
with QuickParse on page 67 for more information.
52
How the rule file is organized

The rule file has a straight-forward organization. It consists of a header,
explanatory information, and parsing rules grouped by data type.
The header
The files header identifies the rule file. You must not alter or delete the header.
DRL Rule File v1.0;
# DO NOT EDIT, MODIFY OR REMOVE THE ABOVE LINE!!!!!
#
Explanatory information
Lines that start with a pound sign (#) are commented out.
#
#
#
#
#
Parsing rules by data type
Group
The file consists of rules
for several types of data.
Groups of rules include:
Name rules
Dual name rules
Firm rules
Address Line rules
Last Line rules
Optional rules, not
enabled by default
Rule
The following types will be used in each example

NAME_DESIGN = ATTN
PRENAME = MR
NAME_STRONG_FN = JOHN
Lastly, the rule file consists of rule grouped by data type.

#######################################################
#
#
#
NAME RULES
#
#
#
#######################################################
#######################################################
#
# Titles by themselves
#
Admin
#
title1 = TITLE_ALONE;
options = begin : end;
action = PERSON conf: 40;
PERSON = 1 : TITLE : L;
end_action
Chapter 4:
53
Rule example
The following is an example of a rule that already exists in the rule file
(drlrules.dat).
The rules within the rule file can be divided into two main sections: the definition
section and the action section. These sections can then be further divided into
smaller components. Though each rule is unique, all rules follow the same
structure as explained in the next sections of this chapter.
When you work with rules, keep in mind the differences between the rule file and
the pattern file. Though the pattern file only has one definition, the rule file has a
separate definition section for each rule defined in the file.
Definition section
Here you name the
rule and define what
type of data pattern
to look for to match
this rule.
Lines that start with a
pound sign (#) are
commented out.
#######################################################
#
# Prename with last name and prename with first, last name
#
Mr Smith and Mrs Mary Jones
#
nfdual34 =
# Prename
PRENAME_ALONE +
# last name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER+
# connector
CONNECTOR +
# Prename
PRENAME_ALONE +
# first name
INITIAL | PREFIRST] +
# last name
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER;
Action section
Here you specify the
action that is performed when the rule
is matched.
54
action = PERSON : D;
PERSON = 1 : PRENAME : 1;
PERSON = 1 : LAST_NAME : 2;
PERSON = 2 : FIRST_NAME : 5;
PERSON = 1 : NAME_CONNECTOR : 3;
end_action
Definition section of a parsing rule

The main purpose of the definition section is to define the components that the
application is looking for in order to match the rule. The diagram and table below
explain the pieces that make up a definition section of a rule.
Rule description line

Example data line
Rule label
#######################################################
#
# Prename with last name and prename with first, last name
#
#
nfdual34 =
# Prename
PRENAME_ALONE +
Rule definition
Rule description
Rule label
Rule definition
# last name
Component
Description
Rule description line
In this line, you can designate a description for the rule. This is
an optional line, and therefore it must begin with at least one
pound sign (#) so the application treats it as a comment. This
line is helpful to use so you know what the rule is intended to
parse.
Example data line
Here you can enter an example of the data that you will parse
with the rule. We recommend that you use such a line, because
it is helpful in locating the rule you want to edit.
This is an optional line, and therefore it must begin with at least
one pound sign (#) so so that it is treated as a comment.
Rule label
The name, or label, for this parsing rule.
Rule definition
The rule definition lists the components that make up the parse.
This line and the components that make it up are described in
more detail in the following section.
If you think of the rule definition as an equation, it may help you understand it.
The rule label (before the equal sign) can be equated with the description (after
the equal sign).
#
nfdual34 =
# Prename
PRENAME_ALONE +
# last name
Chapter 4:
55
Rule label
The rule label must be unique; no two rules can have the same label. The table
below explains how the rule label is created.
Character
Values
First
Second
(only if
first character is n)
additional
Description
Use n in the first character spot to indicate that this is a name

rule. If you use n, you must also specify a 2nd character.
Use f in the first character spot to indicate that this is a firm rule.
If the first character is n, use f in the second character spot to

indicate that the name order of the components in this rule are
first name first.
If the first character is n, use l in the second character spot to

last name first.
If the first character is n, use a in the second character spot to

ambiguous.
any
Use any combination of letters, numbers, or underscore characters for any additional characters in the rule label. The only stipulation is that each rule label is unique.
Note: Although the additional characters for a rule label are completely up
to you, you should label the rule so you can understand it and you can
separate it from others.
Rule definition
The rule definition is a combination of token types that the application looks for
when parsing data. This section always follows the equals sign (=).
In this section of the rule file you can use only certain dictionary types. For a
listing of dictionary types, see the Information codes in Appendix C.
In addition to dictionary types, you can use some token types that are not
dictionary components in the rule file.
Order of types
56
Valid type
Description
ALPHA_NUM
The token is alphanumeric.
PUNCTUATION
The token is a punctuation character.
CONTAINS_PUNC
The token contains punctuation.
NO_VOWEL
The token contains no vowels.
LOOKUP_NOT_FOUND
The token is not in the dictionary.
LOOKUP_ANY
The token could be any one of the above types (including

any of the dictionary types).
The Data Cleanse transform looks for the token types youve listed in the precise
order youve listed them in. Multiple identifiers are connected by a plus sign (+).
Additionally, by adding an asterisk (*) after a token type, you signify that there
can be one or more of these type of tokens in the incoming data.
Rule file operators
Within the rule definition, you can use any of the following operators.
Symbol Also known as
Description
Pound sign
Comments out the line. You must place at least one

pound sign (#) at the start of any line that is for your
reference purposes only. To improve readability, you
can use more than one pound sign.
Equal sign
Shows the relationship between items, such as equating the first part of a line with the second part.
Plus sign (and)
Connects multiple components in rule definition lines.
&
Ampersand
Associates tokens.
[]
Brackets
Groups elements so you can tell which go together.

For an example, see the description for pipe ( | ).
Pipe (or)
Toggles between two options (this or that). The entire

or expression must be enclosed in brackets ([]).
Exclamation mark
(not)
Negates the statement. Place this operator before the

item to negate (for example, !SUFFIX)
Question mark
Indicates that the item is optional.
Asterisk
Indicates that the rule allows for one or any number of

these components.
Colon
Separates components within action lines and within

action item lines.
Semicolon
Terminates the line. Is required only at the end of rule

definition and at the end of each action line.
Chapter 4:
57
Action section of a parsing rule

After you define the components that the Data Cleanse transform should look for
when parsing, you must tell the application what to do when it finds a match for
that rule. In other words, you must tell the Data Cleanse transform what action it
must perform. These actions are performed for both the main parsed item, as well
as the subcomponents that make up the main item.
The diagram and table below explain the pieces that make up the action section of
a parsing rule.
options = no_multiline;
action = PERSON : D;
end_action
1
2
3
4
Component
Description
Options line
In this line you can specify your parsing preferences. All

the components on this line are optional. In fact, the
options line is optional.
Action line
In this line, you assign the output type of the parsed item.
Action item lines
In these lines, you assign the output type for each of these
subcomponents.
end_action command
You enter end_action to signify the end of the action section and, in effect, the end of the rule.
The components that make up action lines and action item lines are discussed in
more detail in the following sections.
How to terminate lines
58
Except for the last line (end_action), you must terminate each line of the rules
action section with a semicolon (;) after the last component or indicator.
Options line
The options line lists optional components, such as whether matching should start
at the end or beginning of data. The options line is optional to the rule file.
Components
The options line consists of two partsthe label that tells you what the line is for
(start options command) and the options themselves.
1
2
options = no_multiline : begin : end;
Available options
Component
Description
Start options
command
Enter option= to designate the start of the command section.
Option
Enter one or more of the valid option values (see Available

options below). Separate option values with a colon (:), and
end the line with a semicolon (;).
The options line accepts only three values as options. An example from the
default rule file (State) shows all three of these options:
# State
STATE;
Options
options = no_multiline : begin : end;

action = LAST_LINE;
LAST_LINE=1;LAST_LINE:0;
end_action
Valid option
Description
begin
When matching, the data must start the line.
end
When matching, the data must end the line.
no_multiline
When matching, the data must be input on a nameline or fimline.
Note: When begin and end options are used together in the rule file, the
data to be found must be by itself on a line; it cant be pulled out of the
middle.
Chapter 4:
59
Action line
The diagram and table below explain the components that make up the action
line.
1
2
3
4
action = PERSON : D conf: 40;
Action item lines
Component
Description
Start action command
Enter action= to designate the start of the action section.
Output type
Enter the output type for the parsed component. Valid output types are:
PERSON
FIRM
ADDRESS
LAST_LINE
Dual rule indicator
If your rule is for two or more people (Mr. and Mrs. John
Smith, for example), enter D after the output type.
The dual rule indicator is needed in the action line only if
the rule is a dual rule. If you need to enter a dual name
indicator, follow the output type with a colon (:).
Confidence score
Optional. This component (for example, conf: 40) adds

a user-defined weight to the calculated confidence, called
a confidence booster. You can add any number here to
make the score higher than the calculated score.
Each subcomponent used in the rule definition usually has a corresponding action
item line. The diagram and table below explain the components that make up the
action item lines.
1
2
3
4
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
60
=
=
=
=
=
=
1
1
2
2
2
1
:
:
:
:
:
:
PRENAME
LAST_NAME
PRENAME
FIRST_NAME
LAST_NAME
NAME_CONNECTOR
:
:
:
:
:
:
1;
2;
4;
5;
6;
3;
# Component
Description
1 Output type
For this component, enter the output type used in the action line,
followed by the equal sign (=).
2 Item index
number
Enter the number corresponding to which output item the line is

referring to. This number will be 1 or if dual, 2.
For example, if your rule was set up to parse a dual nameMr. and
Mrs. John Smithyou have two items: Mr. John Smith and Mrs.
John Smith. In this example, you use a 1 in the action item lines
that refer to Mr. John Smith (because Mr. is listed first in the input)
and a 2 in action item lines that refer to Mrs. John Smith.
Follow the item index number with a colon (:).
3 Output type
subcomponent
Enter the subcomponent of the output type for the information code
that this line corresponds to. For example, if this action item line
corresponds to the PRENAME information code in the rule definition, then PRENAME would be the subcomponent you use here.
For a list of valid subcomponents for each output type, see Valid
output type subcomponents on page 61.
Follow the output type subcomponent with a colon (:).
4 Information
code index
number
This number indicates the information code in the rule definition

that this action item line corresponds to. For example, with the following data:
the following action line

refers to Smith. To be Jones, the number would be 6.

If the action item line applies to all of the information codes in the
rule definition, use an index of 0. For example,
FIRM = 1 : FIRM : 0;
Use an index of L here to include the whole line (data in the firm
line that may not match the rule definition).
Follow the information code index number with a semicolon (;) to
terminate the action item line.
Valid output type

subcomponents
The following table lists the valid subcomponents for each output type.
Output type
Subcomponents
PERSON
PRENAME
FIRST_NAME
MID_NAME
LAST_NAME
OTH_POST
MAT_POST
TITLE
NAME_DESIG
PRELAST
NAME_SPEC
NAME_CONNECTOR
FIRM
FIRM
FIRM_LOC
ADDRESS
ADDRESS
LAST_LINE
LAST_LINE
Chapter 4:
61
Example of a parsing rule

This section takes a hypothetical situation and walks through the steps for setting
up the parsing rule for this situation. For this example, you will create a rule very
similar to one that already exists in the rule file. However, you can use the general
concepts in this example to create any entirely new rule, or edit any other existing
rule.
Heres the situation
Your incoming data often contains names with three last names in the Name field.
For example, one record contains Juan Carlos Fernandez Torres Perez. You want
to create a rule to parse this as one name with discrete components of first name,
middle name, last name, last name, last name.
Note: The Data Cleanse transform already has rules for scenarios with two
last names, but not for a third so we will modify an existing rule to account
for this extra name component.
Step 1: Identify the

data you want to
parse
You already have an example of the data you want to parse: Juan Carlos
Fernandez Torres Perez. Before adding a rule, you should see how this example
currently parses. Using QuickParse you can find that it parses most of the entry
correctly, but it sends the final last name to the extra field. The rule it matches is
nfname15.
Now that you know what you want to parse and how it currently parses, you can
begin to edit the rule file for your scenario. Open drlrules.dat in any text editing
program. It is best to create a backup of this rule in order to reverse any changes,
if necessary.
Step 2: Create the

definition section of
the parsing rule
The definition section includes a number of optional lines, and one required
linethe rule definition line. We recommend including comment lines before
your rule definition as an explanation of what the rule will parse. To add comment
lines to this rule, you could enter the following:
####################################################
#
#First name first rule for names with a first name,
#middle name, and 3 last names
#
#Examples: Juan Carlos Fernandez Torres Perez
Next, you must create the rule label. Name each rule label with a descriptive
name. For this example, we will build on the rule we are using as a base and name
it:
nfname15_extralast =
Because this is a name rule, you use n in the first character position. Again
because it is a name rule, you must include a second specific character in your
label to indicate the name order. Juan, the first name, is listed first, therefore you
use an f as the second character. From there the rest of the name is up to you.
Now you need to list the token types that make up the subcomponents of the main
component you hope to parse. In Step 1, you found that adding an extra last name
subcomponent would fix your problem, so you add that here, joining it to the
previous subcomponents with the plus (+) sign:
62
# name designator (ATTN:)

NAMEDESIG? +
# prename (mr.)
PRENAME_ALONE? +
# first name
[NAME_STRONG_FN |
NAME_WEAK_FN ] +
# middle name
[INITIAL | NAME_STRONG_FN | NAME_WEAK_FN] +
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# last name
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
Add this last

name subcomponent
# last name
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
# maturity post (Jr.)
MATURPOST? +
# honorary post (phd)
HONPOST*? +
# occupational title
TITLE_ALONE*?;
Notice that some subcomponents in this rule are optional. The name designator,
prename, maturity post name, honorary post name, and occupational title are all
followed by the ? operator indicating that they may or may not be present in the
input.
Step 3: Add the
options line
We only want to apply this rule on a nameline, so we must add the following
options line:
Options = no_multiline;
Chapter 4:
63
Step 4: Create the

action line
Every action line needs an action = (which equals to begin) followed by the
output type. Follow up with any optional components for this line, which are
separated by colons. The line ends with a semi colon.
Because the rule we have based this rule on is also a one person name rule, the
action line remains the same:
action = PERSON;
Step 5: Create the

action item lines
To finish the action section, you need to add one action item line for the extra last
name subcomponent. Also, you need to update the index number for the
subcomponents below this inserted line:
PERSON = 1
PERSON = 1
PERSON = 1
PERSON = 1
end_action
:
:
:
:
LAST_NAME : 7;
MAT_POST : 8;
OTH_POST : 9;
TITLE : 10;
The action line begins with PERSON = because this is a name rule. There is only
one person parsed in this rule, therefore all subcomponents have a 1 as the item
index number.
You must tell specify the output type that subcomponents in this line refers to. For
a list of valid subcomponents, see Valid output type subcomponents on
page 61. Finally, remember that the token types in your rule definition correspond
directly to the example of data you want to parse. You must enter the index
number of the token type that the line applies to.
Here is the final action item lines you should have:
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
Step 6: Finish your

rule and save
64
=
=
=
=
=
=
=
=
=
=
1
1
1
1
1
1
1
1
1
1
:
:
:
:
:
:
:
:
:
:
NAME_DESIG : 1;
PRENAME : 2;
FIRST_NAME : 3;
MID_NAME : 4;
LAST_NAME : 5;
LAST_NAME : 6;
LAST_NAME : 7;
MAT_POST : 8;
OTH_POST : 9;
TITLE : 10;
To finish your rule, enter end_action after the action item lines. To enable your
changes, save drlrules.dat. See the final rule on the next page.
####################################################
#First name first rule for names with a first name, middle name and 3 last names
#Examples: Juan Carlos Fernandez Torres Perez
nfname15_extralast =
# name designator (ATTN:)
NAMEDESIG? +
# prename (mr.)
PRENAME_ALONE? +
# first name
[NAME_STRONG_FN |
NAME_WEAK_FN ] +
# middle name
[INITIAL | NAME_STRONG_FN | NAME_WEAK_FN] +
# last name
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
# last name
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
# last name
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
# maturity post (Jr.)
MATURPOST? +
# honorary post (phd)
HONPOST*? +
# occupational title
TITLE_ALONE*?;
action = PERSON;
PERSON = 1 : NAME_DESIG : 1;
PERSON = 1 : MID_NAME : 4;
PERSON = 1 : MAT_POST : 8;
PERSON = 1 : OTH_POST : 9;
PERSON = 1 : TITLE : 10;
end_action
Chapter 4:
65
66
Chapter 5:
Check parsing results with QuickParse
About QuickParse
QuickParse is a tool to help you check parsing results. QuickParse lets you
quickly see how data that you input would parse if input through your Data
Cleanse transform. With QuickParse you can manually type in questionable
records or use an input file and shuffle through the entries.
When you make any modifications to the parsing setup or to the user-modifiable
dictionary, you should use QuickParse to make sure that your changes produce
the results you want.
QuickParses main
window
After you specify the data

to be parsed, it shows up
here.
With these controls you

can navigate your input
database.
When you start QuickParse, the program initially opens the window below. This
window is where you find out how the Data Cleanse transform would parse
records.
The parsed items

are displayed
here.
Components for
the selected
parsed item are
shown here.
General information about

the selected item is displayed here.
Setup needed. This window doesnt display any data until you set up
QuickParse. For information on setting up QuickParse, see Get started
with QuickParse on page 68. For more information on the above window,
see Run QuickParse on page 70.
Chapter 5:
67
Get started with QuickParse

The first thing you need to do to get started is specify which configuration file
you want to use. The configuration file contains all of the parsing option settings
and tells QuickParse which fields you will be using for input.
Set up QuickParse
1. From the QuickParse windows menu bar, select Setup > QuickParse. The
Setup window opens.
Here you specify the type of
data to be used as input for
QuickParsewhether its a
database or data that you
will input manually.
2. Specify the configuration file you want to use by typing in the path or by
browsing for it.
3. As necessary, specify the type of casing, text type, input setup, and greeting
you want to use.
4. Click OK.
Specify manual and/
or database input
68
With QuickParse you can specify the type of data to be used as input by selecting
either Manual Input or Database Input at the Setup window.
Type of input
Description
Manual
If you choose to input your data manually, QuickParse returns you

to the programs main window, where you can start typing in
entries.
Database
If you choose to have a database as your input, QuickParse opens a

window where you can set up your database (see Set up your input
database on page 69).
Set up your input

database
If you specify database input at QuickParses Setup window, the Database Setup
window (below) opens. When you specify a valid input file (ASCII, Delimited,
dBase3 ), the Database fields box displays a list of the fields in the file.
Lists the fields in the

file.
Lists the fields turned

on in the specified configuration file.
Lists the field mappings.
To map which input field(s)

should be put on a line:
1. Click on the fields in each box (Database fields and DataRight fields), and
click Map. The Mapped fields box then shows what fields the input is coming
in on.
Because fields can only be mapped once, the fields in the Database fields box
and DataRight fields box are removed when theyre in use.
2. To delete or change a mapping, click Remove. This will put the appropriate
fields back into the lists to make them available to be mapped again.
3. When you have the fields set up the way you want, click OK. QuickParse
takes you back to the main window where you can begin going through the
records.
Chapter 5:
69
Run QuickParse
After you set up QuickParse (see Get started with QuickParse on page 68), the
applications main window opens.
QuickParses main
window
QuickParses main window lets you view all the parsed information about any
record in an input file.
Shows items the way
they were parsed.
Shows all the components of
the item selected above.
Configuration file.
The name of the configuration file youre using is noted
on the top title bar.
Input lines
activated in
.cfg file.
Input data.
Arrow buttons
for navigation.
Parsed item line.
What type the item
parsed as.
Confidence score.
Type of line the input
came in on.
Rule this item hit if addr,
name or firm.
How fields are mapped.
Data file.
If youre using input from a file,
the file name and number of
records along with which record
displayed is listed on the bottom.
Navigate your
input file
The arrow buttons (in the center on the left, beneath the input) let you navigate
through your input file. You can move forward to the next or ending record, and
backward to the previous or starting record. You can also go directly to a specific
record by typing its record number in the entry box and then pressing the Enter
key.
Enter records in
database mode
If youre using an input file, you still have the ability to type in entries or make
modifications to an entry to see what that change would do. No changes will ever
be made to the original data file. Simply type in the change or addition and click
Parse Current. You can then continue with the file as you were before.
70
Keep a log file
If you have a record that is not parsing correctly and want to take note of it, you
can save it to a log file. You simply have to set up the log file and then save
entries when you come across them.
Log files can also be extremely helpful to Firstlogic Customer Support when you
call in with questions.
To create a log file:
1. Go to Setup > Logfile. QuickParse opens the following window.
2. Type or browse for the directory you want and type the file name.
3. Specify the type of log file you want created.
4. Click OK to complete the setup.
Each time you start a new session of QuickParse you have to specify a new log
file. You cant append to an existing log file.
To add to a log file
You can add an entry to your log file from QuickParses main window.
1. When the entry you want recorded is active, click Log. QuickParse opens the
Log Item Information window.
2. Fill out the Correct item type entry, indicating how you wanted the item to
parse. (QuickParse automatically fills out the item text, input, database name,
current item type and input line fields. If the entry came from an input file,
the log file will also include the input records number.)
3. In the Notes box, enter any additional comments pertaining to this entry.
Note: If you press the Enter key while entering comments in the Notes
box, youll insert a return character, which shows up in your log file as
another line. You may want to have only one line per log file entry.
4. When the log contains the information you want, click OK.
Chapter 5:
71
Change your
configuration file
If you want to change the options set in the configuration file youre using,
choose Edit > Config file. QuickParse opens a text box with the configuration
file you specified at setup.
You can change options by commenting and uncommenting items. To

uncomment a line, remove the # to the left of the line. To comment out a line,
insert a # at the very left of the line.
When you work with these options, you have to be very careful. Only one of each
type can be active at a time. For example, to change the AssignPrenames
parameter from Yes to No, you have to not only uncomment the Yes parameter,
but also comment out the No parameter.
Manage your sessions
Yes
No
AssignPrenames = YES
#AssignPrenames = NO
#AssignPrenames = YES
AssignPrenames = NO
Instead of having to go through all the steps of setup every time you open
QuickParse, you have the option to save the session. Saving a session means
you are saving in your registry the name of the configuration file and other
options on the setup screen to be easily accessed.
If you are using database input, the name and the field mappings are also retained.
Remember, you cant append to a log file, so if youre using a log file youll have
to specify a new one each time you open QuickParse.
72
To save a session:
1. At QuickParses main window, choose File > Save Session. The Save
Session window opens.
2. Enter the name of the session, and click OK.

To open a session:
1. At QuickParses main window, choose File > Open Session.

2. Choose the name of the session to open, and click OK.
To remove a session:
1. At QuickParses main window, choose File > Remove Session.

2. Choose the name of the session to remove, and click OK.
Chapter 5:
73
74
Appendix A:
UMD configuration file, umd.cfg
umd.cfg
Rather than type a long command line, you can use the UMD configuration file.
Make a copy of umd.cfg and save it under a different file name. Then edit and use
your copy.
# UMD Show
#
# UMD Build
Dictionary Type (see NOTE) .............
Verify Input File only (YES/NO) ........
#
# Dictionary Types:
#
parsing
#
generic
#
capital
#
# Output File Types:
#
delimited
#
ascii
#
dbase3
=
=
=
=
=
=
=
=
=
Guidelines for editing

the configuration file
In the configuration file, do not edit anything to the left of the equal signs. To
insert comments, prefix them with a pound sign (#). Complete either the UMD
Show section or the UMD Build section, not both. (Exception: For UMD Show,
specify the dictionary at the Source Dictionary parameter.)
Command line
When you run UMD, include the configuration file as a parameter on the UMD
command line:
Platform
Command line
UNIX
umd -cfg cfg_file.cfg
Windows
umd /cfg cfg_file.cfg
Output File Name

Output File Type
These parameters are for UMD Show mode only. If you want to record query
results in an output file, enter the path and file name. Specify a file type of ASCII,
dBASE3, or Delimited.4
Dictionary Type
Enter the type of dictionary you want to modify. Possible dictionary types are
Parsing, Capital, and Generic (search-and-replace).
4. If you plan to use the output file as a transaction file, we recommend the following: If you plan to use a database
program or spreadsheet program to edit the file, create a dBASE3 or ASCII file. If you plan to use a text editor or word
processor to edit the file, create a delimited file.
Appendix A:
75
Source Dictionary
If you are building a custom dictionary or table, enter the path and file name of
the source dictionary:
If you are creating a parsing dictionary, specify our parsing dictionary
parsing.dct as the source dictionary.
If you are creating a capitalization dictionary or search-and-replace table, you
should usually leave this blank.
If you are querying an existing dictionary (UMD Show), specify the path and file
name of the dictionary you want to query.
Transaction File Name
Type the path and file name of the transaction file containing your entries.
Target Dictionary
Type the path and file name of the custom dictionary you want to create. If the file
already exists, UMD overwrites the existing file.5 If you do not specify a target,
UMD uses the source dictionary as the target.
Verify Input File Only
Do not overwrite any of our base dictionaries. Instead, give your custom
dictionary a separate name. Each time you install a software update, we
overwrite our base dictionaries. If you use our file names for your
dictionaries, your custom dictionaries may be overwritten.
If you set this option to Yes, UMD checks all the entries in the transaction file but
does not actually produce the target dictionary. This is handy if you want to verify
during the day and run the build process during the night.
If you set this option to No, UMD checks the entries in the transaction file. If no
verification errors occur, UMD builds the target dictionary.
Error Message Log File
We recommend that you specify an error log file. UMD will write any error or
warning messages to the log file so you can review them later.
If you leave this parameter blank, UMD sends error and warning messages to the
screen (standard output). If any messages scroll off the screen, you will not be
able to retrieve them.
Work Directory
By default, UMD places its temporary work files in the current directory. If you
would like to use some other location, specify a path.
To estimate the space required for work files, use this formula:
Work space = 4 x (size of transaction file + size of source dictionary)
5. Before overwriting an existing dictionary, UMD makes a backup copy of the existing file. For example, if the
dictionary is named custom.dct, UMD creates a backup file named custom.001. The next time, UMD creates a backup
named custom.002, and so on up to custom.999.
76
Appendix B:
UMD command line
You can use one of three command lines with UMD:

UMD Show, for querying an existing dictionary or table
UMD Build, for verifying and building your user-modifiable dictionary
UMD Config, for using the UMD configuration file.
UMD Show
You can query an existing dictionary or table by using the UMD Show command
line.
Platform
Command line
UNIX
umd -s dct_file.dct [-o out_file] [-d db_type]
Windows
umd /s dct_file.dct [/o out_file] [/d db_type]
Parameter
Description
s dct_file.dct
Path and file name of the dictionary to query.
o out_file
Path and file name of the output file. If you save a query, UMD
writes it to this file. If the file already exists, UMD appends to the
end of the file.
Note: You can edit the output file and use it as a transaction file
d db_type
Database type for the output file. Choose one: dBASE3, ASCII
(default), or Delimited.1
1. If you plan to use the output file as a transaction file, we recommend the following: If you plan to use a database
program or spreadsheet program to edit the file, create a dBASE3 or ASCII file. If you plan to use a text editor or word
processor to edit the file, create a delimited file.
UMD Build
If you prefer not to use the configuration file, you can place all the UMD Build
parameters on the command line.
Platform
Command line
UNIX
umd dct_type -i trans [-s source] [-t target] [-e err_log] [-p work] [-v]
Windows
umd dct_type /i trans [/s source] [/t target] [/e err_log] [/p work] [/v]
Parameter
Description
dct_type
Dictionary type: Parsing or Capital.
i trans
Path and file name of the transaction file containing your custom
entries.
s source
Path and file name of the source dictionary to use as a base for your custom dictionary.
Appendix B:
77
Parameter
Description
t target
Path and file name of the custom dictionary to create. If the file already
exists, UMD will overwrite it.1 If you do not specify a target, UMD
uses the source dictionary as the target.
e err_log
Log file for validation warnings and errors. We recommend that you
include this parameter.
p work
Path and directory to use for temporary storage of work files. To estimate space requirements, use this formula:
Work space = 4 x (size of transaction file + size of source dictionary)
Verify only. If you include this option, UMD checks all the entries in
the transaction file but does not actually produce the target dictionary.
1. Before overwriting an existing dictionary, UMD makes a backup copy of the existing file. For example, if the
dictionary is named custom.dct, UMD creates a backup file named custom.001. The next time, UMD creates a backup
named custom.002, and so on up to custom.999.
UMD Config
Rather than type the UMD Show or UMD Build command line, you can specify
file names and options in the UMD configuration file (see UMD configuration
file, umd.cfg on page 75).
To run UMD with the configuration file, use the following command:
78
Platform
Command line
UNIX
umd -cfg cfg_file.cfg
Windows
umd /cfg cfg_file.cfg
Appendix C:
Information codes and standard-type codes
Information codes
If youre creating a parsing transaction entry, type the appropriate information

codes (or info codes) in the Info field. Put one space (no punctuation) between
codes.
If youre using UMD Show to query a parsing dictionary, these are the codes
shown in the Info Codes field.
Standard-type codes
Information code
Description
DIRECTIONAL
Refers to the part of the address that gives directional information for delivery, such as N, S, N.E
FIRMDESIG
Indicates that a firm is to follow.
FIRMINIT
When used in a firm name, likely to be the first word in the

firm name.
FIRMLOC
A location within a firm (usually used for internal mail

delivery).
For example: Department, Mailstop, Room, Building.
FIRMMISC
A word used in firm names.
FIRMNAME
This code is used for firm names that may be parsed incorrectly. For example, Hewlett Packard could be incorrectly
parsed as a personal name, so Hewlett, Packard, and Hewlett
Packard are all listed as Firm Name words.
FIRMNAME_ALONE
A firm name that can stand on its own, for exam
FIRMTERM
Likely to be the last word in a firm name.

For example: Inc, Corp, Ltd, and so on.
HONPOST
A postname that signifies certification, academic degree, or

affiliation
For example: CPA, PhD, or USNR.
INITIAL
A middle initial, such as C or J.
MATURPOST
A maturity postname such as Jr or Sr.
NAMEDESIG
A name designator such as Attn or c/o.
Appendix C:
79
80
Information code
Description
NAMEGEN1-5
The gender of the name.

NAMEGEN1 94 to 100 percent chance the person is a
man (e.g., Robert).
NAMEGEN2 70 to 93 percent chance the person is a man
(e.g., Adrian).
NAMEGEN3 The name does not reliably indicate gender,
or is a last name.
woman (e.g., Lynn).
woman (e.g., Anne).
NAMESPEC
A word that may appear in a name line, such as Family, Resident, Occupant.
NUMBER
A number word, such as One, First, or 1st.
PHRASE_WRD
A word that is part of a phrase.

For example, the dictionary contains an entry for the phrase
VP Mktg. Each word in the phraseVP and Mktgis
marked as a PHRASE_WRD.
POST_OFFICE
The name of a SCF, ASF, BMC, or ADC.
PREGEN1-5
The gender of a prename.

PREGEN1 Masculine. For example, Mr or Senor.
PREGEN3 Neutral. For example, Dr or Capt.
PREGEN5 Feminine. For example, Ms, Mrs, or Senora.
Note: The entry must also include the PRENAME or
PRENAME_ALONE code.
PREFIRST
A first-name prefix.
PRELAST
A last-name prefix, such as Van Allen or OConnor.
PRENAME
A prename, such as Mr, Ms, Senor, Senora, Dr, Capt.

Note: The entry must also include one of the PREGEN
codes.
PRENAME_ALONE
A prename, such as Mr. or Ms. without the PREGEN code.
PRIVATE_ADDR
Private mailbox, pmb.
REGION
A geographical word such as North, Western, Minnesota, or

NY.
TITLE
TITLE_INIT
TITLE_TERM
TITLE_ALONE
A word used in a job title, such as Software or Engineer.

A word used at the begining of the title, such as Vice or
Associate.
A word used at the end of the title, such as Engineer or Manager
A word that can stand as a single title, such as Accountant or
Attorney
SEC_ADDR
Secondary address, such as an apartment, building, or suite.
STATE
A U.S. abbreviation, such as WI, MN, or NY.
SUFFIX
A suffix, such as Jr, II, or III.
Information code
Description
MIL_ADDR
Part of a domestic military address, such as psc.
MIL_LAST
Part of an overseas military address preceding the two-character

"state" abbreviation, such as APO, FPO.
MIL_STATE
Part of an overseas military address indicating the state, such as

AE, AP, AA
NAME_STRONG_FN
NAME_WEAK_FN
NAME_AMBIGUOUS
NAME_WEAK_LN
NAME_STRONG_LN
Used for a name whose position (first or last) can be qualified.

NAME_STRONG_FN Most likely a first name (e.g.,
Michael).
NAME_WEAK_FN Possibly a first name (e.g.,
Corey).
NAME_AMBIGUOUS Indeterminate (not listed).
NAME_WEAK_LN Possibly a last name (e.g.,
Hunter).
NAME_STRONG_LN Most likely a last name (e.g.,
McMichaels).
RR_HC_ADDR
Highway Contract Rural Routes
CONNECTOR
The word, character, or symbol between other words.

For example: and and &.
ZIP
ZIP4
A ZIP Code
A ZIP+4 Code
If youre creating a parsing transaction entry, type the appropriate standard-type

codes in the Stdtype field. Put one space (no comma) between codes.
If youre using UMD Show to query a parsing dictionary, these are the codes
shown next to each Standard.
In this table, we use the terminology Primary and Secondary. If you are
using UMD Show, the Primary is the word you queried and the Secondary
is the Standard.
Standard-type code
Description
ADDRESS_STD
If the Primary is parsed as an address, use this Secondary as

the standardized form.
ALL_TEXT_TYPES
If a text standard (STD) is not indicated for a particular type

of data, use this standard as the default. Usually used with
NUMBER and REGION words.
For example, if a word is parsed as a firm name but the dictionary does not list a FIRM_STD for the word, then use the
ALL_TEXT_TYPES standard as the text standard.
Note: The ALL_TEXT_TYPES is used as a default text standard (STD) only. It is not used as a default match standard or
acronym.
FIRM_ACR
If the Primary is parsed as a firm name, use this Secondary as

the acronym.
FIRM_MTC

the match standard.
Appendix C:
81
82
Standard-type code
Description
FIRM_STD

FIRMLOC_ACR
If the Primary is parsed as a firm location, use this Secondary

as the acronym.
FIRMLOC_MTC

as the match standard.
FIRMLOC_STD

as the standardized form.
HONPOST_ACR
If the Primary is parsed as a honorary postname, use this Secondary as the acronym.
HONPOST_MTC
If the Primary is parsed as a honorary postname, use this Secondary as the match standard.
HONPOST_STD
If the Primary is parsed as a honorary postname, use this Secondary as the standardized form.
LAST_LINE_STD
If the Primary is parsed as a last line, use this Secondary as

MATURPOST_MTC
If the Primary is parsed as a maturity postname, use this Secondary as the match standard.
MATURPOST_STD
If the Primary is parsed as a maturity postname, use this Secondary as the standardized form.
NAME_MTC
If the Primary is parsed as a name, use this Secondary as the

match standard.
NAMEDESIG_ACR
If the Primary is parsed as a name designator, use this Secondary as the acronym.
NAMEDESIG_MTC
If the Primary is parsed as a name designator, use this Secondary as the match standard.
NAMEDESIG_STD
If the Primary is parsed as a name designator, use this Secondary as the standardized form.
NAMESPEC_ACR
If the Primary is parsed as a name-special component, use this

Secondary as the acronym.
NAMESPEC_MTC

Secondary as the match standard.
NAMESPEC_STD

Secondary as the standardized form.
PRELAST_MTC
If the Primary is parsed as a last-name prefix, use this Secondary as the match standard.
PRELAST_STD
If the Primary is parsed as a last-name prefix, use this Secondary as the standardized form.
PRENAME_ACR
If the Primary is parsed as a prename, use this Secondary as

the acronym.
PRENAME_MTC

the match standard.
Standard-type code
Description
PRENAME_STD

TITLE_ACR
If the Primary is parsed as a title, use this Secondary as the

acronym.
TITLE_MTC

match standard.
TITLE_STD

standardized form.
Appendix C:
83
84
Index
acronyms
how parser generates, 25
adding
firm that looks like a personal name, 22
multiple-word firm name, 20
new word, 17
title phrase, 18
defining, 39
definition section, 37
dictionary
creating custom for parsing, 7
dictionary type
parsing, 7, 15
drludpm.dat, 5, 35
build UMD (command line), 77

building
custom parsing dictionary, 15
e-mail address
Firstlogic, 2
C
capitalization
in parsing dictionary, 17
capitalization dictionary
build customized, 32
create transaction file, 30
create your own, 30
definition, 29
querying, 29, 33
update customized, 33
capitalization transaction entries, 31
capitalization transaction file
creating, 30
codes
information, 79
standard-type, 79
command line
build UMD, 77
config UMD, 78
query UMD, 77
comments, 38
configuration file
change QuickParses, 72
configure UMD (command line), 78
contact information
Firstlogic, 2
copyright statement, 2
creating a parsing transaction file, 12
custom capitalization dictionary
building, 32
updating, 33
custom dictionary
creating for parsing, 7
maintaining, 16
updating, 16
custom parsing dictionary
building, 15
customized parsing dictionary, 15
customizing DataRight, 5
F
firm
adding when it looks like a personal name, 22
adding when multiple word, 20
firm name that looks like a personal name
querying, 11
Firstlogic contact information, 2
G
generating acronyms via parser, 25
I
Information codes, 79
information codes
modifying, 23
input database
set up QuickParses, 69
L
legal notices, 2
log file
maintain in QuickParse, 71
M
macro, 38
macro name, 38
match standards
how standards work, 26
rules, 26
spelling and punctuation, 27
working with, 26
modifying
information codes, 23
standards and standard-types, 24
multiple-word firm name
adding, 20
querying, 10
Index
85
N
new word
adding, 17
O
operators
using in regular expressions, 40
P
parser
generating acronyms, 25
parsing dictionary
building custom, 15
creating custom, 7
definition, 7
parsing transaction file
creating, 12
pattern file, 36
personal name
really a firm name, 22
phone number
Firstlogic, 2
punctuation
in match standards, 27
Q
query UMD (command line), 77
QuickParse, 67
run, 70
set up, 68
set up input database, 69
R
regular expression, 38
rule file, 5
rule name, 38
rule section, 37
rules
defining, 48
S
session of QuickParse, 72
open, 73
remove, 73
save, 73
spelling
in match standards, 27
standards
modifying, 24
86
standard-type codes, 79
standard-types
modifying, 24
subcomponents, 39, 47
defining, 47
T
title phrase
adding, 18
querying, 9
trademarks, 2
transaction database
creating, 12, 30
transaction entries
parsing, 13
transaction file
creating for parsing, 12
placing entries in, 13
putting entries in, 31
transaction file for capitalization
creating, 30
transaction files
supporting files, 12
U
UMD, 5
UMD Build (command line), 32
UMD Build command line, 77
UMD Config command line, 78
UMD Show command line, 77
user-defined data
retrieve from DataRight, 49
submitting to DataRight, 49
definition section, 38
rules section, 38
user-defined pattern example, 37
user-defined pattern file
saving after modifying, 48
V
verification
when building custom dictionary, 15
W
web site
Firstlogic, 2
word
querying, 9

Iq8 Modguide

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Iq8 Modguide

Transféré par

Droits d'auteur :

Formats disponibles

IQ8 Data Cleanse

User Modifiable Dictionary for the IQ8

UMD command line and configuration file

What do you think of

888-788-9004 in the U.S. and Canada;

888-215-6442, fax 608-788-1188,

608-782-5000, or fax 608-788-1188

IQ8 Data Cleanse Modifiers Guide

IQ8 Data Cleanse Modifiers Guide

About this guide

This document follows these conventions:

We use a change bar in the right margin to mark product changes

IQ8 Data Cleanse Modifiers Guide

Why create a custom

The dictionary contains the standard and acronymic forms of

The dictionary contains match standards (potential matches).

The dictionary contains gender data. For example, it indicates

The dictionary contains address components for address

To create a custom parsing dictionary, follow these basic steps:

Preparing custom parsing dictionaries is a task for a data-management

IQ8 Data Cleanse Modifiers Guide

Step 1: Query the dictionary

Query a single word

For descriptions of the

Query a title phrase

1. Start with the raw title.

Chief Executive Officer

2. Query each word and get the standard (TITLE_MTC) for

Chf. Exec. Off.

3. Remove all punctuation.

Chf Exec Off

Get the first appropriate match standard for each word

Query the lookup

Enter a query, or press <ESC> to exit.

1. Start with the raw firm name.

The General Motors Corporation

2. Remove the words and, or, of, the, and for.

General Motors Corporation

3. Remove firm terminator words such as Corp, Inc,

4. Query each remaining word. Get the standard

5. Remove all punctuation.

IQ8 Data Cleanse Modifiers Guide

Query the lookup form of the firm name:

Query a firm name

For descriptions of the

1. Start with the raw firm name.

Johnson and Johnson Corp.

2. Remove all punctuation characters.

Johnson and Johnson Corp

3. Remove the words and, or, of, the, and for.

Johnson Johnson Corp

Step 2: Create a parsing transaction file

If youre updating an existing custom dictionary, use your existing

UMD Show will create an output database filefor example, my_parse.trn.

IQ8 Data Cleanse Modifiers Guide

Step 3: Put your entries in the transaction file

Create a new entry or overwrite the existing entry.

Add information to an existing entry.

Change the usage or gender data for an existing entry.

Delete information from an existing entry, or delete the entry.

Type one of the following:

Create a new entry

Add a standard to an existing entry