Vous êtes sur la page 1sur 184

Informatica Data Quality

(Version 8.6.2)
User Guide
Informatica Data Quality User Guide
Version 8.6.2
March 2009
Copyright (c) 1998-2008 Informatica Corporation. All rights reserved.
This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use and
disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form,
by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software may be protected by U.S. and international
Patents and other Patents Pending.
Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in
DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as
applicable.
The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us in
writing.
Informatica, PowerCenter, PowerExchange, Informatica B2B Data Exchange, Informatica B2B Data Transformation, Informatica Data Quality, Informatica Data Explorer,
Informatica Identity Resolution and Matching, Informatica On Demand, PowerMart, PowerBridge, PowerConnect, PowerChannel, PowerPartner, PowerAnalyzer,
PowerCenter Connect and PowerPlug are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All
other company and product names may be trade names or trademarks of their respective owners.
Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright Melissa Data Corporation. All rights
reserved. Copyright MySQL AB. All rights reserved. Copyright Platon Data Technology GmbH. All rights reserved. Copyright Seaview Software. All rights reserved.
Copyright Sun Microsystems. All rights reserved. Copyright Oracle Corporation. All rights reserved.
This product includes software developed by the Apache Software Foundation (http://www.apache.org/), software developed by lf2prod.com (http://common.l2fprod.com) and
other software which is licensed under the Apache License, Version 2.0 (the "License"). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This product includes software which was developed by the JFreeChart project (http://www.jfree.org/freechart/), software developed by the JDIC project (https://
jdic.dev.java.net/) and other software which is licensed under the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/
lgpl.html. The materials are provided free of charge by Informatica, as-is, without warranty of any kind, either express or implied, including but not limited to the implied
warranties of merchantability and fitness for a particular purpose.
The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine,
and Vanderbilt University, Copyright (c) 1993-2006, all rights reserved.
This product includes ICU software which is copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permissions and limitations
regarding this software are subject to terms available at http://www-306.ibm.com/software/globalization/icu/license.jsp.
This product includes software which is licensed under the MIT License, which may be found at http://www.opensource.org/licenses/mit-license.html.
This product includes software which is licensed under the Eclipse Public License, which may be found at http://www.eclipse.org/org/documents/epl-v10.html.
Tcl is copyrighted by the Regents of the University of California, Sun Microsystems, Inc., Scriptics Corporaotin and other parties. The authors hereby grant permission to use,
copy, modify, distribute, and license this software and its documentation for any purpose.
This product includes software developed by the JDOM Project (http://www.jdom.org/). Copyright 2000-2004 Jason Hunter and Brett McLaughlin. All rights reserved.
This product includes software which is licensed under the Open LDAP Public License, which may be found at http://www.openldap.org/software/release/license.html.
Portions of this software use the Swede product developed by Seaview Software (www.seaviewsoft.com).
This Software may be protected by U.S. and international Patents and Patents Pending.
DISCLAIMER: Informatica Corporation provides this documentation as is without warranty of any kind, either express or implied, including, but not limited to, the implied
warranties of non-infringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this product or documentation is error free.
The information provided in this product or documentation may include technical inaccuracies or typographical errors.
Part Number: IDQ-USG-86200-0008
i i i
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 1: Informatica Data Quality Features and Functionality. . . . . . . . . . . . . . . . . 1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Data Quality Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Project Manager and File Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Publishing Plans to Data Quality Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Exporting and Importing Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Running Plans: Local and Remote Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Plan Resources and Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Version Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Working with Multiple Instances of a Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Organizing the Workbench User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: Data Source Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
CSV Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Database Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Fixed Width Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Realtime Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
SAP Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
CSV Match Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
CSV Dual Match Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Database Match Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Dual Group Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 3: Data Target Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
CSV Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Fixed Width Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Report Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
CSV Merge Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
CSV Match Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Match Key Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Group Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Database Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Database Report Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
SAP Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Realtime Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
iv Table of Contents
Chapter 4: Frequency Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
MinAvgMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Range Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 5: Analysis Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Character Labeller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Token Labeller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 6: Transformation Components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Search Replace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Word Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
To Upper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Rule Based Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 7: Parsing Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Token Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Profile Standardizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Context Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 8: Key Field Generator Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Soundex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Nysiis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 9: Matching Components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Jaro Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Hamming Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Bigram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
v
Mixed Field Matcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Weight Based Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter 10: Identity Matching Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Identity Group Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
CSV Identity Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
DB Identity Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Identity Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
CSV Identity Match Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 11: Address Validation Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Global AV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Formatted Address Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Chapter 12: Dictionary Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Dictionary Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Updating Dictionary Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Creating a Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Chapter 13: Report Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Viewing Data in the Report Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Standard View and Dashboard View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Viewing Plan Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Report Viewer Parameters and Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Tracking Changes in Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Importing Report Files and Working with Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 14: Deploying Plans for Runtime Execution . . . . . . . . . . . . . . . . . . . . . . . . 123
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Deploying Runtime Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Running a Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Command Line Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Multi-Threading and Multi-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Appendix A: Global AV: Match Status and Match Code Information . . . . . . . . . . . 131
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Countries Processed by QAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Countries Processed by Melissa Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
vi Table of Contents
Countries Processed by Address Doctor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Appendix B: Global AV Output Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Global AV Output Fields By Country . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Output Field Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Appendix C: Rule Based Analyzer Rule Statements . . . . . . . . . . . . . . . . . . . . . . . . 147
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Functional Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Appendix D: Search/Replace Operations and Noise Removal . . . . . . . . . . . . . . . . 151
Noise Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Appendix E: Matching Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Matching Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Appendix F: SQL Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Creating a MySQL Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Use of MAX Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Nested Groups and Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Appendix G: ODBC Data Source Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Using the ODBC Data Source Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Appendix H: Character Encodings and Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Character Encodings and Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Appendix I: Data Quality Workbench Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Data Quality Workbench Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Appendix J: Output Options in the CSV Match Target . . . . . . . . . . . . . . . . . . . . . . 163
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Configuring the Outputs for Identified Matches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Appendix K: Informatica Data Quality Naming Conventions . . . . . . . . . . . . . . . . . 165
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
vi i
Preface
Welcome to Informatica Data Quality, the latest-generation data quality management system from Informatica
Corporation. Informatica Data Quality will empower your organization to solve its data quality problems and
realize real, sustainable data quality improvements.
The high-level objectives for this guide are to describe Data Quality functionality in the following areas:
How to build data quality plans using the data sources, data targets, and operational components available in
the Workbench user interface.
How to manage your data quality projects, plans, and associated resource files.
How to use dictionaries and reference data content.
Its intended audience includes third-party software developers and systems administrators who are installing
Data Quality within their IT infrastructure, business users who wish to use Data Quality and learn more about
its operations, and PowerCenter users who work with Data Quality transformations and mapplets.
This guide builds on the material contained in the Data Quality Getting Started Guide. Before reading this
guide, Data Quality users should read the Getting Started Guide to familiarize themselves with data quality
concepts.
Note: The Informatica Data Quality Integration for PowerCenter is not documented in this guide. For more
information on the Data Quality Integration, see the Data Quality Integration for PowerCenter Guide.
Informatica Resources
Informatica Customer Portal
As an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com.
The site contains product information, user group information, newsletters, access to the Informatica customer
support case management system (ATLAS), the Informatica How-To Library, the Informatica Knowledge Base,
Informatica Documentation Center, and access to the Informatica user community.
Informatica Documentation
The Informatica Documentation team takes every effort to create accurate, usable documentation. If you have
questions, comments, or ideas about this documentation, contact the Informatica Documentation team
through email at infa_documentation@informatica.com. We will use your feedback to improve our
documentation. Let us know if we can contact you regarding your comments.
The Documentation team updates documentation as needed. To get the latest documentation for your product,
navigate to the Informatica Documentation Center from http://my.informatica.com.
viii Preface
Informatica Web Site
You can access the Informatica corporate web site at http://www.informatica.com. The site contains
information about Informatica, its background, upcoming events, and sales offices. You will also find product
and partner information. The services area of the site includes important information about technical support,
training and education, and implementation services.
Informatica How-To Library
As an Informatica customer, you can access the Informatica How-To Library at http://my.informatica.com. The
How-To Library is a collection of resources to help you learn more about Informatica products and features. It
includes articles and interactive demonstrations that provide solutions to common problems, compare features
and behaviors, and guide you through performing specific real-world tasks.
Informatica Knowledge Base
As an Informatica customer, you can access the Informatica Knowledge Base at http://my.informatica.com. Use
the Knowledge Base to search for documented solutions to known technical issues about Informatica products.
You can also find answers to frequently asked questions, technical white papers, and technical tips.
Informatica Global Customer Support
There are many ways to access Informatica Global Customer Support. You can contact a Customer Support
Center through telephone, email, or the WebSupport Service.
Use the following email addresses to contact Informatica Global Customer Support:
support@informatica.com for technical inquiries
support_admin@informatica.com for general customer service requests
WebSupport requires a user name and password. You can request a user name and password at http://
my.informatica.com.
Use the following telephone numbers to contact Informatica Global Customer Support:

North America / South America Europe / Middle East / Africa Asia / Australia
Informatica Corporation
Headquarters
100 Cardinal Way
Redwood City, California
94063
United States
Toll Free
+1 877 463 2435
Standard Rate
Brazil: +55 11 3523 7761
Mexico: +52 55 1168 9763
United States: +1 650 385 5800
Informatica Software Ltd.
6 Waltham Park
Waltham Road, White Waltham
Maidenhead, Berkshire
SL6 3TN
United Kingdom
Toll Free
00 800 4632 4357
Standard Rate
Belgium: +32 15 281 702
France: +33 1 41 38 92 26
Germany: +49 1805 702 702
Netherlands: +31 306 022 797
Spain and Portugal: +34 93 480 3760
United Kingdom: +44 1628 511 445
Informatica Business
Solutions Pvt. Ltd.
Diamond District
Tower B, 3rd Floor
150 Airport Road
Bangalore 560 008
India
Toll Free
Australia: 1 800 151 830
Singapore: 001 800 4632
4357
Standard Rate
India: +91 80 4112 5738
1
C H A P T E R 1
Informatica Data Quality Features
and Functionality
This chapter includes the following topics:
Overview, 1
Data Quality Plans, 2
Project Manager and File Manager, 2
Publishing Plans to Data Quality Server, 4
Running Plans: Local and Remote Execution, 6
Plan Resources and Plan Execution, 7
Version Control, 8
Working with Multiple Instances of a Plan, 11
Organizing the Workbench User Interface, 11
Overview
This chapter discusses the project management, file management, and plan management options available
through Data Quality, including the capabilities of Data Quality Workbench in conjunction with Data Quality
Server. If you are running Data Quality Workbench in stand-alone or client-only mode, some functionality
might not be available to you.
Note: For more information on the components that make up the Informatica Data Quality suite, see the
Informatica Data Quality Installation Guide and the Getting Started with Data Quality Guide.
2 Chapter 1: Informatica Data Qual ity Features and Functionali ty
Data Quality Plans
Informatica Data Quality Data analyzes and enhances your source data through processes called plans that you
create in its Workbench application. A data quality plan is a self-contained and executable set of data analysis or
data enhancement steps consisting of one or more of the following types of components:
A plan must contain at least one data source and data target. It can use any number of operational components.
A plan that writes data directly from one file or database to another does not require operational components.
Figure 1-1 shows the components in a plan arranged in the Data Quality Workbench user interface:
The arrows indicate the direction of the data flow through the plan, from data source, through operational
components, to data target.
Note: You can move components in the workspace. Arrows are not foolproof indicators of the precise progress of
data in the plan.
Each operational component in Workbench performs a different type of analysis or enhancement task on your
data. Configure an operational component to execute on a subset of the data that it receives or to filter the data
that it makes available to other components in the component chain.
Many plans make use of text- or table-based reference dictionaries. Informatica provides a set of reference
dictionary files with its Content Installer. You can add dictionaries to several components in Workbench, and
you can define dictionaries in live tables within a database, ensuring that reference tables stay current.
You can edit and define your own dictionary files through the Dictionary Manager. Dictionary files are stored as
text files (.DIC files) in a Dictionaries folder in the Informatica Data Quality directory.
Note: Data Quality dictionaries install through the Content Installer, a separate installer within the Informatica
Data Quality installation. The Content Installer also installs any reference data and processing engine updates
that you receive from Informatica.
Project Manager and File Manager
Workbench stores plans in the Data Quality repository and reads reference data from the file system. It provides
separate browsers to view the contents of the repository and the file system.
Table 1-1. Data Quality Plan Components
Component
Required/
Optional
Description
Data source Required Provides input data for the plan.
Data target Required Collects data output from the plan.
Operational Optional Performs the data analysis or data enhancement actions on the data
they receive. Most plans contain multiple operational components.
Figure 1-1. Plan Components in the Data Quality Workspace
Proj ect Manager and Fi l e Manager 3
Project Manager. Lists the plans and project folders in the local Data Quality repository and any available
repositories on a Data Quality service domain. Allows you to organize plans in folders, publish plans from
the local repository to a service domain repository, export plans to PowerCenter repositories, and run plans.
File Manager. Allows you to access and move files within the local file system and across the service domain
file system. With the File Manager, you can access any file type stored on a server.
In stand-alone installations of Data Quality Workbench, the File Manager and Project Manager provide access
to the local system and local repository only.
To view the Project Manager:
In Informatica Data Quality Workbench, click the Projects tab.
To view this File Manager:
In Informatica Data Quality Workbench, click the Files tab.
Working with the File Manager
The File Manager provides visibility to a Data Quality service domain in the following way:
The names of the servers configured in the domain appear under the service domain name.
The servers are host to the client user spaces and a shared file space for all users. These user spaces contain
the dictionary files and other resource files for plans stored in the service domain repository.
The server hosts a Dictionaries folder that all service domain repository plans can read from. This folder is
created by the Data Quality installer and populated by the Content Installer.
The local computer structure also appears.
To work with files within the File Manager, right-click a file or folder and select the required operation from the
shortcut menu that appears. The permitted operations are as follows:
(Create) New Folder
Rename
Delete
Cut
Copy
Paste
Refresh
Open Externally
Security
The following procedure illustrates how to use the File Manager.
Note: You cannot copy files from another system, such as Windows Explorer, into File Manager folders.
To copy local files to the service domain with the File Manager:
1. Under the File Manager tab, browse the local folder structure and locate the required file.
2. Right-click the file name and select Copy from the context menu that appears.
3. On the service domain, expand the folders of the server to which youll copy the file and locate the
destination folder.
4. Right-click the folder name and select Paste from the context menu that appears.
4 Chapter 1: Informatica Data Qual ity Features and Functionali ty
Publishing Plans to Data Quality Server
Publishing is the process of copying plans from a Workbench repository to a Data Quality Server repository.
Publishing deploys plans in a networked environment, allowing domain users with appropriate permissions to
access and execute the plans. Administrators set user permissions in the Data Quality Administration Console.
A published plan contains version control information that references the owner of the original plan, allowing
the genealogy of plans to be traced across repositories.
To publish a plan from the local repository to a domain repository:
1. Right-click the plan(s) you want to publish.
2. Select Copy from the context menu.
3. Browse the domain repository and locate the folder where you would like to publish the plan(s).
4. Right-click the folder and select Paste from the context menu.
5. Copy all necessary plan resources to the server file system, ensuring that you recreate the folder path
structures used in the source WorkBench plan. For more information on placing resources in the correct
locations, see Implications for Plan Design on page 8.
Note: When plans are published, the latest base version of the plan is used. Any changes saved since this version
are not published. For more information about plan version control, see Version Control and Plan
Publication on page 10.
Exporting and Importing Plans
Use Data Quality Workbench to export and import plans to and from your local repository. Export plans
directly into the PowerCenter repository as mapplets, or export them as files that can be imported by other Data
Quality users.
The following export and import options are available:
Export plans directly into the PowerCenter repository as mapplets. Use this option to run Data Quality
plans natively within PowerCenter.
Export plans in XML format. XML plans can be used by the runtime version of Data Quality as part of
command batch jobs or scheduled processes.
Back up plans to Data Quality PLN files for storage.
Import plans from PLN or XML formats. Informatica recommends importing from PLN files in order to
preserve the layout of the original plan.
Exported and imported plans do not contain plan version histories.
Exporting Plans to PowerCenter
Use Workbench to export Data Quality plan metadata directly to a PowerCenter repository or to an XML file
that you can later import to a PowerCenter repository.
To export plans to a PowerCenter repository:
1. Right-click the plan(s) you want to export.
2. Select Export > PowerCenter Mapplet > To PowerCenter Repository.
3. Enter your connection details in the Connect to PowerCenter Repository dialog box. Ensure you select the
correct PowerCenter repository version.
Exporti ng and I mporti ng Pl ans 5
4. Choose a destination repository folder for the exported plans.
To export plans to an XML file:
1. Right-click the plan(s) you want to export.
2. Select Export > PowerCenter Mapplet > To XML File.
3. Enter a path and name for the XML file in the Export to Mapplet XML File dialog box. You can use this
dialog box to create a new file.
4. Verify that the code page identified in the Select Codepage is suitable. Choose a different code page if
necessary.
5. Click Export.
Note: When you import the mapplet XML to the PowerCenter repository and view it in PowerCenter Mapplet
Designer, the mapplet is read-only. To edit the mapplet, disconnect and reconnect to the repository folder that
contains the mapplet.
Exporting Plans for Runtime Use
Export plans as XML files for use during runtime execution. Runtime execution uses a command-line version of
the Data Quality engine to run plans as part of a scheduled or batch process. For more information on runtime
execution, see Deploying Plans for Runtime Execution on page 123.
Note: Do not import a runtime XML plan file to PowerCenter. For information on creating an XML plan file
that you can import to PowerCenter, see page 5.
To export a plan for runtime use:
1. Right-click on the plan(s) you want to export.
2. Select Export > IDQ Runtime Plan(s) (.xml).
3. Choose a destination folder for the XML plans, and click Select.
4. In the Export a Plan to XML dialog box, choose the operating system on which the plan will run and select
OK. If the exported plans contain file-based sources or targets, you can perform the following actions in
this dialog box:
Change the paths for the sources or targets.
Select OK to All to use the same paths for all file-based sources or targets.
5. Copy the exported XML file to the computer that will run the plans.
6. Copy all necessary source and reference files to the computer that will run the plans, ensuring that they are
placed in the proper locations. For more information, see Plan Resources and Plan Execution on page 7.
Backing Up Plans
Create backup copies of your plans in PLN format. Do not create XML copies of plans for backup purposes.
PLN files retain the original onscreen appearance of the plans.
To back up your plans:
1. Right-click on the plan(s) you want to export.
2. Select Export > Workbench Plan(s) (.pln).
3. Choose a destination folder, and click Select.
4. If reference files are required for the exported plans, back up these files to ensure that the backup plan is
fully functional.
6 Chapter 1: Informatica Data Qual ity Features and Functionali ty
Importing Plans
Informatica recommends using PLN files as the source for your plan imports. While you can import XML
plans, these plans separate all component instances into individual components. This greatly increases the visual
complexity of many plans in the Workbench user interface. Export plans as XML files for runtime execution.
To import plans:
1. Right-click the destination project or folder for the imported plan.
2. Select Import > Workbench Plan(s) (.pln).
3. Choose a file, and click Select.
4. If source and reference files are required for the imported plans, verify that these files are available to Data
Quality Workbench.
Running Plans: Local and Remote Execution
The plan execution process in Data Quality Workbench differs slightly for client-only, license users and users in
client-server environments. Client-only license users define and run plans locally. Full Informatica Data Quality
users can select any available plan in the service domain and run the plan on any available server. Any machine
on the service domain can run a plan if it is host to an Execution service, the Informatica Data Quality service
that executes the plan.
Before you run a plan, make sure all necessary resources, such as the data source files and any required reference
data, are present on the computer that runs the plan and in locations recognized by Data Quality.
When you run a plan locally through your local Workbench this is automatically the case, unless you have
moved any resources between design-time and execution. When you run a plan on a remote server, you must
ensure that the necessary resources are present in the correct locations on the server that runs the plan.
In remote execution scenarios, it is possible for the Execution service and domain repository to reside on
separate servers. The server that runs the plan is the server on which the Execution service is present.
Running a Data Quality Plan
Use the following procedure to run data quality plans in Workbench.
To run a data quality plan in Workbench:
1. Ensure the required plan is selected in the workspace.
2. Click the Run Plan toolbar button.
A dialog box opens with the plan name in its title bar.
3. Click Run.
The plan executes.
If you are connected to a Data Quality service domain, you can also select a remote Data Quality computer
on which to run the plan. That is, you can specify the Execution service that will run the plan. You can run
a plan from any repository available on the service domain. For example, you can open a plan from the
domain repository on Server 1 and run the plan on Server 2.
The Run Plan dialog features a progress bar that states the percentage of the data processed as the plan
executes. You can click the Stop button at any time to end plan execution and view the results so far.
This dialog box also has a menu that allows you to select the percentage of data to use in the plan. The
default setting is 100 percent. You can select a smaller percentage if you want to test that a plan will run as
anticipated. This can be useful if you have designed a complex plan that will take time to execute.
Pl an Resources and Plan Executi on 7
Reporting Options
As well as generating file-based and table-based output, Data Quality Workbench offers graphical reporting
options. These include a proprietary format that lets you view high-level and fine-grained plan results, to
create scorecards, and to export data to file. For more information, see Report Viewer on page 113.
Plan Resources and Plan Execution
Before you run a plan, check that all relevant files are available to the computer that runs it.
When you run a plan locally, the source data and reference data files are set when you configure the
components. Unless you move the data between designing and running the plan, the locations are understood
when you run the plan.
When you run a plan on a remote computer, the Data Quality Server reads the plan, identifies the original path
to each resource, and replaces each path with a corresponding path on the server. The server substitutes the
Windows drive letter with your file folder in the Server host folder structure. Therefore, you must ensure that
the source data and reference data files are available to the Server in locations that the Server expects.
Note: If you have used third-party data in the plan, ensure that the third-party data is installed in a location
accessible to the Execution service that runs the plan.
The following sections describe how Data Quality handles resource files in cases of remote plan execution.
Data Source Files
Data Quality Server recognizes a specific set of folders as valid resource file locations. If a plan refers to a source
file stored in the following location on the Workbench computer:
C:\Myfiles\File.txt
A Data Quality Server on Windows looks for the file here:
C:\Program Files\Informatica Data Quality\users\user.name\Files\Myfiles
A Data Quality Server on UNIX installed at /home/Informatica/Data Quality/ looks for the file here:
/home/Informatica/DataQuality/users/user.name/Files/Myfiles
For further information, see Implications for Plan Design on page 8.
Note: If you have published a file for runtime execution and your source file is located in a non-standard
location, you can provide a parameter file with the runtime command that maps the original location to the
required location.
Dictionary Files
Data Quality looks for dictionary files in a different way to source files.
The installation processes for Data Quality Workbench and Server creates an empty Dictionaries folder under
the top-level Informatica Data Quality folder. This folder is populated with dictionary files by the Content
Installer.
By default, the Dictionaries folder is created at the following location on Windows systems:
C:\Program Files\Informatica Data Quality\Dictionaries
and at the following location on UNIX systems:
/home/Informatica/DataQuality/Dictionaries
Data Quality Server also creates a separate dictionary folder for each Data Quality user that connects into the
service domain. The folder is created when the client user first opens the File Manager or first attempts to run a
plan remotely.
8 Chapter 1: Informatica Data Qual ity Features and Functionali ty
A remotely-run plan first looks for dictionaries in the client users Dictionaries folder. If this folder does not
contain the required dictionaries, the plan looks in the Dictionaries folder created during installation.
Therefore, when you run a plan to the server, you do not need to copy dictionary files to your user dictionary
folder on the server if those dictionaries already exist in the servers dictionary folder.
By default, user dictionary folders are created in the following server locations:
UNIX: /home/Informatica/DataQuality/users/user.name/Dictionaries
Windows: C:\Program Files\Informatica Data Quality\users\user.name\Dictionaries
Cross-Platform Plan File Conventions
Data Quality Server handles the translation of client-to-server file paths and Windows-to-UNIX file paths
seamlessly. When a plan is opened on a Windows system, Data Quality ensures that all paths are in a Windows
format, with folders separated by back slashes. When a plan is opened on a UNIX system, Data Quality renders
all paths in UNIX format with folders separated by forward slashes. The transformations and file paths are case-
sensitive and case-preserving.
Implications for Plan Design
When you design a plan in Data Quality Workbench, you should ensure that the folders you create for file
resources can map efficiently to the server folder structure.
For example, a plan runs in Workbench and reads a source file from the following location:
C:\Program Files\Informatica Data Quality\Sources
When this plan runs on a remote Windows machine, Data Quality Server looks for the source file in the
following location:
C:\Program Files\Informatica Data Quality\users\user.name\Files\Program
Files\Informatica Data Quality\Sources
The folder path Program Files\Informatica Data Quality is repeated here. In this case, good plan design suggests
the creation of folders under C:\ that can be recreated efficiently on the server.
Version Control
Data Qualitys version control features enable you to save multiple versions of a plan, to view the plan version
history, and to edit and run historical versions of the plan.
As well as the most recently-saved version of a plan, Data Quality stores any earlier versions that have been
flagged for retention in the repository. This allows you to save versions of a plan at meaningful points in its
development and to revert to earlier versions of the plan if necessary.
For the purposes of version control, each Data Quality plan has a latest version and one or more base versions.
Latest version. The most recently-saved state of a plan.
Base versions. Earlier versions that have been preserved in the repository
When you save a plan for the first time, you automatically create a base version. If you do not create another
base version, the plan version history shows details for that base version and the latest version only.
Note the following:
A base version cannot be overwritten. If you are working in a base version and save your changes, the newly-
saved state becomes the latest version.
Version control does not keep every saved state of a plan. It is possible to open, edit, and save a plan
multiple times without adding base versions to the version history.
Versi on Control 9
Version control applies to plans only. Version control does not apply to projects or to the external resources
that a plan may require to run successfully.
Version history is reset when you copy or publish a plan. Version information does not move with a plan
when it is copied within a repository, as this operation effectively creates a new plan. When a plan is
published, it retains the version details of the base version published from the Workbench repository the
base version number on the client computer, the creation date and time of that base version, the user who
created it, and the comment added by that user. For more information, see Version Control and Plan
Publication on page 10.
Version Control Commands
You can perform all plan activities in Data Quality without interacting with the version control features.
However, all plans in the repository are assigned a version history that you can access through a shortcut menu.
When you right-click a plan name and select Version Control, a submenu opens as follows:
The Version Control submenu displays the following options:
History. Opens the History Viewer dialog box, which provides file properties for the latest and base versions
of the plan.
Get Latest Version. Opens the last-saved version of the plan or, if the plan is open, restores the onscreen plan
to its last-saved version.
Save Plan as Base Version. Saves the current state of the plan as a new base version. You must enter a
comment describing your changes when you save a new version of the plan.
Viewing Version History
The History Viewer dialog box lists the plan versions maintained in the repository, with the latest version at the
top of the list.
It lists the latest and base versions of the plan, showing the version number, creation date and time, author (the
user who saved the plan), and the comment provided by the author when the version was created.
The Comment for Version pane shows the full text of the comment entered for the version.
Figure 1-2 shows the History Viewer dialog box:
Figure 1-2. History Viewer
10 Chapter 1: Informatica Data Qual ity Features and Functionali ty
Tracking Plans Across the Service Domain
The History Viewer can be useful to service domain users who want to track the progress of a plan through the
enterprise. As a plan retains the version details of its meaningful iterations, the History Viewer facilitates an
audit trail that can assist collaboration between plan designers and the users who deploy the plans.
Opening Plans with Version Control
When you double-click a plan in the Project Manager, you retrieve its latest saved version. You can also open
the latest version of a plan through the version control menus by right-clicking a plan name and selecting
Version Control > Get Latest Version.
The Get Latest Version option also allows you to revert to the latest saved version while working with a plan. If
your plan has unsaved changes when you select Get Latest Version, Data Quality prompts you to confirm the
command, since reverting to the latest version will undo your changes.
Use the following procedure to open a base version of the plan.
To open a base version of a plan:
1. In the Project Manager, right-click a plan name and select Version Control > History.
2. In the History Viewer dialog box, select the required base version and click Open Selected Version.
Saving, Deleting, and Renaming Plans
Version control is sensitive to general plan operations. By default, any save command will update the latest plan
version.
When you save a plan for the first time, you automatically create a base version. When you create a subsequent
base version, the latest version is automatically updated.
When you rename a plan, the name change is propagated through all base versions of the plan.
When you delete a plan, you delete all versions. It is not possible to delete a specific base revision of a plan.
To create a base version:
1. In the Project Manager, right-click the name of the plan and select Version Control > Save Plan as Base
Version.
2. In the Confirm Base Version Creation dialog box, type a comment explaining the operation.
You will not be allowed to proceed without typing a comment in this dialog box.
3. Click Set As Base Version.
Version Control and Plan Publication
Data Quality treats version control differently for publication and local repository copy/move operations.
Publication preserves a plans most recent base version information. Local repository copy/move operations do
not.
Consider a plan published from the local repository to the domain. Publishing the plan sends its most recent
base version, with that version information, to the domain repository. Version information copied with the
published version includes the version number of the published base version on the client, the user who created
the base version on the client, a date-time stamp for the creation of that version, and the comments added when
the version was created. In this way, a plan on the domain is traceable back to its point of origin.
The domain repository also initiates its own version history for the plan. When a plan is first published, the
domain repository assigns it a base version number of 1 while retaining also the client-side version data for the
published version. If a client user subsequently publishes the plan a second time, the domain repository
increments its base version number while again retaining the client-side version data.
Worki ng wi th Mul ti pl e Instances of a Plan 11
For example, you have published base version 5 of a plan from your Workbench repository to the domain
repository. The domain repository creates base version number 1. After working locally on the plan, you publish
base version number 8 from your Workbench repository to the domain, creating a new base version number in
the domain repository.
Table 1-2 illustrates the changes in version details:
Note:
Publication copies/moves the most recent base version, which may not be the latest saved version.
When a plan is copied within the client repository, only the latest saved version is copied/moved. All base
versions are discarded.
Working with Multiple Instances of a Plan
Data Quality is designed to be flexible. To enable teamwork between plan designers, it does not apply any locks
to an open plan. Though it is possible for users on different systems to work on a plan concurrently, this is not
recommended.
The following section describes plan behavior in the event that different instances of Data Quality Workbench
are working with the same plan.
When you save a plan, Data Quality checks the repository to determine if there have been any updates to the
plan since its last save event. If it finds such an update, the system prompts you to confirm that you want
to overwrite the saved plan. This updates the latest version in the repository. Any changes made by the other
user will be lost.
When you save a plan as a base version, Data Quality checks for any updates to the list of base versions for
that plan. If it finds such an update, the system notifies you that a new base version will be created with a
version number incremented from the version most recently created by the other user.
Updating a base version also overwrites the latest saved version in the repository. Data Quality performs two
checks in this case: to establish if the latest version has been updated and to establish if a more recent base
version has been created. When you create a base version in this case, you are asked to accept the changes to
both versions of the plan. If you click No in either case, the plan will not be saved and the base version not
created.
Organizing the Workbench User Interface
You can organize the components on the plan workspace in any manner you choose. The Data Quality
Workbench user interface provides menu options that allow you to organize your plan components in a
meaningful way:
The component icons are connected by directional lines in the workspace. These lines indicate the directions
in which data flows within the plan. However, the directional lines do not provide a foolproof indicator of
whether one component precedes another in plan operations. The relative positions of the icons in the
workspace do not affect the running of the plan.
Table 1-2. Version Data Updated During Plan Publication
Client Repository Domain Repository
Version Number 5 1
Version Number 8 2
12 Chapter 1: Informatica Data Qual ity Features and Functionali ty
Another method of keeping track of the component dependencies in a plan is to assign components to one
or more layers. Layers let you show or hide component icons onscreen. You can create a layer through the
Plan Layer Manager, available from the Tools menu.
To assign a component to a layer, right-click it and select Assign To Layer from the context menu. To view
only the components in a single layer, select View > Plan Layers.
To view a snapshot of the current source data in the plan, open the Source Viewer (F6). This window
appears in the workspace and displays the first 250 rows of the source data currently in use.
The plan components can make use of reference dictionary files to determine the validity of data values.
These dictionaries are visible through the Workbench Dictionary Manager (F8).
You can read or add notes to a plan by opening the Plan Notes window (F11). This window is a free-text tool
that allows you to comment on any aspect of the plan.
Workbench Naming Conventions
When you design or edit plans that will be shared with other users, it is good practice to name your Workbench
elements in an agreed and consistent manner.
You and your team should agree a clear and consistent set of naming conventions for projects, folders, plans,
configurable components, component elements, and dictionaries.
For a comprehensive guide to developing a naming system for these elements, see Informatica Data Quality
Naming Conventions on page 165.
13
C H A P T E R 2
Data Source Components
This chapter includes the following topics:
Overview, 13
CSV Source, 13
Database Source, 14
Fixed Width Source, 16
Realtime Source, 16
SAP Source, 17
CSV Match Source, 19
CSV Dual Match Source, 20
Database Match Source, 20
Group Source, 21
Dual Group Source, 22
Overview
Source components are used to specify the location of the input data files for a plan. This chapter describes all
source components in Data Quality except the CSV Identity Group Source and the DB Identity Group Source.
For information on the configuration of these two component, see page 89.
CSV Source
The CSV Source component connects to files with data organized in a delimited format, such as
comma delimited (CSV), to provide source data for a plan. When configuring this component you
specify the location of the delimited file, the type of delimiter used, and other options as described
below.
Configuration
The CSV Source configuration dialog box contains the following editable fields:
Source File. Displays the name of the file to which the component connects.
14 Chapter 2: Data Source Components
Select. Click this button to browse to the source file.
When you click Select, the Select a CSV File as a Source dialog box opens. This dialog box provides an
option to identify the character encoding associated with the dataset. For more information, see Character
Encodings and Unicode on page 159.
Field Delimiter. Select a field delimiter appropriate to the source data from this menu. The default option is
comma. If headings for the column source data contain this delimiter, you must use a text qualifier to
preserve the data structure.
Text Qualifier. Select a qualifier appropriate to the source data from this menu. A text qualifier should
enclose any delimiter value in your data that you do not want to use as a field delimiter. The default option
is the [] double quote.
First Line of File is the Header. Use this option to designate the first line of data in the source file as a
header and thus distinguish it from the rest of the dataset.
Database Source
The Database Source component connects directly to a database to provide source data for a plan.
When configuring a Database Source, you identify the required database type, connect to a database
available to Data Quality, and configure the tables and columns on the database to produce a source
dataset for your plan.
Configuration
The component dialog box displays configuration options across four tabs: Connect To Database, Before,
During, and After.
The connection is defined on the Connect To Database tab. The Before tab settings create the database table
that will be populated with the source data for the plan. The During options define the data that is used in the
plan, i.e. by selecting and joining columns from the available databases and adding the data to the table defined
in the Before tab. The After tab updates the table configured on the previous tabs and determines the state of
the data as it will be used by other plan components.
Note: The Before, During, and After tabs work in the same fashion for all database types except Microsoft SQL
Server. If you are using the DB Source to read from a Microsoft SQL Server database, bear the following items
in mind:
When composing a Where query on the During tab with Text containing Unicode data, the text must be
preceded with letter N, for example N'unicode data'
When using INSERT statements on the Before or After tab, all columns that need to be populated with
Unicode data must be preceded with the letter N.
Connect To Database Tab
When connecting to a database source, first identify the database type.
The Database Type menu provides five options: Staging, IBM DB2, Oracle, Microsoft SQL Server, and ODBC
(connection to a ODBC-compliant database).
Staging is the default option. It refers to the local database used by Data Quality. The remaining Database
Information and Login Information fields are disabled for this option. That is, you can connect to the local
repository without setting any other options on this page.
When you connect to IBM DB2, Microsoft SQL Server, or ODBC-compliant databases, you must provide a
Data Source Name (DSN) for the database and you might be prompted to provide a valid username and
password combination. The DSN field identifies the database on the network.
Database Source 15
When you connect to an Oracle database, you must provide the System Identifier (SID) that refers to the
Oracle instance.
The Encoding menu lists the available character encodings that can be applied to the data read into the plan.
For more information, see Character Encodings and Unicode on page 159.
The Login Information area contains Username and Password fields. Use these fields when access permissions
have been applied to the database in question. Data Quality does not require this information by default.
Click Connect to establish the connection.
Before Tab
The Before tab has a Database pane and SQL Script pane.
The Database pane displays the available databases and tables in a folder hierarchy. Browse the hierarchy to
locate the data source tables and columns and write the SQL script that defines the table in the SQL Script
pane. Clicking on a folder or column in the left pane transposes its name to the right pane to aid accuracy in
scripting.
The following sample script creates an elementary table called Names:
drop table if exists names; # overwrites any existing names table
create table names
(
id int, # id field populated by integers
name varchar(255) # name field entries up to 255 chars
);
Click Execute to run the script and create the table. You must click Execute before proceeding to the During
tab.
Click Stop On Error if you want the system to stop the script operation and display an error message if the
execution encounters a problem.
During Tab
The During tab allows you to browse database tables and filter the columns to provide source data for your
plan. You can also apply conditions to tables and join columns from multiple tables. The tab shows five
columns:
Database. Like the Before tab, the Database column displays the database structure as a folder hierarchy of
tables and columns.
Select. Provides check boxes for the column on the explored tables. Check a column check box under Select
to add its data to the dataset.
Join. Lets you select columns from multiple tables for join operations so their data is added to the dataset.
Where and Text. These columns allow you to specify the conditions for data inclusion, both for the columns
identified in the Select column and the columns to be joined. Note the following:
To activate the editable fields in the Where and Text columns, click in the column. Use the fields in the
Where column to access conditional statements. You can enter text in the Text column for each database
column.
You can use the Where statement builder to specify the join condition to join two databases using two
Database Source components. Select a database table in the Join column by checking its check box. A new
Join column, such as Join1, appears to its right.
The During tab also contains the following options:
Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces
from the dataset. These options are cleared by default.
Expert mode. Use to view and edit the underlying SQL query statements, and to create advanced select
statements. This option is cleared by default.
16 Chapter 2: Data Source Components
Preview. Use the Preview option to view the dataset as defined by the configured settings in this dialog box.
The Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Validate. Use the Validate option to verify that the SQL query is valid. This option allows you to
periodically test validity as you are constructing an SQL query.
After Tab
The After tab completes the process of generating the plan dataset. The Before tab runs SQL scripts on the
database prior to its configuration The After tab permits SQL scripts to run on the configured dataset. Like the
Before tab, the After tab displays Database and SQL Script panes.
You can browse the configured tables and columns in the left pane and write the SQL script to run on data in
the right pane.
For more information and examples, see SQL Scripts on page 155.
Fixed Width Source
Use this component to specify a fixed-width file as the data source for your plan. This component
allows you to edit column names, widths, and data types.
Configuration
The Fixed Width Source configuration dialog box contains the following features:
Source File. Displays the name of the file to which the source components connects.
Select. Click this button to browse to the source file.
When you click Select, the Select a Fixed Width File as a Source dialog box opens. You can create a new file
by typing a name in the File Name field of this dialog. In this dialog box, you can identify the character
encoding associated with the dataset. For more information, see Character Encodings and Unicode on
page 159.
Fixed Width columns. The columns in this group allow you to enter the name, width, and datatype for each
field in the file.
Remove Trailing Spaces. Use this option to remove trailing spaces, extra spaces at the end of data, from the
dataset used in the plan.
Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Realtime Source
The Realtime Source allows you to develop plans that accept input in real time from live data entry or
other applications. To configure this component, define the input fields that will run data to the plan.
Configuration
The Realtime Source configuration dialog box includes an Inputs column and an Input Type column and, when
first added to a plan, a single, undefined row.
To Add or Delete rows to or from the table, right-click in the dialog box and use the context menu. The Delete
option deletes the highlighted row.
SAP Source 17
The following columns display:
Inputs. Double-click a field in this column to edit the input name. Click OK to apply your changes before
moving from the field.
Input Type. Click a field in this column to view options for defining the input data type. The options are
String or Float.
For example, you may want to design a simple real-time plan to test the validity of a data code. The data code is
valid within an organization if it contains the correct year (for example, 2005 in Figure 2-1). You can write a
rule in the Rule Based Analyzer to check if any given input string contains this value. When you test the plan in
Workbench, an input dialog box like the following appears:
Type the year (or any value) in the Value field and click OK to return a result. In a real-time scenario, data
inputs are checked without any direct user activity.
SAP Source
Note: This component is no longer installed with Data Quality. You can run plans created in earlier
releases that contain this component, and you can export such plans as mapplets for use in
PowerCenter.
The SAP Source allows you to use an SAP database as the data source in a plan. To obtain the data, the SAP
Source connects to a SAP system and uses a BAPI (Business API) function to read data from the SAP database.
In the SAP Source component configuration dialog box you can identify the SAP system and set the input and
output parameters of the function. Set the input parameters to filter the database for the data relevant to your
plan. Set the output parameters to specify the data to be used in the plan.
Data Quality SAP connectivity is licensed separately from other Workbench components. If your license does
not include SAP connectivity, contact Informatica Global Customer Support. Similarly, the SAP Source
requires a valid connection to the SAP System and a corresponding SAP license for the SAP System.
Configuration
The configuration dialog box for the SAP Source displays its options on two tabs:
Connection
SAP System
Connection Tab
The Connection tab displays the following options:
Host. The name or IP address of the SAP host computer.
Client Number. Identifies a SAP client that you are authorized to use.
A SAP system can have multiple clients, each identified by a three-digit client number.
Figure 2-1. Realtime Source: Data Setup Dialog Box
18 Chapter 2: Data Source Components
System Number. A two-digit number that identifies the application server to which you want to connect.
SAP allows multiple application server instances to run against a database.
Encoding. Character encodings that can be applied to the data read into the plan. Data Quality handles all
data read over an ODBC connection as Unicode, regardless of the selection in this field. For more
information, see Character Encodings and Unicode on page 159.
Username and Password. SAP username and password to identify you to the SAP system.
SAP System Tab
After entering the required information on the Connection tab, click Connect to open the SAP System tab.
The SAP application areas available on the connected system are listed on the left. On the right appears options
for defining the input and output parameters to be used in the function call to the SAP database.
You can explore the SAP application areas to reveal the business objects defined for each area and the functions
that can be configured for each business object. The icons associated with each level are color-coded:
application area icons are yellow, business object icons are green, and function icons are red.
Your first task is to explore the available objects and select the function you want to run. Then, you can define
the function using the Import and Export tab options.
Import Tab
On the Import tab, you can set the input parameters of the function that retrieves data from the SAP database.
With this tab selected, two columns display:
Name. Lists the input parameters available for the function.
Value. Use to filter parameter output. To enter a filter, click in the Value column for the the parameter and
enter a filter string.
Note that there are three types of parameters. Configure the values on the Import tab based on the parameter
type:
Scalar parameter. A single name-value pair of the type described above, such as Town Chicago.
Structure parameter. A group of one or more scalar parameters, such as a multi-line address group. A
structure can have multiple rows but has a single column of values, for example:
Table parameter. Contains one or more rows of data described by one or more columns. For example, each
name below has multiple values:
ADDRESS
AddressLine1 781 Fifth Avenue
AddressLine2 New York
AddressLine3 NY
AddressLine4 10022
CUSTOMERS
Name AddressLine1 AddressLine2 AddressLine3
Smith Fifth Avenue New York NY 10022
Jones Park Avenue New York NY 10128
Wilson Columbus Avenue New York NY 10025
CSV Match Source 19
Export Tab
The Export tab displays output parameters that correspond to the settings on the Import tab. The export
parameters determine the data values that are exported from the SAP database for use as source data in your
data quality plan.
The export parameters that appear are specific to the function being used:
Value. To select a parameter for data export to your plan, use the Value check box of the parameter.
Depending in the parameter type, you might need to select individual data elements for export.
Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces
from the dataset. These options are cleared by default.
Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Click OK in the configuration dialog box to save your changes.
CSV Match Source
The CSV Match Source compares the records in a single source file to identify duplicates. The source
file must be delimited. This component makes use of a CSV file in a similar manner to the CSV
Source component, then selects data for a matching operation. To match between two delimited
source files, use the CSV Dual Match Source component. For more information, see CSV Dual
Match Source on page 20.
When the CSV Match Source has been configured, two versions of each field in the source dataset will be
visible to the matching components. To distinguish between them, _1 and _2 are appended to the field
names.
The CSV Match Source is one of two components that enable the generation of match cluster information by
the CSV Match Target. The other source component is the Group Source. If you want to use the CSV Match
Target Identified Matches option to generate match cluster information, you must use CSV Match Source or
Group Source in the plan.
Configuration
The configuration dialog box contains the following fields:
Source File. Displays the name of the file to which the source component connects.
Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source
dialog box opens. You can identify the character encoding associated with the dataset. For more information,
see Character Encodings and Unicode on page 159.
Field Delimiter. Select a field delimiter used in the source file. The default option is comma (,). If headings
for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.
Text Qualifier. Select the text qualifier used in the source file. A text qualifier should enclose any delimiter
value in your data that you do not want to use as a field delimiter. The default option is the double
quotation mark ().
First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the dataset.
20 Chapter 2: Data Source Components
CSV Dual Match Source
This component allows you to match data from two delimited source files. The functionality of the
component is similar to that of the CSV Match Source, except the Dual Match Source compares data
across two files.
Configuration
The CSV Dual Match Source configuration dialog box displays a set of options in a two areas: Source 1 and
Source 2. Each area provides identical settings for selecting and configuring a dataset. The settings in each area
are identical to those in the configuration dialog for the CSV Match Source:
Source File. Displays the name of the file to which the source component connects.
Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source
dialog box opens. You can identify the character encoding associated with the dataset. For more information,
see Character Encodings and Unicode on page 159.
Field Delimiter. Select a field delimiter used in the source file. The default option is comma (,). If headings
for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.
Text Qualifier. Select the text qualifier used in the source file. A text qualifier should enclose any delimiter
value in your data that you do not want to use as a field delimiter. The default option is the double
quotation mark ().
First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the dataset.
Note: If the CSV Dual Match Source component is being used for Match-and-Append operations, the reference
file appears in the Source 2 area.
Database Match Source
The Database Match Source component lets you explore the Data Quality repository to select tables
and columns for use in a matching plan. To configure this component you connect to the Data
Quality repository and configure the dataset.
The Database Match Source provides a single-component alternative for plans that use two Database Source
components to match data across a single table.
Configuration
The Database Match Source configuration dialog box includes two tabs: Connect to Database and Match
Selection. The Connect To Database tab options are identical to the Connect to Database tab on the Database
Source configuration dialog box, as described in Database Source on page 14.
Connect to Database Tab
The Database Match Source connects to the Data Quality staging database.
Click Connect to effect the connection and open the Match Selection tab. The remaining options on this tab
are disabled.
Match Selection Tab
The options on this tab allow you to explore the database tables defined in the repository and select the
columns to provide data for the matching plan:
Group Source 21
Database. Displays the repository structure as a folder hierarchy of tables and columns.
Select. Provides check boxes for the column on the explored tables. Check Select for a column to add its data
to the dataset.
Unique ID. Use to identify the data column to provide the unique ID for the dataset. The dataset can have
one unique ID only.
Group Key. The fields that the matching plan searches for common values. Select one or more group keys.
Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces
from the dataset. These options are cleared by default.
Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Note: Configuring a column for UniqueID or GroupKey automatically checks the Select option to add the
column to the dataset. However, clearing either option does not automatically remove them from the dataset.
Clear the Select option to remove a column from the dataset.
Group Source
The Group Source component defines the input data for a plan by reading the set of group files
created by a Group Target in another plan. When you configure the Group Source to connect to the
set of group files, the Group Source uses the dataset underlying these files as the source for the plan,
providing the data to the operational components on a group-by-group basis.
Grouped data is chiefly used in matching plans, although it can be used in other types of plans.
Groups are produced by the Group Target component. The Group Target creates a set of delimited text files in
a proprietary format and saves the files in a user-defined directory. The files use the extension SSG. When
configuring the Group Source, you need to specify the host directory for the grouped files.
Groups are created in the Group Target component by defining one or more key grouping fields for the dataset.
All records with common values in the key grouping fields will be associated with a single group.
The Group Source is one of two components that enable the generation of match cluster information by a CSV
Match Target. The other source component is the CSV Match Source. If you want to use the CSV Match
Target Identified Matches option to generate match cluster information, you must use Group Source or CSV
Match Source in the plan.
You can use the Dual Group Source to group data from two data sources. For more information, see Dual
Group Source on page 22.
Configuration
The Group Source configuration dialog box contains the following features:
Select Directories pane. Identifies the directory or directories containing the grouped data you want to use.
To add a directory, right-click in the pane and click Add from the menu.
Select a Source Group Directory dialog box. Appears after you add a directory. Use to select a folder to act
as the source directory. Be sure to select a folder, not a file.
Column Headers pane. Displays the headings for each data column in the group highlighted in the Select
Directories pane. This pane has no editable options.
Note the following:
Group files do not contain data from the underlying dataset, and group creation does not edit the
underlying dataset in any way. Groups are a way to identify data records with a common values so these
records can be processed together in matching operations. Matching operations can be performed on
grouped data at significantly higher speeds than on non-grouped data.
22 Chapter 2: Data Source Components
The column names in the Column Headers pane are appended with _1 or _2. The columns are derived
from the source dataset in the plan that generated the SSG files. Each column in the dataset is duplicated so
their data values can be matched.
Dual Group Source
The Dual Group Source allows you to perform matching operations on grouped data from two
different data sources. It uses the SSG files defined for two datasets as input.
Configuration
The Dual Group Source configuration dialog box contains the same elements as the Group Source component.
However, the Dual Group Source dialog box displays two instances of each pane.
The Dual Group Source configuration dialog box contains the following features:
Select Directories pane. Identifies the directory or directories containing the grouped data you want to use.
To add a directory, right-click in the pane and click Add from the menu.
Select a Source Group Directory dialog box. Appears after you add a directory. Use to select a folder to act
as the source directory. Be sure to select a folder, not a file.
Column Headers pane. Displays the headings for each data column in the group highlighted in the Select
Directories pane. This pane has no editable options.
For more information about using grouped data in plans, see Group Source on page 21.
23
C H A P T E R 3
Data Target Components
This chapter includes the following topics:
Overview, 23
CSV Target, 23
Fixed Width Target, 24
Report Target, 25
CSV Merge Target, 26
CSV Match Target, 27
Match Key Target, 29
Group Target, 31
Database Target, 32
Database Report Target, 34
SAP Target, 35
Realtime Target, 36
Overview
Just as you configure source components to specify input data for your data quality plan, you configure target
components to specify plan output. Targets are designed to accept data derived from the source and operational
components of a plan.
This chapter describes all target components in Data Quality except the Identity Group Target and the CSV
Identity Match Target. For information on the configuration of these two component, see page 89.
CSV Target
The CSV Target component defines a delimited file, such as a comma-separated file, as the output
format for your data quality plan.
The component allows you to do the following:
Specify the fields included in the output file, including any combination of data source fields and fields
generated within the plan.
24 Chapter 3: Data Target Components
Specify the position of each field in the output file.
Enter a condition to filter data written to the output file.
Configure the plan to create new output files or append data to an existing file.
Configuration
The CSV Target configuration dialog box contains the following options:
Target File. Identifies the output file for the data target.
Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a
Target dialog box opens. You can create a new file by typing a name in the File Name field. You can also
identify the character encoding associated with the dataset. For more information, see Character Encodings
and Unicode on page 159.
Overwrite file? When checked, this option specifies that the plan overwrites the target file every time it runs
(in cases where the target file name and path are unchanged for successive executions of the plan). When
cleared, this option specifies that the plan writes its output to the end of the existing target file each time it
runs. In this case, the target file grows in size each time the plan is run. This box is checked by default.
Condition. Use to create a condition-based filter in the form of an IF statement to the data processed by the
target. Use the filter to limit the records written to the output file.
Specify a condition by selecting a single input data field, an operator, and a condition value.
Inputs. This pane lists the field types available to the target, typically, the data derived from the operational
components of the plan and the source dataset. Beside each field type is a check box. Use the check box to
add a field to the target output.
Outputs. This pane shows the fields that have been selected from Inputs for inclusion in the data output. To
change the order of the output fields, use the Up and Down arrows.
Launch Viewer. If there is a program associated with the file type, use this option to launch a database table
view of the target output automatically when the plan is executed.
First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the rest of the dataset.
Field Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is a
comma (,). If headings for the column source data contain this delimiter, you must use a text qualifier to
preserve the data structure.
Text Qualifier. Select a qualifier appropriate to the data from this menu. A text qualifier should enclose any
delimiter value in your data that you do not want to use as a field delimiter. The default option is the double
quotation mark ().
Fixed Width Target
The Fixed Width Target component generates plan output in a fixed-width file format.
The component allows you to do the following:
Specify the fields included in the output file, including any combination of data source fields and
fields generated within the plan.
Specify the position of each field in the output file.
Specify the length of each fixed width column.
Enter a condition to filter data written to the output file.
Configure the plan to create new output files or append data to an existing file.
Report Target 25
Configuration
The Fixed Width Target configuration dialog box contains the following features:
Target File. Identifies the output file for the data target.
Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a
Target dialog box opens. You can create a new file by typing a name in the File Name field. You can also
identify the character encoding associated with the dataset. For more information, see Character Encodings
and Unicode on page 159.
Condition. Use to create a condition-based filter in the form of an IF statement to the data processed by the
target. Use the filter to limit the records written to the output file.
Specify a condition by selecting a single input data field, an operator, and a condition value.
Overwrite File. Use to overwrite the target file with successive executions of the plan.This option is checked
by default. Clearing this option keeps the selected target file from being overwritten, making it read-only.
Inputs. This pane lists the field types available to the target, typically, the data derived from the operational
components of the plan and the source dataset. Beside each field type is a check box. Use the check box to
add a field to the target output.
Outputs. Lists the name, width, and type of each selected input. The values in the cells of the Width column
determine the width as a number of characters for the associated columns of output data.
If the data values are longer than the width specified, the data will be truncated in the output file.
The default data type is String. Valid types are String, Number, and Date.
Launch Viewer. If there is a program associated with the file type, use this option to launch a database table
view of the target output automatically when the plan is executed.
First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the rest of the dataset.
Note that the Fixed Width Source does not use a header record. Clear this option if you intend to use the
fixed-width target output file as a source in another plan.
Launch Specification Viewer. Use this option to open the fixed-width specification file, which specifies the
field names and widths defined for the target output file.
Report Target
The Report Target generates an easy-to-read report file that displays plan output data. The report files
can be opened in other applications, including web browsers and spreadsheets.
You can create three types of report files: HTML, CSV (delimited flat file), and SSR (a proprietary
Informatica Data Quality format). SSR reports can be viewed as dashboards in the Data Quality Report Viewer.
For more information, see Report Viewer on page 113.
When you use Report Target, you need to use a frequency component, such as Count, before Report Target.
The data fields counted in the Report Target are determined in the frequency component preceding it in the
plan.
Note: The Report Target does not read outputs from the Aggregation component.
Configuration
The Report Target configuration dialog box contains the following features:
Report File. Identifies the output file for the data target.
26 Chapter 3: Data Target Components
Select. Use to browse to the output file for the data target. When you click Select, the Select a Report as a
Target dialog box opens. You can create a new file by typing a name in the File Name field of this dialog. By
default, files of the type specified by the Report Transform options display.
Report Transform. Determine the output file type.
Check the Standard option to enable the file type selection menu. The options are HTML, CSV, and SSR.
The HTML option activates the Include Chart menu, which allows you to add a pie chart, bar chart, or
line chart to the report.
Check the Custom option to write the target output to a customized HTML report template and to
generate graphical reports. Click Select beside the Custom text field to browse to a template file.
Launch Report on Completion. Use to launch the report file automatically when the plan is executed.
CSV Merge Target
The CSV Merge target merges columns from two sources to a single target file. It can be used in
matching plans that compare a dataset against a reference dataset. The component operates as follows:
The target lists data fields available from the other components in the plan as inputs. Select the input
fields to write as outputs to the target.
The inputs defined as Source 1 are automatically written to the resulting merged target.
The inputs defined as Source 2 constitute reference data. Data values from Source 2 are appended to the
merged target where good matches are found with Source 1 data, as determined by the Match Input Field
and Match Threshold settings.
Note: When more than one positive match is identified, the match with the highest score is appended.
Configuration
The CSV Merge Target configuration dialog box contains the following features:
Target File. Identifies the output file for the merged data.
Select. Use to browse to the output file for the data target.
When you click Select, the Select a CSV file as a Target dialog box opens. You can create a new file by typing
a name in the File Name field of this dialog.
Inputs. Lists the potential input fields for the target. Input fields can be added to the Source 1 or Source 2
output panes so their data can be considered for inclusion in plan output. Add an input column to either
pane by right-clicking a field name in the Inputs pane and selecting Add to Source 1 List or Add to Source 2
List.
Launch Match File. Use to open the output file automatically when the plan is run.
Match Threshold. Filters the columns in the Source 2 Outputs pane according to their scores in the key
matching field, as defined for the target on the Match Input Field. Records in these columns with match
scores below this value are not included in the merged output. The default value is 0.9.
Match Input Field. Lists the key matching fields defined by the plan components. Use this menu to select
the field on which to base the matching calculation. The Match Threshold applies to this calculation.
Use First Line as Header. Use this option to designate the first line of data in the source file as heading text
and distinguish it from the rest of the dataset.
CSV Separator: Delimiter. Select a field delimiter appropriate to the data from this menu. The default
option is comma (,). If headings for the column source data contain this delimiter, you must use a text
qualifier to preserve the data structure.
CSV Mat ch Target 27
CSV Separator: Qualifier. Select a qualifier appropriate to the data from this menu. A text qualifier should
enclose any delimiter value in your data that you do not want to use as a field delimiter. The default is the
double quotation mark ().
CSV Match Target
The CSV Match Target creates a delimited output file containing data generated by a matching plan.
The component can generate two types of output: a HTML match report displaying match clusters
and corresponding match scores, and a CSV file containing data values that meet or exceed the match
threshold score. This match file can be used as input for the consolidation process.
The principal steps in configuring the CSV Match Target are:
Select the data fields whose data matches you want to include in the target output. Include at least one
matching component output field.
Select the match input field to which you want to apply the match threshold. This field and the match
threshold value constitute a filter for the plan output data.
Select the types of output you want the target to generated. The target can generate a HTML report or a
CSV file in one of two formats.
For more information about formatting CSV outputs, see Output Options in the CSV Match Target on
page 163.
The input fields listed in the CSV Match Target configuration dialog box are numbered by appending _1 and
_2 to the field names. When you match data fields from a single source file, _1 and _2 are appended to
the field names. When you match data fields in two data sources, the fields, _1 is appended to the fields in
one source and _2 is appended to the fields in the other source.
Configuration
The CSV Match Target configuration dialog box contains the following options:
Target File. Identifies the CSV output file for the data target.
Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a
Target dialog box opens. You can create a new file by typing a name in the File Name.
Inputs. Lists the data fields that can be included in the target output. Check a field to include it in the plan
output calculations. You must select at least one output from a matching component.
Outputs. Lists the fields selected in the Inputs pane. Use the Up and Down arrows to change the order of
the output fields, that is, the order in which you want them to appear in the plan output.
Use First Line as Header. Check to designate the first line of data in the source file as heading text and so
distinguish it from the dataset.
Launch Viewer. Use to open the output files automatically when the plan executes.
Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is comma (,).
If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the
data structure.
Qualifier. Select a qualifier appropriate to the data from this menu. A text qualifier should enclose any
delimiter value in your data that you do not want to use as a field delimiter. The default is the double
quotation mark ().
Create HTML Match Report. Use to generate a HTML report displaying the match clusters found by the
plan. This option is checked by default.
Note: An HTML match report can only be generated for plans that use a Group Source or CSV Match
Source. If your plan does not include one of these two sources, an error message appears. If you are running
28 Chapter 3: Data Target Components
a CSV Match target plan created in an earlier version of Workbench, check the source configuration to make
sure that the plan continues to run successfully.
Match Output Type (Matched Pairs/Identified Matches). These options determine how the CSV report file
displays the matches found by the plan.
Use the Matched Pairs option to list matching values together in the file output. For example, if the strings
John Smith and John Smyth are identified as a matched pair, both these strings will be written to a single
row along with the match score:
Use the Identified Matches option to append the match cluster ID and the number of records per cluster to
records identified as matches by the plan. For example, in a plan that matches the four input records John
Smith, Bill Brown, Mary Murphy, and John Smyth, the Identified Matches option appends the
following columns to the target file and populate the columns as follows.
Here, John Smith and John Smyth share a common Cluster ID, indicating that they satisfy the plans
matching criteria.
Also note the following points about the Identified Matches option:
The Identified Matches option requires inputs from a CSV Match Source or a Group Source. If you add
inputs from other sources to the CSV Match Target and select the Identified Matches option, the plan
registers an error.
Clustering does not group matching records in the output file. The data input order corresponds to the
data output order.
The columns listed in the Outputs pane must be organized by data source, with an equal number of
columns for records from each data source. The match score column must appear after the record
columns. Figure 3-1 illustrates the correct order.
If you select the Identified Matches option, match score values do not appear in the file output for this
Target, even if you select a match score in the Outputs pane. This is because Identified Matches causes
data to be written one by one, and any given data row can have multiple rows associated with it.
For more information about formatting outputs, see Output Options in the CSV Match Target on
page 163.
John Smith John Smyth 0.9
Name Cluster ID Records Per Cluster
John Smith 1 2
Bill Brown 2 1
Mary Murphy 3 1
John Smyth 1 2
Figure 3-1. CSV Match Target Outputs Pane, Showing Column Order for Identified Matches
Match Key Target 29
Field. Lists the output fields defined by the matching components in the plan. Use this menu to select the
field from which the CSV Match Target reads the match score. The match threshold values set in this dialog
box apply to the match scores achieved in this field.
Thresholds fields (Lower and Upper). Filter the data record values written as plan output according to the
record scores in the match input field (see Field menu above).
Enter a lower and upper limit for the match scores in these fields, between 0 and 1. Data from records whose
scores fall outside this range will not be included in the output. The default values are 0.9 for Lower and 1.0
for Upper. The Lower field is not designed to calculate matches with a value of 1.
Match Key Target
The Match Key Target component is commonly used in consolidation plans. It allows you to append
match plan output data directly to the source database. This eliminates the need to write match data to
a new target table. With the Match Key Target, matching and consolidation information is written
and held in database tables. The outputs of this component are CSV and HTML reports.
Data may be written by the Match Key Target if the following criteria are met in the source table
structure:
The source table contains a column that can be used by the Match Key Target to uniquely identify a record.
This record will be a primary key unique, non-null, and a sequence auto-increment.
The source table contains a column in which the system stores the match score for each matching record.
This field must be of datatype Float.
The source table contains a column in which the match key is recorded. This key identifies the consolidated
records within a cluster.
Configuration
The configuration options in the Match Key Target configuration dialog box are arranged on three tabs:
Database, Match Details, and Outputs.
Database Tab
The Database Type menu lists a static option, Staging, representing the Data Quality repository. The remaining
fields are disabled.
Click the Connect button to access the database data. This opens the Match Details tab.
Match Details Tab
The options on this tab are arranged in three areas:
Table Details. Table Details area contains the Table Names menu. This menu lists the database tables
available to the target Use this menu to select the table to which the target will write the output data.
Column Details. These menu options relate to the table identified under Table Details, whereas the Inputs
menu options list all columns in the database tables available according to the Database tab settings.
The Column Details area contains three fields:
UniqueID. Select the column that contains the unique ID (primary key) of this table.
Match Key. Select the column to record the match key. The match key is the primary key of the master
record in a match cluster.
Match Score. Select the column to store the match score between each record and its master.
30 Chapter 3: Data Target Components
If the table does not already have a column created to hold the match key and match score, the table
structure must be altered to generate these fields. The match key and match score are populated when the
matching plan is run.
Inputs. This area contains two fields: Unique ID - Input 1 and Unique ID - Input 2. Select the columns on
which to base the matching operations.
Outputs Tab
The options on this tab let you configure a HTML match report and CSV match file to display the data output
from the target. The match report presents the matches in clusters, and the match file presents a single row for
each matched pair.
The creation of a report or file is optional. Also, fields selected under Match Table Column Selection and
Ordering appear in the match report and match file.
The Outputs tab displays the following areas:
Match Report. This area contains the following options.
Create Report. Check to create a match report when the plan is executed.
Select. Click to browse to the report file. When you click Select, the Select a HTML file for the Report
dialog box opens. You can create a new file by typing a name in the File name field.
Launch Viewer. Enabled when the Create Report is checked. When selected, the report opens
automatically when the plan runs.
Clusters Per Page. Determines how many match clusters appear on each page in the report.
Match Table Column Selection and Ordering. This area shows two panes. The left pane lists the columns
available on the table selected on the Match Details tab. The right pane lists the columns to appear in the
report or match file. To add a column to the right pane, click its check box in the left pane.
Match Input. The match report presents each match cluster along with the selected input fields from related
match sources and the field selected from the Match Input menu. The Match Input selection and the
primary key of the source data appear as default fields on this report.
The Match Input menu lists the key fields defined by the matching components in the plan. The field you
select, in conjunction with its match threshold score, determines the records to be included in the target
output.
Likewise, the range of values you set in the Match Threshold fields are applied to the Match Input key field.
Matching records whose scores fall outside this range are not be included in the output. You can set lower
and upper values between 0 and 1. The default values are 0.75 and 1.0.
Match File. Like the match report, the match file contains records that contain matches within the match
threshold for the field selected from the Match Input menu. The file contains the columns selected in the
Match Table Column Selection and Ordering area. Match File has the following options:
Create File. Check to create a match file when the plan is executed.
Select. Click to browse to the report file. When you click Select, the Select a CSV File as a Target dialog
box opens. You can create a new file by typing a name in the File name field of this dialog.
Launch Viewer. Enabled when the Create File box is checked. When selected, the file opens automatically
when the plan runs.
Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is comma
(,). If headings for the column source data contain this delimiter, you must use a text qualifier to preserve
the data structure.
Qualifier. Select a qualifier appropriate to the data from this menu. A text qualifier should enclose any
delimiter value in your data that you do not want to use as a field delimiter. The default option is the
double quotation mark ().
Note: It is good practice to run a plan populating an audit trail table with the unique IDs of each matching
record for every match created. When the data is consolidated, duplicate records are removed from the source
table.
Group Target 31
Group Target
The Group Target component creates groups, a series of files in a Data Quality-proprietary format that
organizes plan data according to key data fields that you configure.
Grouping involves grouping records based on similar or identical values in one or more fields and
performing matching operations on the records assigned to each group.
Group Target output files can be used by a Group Source or Dual Group Source to organize the data inputs to
a matching plan.
Grouping large datasets is a useful precursor to running a matching plan. Matching operations can be
performed on grouped data improves performance with minimal loss of matching accuracy.
Grouped data is stored in local directories as a set of delimited files with the extension SSG. Set up groups by
defining one or more group key fields for the dataset. All records with common value in the defined key fields
are written to a single group file.
Note: Group files are organized separately from the original dataset and do not modify the original dataset in any
way. A large number of SSG files can be created in the group directories, depending on the number of records
with common data in the key fields.
Configuration
The Group Target configuration dialog box contains the following options:
Directory. The location and name of the directory in which the groups are created. This field is not editable.
Select. Click to open the Select the Group Directory dialog box and browse to the required directory. To
select a directory, highlight it in the main window and click Select. Select a directory, not a file.
Outputs. This pane lists the columns available in the dataset. Check the column name to include its data in
the plan output. The columns you select are added to the Grouping Fields pane.
Tip: Right-click in this pane to display a Select All option.
Grouping Fields. Select a group key. The group files created in the group directory are based on the key you
select.
Maximum Group Size. The maximum number of records assigned to a group file. If the Group Target
reaches this limit when writing to a group file, it creates another file for the group. The default value is zero,
no limit.
Note: Matching operations are performed within group files. This is standard behavior for matching
operations on grouped records. Although a reduction in group size can lead to faster processing times, it can
also impact the accuracy of match results.
Maximum Files Per Group. The maximum number of group files written to a given folder on disk. The
default value is 5000. When this number is exceeded, the Group Target creates one or more sub-folders to
house the remaining files. If this value is set to zero, no limit is be imposed and files are written to a single
folder.
Ignore Empty Group Field Values. Use to avoid the creation of a group based on records with null values in
a group key field.
Note: The group files you create are overwritten if you run a plan again without changing the target
configuration details. To preserve a set of group files, select a new group directory before you run the plan
again.
32 Chapter 3: Data Target Components
Database Target
The Database Target (or DB Target) component allows you to write plan output to a database. Data
produced by the plan can update selected tables in the database or can be inserted in new or existing
tables.
In addition to its own repository, Data Quality connects to Oracle, IBM DB2, and Microsoft SQL Server
databases and also supports ODBC connections. A single plan can write to multiple databases using multiple
Database Targets.
The Database Target can write the data records processed by the plan to the database, or it can write data from
the Aggregation component detailing the frequency of occurrence of data values.
Configuration
The Database Target configuration dialog box contains four tabs:
Connect To Database
Before
During
After
The connection is defined on the Connect To Database tab.
Connect To Database Tab
This tab contains three areas: Database Information, Login Information, and Target Format.
You must identify the target database in the Database Information fields.
Database Type. This menu provides five options: Staging (the local repository), IBM DB2, Oracle,
Microsoft SQL Server, and ODBC (as a connection to a ODBC-compliant database).
Note: When you select Oracle, you are prompted for a Oracle database system identifier. If you select another
database type, you are prompted for a data source name.
DSN. Data Source Name. Identifies the database on the network. This is required for all database
connection types except Oracle.
SID. Source Identifier. Identifies the instance of the Oracle database.
Encoding. Lists the available character encodings that can be applied to the data output. Data Quality
handles all data read over an ODBC connection as Unicode, regardless of the selection in this field. For more
information, see Character Encodings and Unicode on page 159.
Login Information. Contains username and password text fields. You must provide your login when access
permissions have been applied to the database.
Connect. Click to establish the connection.
You must also set the target format.
Select Normal Mode to write the plan data to the database.
Select Aggregation Mode to write data summarizing the frequency of occurrence of data values, as tabulated
by the Aggregation component, to the database. When you select this option, select the component from
which the component will read the data.
Note: When you select Normal Mode, the outputs from all components except the Aggregation component
are available to the target. When you select Aggregation Mode, only the outputs from the Aggregation
component are available.
Database Target 33
Before Tab
The Before tab contains Database pane and a SQL Script pane. This tab is typically used in the Database Target
to create new tables in the selected database. You can also create Pre-INSERT and Pre-UPDATE statements.
Click Execute to implement the SQL script. Click Execute before proceeding to the During tab.
Check the Stop On Error check box to stop the script operation and open a message box if the execution
encounters ungrammatical script.
During Tab
The During tab enables you to browse the database tables and filter the columns that will constitute the data
written to the database. Use this tab to create INSERT and UPDATE statements. You can also apply conditions
to tables and join columns from multiple tables. The During tab includes five columns: Database, Insert,
Update, Where, and Text.
Figure 3-2 displays the Database Target During tab:
Note:
Like the Before tab, the Database column displays the database structure as a hierarchy of tables and
columns.
To write to a column in a database table, select the required Data Quality output from the corresponding list
in the Insert or Update column.
Use Stop On Error to stop the script operation and open a message box if the execution encounters
ungrammatical script.
Use Roll Back on Error to commit data to the database at the end of the batch operation. If this box cleared,
data is committed to the database at the end of each transaction.
Use Expert Mode to view and edit the underlying SQL query. Expert Mode is typically used to create more
advanced statements.
Any changes made in Expert Mode are lost if you clear this box and return to standard mode.
Click the Condition option to create a condition-based filter in the form of an IF statement to the data
processed by the target. Use the filter to limit the records written to the output file.
In Aggregation Mode, only outputs from Aggregation component are available. You can use Expert mode to
perform additional calculations on aggregates.
If you are using the DB Target to write to a Microsoft SQL Server database, bear the following items in
mind:
Figure 3-2. Database Target, During Tab
34 Chapter 3: Data Target Components
When composing a Where query on the During tab with Text containing Unicode data, the text must be
preceded with letter N, for example N'unicode data'
When using INSERT statements on the Before or After tab, all columns that need to be populated with
Unicode data must be preceded with the letter N.
After Tab
Use the After tab options to write post-insert or update SQL statements for a table. Use this tab to configure
primary keys and indexes for tables.
The After tab completes the process of defining the target output. The Before tab runs SQL scripts on the data
prior to its configuration. The After tab runs SQL scripts on the configured dataset. Its Database and SQL
Script panes are identical to those of the Before tab. You can browse configured tables and columns in the
database and write the SQL script to run on selected data.
For more information about SQL scripts, see SQL Scripts on page 155.
Database Report Target
The Database Report Target component generates report data for a plan and inserts this data to the
Data Quality repository. Like the Report Target, Database Report Target accepts input from frequency
components.
The Database Report Target also makes Data Quality report data accessible to external applications through an
ODBC connection. You can analyze and present the results of a data quality plan through a range of analytical
software tools, including Microsoft Excel and Crystal Reports.
Note: Unlike the Report Target component, the Database Report Target does not produce a formatted report on
the data. Instead, it writes report data to local Data Quality MySQL database tables. The tables can then be
made available to other applications through ODBC.
The MySQL database tables that store the Data Quality report data are located in the Data Quality repository,
named repository.t_athanor_report (master record) and repository.t_athanor_report_detail (detail record).
Configuration
The Database Report Target configuration dialog box contains the following:
Connection Details Area. Because the Database Report Target always writes data to the Data Quality
repository, the connection options shown in this area are static.
Parameters Area. This area contains the following fields:
Report Name. Enter a report name. The report data is saved in the repository under this name.
Maintain Reports. When this box is checked, a new record containing the report data is inserted in the
MySQL database tables each time the plan executes. Each instance of the report is identified on the
MySQL table by a unique report ID and timestamp. When this box is cleared, the record containing the
report data is updated with the latest report data each time the plan is executed.
Technical Requirements
A MySQL ODBC Driver is required when importing data from the MySQL database to an external
application. This is available to download from http://www.mysql.com.
SAP Target 35
Maintenance
To ensure reasonable table size, it might be necessary to remove historical data from the database tables that
store report data. When deleting a record from these tables, ensure that the record in question is deleted from
both the Master and Detail records to avoid creating orphaned records.
SAP Target
Note: This component is no longer installed with Data Quality. You can run plans created in earlier
releases that contain this component, and you can export such plans as mapplets for use in
PowerCenter.
The SAP Target allows you to write plan output to a SAP database. This component complements the SAP
Source component, which allows you to obtain data from the SAP database for use as source data in a plan.
There are three basic steps to configuring the target to write data to the SAP database:
1. Define a connection between Data Quality and the target SAP system.
2. Browse the list of BAPI functions on the SAP system and select the function associated with the data.
3. Configure one or more parameters on the function to be populated with data from the Data Quality plan.
Perform these steps using options on the SAP Target configuration dialog box.
Configuration
The configuration dialog box for the SAP Source displays its options on two tabs:
Connection. Use the Connection tab options to establish the connection to the SAP system.
SAP System. When connected, use the SAP System tab options to locate the appropriate BAPI and link its
parameters to the output columns in your plan.
Connection Tab
The Connection tab contains the following options:
Host. The name or IP address of the SAP host computer.
Client Number. Identifies the SAP client that you are authorized to use. A SAP system can have multiple
clients, each of which is identifiable by the three-digit client number.
System Number. SAP allows multiple application server instances to run against a database. The system
number is a two-digit number that identifies the application server to which you want to connect.
Encoding. This menu lists the available character encodings that can be applied to the data read by the
target. Data Quality handles all data read over an ODBC connection as Unicode, regardless of the selection
in this field. For more information, see Character Encodings and Unicode on page 159.
Username and Password. These fields identify you to the SAP system.
Clicking Connect opens the SAP System tab.
SAP System Tab
This tab is divided into two panes. The left pane lists the SAP application areas and functions available on the
connected system, and the right pane lists the parameters defined on the highlighted function.
You can explore the application area pane as an alphabetical list or as a hierarchy that groups areas together
according to user-defined criteria. The areas can be expanded to reveal the business objects defined for each area
and the functions configured for each business object. Application areas are read from the SAP system.
36 Chapter 3: Data Target Components
The icons associated with each level in the left pane are color-coded: application area icons are yellow, business
object icons are green, and function icons are red.
Explore the available objects and select the function you want to use to write to the SAP database. Then,
configure one or more of the function parameters to receive data from one or more plan output columns.
As demonstrated for the SAP Source configuration dialog, there are three parameter types:
Scalar. A single name-value pair, such as Town Chicago.
Structure. A group of one or more scalar parameters, like a multi-line address group. A structure may have
multiple rows but has a single column of values.
Table. Contains one or more rows of data described by one or more columns.
Note: The SAP Target treats each field in a parameter as a scalar parameter, regardless of whether it is a single-
field scalar parameter or a multi-field table.
To configure a parameter:
1. Examine the parameter and identify the fields to which you want to add data.
2. Double-click the Value field of the parameter:
If you select a scalar parameter, this opens the Edit Scalar Parameter dialog box.
If you select a structure or table parameter, this opens Edit Structure Parameter or Edit Table Parameter
dialog box in which constituent scalar values can be configured. Double-clicking a value in these dialogs
opens the Edit Scalar Parameter dialog box.
3. In the Edit Scalar Parameter dialog box, click the Down arrow by the Value field to see a list of available
output columns.
You can also enter a column name.
4. Select a column, and click OK.
5. Repeat these steps for all required parameters.
Realtime Target
The Realtime Target enables you to develop plans to process output data in real time and deliver data
to another application. With this component, you can define a set of columns that determine the data
sources for a plan executed by the Data Quality engine a real-time environment.
You can develop, run, and test the plan using the Workbench user interface.
When the Data Quality engine executes a real-time plan, the records passed to the application contains all fields
selected as outputs from the Realtime Target. When configuring Realtime Target, select only the data fields that
your application needs.
Configuration
The Realtime Target configuration dialog box displays a single pane that lists all available data fields. Select the
required fields individually, or right-click within the selection pane to Select All.
37
C H A P T E R 4
Frequency Components
This chapter includes the following topics:
Overview, 37
Count, 37
Sum, 40
Aggregation, 41
MinAvgMax, 43
Range Counter, 44
Missing Values, 45
Overview
Data Quality provides five components that determine the frequencies of values within selected data fields.
These components allow you to determine the frequencies of all values, specific values, and defined ranges of
values within data fields.
Frequency Analyzer components are essential in plans that use the Report Target or Database Report Target to
create plan output. Report Target and Database Report Target can only accept inputs from frequency
components.
Data Quality provides the following frequency components:
Count
Aggregation
MinAvgMax
Range Counter
Missing Values
Count
The Count component determines the number of unique values in a column and calculates the
frequency of occurrence of each value. Count is a frequency component and therefore can provide data
input to the Report Target and Database Report Target.
38 Chapter 4: Frequency Components
For example, consider the addresses listed in Table 4-1:
Applying Count to the Address2 column results in the following data:
When the Count component output is read by a Report Target, and the plan output viewed in the Report
Viewer, you can drill-down on any item heading to view underlying data values.
Configuration
The Count configuration dialog box displays its settings on two tabs:
Inputs
Parameters
Inputs Tab
The Inputs tab lists the data columns available to the Count component from other components in the plan.
Select a column to add it to the Report Target.
Parameters Tab
The Parameters tab allows you to select and filter the data values that are counted by the component and passed
to the Report Target. It also lets you edit the output names for each counted column. The tab lists the columns
selected on the Inputs tab. For each column, three fields are displayed: Min Count, Max Cases, and Output
Name.
Min Count. Specifies the minimum number of times a value must occur in a column before being listed in
the report output. For example, if a SURNAME column is selected on the Inputs tab, and the Min Count
value for SURNAME is 5, then a given surname must appear at least five times in the column to appear on
Table 4-1. Count Component: Sample Address List
Address1 Address2 Address3 State Zip
2440 Camino Ramon San Ramon Contra Costa CA 94583-4296
2306 Shoreline Loop # 132 San Ramon Contra Costa CA 94583
2050 Shoreline Loop San Ramon Contra Costa CA 94583-5502
1200 Concord Ave Concord Contra Costa CA 94520-4915
1350 Montego Walnut Creek Contra Costa CA 94598-2822
1200 Montego Walnut Creek Contra Costa CA 94598-2820
108 Summerwood Pl Concord Contra Costa CA 94518-2718
305 Reflections Cir Apt 27 San Ramon Contra Costa CA 94583-5204
101 Ygnacio Valley Rd Ste 300 Walnut Creek Contra Costa CA 94596-4061
2245 Via De Mercados Concord Contra Costa CA 94520-4919
2000 Crow Canyon Pl Ste 206 San Ramon Contra Costa CA 94583-4633
2000 Crow Canyon Pl Ste 420 San Ramon Contra Costa CA 94583-1367
2000 Crow Canyon Pl Ste 260 San Ramon Contra Costa CA 94583-1384
2400 Camino Ramon Ste 100 San Ramon Contra Costa CA 94583-4287
San Ramon 8
Concord 3
Walnut Creek 3
Count 39
the list of surnames in the generated report. If the surname appears fewer than five times, its occurrences are
added to the Filtered total on the report.
Max Cases. The Max Cases field specifies a stopping point for the count operation by setting an upper limit
on the number of different values the component lists in the report. When this limit is reached, the number
of uncounted records is included in the Others column of the report.
Output Name. The name of each column sent to the target component. You can edit the name in each field.
Example
The following data sample contains eight different surnames in eleven records. A Min Count value of 2 returns
all surnames that occur more than once, Smith and Jones. A Max Cases of 7 continues counting until finding
seven different names, so the eighth name, Yeung, is added to the Others figure on the report.
The Max Cases setting takes precedence over the Min Count setting. Max Cases determines the number of data
buckets available in the output. The Max Cases limit can be reached without identifying all the values that
meet or exceed the Min Count setting. For this reason, note the percentage of values represented by the Others
total.
For example, with the same settings but data ordered differently, as shown below, the most common name
would not be listed on the report:
In this case, the Max Cases setting of 7 does not reach the eighth surname, Smith, which in fact is the most
common name in the dataset.
The Parameters options allow you to tune the performance of the plan in a number of ways.
SURNAME
1 Smith
2 Jones
3 Adams
4 Jones
5 Smith
6 Brady
7 Baldwin
8 Smith
9 Chase
10 Powell
11 Yeung
SURNAME
1 Powell
2 Jones
3 Adams
4 Jones
5 Chase
6 Brady
7 Baldwin
8 Yeung
9 Smith
10 Smith
11 Smith
40 Chapter 4: Frequency Components
For example, you require the fifty most common surnames in a dataset of one million records. Assuming the
surnames are spread randomly throughout the dataset, applying a Max Cases figure in excess of fifty should
return the most common surnames without counting all rows.
There is no limit to the number that can be applied for Max Cases. However, when the total number of
different counts is greater than 20,000, plan performance may slow. When the number of counts is below
20,000, all values being counted are held in memory. If the number exceeds 20,000, all counts above this
number are held in the database as the count operations are carried out.
The following examples demonstrate how the two parameters can be used:
To check for non-unique values in a field that should contain only unique values. Set the Min Count value
to 2. The report identifies all non-unique values, those that occur more than once.
The Max Cases field should be set to the number of records in the dataset. This ensures that sufficient
counts are performed so that even if the last two rows in the table are the only two with duplicate values,
they are identified.
To count the frequency of values in a column where a finite number of different values are possible. In this
case, set Min Count to 1 and Max Cases to any value greater than the maximum number of possible values.
Sum
The Sum component calculates sums for the numeric values in each selected column. This component
classifies numeric values as positive, negative, invalid, or filtered, and provides count and sum totals
for each of these classes.
Use outputs from the Sum component as inputs for the Report Target and DB Report Target.
Note: The Sum component processes positive and negative numbers, for example 10 and -10. Do not prefix a
positive number with a + symbol. The Sum component will treat numbers entered in other formats (for
example, (10) or 10) as invalid values.
Configuration
The Sum configuration dialog box contains the following:
Inputs tab
Parameters tab
Inputs Tab
The Inputs tab lists the data columns available to the component from other components in the plan. Check
the column name to assign it as an input.
Parameters Tab
Use the options on the Parameters tab to set a minimum value for inclusion in the Positive category for each
input column.
Positive numeric values that are less than or equal to the Min value for a column are classified as filtered. The
default Min value is 0.
Use the Parameters tab to rename the column outputs for the Sum components.
Aggregati on 41
Aggregation
The Aggregation component provides a number of methods to calculate the frequency of occurrence of
data values both in a single column and across multiple columns. It can create detailed metrics that
demonstrate value frequencies across a dataset without writing the data in a temporary staging area or
using SQL.
The Aggregations capabilities include the following:
It tabulates the quantities of records that contain common values in a selected field. The Count component
also performs this operation.
It can tabulate the quantities of records that share a set of common values across multiple fields.
It can calculate a sum of the numerical values in a given column.
It can apply conditional rules to the data in selected columns so that additional counts are performed for
values that satisfy the conditions. Sum calculations do not use conditions.
The Aggregation component delivers outputs directly to a Database Target. Its outputs are not compatible with
other components.
Note: Set the Database Target to Aggregation Mode to enable it to read the Aggregation outputs.
Configuration
The Aggregations configuration dialog box displays its settings on three tabs:
Inputs
Parameters
Outputs
Inputs Tab
The Inputs tab lists the data columns available to the component from other components in the plan. Select one
or more columns for configuration on the Parameters tab.
Note: When you select one or more columns on this tab, the Aggregation performs an aggregate count operation
on all data from these columns. This output appears as the Count field on the Outputs tab. You do not need to
configure other parameters to create this output, and you cannot clear this output in the Aggregation
component.
Parameters Tab
The Parameters tab allows you to select and filter the data values that are counted by the component and passed
to the Database Target. The tab contains an upper area that lists the columns selected on the Inputs tab and a
lower area that lets you define conditions to apply to the inputs.
Beside the input names in the upper area are two columns: Group and Sum.
Check the Group option for one or more input columns to generate totals for each pattern of values that
occurs across those columns. See Calculating in Groups on page 42.
Check the Sum option for one or more input columns to calculate a total for the numerical values in those
columns. See Calculating Sums on page 42.
The Parameters tab also contains a Conditional Counts area. This allows you to filter the data to which a count
calculation is applied.
Define a conditional count by selecting an input field and operators from the Conditional Count area and
clicking Add. To delete a condition, select it in the lower area and click Delete.
You can define conditional counts for individual columns, and you can add multiple conditional counts on
this tab.
42 Chapter 4: Frequency Components
Calculating in Groups
Table 4-2 provides sample bank account data that illustrates how group calculations work.
Figure 4-1 illustrates a sample configuration for the Aggregation component based on this data:
In Figure 4-1, the Group options for CITY and STATE are checked. Thus the component will aggregate data
patterns across both columns and send the following totals to a Database Target:
Calculating Sums
In Figure 4-1, the Sum option is checked for the BALANCE column. Thus the component will calculate the
sum of all values in this column, which is $62,453.70.
Sum calculations ignore all non-numeric data.
Table 4-2. Sample Input Data for Aggregation Component
NAME CITY STATE BALANCE
John Smith Brooklyn NY 36541.64
Mary Jones Brooklyn NY 6345.87
Estelle Franklin Brooklyn NY 354.12
Brian Franklin New York NY -650.01
Tina Brooks New York NY 3515.21
Charles Cowell New York NY 216.87
Marian Hodges New York NY 32.81
Kate Lee Albany NY 354.21
Albert Chung Albany NY 15498.32
Gillian Ross Buffalo NY 244.66
Figure 4-1. Aggregation Component Dialog Box. Parameters Tab
Brooklyn NY 3
New York NY 4
Albany NY 2
Buffalo NY 1
MinAvgMax 43
Conditional Counts
The Conditional Counts area lets you define a condition with Argument, Operator, and Value variables. A
condition acts as a filter for count calculations in the selected column.
Argument. The input column whose data will be filtered.
Operator. A mathematical operator applied to the argument data.
Value. The filter value.
Figure 4-1 contains a condition that will count the quantity of negative values in the BALANCE column, which
equates to the quantity of overdrawn accounts. You cannot define conditions for Sum calculations.
Outputs Tab
This tab lists the outputs that are written to the Database Target. You can edit the output names.
Figure 4-2 shows the outputs for the Parameters set in the previous example.
CITY and STATE. The quantities of common values in these fields will be calculated in group fashion. Group
calculations are not prefixed.
Count. This output is created when a column is selected on the Inputs tab. It sends a count of all value
quantities in all columns selected on the Inputs tan to the Database Target.
(Sum)BALANCE. All number in the BALANCE column will be added together and the sum sent to the
Database Target.
(Where)BALANCE<0. The quantity of negative balances will be sent to the Database Target.
MinAvgMax
This component returns the minimum, maximum, and average data values for selected columns.
The MinAvgMax only recognizes data in the Float datatype that originates as output from the Rule
Based Analyzer.
Configuration
The MinAvgMax configuration dialog box displays an Inputs tab with a single pane beneath listing the columns
you can use. Only numeric fields appear in the Inputs tab.
The calculations for the selected columns are sent to the Report Target.
Figure 4-2. Aggregation Component, Outputs Tab
44 Chapter 4: Frequency Components
Range Counter
The Range Counter calculates the frequency and distribution of numerical data in selected fields. It
does so by counting the numbers of values between user-defined intervals in the data.
To configure the Range Counter, select a data column and an interval, or a series of custom intervals,
to apply to the data. You can define multiple such instances within the component.
Configuration
The Range Counter configuration dialog box contains the following:
Components pane
Inputs tab
Parameters tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance by working with the options on the Inputs
and Parameters tabs.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the data columns available to the component from the other components in the plan.
Check the column name to assign it to the highlighted instance in the Components pane.
Parameters Tab
The options on the Parameters tab determine how the range of data is represented in the report. The parameters
divide the data into meaningful subsets. While the Count component counts the overall number of data values
in a given column, the Range Counter divides the column data into subsets and counts the data values in each
subset.
The parameters are organized in two areas, Select Range Type and Select Intervals. The Select Range Type area
provides two options:
Linear Numeric Range. Select to apply a uniform interval to the data column associated with the
highlighted instance.
When you select this option, the Select Intervals area displays a single Interval Value field. The value you
enter determines the size of the subsets in which the reported data is organized.
Variable Numeric Range. Select to apply custom intervals to the data column associated with the
highlighted instance. When you select this option, the Select Intervals area displays. When you first
configure the component, this area shows a single row with three fields: Label, Start, and End. It also shows
an All check box. You can add as many rows as you need. Each row defines an interval, and each interval can
be a different size.
Label field. Allows you to enter a descriptive label for the data row that appears in the report.
Start and End fields. Allow you to set the interval boundaries for the ranges displayed in the report.
Add button. Adds a row beneath the existing rows.
Remove button. Deletes the selected row. To delete a row from the report, check its box and click Remove.
To delete all rows, check the All option and click Remove.
Mi ssing Val ues 45
Missing Values
The Missing Values component searches for specific values in an input field and determines the
frequency of the values within the field. Use for searching for known bad or absent data values.
The Report Target creates a table listing the searched-for values and the number of times they occur in
the related column.
Configuration
The Missing Values configuration dialog box contains an upper pane that lists the data columns available to the
component, and a Missing Values pane to specify the data values you want to find.
To configure the component, highlight and select a data column in the upper pane. Next, right-click in the
Missing Values pane and select Add Value or Add Null Value from the context menu.
When you select Add Value, a message appears. Double-click the text as prompted and type a value on the edit
line. The value you provide will be assigned to the highlighted column. To save your changes, press Enter before
moving from the edit line. You can assign multiple values to a single column.
Note: You can select all columns in the upper pane with a context menu option. However, values are assigned
only to the highlighted column. You can also add multiple values for a single column.
Selecting Add Null Value adds the text Null Value to the pane and instructs Data Quality to search for null
values in the selected column.
To delete a value from the Missing Values pane, select Delete Value from the context menu.
46 Chapter 4: Frequency Components
47
C H A P T E R 5
Analysis Components
This chapter includes the following topics:
Overview, 47
Character Labeller, 47
Token Labeller, 50
Overview
Analysis components are used to identify data quality problems within individual fields in a dataset. The
analysis components identify features within free-text or non-numeric fields. The frequency of these features
can then be counted using the Count component and included in the plan report. The features can also be used
directly in cleansing and standardization routines.
Data Quality provides the following analysis components:
Character Labeller
Token Labeller
Character Labeller
The Character Labeller creates a character-by-character profile of data values in a data field. The
component categorizes some or all characters in the input fields according to character type. The
character types recognized by the component are:
Alpha. An alphabetic character. The default label is c.
Digit. A numeric character. The default label is n.
Symbol. A symbol, such as a period. The default label is s.
Space. Any space between data elements. The default label is _.
You can configure the component to identify all instances of one or more of these types in the input data. The
Character Labeller searches each field in the dataset for the character types you specify and writes a new column
containing codified representations of where your selections occur.
For example, the Character Labeller labels the string 01/01/2008 as nn/nn/nnnn with the Digit type
selected. It labels the same string as nnsnnsnnnn with the Digit and Symbol types selected.
48 Chapter 5: Analysis Components
You can change the labels assigned to the character types. You can also define custom labels that represent a
single character value or a set of character values.
Configuration
The Character Labeller configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Filters tab
Dictionaries tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. Use the Components pane
to define an instance of the component for use in the plan.
When first opened, this pane lists a single unconfigured instance. Configure this instance by working with the
options on the tabs.
To add an instance, right-click in this pane and select Add from the context menu. You can remove an existing
instance by highlighting it and selecting Delete from the context menu.
Inputs Tab
This tab lists the data columns available to the component from the other components in the plan. Check a
column name to assign the column to the instance highlighted in the Components pane. You can assign a single
input to each instance.
Parameters Tab
The Parameters tab options are organized in two areas:
Standard Symbols. This area lists the standard symbols that can be applied to input data. To filter the input
fields for a character type, check its check box. If you clear a box, the underlying data for that character type
is returned.
You can select multiple character types for each instance of the component. You can also edit the symbols
returned for the character types. Table 5-1 lists the default symbols for each character type:
Substring. This area provides options for returning the underlying data characters instead of the character
symbols for data in a field. It returns underlying characters based on their positions in the field.
For the data fields on the selected component instance, you can determine how many underlying characters
to return and where in the field to locate them.
Check Use Position to activate these settings.
Table 5-1. Character Type Default Symbols
Character Type Default Symbol
Alpha c
Digit n
Space _ (underscore)
Symbol s
Character Labell er 49
Start Position. Determines the starting location in the field for this operation. For example, with a setting
of 3, the Character Labeller returns underlying data starting at the third character in the string.
Length. Determines the number of underlying characters to be returned, starting with the character
identified by the Start Position setting. For example, in a Date field with values in the mm/dd/yyyy
format, a Start Position of 7 and a Length of 4 returns the underlying year values for this field. You must
enter a value in this field to activate the substring settings.
Filters Tab
The Filters options allow you to define filters for the input data on a component instance. You can use one or
more characters to define a filter. When the Character Labeller encounters the filter string in the input data, it
returns the underlying data characters rather than the character type symbol.
For example, in a numeric field containing quantities, such as the number of transactions in an account, you
might define a filter of 0 (zero) as it is impossible that a customer would have zero transactions. In such a case,
non-zero values will be reported by the Digit symbol while values of zero will be reported by the zero digit.
To create a filter, right-click in the Filters pane and select Add from the context menu. This opens the Filter
Setup dialog box. Type the required string in the Filter Text field and set the Enable Substring options if
required. If you do not select Enable Substring, the filter will apply to all characters in the field.
Check Use Position to activate the substring settings.
The Start Position option determines the starting location in the field for the filter operation.
The Length option determines the number of underlying characters to be returned, starting with the
character identified by the Start Position setting. You must enter a value in this field to activate the
substring settings.
The Case Sensitive option applies the filter text in a case-sensitive manner, that is, the filter will only
recognize alphabet characters in the same case (upper or lower) as the characters in the Filter Text field.
The Transform all filtered text to upper case option changes the case of filtered characters to upper case.
This option not affect the operation of the Case Sensitive option. Transform all filtered text to upper case
operates on text that has already passed the Case Sensitive option, if the latter option is selected.
Dictionaries Tab
This tab allows you to apply dictionaries to the input data for the highlighted component instance. A dictionary
acts as another type of filter for the input data. Any character string that appear in the dictionary will be
filtered, and a user-defined character returned for them.
For example, you can apply a dictionary of state names to a customer address file, having first removed the
name of your home state. Using this dictionary, you can set the Character Labeller to replace any values in the
state field with an easily recognizable value such as X. This may assist a business that charges different postal
rates for out of state customers.
To add a dictionary, right-click in the Dictionaries pane and select Add from the context menu. The Dictionary
Setup dialog box opens. In this dialog, click the Select button to browse to a dictionary, and type a single filter
character in the Format Text field. The Character Labeller uses one character only.
Note: You must set the Enable Substring options on this tab if you select a dictionary. You cannot apply a
dictionary to all characters in a field.
Check the Use Position option to activate the substring settings.
The Start Position field determines the starting location in the field for the dictionary filter operation.
The Length field determines the number of underlying characters to be filtered, starting with the character
identified by the Start Position setting.
Note: The Character Labeller applies dictionaries to the dataset in the order they are listed under the
Dictionaries tab for a highlighted component. You can adjust the dictionary order using the Up/Down arrows.
50 Chapter 5: Analysis Components
Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to render it editable. To save your edits, press Enter before
removing focus from the field.
Token Labeller
The Token Labeller analyzes the format of the data values within a field and categorizes each value
according to a list of standard or user-defined tokens.
The Token Labeller component defines nine standard tokens:
Word (alphabetic)
Number (numeric)
Code (alphanumeric mix)
Initial (single alphabetic character)
Init Set (multiple alphabetic characters)
Symbol (punctuation or other symbols)
Dictionary
Word Symbol (mix of alphabet and symbols)
Code Symbol (mix of alpha-numeric tokens and symbols)
The Token Labeller searches the dataset for the tokens you specify and returns a profile detailing how these
tokens occur in the dataset.
Table 5-2 shows a sample Customer_Name data extract:
Table 5-3 displays a data profile itemizing the occurrences of tokens in the data extract:
You can define additional token types for the Token Labeller. Customized tokens are called filters in the Token
Labeller configuration dialog box.
Table 5-2. Sample Customer_Name Data Extract
Customer_Name Customer_Name
Mr Matthew Evans Robert Chad Griffin
Jason R Taylor Ms Megan Adams
Amanda Parker Antonio Reed
Heather Gray D M Jenkins
Scott Campbell Mrs L Perry
Table 5-3. Profile of Tokens
Data Values Quantity Percent
firstname surname 4 40
nameprefix firstname surname 2 20
nameprefix initial surname 1 10
initial initial surname 1 10
firstname firstname surname 1 10
Token Labell er 51
Configuration
The Token Labeller configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Filters tab
Dictionaries tab
Outputs tab
Components Pane
The Components pane shows the instances of the component that are available to the plan. When first opened,
this pane lists a single unconfigured instance. Configure this instance by working with the options on the tabs.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance from this pane by selecting Delete from the context menu.
Inputs Tab
This tab lists the data columns available to the component from the other components in the plan. Check a
column name to assign the column to the instance highlighted in the Components pane. You can assign a single
input to each instance.
Parameters Tab
The Parameters tab options are organized in three areas:
Tokens. Lists the standard tokens that can be applied to input fields. To filter the input fields for a token
type, select the token. You can select multiple tokens for each instance of the component. If you clear a
selected token, the underlying data for that token type is returned.
Case Sensitive. Lists the standard tokens that can be rendered in upper or lower case, except Number and
Symbol. To generate case-sensitive output for a token type, select the token.
Case-sensitive output means that the token appearance in the analysis output will mirror the case of the
related characters in the source data. For example, with case sensitivity applied, the name Lyndon B Johnson
is rendered, Word INIT Word. With case sensitivity inactive, the name is rendered word init word.
Lookup. Check to apply case sensitivity to any dictionaries specified on the Dictionaries tab.
Delimiters area. Provides a list of the punctuation symbols used to delimit data entries in a flat file. As with
the Tokens area, select the symbol if you want to use as a delimiter between data fields. Any punctuation
marks or symbols not selected are considered part of the dataset.
Filters Tab
The Filters options allow you to define and edit custom token types for a component instance and to specify the
data values to correspond to those types.
For example, data might contain fields of null or system-default data with their null status represented in
multiple ways, such as Null, Missing, N/A, or Other. The Filters tab allows you to create a token type, such as
Null and assign one or more data values to it. When the Token Labeller encounters that value, it identifies it
as the token you have created. In effect, a filter type with multiple values assigned to it is a form of reference
dictionary.
To create a filter:
1. Right-click in the Filters pane and select Add from the context menu.
This opens the Filter Setup dialog box.
52 Chapter 5: Analysis Components
2. In the Format Text field, enter a filter type, that is, a token type.
3. Type a data value in the Filter Text field.
When the Token Labeller encounters the Filter Text value, it generates the Format Text custom token type.
You can add multiple filters with different Filter Text entries and a common Format Text entry.
The context menu also provides options to edit and delete filters from a component instance.
Note: Filters defined on this tab are not governed by the Parameters tab options. They are always applied to the
input data for the component instance with which they were created.
Dictionaries Tab
This tab allows you to use one or more reference dictionaries as token identifiers. The Token Labeller assigns
dictionary entries to a single token type.
For example, you add a US_CITY dictionary to an instance of the component and assign the token type CITY
to it. Now any value in the dataset that matches a dictionary value will be recognized as the token type CITY by
the Token Labeller.
To add a dictionary:
1. Right-click in the Dictionaries pane and select Add from the context menu.
This opens the Dictionary Setup dialog box.
2. In this dialog, click Select and browse to a dictionary.
3. In the Format Text field, type a name for the dictionary value type, that is, a token type.
In the Dictionary Setup dialog box, the Inclusive and Priority options determine how the Token Labeller treats
the data values it recognizes in a dictionary:
Inclusive. When selected, the Token Labeller assigns the Format Text label to every data value it finds in the
dictionary for the highlighted instance. If this box is cleared, the Token Labeller assigns the Format Text
label to all data values that are not listed in the dictionary for the highlighted instance. This option is useful
for identifying invalid or non-dictionary matches.
Priority. Determines how the Token Labeller treats strings located a dictionary entry. If this box is checked,
the Token Labeller treats the entire contents of a field as a single entity and labels it as a dictionary match. If
this box is cleared, the Token Labeller treats the matching string as a dictionary match and labels the rest of
the field separately.
For example, a company name column contains a field with the string Informatica Corporation. A Corporate
Suffix dictionary is applied to this column, so the Token Labeller identifies any string containing Ltd, Inc,
Corp, LLP, or any other standard corporate suffix.
When you check Priority for the Corporate Suffix dictionary, the Corporate Suffix dictionary treats the string
Informatica Corporation as a single entity and returns a corresponding value: companyname. If you clear this
option, the Token Labeller returns two values for this string: word companyname.
Note: The Token Labeller applies dictionaries to the dataset in the order they are listed under the Dictionaries
tab. You can adjust the dictionary order using the Up/Down arrows.
When multiple dictionaries have been assigned to a component instance and a data value appears in more than
one such dictionary, the Token Labeller applies the token defined for the first dictionary in which it finds the
value.
Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to edit it. To save your edits, press Enter before removing focus
from the field.
You can save the data output from a Token Labeller instance as metadata with the following procedure.
Token Labell er 53
To save data output from a Token Labeller:
1. In the Meta Data area of the output pane, check Store.
This activates the Metadata and Profile menu fields.
2. Type the metadata and profile names in these two fields or select from existing names.
3. Click OK.
There is no need to create metadata more than once. After metadata has been created for a component instance,
you can clear the Store option so metadata is not recreated each time the plan runs. Recreate metadata only
when the plan input dataset changes.
54 Chapter 5: Analysis Components
55
C H A P T E R 6
Transformation Components
This chapter includes the following topics:
Overview, 55
Search Replace, 55
Word Manager, 57
Merge, 58
To Upper, 59
Rule Based Analyzer, 61
Scripting, 63
Overview
Data Quality transformation components allow you to adjust source data. They are typically used in
standardization plans.
Data Quality provides the following transformation components:
Search Replace
Word Manager
Merge
To Upper
Rule Based Analyzer
Scripting
Note: Transformation components create new fields for altered data. The original data remains untouched.
Search Replace
Use this component to standardize data. Like the Word Manager, the Search Replace component can
be used to remove unwanted values from a group. While the Word Manager uses dictionaries, the
Search Replace component makes use of user-defined values.
You can use the Search Replace component in the following ways:
56 Chapter 6: Transformation Components
Search for a user-defined data string and remove it from the dataset.
Search for a user-defined data string and replace it with another string.
Insert a user-defined data string at the start or end of a field.
Configuration
The Search Replace configuration dialog box contains the following areas:
Components pane
Inputs tab
Actions tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the data columns available to the component instance highlighted in the top pane. Select a
field by highlighting it and clicking its check box. You can select a single column for each highlighted instance.
Actions Tab
The Actions tab lists the search and replace operations defined for the highlighted component instance. To add
an action, right-click in the pane and select Add from the context menu. This opens the Action Setup dialog
box:
The dialog box provides three options Replace, Remove, and Insert and a grid of text fields where you can
type one or more strings to be replaced or removed. Below this grid is a field where you can type any values that
you want to add to data. At the bottom of the dialog box are three buttons that determine where in each input
field the search and replace operation should be conducted.
The settings in this dialog box depend on the type of action you require. If you select Replace, all fields remain
available, so you can search for one or more strings and replace them with another string. If you select Remove,
the With field is disabled. If you select Insert, the search grid and also Anywhere option are disabled.
The search grid has twelve input fields by default. To add more fields, right-click in the grid and select Add
from the context menu. Likewise you can right-click and select Delete from the context menu to remove a row
from the grid. The highlighted row will be removed.
Figure 6-1. Action Setup Dialog Box
Word Manager 57
When you have finished working in this dialog box, click OK to save your action. To edit previously created
actions, right click on an action and choose Edit from the context menu.
If your Search Replace component contains multiple actions, you can change the order in which these actions
are performed. Select an action and click the arrows to move it up or down in the list.
Outputs Tab
The Output tab lists the names of the data outputs for the highlighted component instance as they appear in
other components in the plan. Double-click a name to render it editable. To save your edits, press Enter before
removing focus from the field.
Word Manager
The Word Manager applies one or more reference sources, data dictionaries, to an input dataset and
thus can be used to determine and improve the usability of the dataset.
The Word Manager is used for three main tasks:
Determining the accuracy or inaccuracy of data in a column based on a reference source.
Removing terms from a data column.
Replacing terms in a data column.
Principally the Word Manager is used for data enhancement operations.
For example, by comparing an address data column containing European city names with a reference dictionary
of city names, you can evaluate the accuracy of data in this column.
If the dictionary includes variant spellings of city names, you can use the Word Manager to standardize spelling
by creating a new output column based on the dictionary entries.
You can check for original data entries that are not recognized by the dictionary. The Word Manager provides
an option to return only those values that are not recognized by the dictionary. The output column contains
only non-standard data. You can then subject that data to further evaluation.
Configuration
The Word Manager configuration dialog box contains the following areas:
Components pane
Inputs tab
Dictionaries tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
58 Chapter 6: Transformation Components
Inputs Tab
This tab lists the data columns available to the component from the other components in the plan. Check a
column to assign that column to the instance highlighted in the Components pane. You can assign a single
input to each instance.
Parameters Tab
The Parameters tab displays two groups of editable options:
Dictionary Lookup (Case Sensitive). Applies to any dictionaries you specify for the data on the Dictionaries
tab. Check this option if the parsing operation should apply dictionaries to the input data in a case-sensitive
manner.
Delimiters. Displays a list of delimiting characters. Check the delimiters applicable to your source dataset.
If your input data includes multi-domain fields, you must indicate the delimiters in use in the dataset so that
the Word Manager can distinguish between the words in the field and apply the transformative rules you
define.
Dictionaries Tab
This tab allows you to use one or more reference dictionaries to analyze or improve input data.
To add a dictionary, right-click in the Dictionaries pane and select Add from the context menu. This opens the
Dictionary Setup dialog box. In this dialog, click Select to browse to a dictionary.
The Remove Dictionary Matches option ensures that only input data values that are not recognized by the
dictionary are returned in the output column.
Dictionaries are applied to the input data in the order listed in the Dictionaries pane. You can change this order
with the Up and Down arrows.
Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to render it editable. To save your edits, press Enter before
removing focus from the field.
Merge
The Merge component combines the data values from multiple input fields to form a single output
field. This component is common in standardization and analysis plans. For example, you can
combine Customer_Firstname and Customer_Surname fields to create a new field called
Customer_Name. You set the order in which the input values are merged. For example, you can create
a Customer_Name field in which surname precedes firstname or firstname precedes surname.
Configuration
The Merge configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
To Upper 59
Components Pane
The Components pane shows the instances of the component are available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the data fields available for assignment to the highlighted component. Select a field by
highlighting it and clicking its check box. Select at least two matching components on this tab.
Note: The order in which you check the boxes determines the order in which the columns are merged. If, in the
example above, you check the Customer_Surname field before the Customer_Firstname field, the merged
output lists the surname before the first name. The default name given to the output for the instance lists the
field whose box was checked first.
Parameters Tab
This tab displays the output order of the selected inputs and the join character used to merge them. To change
the output order, select an input and click the arrows to move it up or down in the list.
In the Select Join Character dropdown, choose the character to place between the merged items. Table 6-1 lists
the available characters:
Outputs Tab
This tab lists the names of the configured outputs as they appear in any other components connected to the
Merge component. Double-click a name to render it editable. To save your edits, press Enter before removing
focus from the field.
To Upper
The To Upper component provides several ways to alter the case of a dataset. The component provides
pre-set methods to transform case and also allows you to use dictionaries when determining which
strings to transform.
To Upper is often used to create data uniformity before matching, standardization, or analysis operations.
Configuration
The To Upper configuration dialog box contains the following areas:
Components pane
Inputs tab
Table 6-1. Available Join Characters for the Merge Component
Available Characters
Space Double Quote Comma Full Stop
Semi-Colon Single Quote Underscore Tab
Dash Pipe Forward Slash At Symbol (@)
NONE
60 Chapter 6: Transformation Components
Parameters tab
Delimiters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component are available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the fields available for assignment to the highlighted instance. Select a field by highlighting
it and clicking its check box. You can add multiple fields to a single component instance. Each input field has
its own output field.
Parameters Tab
On this tab, the Case Transform area allows you to select the transformation method for the case of the data,
and the Options area provides additional options for dictionary use and underlying data in uppercase form.
The methods for transforming case are as follows:
Uppercase. Converts all letters to uppercase.
Lowercase. Converts all letters to lowercase.
Toggle Case. Converts each lowercase letter to uppercase and vice versa.
Title Case. Capitalizes the first letter in each sub-string.
Sentence Case. Capitalizes the first letter of the field data string.
No transform. No case transformation is applied. This option is generally used with the Capitalize option.
The Options area provides the following options:
Capitalize Using Dictionary Entries. Use this option if you want to use a reference dictionary to identify
data strings for capitalization. Click Select to browse to a dictionary. Data strings recognized in the
dictionary are returned in the case style of their respective dictionary entries.
Leave UPPERCASE Words as Found. Use this option to override the Capitalize option if the input data
string is already in upper case.
Delimiters Tab
When the input dataset consists of multi-domain fields, you might need to specify the delimiting symbol used
in the fields. The Delimiters tab lists the delimiters recognized by the component:
Check the delimiters you want the component to recognize. You can use multiple delimiters.
Table 6-2. Available Delimiters for the To Upper Component
Available Characters
Space Double Quote Comma Full Stop
Semi-Colon Single Quote Underscore Tab
Dash Pipe Forward Slash At Symbol [@]
Rul e Based Anal yzer 61
Outputs Tab
This tab lists the names of the configured outputs as they appear in other components connected to the To
Upper component. Double-click a name to render it editable. To save your edits, press Enter before removing
focus from the field.
Rule Based Analyzer
The Rule Based Analyzer allows you to define and apply one or more business rules to selected input
data. It requires no previous knowledge of scripting or coding.
You can define two types of rules in this component: Condition and Assignment. Define a conditional
rule using IF-THEN-ELSE logic. Define an assignment by assigning a value to an output.
Configuration
When opened, the Rule Based Analyzer configuration dialog box displays any rules defined for the component.
Rule names appear in the Description column. The Status field indicates whether the plan can run the rule as
currently defined. A red icon in this field indicates that the rule has not been properly configured.
To add a rule, right-click in this pane and select Add Condition or Add Assignment from the context menu.
When you add a rule, default text appears in the Description field. Double-click in the field to exit the default
text. To configure the rule, right-click in this field and select Edit from the context menu.
Selecting Edit for a condition rule opens the Standard Rule dialog box. Selecting Edit for an assignment rule
opens the Set Rule dialog box.
Defining a Conditional Rule
The Standard Rule dialog box lists the IF, THEN, and ELSE statements defined for the component. You can
add multiple sets of statements. To edit a statement, right-click it and select Edit from the context menu.
Editing a statement involves working with a Rule Wizard to define the criteria for the statement.
When you enter multiple statements in the IF pane, those statements have an AND relationship.
The condition outputs are identified in the lower half of the Standard Rule dialog box. You can define multiple
outputs and assign a THEN or ELSE statement to any one of them.
Defining an Assignment Rule
The Set Rule dialog box provides fewer options than the Standard Rule dialog box. In place of the If, Then, and
Else panes, it has a single SET pane that lists the assignment settings defined for the rule. To edit a SET
statement, right-click its name and select Edit from the context menu.
As with conditional rules, editing a SET statement involves working with a Rule Wizard to define the criteria
for the statement. Similarly, you can define multiple potential outputs in the lower half of the dialog box and
assign the SET statement to any one of them.
The conditional rule logic is a superset of assignment rule logic. If you add another THEN or ELSE statement
to a conditional rule, the Standard Rule dialog box indicates that you are adding another assignment statement.
Expert Mode
The rule wizards allow you to write condition and assignment rules even if you have no knowledge of
programming. However, these rules retain their underlying code and syntax. To view and edit the underlying
code, use the Expert Mode option in the Standard and Set Rule dialog boxes. The code below is taken from a
62 Chapter 6: Transformation Components
conditional rule defined to check the validity of a data values, Input1, by comparing them with a reference
dataset, Input2:
IF (Input1 = Input2) THEN
Output1 := "INVALID"
ELSE
Output1 := "VALID"
ENDIF
Use Expert Mode to construct more complex rules than are possible in the rule wizard, such as nested IF
statements.
Click the Validate button to validate the syntax of a rule.
Click OK to save your work. Informatica Data Quality displays an error message if the rule is invalid.
You can save an invalid or incomplete rule in Expert Mode. Complete or repair the rule before running the
plan.
Clearing the Expert Mode option before saving your work restores the dialog box defaults and discards any
changes you have made in the Scripts window.
For a list of keywords and expressions usable in Expert Mode, see Rule Based Analyzer Rule Statements on
page 147.
Example: CONTAINS Function
Use the CONTAINS function to create a rule that determines if a given string contains a user-defined value.
This function is useful when checking if data entry strings contain predicted data, for example, checking the
validity of a product code at the point of data entry.
The syntax for creating such a CONTAINS rule in Expert Mode is as follows:
Output1 := CONTAINS (Input2, Input1)
Where Input1 is the input string and Input2 is the string to be located.
The function returns an integer indicating the position of the value or the position of the first character in the
string. If the value is present in multiple positions on the string, the function returns the first position in which
it occurs. If the value is not present, the function returns 0.
The CONTAINS function is case-sensitive.
Example: DATECONVERT Function
Use the DATECONVERT function to create a rule that converts a date to a different format. For example, a
plan might use a rule that converts a date from typical UK format (DD/MM/YYYY) to U.S. format
(MM/DD/YYYY). The syntax for such a rule is:
Output1 := DATECONVERT(Input1,"DD/MM/YYYY","MM/DD/YYYY")
Date Functions
Date functions only accept numerical dates and do not accept leading or trailing spaces. Use a slash to separate
date elements in input strings. The Rule Based Analyzer processes all Gregorian dates.
When a two-digit year value is entered, Data Quality uses the following rules to determine the century:
If the two-digit year value is less than ten, the year is treated as twenty-first century. Therefore, the Rule
Based Analyzer handles the year digits 00-09 as 2000-2009.
If the two-digit year value is ten or more, the year is treated as twentieth century. Therefore, the Rule Based
Analyzer handles the year digits 10-99 as 1910-1999.
Scri pti ng 63
Treatment of Locale Numbers
All numerical inputs and outputs in the Rule Based Analyzer are interpreted in a locale-specific format. For
example, when using a French locale setting, the Rule Based Analyzer accepts and generate outputs using the
comma as a decimal separator.
If you want to use numbers in a format that differs from the default setting, place them in double quotation
marks, as shown in the second point below:
Generic format: 1.65
Locale format: 1,65
Error Handling
When invalid parameters are passed into Rule Based Analyzer functions, the error is logged and the plan
continues execution. For example, if a numeric value is incorrectly passed to a Date Compare function, Data
Quality executes the plan, but the Rule Based Analyzer output appears in the output file as Invalid Value.
When conditional statements contain incorrect syntax, Data Quality produces an error message and the plan
fails.
Scripting
The Scripting component provides greater flexibility than the Rule Based Analyzer to build
customized rules and processes into a data quality plan.
Note: The Scripting component allows you to write scripts using Tool Command Language (TCL). As
such, the component requires some knowledge of this language.
For a standard dataset and for standard rules, the Rule Based Analyzer is typically adequate. Informatica
recommends the Scripting component only for rules of a complexity that the Rule Based Analyzer cannot
handle.
Configuration
The Scripting configuration dialog box contains the following areas:
Inputs
Script
Outputs
It does not have a Components pane and does not permit multiple instances to be defined for a single
component.
Inputs. Allows you to identify the data columns that constitute the input data for the component. These
fields list the input fields available to the component. Click a field to access a menu and choose a column.
The columns you select are numbered in the Input Index fields.
Script. Provides a workspace for writing the TCL script that can make use of the inputs defined above.
The Save and Load options allow you to save the script to a file and to load a pre-saved script from file.
These options act on the TCL script written in the Script pane only they do not save or load other
settings in the dialog box.
Outputs. Displays the output name for the generated data as it appears to other components. Double-click a
name to render it editable. To save your edits, press Enter before removing focus from the field.
The Output Type field allows you to change the output data type. Two types are available: String and Float.
64 Chapter 6: Transformation Components
For more information about the range of functionality within the Scripting component, contact Informatica
Global Customer Support.
65
C H A P T E R 7
Parsing Components
This chapter includes the following topics:
Overview, 65
Parser, 65
Splitter, 66
Token Parser, 67
Profile Standardizer, 70
Context Parser, 72
Overview
The parsing components allow you to extract relevant data from a field and separate extracted data into a
standardized format.
Data Quality provides the following parsing components:
Parser
Splitter
Token Parser
Profile Standardizer
Context Parser
Parser
Informatica partners use the Parser component to implement customized parsing plug-ins. Parsing
plug-ins read specified input strings and create one or more new custom values from the words or
characters in the string.
Developers implement this component using the Global Component SDK. For more information, see the
Global Component SDK Guide.
66 Chapter 7: Parsing Components
Splitter
The Splitter component parses data values in a text field into new fields by comparing source data with
one or more reference datasets. Each instance of the Splitter parses a single data column.
Configure the Splitter by:
Selecting data input, that is, a column on the dataset already configured in the plan.
Identifying another data column to use as a reference dataset,
Optionally, defining output field variables or identifying a dictionary for use as a filter on parsed data.
You can use the Splitter with or without a dictionary. The method you choose depends on the composition of
your dataset and the available dictionaries.
Parsing Data Without a Dictionary
You want to parse a column of names by gender and your dataset already contains a Gender column, so you do
not need a dictionary. First, select the source data column, such as the First_name field and then select the
Gender column for reference purposes.
Next, identify the variables you want the Splitter to match against the reference data. The variables should
match the possible values in the reference field, in this case MALE and FEMALE.
The Splitter component creates output fields based on the defined variables. Each value in the First_name field
identified as MALE in the reference data is written to a corresponding new MALE data field, and each source
value defined as FEMALE is written to a new FEMALE field. By default, the Splitter also creates an Overflow
field to capture any source data that cannot be identified by the reference column.
Parsing Data with a Dictionary
You want to parse a column of account names based on their residence in the United States. Instead of adding
variables for the names and possible abbreviations of every state, you can use a dictionary.
First, select a source data column, such as the Surname field, then select an appropriate column address column,
such as State or Zip, for reference purposes.
Next identify an appropriate dictionary, in this case, all valid U.S. zip codes. The entries in this dictionary are
compared with the reference column data. By default, the Splitter creates an output field for source data
recognized by the dictionary and an overflow field for values not recognized. In this way, the Splitter produces
two columns, one each for U.S. and non-U.S. account names.
Note the following:
You can use multiple dictionaries and multiple variables.
Dictionaries and variables are not mutually exclusive. You can use either or both with an instance of the
Splitter. Each has its own output column.
The variables or the dictionaries you select are compared with the reference dataset, not the source dataset.
Configuration
The Splitter configuration dialog box contains two menus for identifying the input and reference data fields,
and two panes that you can populate using context menus:
Source Input menu. Use to identify the data column to be parsed.
Reference Input menu. Use to identify data column with which the defined variables or dictionaries will be
compared.
Lookup (Case Sensitive) option. Use if you want the Splitter to apply case sensitivity when comparing a
dictionary with the reference data.
Token Parser 67
To add a dictionary or variable, right-click in the pane beneath the Lookup option and select Add Dictionary or
Add Value from the context menu.
The Splitter creates an output column for each entry in the upper pane and lists them in the Outputs pane. Edit
an output column name or overflow output field name by double-clicking it.
Token Parser
The Token Parser is designed to parse free-text fields that contain multiple tokens. It parses each token
to a separate field. The component identifies each value in the field by data type and writes each value
to a user-defined output field.
For example, a single free-text address field such as 3 Trebovir Rd, London, SW1 can be parsed to the
following output fields:
The Token Parser searches an input field for the data types defined on the Outputs tab of the configuration
dialog box. When it finds a type specified for the first defined output, it writes that data to the associated
output field. It then searches the field for the type defined in the second output. When a specified data type is
not found, the corresponding output is left blank.
The parsing operation passes through each field only once. The parsing operation does not reset to the start of
the field when a data value is recognized.
The Token Parser uses the same set of generic data types as in the Token Labeller component:
Word
Code
Number
It also allows you to define data types by dictionary.
Configuration
The Token Parser configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Dictionaries tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
House Number Street Name Address Suffix City Postcode
3 Trebovir Road London SW1
68 Chapter 7: Parsing Components
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You can select a single field for each component instance.
Parameters Tab
The Parameters tab displays the following editable options:
Delimiters. The Delimiters area displays a list of delimiting characters. Select the delimiters applicable to
your source dataset.
Reverse Enabled. Use to read data inputs associated with the highlighted instance from right to left, instead
of the default direction of left to right. This option enables you to parse data based on the final values in a
field, such as postcode.
Overflow Reverse Enabled. When selected, overflow data from a reverse-enabled parsing operation is
written to the Overflow output in reverse, right to left. Enabled when you use the Reverse Enabled option,
this option is selected by default. If you clear this option, overflow output for the parsed data is written left
to right.
Dictionary Lookup (Case Sensitive). Applies to any dictionaries specified for the data on the Dictionaries
tab. Use this option if the parsing operation should apply dictionaries to the input data in a case-sensitive
manner. When this option is checked, the dictionary will only recognize tokens in the same case as the
dictionary labels.
Note: This option does not enable or disable dictionary lookup. It only determines the case sensitivity of the
lookup.
Multiple Dictionary Outputs. Determines whether the component creates a single output column for the
dictionary or dictionaries applied to the instance, or whether a separate output column is created for each
dictionary. This option is selected by default.
Multiple Dictionary Operations
When you enable the Multiple Dictionary Outputs option, an output column is created for each dictionary
applied to the instance. The input is parsed by the first selected dictionary, and the first match found is written
to the dictionary output field.
If a match is found, the next dictionary is invoked, and this dictionary searches for a match within the
remaining non-parsed tokens. It does not search the tokens already searched by the former dictionary. If no
match is found, the dictionary output field is left blank and the process begins again by invoking the next
dictionary. This process continues for all dictionaries applied to the instance.
When the Multiple Dictionary Outputs option is cleared, a single output field is created. All dictionaries are
searched in the order in which they are listed on the Dictionaries tab, but only the first term identified is
written to the output column. The remaining non-parsed terms are passed to the text, number, and code
outputs, or alternatively to the overflow column.
Dictionaries Tab
The options on this tab allow you to apply a Data Quality dictionary to the input strings so that any input data
that matches a dictionary entry will be returned as a dictionary output. You can configure each dictionary to
write the input token unchanged to the dictionary output column or to standardize the input token to the
dictionary version of the token.
To add a dictionary to the instance highlighted in the Components pane, right-click in the pane beneath the
Dictionaries tab and select Add from the context menu. This opens the Dictionary Setup dialog box. Click the
Select button in this dialog to browse to the required dictionary.
The Dictionary Setup dialog box contains a Dictionary Standardization option. Check this option to return the
dictionary version of the token. When cleared, this option returns the token as it appears in the input string.
Token Parser 69
Outputs Tab
The Outputs tab options define the output columns into which the input data values are parsed. Figure 7-1
shows the Outputs tab of the Token Parser:
The Token Parser can create up to five types of output column:
Code. Any value that mixes alphabetical and numerical data. Right-click in the Add Code Outputs field to
create a code output column.
Number. Any purely numerical value identified in the input data. Right-click in the Add Number Outputs field
to create a number output column.
Text. Any purely alphabetical value identified in the input data. Right-click in the Add Text Outputs field to
create a text output column.
Dictionary. Lists the columns defined on the Dictionaries tab. You cannot add or delete dictionary outputs
from the Outputs tab.
Overflow. A single column to which any non-parsed data is written. This field is created by the component and
cannot be deleted from the component
The Token Parser creates its outputs as follows:
First, the component applies any user-set dictionaries to the input data. Any tokens recognized by the
dictionaries are written to the columns specified in the Dictionary Outputs field.
Next, the component looks for output columns defined for code, number, and text tokens, in that order. If it
finds such columns, it writes any recognized tokens to the respective columns.
You can create multiple output columns for a Token type. For example, if your input data is composed of
records containing three address fields, create three text outputs. If your input data contains a telephone
number and a five-digit zip code, create two code outputs.
The component attempts to populate the first output column of each token type and then moves down the
columns listed for that type. If the component cannot find an appropriate column for a token, it writes that
token to the overflow column.
Figure 7-1. Token Parser, Outputs Tab
70 Chapter 7: Parsing Components
Note: The parsing operation passes through each input record once only. The parsing operation does not reset to
the start of the record when a data value is recognized.
Profile Standardizer
The Profile Standardizer uses the output data from a Token Labeller as input data in a parsing
operation. The Profile Standardizer parses input data to a number of output fields based on a data
structure that you define.
A Profile Standardizer parses one or more inputs from a single Token Labeller. To parse output from another
Token Labeller, use another Profile Standardizer.
Configuration
The Profile Standardizer configuration dialog box enables you to define a multi-field data structure for the
tokens recognized by the Token Labeller. Figure 7-2 displays the Profile Standardizer configuration dialog box:
Using the Profile Standardizer, you can create new data columns into which one or more tokens are parsed. You
can create a rule for each combination of tokens, so that each underlying value is written to a new field.
For example, a Customer Account dataset includes a single Name field for customer names, including first and
middle names, surnames, and initials. The Token Labeller recognizes the types of tokens present in the Name
field data. The Profile Standardizer accepts the Token Labeller output and lists the various combinations of
tokens in the Name field. The Profile Standardizer can new columns for first names, middle names, and
surnames.
Figure 7-2 shows a Profile Standardizer in mid-configuration. You do not have to create rules for every
combination of tokens.
In Figure 7-2, the rule applied to line 3, word word, sends the first token to a new first name field and the
second token to a surname field. Similarly, the combination word word word on line 5 correspond to a
Figure 7-2. Profile Standardizer Configuration Dialog Box
Profi le Standardi zer 71
customer firstname, middle name, and surname, and the rule is defined accordingly. Depending on the dataset,
there can be an element of trial and error to maximizing the output of the Profile Standardizer. The rules might
require tuning to recognize your target level of parsing quality.
When you define a rule for a token combination, its row changes appearance.
Components pane. Lists the instances defined for the Profile Standardizer. When first opened, this pane lists
a single instance, You can add multiple instances as long as they are linked to the same Token Labeller.
Inputs pane. Lists the Token Labeller outputs available to the highlighted component instance. Select an
input by highlighting it and clicking its check box. You can select a single input.
The Metadata and Profile menus let you identify the metadata associated with the Token Labeller output. A
single Token Labeller can store multiple metadata and profile combinations. Selecting a new metadata-profile
combination in the Profile Standardizer can provide a new range of input options.
Save any changes you have made in the component before changing the current metadata or profile.
When the input, metadata, and profile are selected for the current instance, the Profiles column is populated
with the profiles created by the Token Labeller. You can now define the target columns for each set of tokens.
Right-click anywhere in the Profiles pane to add, insert, delete or rename columns from a context menu. When
you add a column, it appears to the right of existing columns.
Applying Rules to Profiles
After you created the new columns that you need, you can define the rules that determine how input data values
are parsed to new fields.
You do not have to define rules every token profile. Defining a small number of rules can often parse a large
percentage of input data. You can subsequently add or edit rules to reach your target levels for parsing quality.
As with other parsing components, the Profile Standardizer creates an Overflow column automatically for all
data that is not parsed by the defined rules.
To apply rules to profiles:
1. Click a field in a user-defined column to open the Edit Profile Rule dialog box.
This displays the tokens available for insertion to that field, that is, the tokens in the Name input field for
that record. Tokens are listed in order of their occurrence in the source field, from top to bottom.
2. Select a token to send all values corresponding to that token to the new field.
3. Define a rule for a field and click Apply.
The Edit Profile Rule dialog box automatically moves to the next field in the row and displays its token
options.
Reusing Profile Data
Configured Profile Standardizer instances are saved with the metadata and profile from which the Profile
Standardizer drew the input token information. The metadata and profile appear in menus in the dialog box.
Any rules you save with a Profile Standardizer can be accessed by other instances of the component in the plan,
or in any other plans that access the same metadata repository.
Changing or deleting the Token Labeller can affect the input to the Profile Standardizer, but does not affect the
rules already created for a profile. Changing the inputs selected in the Inputs window of the Profile Standardizer
does not affect the rules already saved in the component. These rules remain in the table for any other inputs
selected in the component.
When a component is saved with a particular profile and rules, and a new profile is introduced and assigned
parsing rules, the rules from the previously-selected profile are appended to the end of the new table. The rules
from the previous profile are displayed by a light grey font on a dark grey background.
72 Chapter 7: Parsing Components
Changing the Number of Displayed Profiles
The number of profiles displayed within the Profile Standardizer is set by default to 500 rows. You can change
the maximum number of rows by editing the config.xml file located in your Data Quality installation folder, by
default: C:\Program Files\Informatica Data Quality\config.xml.
The value is configured as MetaDataProfiles:
<MetadataProfiles>500</MetadataProfiles>
Note: Restart Data Quality Workbench for the changes to take effect.
Context Parser
Like the Token Parser, the Context Parser is designed to parse free-text fields containing multiple
tokens into multiple single-token fields. Context Parser operations are based on the values and the
relative positions of the tokens.
The high-level steps in configuring the Context Parser are as follows:
1. Select an input data column for each instance.
2. Specify the delimiters to use when parsing input data.
3. Configure the output columns where individual tokens will be parsed:
Determine the number of tokens you expect in the output data.
Add an output field for each of these tokens.
Define a token type for each output you add.
The output columns can contain one or more data values, which can be of the following types:
Word
Number
Code
Symbol
Init
Dictionary (listed in a specified dictionary)
By using a combination of positional hierarchy, generic token types, and dictionary-determined data, you can
achieve highly-effective parsing results even in very noisy datasets.
Configuration
The Context Parser configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
Context Parser 73
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You can select a single field for each component instance.
Parameters Tab
The Parameters tab displays the following editable options:
Delimiters. The Delimiters area displays a list of delimiting characters. Select the delimiters applicable to
your source dataset.
Reverse Enabled. Use to read data inputs associated with the highlighted instance from right to left, instead
of the default direction of left to right. This option enables you to parse data based on the final values in a
field, such as postcode.
Dictionary Lookup (Case Sensitive). Applies to any dictionaries specified for the data on the Outputs tab.
Use this option if the parsing operation should apply dictionaries to the input data in a case-sensitive
manner.
This option does not enable or disable dictionary lookup. It only determines the case sensitivity of the
lookup.
Outputs Tab
This tab displays the user-defined output columns for the highlighted component instance. With no outputs
defined, this area is empty. Right-click below the tab and select Add Output to add an output column.
Each output is defined by two fields. The output name appears in an editable upper field. The lower field lists
the types of data values to be parsed to the field. You can set the output field to accept any of six data value
types, and you can organize these types in any order.
The input data is parsed according to the order in which the outputs are listed on this tab, and within each
output column, by the order in which the data types are listed. You can change the order of the output columns
by right-clicking an output name and selecting Move Up or Move Down from the context menu.
Note the following:
The Context Parser performs a single sweep of each input field. As a result, the Context Parser works best for
structured data. For less- structured data, the Profile Standardizer may be more appropriate.
For example, you add an output of type NUMBER, and below it add an output of type WORD. When
parsing 12 Main Street, the Context Parser locates 12, then Main. If you reverse the output types, the
Context Parser locates the Main but skips the number 12.
You can configure an output to accept more than one token by adding multiple token types to the output or
by selecting the Toggle Merge option.
Right-click a data type and select Toggle Merge from the context menu to place multiple values of that type
in a single output field if they occur consecutively within the input field. For example, right-clicking a
WORD data type and selecting Toggle Merge returns consecutive words, starting with the first word in the
field.
An overflow output is created automatically for any input values that have not been handled by the
component.
74 Chapter 7: Parsing Components
75
C H A P T E R 8
Key Field Generator Components
This chapter includes the following topics:
Overview, 75
Normalization, 75
Soundex, 75
Nysiis, 77
Overview
Key Field Generator components group data in preparation for the matching process. With these components,
you can create the keys by which the data is grouped. When you group data, you enhance the efficiency of the
matching process.
Data Quality provides the following key field generator components:
Normalization
Soundex
Nysiis
Normalization
Informatica partners use the normalization component to implement customized normalization plug-
ins. Normalization plug-ins read input values and write standardized versions of those values.
Developers implement this component using the Global Component SDK. For more information, see
the Global Component SDK Guide.
Soundex
The Soundex component recognizes phonetic matches between alphabetic strings. It analyzes the
phonetic components of a word and assigns a value to the string based on the phonetic characteristics
76 Chapter 8: Key Fi eld Generator Components
of the initial characters in the string. Because it can identify matches between words based on an analysis of how
the words sound rather than how they are spelled, Soundex allows for spelling errors at the point of data entry.
Use Soundex to generate a phonetic key for grouping similar records before matching. Soundex can be applied
to any free-text field.
For every field analyzed, Soundex generates a code beginning with the first letter in the word and followed by a
series of numbers representing successive consonants. Generally, similar-sounding consonants are assigned the
same code. The Soundex depth, the number of alphanumeric characters returned, is set to 3 by default. This
means the Soundex code consists of the first letter in the string and two numbers representing the next two
distinct-sounding consonants. You can change the Soundex depth.
Configuration
The Soundex configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Soundex component to another.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You can select multiple inputs for each instance in the Components
pane, but all inputs share a common Soundex depth.
Parameters Tab
The Parameters tab allows you to set the number of alphanumeric characters Soundex returns, called the depth.
The default depth is 3, with an alphabetic character representing the first letter in the word, and two numbers
representing the next two letters.
Increasing the depth means increasing the number of digits generated to represent additional letters in the
word. The depth setting applies to the highlighted instance in the upper pane.
The following table illustrates different Soundex depth codes:
Surname Value Soundex Value - Depth 3 Soundex Value - Depth 4
Broderick B63 B636
Smith S53 S530
Ford F63 F630
Burton B63 B635
Nysii s 77
Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to edit it. To save your edits, press Enter before removing focus
from the field.
Deriving Soundex Depth Codes
The Soundex depth code consists of the first letter of the string in a given field, followed by a series of numbers
that represent some or all of the remaining letters in the string. The component skips all vowels and similar
letters:
a, e, i, o, u, h, w, y
It adds numbers for other letters as shown in the following table:
The following general rules apply:
If two or more consecutive letters have the same code number, they are coded together, allowing Soundex to
skip to the next distinct consonant sound. This rule applies in all cases, including the first and second letters
of the word.
For example:
Gutierrez is coded G362: G, 3 = T, 6 = both Rs, 2 = Z
Pfister is coded P236: P, (F skipped for having the same code as P), 2 = S, 3 = T, 6 = R
If there are an insufficient letters for the Soundex depth, the remaining numbers in the code appear as zero.
For example, if the depth is set to 5 and the word in question has three letters, Soundex completes the code
with zeros.
Letters are counted as consecutive when they are separated by a vowel or consonant skipped by Soundex.
If a vowel separates two consonants that have the same Soundex code, the consonant to the right of the
vowel is coded.
For example:
Tymczak is coded as T522: T, 5 = M, 2 = C, Z skipped, 2 = K). As "A" separates Z and K,
the K is coded.
If H or W separate two consonants that have the same Soundex code, the consonant to the left of the
vowel is coded and the vowel to the right ignored.
For example:
Ashcraft is coded A261 (A, 2 = S, C ignored, 6 = R, 1 = F). It is not coded A226.
Nysiis
The Nysiis component converts the values of an input string to their phonetic equivalent. Nysiis uses a
phonetic encoding algorithm created for the New York State Identification and Intelligence System.
Table 8-1. Soundex Depth Codes
Code Letters
1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4 L
5 M, N
6 R
78 Chapter 8: Key Fi eld Generator Components
Unlike the Soundex component, Nysiis does not create a code to represent the string. Instead, it reconstitutes
the spelling of the string based in its phonetic characteristics. While Soundex focuses on similarities in spelling
at the start of matched strings, Nysiis looks for overall similarities between strings.
Configuration
The Nysiis configuration dialog box consists of the following areas:
Inputs tab
Outputs tab
Inputs Tab
The Inputs tab lists the input columns available to the component. To select an input, check its check box. You
can access a Select All option in the context menu by right-clicking in the dialog box. You can create a single
instance of Nysiis for each component.
Outputs Tab
This tab lists the names of the data outputs as they appear in other components in the plan. Double-click a
name to render it editable. To save your edits, press Enter before removing focus from the field.
The following table shows examples of Name-to-Nysiis value conversions:
Surname Value Nysiis Value
Adams Adan
Adames Adan
Adems Adan
Barnes Barn
Barns Barn
Bearns Barn
Adams Adan
79
C H A P T E R 9
Matching Components
This chapter includes the following topics:
Overview, 79
Similarity, 80
Edit Distance, 80
Jaro Distance, 81
Hamming Distance, 82
Bigram, 83
Mixed Field Matcher, 84
Weight Based Analyzer, 85
Overview
Data Quality provides matching components that are explicitly designed to determine the degrees of similarity
between given data values. Each matching component applies a different algorithm to its data input, and each is
suited to a different type of data quality problem:
Identity Match. Performs matching operations on input data at an identity level.
Note: For information on the configuration of this component, see page 89.
Similarity. Implements custom plug-ins to calculate the type and degree of similarity between two strings.
Edit Distance. Calculates the edit distance between two strings.
Jaro Distance. Calculates the difference between two strings using a variation of the a variation of the Jaro-
Winkler1 algorithm.
Hamming Distance. Calculates the number of positions in which characters differ two strings.
Bigram. Calculates the occurrence of matching pairs between two strings.
Mixed Field Matcher. Compares multiple fields between two strings based on selected match calculations.
Weight Based Analyzer. Calculates an aggregate match score based on the output scores from other
matching components. You can define weights for the output scores from the other matching components
Note: Distance components are case-sensitive.
Matching components calculate numerical scores representing the similarity or dissimilarity between pairs of
data values, generating a match score between 0 and 1. The higher the score, the greater the degree of similarity
between the two strings based on the match component criteria.
80 Chapter 9: Matchi ng Components
For information about the formulas used to calculate match scores, see Matching Formulas on page 153.
Similarity
Informatica partners use the Similarity component to implement customized similarity plug-ins.
Similarity plug-ins read a pair of input values and compute the type and degree of identity between the
two values, expressing this identity as a numerical value.
Developers implement this component using the Global Component SDK. For more information, see the
Global Component SDK Guide.
Edit Distance
The Edit Distance component derives a match score for two data values by calculating the minimum
cost of transforming one string to another by the inserting, deleting, or replacing characters.
The result of this calculation is the edit distance. The higher the edit distance score, the greater the
similarity between the two strings.
This component is ideal for matching fields containing a single word or a short text string such as a name or
short address field. You can use it to compare corresponding fields across two records or to compare different
fields within the same record.
For example, an edit distance calculation is performed on two street names:
The component calculates the cost of transforming the a in Collage to an e and inserting a period after St.
Configuration
The Edit Distance configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Edit Distance component to another.
College St. Collage St
Jaro Di st ance 81
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
Parameters Tab
The Parameters tab allows you to set the output score assigned to a matched pair when one or both fields are
empty or contain null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Jaro Distance
Like the Edit Distance component, the Jaro Distance component calculates the general similarity
between two data values. However, the Jaro Distance component reduces the match score when a pair
of values do not share a common prefix.
Like other Data Quality matching components, the higher the match score, the greater the similarity between
the strings.
The component uses a variation of the Jaro-Winkler1 algorithm. The algorithm penalizes the match if the first
four characters in each string are not identical. The default penalty is 0.2.
Configuration
The Jaro Distance configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Jaro Distance component to another.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
82 Chapter 9: Matchi ng Components
Parameters Tab
The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain
null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
The Penalty field determines the value subtracted from the match score if the first four characters of both
strings are not identical. The default setting is 0.2.
The Case Sensitive check box, when checked, specifies that the matching calculation will consider the case of
the characters when determining the identity between them. This box is cleared by default.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Hamming Distance
The Hamming Distance component derives a match score by calculating the number of positions in
which characters differ for a pair of data strings. Use the Hamming Distance component when the
position of the data characters is a critical factor, as in numeric or code fields such as telephone
numbers, zip codes, dates, and product codes.
By default, the Hamming Distance component reads data from left to right. You can reverse this setting.
Configuration
The Hamming Distance configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Hamming Distance component to another.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
Bigram 83
Parameters Tab
The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain
null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
This tab also displays the Reverse Hamming option. Use this option to configure the Hamming Distance
component to read data from right to left instead of the default, left to right.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Bigram
The Bigram component matches data based on the occurrence of consecutive characters in both data
strings in a matching pair, looking for pairs of consecutive characters that are common to both strings.
The greater the number of common identical pairs between the strings, the higher the match score.
This component is useful in the comparison of long text strings, such as free format address lines or lines of user
comments.
For example, when the following two names are analyzed by the Bigram component:
The bigram pairs for the two inputs are as follows:
Da, am, mi, ie, en
Da, ar, rr, re, en
There are ten pairs in this example, yielding four matches or two matched pairs. Therefore, the Bigram Distance
between these strings is 0.4.
Configuration
The Bigram configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Bigram component to another.
Damien Darren
84 Chapter 9: Matchi ng Components
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
Parameters Tab
The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain
null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Mixed Field Matcher
While the distance matching components compare pairs of data values at a time, the Mixed Field
Matcher compares multiple fields in different match calculations.
The Mixed Field Matcher component identifies matches in a dataset where data values of the same or
similar types appear across multiple fields, such as freeform address fields where address elements like the
apartment number, city, or zip code can exist in different fields for different records.
The component provides several mechanisms for fine-tuning the match score computation, so you can give
different priorities to matches or near-matches of different types and levels of approximation.
To configure this component, select two groups of data fields to be matched and identify the matching
algorithm to apply to the data. You can also activate and tune priority levels for incorrect or approximate
matches. However, Informatica recommends using the default settings for these parameters.
Note: Matching operations in this component can incur a significant performance overhead and may take longer
to execute than operations in other matching components.
Configuration
The Mixed Field Matcher configuration dialog box contains the following areas:
Inputs tab
Parameters tab
Output tabs
Inputs Tab
The Inputs tab allows you to view available data fields and select the sets of input fields to be compared. To
compare data, assign fields to Input Group A and Input Group B.
Note: Groups A and B must contain the same number of fields.
The Inputs pane lists the data fields available to the component. To add a data field to either input group, right-
click it and select Add to Group A or Add to Group B from the context menu. The data fields you select display
in the input group panes.
To remove a field from either pane, right-click it and select the Remove context menu option.
Use Ctrl-A to select all fields in these panes. Select multiple fields using Shift-click or Ctrl-click.
Weight Based Anal yzer 85
Parameters Tab
The Parameters tab options allow you to fine-tune the component matching operations. The tab organizes its
parameters in three areas:
General. This area contains the following options:
Relative Position Factor. When the Mixed Field Matcher compares two fields from different record sets,
the relative position within each record of each field affects the strength of the match. For example, when
the Mixed Field Matcher matches a pair of fields in two records, it considers the match stronger when the
two records are in the same column. If the same two fields appear in different columns, it considers them
a relatively inferior match.
You can set Relative Position Factor to Off, Low, Medium, and High. Medium is the default.
Matching Order Factor. This setting is concerned with the relative order of the best matches between the
input record sets. For example, when matching two fields in the record sets representing Firstname and
Surname, the Mixed Field Matcher matches John Smith with Joan Smith better than with Smith Joan even
though the individual fields match with the same score.
You can set Matching Order Factor to Off, Low, Medium, and High. Medium is the default.
Empty Input Fields Factor. This setting calculates the number of empty fields in a record as a proportion
of the total number of input fields. A high proportion of empty fields lowers the match score for fields in
the record.
You can set Empty Input Fields Factor to Off, Low, Medium, and High. Medium is the default.
Different Input Sizes Factor. This property compares the numbers of empty or null fields found in a pair
of records. When two records have different numbers of empty or null fields, this difference is
incorporated into the final matching score.
You can set Different Input Sizes Factor to Off, Low, Medium, and High. Medium is the default.
Field Match. This area contains the following options:
Match Method. This menu identifies the overall key for the matching operations. The default setting is
LCS (Longest Common Subsequence). This setting considers the length of any common character strings
in a pair of input fields and adds a factor based on the longest such string to the final score.
The default setting does not require input from another matching component in the plan. The other
settings in this menu provide for scores from other matching components.
Single Null Match Value. This settings applies if one of the two compared fields is empty. The default
setting is 0.5.
Both Null Match Value. This setting applies if both fields are empty. The default setting is 0.5.
Advanced Area. In most situations there is no need to change the advanced settings for this component. For
more information about these settings, consult Informatica Global Customer Support
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Weight Based Analyzer
The Weight Based Analyzer takes the results from two or more matching operations and calculates a
single match score. The component accepts data from any matching component and allows you to
assign weights to their match scores so the overall score for a field pair can reflect the priorities of the
data.
You can define more than one instance in the Weight Based Analyzer. This allows you to configure each
component with different combinations of input fields and different weights as required.
86 Chapter 9: Matchi ng Components
You can use the Weight Based Analyzer to calculate overall matching scores for the plan. For effective matching,
assign higher weightings to the more important fields.
Configuration
The Weight Based Analyzer configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
You must select at least two matching components on this tab.
Parameters Tab
This tab displays the matching components selected on the Inputs tab. Each matching component has a text
field in which you can edit the weight defined for it. The higher the value in a text field, the higher the priority
given by the component to the overall match score.
Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.
Weight Based Anal yzer 87
88 Chapter 9: Matchi ng Components
89
C H A P T E R 1 0
Identity Matching Components
This chapter includes the following topics:
Overview, 89
Identity Group Target, 91
CSV Identity Group Source, 92
DB Identity Group Source, 94
Identity Match, 96
CSV Identity Match Target, 98
Overview
An identity is a set of data values within a record that collectively provide enough information to identify an
individual or entity. Data Quality installs with sources, targets, and matching components that can analyze
identity information across record values and return likely duplicates.
An identity matching process involves two plans. The first plan reads a source dataset and generates an index of
key values representing the permutations of identity information in that dataset. The second plan performs the
match analysis by comparing the index information with the source dataset to establish if any identities occur
more than once in that dataset.
Use the same source data in both plans.
Population Files
Identity matching uses population files to create index keys and perform match analyses. A population file
contains key-building algorithms, search strategies, and matching schemas that enable duplicate analysis of
identity information. Population files can allow for multiple languages and character sets within the source
data.
Informatica provides proprietary population files for use in Data Quality and PowerCenter. Before you begin,
ensure you have a suitable population file for your source data installed on your computer.
Combining Identity And Non-Identity Components In A Plan
Data Quality installs with five components that are dedicated to analyzing identity data. You can create identity
matching plans that use only these components, or you can substitute one or more of these components with
other matching components.
90 Chapter 10: Identity Matchi ng Components
Although you can replace identity components with other components, bear in mind that you will obtain
different match results on identity information if you do so. The identity match components provide you with
a means to analyze the identity information in your datasets using specialist identity match algorithms that
deliver a higher quality of duplicate analysis for identity data.
Table 10-1 describes the identity components and when they are required for use.
Based on the data in Table 10-1, the following combinations of identity and non-identity components are valid
in a data quality plan:
A key generation plan must use a CSV or DB Source and an Identity Group Target.
This plan does not require other components.
An identity match plan must use a CSV or DB Identity Group Source and a CSV Identity Match Target or
CSV Match Target.
To generate output in Identified Matches mode, use a CSV Identity Match Target.
To generate output in Matched Pairs mode, use a CSV Match Target.
Use a match component in an identity match plan. This can be an identity or non-identity match
component.
Note: For more information on Matched Pairs and Identified Matches, see page 28.
Identity Matching Process Flow
Create two plans that include the steps below.
1. Create a data quality plan and add a database or file-based source component. Connect to your source
dataset. Do not use an identity source component in this plan.
2. Add an Identity Group Target to the plan, and select the input columns from which Data Quality will
build the key index. When you configure this target, specify a folder location for the index key files.
3. Create a second plan in Workbench, and add a CSV Identity Group Source or DB Identity Group Source
to the plan. Connect to the source you selected in the first plan, and select the index location specified in
the Identity Group Target of that plan. This enables the plan to match the source dataset against the key
index created for that dataset.
4. Add a match component to the plan. If you have selected an Identity Match component, select a
population file to run on the data. You can select a different matching component at this stage, but its
match analysis will not be tailored to identity data.
Table 10-1. Identity Components Installed with Data Quality
Component Name Description Required Or Optional In Identity Plans
CSV Identity Group Source Performs identity matching on file sources using
keys created by the Identity Group Target.
Required in an identity matching plan. Reads source
data against identity index keys.
DB Identity Group Source Performs identity matching on database sources
using keys created by the Identity Group Target.
Required in an identity matching plan. Reads source
data against identity index keys.
CSV Identity Match Target Writes the results of an identity matching
operation to file. Uses match scores from a
matching component to determine which records
may be duplicates of each other.
Optional in an identity matching plan - can replace
with a CSV Match Target.
When using a CSV Match Target in an identity match
plan, select the Matched Pairs option.
Identity Group Target Creates an index of key values that another plan
can use in identity matching operations.
Required in a key generation plan. Creates an index
of identity key values for use by a CSV Identity
Group Source or DB Identity Group Source.
Identity Match Compares identity data from multiple fields
against each other using matching criteria
defined in population files. Generates match
scores for each comparison.
Optional in an identity matching plan - can replace
with another Data Quality matching component. Note
that other matching components cannot use
Informatica population files to evaluate identities.
Identi ty Group Target 91
5. Add a CSV Match Target or CSV Identity Match Target. When configuring this component, select an
equal number of _1 and _2 columns.
Note: The CSV Identity Match Target performs final duplicate analyses that the CSV Match Target does
not perform. For more information, see page 98.
Identity Group Target
The Identity Group Target generates key values for the input data it accepts in a data quality plan. It
stores these keys and the input data in an index within the Data Quality folder structure. The CSV
Identity Group Source and the DB Identity Group Source read the key values in this index when run
in an identity matching plan.
All identity matching operations require two plans to run consecutively. The first plan must contain an Identity
Group Target. The second plan must contain either a CSV Identity Group Source or DB Identity Group
Source. These source components search the data for the keys defined by the Identity Group Target in the first
plan.
Identity Group components require population files that install through the Content Installer. Contact
Informatica to purchase and download population files. For information on installing population files, consult
the Informatica Data Quality Installation Guide.
Configuration
The Identity Group Target configuration dialog box contains the following options:
Input. This pane lists the potential input columns available to the target. Use the check box next to each
column to add that column to the key index. At least one input column should contain person name,
organization, or address data, as the Identity Group Target uses these data types for key generation.
The index you create in this target is used later in the identity matching process. Ensure that you can access
the list of columns you selected in the Identity Match Target when you configure an CSV Identity Match
Target in a later plan.
Tip: Right-click in the input pane to display a Select All option.
Outputs. This pane contains the columns that you select for addition to the index. The outputs appear as
you select input columns.
Population. Populations contain key-building algorithms that are customized for countries and languages.
Select the population that most closely matches the origin of the input data.
Key Type. The standard populations provided by Informatica can generate keys for three types of index data:
person names, organizations, and addresses. Select the Key Type corresponding to the type of data that you
wish to use in key generation.
Key Level. The Key Level determines the number and variety of keys generated by the Identity Group
Target. The three key levels are Limited, Standard, and Extended. The following table describes the features
of each Key Level:
Key Level
Disk Space
Usage
Matching Success Intended Use
Limited Low Finds likely matches; does not find all
probable matches
Non-critical searches on systems
with limited disk space
Standard High Overcomes most variations in word order,
missing words, and extra words
Most search applications
Extended Very high Finds most possible matches, regardless of
word order variation and concatenation
High-risk or mission-critical search
applications
92 Chapter 10: Identity Matchi ng Components
Input Column. The input column specifies the source data that the Identity Group Target uses for key
generation. Choose an input column that contains the type of data specified in the Key Type field.
The order of individual strings in the selected input column should match the normal string order used in
the population Key Type you selected. For example, in English-speaking countries the normal string order
for person names is as follows:
First Name + Middle Name(s) + Family Name(s)
Key Index Folder. Specifies the folder that contains the key index. Data Quality creates key index folders in
an Identity folder in its folder structure. The default location for the folder is
C:\Program Files\Informatica Data Quality\Identity
When you enter a Key Index Folder name, you create a folder of that name under \Identity in the Data
Quality folder structure. You can enter a folder path in this field. For example, a Key Index Folder value of
MyIndexes\862 will create a folder at the following location:
C:\Program Files\Informatica Data Quality\Identity\MyIndexes\862
Update Index. This option determines how the component handles the creation of a key index in the event
that index files are present in the key index folder. It is provided so that Data Quality does not need to
recreate a key index if a plan is re-run on a dataset to which a small number of records have been added.
If this option is cleared, the component will generate a new key index when the plan runs and overwrite the
index files at the key index folder location. If this option is checked, the component will retain any current
files and save new index information to the current files.
Handling Generic and Null Values in the Input Column
The population that you use during the key index generation process contains a list of values that Data Quality
determines to be noise: that is, generic, null, or otherwise non-informational values. If the key generation
process encounters such a value in the Input Column field, it omits the record containing that value from the
index.
The key index generation process determines the following types of value to be noise:
Words or strings that may form part of an identity string but that do not have any meaning as the sole value
in a field. Examples include prepositions, corporate suffixes, and name fragments such as AND, THE,
LIMITED and LTD, LA and LE, DELLA.
Null entries.
Placeholders or flags that indicate the status of a record but that do not contain any identifying information,
such as EMPLOYEE, DECEASED, RETIRED, UNKNOWN.
If your dataset contains noise values, consider standardizing your data to replace the values with other terms
before generating the key index. For more information on the words that Data Quality treats as noise in the key
index generation process, contact Informatica Global Customer Support.
Note: These examples are illustrative and do no not represent a complete list. The values defined as noise in the
key index building process depend on the population used and also depend on the type of field selected as the
Input Column. For example, a noise word in a Name field may not be a noise word in an Address field.
Data Quality updates its Workbench log file when a generic or null value is found and a data row is omitted
from the key index. Locate the Workbench log file in the logs folder of your Data Quality installation.
CSV Identity Group Source
The CSV Identity Group Source performs identity matching on delimited file sources using keys
created by the Identity Group Target.
CSV Ident it y Group Source 93
To use the CSV Identity Group Source, you must first run a plan containing an Identity Group Target. The
Identity Group Target stores keys in an index within the Data Quality folder structure. The CSV Identity
Group Source compares its input data against the key data in this index.
If you have generated a key index in a previous version of Data Quality, you must recreate the index before
running an identity matching plan that reads that index in the current version of Data Quality. Use the Identity
Group Target to recreate the index. Delete the key index folder before recreating the index.
In both the CSV Identity Group Source and the Identity Group Target, you must select the same Population
and Key Type, and you must ensure that the Input Column in both components contains the same type of data.
Additionally, both components must take the same number of columns as input.
Note: Identity Group components require population files that install through the Content Installer. Contact
Informatica to purchase and download population files. For information on installing population files, consult
the Informatica Data Quality Installation Guide.
Configuration
The configuration dialog box contains the following fields:
Source File. Displays the name of the file to which the source component connects.
Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source
dialog box opens. You can identify the character encoding associated with the dataset. For more information,
see Character Encodings and Unicode on page 159.
Field Delimiter. Select a field delimiter used in the source file. The default option is comma (,). If headings
for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.
Text Qualifier. Select the text qualifier used in the source file. A text qualifier should enclose any delimiter
value in your data that you do not want to use as a field delimiter. The default option is the double
quotation mark ().
First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the dataset.
Population. Populations contain key-building algorithms that are customized for specific countries and
languages. Select the population used in the Identity Group Target of the key generation plan.
Key Type. The standard populations provided by Informatica can generate keys for three types of index data:
person names, organizations, and addresses. Select the Key Type corresponding to the type of data that you
wish to use in key generation.
Search Level. Select the Search Level that fits your matching needs. Each level uses a different balance of
search quality and search speed. The following table describes the search speed and matching criteria for
each Search Level.
Search
Level
Search
Speed
Matching
Criteria
Description
Narrow Fastest Nearly exact This Search Level performs the fastest and most exact matches. For example, using
a Narrow Search Level for person name matching returns exact matches and name
abbreviation matches (initials).
Typical Fast Strict This Search Level performs fast searches with strict matching criteria. For example,
using a Typical Search Level for person name matching returns data with name
abbreviation matches and some potential errors (e.g., incorrect initials).
Exhaustive Average Loose This Search Level performs average speed searches with loose matching criteria.
For example, using an Exhaustive Search Level for person name matching returns
matches that may represent substantial spelling errors.
Extreme Slow Very Loose This Search Level performs slow searches with very loose matching criteria. For
example, using an Extreme Search Level for person name matching may return
matches with a very wide variety of spelling errors.
94 Chapter 10: Identity Matchi ng Components
Input Column. The input column specifies the source data column that the CSV Identity Group Source
uses as a group key for matching. Choose an input column that contains the type of data specified in the Key
Type field.
The order of individual strings in the selected input column should match the normal string order used in
the population Key Type you selected. For example, in English-speaking countries the normal string order
for person names is as follows:
First Name + Middle Name(s) + Family Name(s)
Key Index Folder. Specifies the folder containing the key index that the plan will match against the input
data. Enter the folder specified in the Identity Group Target that created the index data you want to read.
You can enter a folder path in this field. For example, a Key Index Folder value of MyIndexes\862 will read a
folder at the following location:
C:\Program Files\Informatica Data Quality\Identity\MyIndexes\862
Note: By default, the CSV Identity Group Source can cache 16777216 bytes, or 16 MB, of data when
retrieving data from the key index. You can increase the cache size in the Informatica Data Quality
config.xml file as follows:
Open config.xml from the Data Quality root folder.
Locate the <CSVIdentityGroupSource> parameter, and increase the <CacheSize> parameter within it
from 16777216 to an acceptable value. For assistance in changing your cache size, contact Informatica
Global Customer Support.
DB Identity Group Source
The DB Identity Group Source performs identity matching on database sources using keys created by
the Identity Group Target.
To use the DB Identity Group Source, you must first run a plan containing an Identity Group Target.
The Identity Group Target stores keys in an index within the Data Quality folder structure. The DB
Identity Group Source compares its input data against the key data in this index.
If you have generated a key index in a previous version of Data Quality, you must recreate the index before
running an identity matching plan that reads that index in the current version of Data Quality. Use the Identity
Group Target to recreate the index. Delete the key index folder before recreating the index.
In both the DB Identity Group Source and the Identity Group Target, you must select the same Population and
Key Type, and you must ensure that the Input Column in both components contains the same type of data.
Additionally, both components must take the same number of columns as input.
Note: Identity Group components require population files that install through the Content Installer. Informatica
provides these files separately from Data Quality. Contact Informatica to purchase and download population
files. For information on installing population files, consult the Informatica Data Quality Installation Guide.
Configuration
The DB Identity Group Source configuration dialog box includes two tabs: Connect to Database and Match
Selection.
Connect to Database Tab
The Connect To Database tab options are identical to the Connect to Database tab on the Database Source
configuration dialog box. For more information about the Connect to Database tab options, see Database
Source on page 14.
Click Connect to make the connection and open the Match Selection tab.
DB Ident it y Group Source 95
Match Selection Tab
The options on this tab allow you to explore database tables and select the columns to provide data for the
matching plan:
Database. Displays the database structure as a folder hierarchy of tables and columns.
Select. Provides check boxes for the column on the explored tables. Check Select for a column to add its data
to the dataset.
Input Column. Specifies the source data that the DB Identity Group Source uses for matching. Select a
single input column. Choose an input column that contains the type of data specified in the Key Type field.
The order of individual strings in the selected input column should match the normal string order used in
the population Key Type you selected. For example, in English-speaking countries the normal string order
for person names is as follows:
First Name + Middle Name(s) + Family Name(s)
Group Key. The fields that the matching plan searches for common values. Select one or more group keys.
Note: Do not select the same column as the Input Column and Group Key. The selections must be
different. Both are mandatory.
Population. Populations contain key-building algorithms that are customized for specific countries and
languages. Select the population used in the Identity Group Target of the key generation plan.
Key Type. The standard populations provided by Informatica can generate keys for three types of index data:
person names, organizations, and addresses. Select the Key Type corresponding to the type of data that you
wish to use in key generation.
Search Level. Select the Search Level that fits your matching needs. Each level uses a different balance of
search quality and search speed. The following table describes the search speed and matching criteria for
each Search Level.
Key Index Folder. Specifies the folder containing the key index that the plan will match against the input
data. Enter the folder specified in the Identity Group Target that created the index data you want to read.
You can enter a folder path in this field. For example, a Key Index Folder value of MyIndexes\862 will read a
folder at the following location:
C:\Program Files\Informatica Data Quality\Identity\MyIndexes\862
Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces
from the dataset. These options are cleared by default.
Stop on Error. Select this option if you want to stop script operation and display an error message if the
execution encounters a problem.
Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Search
Level
Search
Speed
Matching
Criteria
Description
Narrow Fastest Nearly exact This Search Level performs the fastest and most exact matches. For example, using
a Narrow Search Level for person name matching returns exact matches and name
abbreviation matches (initials).
Typical Fast Strict This Search Level performs fast searches with strict matching criteria. For example,
using a Typical Search Level for person name matching returns data with name
abbreviation matches and some potential errors (e.g., incorrect initials).
Exhaustive Average Loose This Search Level performs average speed searches with loose matching criteria.
For example, using an Exhaustive Search Level for person name matching returns
matches that may represent substantial spelling errors.
Extreme Slow Very Loose This Search Level performs slow searches with very loose matching criteria. For
example, using an Extreme Search Level for person name matching may return
matches that contain a very wide variety of spelling errors.
96 Chapter 10: Identity Matchi ng Components
Note: Configuring a column for InputColumn or GroupKey automatically checks the Select option to add the
column to the dataset. However, clearing either option does not automatically remove them from the dataset.
Clear the Select option to remove a column from the dataset.
Identity Match
The Identity Match component performs matching operations on input data at an identity level. An
identity is a set of fields providing name and address information for a person or organization. The
component treats one or more input fields as a defined identity and performs matching analysis
between the identities it locates in the input data.
The component analyzes records regardless of the character sets in which they are stored. Use this component to
identify similar or duplicate identities across datasets that may use several different language locales or character
encodings.
Informatica uses population files to describe key-building algorithms, search strategies, and matching schemes
that are customized by country and language. These customized settings improve match accuracy for data
sourced from those countries and languages.
There are three main steps to configuring the Identity Match component:
Select a population in the upper menu in the configuration dialog box.
Select the type of identity to analyze in the lower menu of this dialog box. Table 10-2 lists the type of
identity you can analyze. The fields available will depend on the population selected.
Select the data fields you want to analyze and apply them to the template fields for your chosen identity
type. The fields available will depend on the population selected.
Configuration
The Identity Match configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options below this pane and on the
Inputs, Parameters, and Outputs tabs.
Below the Components pane are two drop-down menus:
Use the upper drop-down menu to select the population that you will apply to the data. Select the Identity
Match Country option for a single locale or region, or select the Identity Match - Multiple Populations
option.
Use the lower menu to specify the type of identity data that the component will match. For example, the
Contact option relates to the names and addresses of members of organizations. The option you select here
Identi ty Match 97
determines the fields that are displayed on the Inputs tab. Each population selected in the upper menu has
its own set of information types. Table 10-2 lists the types of identity you can analyze.
Inputs Tab
The Inputs tab allows you to configure the data input fields. The Input Fields Mapping Area contains two
columns:
The left-hand column lists the field names. The names displayed depend on the population selected in the
Components pane. Mandatory input fields are highlighted in the column.
The right-hand column lists the available inputs for the selected input field. Select an option from each
drop-down list to map an available input to the selected input field.
Note: If you have selected the Identity Match - Multiple Populations option in the upper drop-down menu
beneath the Components pane, the Population field name is displayed and highlighted as mandatory in the
left-hand column. Select a population field on the right-hand column.
Note: For all field names (except for the Population field name) you must select values for the field name in
pairs. For example, when using field names PERSON_NAME1 and PERSON_NAME2 you must select
values for both field names in the right-hand column. This enables the component to match input fields
against each other.
Parameters Tab
The Parameters tab contains the following options:
Default Population. Sets the default population if the multiple populations option has been selected in the
Components pane.
When you opt to match data from several populations, the Identity Match component looks to the specified
population first and then to the other configured populations.
Match Level. Sets the match level to one of the following:
Typical. Accepts reasonable matches. This is the default selection if no other match level is specified. The
Accept Limit is 89 and the Reject Limit is 70.
Conservative. Accepts only close matches. The Accept Limit is 90 and the Reject Limit is 80.
Loose. Accepts matches with a high degree of variation. The Accept Limit is 75 and the Reject Limit is 50.
Table 10-2. Identity Type
Options Description
Wide_Contact Matches person name at organization name
Contact Matches person name at organization name and address
Individual Matches person with either name id or birth date
Resident Matches person name at address
Address Matches address
Organization Matches organization name
Division Matches organization name at address
Household Matches family name at address
Person_Name Matches person name
Fields For general use for any one or combination of fields
Corp_Entity Matches company name
Family Matches family name at either address or phone number
Wide_Household Matches family name or phone number at address
98 Chapter 10: Identity Matchi ng Components
Stop on Error. Check this option if you want the plan to stop running when the plan cannot locate up-to-
date population data. When this option is checked, the plan will stop running if it finds that the population
data is absent. When this option is cleared, the plan will run as normal and write a status code to the output
column.
Advanced Matching. The Overriding Match Control Field allows you to override the population settings by
providing a dialog in which you enter a query. The query syntax specifies the Identity Match options to be
used.
Note: For more information on the query syntax, refer to the Informatica Identity Systems Naming Server
documentation.
Outputs tab
This tab lists the possible output fields for the data associated with the instance highlighted in the Components
pane. The tab shows two output fields:
Identity Match Score. The score can range between zero (no similarity) and 1 (perfect match) and is correct
to two decimal places.
Identity Match Decision. Accept, Reject, Undecided, or Processed. The decisions returned are based on a
combination of the Match Score and the Match Level specified on the Parameters tab (Typical,
Conservative, or Loose).
Double-click a field name to render it editable. To save your edits, press Enter before removing focus from the
field.
CSV Identity Match Target
The CSV Identity Match Target creates one or more output files that list the duplicate identities
found by the plan. The target lists the duplicates in clusters. A cluster is a set of two or more identities
that the plan determines are similar to or duplicates of each other.
Use this component in a plan with a DB or CSV Identity Group Source. Do not use the CSV Identity Match
Target with other non-Identity match sources.
Note: The CSV Identity Match Target does not write full record information to its output files. It writes only
values from the fields selected for its key index.
Before writing the output files from the plan, the CSV Identity Match Target performs final matching
operations on the source data. The matching component upstream in the plan compares the input records with
identity data in the key index and calculates a match score for each record/index pair. The CSV Identity Match
Target performs identity matching on all record/index pairs whose scores meet or exceed the value in the Match
Threshold field. It writes input record information for pairs of records in each cluster to a CSV file and
optionally creates a HTML report on the cluster data.
Best Practice in Identity Matching
When designing a plan with the CSV Identity Match Target, consider the following factors:
To obtain optimal results from your identity matching plan, use an Identity Match component and a CSV
Identity Match Target. This provides you with the best opportunity to capture the duplicate identities in
your data.
When selecting fields in the Inputs pane of the CSV Identity Match Target, select the same fields that you
selected in the Identity Group Target when generating the key index. The target does not insist that you
select fields with the same names as those used to create the index, but it does require that you select the
same number of Input fields and additionally select a match score field from an upstream matching
component.
CSV I denti ty Mat ch Target 99
For example, if you selected five fields when creating the key index, select five Input fields from the source
data in the CSV Identity Match Target plus a match score field.
The CSV Identity Match Target writes a record to a cluster only if the record appears in the key index
database file.
The CSV Identity Match Target output corresponds to the output from the CSV Match Target in Identified
Matches mode. For more information on Identified Matches mode, see page 28. If you use a CSV Match
Target instead of a CSV Identity Match Target in an identity plan, select Matched pairs mode.
Configuration
The CSV Identity Match Target configuration dialog box contains the following options:
Target File. Identifies the CSV output file for the data target.
Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a
Target dialog box opens. You can create a new file by typing a name in the File name field.
Inputs. Lists the data fields that can be included in the target output. Check a field to include it in the plan
output calculations. Select the same number of _1 and _2 fields, and select a match score field.
Outputs. Lists the fields selected in the Inputs pane. Use the Up and Down arrows to change the order of
the output fields. This determines the order in which the fields will be matched against the fields in the key
index.
Use First Line as Header. Check to designate the first line of data in the source file as heading text and so
distinguish it from the dataset.
Launch Viewer. Use to open the output files when the plan executes.
Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is comma (,).
If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the
data structure.
Qualifier. Select a qualifier appropriate to the data from this menu. A text qualifier should enclose any
delimiter value in your data that you do not want to use as a field delimiter. The default is the double
quotation mark ().
Create HTML Match Report. Use to generate a HTML report displaying the match clusters found by the
plan. This option is checked by default.
The HTML report returns information on the best-matching record pairs only. It does not return range
calculations or other data on best matches.
Key Index Folder. The name of the folder containing the identity key index read by the target. The folder
selected in this field must match the Key Index Folder selected by the CSV or DB Identity Group Source in
the plan.
You can enter a folder path in this field. For example, a Key Index Folder value of MyIndexes\862 will read a
folder at the following location:
C:\Program Files\Informatica Data Quality\Identity\MyIndexes\862
Field. Lists the output fields defined by the matching components in the plan. Use this menu to select the
field from which the CSV Identity Match Target reads the match score.
Match Threshold. Filters the data record values written as plan output according to the record scores in the
match input field (see Field above).
100 Chapter 10: Identity Matchi ng Components
101
C H A P T E R 1 1
Address Validation Components
This chapter includes the following topics:
Overview, 101
Global AV, 102
Formatted Address Outputs, 104
Overview
Data Quality installs with address validation engines that compare address data inputs against reference datasets
of postal address information. Data Quality also accepts address validation engines developed as plug-ins in
accordance with the requirements of Data Quality Global Component SDK. Data Quality installs a single
address validation component to handle these validation engines, called the Global AV. It also supports plans
that contain deprecated address validation components from earlier versions of Data Quality.
Note: The Global AV matches input address data against reference datasets of postal addresses. Before you can
use the Global AV, you must install reference data for the countries you are interested in. Contact your
Informatica account representative for information on address validation country subscriptions.
The Global AV and the installed validation engines deliver the following functionality:
They validate the accuracy and deliverability of addresses against the address reference data available for the
country in question. Some countries provide complete address information and can enrich an address with
new information, for example in the United States providing a nine-digit zip code in place of a five digit
code. Other countries provide last-line address information only, that is, information on city, province, or
post code (information commonly found on the last line on the envelope).
Where possible, they correct errors in addresses and complete partial address records.
They add postally-relevant information to the address that may not appear in the data source or on the
envelope. For example, they can report on whether an address has a physical address or is at a commercial
mailbox location. This capability varies by country.
They provide detailed status reports on the validity of each input address, describing its deliverable status
and the nature of any errors or ambiguities it contains.
In addition to returning individual fields that contain postal address and other value-added information,
they can provide output addresses in an envelope-ready format.
The Global AV provides the user interface to all address validation engines, including engines that are
developed with the Global Component SDK. Data Quality no longer installs a separate operational component
for each installed address validation engine.
102 Chapter 11: Address Val idation Components
This installation of Data Quality supports plans that contain address validation components installed with
earlier product versions. The supported components are the Address Validator, the International AV, and the
North America AV. You cannot create new instances of these components.
Installing Validation Components and Reference Data
The Data Quality Content Installer installs the reference datasets for address validation. You download address
reference datasets on a country-by-country basis from Informatica.
You can also use the Content Installer to install updates to the validation engines. For more information,
consult the Informatica Data Quality Installation Guide.
Global AV
The Global AV component provides access to address validation functionality in Data Quality. It
provides a means of validating addresses from anywhere in the world through a single component.
The Global AV compares your input data records to reference databases of postally valid address
information to quantify, verify, and enhance the quality and deliverability of your address records. It provides
access to all address validation engines installed with or linked to Data Quality.
Configuration
The Global AV configuration dialog box contains the following areas:
Components pane
Inputs tab
Parameters tab
Outputs tab
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
Inputs Tab
The Inputs tab lists all available data columns. Select a column to add it to the instance highlighted in the
Components pane.
You can select multiple address columns for the component instance. In general, the more columns you provide,
the greater the opportunity for the Global AV to locate the correct address in its reference data. However,
incorrect input data does not enhance the matching operation.
Parameters Tab
The Parameters tab options allow you perform the following operations:
Set the principal country database to use when validating input data.
Define the address structure for the input address string by mapping input fields to field parameters used by
the Global AV.
Add CASS/DPV or Geocoding information to the outputs (country data permitting).
Global AV 103
The options displayed on this tab change according to the country database option (single or multiple) that you
select:
Validating data from one country. Check the Select Single Country option, and then select the required
country from the Select Default Country menu.
Note: Do not select a Single Country database in the Global AV unless the input data relates exclusively to
the country you specify.
Validating data from several countries. Check the Select Multiple Countries option, and then select a
country from the Select Default Country menu.
The process for validating addresses from several countries works as follows:
The Global AV first looks for a populated country code field in the address. This must be a three-letter
ISO country code.
If it finds a country code for a country on its menu, the component compares the address with the
installed against reference database for that country. If it does not find an address match in the country
database, the component applies the address to the default country database.
You do not have to set a country in the Select Default Country menu. If you set the NONE option in this
menu, the component will search all input addresses for a country code and attempt to validate the
addresses accordingly. If it does not find a country code for the address, the component will not perform a
validation check for that address.
The Services Required area contains options relating to the enrichment of address information with Geocoding
and DPV information and to the handling of the plan in cases of critical reference data errors.
Geocoding. Check this option to return latitude and longitude coordinates for each input address.
Informatica enables Geocoding for United States addresses.
The Geocoding option is also available when you choose the Select Multiple Countries option, but in such
cases it only returns data from reference databases containing Geocoding data.
CASS/DPV (Delivery Point Validation). Check this option to return a two-digit Delivery Point Code for
the address. CASS (Coding Accuracy Support System) is a United States Postal Service means of certifying
the accuracy of address validation by software. A delivery point code is a two-digit code that can uniquely
represent, along with the nine-digit zip code, any mailbox address. The full delivery code, including the zip
and DPV information, is typically added to sorted mail as a bar code.
This option is available for the United States only. This option is also available when you choose the Select
Multiple Countries option, but it only returns data from a country database containing U.S DPV data.
Stop on Error. When this option is checked, the plan will stop running if it finds that the reference data is
absent, or expired, or lacks a current license. When this option is cleared, the plan will run as normal and
write an appropriate status code to the output columns if a reference data error arises.
The Input Mappings area contains a Parameters column and an Input Fields column.
The Parameters column contains a set of address field names that you can map to the fields selected on the
Inputs tab. Use these parameters to define the address format that the Global AV will apply to the data it
sends to the validation engine. The Parameters options shown depend on the country database selected.
The Input Fields column contains a set of menus for every field in the Parameters column. Each menu
contains a field from the Inputs tab.
Use these menus to build the address that the component will send to the validation engines. Map each field
name you require from the Parameters column to a unique field name under Input Fields. You must map an
input field to the Addressline1 parameter. You must also map an input field to the Country parameter if you
choose the Select Multiple Countries option.
Note: You can map the Dependent Locality parameter to both Dependent Locality and Urbanization
information in your input address. The Outputs tab maintains separate Dependent Locality and
Urbanization fields, however, in order to support plans that are configured to write outputs to these fields
individually.
104 Chapter 11: Address Val idation Components
Outputs tab
This tab lists the output fields available for the instance highlighted in the Components pane. There are several
types of output field:
Match status and match code fields that describe the level of validation achieved for an address.
Individual address elements such as street name, building number, suite or apartment number, city, Zip or
postcode, country code.
Formatted address fields that provide envelope-ready address lines in the manner expected by the postal
carrier of the country in question. For more information on formatted address lines, see page 104.
Postally-relevant information in areas such as CASS/DPV and Geocoding where available.
The CASS/DPV options are enabled if a current set of United States reference data is installed on your
system. Geocoding options are enabled if the appropriate reference data for the United States, United
Kingdom, or Australia is installed.
Check the fields you want to use as outputs from the component.
Reading Output Data
The type of information returned in each output field depends on the country dataset used to validate the input
address. Also, not all output options are available for all countries.
For a detailed description of the outputs enabled for your country and the types of information each output can
contain, see page 141.
Reading Match Status and Match Code Data
The Match Status and Match Code outputs that appear at the top of the Outputs pane provide information on
the quality of the match found between the input address and the reference data. The meanings of the
descriptions in these fields depends on the country dataset used to validate the input address.
The options are:
Match Status. Describes the type of match found for each input address.
Match Code. Describes the success of the match found for each address.
For detailed information on the match status and match code outputs for your addresses, see page 131.
Note: These outputs do not provide usable envelope data. You cannot clear these options.
The Global AV provides access to the processing capabilities of the address validation engines installed with
Data Quality and also to any third-party address validation engines installed as plug-ins. The output fields for
the Global AV, including the match status and match code outputs, are based on the analyses performed by
these engines.
Formatted Address Outputs
In addition to analyzing and enhancing input address elements, the Global AV can assemble validated address
outputs in an envelope-ready format. The component uses the validated input data to build the formatted
addresses, eliminating the need to manually parse address values from multiple fields into standardized formats.
The Global AV engines create formatted addresses on a record-by-record basis, so that each address is created in
the envelope format expected by the postal carrier in its country.
Because standard address formats differ from country to country, the formatted address lines are named
generically in the Global AV. The component provides ten output lines that accept all types of address
Format ted Address Outputs 105
information. It also provides eight lines that focus on street and locality information. These outputs are named
as follows:
Formatted_Address_Line_1, Formatted_Address_Line_2... Formatted_Address_Line_10
Address1, Address2, Address3, Address4
Locality_line_1, Locality_line_2, Locality_line_3, Locality_line_4
The Global AV populates these outputs with information that may already be written to other output lines.
They are also populated dynamically by the Global AV. For example, if an address contains a company name,
the Global AV writes this name to Formatted_Address_Line_1 and writes the street name to
Formatted_Address_Line_2. If an address does not contain a company name, the Global AV writes the street
name to Formatted_Address_Line_1.
Note: The Address1-4 and Locality_Line_1-4 outputs focus on information of their own type only. For example,
Address1 cannot accept a company name. For more information on these output types, see page 106.
Formatted Address Example: United States
The Global AV uses up to four lines to create an address in the standard USPS format. Table 11-1 shows how
the Global AV builds the formatted address:
The address format shown in Table 11-1 is a business address. It does not include personal name information.
You can select this information separately when configuring the plan outputs.
Note: You cannot change the output values that the component writes to the formatted address fields. The
selections are determined in the underlying validation engines.
Writing Formatted Addresses To Target Components
Formatted addresses answer a particular business need. If you do not need envelope-ready address information,
do not select the formatted address options in the Global AV or in your plan target components. If you select
these outputs, you must have a strategy for using the information when it leaves the data quality plan. You
should consider the structure of the file or database table that will contain the formatted addresses.
Address Formatting And Invalid Addresses
When your dataset contains only valid addresses, you can follow the strategies above with no difficulty. When
your dataset produces mixed validation results, you must decide how to handle the addresses that the Global AV
identifies as invalid or partially valid.
If the Global AV cannot validate an input address, it writes the original input values to the formatted address
fields, unless it determines that the address relates to one of the following countries, in which case it leaves the
formatted address fields empty:
Australia, Denmark, France, Luxembourg, Netherlands, Singapore, United Kingdom
The Global AV determines the country of origin by reading the ISO country code field or from the default
database selection on the Parameters tab.
Table 11-1. Standard United States Business Address Format
Global AV Output Description
Formatted_Address_Line_1 Company or organization name
Formatted_Address_Line_2 Urbanization (if applicable, for example in Puerto Rican addresses)
Formatted_Address_Line_3 Street address, including Suite/Suite Range fields
Formatted_Address_Line_4 City, State, Zip code
106 Chapter 11: Address Val idation Components
Address Lines and Locality Lines
The Address1 to Address4 and Locality_Line_1 to Locality_Line_4 lines provide an alternative means of
creating formatted addresses. These options function in a similar way to the Formatted_Address_Line_[n]
outputs, but they accept a more focused set of address information. The Address1 to Address4 lines accept street
information, and Locality_Line_1 to Locality_Line_4 accept locality information.
Table 11-2 describes the options you can select to format a sample address in the United States.
Note: In this example, only Formatted_Address_Line_1 carries company information.
Overseas Territories and Database Settings
The Global AV validates addresses for several principalities and overseas territories. Table 11-3 lists the
territories of France, the United Kingdom, and Australia for which the Global AV can validate addresses.
The requirements for selecting an address database and using a three-letter ISO country code described on
page 102 also apply to these territories and principalities. For example, the Global AV will apply a Gibraltar
address to an up-to-date UK reference database if it contains a GBR country code, or if you select United
Kingdom in Single Country mode.
Enhancing Address Validation Engine Performance
You can edit the configuration files associated with the Melissa Data and Address Doctor engines to improve
data processing speed and to log messages warning of data expiry. For more information, see the Data Quality
Installation Guide.
Table 11-2. Formatted Address Output Comparison, Global AV
Envelope Address Formatted_Address_Line_[n] Address[n] Locality_Line_[n]
The Tonight Show Formatted_Address_Line_1
3000 W. Alameda Ave. Formatted_Address_Line_2 Address1
Burbank, CA 91523 Formatted_Address_Line_3 Locality_Line_1
Table 11-3. ISO Country Codes and Overseas Territories
Country Territory Use this ISO Code
France Guadeloupe, Martinique, Mayotte, Monaco, Reunion, Wallis and
Fortuna
FRA
United Kingdom Gibraltar, Pitcairn Is., Saint Helena GBR
Australia Christmas Is. AUS
107
C H A P T E R 1 2
Dictionary Management
This chapter includes the following topics:
Overview, 107
Dictionary Manager, 108
Updating Dictionary Files, 108
Creating a Dictionary, 110
Overview
Informatica Data Quality plans can use the following types of reference data:
Dictionary files. Plain-text files provided by Informatica and saved in the DIC file format. These files are
usable in many Workbench components and are installed by the Content Installer.
Database dictionaries. User-created reference datasets stored in database tables. These tables can be updated
dynamically when the underlying data is updated. Informatica does not provide these dictionaries.
Database dictionaries are a convenient way to use data that has been created for other purposes. By making
use of a dynamic connection, data quality plans can always point to the current version of a database
dictionary.
Third-party reference data. File-based and database reference datasets originating from third party sources
and offered by Data Quality as additional product options. Required for address validation components.
The Content Installer installs these datasets.
This chapter describes the DIC files provided by Informatica and the process to create a dictionary. For more
information about third-party reference data, contact Informatica Global Support.
Dictionary Files
Dictionary files provide an authoritative reference source for many areas in which common terminology is used,
including postal address terms, city names, units of measurement, personal salutations, telephone area codes,
and company names. Many Data Quality components provide options for comparing or updating input data
against dictionary data. These dictionaries are editable, and you can also define your own dictionaries.
A dictionary file is essentially a text file saved in a proprietary (.DIC) format. Each file contains one or more
label entries with one or more item entries for each label. The label represents the correct or standard form of a
word or term. The item values for each label represent a range of variant or alternative spellings. Any operation
that updates your dataset from a dictionary does so by locating an item entry and returning its corresponding
label.
108 Chapter 12: Dictionary Management
Data Quality reads dictionary files from the Dictionaries folder created at install time. The Data Quality
installer does not add dictionaries to this folder. Dictionaries are added by the Content Installer.
When you run a local plan, Data Quality Workbench looks for any dictionaries cited in the plan in the
Dictionaries folder of your Workbench installation. When you run a plan across the service domain, Data
Quality Server looks in the local Dictionaries folder and also in the your Dictionaries folder on the service
domain. For more information, see Dictionary Files on page 7.
Note: The dictionary folders read by Data Quality are set during product installation. Their locations can be
changed later if necessary. For information on changing these locations, contact Informatica Global Customer
Support.
Dictionary Manager
The Dictionary Manager is an applet within Workbench that allows you to view and manage the contents of the
local Dictionaries folder. To open the Dictionary Manager in Workbench, press F8.
When you use the Dictionary Manager for the first time following the Content Install, it appears populated
with multiple folders. Figure 12-1 displays the Dictionary Manager window:
Note: The Content Installer overwrites any files with the same names that it finds in the Dictionaries folders. If
you have created, renamed, or moved any dictionaries since install and wish to rerun the Content Installer, back
up these files first.
Updating Dictionary Files
A dictionary file is organized as a table with a column of definitive spellings for the terms in the dictionary and
one or more columns for matching or acceptable variant spellings. Each dictionary term has entries in at least
two fields:
Label field. Represents the spelling that will be written back to the plan.
Figure 12-1. Dictionary Manager
Updati ng Di ct ionary Fil es 109
Item fields. Represents the forms of spelling that are recognized as a match for the Label in the input data.
The first item field always contains the same spelling as the Label field, that is, it matches the formally
correct or approved spelling of the term.
You can create or update a dictionary in the following ways:
Add or delete an item. Add or delete variant spellings for an existing dictionary term.
Add or delete a label and its related items. Add or delete a definition from the dictionary.
Create a new dictionary file. See page 110.
Before deleting data from a dictionary, be sure that doing so is appropriate for all plans that reference the
dictionary.
Note: You should backup or rename any dictionary you edit. If you rename a dictionary that is used by a plan,
you must edit the plan components to recognize the new dictionary name. If you edit a dictionary but do not
change its name, you do not need to update the plan configuration.
Adding New Items
You can add new spellings to existing definitions. For example, the Numeric Patterns dictionary contains
character patterns for many types of personal data, such as Social Security numbers, telephone numbers, and zip
codes. You can add a variant pattern for one of these data types.
In Figure 12-2, a pattern for a U.S. area code and telephone number has been added to the Item4 field. This
pattern divides the numbers with blank spaces, indicated by an underscore:
To add new spellings to a term in the dictionary:
1. Open the dictionary in the Dictionary Manager and locate the row containing the term.
2. Type the new spelling in the first empty cell on the row.
Adding New Labels
You can add new terms to a dictionary and define the related spellings. Dictionary labels do not need to be in
alphabetical order.
The decision to add terms to a dictionary depends on the purposes of the plans that will use it. You might not
want to recognize all possible variations in a data value.
To add a new term to a dictionary:
1. Open the dictionary and type the formal spelling in the first empty Label field and the Item1 field. These
two fields must be identical. You might need to scroll the dictionary contents to reach an empty row.
2. In the adjacent Item fields, type any variant spellings you want to include in the dictionary. Start in the
Item2 column.
Figure 12-2. Numeric Patterns Dictionary
110 Chapter 12: Dictionary Management
Creating a Dictionary
You can create text dictionaries or database dictionaries.
To create a text dictionary:
1. Open the Dictionary Manager and select the folder where you want to create the new dictionary.
2. Right-click in the right pane of the Dictionary Manager and click New Dictionary > Text.
An empty dictionary worksheet displays.
3. Type or copy a list of values into the Label and Item columns of the dictionary.
4. Close the dictionary and click Yes to save the dictionary.
The dictionary appears in the folder with the name New Dictionary.
5. To rename the dictionary, right-click the dictionary name and select Rename
6. Type a new name for the dictionary.
The newly-created dictionary can be viewed in the Dictionary Manager and can be found in the Dictionaries
folder of your Data Quality installation.
Note: You can add a correctly-formatted text file with the extension DIC to folders in the Dictionaries folder
structure. The file will be visible in the Dictionary Manager.
To create a database dictionary:
1. Open the Dictionary Manager and select the folder where you want to create the new dictionary.
2. Right-click in the right pane of the Dictionary Manager and click New Dictionary > Database.
The Select Two Columns for Dictionary dialog box opens.
3. Complete the enabled fields under the Connect To Database tab and click Connect.
Fields differ based on the database type you select.
The default database setting is Staging. It refers to the local database used by Data Quality. You can select
any valid connection.
When you connect to IBM DB2, Microsoft SQL Server, or ODBC-compliant databases, you must
provide a DSN (Data Source Name) for the database. You might be prompted to provide a valid login.
The DSN field identifies the database on the network.
When you connect to an Oracle database, you must provide the SID (System Identifier) for the Oracle
instance.
You might be prompted for login information if you select a non-default database type.
You can identify the character encoding associated with the data in the dictionary. For more
information, see Character Encodings and Unicode on page 159.
4. Click Connect.
The During tab displays.
5. Under this tab, select the two columns to use for the Label and Item1 values in the dictionary, and click
OK.
Creating Dictionary Files with the Report Viewer
The Data Quality Report Viewer allows you to create dictionary files from the output of a data quality plan.
To create or append to a dictionary file using the Report Viewer, your plan should write its output to a Report
Target. A Report Target creates output files in a proprietary SSR file format that allows plan data to display
graphically and in Data Quality dashboards.
Creati ng a Dicti onary 111
The Report Target accepts data only from a frequency component, such as a Count component. The Count
component counts the occurrences of data values in a selected column. You can drill-down into the summary
calculations for each column in the Report Viewer to locate the raw data for a dictionary file. When you drill-
down into data, you can select a data column and add it to an existing dictionary or create a new dictionary.
Figure 12-3 illustrates how you can drill-down through report data, right-click on a column, and save the
column data as a dictionary file. This file becomes populated with Label and Item1 entries corresponding to the
column data:
In this case, the dictionary will contain a list of serial numbers from customer records that include invalid zip
codes. You can now create plans to check customer databases against these serial numbers.
For more information about the Report Target, see page 25. For more information about the Report Viewer, see
page 113.
To create or append to a dictionary file using the Report Viewer:
1. Open the Report Viewer. Open the SSR file that references the plan data to be added to the dictionary.
You can open an SSR file in two ways:
In Workbench, run a Data Quality plan with a Report Target, ensuring that the Report Target has been
configured to launch the Report Viewer on plan execution.
In the Report Viewer, click File > Open and browse to the SSR file for the report in question.
2. With the report open in standard view, right-click the row for the relevant data instance and select Open.
A spreadsheet opens, showing all data rows for the instance you have selected.
3. If you want to save the full contents of a column to a dictionary file, right-click in the column and click
Edit > Select Column.
The entire column is highlighted.
-or-
If you want to save a selection from a column to a dictionary file, Shift-click to select the required values.
4. Right-click the highlighted values and select Export To > Dictionary File.
The Select Dictionary Name dialog box opens.
5. Browse to a location in the Informatica Data Quality Dictionaries folder structure.
6. If you want to create a new dictionary, type a new dictionary name.
Figure 12-3. Creating a Dictionary File with the Report Viewer
112 Chapter 12: Dictionary Management
-or-
If you want to append to or replace a dictionary, select a dictionary name.
You will be prompted to append to or overwrite the current data for the dictionary.
7. Click OK.
113
C H A P T E R 1 3
Report Viewer
This chapter includes the following topics:
Overview, 113
Viewing Data in the Report Viewer, 113
Standard View and Dashboard View, 115
Viewing Plan Data, 118
Report Viewer Parameters and Settings, 119
Tracking Changes in Data Quality, 120
Importing Report Files and Working with Groups, 121
Overview
The chapter describes the Data Quality Workbench Report Viewer. The Report Viewer allows you to perform
the following tasks:
Display plan results, both in graphical and numerical formats and in a dedicated viewing application.
View drill-down analysis of the raw data underlying the plan results.
Create data quality dashboards that can be exported in spreadsheet and HTML form for business users and
other interested parties.
Save key subsets of plan data to file for use as reference dictionaries.
The Report Viewer is particularly suited to displaying data quality dashboards, those that explore the quality of
a dataset according to criteria set by the business.
You can use the Report Viewer to view the SSR report files that are created by plans containing a Report Target.
Viewing Data in the Report Viewer
You can open and read data in the Report Viewer.
Opening the Report Viewer
The Report Viewer can be activated in three ways:
114 Chapter 13: Report Vi ewer
Configuring the Report Target to generate a report in Standard/SSR report format, check Launch Report on
Completion, and then execute the plan.
Open the Report Viewer from the Data Quality Workbench program group via the Windows Start menu.
You can use the Report Viewers File menu to open a report file.
Click the Report Viewer toolbar button in the Data Quality Workbench user interface.
Reading Report Data
The Report Viewer can display data for all items selected frequency components of the plan. Data items
typically have many kinds of data associated with them.
When you select a data item in the Count component, you add the number of times each value occurs to the
report.
For example, a plan might contain a business rule defined in a Rule Based Analyzer that tests the accuracy of the
currency type associated with data records. In this case, the Rule Based Analyzer creates a new data column
whose fields may read Valid Currency or Invalid Currency.
The Report Viewer might also show the number of empty fields and values excluded from calculations
depending on the parameters of the preceding operational component, such as the number of values classified as
Others by the Count component. For this reason, it is important to understand how frequency components are
configured. A large number of Others values can indicate that the Count component needs to be reconfigured.
Types of Graph
In standard mode, you can choose from two graphing options for a data item from the View menu:
Pie Chart
Bar Chart
Beneath each chart type, the data for the item is tabulated. The No Graph option omits both chart types.
When you open the Report Viewer, the right pane displays data for one item at a time. You can select an All
Reports option through the View menu that displays all items in scrollable form in the right pane.
The View menu also lets you set the orientation of the bars in the chart to horizontal or vertical. The legend for
the charted item appears below the chart, providing precise metrics for the quantity and percentage of the
charted data.
Figure 13-1. Report Viewer, Standard View
Standard View and Dashboard Vi ew 115
Standard View and Dashboard View
You can view data in the report viewer in two modes:
Standard view
Dashboard view
Standard View
When first opened, the Report Viewer opens in Standard view, presenting its information in two panes. The left
pane lists the source fields selected in the frequency components in the plan. The right pane displays the
following information:
A bar chart or pie chart for each item in the left pane.
The numbers of records that satisfy or do not satisfy the quality criterion for each item and the percentage of
data in the item that each number represents.
Any changes you make to the view settings for the report are stored to a master settings file for the Report
Viewer. For example, if you leave the standard mode by selecting Dashboard view, the report data displays in
dashboard mode the next time the SSR file is opened.
Dashboard View
Dashboards illustrate the ongoing progress of the dataset towards data quality business targets. When you
activate the dashboard, the standard view is collapsed, and the items are presented in a series of bar charts that
can be arranged in data quality categories.
Dashboards can display the following information:
The percentage of records that satisfy the data quality criterion underlying each item.
The data quality target set by the business for each item.
Horizontal bars charting the percentage of good quality records in each item with each bar color-coded to
indicate whether the data meets or misses its target.
An icon that indicates whether the data quality in the item is improving over time.
The percentage of records in each item that satisfied the respective data quality criteria in previous
executions of the plan.
Select View > Dashboard from the main menu to toggle between standard and dashboard modes.
Setting Data Quality Targets in the Dashboard
The fields in the Target column for each data item are editable. You can activate the cursor in each field and
type a percentage target value for it.
When a data item meets its target, when the percentage in the Passed field meets or exceeds the percentage in
the Target field, the horizontal bar for that item turns green.
When the Passed percentage is lower than the Target percentage, the horizontal bar turns red, except in cases
where the shortfall is within the threshold set in the Settings dialog box.
Modifying Dashboard Calculation Parameters
In addition to setting the weight associated with an item and its target percentage, you can add or remove data
elements from the data quality percentage calculation for that item. This allows you to display the data quality
compliance percentages for constituent elements within the data item.
116 Chapter 13: Report Vi ewer
To view and edit the list of data elements for a data item, right-click the item and select Configure Items. This
opens a configuration dialog box that lists the data elements associated with the item and shows which ones are
applied to the passed percentage calculation.
Check an element to add it to the calculation. To remove an element, clear its checkbox. Select at least one
element.
Note: Item configuration changes made in the dashboard are not applicable to the charts and statistics in
standard mode.
Dashboard Categories
In dashboard mode, you can create categories and assign data items to them. You typically create categories to
display items with common data quality criteria. Figure 13-2 on page 116 shows categories for Accuracy,
Completeness, Conformity, and Consistency and also the default New Items category.
Categories are managed through the Dashboard Categories dialog box. This dialog box provides options to add
new categories, edit category names, and move categories higher or lower in the dashboard report.
To open this dialog box, right-click any data item on the dashboard and select Configure Categories:
Creating a Category
Use the following procedure to create categories.
To create a category:
1. Open the Dashboard Categories dialog box and click Add.
The Category Name dialog box opens.
Figure 13-2. Report Viewer, Showing Dashboard Categories
Standard View and Dashboard Vi ew 117
2. Type a name in this dialog and click OK.
3. Click Close in the Dashboard Categories dialog box.
Assigning Items
All dashboards contain a single category when first created, named after the plan. All data items reside in this
category before you assign them to other categories.
Data Quality Workbench creates a new category for each new plan/group added to the report.
To assign a data item to a category:
1. On the dashboard, highlight the item name.
2. Right-click the category and select Move to from the context menu.
This displays a list of available categories.
3. Without leaving the context menu, select a new category for the item.
Note: A dashboard displays all items available to the Report Target. Items cannot be hidden or deleted from
the dashboard.
Moving Rows within Categories
You can move a row of data within a dashboard category.
To move a data row within a dashboard category:
Hold the Alt key and drag the row to a different location in the category.
Deleting a Category
You can delete categories from a dashboard. A category that contains a data item cannot be deleted from the
dashboard. Assign the data item to a different category before deleting the category.
To remove a category from the dashboard:
Highlight the category in the Dashboard Categories dialog box and click Remove.
Assigning Weights to Data Items
Each category on the dashboard has a weighted average, the average pass percentage across all items in the
category calculated based on the weight assigned to each item.
By default, all items have an equal weight of 1.0. You might change this value based on the business importance
of the item within the category or the relative number of data records represented by the category. A higher
number reflects higher relevance for that item. A lower number reflects lower importance. Setting the number
to 0 removes the item from the calculation of the average pass rate for the category.
To review and edit the weight assigned to an item:
1. Highlight the first row in its category, right-click and selecting Configure Items.
This opens the Weighted Average Configuration dialog box, which lists the items in the category and the
current weight for each one.
Note: The first row in each category is named Weighted Average by default. This name can be changed in
the Weighted Average Configuration dialog box. However, the first row always provides the weighted
average pass rate for the category and appears in bold type. The configuration dialog box name is static
regardless of the item name displayed in the first row.
2. Enter new weights as necessary.
118 Chapter 13: Report Vi ewer
Viewing Plan Data
You can use the Report Viewer to drill-down into the underlying plan data, including the source data, in tabular
form. From the drill-down table, you can filter the data to pinpoint different data values and copy all or part of
the dataset to a CSV file or clipboard.
In standard mode, you can double-click any chart element in the right pane to open a new window that displays
the records matching the properties of that element. You can also right-click any highlighted element in the
legend and select Open.
Dashboards provide another means to view the underlying data.
To view the records that do not satisfy the quality criteria for that item:
Right-click a highlighted data item in dashboard mode and select View Exceptions.
Note: When you drill-down to data within the Report Viewer, you refresh the view of the underlying plan data,
displaying the current state of the dataset. If the data has changed since the plan was last run in Workbench,
these changes are available to the Report Viewer. This does not alter the SSR file or the plan.
Drill-down mode can display either the columns in plan source data or all columns used in the plan. The latter
includes both source data columns and columns created in the plan. Configure this setting in the Report Viewer
Settings dialog box.
Exporting and Filtering Data in Drill-Down Mode
In drill-down mode you can export data to CSV file and to dictionary (.DIC) file.
To export data to a dictionary file:
1. Right-click the data values you want to export and click Export To > Dictionary.
This Select Dictionary Name dialog box displays.
2. You can append the data to the dictionary or overwrite existing data by selecting an existing dictionary file.
-or-
You can enter a new name in the File name field to create a new Data Quality Workbench dictionary with
values for Label and Item1.
3. Save the dictionary in a location recognized by the Dictionary Manager.
To export data to a CSV file:
1. Right-click the data values you want to export and click Export To > CSV File.
The Select CSV File Name dialog box displays.
2. You can overwrite data in an existing file.
-or-
You enter a new name in the File name field to create a new CSV file.
You can use the context menu to filter the data that displays and focus on a subset of data. The drill-down
context menu provides the following options:
Edit > Select Column. Selects all values in the column.
Edit > Select All. Selects all values in the table.
Edit > Copy. Copies the highlighted cells to the Windows clipboard. You can use Ctrl or Shift-click to
highlight cells across multiple rows and columns, and then copy their contents to the clipboard.
Export to > Dictionary. Copies the highlighted cells to a reference dictionary (.DIC) file.
For more information about creating dictionaries using the Report Viewer, see Creating Dictionary Files
with the Report Viewer on page 110.
Report Vi ewer Paramet ers and Setti ngs 119
Export to > CSV File. Copies the highlighted cells to a CSV File.
Filter > Filter by Selection. Hides all records that do not contain the value in the highlighted cell.
Filter > Remove Filters. Removes the filter applied and restores the data table.
Filter > Auto Filter. Adds a new cell at the top of every column in the table. Each cell provides a menu of
every data value in the column. You can select a value from any cell to filter the table for records containing
the same value in the same column.
You can use multiple cells in a a filter, resulting in data that fulfills all filter requirements. Select Unfilter to
clear these filters.
Find. Opens a dialog box that permits searches of selected columns or the entire table.
Report Viewer Parameters and Settings
Bear the following points in mind:
The Report Viewer displays report files. The SSR files displayed in the Report Viewer are written or
updated only when the plan is executed using the Workbench Run Plan command. You cannot edit or save
report files using the Report Viewer.
The Report Viewer stores settings in a master report settings file. Some display settings are stored
automatically, such as the display mode and report charts display. Other settings can be set as properties. The
Report Viewer does not store report settings in the SSR file.
Some key report settings cannot be restored if they are changed in the Report Viewer. If you delete the
dashboard history, for example, you cannot restore it, even if you run the plan again or have a back up SSR
file. There is no Undo function in the Report Viewer.
Editing Report Viewer Settings
Several settings and display parameters relating to all viewed reports can be set manually.
The following settings are available in the Report Viewer Settings dialog box. Click File > Preferences to access
this dialog box.
Limit pages to n records. Sets the number of records displayed when you drill-down to the data records
underlying the plan. The default value is 500.
Limit record retrieval to [n] records. Sets the number of records retrieved in a drill-down operation. This
setting is useful when you want a snapshot of the plan data and do not need to run the entire plan. The
default value is 2000.
Limit column autosizing to [n] characters. This value sets the default column width. Any field that is not
wide enough to display all characters in a string displays an arrow indicator. The default value is 30
characters.
Limit Pie chart to [n] slices. This value sets the number of slices that display in report pie charts. Any data
values that do not fall into the number of slices set by this field are aggregated into a single slice.
The default value is 10 slices, displaying a maximum of nine slices that refer to data elements and a tenth
slice for the remaining elements.
Use this setting to keep pie chart easy to read. It is also a useful method of grouping data elements for drill-
down purposes.
Limit Bar chart to [n] bars. This value sets the number of bars that display in report bar charts. Any data
values that do not fall into the number of bars set by this field are aggregated into a single bar.
The default value is 10 bars, displaying a maximum of nine bars that refer to data elements and a tenth bar
for the remaining elements.
As is the case with pie charts, you can use this setting to group data elements for drill-down purposes.
120 Chapter 13: Report Vi ewer
Show orange bar when within [n] percent of target. This setting relates to dashboards. It provides a visual
cue to indicate when a data quality level approaches its data quality target. The default setting is 5 percent.
Show component columns. Use this option to show all data columns available in the plan in drill-down
view. This option is cleared by default, displaying only source data columns for drill-down.
Report template. Displays the path to the XSL template on which the standard report view is based.
Dashboard template. Displays the path to the XSL template on which the dashboard view is based.
Dashboard history template. Displays the path to the template for the dashboard history graph.
Hiding Data Elements in Standard View
In addition to limiting the bar chart and pie chart segments displayed through the Settings dialog box, you can
hide data elements through the legend displayed in standard mode.
To hide data elements:
Right-click the element and click Hide.
The item is removed from the legend and from any chart above it.
To restore hidden data elements:
Right-click the legend and click Unhide.
The resulting dialog box will list all hidden items. You can choose one or more of these to restore.
Note: In dashboard view, the Report Viewer stores drill-down settings across successive Report Viewer sessions
and successive plan executions. However in standard view, hidden data settings are not stored.
Tracking Changes in Data Quality
A dashboard is particularly useful for tracking changes in the data quality levels of the dataset, data item by data
item. It provides two means to do so:
Historical percentages
Historical trend graphs
Historical Percentages
A dashboard can show the changes in the percentage data quality achieved by a data item over time. The Report
Viewer remembers the data quality percentages from the most recent dashboard view on each day that the
report is opened. That is, the Report Viewer remembers one set of percentages a day. These percentages appear
on the right of the dashboard.
Historical Trend Graphs
At a high level, an arrow in the left-most column on the dashboard will indicate whether the data quality for an
item has improved or disimproved since the base point date. (No arrow means there has been no change.)
For a more detailed view, highlight the item name, right-click on the dashboard, and select View History... from
the context menu. This opens a line graph plotting the progress in data quality for the item over time.
Viewing the Line Graph
The line graph displays percentage values on its vertical axis and date values on its horizontal axis. Right-
clicking in graph area provides access to the following options:
I mport ing Report Fi les and Worki ng wi th Groups 121
Copy. Use to copy the chart image to the clipboard.
Set as base point. Use to set the selected percentage as the baseline for the graph. In a graph with multiple
data points, a pair of dotted X-Y lines identify the selected percentage.
Clear history before point. Use to clear all history before this date. When you select this option, you are
asked if you want to clear the history for all other items on the dashboard. The default option is Yes. Click
No if you want to clear the history for this item only. Click Cancel to cancel the operation.
Note: The Clear command deletes the earlier graph history and the associated historical data on the dashboard
itself. Once deleted, this information cannot be restored.
Importing Report Files and Working with Groups
You can combine data from multiple report files into a single view in the Report Viewer by using the Import
command. This command identifies an SSR file and imports its data into an open report.
When you import a report, you create a group comprising data from the imported report and the report
previously-open in the Report Viewer. A group is a collection of settings saved to the master report settings file
that points to multiple SSR files and defines how they display.
The group does not store report data or edit the SSR files.
Creating a Group
Use the following procedure to create groups.
To import data from a report file and create a group:
1. Select File > Import... from the main menu.
The Import Report dialog box opens.
2. Browse to the location of the SSR file and click OK.
When you identify the relevant file, a new dialog prompts you to type a group name for the combined
report data.
Managing Groups
Use the following procedure to view or delete group.
To view the groups available to the Report Viewer:
1. Click File > Groups to open the Manage Groups dialog box.
2. To view a group, highlight its name and click Open.
3. To delete a group, highlight it and click Delete.
Clicking the Close button closes this dialog box.
You cannot delete the currently open group.
Groups and Dashboards
Groups are useful for aggregating and displaying the data analyses of several plans. This can provide a wide-
angle view of the quality of the business data, particularly when scorecards are built for the group.
You can define a dashboard for a group as you do for a single report. With group dashboards, you can define
one or more categories containing key items from multiple reports.
122 Chapter 13: Report Vi ewer
Note: You cannot toggle between a dashboard for a single report file and for a group. When you view the
dashboard for a group, the Report Viewer drops the dashboard for the originally-opened report file and displays
dashboards for available groups for the remaining Report Viewer session. To return to the earlier report file, you
need to open the file again.
123
C H A P T E R 1 4
Deploying Plans for Runtime
Execution
This chapter includes the following topics:
Overview, 123
Deploying Runtime Plans, 123
Running a Plan, 124
Command Line Arguments, 126
Performance, 127
Multi-Threading and Multi-Processing, 128
Security, 129
Overview
Data Quality supports the deployment of plans for runtime execution that is, for execution as part of a
scheduled or batch process. Plans created in Data Quality Workbench can be published from one Data Quality
repository to another. The execution of the plans is then managed from the command line. You can deploy
plans on Windows and UNIX platforms.
Note: In earlier versions of Informatica Data Quality, the capability to deploy plans for scheduled or batch
execution was delivered through a separate application called Data Quality Runtime. In this version, Runtime
functionality has been incorporated into Data Quality Server. This chapter describes the runtime plans.
For information about the prerequisites and system requirements for runtime functionality, see the Informatica
Data Quality Installation Guide.
Deploying Runtime Plans
Plans deployed for batch or scheduled execution can be run from one of two locations:
Directly from the Data Quality repository (enterprise installs only).
As an XML file from the local file system.
124 Chapter 14: Deploying Plans for Runtime Executi on
The local or remote Data Quality repository is identified in the config.xml file on the machine that runs the
plan.
Data Quality Workbench users in a service domain can use the Project Manager and File Manager to publish
plans and move file resources to a remote Data Quality repository for deployment. All plans published to the
repository are available for execution by Informatica Data Quality as long as the paths to all relevant data and
dictionary files are valid for the plan. You can identify the paths and filenames using parameter files. For more
information, see The -c Option on page 126.
You can convert plans to XML files from the Workbench interface and deploy the plan files and other resource
files. For example, you can transfer files to another computer using FTP.
Note: When executing a runtime plan, Data Quality looks in the default Dictionaries folder for plan
dictionaries. However, you can specify data source files that anywhere on the Runtime host as long as their
locations are specified in a parameter file associated with the plan. For this reason, Data Quality Workbench
allows you to specify the source and target file locations when you save a plan as XML.
Use runtime plans in environments where the data repository is updated periodically from one or more low-
quality source systems when you need to cleanse and run reports on data periodically.
On Windows, the executable file for implementing runtime functionality is Athanor-RT.exe, located in the bin
folder of the Data Quality Server installation.
On UNIX and Linux the executable file is a script located in the bin folder of the Data Quality Server
installation, named athanor-rt. This script calls the Athanor-RT executable file using a suitable environment.
Note: Do not run the Athanor-RT executable directly on non-Windows platforms.
Running a Plan
Data Quality can execute a plan as an XML file from the file system or from the Data Quality repository.
The -f flag specifies that athanor-rt should read a plan from an XML file in the local file system. The -p flag
specifies that the plan should be read from the repository identified in the local config.xml file. For example, the
following code runs myplan.xml from the home/Informatica/DataQuality/plans folder:
athanor-rt -f home/Infomatica/DataQuality/plans/myplan.xml
The following code runs myplan from the Folder1 folder in the Project1 project in the repository:
athanor-rt -p project1/folder1/myplan
Note the following:
You can use the -c command to have Data Quality read plan variables and source file locations from a
parameter file. This allows you to reuse a plan without having to edit the plan for each scenario. For more
information, see Command Line Arguments on page 126.
Parameter files are also important elements in plan execution. Use -p as the parameter file to identify the
locations of the data source files.
As the Data Quality executes plans, it logs messages to the screen, to the local log file, and to the Event Log
on Windows platforms or syslog on UNIX platforms as configured in the config.xml file.
Version Control
Data Quality Server provides version control for plans stored in the repository. The -p option allows you to
identify a base version of a plan for runtime execution.
For example, the following code runs base version 3 of myplan:
athanor-rt -p project1/folder1/myplan:3
Runni ng a Pl an 125
Scheduling Operations
Data Quality can run plans in batch mode automatically, by means of a scheduling application, or manually, by
an operator. For example, when an overnight batch schedule updates a database from a series of data feeds, you
can call the Data Quality engine to check the feeds for data quality problems. You can call the command line
application with a scheduler such as Windows Task Scheduler or UNIX Cron.
Windows Scheduling
The following steps describe how to schedule a plan on a Windows computer:
1. Create a batch file QualityReport.bat and add the desired command, for example:
C:\Program Files\IDQ\bin\Athanor-RT.exe -f C:\Plans\QualityReport.bat
2. Run the batch file to ensure that it works as expected.
Run the file with the user profile of its intended user.
3. Add a new task.
Open the Scheduled Tasks window from the Windows Control Panel. Right-click in the window and click
New > Scheduled Task from the shortcut menu, and name the task.
4. Open the property sheet for this task and edit its settings as follows:
On the Task tab:
Type the local path to the batch file in the Run field, such as C:\Plans\QualityReport.bat.
Type the path to the Data Quality installation in the Start In field, such as C:\Program Files\IDQ.
Select the user profile that will run the plan. Remember to confirm that the file will run correctly for
that user.
On the Schedule tab, specify when you want to run the task.
Review the Settings tab fields. The default settings on this tab are sufficient for most tasks.
5. Click OK and, if prompted, enter a username and password.
The task is now under the control of the Windows Task Scheduler.
6. To add pre- or post-task operations, add steps to the batch file or add new tasks to the Scheduler.
You can use any scheduler with the ability to run command line tools.
Note: If the Windows Scheduler cannot find the specified file, check for spaces in the paths provided in step 4
above. Check the path by running the file from the command line. If spaces are present, surround the path with
double quotation marks, as follows:
"C:\My Tasks\QualityReport.bat"
The batch file returns the error code of the last command executed.
UNIX Scheduling
The following steps illustrate the scheduling of plan Profile.xml on a Solaris machine using the cron scheduler:
1. Create a shell script called QualityReport.sh and add the run command, for example:
$ home/athanor/bin/athanor-rt -f $HOME/Plans/Profile.xml
2. Run the batch file to make sure it performs as expected.
3. Create a new scheduled task using the crontab -e shell command.
The following task runs QualityReport.sh and logs standard and error messages to /tmp/QualityReport.log:
0 02 * * * sh -f /export/home/athanor/QualityReport.sh > /tmp/QualityReport.log 2>&1
You can use any scheduler than has the ability to run command line tools. For more information on using cron
and crontab, see the man crontab and man cron commands or contact your system administrator.
126 Chapter 14: Deploying Plans for Runtime Executi on
Command Line Arguments
Typing athanor-rt -? at the command prompt displays the following output:
Usage: .\Athanor-RT.exe [ -f <XML plan filename> | -p <project name>[/<folder name> ...
]/<plan name>[:<version id>] ]
[ Options ]
Specify a plan:
-f <XML plan filename> Run the plan contained in the runtime plan XML file
-p <Repository plan> Run the plan from the repository specified by the path
Options:
-c f Use the parameter file f to override values in the XML plan
-i n Display progress information every n records
-? Display this usage screen
-h Display this usage screen
For more information about options -f and -p, see Running a Plan on page 124.
The -c Option
Data Quality supports the use of parameter files that can facilitate the deployment of a plan in one or more
environments. The parameter file is passed to the Data Quality engine using the -c command.
The parameter file defines the environment-specific values to be used when the plan is executed. For example, a
mapping between the original location of a source file and its new location can be mapped in the parameter file:
C:\Program Files\IDQ\DevData\Source.csv=
C:\Program Files\IDQ\users\user.name\Files\ProdData\Source.csv
Such mappings are platform-independent, that is, a Windows path can be mapped to a UNIX path, and vice
versa.
You can export or publish a plan and notify an administrator who applies the parameter file. Alternatively, you
can prepare the parameter file before exporting or publishing the plan.
To make best use of the -c option, establish a standard convention to indicate the kind of information files
contain. Take care when defining mappings in the parameter file. For example, the mapping word=book will
replace all instances of word in the XML file, including tags such as <password>, which can result in an invalid
plan.
Encryption
Often the details in a parameter file, such as passwords and database connection details, are secured. To
maintain security, an administrator can encrypt the parameter file by passing it to the Athanor-Encode utility.
This generates an encrypted file with the extension .enc appended to the original parameter file name.
This file can only be read by Data Quality or by Informatica Global Customer Support. You can edit the
parameter file in a secure environment and place the encrypted version in the production environment.
Passwords
You can apply the parameter file in encrypted or plain text mode. In plain text mode, when you edit the
password tag, the parameter will be applied each time the plan is run.
When you want to replace encrypted passwords at execution time, you must edit the XML plan and replace the
encrypted password with a placeholder. For example, the following line:
<Password EncryptionLevel='1'>W3uC+PY/kzcAUw==</Password>
should be replaced with an non-encrypted placeholder than can be easily communicated and defined in
production parameter files, for example:
<Password>PasswordHolder</Password>
In a parameter file, the password can now be substituted using the following mapping:
PasswordHolder=user.name
Performance 127
Shared Databases Details
A plan may be designed for use with two databases with common connection details, then in production, the
plan is run against two different databases. In such a case, Data Quality cannot distinguish between the two.
You must edit the original plan so that it refers to the production databases, or add placeholders for the
production databases before moving plans to a different domain. Alternatively, as best practice, it may be worth
developing the convention of using distinct database details and accounts for each database when a plan is in
design.
The -i Option
Use the -i option for checking system performance and establishing the reasons why a plan is behaving in a
certain way.
For example, if plan n reads a CSV source and changes two fields within the dataset to uppercase, then it writes
the data to a CSV target. Its input fields are as follows:
CUSTOMER_KEY, FIRST_NAME, LAST_NAME, ADDREESS_LINE_1, ADDRESS_LINE_2... ADDRESS_LINE_6
Running the plan and specifying -ix at the command, where x is a positive integer, produces the output shown
below, whenever x records (plus 1 for the initial record) are processed:
Time in long seconds 1063104892
Local time Tue Sep 09 11:54:52 2003
[0] DataSource Progress = 0
[1] DataSource Num Records = 9975
[2] DataSource Num Comparisons = 4
[3] Similarity Record ID = 4
[4] CUSTOMER_KEY = 12321
[5] FIRST_NAME = Edward
[6] LAST_NAME = Oconnell
[7] ADDRESS_LINE_1 = Clorane
[8] ADDRESS_LINE_2 = Kiloimo
[9] ADDRESS_LINE_3 = Co Limerick
[10] ADDRESS_LINE_4 =
[11] ADDRESS_LINE_5 =
[12] ADDRESS_LINE_6 =
[13] To Upper 2(FIRST_NAME) = EDWARD
[14] To Upper 2(LAST_NAME) = OCONNELL
Each row corresponds to a memory location in the engine. The time in long seconds is useful for checking the
performance of the engine. For most tasks, every set of x records should be processed in the same amount of
time. If this is not the case, a performance bottleneck exists.
Performance
The time it takes for a plan to execute depends on several factors. Some are related to Data Quality, and some
are related to the environment in which the plan is executed.
In general, plan execution time includes time for the following:
1. Reading data from a data source.
2. Executing the business rules defined in the plan.
3. Writing data to a data target or report.
Reading and writing data depends on the speeds at which the Data Quality engine can read from and write to a
data source or data target. With a slow-performing database source, the engine may spend more time waiting
for data than processing it. Similarly, a slow-performing file target means that Data Quality may spend more
time waiting for data to be written.
128 Chapter 14: Deploying Plans for Runtime Executi on
As a rule, database sources should be in as close as possible to the Data Quality instance that executes the plan.
For example, a plan using a database source will run much faster if the database is located on the same local
network than if the database is located at a remote site.
Similarly, when the Data Quality process is constrained by system resources such as CPU or available memory,
it spends more time processing. When a plan consumes a large percentage of the CPU, it will probably execute
faster on a higher-performance CPU.
Reading and Writing
Tuning database or file system access to reduce the time spent accessing data sources and targets allows Data
Quality to concentrate on processing records.
Processing
Increasing the CPU speed means that records can be processed more quickly.
The MySQL database underlying the Data Quality repository or staging area can also be tuned.
Maintenance and Housekeeping
In case of the following:
Plan failure. athanor-rt reports an error code of 1 if a plan fails to execute. The calling process can opt to fail
or run again depending on the error code returned.
Product failure. In the unlikely event that Data Quality crashes, you can facilitate crash diagnosis by
performing a stack traceback and sending the results to Informatica Global Customer Support. For
information about this operation, contact your systems administrator.
Multi-Threading and Multi-Processing
Data Quality applications are multi-threaded and therefore suited for multiple CPU environments. Multi-
threading allows an application to make use of multiple CPUs to improve throughput. On a single CPU, multi-
threading also allows an application to make use of a CPU while a slow input or output operation takes place.
However, multi-threading is not the only way to improve throughput. Multi-processing can split a problem
between multiple computing devices or multiple CPUs on a single device.
With multi-processing, you can decide how the best possible throughput can be achieved by dividing a problem
into several different jobs. Each job then executes and solves a part of the overall problem. There are two
major differences between this approach and multi-threading:
Jobs can run on multiple devices and can provide greater computational power than any single device can
offer.
You might be able to accelerate processing beyond speeds possible with a generic threading approach.
Multi-processing and multi-threading provide complementary approaches to increasing throughput.
With Data Quality installed on a single machine, you can execute multiple processes concurrently, each process
applying the same Data Quality plan to different parts of an overall dataset, and thus achieve greater
throughput efficiency.
For example when matching large datasets, you might have six processes running on a four-CPU system, with
each process tackling a different cluster of records. Each Data Quality process executes against only those
clusters assigned to it.
Securi ty 129
The processing requirements of each cluster increase exponentially with the number of records in the cluster.
Typically one process is assigned only a few very large clusters while other processes are assigned a large number
of small clusters. Each process performs the same amount of work and each contributes to the overall operation.
A similar approach applies to the standardization of records. In this case, each Data Quality process executes on
a subset of the data. As the time taken to process the overall dataset increases linearly with the number of
records, it is a simple task to distribute the processing load across multiple Data Quality processes executing on
one or more CPUs within one or more computing hosts.
Security
Note the following security-related details:
To avoid storing potentially sensitive passwords in plain text, Data Quality can encrypt plan and parameter
file passwords.
The Data Quality installer on UNIX prevents the product from being installed by any user with root
privileges. On UNIX, Data Quality requires no special user privileges, other than write access to /tmp.
Consequently, a system administrator can restrict and control access to the product in the same manner as
access to any other user-level application.
The Data Quality staging area is configured by default to permit access to the underlying MySQL database
to local users only. Extending access privileges requires the explicit granting of access to other users.
130 Chapter 14: Deploying Plans for Runtime Executi on
131
A P P E N D I X A
Global AV: Match Status and Match
Code Information
The appendix includes the following topics:
Overview, 131
Countries Processed by QAS, 132
Countries Processed by Melissa Data, 137
Countries Processed by Address Doctor, 139
Overview
This appendix describes the range of Match Status and Match Code output values that the Global AV
component can return during address validation. The Global AV returns these values along with address line
data and other data that can assist in postal delivery.
The Match Status and Match Code outputs does not contain information relevant to postal delivery. They
contain information on the success or otherwise of attempts by the Global AV to validate an input address
against address reference data.
The Global AV uses specialist address processing engines that install with Data Quality to validate address data.
The range of Match Status and Match Code values returned by your address validation plans depends on the
engines that process your data, and the Global AV uses different engines to process data for different countries.
Use the following table to determine which engines process address data in your plans.
If Your Address Is From This Country... The Global AV Uses This Engine
Australia QAS (See page 132)
Canada Melissa Data (See page 137)
Denmark QAS (See page 132)
France QAS (See page 132)
Luxembourg QAS (See page 132)
Netherlands QAS (See page 132)
Singapore QAS (See page 132)
United Kingdom QAS (See page 132)
132 Appendi x A: Global AV: Match Status and Match Code Information
The Global AV performs two checks to determine the country of origin for an address and thus the engine that
will process it:
It analyzes the address against the default country dataset as specified on its Parameters tab.
If it cannot validate the address against the default country, it looks for a valid ISO country code and
analyzes the address against the reference data for that country, if such data is installed.
Countries Processed by QAS
Data Quality installs the QAS Batch API and uses it to process data from the following countries: Australia,
Denmark, France, Luxembourg, Netherlands, Singapore, United Kingdom.
Match Status Information
The Match Status field contains a text summary of the results of the validation process for the record.
Table A-1 lists the possible return values:
Note: The Global AV returns the same match status options for data processed by the Melissa Data and QAS
engines.
Match Code Information
The QAS Batch API writes an alphanumeric match code representing the level of success achieved for each
address it analyzed against the country reference data. The format of the code is:
R933000000000000000000000000
The elements of the code are explained below.
The first four characters of the match code provide the following information:
United States Melissa Data (See page 137
Other countries Address Doctor (See page 139
Table A-1. Match Status Values For QAS Reference Data
Validated Good Match Partial Match Tentative Match
Poor Match Multiple Matches Foreign Address Unmatched
Table A-2. Match Code Descriptions
Match Code Type Position Description
Match Success First A single, upper case letter representing the level of success
achieved in matching the address.
Match Confidence Level Second A single number representing how good the validation tools consider
the match to be.
Postal Code Action Third A single number indicating the changes that were made to the postal
code.
Address Action Fourth A single number indicating the changes that were made to the
address.
If Your Address Is From This Country... The Global AV Uses This Engine
Countries Processed by QAS 133
For example, in the following code, the first four characters [R933] represent the match success, match
confidence level, postal code action, and address action: The format of the code is:
R933000000000000000000000000
The twenty four subsequent numbers can be read as three eight-digit groups of generic information bits,
country information bits, and supplemental country information bits. This appendix describes the first four
characters and the generic information bits in the code.
Match Success
The letter at the beginning of the match code indicates how successfully the validation tools have been able to
match the input address to an address in the selected databases.
At a high level, the values of the match success letter are split into two ranges. Codes A through D indicates the
input address was not processed. K through R indicates the input address was processed.
When the validation tools return a Q or R, along with a match confidence level of 9, you can be sure they have
found the right match.
Table A-3 describes all match success codes:
Table A-3. Match Success Codes
Code Description Returned when...
A Unprocessed. Results cannot be returned for the input address. This reflects an
internal processing issue that should not occur during normal usage.
B Blank. Validation tools find no data in the input address or find too insignificant
an amount of data to return an address.
C Country not
available.
Your input address contains a country name for which the appropriate
country database is not installed.
D Unidentified country. No default country has been configured and validation tools are unable
to determine the country of origin for the record.
K No address or
postcode can be
derived.
Validation tools cannot find data matching the input address. This might
occur if the input address does not contain a country name and does not
match anything in the default country database.
For example, if you process the following address against the UK
country database, the validation tools return K because (i) they cannot
find any matching street names and (ii) they have no other information
such as a locality or postal code to search on:
42 Durlston Square
L Postcode found, but
no address can be
derived.
Validation tools derive a valid postal code from the input address but no
address information.
M Multiple addresses
found, but no
postcode.
The input address matches more than one address in the database. For
example, the following address finds four matches in the UK country
database:
146 High Street, Cambridge
N Multiple addresses
found with postcode.
Validation tools find more than one matching address within a postal
code. This is most likely to occur where a postal code covers large
areas, such as in Australia.
For example, the following Australian address has two possible
matches, as it exists in the localities of Kingsholme and Ormeau:
25 Cliff Barrons Rd, QLD, 4208
134 Appendi x A: Global AV: Match Status and Match Code Information
Match Confidence Level
The first number in the match code, the second character in the code, indicates the confidence regarding a
particular match. There are three levels of confidence:
Low. Essential matching rules were not satisfied.
Intermediate. Less important rules were not satisfied, or another check failed (for example, the input address
is not in the order expected).
High. Matching rules were satisfied.
Because the completeness of the returned address is determined by the match success letter, the validation tools
can return an R match with low confidence, indicating that although it has found a complete and correct
address, it is not sure that it is the same address as the input.
Table A-4 describes match confidence level codes:
O Partial address
found, but no
postcode.
Validation tools find a partial address to match your input, but they
cannot return a full postal code, because the partial address is covered
by more than one postal code. This might occur if the input address has
a missing or invalid property number. The validation tools cannot
determine the correct property number, and returns as much of the
address as it can.
For example, in the following UK address, number 70 does not exist:
70 Glebe Road, Long Ashton, Bristol
Because no postal code is included in the input address, the validation
tools do not know which of two possible postal codes to return:
Glebe Road, Long Ashton, Bristol
P Partial address
found with postcode.
Validation tools have found a partial address that matches your input,
including a postal code. Either the input postal code was valid, or the
validation tools have found a single postal code for the partial address.
For example, for the following Australian address, the validation tools
are able to add a postal code and state code, although the lack of
property number prevents the return of a full address:
Robertson St, Sherwood
Q Full address found,
but no postcode.
Validation tools find a full address that matches your input data, but
cannot find a full postal code to go with it. This can happen when a
country database, such as the Irish database, does not include postal
codes for every address.
R Full address and
postcode found.
Validation tools make a full match, either by verifying a correct input
address or by locating a full address from partial input data. The
following examples all return R matches:
14 Carnaby St, London
Grimmstr 5, 79848 Bonndorf
Sintelweg 10, 9364 Nuis
19 Meyer Place, Melbourne, Victoria 3000
However, an R match only signifies that a full address and postal code
have been returned. It does not necessarily mean that the address is
the one you want. You can gauge the likelihood of a correct match with
the match confidence level.
Table A-4. Match Confidence Level Codes and Descriptions
Code Description
0 Low confidence.
5 Intermediate confidence.
9 High confidence.
Table A-3. Match Success Codes
Code Description Returned when...
Countries Processed by QAS 135
Low Confidence: 0
The validation tools set the confidence level to 0 when they find a match that differs considerably from the
input address. Take this Australian address:
Music St, Carmilla, QLD, 4739
The validation tools return the nearest guess:
Lot 10, Music St, Carmilla, QLD, 4739
Because this is a full address, it is given an R match success letter. However, since the input address did not
contain any premises information, the validation tools cannot be confident that this is the right match for the
input data.
Intermediate Confidence: 5
The validation tools return a confidence level of 5 when a correct match is likely. This might occur if the input
address is slightly inaccurate. Consider this UK input address:
3 Marine Terrace, Abardeen, AB11 7SF
In this example, the town name should be Aberdeen. However, the validation tools are able to find the correct
address. Only the misspelling prevents a full-confidence match.
High Confidence: 9
The validation tools return a 9 when they are sure that the output address matches the input data. This happens
when an input address is fully accurate, or when incomplete address data is sufficiently detailed to append the
remaining address details.
Postal Code Action Indicator
The second number in the match code, the third character in the code, indicates the action performed by the
validation tools on the postal code. There are four possible values for this number:
Address Action Indicator
The third number in the match code, the fourth character in the code, indicates the action performed by the
validation tools on the address. Table A-6 describes possible values for this number:
Generic Information Bits
The eight character hexadecimal information bits that are returned with a match code indicate why a match has
a reduced confidence level. These bits can be ORed together when more than one applies. For example, if the
Table A-5. Postal Code Action Indicator Codes and Descriptions
Code Description
3 The existing postal code has been corrected.
2 A postal code has been added.
1 The existing postal code was already correct.
0 No action was taken.
Table A-6. Action Address Indicator Codes and Descriptions
Code Description
3 All or part of an address was returned. The quantity of the address is indicated by the match
success letter.
2 The matched address was enhanced with additional information.
0 No action was taken because the supplied address was not matched.
136 Appendi x A: Global AV: Match Status and Match Code Information
information bits 05000000 are returned, this means that extra numbers were found in the address (01000000)
and no place element was found (04000000).
Table A-7 describes the generic information bits:
Table A-7. Information Bit Descriptions
Information
Bit
Description
10000000 The elements in the input address were not in the expected order. For example, in the following address, the postal
code should appear at the end of the address:
7 Old Town, SW4 0JT, London
20000000 Preferred matching rules were not satisfied, so the match is marked with intermediate confidence at best.
40000000 Close matching rules were not satisfied, so the match is marked with intermediate confidence at best.
01000000 Extra numbers were found in the address. For example, with the following address:
Flat 2, 12 10, Abbeville Road, London
A full match is achieved with:
Flat 2, 10, Abbeville Road, London, SW4 9NJ
But the additional number 12 might reduce the confidence level to intermediate.
02000000 Additional text between a number and the expected adjacent component has been found, such as extra text between a
property number and a street name. The confidence level is reduced to intermediate.
04000000 No place element, such as a locality in Australia, was found in the address, so the confidence level might be reduced.
08000000 An item associated with a number is missing. For example, the following British address should include the street name
Ash Gardens after the building number:
4, South Marston, Swindon, SN3 4XX
00100000 One or more essential matching rules were not satisfied, so the match confidence is reduced to low.
00200000 A timeout has occurred and the address has not been matched.
00400000 The input address was a superset of the address in the country database. For example, the following input address
contains more information than the official version, which does not contain the Village Arcade element:
Village Arcade, 5 Hillcrest Road, Pennant Hill, NSW, 2120
00800000 A leading number was unused in the input address. For example, L 5 is not found in the official address:
L 5, 2/6 The Bollard, Corlette, NSW
00010000 There was ambiguity in the supplied range in the input address. For example, the following address has an ambiguous
range because there is a 26, 28 and 30 Delhi Street and the input address cannot be matched to a specific property:
26-30 Delhi Street, Adelaide SA 5000
00020000 A street descriptor has been added or has changed. For example, the correct descriptor Street is returned instead of
Road in the following address:
10 Railway Road, Serviceton, Vic
00040000 Additional text in the input address was too significant to ignore. For example, the following French address contains the
unmatched significant information, CEDEX 11:
18 Boulevard Voltaire, 75011 Paris CEDEX 11
This returns an intermediate confidence level.
00080000 There was an error in the input street name that the validation tools have amended.
00001000 There was an error in an input place name. This has been corrected by the validation tools.
00002000 The validation tools have added or changed a premises number or range, such as a building number in Australia data
where a single number matched to a range, or organization names in French data.
00008000 A name was used to secure an address match.
00000100 The address line is too small to contain each address element. Increase the size of the address lines to avoid truncating
address elements.
00000200 Entire address elements are unable to fit on the address line. Increase the size of the address lines to ensure all
address elements are visible.
00000400 The validation tools failed to generate one or more non-address items. It is likely that the DataPlus set could not be
opened.
Count ries Processed by Meli ssa Data 137
Note: The validation engine returns two further sets of eight-character hexadecimal information bits: Country
information bits and Extended country information bits. This guide does not describe these information bits.
Countries Processed by Melissa Data
Data Quality installs the Melissa Data processing engine and uses it to process data from the United States and
Canada.
Match Status Information
The Match Status field contains a text summary of the results of the validation process for the record.
Table A-8 lists the possible return values:
Note: The Global AV returns the same match status options for data processed by the Melissa Data and QAS
engines.
Match Code Information
The Match Code field represents the level of validation achieved for the input address. The output values for
this field range cover successful and unsuccessful attempts to validate the address against the reference data.
00000800 When in enhanced cleaning mode, the validation tools cannot fill the unmatched address elements back into the
database. To resolve this, increase the size of the address lines or add additional lines.
00000010 To produce an enhanced address, PAF address elements are moved to the right or downwards to allow unmatched
elements to be incorporated.
00000020 The validation tools have determined that the supplied address has been significantly cleaned. This can include spelling
corrections, changes in capitalization, or the reformatting of the input address elements. Quotes and spaces are ignored
during the validation process.
00000040 Key input address elements were judged correct as supplied, but the format of the output address might have been
changed. For example, address elements may have been expanded or abbreviated, or capitalization changed.
00000080 If you defined InputLineCount and the input line count does not match the number of lines defined in the input search
string, this bit will be set. This bit does not affect match confidence.
00000001 Strict matching rules were not satisfied, so the match is marked with intermediate confidence at best.
00000002 The validation tools have found a premises-level partial address match.
00000004 The validation tools have found a street-level partial address match.
00000008 The validation tools have found a place-level partial address match.
Table A-8. Match Status Values For Melissa Data Reference Data
Validated Good Match Partial Match Tentative Match
Poor Match Multiple Matches Foreign Address Unmatched
Table A-7. Information Bit Descriptions
Information
Bit
Description
138 Appendi x A: Global AV: Match Status and Match Code Information
Table A-9 describes the possible outputs for this field:
Table A-9. Match Code Values For Melissa Data Reference Data
Code
Error String
(Optional)
Description
Empty No Error Address is correct.
C Canadian Postal
Code
A Canadian Postal Code was passed to the engine, and either the PathToCanadianData property
was not set or the current license string only permits the address object to validate U.S.
addresses.
2 Address2 Coded The engine could not validate the contents of the Address property and the contents of the
Address2 property were validated instead.
6 A Canadian address was fully validated.
7 (U.S. only.) There were multiple matches for the address but they were all in the same Zip Code
and carrier route. The Zip Code and carrier route returned is correct, but lacks the final four-digit
extension.
9 (U.S. only.) The address was fully validated.
D Demo Mode In Demonstration mode you can verify Nevada addresses only.
E Expired Database The current files have expired.
F Non-Canadian
Postal Code
A non-Canadian Postal Code was passed to the engine, and only the PathToCanadianData
property (and not the PathToUSFiles) was set.
M Multiple Matches More than one record matches the address, and there is not enough information available in the
input address to break the tie between multiple records. Passing information such as
city/municipality names or urbanization names can help reduce the number of multiple match
errors.
N No Street Data for
ZIP/Postal Code
The Zip/Postal Code exists but no streets begin with the same letter in that Zip/Postal Code.
R Address out of
Range
The address was found in the reference data, but the street number in the input address was not
between the low and high range of the post office database.
S Invalid Suite
(Canadian
Addresses Only).
The suite was missing or not correct. Canadian addresses cannot be coded to default site
addresses as they can be in the U.S.
T Component
Mismatch
Either the directionals or the suffix field did not match the post office database, and there was
more than one choice for correcting the address. For example, if the given address was 100
Main St and the only addresses found were 100 E Main St and 100 Main Ave, the error code T
would be returned because it is unclear whether to add the directional E or change the suffix to
Ave.
U Unknown Street An exact street name match could not be found, and phonetically matching the street name
resulted in either no matches or matches to more than one street name.
W Early Warning
System
This address has been identified in the Early Warning System (EWS) data file, and should be
included in the next national database update.
X Non-Deliverable
Address
The physical location exists, but there are no homes on this street. One reason might be railroad
tracks or rivers running alongside this street, as they would prevent construction of homes in this
location.
Z ZIP/Postal Code
Error
The Zip/Postal Code does not exist and could not be determined by the city/municipality and
state/province.
2 Address2 Coded The engine could not validate the contents of the Address property and the contents of the
Address2 property were used instead.
Countri es Processed by Address Doctor 139
Countries Processed by Address Doctor
Data Quality installs the Address Doctor processing engine and uses it to process data from several countries. If
your plan contains addresses from countries other than those covered by the QAS or Melissa Data engines, the
Global AV will attempt to validate it using the Address Doctor engine.
The Address Doctor engine also returns ElementMatchStatus and ElementResultStatus information.
Match Status Information
The Match Status field contains a text summary of the results of the validation process for the record.
Table A-10 lists the possible return values:
Match Code Information
The Match Code field represents the level of validation achieved for the input address. The output values for
this field range cover successful and unsuccessful attempts to validate the address against the reference data.
Table A-11 describes the possible outputs for this field:
Element Match Status and Element Result Status Information
In addition to Match Status and Match Code outputs, the Address Doctor engine generates
ElementMatchStatus and ElementResultStatus outputs. An ElementMatchStatus or ElementResultStatus field
contains an eight-digit string in which each digit represents a unique address element. For example, the first
digit represents postal code in formation and the second digit represents locality information. At each position
in the string, the digit returned indicates the quality of validation obtained for that address element.
Table A-10. Match Status Values For Address Doctor Reference Data
Validated Corrected Good deliverability
Unmatched Not processed
Table A-11. Match Code Values For Address Doctor Reference Data
Code Description
V Data correct on input (validated)
C Data corrected by Address Doctor
P3 Data cannot be corrected, but very likely to be deliverable
P2 Data cannot be corrected, but fair chance that the address is deliverable
P1 Data cannot be corrected and unlikely to be deliverable
Q3 Suggestions are available
Q2 Suggested address is not complete (enter more information)
Q1 No suggestions are available. A query against the database was performed but yielded no
results
N5 Insufficient information to generate suggestions. No query against the database was
performed.
N4 Validation method not yet called (after parsing operation)
N3 No validation performed because country not unlocked
N2 No validation performed because reference database not found or not available
N1 No validation performed because country not recognized
140 Appendi x A: Global AV: Match Status and Match Code Information
The meanings of the numbers in each positions are as follows:
ElementMatchStatus
The numeric values at each position in the ElementMatchStatus string indicate the quality of match found for
the address elements in the input address.
A good match does not mean that the element is correct in the context of the overall address. For example, if the
postal code and city name both exist in a given country, they will both return a code of 4, indicating a perfect
match. However, the postal code could relate to a different city. Thus an address can have an element match
status consisting only of 4s and be incorrect.
The possible values for ElementMatchStatus are:
ElementResultStatus
The numeric values at each position in the ElementResultStatus string indicate if and how the validated output
fields have been changed from the input fields. The possible values are:
Position Description
1 Postal Code
2 Locality
3 Province
4 Street
5 House Number
6 PO Box
7 Building
8 Organization
Values Description
0 Element empty on input.
1 No match. Element was checked against reference data but could not be found.
2 Element was not validated, either because no reference data is available or the element is
too incomplete, such as a single letter in locality field.
3 Match with error. For example: Boton to Boston.
4 Perfect match.
Value Description
0 Element not provided.
1 Element not checked. Using parsed input for output.
2 Element not checked. Using parsed and standardized input for output. For example, Street
replaced by ST.
3 Element was corrected using postal reference data.
4 Element was correct but was changed to a synonym. For example, Munich replaced by
Mnchen.
5 Element was correct but was standardized. For example, Maryland replaced by MD.
6 Element was correct and is unchanged.
141
A P P E N D I X B
Global AV Output Descriptions
This appendix includes the following topics:
Overview, 141
Global AV Output Fields By Country, 142
Output Field Definitions, 143
Overview
This appendix describes the output columns that you can select for the countries whose input addresses you can
validate with the Global AV component. The appendix contains three tables.
Table B-1 lists the abbreviations used for each country in this appendix.
Table B-2 lists the fields that are populated by the Global AV for each country. The Global AV does not
populate every field for every country
Table B-3 defines the output fields.
Country Abbreviations
Table B-1 lists the abbreviations for the countries for which the Global AV writes output values:
Table B-1. Global AV Country and Abbreviation List
Country Name Abbreviation
Australia AUS
Canada CAN
Denmark DNK
France FRA
Luxembourg LUX
The Netherlands NLD
Singapore SGP
United Kingdom UK
United States US
All Other Countries Other
142 Appendi x B: Global AV Output Descripti ons
Global AV Output Fields By Country
Table B-2 lists the output fields available on the Global AV component and specifies the fields that the
component can populate for supported countries.
Table B-2. Global AV Output Fields - Country Coverage
Output Field US CAN AUS NLD DNK FRA GBR SGP LUX Other
Match Status Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Match Code Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Address1 Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Address2 Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Address3 Yes Yes Yes Yes Yes Yes Yes Yes
Address4 Yes Yes
Locality_Line1 4 Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Formatted_Address_Lines1 10 Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Organization Yes Yes Yes Yes Yes
Building Yes Yes Yes Yes Yes Yes
Building2 Yes Yes Yes Yes Yes
Sub Building Yes Yes Yes Yes Yes Yes Yes
Sub Building2 Yes Yes Yes Yes Yes
House Number Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Street Name Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
City Abbreviation Yes Yes
Locality/City Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Additional Locality Yes
Dependent Locality Yes Yes
Dependant Thoroughfare Yes
Thoroughfare Yes
Double Dependant Locality Yes
Province/State Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Postal Code/Zipcode Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Zip Plus 4 Yes
PO Box Yes Yes Yes Yes Yes Yes Yes Yes
Country Name Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Country Code ISO 3 Letter Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Carrier Route Yes
Delivery Point Code Yes Yes
Delivery Point Check Digit Yes
County FIPS Yes
Address Type Code Yes
Address Type String Yes Yes
Output Fi eld Defi ni ti ons 143
Output Field Definitions
Table B-3 describes the output fields populated by the Global AV.
Urbanization Yes
Congressional District Yes
Private Mailbox Yes
Time Zone Code Yes
Time Zone Yes
MSA Yes
PMSA Yes
Suite Status Code Yes
EWS Flag Yes
Zip Type
Parsed Pre-Direction Yes Yes
Parsed Suffix Yes Yes
Parsed Post-Direction Yes Yes
Parsed Suite Name Yes Yes
Parsed Suite Range Yes Yes
Parsed Private Mailbox Name Yes
Parsed Private Mailbox Number Yes
LACS Yes
LACS Link Indicator Yes
LACS Link Return Code Yes
Element Match Status Yes
Element Result Status Yes
Delivery Point Identifier Yes
Post Office Name Yes
Sub-Region/County Yes
Residue Yes
Table B-3. Global AV Output Field Definitions
Output Field Description
Match Status A text summary of the results of the validation process for the record. For more information, see
page 131.
Match Code Code representing the level of success achieved for each address it analyzed against the country
reference data. Can be alphabetic, numerical, or alphanumeric. For more information, see
page 131.
Table B-2. Global AV Output Fields - Country Coverage
Output Field US CAN AUS NLD DNK FRA GBR SGP LUX Other
144 Appendi x B: Global AV Output Descripti ons
Address1... Address4 Street address information. These fields enable you to build a formatted address in conjunction
with the Locality_Line_[n] fields.
Locality_Line1 Locality_Line4 Last line information. These fields enable you to build a formatted address in conjunction with the
Address[n] fields. Only Locality_Line_1 is used here.
Formatted_Address_Lines1 10 Standard address label lines, as dictated by the postal service standard (up to 5 lines are
populated).
Organization Company or organization name. This field is typically unpopulated.
Building Building name
Building2 Building 2 name
Sub Building Sub-building information, for example allotment number, flat or unit name, building level
Sub Building2 Sub-building2 information.
House Number Provided as Primary Range, for example 100 from 100 Main St.
Street Name Street name with street indicator, for example Main Street. Global AV does not return a street
indicator for US and Canada addresses.
City Abbreviation Official thirteen-letter city name abbreviation associated with the input address. US and Canada
addresses only.
Locality/City Locality name (including name from unvalidated addresses).
Additional Locality Locality sort code, for example '3' in addition to 'Praha' in 'Praha 3'.
Dependent Locality Town within Post Town area. This field can contain Urbanization information for Puerto Rico (US)
addresses.
Dependant Thoroughfare Additional street name information.
Thoroughfare Street name information.
Double Dependant Locality Town within Dependent Locality.
Province/State Province or state name.
Postal Code/Zipcode Post code or Zip code.
Zip Plus 4 Nine-digit Zip code (five-plus-four digits).
PO Box Post Office Box identifier.
Country Name Country name
Country Code ISO 3 Letter Three-character ISO country code.
Carrier Route A carrier route is a group of addresses assigned a common code by the United States Postal
Service. The carrier route code contains nine characters - a five-digit Zip Code, a single letter for
the carrier route type, and three digits for the carrier route code. Typically, a carrier route relates
to where a particular mail carrier delivers.
Delivery Point Code A delivery point is a set of digits between 00 and 99 assigned to every address. Combined with
the nine-digit Zip code, the delivery point provides a unique identifier for every deliverable
address served by the United States Postal Service.
Delivery Point Check Digit Check digit enabling the engine to determine if the Delivery Point information contains an error.
County FIPS County code defined by the Federal Information Processing Standard.
Address Type Code Single-character code indicating the address type, such as F (Firm or company address), G
(General Delivery address), H (Highrise or business complex), P (PO Box address), R (Rural
route address), or S (Street or residential address).
Address Type String Descriptive string related to the Address Type Code.
Urbanization Area, sector, or residential development within a geographic area.
Congressional District Congressional District that contains the address.
Table B-3. Global AV Output Field Definitions
Output Field Description
Output Fi eld Defi ni ti ons 145
Private Mailbox Flag indicating that the address is a private mail box associated with a Commercial Mail
Receiving Agency.
Time Zone Code Numerical code indicating the time zone associated with the input address. Does not account for
Daylight Savings Time. For example: 0 (Military Time - APO or FPO), 4 (Atlantic Time), 5 (Eastern
Time), 6 (Central Time), 7 (Mountain Time), 8 (Pacific Time), 9 (Alaska Time), 10 (Hawaii Time),
11 (Samoa Time), 12 (Marshall Is. Time), 14 (Guam Time), 15 (Palau Time).
Time Zone Text description of the time zone code.
MSA MSA number. A Metropolitan Statistical Area consists of one or more counties forming a large
population with adjacent communities and having a high degree of social and economic
integration.
PMSA Primary Metropolitan Statistical Area - an MSA region with a population of more than one
million people.
Suite Status Code Status code indicating the level of success with which the engine validated the suite/apartment
number of an input address. Returned values are M (A suite number is required but is missing
from the input record), R (A suite number is present but is either not required or is out of range for
the given street address), V (The suite field was verified), X (The suite field was not coded).
EWS Flag Flag indicating that the address was found in the Early Warning System database.
Zip Type Zip code type. Returned values are P (Zip code used for PO Boxes), U (Unique: A code assigned
to an organization or government institution), M (Military: A ZIP Code assigned to an APO/FPO),
and Empty/no code returned (a standard Zip code).
Parsed Pre-Direction For example, S,SE,SW.
Parsed Suffix For example, St, Ave.
Parsed Post-Direction For example, S,SE,SW.
Parsed Suite Name For example, #, APT, FL, TRLR, UNIT.
Parsed Suite Range Secondary range, e.g. 23 from Suite 23.
Parsed Private Mailbox Name Name of a private mail box associated with a Commercial Mail Receiving Agency.
Parsed Private Mailbox Number Number of a private mail box associated with a Commercial Mail Receiving Agency.
LACS One-character string indicating if the input address has been converted to a city-style street
address format. Returns L if the address has been converted, and otherwise is blank.
LACS Link Indicator One-character string indicating if the input address has been converted to a city-style street
address format. Returns L if the address has been converted, and otherwise is blank.
LACS Link Return Code Code indicating the degree to which the submitted address was matched to the LACSLink data
and if the address was updated. Returned values are A (LACS Record Match - The input record
matched to a record in the master file, and a new address could be furnished), 00 (No Match - a
new address could not be furnished), 14 (The input record matched to a record in the master file,
but the new address could not be converted to a deliverable address), 92 (The input record
matched to a master file record, but the input address had a secondary number and the master
file record did not. The record is a ZIP + 4 street level or highrise match.)
Element Match Status Eight-digit string indicating the status of each address component
Element Result Status Eight-digit string indicating if and how each address component has been modified during the
validation process.
Delivery Point Identifier Eight-digit number that uniquely identifies a delivery point.
Post Office Name Post Office Name
Sub-Region/County Subregion or county information.
Residue Information not parsed by the engine.
Table B-3. Global AV Output Field Definitions
Output Field Description
146 Appendi x B: Global AV Output Descripti ons
147
A P P E N D I X C
Rule Based Analyzer Rule
Statements
This appendix includes the following topics:
Overview, 147
Functional Operators, 148
Overview
When working with the Rule Based Analyzer, note the following points:
1. The rules are defined in a rule block.
2. Rule blocks contain a sequence of IF statements and assignment statements.
3. IF statements have the following form:
// Primary condition
IF <boolean expression>
THEN <Rule Block>
// Optional arbitrary number of elseifs
ELSEIF <boolean expression>
THEN <RuleBlock>
// Optional else
ELSE <Rule Block>
ENDIF
The definition of a rule block allows for IF statements to be nested. Each IF statement must be closed by
the ENDIF keyword.
Examples of IF statements:
IF input1 = "" // Testing if input 1 is empty
THEN output1:= "Empty Input"
ENDIF
IF (input1 < 100) and (input2 < 100)
THEN output1:= 0
ELSEIF input1 > 100
THEN output1:= input1
ELSEIF input2 > 100
THEN output1:= input2
ELSE output1:= 100
ENDIF
148 Appendi x C: Rul e Based Analyzer Rule Statements
4. You can add single-line text comments to logical expressions that start with two forward-slashes (//).
5. Assignment statements have the following form:
OUTPUTX:= <expression>
(Where X ranges from 1 to the maximum output number.)
For example:
output1:= input1 * 123.5
6. Every expression has a type that is a Boolean, an integer, a floating point value, or a string. Expressions can
be simple constant values, inputs, outputs, or operations. For example:
123 // Integer
"123" // String
123.5 // Float
Input1 // Input 1 type and value
Output3 // Output 3 type and value
100 + 2 // Integer addition operation
7. Operations are composed of operators and their arguments.
Table C-1 lists operators you can use when building a rule:
Functional Operators
The Rule Based Analyzer accepts several functional operators in rules. You can apply them in the Rule wizard
and in Expert Mode. The operators ISNUMBER and ISDATE appear as options in IF statements only.
Use the following rules and guidelines when you use functional operators:
Operators that expect float arguments attempt to convert string arguments to floating point numbers where
possible.
The string concatenate operator [&] converts arguments to strings.
Operators display an error message if an automatic conversion between types fails.
The Rule Based Analyzer accepts all Gregorian dates.
Table C-1. Operators
Operator Types Operators
Prefix operators that take Boolean arguments NOT
Infix Operators that take Boolean arguments AND
OR
XOR (Exclusive or =)
Prefix Operators that take numerical arguments (integer or float) - (Negative)
Infix Operators that take numerical arguments (integer or float) = (Equal)
<> (Not equal)
< (Less than)
<= (Less than or equal to)
> (Greater than)
>= (Greater than or equal to)
- (Minus)
+ (Plus)
* (Multiply)
/ (Divide)
% (Modulo)
^ (Power)
Operators that take String arguments = (Equal)
<> (Not equal)
& (Concatenate)
Funct ional Operat ors 149
Date functions do not accept leading or trailing spaces.
Table C-2 describes the functional operators you can use when building a rule:
Table C-2. Functional Operators
Functional Operator Returns Description
ISNUMBER (expression e) Boolean Returns true if the expression can be evaluated as a number.
ISDATE (expression e) Boolean Returns true if the expression can be evaluated as a date.
Dates must be in the DD/MM/YYYY format.
TOINT (expression e) Integer Converts an expression to an integer.
TOFLOAT (expression e) Float Converts an expression to a floating point value.
TOSTRING (expression e) String Converts an expression to a string.
STRLEN (string s) Integer Returns the number of characters in s.
LEFTSTR (string s, integer n) String Returns the leftmost n characters of the input string, s.
If n is greater than the length of s then s is returned.
RIGHTSTR (string s, integer
size)
String Returns the rightmost n characters of the input string s.
If n is greater than the length of s, then s is returned.
SUBSTR (string s, integer
startPos, integer size)
String Returns a substring of s, starting at the position specified by
startPos and with length specified by size.
DATECOMPARE (string s1,
string s2, dateformat)
Integer Returns the number of days between s1 and s2.
Must define date format, such as: DD/MM/YYYY.
For example, DateCompare (2003/03/04, 2002/03/04,
YYYY/MM/DD) returns the number of days between the 4th
March 2003 and 4th March 2002.
DATECONVERT (string s,
dateformat1, dateformat2)
String Converts the date from one specified format to another.
Must define date format, such as DD/MM/YYYY.
See also Example, page 62.
MONTHCOMPARE (string s1,
string s2, dateformat)
Integer Returns the number of months between s1 and s2.
Must define date format, such as: DD/MM/YYYY.
For example, MonthCompare (2003/03/04, 2002/03/04,
YYYY/MM/DD) returns the number of months between the 4th
March 2003 and 4th March 2002.
TIMECOMPARE (string s1,
string s2)
Integer Returns the number of seconds between s1 and s2.
Both s1 and s2 must be in hh:mm:ss format.
For example, TimeCompare(13:35:27, 13:34:28) returns the
integer value 59.
CHAR (integer i) String Returns a string containing the character with the specified
ASCII code value.
CODE (string s) Integer Returns the ASCII code value for the first character of the
specified string.
MAX (integer i1, integer i2) Integer Returns the maximum value of the two arguments.
MAX (float f1, float f2) Float Returns the maximum value of the two arguments.
MIN (integer i1, integer i2) Integer Returns the minimum value of the two arguments.
MIN (float f1, float f2) Float Returns the minimum value of the two arguments.
ABS (integer i1) Integer Returns the absolute value of the argument.
ABS (float f1) Float Returns the absolute value of the argument.
CURDATE (DD/MM/YYYY) String Returns the current date in DD/MM/YYYY format.
Can also delimit date by [-], such as DD-MM-YYYY.
CURTIME () String Returns the current time in the hh:mm:ss format.
LTRIM (string s) String Returns the string created by trimming any white spaces from
the start of string s.
150 Appendi x C: Rul e Based Analyzer Rule Statements
RTRIM (string s) String Returns the string created by trimming any blank spaces from
the end of string s.
TRIM (string s) String Returns the string that is created by trimming any white spaces
from the start and end of string s.
CONTAINS (string s2, string
s1)
Integer Searches for string s2 in string s1. Returns the position of the
string s2 in s1 or the position of the first character of s2 in s1.
Case-sensitive. For more information on the CONTAINS
function, see page 62.
Table C-2. Functional Operators
Functional Operator Returns Description
151
A P P E N D I X D
Search/Replace Operations
and Noise Removal
This appendix includes the following topic:
Noise Removal, 151
Noise Removal
This appendix contains information about noise removal, that is, removing extraneous
characters from data strings. Noise removal can make data records more legible and facilitate
matching operations.
When you run an analysis plan, identify any symbols, spaces, and unexpected characters in
the source data fields so you can remove or replace them with a Search Replace component.
This is known as noise removal.
Table D-1 lists some typical removal and replacement selections in the Search Replace component:
Table D-1. Standard Noise Removal and Replacement Operations
Data Element Action
. Replace with a single space.
, Replace with a single space.
- Replace with a single space.
/ Replace with a single space.
\ Replace with a single space.
; Replace with a single space.
Double Spaces Replace with a single space.
Blank space Remove at start.
ATTN: Remove at start.
C/O Remove at start.
C\O Remove at start.
Blank space Remove at end.
152 Appendi x D: Search/Repl ace Operati ons and Noise Removal
Remove.
Remove.
' Remove.
' Remove.
( Remove.
! Remove.
` Remove.
# Remove.
: Remove.
{ Remove.
} Remove.
[ Remove.
] Remove.
Table D-1. Standard Noise Removal and Replacement Operations
Data Element Action
153
A P P E N D I X E
Matching Formulas
This appendix includes the following topic:
Matching Formulas, 153
Matching Formulas
Given an input set of N records, the following number of comparisons is required without grouping:
If the records are grouped into m groups (G1Gm being the number of records in groups 1m) and
comparisons only occur within records in the same group, the following number of comparisons is required:
In the worst case, this means that grouping leads to a reduction of comparisons, where Gmax is the size of the
biggest group:
In practice, a greater reduction is expected since it is unlikely that every group is the same size.
154 Appendi x E: Matchi ng Formul as
155
A P P E N D I X F
SQL Scripts
This appendix includes the following topics:
Overview, 155
Creating a MySQL Table, 155
Use of MAX Function, 156
Nested Groups and Counts, 156
Overview
Data Quality is installed with a MySQL database system to which data files can be migrated
and in which queries can be developed. Although SQL scripts are not required in the majority
of cases when designing and running plans, there are cases in which SQL scripts can provide
efficient solutions to particular data problems.
The Database Source and Database Target component configuration dialog boxes allow you to develop SQL
scripts. The sections below describe some useful SQL scripts and the particular issues that they address.
Creating a MySQL Table
Use the following steps to create a MySQL table:
1. Using a Database Target component, create the database table to which you want to migrate a data file. In
the Before pane, type the following:
drop table if exists table_name; # delete table if it already exists
create table table_name # create table with following fields
(
TableID int primary key,
FieldA varchar(20), # use descriptive names for fields
FieldB varchar(20),
FieldC varchar(20),
FieldD float
FieldE int
);
2. In the During pane, insert the data from the source file to new table.
Select Expert Mode to see the SQL scripting equivalent of the tab settings.
156 Appendi x F: SQL Scri pts
3. In the After pane, you should create an index, especially when dealing with large datasets. Use the following
script:
Create index index_name on table_name(FieldE);
Use of MAX Function
The MAX function works best on numeric data.
You can use the following steps to use the MAX function to identify the most recent transaction for each
customer:
1. Convert each date to YYYYMMDD format and store it as an numeric type data field.
With this step in place, you can add the following SQL scripts to the Database Source configuration dialog
box to identify the most recent transaction for each customer.
2. Type the following in the Before tab:
Drop table if exists tmp; # create a temporary table
CREATE table tmp
(cust_ref varchar(20),
numdate bigint);
INSERT INTO tmp
SELECT
transtable.cust_ref,
MAX(transtable.numdate)
FROM transtable
GROUP BY transtable.numdate
CREATE index tmp_trans_index on tmp(cust_ref, numdate);
3. Type the following in the During tab:
SELECT select transtable.cust_ref, transtable.numdate, <any other fields>
FROM transtable, tmp
WHERE transtable.cust_ref = tmp.cust_ref
AND transtable.numdate = tmp.numdate
4. Type the following in the After tab:
Drop table tmp;
Nested Groups and Counts
You might use the following steps to count the numbers of customers in your dataset by town and country:
1. In the During pane, select the data fields required for the report.
For this example, assume each unique record represents a single customer and that each record contains the
following fields of information: Country and Town.
2. Check the Expert Mode option.
3. Edit the resulting script so that it reads as follows:
SELECT Table_name.Country, COUNT(table_name.Country), Table_name.town, COUNT
(table_name.town) FROM table_name
GROUP BY
Table_name.country., Table_name.town
157
A P P E N D I X G
ODBC Data Source
Administrator
This appendix includes the following topic:
Using the ODBC Data Source Administrator, 157
Using the ODBC Data Source Administrator
Use the Microsoft ODBC Data Source Administrator when connecting to databases with
ODBC. When the Database Source is configured to connect using ODBC, it requires a Data
Source Name.
Note: The following procedure is written for Windows XP users. Details may differ slightly for
for other versions of Windows.
To create a Data Source Name that is recognized by ODBC:
1. Open the Administrative Tools window.
2. Double-click Data Sources (ODBC).
The ODBC Data Source Administrator dialog box opens.
3. In this dialog box, select the System DSN tab and click Add.
The Create New Data Source dialog box prompts to select the driver for which you want to set up a data
source.
4. Select the appropriate driver for the database that you want to connect to.
You might need to install a driver if you cannot locate one in the list.
When you have successfully identified the driver, a setup dialog box opens for the database driver you have
selected.
5. Type a name for the data source in the Data Source Name field.
6. Click Select and browse to select the appropriate database for the new data source.
7. Click OK to exit the dialog boxes and return to Data Quality Workbench.
8. Under the Connect to Database tab of the Database Source configuration dialog box, type the newly-
created Data Source Name in the relevant field and click Connect.
158 Appendi x G: ODBC Data Source Administrator
You should now see the data tables of the database that you associated with the data source name. You can drill
down into the tables and select fields as required.
Note the following:
You can apply Data Quality components directly to data retrieved by ODBC and write the results to local
files. You can migrate the data retrieved by ODBC into a local Data Quality MySQL data table. This
approach may prove useful if you are retrieving a large data set across a network that is prone to heavy traffic.
When connecting to Microsoft Access databases, you might find that no tables or data fields are available for
viewing after you establish an ODBC connection. This can occur if Access table names or field names
include spaces. Most database vendors do not accept spaces in table names or field names.
This naming convention is an accepted industry standard. To view data in this instance, you must remove all
spaces from the Microsoft Access table names and field names.
159
A P P E N D I X H
Character Encodings and Unicode
This appendix includes the following topic:
Character Encodings and Unicode, 159
Character Encodings and Unicode
Informatica Data Quality is Unicode-compliant. Several components allow you to specify the character
encodings to be applied to the data on which they operate. The character encoding options are generally
available in the Encodings menu on the configuration dialog box for the component.
Entries on this menu include the default encoding for the current system based on the current locale, the
standard UTF encodings (UTF-8 and UTF-16 little endian and big endian), and an option to choose other
encodings not listed in the menu by default.
Encodings recently selected but not defined by the default selections are added to a history of previously-
selected encodings. Only those encodings not available by default are added to the history. The history is
limited to three entries.
Choosing a Non-Default Encoding
Click Choose on the menu to open a new dialog box listing the available encodings as defined in the
localeEncoding.csv file.
This dialog box lists the following:
Base languages
Encodings available for versions of the base language
Countries associated with each version
ISO number of each version
The list can be expanded and collapsed to aid list navigation. Highlight a language or dialect and click OK to
select it for any data on which the component will operate.
Note that you select an encoding of the language rather than the base language, and that in some cases the
versions are distinguished by operating system rather than region.
Note: Data Quality handles all data read over an ODBC connection as Unicode, regardless of the selection in
this field.
160 Appendi x H: Character Encodings and Uni code
161
A P P E N D I X I
Data Quality Workbench Toolbar
This appendix includes the following topic:
Data Quality Workbench Toolbar, 161
Data Quality Workbench Toolbar
Figure I-1 lists the names of Data Quality Workbench toolbar icons:
Figure I-1. Data Quality Workbench Toolbar
New Project New Plan Save Plan Run Plan Refresh Undo Redo
Cut
Component
Copy
Component
Paste
Component
Configure
Component
Delete
Component
Show
Source
Viewer
Show
Project
Manager
Show Plan
Notes
Import
Workbench
Plan
Export
Workbench
Plan
Import
Realtime
Plan
Export
Realtime
Plan
Import
Runtime
Plan
Export
Runtime
Plan
Open Report
Viewer
Open
Dictionary
Manager
View Plan
Layers
Tile
Windows
Cascade
Windows
Open Help
Topics
162 Appendi x I: Data Qual ity Workbench Toolbar
163
A P P E N D I X J
Output Options in the
CSV Match Target
This appendix includes the following topics:
Overview, 163
Configuring the Outputs for Identified Matches, 164
Overview
Significant changes have been made to the CSV Match Target component in this version of
Data Quality. The CSV Match Target component:
Can generate a CSV file in two formats.
Provides improved HTML reporting.
Employs a new algorithm to generate match clusters.
New Output Formats
The CSV Match Target provides two output formats:
Identified Matches. Provides similar results to the HTML report output. In this format, the target
reconstructs the original source file and appends a cluster ID and the number of records in each cluster to
the record. As a result, the number of rows in the target output file should be the same as the number of
input rows. Any record for which a match was not found will have its own unique cluster ID and a cluster
size of 1.
Matched Pairs. Delivers each matching pair that meets or exceeds the match threshold set in the target.
(This corresponds to the target output in version 3.0 of the product.)
HTML Report
The HTML Report format displays with the unique records in the cluster, with the best match identified and
the score against that match.
164 Appendi x J: Output Options in the CSV Match Target
Usage
The CSV Match Target only calculate clusters when configured to do so. Select the Identified Matches or
HTML Report option to activate cluster generation.
You can also disable HTML report generation.
Clustering
The clustering algorithm assigns all records identified as matches to a cluster. The algorithm runs while the plan
runs and stores temporary data in memory.
In larger datasets, large quantities of matches can cause a large amount of memory to be used. Grouping data
can keep group sizes within recommended parameters, so unnecessary matching operations are avoided.
Informatica recommends a maximum 5,000 records per group.
Sources
The CSV Match Target can calculate record clusters when used with the CSV Match Source or Group Source.
When you use CSV Match Target with other sources and select the Identified Matches option, the plan does
not run. If you select HTML Report is selected, then the plan runs, but the HTML page indicates that the
report cannot be created.
Configuring the Outputs for Identified Matches
When you select the Identified Matches output format, you must review the order of the output columns in the
Output pane.
The columns in the Outputs pane must be organized by data source, with an equal number of columns for
records from each data source. The match score column must appear after the record columns. The logic is as
follows:
Data reaches the CSV Match Target as two input records side by side, For example, records with Name and
Address fields reach the Target in the following format followed by the match score:
Name_1,Address_1,Name_2,Address_2
When you select the Identified Matches format, the Target reconstructs the original input records. The
previous example would be reconstructed as follows:
Name_1,Address_1
Name_2,Address_2
You must order the output columns in the Output pane so the columns from the first record are listed in
order, followed by the columns in the second record, followed by the columns for the match scores. The
Outputs pane for the previous example should look like this:
Name_1
Address_1
Name_2
Address_2
MatchScore
Figure 3-1 on page 28 illustrates a well-ordered Outputs pane for the Identified Matches option.
Use the Up and Down arrows to order columns.
165
A P P E N D I X K
Informatica Data
Quality Naming
Conventions
This appendix includes the following topics:
Overview, 165
Overview
This appendix describes a recommend naming system for Data Quality project elements. You
and your team should agree a clear and consistent set of naming conventions for the elements
you create in Workbench. Your exact approach to naming conventions will depend on your
organizations needs.
The elements to consider are:
Projects. Create a project under the local repository (My Repository) in Workbench Project Manager. You
cannot rename a Data Quality repository.
Folders. Create a folder under a project in Workbench Project Manager. Folders can be nested in projects.
Plans. Create a plan at folder or project level in Workbench Project Manager.
Configurable components. Select a component from the Component Palette and add it to an open plan.
Component instances. Open a component onscreen to view or edit an instance. A component comprises
one or more instances.
Component outputs. Open a component onscreen to view or edit its outputs. A component creates one or
more output columns based on the rules applied to its inputs.
Dictionaries. Open Workbench Dictionary Manager or the local file system to view dictionary (.DIC) files.
No element can share a name with another element at the same node in the Project Manager. For example, you
cannot define two folders named MyFolder in the same project.
You can copy an element at its current location. In such cases, Workbench prefixes its name with Copy of. For
example, you can make a copy of MyFolder and create a new folder named Copy of MyFolder by default in the
same project. If the length of the new element is longer than permitted, Workbench truncates the name.
166 Appendi x K: Informatica Data Quality Naming Conventi ons
Naming Projects
Workbench creates a project with the default name New Project.
Project naming should be clear and consistent within a repository. Follow these guidelines:
Limit project names to 22 characters. The repository imposes a limit of 30 characters. Limiting project
names to 22 characters allows Workbench to prefix Copy of to a copied project without truncating
characters.
Include enough descriptive information in the project name for an unfamiliar user to grasp the general
purpose of the plans in the project.
If plans within the project will operate on a single data source, incorporate the data source name in the
project name.
Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions.
They allow the PowerCenter repository to import the project without changing its name.
If you use company codes or abbreviations in the project name, ensure they are consistent and well
documented.
Naming Folders
Workbench creates four folders by default beneath a new project. The folders are named Consolidation,
Matching, Profiling, and Standardization and are listed alphabetically. These names relate to four common
types of data quality plan. You can rename, delete, and create folders to suit your business and project
objectives.
Naming guidelines for folders:
Limit folder names to 42 characters. The repository imposes a limit of 50 characters. Limiting folder names
to 42 characters allows Workbench to prefix Copy of to a copied folder without truncating characters.
Include enough descriptive information in the folder name for an unfamiliar user to grasp the purpose of the
plans in the folder.
Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions.
They allow the PowerCenter repository to import the folder without changing its name.
If you use company codes or abbreviations in the folder name, ensure they are consistent and well
documented.
Naming Plans
When you create a new plan, Workbench prompts you to select one of four generic plan types as the plan name:
Analysis, Consolidation, Matching, or Standardization. These names relate to the default folder names.
Workbench provides them as an aid to project design.
These default names in no way determine or constrain plan functionality. You can add a new plan to any folder
regardless of their names.
Note: Take particular care when naming plans, particularly if you will export the plan to a PowerCenter
repository. Be as clear and descriptive as possible. Data quality operations are defined and implemented at plan
level. Although you can see a plans folder and project parentage in Workbench, these elements may not be
evident in the PowerCenter repository.
Naming guidelines for plans:
Include the plans purpose or primary functionality in the plan name.
If you will use the plan in a PowerCenter mapping or mapplet, prefix the plan name with dq_. This
conforms to PowerCenter naming conventions. PowerCenter applies a lowercase prefix to all elements in its
repository. For data quality plans, this is an optional but recommended step.
Limit plan names to 42 characters. The repository imposes a limit of 50 characters. Limiting plan names to
42 characters allows Workbench to prefix Copy of to a copied plan without truncating characters.
Overvi ew 167
Include enough descriptive information in the plan name for an unfamiliar user to grasp the purpose of the
plans in the folder.
Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions.
They allow the PowerCenter repository to import the plan without changing its name.
If you use company codes or abbreviations in the plan name, ensure they are consistent and well
documented.
Naming Components
When you add a component to a plan, its default name appears underneath its icon in the plan workspace. Edit
this name to provide a description of the components role in the plan. Prefix your new name with an
abbreviation of the plans original name to make the plan more legible onscreen.
If the component type abbreviation itself is not sufficient to identify what the component does, include an
identifier for the function of the component in its name.
Table K-1 lists prefixes you can use when renaming your components:
In addition, consider these naming guidelines for components:
Keep component names short where possible. You may wish to reuse component names in field names, and
your database may impose a limit on field length.
Include the name of the input field or the field type.
Table K-1. Component Names and Prefixes
Component Prefix Component Prefix
Address Validator av_ Soundex sx_
Aggregation ag_ Splitter spL_
Bigram bg_ To Upper tu_
Character Labeller cl_ Token Labeller tl_
Context Parser cp_ Token Parser tp_
Count co_ Weight Based Analyzer wba_
Edit Distance ed_ Word Manager wm_
Global AV av_ SOURCES/TARGETS
Hamming Distance hd_ CSV Dual Source csv_m_
International AV iav_ CSV Match Source csv_d_
Jaro Distance jd_ CSV Merge Target csv_merge_
Merge MG_ CSV Source/Target csv_
MinAvgMax mam_ DB Match Source db_m_
Missing Values mv_ DB Report Target db_r_
Mixed Field Matcher mfm_ DB Source/Target db_
North America AV nav_ Dual Group Source dgs_
Nysiis nys_ Fixed Width Source/Target fws_
Profile Standardizer ps_ Group Source/Target grp_
Range Counter rc_ Match Key Target mks_
Rule Based Analyzer rba_ Realtime Source/Target rs_
Scripting sc_ Report Target rep_
Search Replace sr_ SAP Source/Target sap_
168 Appendi x K: Informatica Data Quality Naming Conventi ons
Use letters, numbers, and underscores in your name. Do not use spaces.
If you use company codes or abbreviations in the component name, ensure they are consistent and well
documented.
Naming Fields
Careful field naming is essential when designing data quality plans. The power of Data Quality leads to
complex plans with many components.
Data Quality requires that every component output field name is unique in the plan. Output field names do
not persist from component to component.
Data Quality does not have the data lineage feature of PowerCenter, so the field name is the clearest indicator of
the source of a data element when a plan is examined by a third party.
Naming guidelines for fields:
Prefix each output field name with an abbreviation of its component name. For a list of usable abbreviations,
see Table K-1.
Use upper and lower case consistently.
Do not rename output fields in target components unless necessary, as there is no convenient way to
determine the origin of a renamed output field.
If you use company codes or abbreviations in the field name, ensure they are consistent and well
documented.
Naming Dictionary Files
Dictionaries may be given any name suitable for the operating system on which they will be used.
Naming guidelines for dictionary files:
Limit dictionary names to characters permitted by the operating system. If a dictionary is to be used on both
Windows and UNIX, do not use spaces.
If you modify a dictionary file from Informatica, rename or move it to a new folder before using it in a plan.
In this way, you will not overwrite your modifications if you perform a Content update.
If you use company codes or abbreviations in the dictionary name, ensure they are consistent and well
documented.
169
I NDEX
A
address action indicator codes
Address Validator component 135
Address Validator component
address action indicator codes 135
match confidence level codes 134
match success codes 133
postal code action indicator codes 135
Aggregation component
configuring 41
B
Bigram component
configuring 83
C
-c option
command line argument 126
shared database details 127
categories
creating dashboard 116
dashboard 116
deleting 117
moving rows 117
character encoding
configuring 159
Character Labeller component
configuring 47
characters
removing extraneous 151
clustering
CSV Match Source algorithm 164
command line arguments
-c option 126
encrypting parameter files 126
-i option 127
overview 126
Components
Address Validation Components
Global AV 102
Analysis Components
Character Labeller 47
Token Labeller 50
Frequency Components
Aggregation 41
Count 37
MinAvgMax 43
Missing Values 45
Range Counter 44
Sum 40
Key Field Generator Components
Normalization 75
Nysiis 77
Soundex 75
Matching Components
Bigram 83
Edit Distance 80
Hamming Distance 82
Identity Match 96
Jaro Distance 81
Mixed Field Matcher 84
Similarity 80
Weight Based Analyzer 85
Parsing Components
Context Parser 72
Parser 65
Profile Standardizer 70
Splitter 66
Token Parser 67
Source Components
CSV 13
CSV Dual Match 20
CSV Identity Group 92
CSV Match 19
Database 14
Database Match 20
DB Identity Group 94
Dual Group 22
Fixed Width 16
Group 21
Realtime 16
SAP 17
Target Components
CSV 23
CSV Identity Match 98
CSV Match 27
CSV Merge 26
Database 32
Database Report 34
Fixed Width 24
Group 31
Identity Group 91
Match Key 29
Realtime 36
Report 25
170 Index
SAP 35
Transformation Components
Merge 58
Rule Based Analyzer 61
Scripting 63
Search Replace 55
To Upper 59
Word Manager 57
Context Parser component
configuring 72
Count component
configuring 37
CSV Dual Match Source component
configuring 20
CSV Identity Group Source component
configuring 92
CSV Identity Match Target component
configuring 98
CSV Match Source component
configuring 19
CSV Match Target component
configuring 27
Identified Matches option 27, 164
Matched Pairs option 27
output options 163
sources for calculating clusters 164
CSV Merge Target component
configuring 26
CSV Source component
configuring 13
CSV Target component
configuring 23
D
dashboard view
Report Viewer 115
dashboards
categories 116
creating categories 116
creating groups 121
modifying calculation parameters 115
setting Data Quality targets 115
tracking changes 120
tracking historical percentages 120
tracking historical trends 120
data
viewing plan 118
data elements
hiding 120
data matching
formulae 153
Data Quality staging area
default permissions 129
data sources
creating ODBC 157
database dictionaries
creating 110
description 107
Database Match Source component
configuring 20
Database Report Target component
configuring 34
Database Source component
configuring 14
Database Target component
configuring 32
databases
shared details 127
DB Identity Group Source component
configuring 94
deploying
runtime plans 123
deploying plans
using the command line 126
dictionaries
adding spellings 109
creating 110
overview 107
updating files 108
Dictionary Manager
overview 108
Dual Group Source component
configuring 22
E
Edit Distance component
configuring 80
element status codes
International AV component 139
elementmatchstatus
International AV component 140
elementresultstatus
International AV component 140
encodings
configuring 159
encrypting
parameter files 126
encryption
for password protection 129
executing
plans 6
F
File Manager
description 2
Fixed Width Source component
configuring 16
Fixed Width Target component
configuring 24
functional operators
in rules 148
G
Global AV component
configuring 102
match status and match code outputs 131
overseas territories and database settings 106
Index 171
Group Source component
configuring 21
Group Target component
configuring 31
groups
creating 121
creating dashboards 121
managing 121
nested in scripts 156
H
Hamming Distance components
configuring 82
hiding
data elements 120
HTML
CSV Match Target component report format 163
I
-i option
command line argument 127
Identified Matches option
configuring output 164
CSV Match Target component 163
Identity Group Target component
configuring 91
generic values in the Input Column 92
Identity Match component
configuring 96
populations 96
Identity matching defined 89
International AV component
element status codes 139
elementmatchstatus 140
elementresultstatus 140
items
assigning 117
J
Jaro Distance component
configuring 81
L
line graphs
viewing 120
M
match confidence level codes
Address Validator component 134
Match Key Target component
configuring 29
match success codes
Address Validator component 133
Matched Pairs option
CSV Match Target component 163
MAX function
in scripts 156
Merge component
configuring 58
MinAvgMax component
configuring 43
Missing Values component
configuring 45
Mixed Field Matcher component
configuring 84
multi-processing
overview 128
multi-threading
overview 128
MySQL tables
creating 155
N
nested groups
in scripts 156
noise
removal 151
Normalization component
configuring 75
North America AV component
status codes 137
Nysiis component
configuring 77
O
ODBC
creating data sources 157
ODBC Data Source Administrator
creating a DSN 157
P
parameter files
encrypting 126
passwords 126
Parser component
configuring 65
passwords
parameter files 126
percentages
tracking historical 120
performance
checking with command line argument 127
tuning 127
plans
executing 6
overview 2
performance tuning 127
version control 8
population file 89
postal code action indicator codes
Address Validator component 135
Profile Standardizer component
configuring 70
172 Index
Project Manager
description 2
R
Range Counter component
configuring 44
Realtime Source component
configuring 16
Realtime Target component
configuring 36
removing
extra characters 151
Report Target component
configuring 25
Report Viewer
assigning weights to data items 117
creating dictionary files 110
creating groups 121
dashboard view 115
Data Quality targets on the dashboards 115
editing settings 119
exporting data 118
filtering data 118
importing report files 121
managing groups 121
parameters and settings 119
standard view 115
tracking changes 120
viewing plan data 118
working with groups 121
Rule Based Analyzer
rule statements 147
Rule Based Analyzer component
configuring 61
rules
functional operators 148
runtime execution
plans 123
runtime plans
deploying 123
S
SAP Source component
configuring 17
SAP Target component
configuring 35
scheduling
operations 125
Scripting component
configuring 63
Search Replace component
configuring 55
security
encrypting parameter files 126
tips 129
Similarity component
configuring 80
Soundex component
configuring 75
sources
calculating clusters with CSV Match Target 164
Splitter component
configuring 66
SQL scripts
samples 155
standard dictionaries
creating text 110
description 107
standard view
Report Viewer 115
status codes
North America AV component 137
Sum component
configuring 40
system performance
checking with command line argument 127
T
tables
creating MySQL 155
terms
adding new to dictionaries 109
adding spellings to dictionaries 109
third-party reference data
description 107
To Upper component
configuring 59
Token Labeller component
configuring 50
Token Parser component
configuring 67, 68
multiple dictionary operations 68
toolbar
icons 161
trends
tracking historical 120
U
Unicode
compliance 159
UNIX installation
root privileges 129
V
version control
plan publication 10
plans 8
tracking plans 10
views
Report Viewer 115
W
Weight Based Analyzer component
configuring 85
weights
assigning to data items 117
Index 173
Word Manager component
configuring 57
174 Index
NOTICES
This Informatica product (the Software) includes certain drivers (the DataDirect Drivers) from DataDirect Technologies, an operating company of Progress Software Corporation (DataDirect)
which are subject to the following terms and conditions:
1. THE DATADIRECT DRIVERS ARE PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT INFORMED OF THE POSSIBILITIES OF DAMAGES IN
ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY,
NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.