Vous êtes sur la page 1sur 44

Informatica UK User Group Meeting

Date: 2nd July 2008 Venue: London

Identity Resolution Mike Pataky Snr Technical Consultant

Identity Resolution Michael Pataky Snr Technical Consultant

Identity Resolution Mr M C Pataky Snr Technical Consultant

Identity Resolution Mike Chuck Pataky Snr Technical Consultant

Reality

Father calls me William, Sister calls me Will, Mother calls me Willie, But the fellers call me Bill!
Eugene Field (Poet)

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

A Thought
Are these the same people?

Sean Mac Gabhann Jon Smith

12 May 49

British

5 Dec 1949 British

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Word variation, error and misspelling are unavoidable


Variation or Error Nicknames and aliases Sequence Errors Concatenated names Abbreviations Truncation Example Chris Christine, Christopher, Tina Mark Douglas or Douglas Mark Moi Lim Chung or Chung Moi Lim Ann Marie, AnnMarie Ave/Avenue, Mfg/Manufacturing International Business Mach

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Word variation, error and misspelling are unavoidable


Variation or Error Transposed characters Unreliable data entry Involuntary correction Inaccurate dates Transliteration differences Example Michael, Micheal Peter Jackson deceased James Phillips DL293384 Graeme Graham, MacDonaldMcDonald 12/10/1915, 21/10/1951, 10121951, 00001951

,
Sid Ali Ahmd Al Gamdi, Saud Ali Abdullah AlGamdi

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Transliteration Realities

Transliteration does not make the problem of identity search & matching go away, it just adds to the complexity The ideal solution captures the identity data in both local source and transliterated form Together, and with algorithms that address the individual characteristics of each form, the opportunity for success is multiplied even further.
Transliterated into French by Algerian speaker Transliterated from French into English by English Speaker

Arabic Identity

Transliterated into Hebrew by Israeli

Transliterated into English by English speaker

Transliterated from Hebrew into Roman by Hebrew Speaker

In high-risk systems,

this multi-algorithmic approach is critical

Do three versions match?

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Informatica Identity Resolution


Identity Systems acquired May 2008 When, Where, Who
Founded in 1986, as Search Software America (SSA). Located in Old Greenwich, CT (HQ), London, Bangalore, Sydney and Canberra 500+ multi-national customers Partnerships: Siperian,

Purisma (D&B) & ORCL (Siebel)

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Identity Resolution in Border Control and Immigration


Example Customers:
UK Border Authority Australia Immigration Immigration Canada NZ Immigration Israel Border Police
KEY BUSINESS IMPERATIVE

Identify persona non-grata Check visa applicants against known terrorists and undesirables Identify threats and prevent entry at border posts Manage case lifecycle for immigration benefits
IDENTITY ADVANTAGE RESULTS/BENEFITS

THE CHALLENGE

Identity data in national watch lists are incomplete, spelled incorrectly, from various languages, countries and cultures. Romanization and transliteration introduces more error and variation Cost of missed match

Embeddability Transactions latency and throughput Ability to deal with incomplete or partial data Ability to deal with entity data from anywhere in the world

Improved performance and better accuracy compared to in-house solutions Lower false positives helps resource/staff mgmt Lower TCO COTS, ongoing maintenance Reduced reliance on internal resources

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Existing Techniques Versus Our Solution

Existing Techniques Critical Weaknesses


SQL and other DBMS exact searches Can only find exactly equal or truncated data Soundex, Wild Card commands Miss many matches and find many false matches Perform very poorly Query languages and OLAP Depend on SQL Matchcodes Fail when any element is in error Text Retrieval and WEB search engines Not optimized for identity data

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Hybrid Approach

Deterministic Heuristic Empirical Probabilistic Phonetic

Linguistic

No single algorithm is capable of compensating for all the classes of error and variation present in identity data.

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Our Solution
Smart indexing Overcome spelling & phonetic errors missing/out-of-order words and other errors & variation transliteration & multi-country data Flexible Search Strategies Balance performance and comprehensiveness of search Matching algorithms Mimic a human experts ability to determine a match based on numerous attributes

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Our 3 Step Approach


At setup time :
1. Index the data in the original database

At search time :
2. Search the data for the required candidates 3. Verify the Match using additional data elements

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Step 1 : Indexing the data


Name : Andrew Jackson Smith Smith Andrew Jackson
S

Jackson Smith Andrew Andrew Jackson Smith

A J S

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Step 1 : Indexing the data


Name : Bill Jackson Smith Q : Could this be Mr W J Smith ? Bill Jackson Smith William Jackson Smith

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Step 1 : Indexing the data


Key
KDM/> WETK/ YFNO$

Name (Compressed) Andrew Jackson Smith Jackson Smith Andrew Smith Andrew Jackson

Other Data (Compressed) ABC123+ ABC123+ ABC123+

ID ABC123 ADE938 ARF073

Name Andrew Jackson Smith Andrew Smythe Andrew Smith

Address 9 Headley Road Woodley Reading 29 Headley Road Reading 12 High Street Bracknell

DOB 12/05/75 05/07/49 05/12/75

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Step 1 : Indexing the data


Key
KDM/> WETK/ YFNO$ KDM$E YFN$<

Name (Compressed) Andrew Jackson Smith Jackson Smith Andrew Smith Andrew Jackson Andrew Smythe Smythe Andrew

Other Data (Compressed) ABC123+ ABC123+ ABC123+ ADE938+ ADE938+

ID ABC123 ADE938 ARF073

Name Andrew Jackson Smith Andrew Smythe Andrew Smith

Address 9 Headley Road Woodley Reading 29 Headley Road Reading 12 High Street Bracknell

DOB 12/05/75 05/07/49 05/12/75

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Step 1 : Indexing the data


Key
KDM/> WETK/ YFNO$ KDM$E YFN$< KDM$< YFN$<

Name (Compressed) Andrew Jackson Smith Jackson Smith Andrew Smith Andrew Jackson Andrew Smythe Smythe Andrew Andrew Smith Smith Andrew

Other Data (Compressed) ABC123+ ABC123+ ABC123+ ADE938+ ADE938+ ARF073+ ARF073+

ID ABC123 ADE938 ARF073

Name Andrew Jackson Smith Andrew Smythe Andrew Smith

Address 9 Headley Road Woodley Reading 29 Headley Road Reading 12 High Street Bracknell

DOB 12/05/75 05/07/49 05/12/75

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Step 1 : Indexing the data


Key
KDM/> KDM$< KDM$E WETK/ YFNO$ YFN$< YFN$<

Name (Compressed) Andrew Jackson Smith Andrew Smith Andrew Smythe Jackson Smith Andrew Smith Andrew Jackson Smythe Andrew Smith Andrew

Other Data (Compressed) ABC123+ ARF073+ ADE938+ ABC123+ ABC123+ ADE938+ ARF073+

Address Index

ID ABC123 ADE938 ARF073

Name Andrew Jackson Smith Andrew Smythe Andrew Smith

Address 9 Headley Road Woodley Reading 29 Headley Road Reading 12 High Street Bracknell

DOB 12/05/75 05/07/49 05/12/75

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Our 3 Step Approach


At setup time :
1. Index the data in the original database

At search time :
2. Search the data for the required candidates

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Step 2 : The Search


Search : Andy J Smith

From Key
KDM$$

To Key
KDMZZ

Database Index

Key
KDM/> KDM$< KDM$E

Name (Compressed) Andrew Jackson Smith Andrew Smith Andrew Smythe

Data (Compressed) ABC123+ ARF073+ ADE938+

Records returned from the database to our software

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Our 3 Step Approach


At setup time :
1. Index the data in the original database

At search time :
2. Search the data for the required candidates 3. Verify the Match using additional data elements

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Step 3 : Scoring
The scoring step is carried out using the fields chosen by the user :
i.e. Name : Andy J Smith Address : 9 Hedley Rd Reading DOB : 12 May 1975 (Choosing weights suitable for finding the same Resident)
ID Name Andrew Jackson Smith Andrew Smythe Andrew Smith Address 9 Headley Road Woodley Reading 29 Headley Road Reading 12 High Street Bracknell DOB 12/05/75 05/07/49 05/12/75

97 90 54

ABC123 ADE938 ARF073

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

User Results
The required data is found quickly Results are scored and ranked
Search : Andy J Smith 97 90 Andrew Smythe 9 Hedley Rd Reading 29 Headley Road Reading 12 May 1975 05/07/49

Andrew Jackson Smith 9 Headley Road Woodley Reading 12/05/75

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Our Solutions The Complete View

Identity Resolution Software


IIR - Informatica Identity Resolution
To add identity search, matching and duplicate discovery to DB2, ORACLE, Sybase and SQL Server without change to existing code or tables

SSA-NAME3
SDK for on-line and batch name search and matching applications Core technology for other IIR products

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Handling Identity Data


Ability to match all types of Identity Data
People Names and Aliases Company Names Addresses Dates Telephone Numbers ID Numbers

But also :
Car registrations Music Titles ..
Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Supported non-Latin sources


Arabic (cp1256) Chinese Simplified (cp936) Chinese Traditional (cp950) Cyrillic (cp1251) Greek (cp928) Hebrew (cp1255) Japanese (cp932) Korean (cp949) Thai (cp874)

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Tuning
We have 20+ years experience of tuning to overcome an enormous variety of data and business issues But YOU know YOUR data better than anyone SO

Here is our knowledge and some tools to help you tune the knowledge to fit your data

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Tuning Tools
Included is the ability for :
Users to add their knowledge Administrators to add, change and delete rules Business rules to be customised Field weightings to be modified Thresholds to be set

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

The Product Informatica Identity Resolution

Informatica Identity Resolution


Ability to build across multiple data sources
No limit to the number of sources

Input data can come from


MS SQL Oracle DB2 Sybase ASE Flat File (Delimited or Fixed width) A variety of other sources with ODBC connection

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Informatica Identity Resolution


Speed and Scale Able to perform high volume searches quickly against very large databases Simple to deploy No change to existing databases and tables Complex coding/development not necessary

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Informatica Identity Resolution


Built-in facilities including end-user screens for : Searching (local and web clients), Duplicate discovery within data Relating of external files against the database Creation of Known Match Tables (for use with Visualising tools) APIs to link into existing applications

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Central Customer View


Jonathan Smith John Smith

Jonathon Smith

Smith Jon

Mr J Smith

Johnny Smythe

Dr J A Smith

Andy John Smith

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Checking Lists

Government Black List Or Compliance List

Database

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Summary & Close

Summary
Our Identity Resolution Software is used : To overcome the unavoidable error and variation in identity data 24/7 batch and online Cost of missing a match is high Varied character sets and Transliteration Large volumes of data
11 TB of data 2 million real time searches per day 1 million batch transactions per hour

Minimum installation costs

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

A Final Thought
Are these the same people? Could your existing software tell you?

Sean Mac Gabhann Jon Smith

12 May 49

British

5 Dec 1949 British

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Questions ?

Copyright 2008 Informatica Corporation. All rights reserved. Unauthorised distribution or copying is prohibited.

Vous aimerez peut-être aussi