Data Cleaning: Problems and Current Approaches

Data Cleaning: Problems and Current
Approaches
Abstract
Weclassifydataqualityproblemsthatareaddressedbydatacleaningandprovideanoverviewofthemain
solutionapproaches.Datacleaningisespeciallyrequiredwhenintegratingheterogeneousdatasourcesand
shouldbeaddressedtogetherwithschemarelateddatatransformations.Indatawarehouses,datacleaningis
amajorpartofthesocalledETLprocess.Wealsodiscusscurrenttoolsupportfordatacleaning.
1 Introduction
Data cleaning,alsocalled data cleansingor scrubbing,dealswithdetectingandremovingerrorsand
inconsistenciesfromdatainordertoimprovethequalityofdata.Dataqualityproblemsarepresentinsingle
datacollections,suchasfilesanddatabases,e.g.,duetomisspellingsduringdataentry,missinginformation
orotherinvaliddata.Whenmultipledatasourcesneedtobeintegrated,e.g.,indatawarehouses,federated
databasesystemsorglobalwebbasedinformationsystems,theneedfordatacleaningincreases
significantly.Thisisbecausethesourcesoftencontainredundantdataindifferentrepresentations.Inorderto
provideaccesstoaccurateandconsistentdata,consolidationofdifferentdatarepresentationsandelimination
ofduplicateinformationbecomenecessary.
Operational
sources
Extraction, Transformation, Loading

Extraction
Integration
Aggregation
Schemamatching
andintegration
Schemaextraction
andtranslation
Data
warehouse
Schema
implementation
Data
warehouse
Data
staging
area
Instanceextraction
andtransformation
Instancematching
andintegration
Filtering,
aggregation
Scheduling,logging,monitoring,recovery,backup
Legends:
Metadataflow
Dataflow
Figure1.
3Instancecharacteristics
(realmetadata)
2Translationrules
4Mappingsbetweensourceandtarget
schema
5Filteringandaggregationrules
Stepsofbuildingadatawarehouse:theETLprocess
Datawarehouses[6][16]requireandprovideextensivesupportfordatacleaning.Theyloadand
continuouslyrefreshhugeamountsofdatafromavarietyofsourcessotheprobabilitythatsomeofthe
sourcescontaindirtydataishigh.Furthermore,datawarehousesareusedfordecisionmaking,sothatthe
correctnessoftheirdataisvitaltoavoidwrongconclusions.Forinstance,duplicatedormissinginformation
willproduceincorrectormisleadingstatistics(garbagein,garbageout).Duetothewiderangeofpossible
ThisworkwasperformedwhileonleaveatMicrosoftResearch,Redmond,WA.
datainconsistenciesandthesheerdatavolume,datacleaningisconsideredtobeoneofthebiggestproblems
indatawarehousing.DuringthesocalledETLprocess(extraction,transformation,loading),illustratedin
Fig.1,furtherdatatransformationsdealwithschema/datatranslationandintegration,andwithfilteringand
aggregatingdatatobestoredinthewarehouse.AsindicatedinFig.1,alldatacleaningistypically
performedinaseparatedatastagingareabeforeloadingthetransformeddataintothewarehouse.Alarge
numberoftoolsofvaryingfunctionalityisavailabletosupportthesetasks,butoftenasignificantportionof
thecleaningandtransformationworkhastobedonemanuallyorbylowlevelprogramsthataredifficultto
writeandmaintain.
Federateddatabasesystemsandwebbasedinformationsystemsfacedatatransformationstepssimilarto
thoseofdatawarehouses.Inparticular,thereistypicallya wrapperperdatasourceforextractionanda
mediatorforintegration[32][31].Sofar,thesesystemsprovideonlylimitedsupportfordatacleaning,
focusinginsteadondatatransformationsforschematranslationandschemaintegration.Dataisnot
preintegratedasfordatawarehousesbutneedstobeextractedfrommultiplesources,transformedand
combinedduringqueryruntime.Thecorrespondingcommunicationandprocessingdelayscanbesignificant,
makingitdifficulttoachieveacceptableresponsetimes.Theeffortneededfordatacleaningduring
extractionandintegrationwillfurtherincreaseresponsetimesbutismandatorytoachieveusefulquery
results.
Adatacleaningapproachshouldsatisfyseveralrequirements.Firstofall,itshoulddetectandremoveall
majorerrorsandinconsistenciesbothinindividualdatasourcesandwhenintegratingmultiplesources.The
approachshouldbesupportedbytoolstolimitmanualinspectionandprogrammingeffortandbeextensible
toeasilycoveradditionalsources.Furthermore,datacleaningshouldnotbeperformedinisolationbut
togetherwithschemarelateddatatransformationsbasedoncomprehensivemetadata.Mappingfunctionsfor
datacleaningandotherdatatransformationsshouldbespecifiedinadeclarativewayandbereusablefor
otherdatasourcesaswellasforqueryprocessing.Especiallyfordatawarehouses,aworkflowinfrastructure
shouldbesupportedtoexecutealldatatransformationstepsformultiplesourcesandlargedatasetsina
reliableandefficientway.
Whileahugebodyofresearchdealswithschematranslationandschemaintegration,datacleaninghas
receivedonlylittleattentionintheresearchcommunity.Anumberofauthorsfocussedontheproblemof
duplicateidentificationandelimination,e.g.,[11][12][15][19][22][23].Someresearchgroupsconcentrateon
generalproblemsnotlimitedbutrelevanttodatacleaning,suchasspecialdataminingapproaches[30][29],
anddatatransformationsbasedonschemamatching[1][21].Morerecently,severalresearcheffortspropose
andinvestigateamorecomprehensiveanduniformtreatmentofdatacleaningcoveringseveral
transformationphases,specificoperatorsandtheirimplementation[11][19][25].
Inthispaperweprovideanoverviewoftheproblemstobeaddressedbydatacleaningandtheirsolution.In
thenextsectionwepresentaclassificationoftheproblems.Section3discussesthemaincleaning
approachesusedinavailabletoolsandtheresearchliterature.Section4givesanoverviewofcommercial
toolsfordatacleaning,includingETLtools.Section5istheconclusion.
2 Data cleaning problems

Thissectionclassifiesthemajordataqualityproblemstobesolvedbydatacleaninganddatatransformation.
Aswewillsee,theseproblemsarecloselyrelatedandshouldthusbetreatedinauniformway.Data
transformations[26]areneededtosupportanychangesinthestructure,representationorcontentofdata.
Thesetransformationsbecomenecessaryinmanysituations,e.g.,todealwithschemaevolution,migratinga
legacysystemtoanewinformationsystem,orwhenmultipledatasourcesaretobeintegrated.
AsshowninFig.2weroughlydistinguishbetweensinglesourceandmultisourceproblemsandbetween
schemaandinstancerelatedproblems.Schemalevelproblemsofcoursearealsoreflectedintheinstances;
theycanbeaddressedattheschemalevelbyanimprovedschemadesign(schemaevolution),schema
translationandschemaintegration.Instancelevelproblems,ontheotherhand,refertoerrorsand
inconsistenciesintheactualdatacontentswhicharenotvisibleattheschemalevel.Theyaretheprimary
focusofdatacleaning.Fig.2alsoindicatessometypicalproblemsforthevariouscases.Whilenotshownin
Fig.2,thesinglesourceproblemsoccur(withincreasedlikelihood)inthemultisourcecase,too,besides
specificmultisourceproblems.
Data Quality Problems

Single-Source Problems
Multi-Source Problems
Schema Level
Instance Level
Schema Level
(Lackofintegrity
constraints,poor
schemadesign)
(Dataentryerrors)
(Heterogeneous
datamodelsand
schemadesigns)
Uniqueness
Referentialintegrity
Misspellings
Redundancy/duplicates
Contradictoryvalues
Namingconflicts
Structuralconflicts
Figure2.
Instance Level
(Overlapping,
contradictingand
inconsistentdata)
Inconsistentaggregating
Inconsistenttiming
Classificationofdataqualityproblemsindatasources
2.1 Single-source problems

Thedataqualityofasourcelargelydependsonthedegreetowhichitisgovernedbyschemaandintegrity
constraintscontrollingpermissabledatavalues.Forsourceswithoutschema,suchasfiles,therearefew
restrictionsonwhatdatacanbeenteredandstored,givingrisetoahighprobabilityoferrorsand
inconsistencies.Databasesystems,ontheotherhand,enforcerestrictionsofaspecificdatamodel(e.g.,the
relationalapproachrequiressimpleattributevalues,referentialintegrity,etc.)aswellasapplicationspecific
integrityconstraints.Schemarelateddataqualityproblemsthusoccurbecauseofthelackofappropriate
modelspecificorapplicationspecificintegrityconstraints,e.g.,duetodatamodellimitationsorpoor
schemadesign,orbecauseonlyafewintegrityconstraintsweredefinedtolimittheoverheadforintegrity
control.Instancespecificproblemsrelatetoerrorsandinconsistenciesthatcannotbepreventedatthe
schemalevel(e.g.,misspellings).
Scope/Problem
Attribut Illegalvalues
Violatedattribute
Record
dependencies
Uniqueness
Record
violation
type
Referential
Source
integrityviolation
Dirty Data
bdate=30.13.70
age=22,bdate=12.02.70
emp1=(name=JohnSmith,SSN=123456)
emp2=(name=PeterMiller,SSN=123456)
emp=(name=JohnSmith,deptno=127)
Reasons/Remarks
valuesoutsideofdomainrange
age=(currentdatebirthdate)
shouldhold
uniquenessforSSN(socialsecurity
number)violated
referenceddepartment(127)notdefined
Table1.Examplesforsinglesourceproblemsatschemalevel(violatedintegrityconstraints)
Forbothschemaandinstancelevelproblemswecandifferentiatedifferentproblemscopes:attribute(field),
record,recordtypeandsource;examplesforthevariouscasesareshowninTables1and2.Notethat
uniquenessconstraintsspecifiedattheschemaleveldonotpreventduplicatedinstances,e.g.,ifinformation
onthesamerealworldentityisenteredtwicewithdifferentattributevalues(seeexampleinTable2).
Scope/Problem
Attribute Missingvalues
Record
Record
type
Source
Dirty Data
phone=9999999999
Misspellings
Crypticvalues,
Abbreviations
Embeddedvalues
city=Liipzig
experience=B;
occupation=DBProg.
name=J.Smith12.02.70NewYork
Misfieldedvalues
Violatedattribute
dependencies
Word
transpositions
Duplicatedrecords
city=Germany
city=Redmond,zip=77777
Contradicting
records
Wrongreferences
Reasons/Remarks
unavailablevaluesduringdataentry
(dummyvaluesornull)
usuallytypos,phoneticerrors
multiplevaluesenteredinoneattribute
(e.g.inafreeformfield)
cityandzipcodeshouldcorrespond
name1=J.Smith,name2=MillerP.
usuallyinafreeformfield
emp1=(name=JohnSmith,...);
emp2=(name=J.Smith,...)
emp1=(name=JohnSmith,bdate=12.02.70);
emp2=(name=JohnSmith,bdate=12.12.70)
emp=(name=JohnSmith,deptno=17)
sameemployeerepresentedtwicedueto
somedataentryerrors
thesamerealworldentityisdescribedby
differentvalues
referenceddepartment(17)isdefinedbut
wrong
Table2.Examplesforsinglesourceproblemsatinstancelevel
Giventhatcleaningdatasourcesisanexpensiveprocess,preventingdirtydatatobeenteredisobviouslyan
importantsteptoreducethecleaningproblem.Thisrequiresanappropriatedesignofthedatabaseschema
andintegrityconstraintsaswellasofdataentryapplications.Also,thediscoveryofdatacleaningrules
duringwarehousedesigncansuggestimprovementstotheconstraintsenforcedbyexistingschemas.
2.2 Multi-source problems

Theproblemspresentinsinglesourcesareaggravatedwhenmultiplesourcesneedtobeintegrated.Each
sourcemaycontaindirtydataandthedatainthesourcesmayberepresenteddifferently,overlapor
contradict.Thisisbecausethesourcesaretypicallydeveloped,deployedandmaintainedindependentlyto
servespecificneeds.Thisresultsinalargedegreeofheterogeneityw.r.t.datamanagementsystems,data
models,schemadesignsandtheactualdata.
Attheschemalevel,datamodelandschemadesigndifferencesaretobeaddressedbythestepsofschema
translationandschemaintegration,respectively.Themainproblemsw.r.t.schemadesignarenamingand
structuralconflicts[2][24][17].Namingconflictsarisewhenthesamenameisusedfordifferentobjects
(homonyms)ordifferentnamesareusedforthesameobject(synonyms).Structuralconflictsoccurinmany
variationsandrefertodifferentrepresentationsofthesameobjectindifferentsources,e.g.,attributevs.table
representation,differentcomponentstructure,differentdatatypes,differentintegrityconstraints,etc.
Inadditiontoschemalevelconflicts,manyconflictsappearonlyattheinstancelevel(dataconflicts).All
problemsfromthesinglesourcecasecanoccurwithdifferentrepresentationsindifferentsources(e.g.,
duplicatedrecords,contradictingrecords,).Furthermore,evenwhentherearethesameattributenamesand
datatypes,theremaybedifferentvaluerepresentations(e.g.,formaritalstatus)ordifferentinterpretationof
thevalues(e.g.,measurementunitsDollarvs.Euro)acrosssources.Moreover,informationinthesources
maybeprovidedatdifferentaggregationlevels(e.g.,salesperproductvs.salesperproductgroup)orrefer
todifferentpointsintime(e.g.currentsalesasofyesterdayforsource1vs.asoflastweekforsource2).
Amainproblemforcleaningdatafrommultiplesourcesistoidentifyoverlappingdata,inparticular
matchingrecordsreferringtothesamerealworldentity(e.g.,customer).Thisproblemisalsoreferredtoas
theobjectidentityproblem[11],duplicateeliminationorthemerge/purgeproblem[15].Frequently,the
informationisonlypartiallyredundantandthesourcesmaycomplementeachotherbyprovidingadditional
informationaboutanentity.Thusduplicateinformationshouldbepurgedoutandcomplementing
informationshouldbeconsolidatedandmergedinordertoachieveaconsistentviewofrealworldentities.
Customer(source1)
CID
11
24
Name
KristenSmith
ChristianSmith
Street
2HurleyPl
HurleySt2
City
SouthFork,MN48503
SForkMN
Sex
0
1
Client(source2)
Cno
24
LastName
Smith
FirstName
Christoph
Gender
M
493
Smith
KrisL.
Address
23HarleySt,Chicago
IL,606332394
2HurleyPlace,South
ForkMN,485035998
Phone/Fax
3332226542/
3332226599
4445556666
Customers(integratedtargetwithcleaneddata)
No LName
1
Smith
FName
KristenL.
Gender
F
Smith
Christian
Smith
Christoph
Figure3.
Street
2Hurley
Place
2Hurley
Place
23Harley
Street
City
South
Fork
South
Fork
Chicago
State
MN
MN
IL
ZIP
48503
5998
48503
5998
60633
2394
Phone
444555
6666
Fax
333222
6542
333222
6599
CID
11
Cno
493
24
24
Examplesofmultisourceproblemsatschemaandinstancelevel
ThetwosourcesintheexampleofFig.3arebothinrelationalformatbutexhibitschemaanddataconflicts.
Attheschemalevel,therearenameconflicts(synonyms Customer/Client, Cid/Cno, Sex/Gender)and
structuralconflicts(differentrepresentationsfornamesandaddresses).Attheinstancelevel,wenotethat
therearedifferentgenderrepresentations(0/1vs.F/M)andpresumablyaduplicaterecord(Kristen
Smith).Thelatterobservationalsorevealsthatwhile Cid/Cnoarebothsourcespecificidentifiers,their
contentsarenotcomparablebetweenthesources;differentnumbers(11/493)mayrefertothesameperson
whiledifferentpersonscanhavethesamenumber(24).Solvingtheseproblemsrequiresbothschema
integrationanddatacleaning;thethirdtableshowsapossiblesolution.Notethattheschemaconflictsshould
beresolvedfirsttoallowdatacleaning,inparticulardetectionofduplicatesbasedonauniform
representationofnamesandaddresses,andmatchingofthe Gender/Sexvalues.
3 Data cleaning approaches

Ingeneral,datacleaninginvolvesseveralphases
Data analysis:Inordertodetectwhichkindsoferrorsandinconsistenciesaretoberemoved,adetailed
dataanalysisisrequired.Inadditiontoamanualinspectionofthedataordatasamples,analysis
programsshouldbeusedtogainmetadataaboutthedatapropertiesanddetectdataqualityproblems.
Definition of transformation workflow and mapping rules:Dependingonthenumberofdatasources,
theirdegreeofheterogeneityandthedirtynessofthedata,alargenumberofdatatransformationand
cleaningstepsmayhavetobeexecuted.Sometime,aschematranslationisusedtomapsourcestoa
commondatamodel;fordatawarehouses,typicallyarelationalrepresentationisused.Earlydata
cleaningstepscancorrectsinglesourceinstanceproblemsandpreparethedataforintegration.Later
stepsdealwithschema/dataintegrationandcleaningmultisourceinstanceproblems,e.g.,duplicates.
Fordatawarehousing,thecontrolanddataflowforthesetransformationandcleaningstepsshouldbe
specifiedwithinaworkflowthatdefinestheETLprocess(Fig.1).
Theschemarelateddatatransformationsaswellasthecleaningstepsshouldbespecifiedbya
declarativequeryandmappinglanguageasfaraspossible,toenableautomaticgenerationofthe
transformationcode.Inaddition,itshouldbepossibletoinvokeuserwrittencleaningcodeandspecial
purposetoolsduringadatatransformationworkflow.Thetransformationstepsmayrequestuser
feedbackondatainstancesforwhichtheyhavenobuiltincleaninglogic.
Verification:Thecorrectnessandeffectivenessofatransformationworkflowandthetransformation
definitionsshouldbetestedandevaluated,e.g.,onasampleorcopyofthesourcedata,toimprovethe
definitionsifnecessary.Multipleiterationsoftheanalysis,designandverificationstepsmaybeneeded,
e.g.,sincesomeerrorsonlybecomeapparentafterapplyingsometransformations.
Transformation:ExecutionofthetransformationstepseitherbyrunningtheETLworkflowforloading
andrefreshingadatawarehouseorduringansweringqueriesonmultiplesources.
Backflow of cleaned data:After(singlesource)errorsareremoved,thecleaneddatashouldalsoreplace
thedirtydataintheoriginalsourcesinordertogivelegacyapplicationstheimproveddatatooandto
avoidredoingthecleaningworkforfuturedataextractions.Fordatawarehousing,thecleaneddatais
availablefromthedatastagingarea(Fig.1).
Thetransformationprocessobviouslyrequiresalargeamountofmetadata,suchasschemas,instancelevel
datacharacteristics,transformationmappings,workflowdefinitions,etc.Forconsistency,flexibilityandease
ofreuse,thismetadatashouldbemaintainedinaDBMSbasedrepository[4].Tosupportdataquality,
detailedinformationaboutthetransformationprocessistoberecorded,bothintherepositoryandinthe
transformedinstances,inparticularinformationaboutthecompletenessandfreshnessofsourcedataand
lineageinformationabouttheoriginoftransformedobjectsandthechangesappliedtothem.Forinstance,in
Fig.3,thederivedtable Customerscontainstheattributes CIDand Cno,allowingonetotracebackthe
sourcerecords.
Inthefollowingwedescribeinmoredetailpossibleapproachesfordataanalysis(conflictdetection),
transformationdefinitionandconflictresolution.Forapproachestoschematranslationandschema
integration,werefertotheliteratureastheseproblemshaveextensivelybeenstudiedanddescribed
[2][24][26].Nameconflictsaretypicallyresolvedbyrenaming;structuralconflictsrequireapartial
restructuringandmergingoftheinputschemas.
3.1 Data analysis

Metadatareflectedinschemasistypicallyinsufficienttoassessthedataqualityofasource,especiallyifonly
afewintegrityconstraintsareenforced.Itisthusimportanttoanalysetheactualinstancestoobtainreal
(reengineered)metadataondatacharacteristicsorunusualvaluepatterns.Thismetadatahelpsfindingdata
qualityproblems.Moreover,itcaneffectivelycontributetoidentifyattributecorrespondencesbetween
sourceschemas(schemamatching),basedonwhichautomaticdatatransformationscanbederived[20][9].
Therearetworelatedapproachesfordataanalysis,dataprofilinganddatamining. Data profilingfocusses

ontheinstanceanalysisofindividualattributes.Itderivesinformationsuchasthedatatype,length,value
range,discretevaluesandtheirfrequency,variance,uniqueness,occurrenceofnullvalues,typicalstring
pattern(e.g.,forphonenumbers),etc.,providinganexactviewofvariousqualityaspectsoftheattribute.
Table3showsexamplesofhowthismetadatacanhelpdetectingdataqualityproblems.
Problems
Illegal values
Metadata
cardinality
max,min
variance,deviation
Misspellings
attributevalues
Missing
values
Varying value
representatio
Duplicates
nullvalues
attributevalues+defaultvalues
attributevalues
cardinality+uniqueness
attributevalues
Examples/Heuristics
e.g.,cardinality(gender)>2indicatesproblem
max,minshouldnotbeoutsideofpermissiblerange
variance,deviationofstatisticalvaluesshouldnotbehigherthan
threshold
sortingonvaluesoftenbringsmisspelledvaluesnexttocorrect
values
percentage/numberofnullvalues
presenceofdefaultvaluemayindicaterealvalueismissing
comparingattributevaluesetofacolumnofonetableagainstthat
ofacolumnofanothertable
attributecardinality=#rowsshouldhold
sortingvaluesbynumberofoccurrences;morethan1occurrence
indicatesduplicates
Table3.Examplesfortheuseofreengineeredmetadatatoaddressdataqualityproblems
Data mininghelpsdiscoverspecificdatapatternsinlargedatasets,e.g.,relationshipsholdingbetween
severalattributes.Thisisthefocusofsocalleddescriptivedataminingmodelsincludingclustering,
summarization,associationdiscoveryandsequencediscovery[10].Asshownin[28],integrityconstraints
amongattributessuchasfunctionaldependenciesorapplicationspecificbusinessrulescanbederived,
whichcanbeusedtocompletemissingvalues,correctillegalvaluesandidentifyduplicaterecordsacross
datasources.Forexample,anassociationrulewithhighconfidencecanhinttodataqualityproblemsin
instancesviolatingthisrule.Soaconfidenceof99%forruletotal=quantity*unit priceindicatesthat1%of
therecordsdonotcomplyandmayrequirecloserexamination.
3.2 Defining data transformations

Thedatatransformationprocesstypicallyconsistsofmultiplestepswhereeachstepmayperformschema
andinstancerelatedtransformations(mappings).Toallowadatatransformationandcleaningsystemto
generatetransformationcodeandthustoreducetheamountofselfprogrammingitisnecessarytospecify
therequiredtransformationsinanappropriatelanguage,e.g.,supportedbyagraphicaluserinterface.
VariousETLtools(seeSection4)offerthisfunctionalitybysupportingproprietaryrulelanguages.Amore
generalandflexibleapproachistheuseofthestandardquerylanguageSQLtoperformthedata
transformationsandutilizethepossibilityofapplicationspecificlanguageextensions,inparticularuser
definedfunctions(UDFs)supportedinSQL:99[13][14].UDFscanbeimplementedinSQLorageneral
purposeprogramminglanguagewithembeddedSQLstatements.Theyallowimplementingawiderangeof
datatransformationsandsupporteasyreusefordifferenttransformationandqueryprocessingtasks.
Furthermore,theirexecutionbytheDBMScanreducedataaccesscostandthusimproveperformance.
Finally,UDFsarepartoftheSQL:99standardandshould(eventually)beportableacrossmanyplatforms
andDBMSs.
CREATEVIEWCustomer2(LName,FName,Gender,Street,City,State,ZIP,CID)AS
SELECT LastNameExtract(Name), FirstNameExtract(Name),Sex,Street, CityExtract(City),
StateExtract(City), ZIPExtract(City),CID
FROMCustomer
Figure4.
Exampleoftransformationstepdefinition
Fig.4showsatransformationstepspecifiedinSQL:99.TheexamplereferstoFig.3andcoverspartofthe
necessarydatatransformationstobeappliedtothefirstsource.Thetransformationdefinesaviewonwhich
furthermappingscanbeperformed.Thetransformationperformsaschemarestructuringwithadditional
attributesintheviewobtainedbysplittingthenameandaddressattributesofthesource.Therequireddata
extractionsareachievedbyUDFs(showninboldface).TheUDFimplementationscancontaincleaning
logic,e.g.,toremovemisspellingsincitynamesorprovidemissingzipcodes.
UDFsmaystillimplyasubstantialimplementationeffortanddonotsupportallnecessaryschema
transformations.Inparticular,simpleandfrequentlyneededfunctionssuchasattributesplittingormerging
arenotgenericallysupportedbutneedoftentobereimplementedinapplicationspecificvariations(see
specificextractfunctionsinFig.4).Morecomplexschemarestructurings(e.g.,foldingandunfoldingof
attributes)arenotsupportedatall.Togenericallysupportschemarelatedtransformations,language
extensionssuchastheSchemaSQLproposalarerequired[18].Datacleaningattheinstancelevelcanalso
benefitfromspeciallanguageextensionssuchasaMatchoperatorsupportingapproximatejoins(see
below).Systemsupportforsuchpowerfuloperatorscangreatlysimplifytheprogrammingeffortfordata
transformationsandimproveperformance.Somecurrentresearcheffortsondatacleaningareinvestigating
theusefulnessandimplementationofsuchquerylanguageextensions[11][25].
3.3 Conflict resolution

Asetoftransformationstepshastobespecifiedandexecutedtoresolvethevariousschemaandinstance
leveldataqualityproblemsthatarereflectedinthedatasourcesathand.Severaltypesoftransformationsare
tobeperformedontheindividualdatasourcesinordertodealwithsinglesourceproblemsandtoprepare
forintegrationwithothersources.Inadditiontoapossibleschematranslation,thesepreparatorysteps
typicallyinclude:
Extracting values from free-form attributes (attribute split):Freeformattributesoftencapturemultiple
individualvaluesthatshouldbeextractedtoachieveamorepreciserepresentationandsupportfurther
cleaningstepssuchasinstancematchingandduplicateelimination.Typicalexamplesarenameand
addressfields(Table2,Fig.3,Fig.4).Requiredtransformationsinthissteparereorderingofvalues
withinafieldtodealwithwordtranspositions,andvalueextractionforattributesplitting.
Validation and correction:Thisstepexamineseachsourceinstancefordataentryerrorsandtriesto
correctthemautomaticallyasfaraspossible.Spellcheckingbasedondictionarylookupisusefulfor
identifyingandcorrectingmisspellings.Furthermore,dictionariesongeographicnamesandzipcodes
helptocorrectaddressdata.Attributedependencies(birthdateage,totalpriceunitprice/quantity,
cityphoneareacode,)canbeutilizedtodetectproblemsandsubstitutemissingvaluesorcorrect
wrongvalues.
Standardization:Tofacilitateinstancematchingandintegration,attributevaluesshouldbeconvertedto
aconsistentanduniformformat.Forexample,dateandtimeentriesshouldbebroughtintoaspecific
format;namesandotherstringdatashouldbeconvertedtoeitherupperorlowercase,etc.Textdatamay
becondensedandunifiedbyperformingstemming,removingprefixes,suffixes,andstopwords.
Furthermore,abbreviationsandencodingschemesshouldconsistentlyberesolvedbyconsultingspecial
synonymdictionariesorapplyingpredefinedconversionrules.
Dealingwithmultisourceproblemsrequiresrestructuringofschemastoachieveaschemaintegration,
includingstepssuchassplitting,merging,foldingandunfoldingofattributesandtables.Attheinstance
level,conflictingrepresentationsneedtoberesolvedandoverlappingdatamusttobedealtwith.The
duplicate eliminationtaskistypicallyperformedaftermostothertransformationandcleaningsteps,
especiallyafterhavingcleanedsinglesourceerrorsandconflictingrepresentations.Itisperformedeitheron
twocleanedsourcesatatimeoronasinglealreadyintegrateddataset.Duplicateeliminationrequirestofirst
identify(i.e.match)similarrecordsconcerningthesamerealworldentity.Inasecondstep,similarrecords
aremergedintoonerecordcontainingallrelevantattributeswithoutredundancy.Furthermore,redundant
recordsarepurged.Inthefollowingwediscussthekeyproblemofinstancematching.Moredetailsonthe
subjectareprovidedelsewhereinthisissue[22].
Inthesimplestcase,thereisanidentifyingattributeorattributecombinationperrecordthatcanbeusedfor
matchingrecords,e.g.,ifdifferentsourcessharethesameprimarykeyorifthereareothercommonunique
attributes.Instancematchingbetweendifferentsourcesisthenachievedbyastandardequijoinonthe
identifyingattribute(s).Inthecaseofasingledataset,matchescanbedeterminedbysortingonthe
identifyingattributeandcheckingifneighboringrecordsmatch.Inbothcases,efficientimplementationscan
beachievedevenforlargedatasets.Unfortunately,withoutcommonkeyattributesorinthepresenceof
dirtydatasuchstraightforwardapproachesareoftentoorestrictive.Todeterminemostorallmatchesa
fuzzymatching(approximatejoin)becomesnecessarythatfindssimilarrecordsbasedonamatchingrule,
e.g.,specifieddeclarativelyorimplementedbyauserdefinedfunction[14][11].Forexample,sucharule
couldstatethatpersonrecordsarelikelytocorrespondifnameandportionsoftheaddressmatch.The
degreeofsimilaritybetweentworecords,oftenmeasuredbyanumericalvaluebetween0and1,usually
dependsonapplicationcharacteristics.Forinstance,differentattributesinamatchingrulemaycontribute
differentweighttotheoveralldegreeofsimilarity.Forstringcomponents(e.g.,customername,company
name,)exactmatchingandfuzzyapproachesbasedonwildcards,characterfrequency,editdistance,
keyboarddistanceandphoneticsimilarity(soundex)areuseful[11][15][19].Morecomplexstringmatching
approachesalsoconsideringabbreviationsarepresentedin[23].Ageneralapproachformatchingbothstring
andtextdataistheuseofcommoninformationretrievalmetrics.WHIRLrepresentsapromising
representativeofthiscategoryusingthecosinedistanceinthevectorspacemodelfordeterminingthedegree
ofsimilaritybetweentextelements[7].
Determiningmatchinginstanceswithsuchanapproachistypicallyaveryexpensiveoperationforlargedata
sets.Calculatingthesimilarityvalueforanytworecordsimpliesevaluationofthematchingruleonthe
cartesianproductoftheinputs.Furthermoresortingonthesimilarityvalueisneededtodeterminematching
recordscoveringduplicateinformation.Allrecordsforwhichthesimilarityvalueexceedsathresholdcanbe
consideredasmatches,orasmatchcandidatestobeconfirmedorrejectedbytheuser.In[15]amultipass
approachisproposedforinstancematchingtoreducetheoverhead.Itisbasedonmatchingrecords
independentlyondifferentattributesandcombiningthedifferentmatchresults.Assumingasingleinputfile,
eachmatchpasssortstherecordsonaspecificattributeandonlytestsnearbyrecordswithinacertain
windowonwhethertheysatisfyapredeterminedmatchingrule.Thisreducessignificantlythenumberof
matchruleevaluationscomparedtothecartesianproductapproach.Thetotalsetofmatchesisobtainedby
theunionofthematchingpairsofeachpassandtheirtransitiveclosure.
4 Tool support
Alargevarietyoftoolsisavailableonthemarkettosupportdatatransformationanddatacleaningtasks,in
particularfordatawarehousing.1Sometoolsconcentrateonaspecificdomain,suchascleaningnameand
addressdata,oraspecificcleaningphase,suchasdataanalysisorduplicateelimination.Duetotheir
restricteddomain,specializedtoolstypicallyperformverywellbutmustbecomplementedbyothertoolsto
addressthebroadspectrumoftransformationandcleaningproblems.Othertools,e.g.,ETLtools,provide
comprehensivetransformationandworkflowcapabilitiestocoveralargepartofthedatatransformationand
cleaningprocess.AgeneralproblemofETLtoolsistheirlimitedinteroperabilityduetoproprietary
applicationprogramminginterfaces(API)andproprietarymetadataformatsmakingitdifficulttocombine
thefunctionalityofseveraltools[8].
Wefirstdiscusstoolsfordataanalysisanddatarengineeringwhichprocessinstancedatatoidentifydata
errorsandinconsistencies,andtoderivecorrespondingcleaningtransformations.Wethenpresent
specializedcleaningtoolsandETLtools,respectively.
4.1 Data analysis and reengineering tools

Accordingtoourclassificationin3.1, data analysis toolscanbedividedintodataprofilinganddatamining
tools.MIGRATIONARCHITECT(EvokeSoftware)isoneofthefewcommercial data profiling tools.Foreach
attribute,itdeterminesthefollowingrealmetadata:datatype,length,cardinality,discretevaluesandtheir
percentage,minimumandmaximumvalues,missingvalues,anduniqueness.M IGRATIONARCHITECTalso
assistsindevelopingthetargetschemafordatamigration. Data mining tools,suchasWIZRULE(WizSoft)
andDATAMININGSUITE(InformationDiscovery),inferrelationshipsamongattributesandtheirvaluesand
computeaconfidencerateindicatingthenumberofqualifyingrows.Inparticular,W IZRULEcanrevealthree
kindsofrules:mathematicalformula,ifthenrules,andspellingbasedrulesindicatingmisspellednames,
e.g., valueEdinburgh appears 52 times in fieldCustomer; 2 case(s) contain similar value(s).WIZRULE
alsoautomaticallypointstothedeviationsfromthesetofthediscoveredrulesassuspectederrors.
Data reengineering tools,e.g.,INTEGRITY(Vality),utilizediscoveredpatternsandrulestospecifyand
performcleaningtransformations,i.e.,theyreengineerlegacydata.InI NTEGRITY,datainstancesundergo
severalanalysissteps,suchasparsing,datatyping,patternandfrequencyanalysis.Theresultofthesesteps
isatabularrepresentationoffieldcontents,theirpatternsandfrequencies,basedonwhichthepatternfor
standardizingdatacanbeselected.Forspecifyingcleaningtransformations,I NTEGRITYprovidesalanguage
includingasetofoperatorsforcolumntransformations(e.g.,move,split,delete)androwtransformation
1
Forcomprehensivevendorandtoollistings,seecommercialwebsites,e.g.,DataWarehouseInformationCenter
(www.dwinfocenter.org),DataManagementReview(www.dmreview.com),DataWarehousingInstitute(www.dwinstitute.com)
(e.g.,merge,split).INTEGRITYidentifiesandconsolidatesrecordsusingastatisticalmatchingtechnique.
Automatedweightingfactorsareusedtocomputescoresforrankingmatchesbasedonwhichtheusercan
selecttherealduplicates.
4.2 Specialized cleaning tools

Specializedcleaningtoolstypicallydealwithaparticulardomain,mostlynameandaddressdata,or
concentrateonduplicateelimination.Thetransformationsaretobeprovidedeitherinadvanceintheformof
arulelibraryorinteractivelybytheuser.Alternatively,datatransformationscanautomaticallybederived
fromschemamatchingtoolssuchasdescribedin[21].
Special domain cleaning:Namesandaddressesarerecordedinmanysourcesandtypicallyhavehigh
cardinality.Forexample,findingcustomermatchesisveryimportantforcustomerrelationship
management.Anumberofcommercialtools,e.g.,IDCENTRIC(FirstLogic),PUREINTEGRATE(Oracle),
QUICKADDRESS(QASSystems),REUNION(PitneyBowes),andTRILLIUM(TrilliumSoftware),focuson
cleaningthiskindofdata.Theyprovidetechniquessuchasextractingandtransformingnameand
addressinformationintoindividualstandardelements,validatingstreetnames,cities,andzipcodes,in
combinationwithamatchingfacilitybasedonthecleaneddata.Theyincorporateahugelibraryofpre
specifiedrulesdealingwiththeproblemscommonlyfoundinprocessingthisdata.Forexample,
TRILLIUMsextraction(parser)andmatchermodulecontainsover200,000businessrules.Thetoolsalso
providefacilitiestocustomizeorextendtherulelibrarywithuserdefinedrulesforspecificneeds.
Duplicate elimination:SampletoolsforduplicateidentificationandeliminationincludeD ATACLEANSER
(EDD),MERGE/PURGELIBRARY
(Sagent/QMSoftware),
MATCHIT(HelpITSystems),and
MASTERMERGE(PitneyBowes).Usually,theyrequirethedatasourcesalreadybecleanedformatching.
Severalapproachesformatchingattributevaluesaresupported;toolssuchasD ATACLEANSERand
MERGE/PURGELIBRARYalsoallowuserspecifiedmatchingrulestobeintegrated.
4.3 ETL tools

AlargenumberofcommercialtoolssupporttheETLprocessfordatawarehousesinacomprehensiveway,
e.g.,COPYMANAGER(InformationBuilders),DATASTAGE(Informix/Ardent),EXTRACT(ETI),POWERMART
(Informatica),DECISIONBASE(CA/Platinum),DATATRANSFORMATIONSERVICE(Microsoft),METASUITE
(Minerva/Carleton),SAGENTSOLUTIONPLATFORM(Sagent),andWAREHOUSEADMINISTRATOR(SAS).They
usearepositorybuiltonaDBMStomanageallmetadataaboutthedatasources,targetschemas,mappings,
scriptprograms,etc.,inauniformway.Schemasanddataareextractedfromoperationaldatasourcesvia
bothnativefileandDBMSgatewaysaswellasstandardinterfacessuchasODBCandEDA.Data
transformationsaredefinedwithaneasytousegraphicalinterface.Tospecifyindividualmappingsteps,a
proprietaryrulelanguageandacomprehensivelibraryofpredefinedconversionfunctionsaretypicallypro
vided.Thetoolsalsosupportreusingexistingtransformationsolutions,suchasexternalC/C++routines,by
providinganinterfacetointegratethemintotheinternaltransformationlibrary.Transformationprocessingis
carriedouteitherbyanenginethatinterpretsthespecifiedtransformationsatruntime,orbycompiledcode.
Allenginebasedtools(e.g.,COPYMANAGER,DECISIONBASE,POWERMART,DATASTAGE,
WAREHOUSEADMINISTRATOR),possessaschedulerandsupportworkflowswithcomplexexecution
dependenciesamongmappingjobs.Aworkflowmayalsoinvokeexternaltools,e.g.,forspecializedcleaning
taskssuchasname/addresscleaningorduplicateelimination.
ETLtoolstypicallyhavelittlebuiltindatacleaningcapabilitiesbutallowtheusertospecifycleaningfunc
tionalityviaaproprietaryAPI.Thereisusuallynodataanalysissupporttoautomaticallydetectdataerrors
andinconsistencies.However,userscanimplementsuchlogicwiththemetadatamaintainedandbydeter
miningcontentcharacteristicswiththehelpofaggregationfunctions(sum,count,min,max,median,vari
ance,deviation,).Theprovidedtransformationlibrarycoversmanydatatransformationandcleaning
needs,suchasdatatypeconversions(e.g.,datereformatting),stringfunctions(e.g.,split,merge,replace,
substringsearch),arithmetic,scientificandstatisticalfunctions,etc.Extractionofvaluesfromfreeform
attributesisnotcompletelyautomaticbuttheuserhastospecifythedelimitersseparatingsubvalues.
Therulelanguagestypicallycover if-thenand caseconstructsthathelphandlingexceptionsindatavalues,
suchasmisspellings,abbreviations,missingorcrypticvalues,andvaluesoutsideofrange.Theseproblems
canalsobeaddressedbyusingatablelookupconstructandjoinfunctionality.Supportforinstancematching
istypicallyrestrictedtotheuseofthejoinconstructandsomesimplestringmatchingfunctions,e.g.,exact
orwildcardmatchingandsoundex.However,userdefinedfieldmatchingfunctionsaswellasfunctionsfor
correlatingfieldsimilaritiescanbeprogrammedandaddedtotheinternaltransformationlibrary.
5 Conclusions
Weprovidedaclassificationofdataqualityproblemsindatasourcesdifferentiatingbetweensingleand
multisourceandbetweenschemaandinstancelevelproblems.Wefurtheroutlinedthemajorstepsfordata
transformationanddatacleaningandemphasizedtheneedtocoverschemaandinstancerelateddata
transformationsinanintegratedway.Furthermore,weprovidedanoverviewofcommercialdatacleaning
tools.Whilethestateoftheartinthesetoolsisquiteadvanced,theydotypicallycoveronlypartofthe
problemandstillrequiresubstantialmanualeffortorselfprogramming.Furthermore,theirinteroperabilityis
limited(proprietaryAPIsandmetadatarepresentations).
Sofaronlyalittleresearchhasappearedondatacleaning,althoughthelargenumberoftoolsindicatesboth
theimportanceanddifficultyofthecleaningproblem.Weseeseveraltopicsdeservingfurtherresearch.First
ofall,moreworkisneededonthedesignandimplementationofthebestlanguageapproachforsupporting
bothschemaanddatatransformations.Forinstance,operatorssuchasMatch,MergeorMapping
Compositionhaveeitherbeenstudiedattheinstance(data)orschema(metadata)levelbutmaybebuilton
similarimplementationtechniques.Datacleaningisnotonlyneededfordatawarehousingbutalsoforquery
processingonheterogeneousdatasources,e.g.,inwebbasedinformationsystems.Thisenvironmentposes
muchmorerestrictiveperformanceconstraintsfordatacleaningthatneedtobeconsideredinthedesignof
suitableapproaches.Furthermore,datacleaningforsemistructureddata,e.g.,basedonXML,islikelytobe
ofgreatimportancegiventhereducedstructuralconstraintsandtherapidlyincreasingamountofXMLdata.
Acknowledgments
WewouldliketothankPhilBernstein,HelenaGalhardasandSunitaSarawagiforhelpfulcomments.
References
[1]Abiteboul,S.;Clue,S.;Milo,T.;Mogilevsky,P.;Simeon,J.: Tools for Data Translation and Integration.In
[26]:38,1999.
[2]Batini,C.;Lenzerini,M.;Navathe,S.B.: A Comparative Analysis of Methodologies for Database Schema
Integration.InComputingSurveys18(4):323364,1986.
[3]Bernstein,P.A.;Bergstraesser,T.: Metadata Support for Data Transformation Using Microsoft Repository.In
[26]:914,1999
[4]Bernstein,P.A.;Dayal,U.: An Overview of Repository Technology.Proc.20thVLDB,1994.
[5]Bouzeghoub,M.;Fabret,F.;Galhardas,H.;Pereira,J;Simon,E.;Matulovic,M.: Data Warehouse Refreshment.In
[16]:4767.
[6]Chaudhuri,S.,Dayal,U.: An Overview of Data Warehousing and OLAP Technology.ACMSIGMODRecord
26(1),1997.
[7]Cohen,W.: Integration of Heterogeneous Databases without Common Domains Using Queries Based
Textual
Similarity.Proc.ACMSIGMODConf.onDataManagement,1998.
[8]Do,H.H.;Rahm,E.: On Metadata Interoperability in Data Warehouses.Techn.Report,Dept.ofComputerSci
ence,Univ.ofLeipzig.http://dol.unileipzig.de/pub/200013.
[9]Doan,A.H.;Domingos,P.;Levy,A.Y.: Learning Source Description for Data Integration.Proc.3rdIntl.Work
shopTheWebandDatabases(WebDB),2000.
[10]Fayyad,U.: Mining Database: Towards Algorithms for Knowledge Discovery.IEEETechn.BulletinDataEngi
neering21(1),1998.
[11]Galhardas,H.;Florescu,D.;Shasha,D.;Simon,E.: Declaratively cleaning your data using AJAX.InJournees
BasesdeDonnees,Oct.2000.http://caravel.inria.fr/~galharda/BDA.ps.
[12]Galhardas,H.;Florescu,D.;Shasha,D.;Simon,E.: AJAX: An Extensible Data Cleaning Tool.Proc.ACMSIG
MODConf.,p.590,2000.
[13]Haas,L.M.;Miller,R.J.;Niswonger,B.;TorkRoth,M.;Schwarz,P.M.;Wimmers,E.L.: Transforming Heterogeneous Data with Database Middleware: Beyond Integration.In[26]:3136,1999.
[14]Hellerstein,J.M.;Stonebraker,M.;Caccia,R.: Independent, Open Enterprise Data Integration.In[26]:4349,
1999.
[15]Hernandez,M.A.;Stolfo,S.J.: Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem.Data
MiningandKnowledgeDiscovery2(1):937,1998.
[16]Jarke,M.,Lenzerini,M.,Vassiliou,Y.,Vassiliadis,P.: Fundamentals of Data Warehouses.Springer,2000.
[17]Kashyap,V.;Sheth,A.P.: Semantic and Schematic Similarities between Database Objects: A Context-Based
Approach.VLDBJournal5(4):276304,1996.
10
[18]Lakshmanan,L.;Sadri,F.;Subramanian,I.N.: SchemaSQL A Language for Interoperability in Relational MultiDatabase Systems.Proc.26thVLDB,1996.

[19]Lee,M.L.;Lu,H.;Ling,T.W.;Ko,Y.T.: Cleansing Data for Mining and Warehousing.Proc.10thIntl.Conf.
DatabaseandExpertSystemsApplications(DEXA),1999.
[20]Li,W.S.;Clifton,S.: SEMINT: A Tool for Identifying Attribute Correspondences in Heterogeneous
Databases
Using Neural Networks.InDataandKnowledgeEngineering33(1):4984,2000.
[21]Milo,T.;Zohar,S.: Using Schema Matching to Simplify Heterogeneous Data Translation.Proc.24thVLDB,
1998.
[22]Monge,A.E. Matching Algorithm within a Duplicate Detection System.IEEETechn.BulletinDataEngineering
23(4),2000(thisissue).
[23]Monge,A.E.;Elkan,P.C.: The Field Matching Problem: Algorithms and Applications.Proc.2ndIntl.Conf.
KnowledgeDiscoveryandDataMining(KDD),1996.
[24]Parent,C.;Spaccapietra,S.: Issues and Approaches of Database Integration.Comm.ACM41(5):166178,1998.
[25]Raman,V.;Hellerstein,J.M.: Potter's Wheel: An Interactive Framework for Data Cleaning.WorkingPaper,1999.
http://www.cs.berkeley.edu/~rshankar/papers/pwheel.pdf.
[26]Rundensteiner,E.(ed.):SpecialIssueonDataTransformation.IEEETechn.Bull.DataEngineering22(1),1999.
[27]Quass,D.: A Framework for Research in Data Cleaning.UnpublishedManuscript.BrighamYoungUniv.,1999
[28]Sapia,C.;Hfling,G.;Mller,M.;Hausdorf,C.;Stoyan,H.;Grimmer,U.: On Supporting the Data Warehouse
Design by Data Mining Techniques.Proc.GIWorkshopDataMiningandDataWarehousing,1999.
[29]Savasere,A.;Omiecinski,E.;Navathe,S.: An Efficient Algorithm for Mining Association Rules in Large Databases.Proc.21stVLDB,1995.
[30]Srikant,R.;Agrawal,R.: Mining Generalized Association Rules.Proc.21stVLDBconf.,1995.
[31]TorkRoth,M.;Schwarz,P.M.: Dont Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources.Proc.
23rdVLDB,1997.
[32]Wiederhold,G.: Mediators in the Architecture of Future Information Systems.Computer25(3):3849,1992.
11

Data Cleaning: Problems and Current Approaches

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Cleaning: Problems and Current Approaches

Transféré par

Droits d'auteur :

Formats disponibles

Data Cleaning: Problems and Current

Extraction, Transformation, Loading

2 Data cleaning problems

Data Quality Problems

2.1 Single-source problems

2.2 Multi-source problems

3 Data cleaning approaches

3.1 Data analysis

Therearetworelatedapproachesfordataanalysis,dataprofilinganddatamining. Data profilingfocusses

3.2 Defining data transformations

3.3 Conflict resolution

4.1 Data analysis and reengineering tools

4.2 Specialized cleaning tools

4.3 ETL tools

[18]Lakshmanan,L.;Sadri,F.;Subramanian,I.N.: SchemaSQL A Language for Interoperability in Relational MultiDatabase Systems.Proc.26thVLDB,1996.

Vous aimerez peut-être aussi