Académique Documents
Professionnel Documents
Culture Documents
PracticalMachineLearningToolsandTechniques
SlidesforChapter3ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall
Input:Concepts,instances,attributes
Terminology
Whatsaconcept?
Whatsinanexample?
Relations,flatfiles,recursion
Whatsinanattribute?
Classification,association,clustering,numericprediction
Nominal,ordinal,interval,ratio
Preparingtheinput
ARFF,attributes,missingvalues,gettingtoknowdata
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
Terminology
Componentsoftheinput:
Concepts:kindsofthingsthatcanbelearned
Instances:theindividual,independentexamples
ofaconcept
Aim:intelligibleandoperationalconceptdescription
Note:morecomplicatedformsofinputarepossible
Attributes:measuringaspectsofaninstance
Wewillfocusonnominalandnumericones
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
Whatsaconcept?
Stylesoflearning:
Classificationlearning:
predictingadiscreteclass
Associationlearning:
detectingassociationsbetweenfeatures
Clustering:
groupingsimilarinstancesintoclusters
Numericprediction:
predictinganumericquantity
Concept:thingtobelearned
Conceptdescription:
outputoflearningscheme
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
Classificationlearning
Exampleproblems:weatherdata,contact
lenses,irises,labornegotiations
Classificationlearningissupervised
Schemeisprovidedwithactualoutcome
Outcomeiscalledtheclassoftheexample
Measuresuccessonfreshdataforwhich
classlabelsareknown(testdata)
Inpracticesuccessisoftenmeasured
subjectively
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
Associationlearning
Canbeappliedifnoclassisspecifiedandanykind
ofstructureisconsideredinteresting
Differencetoclassificationlearning:
Canpredictanyattributesvalue,notjusttheclass,and
morethanoneattributesvalueatatime
Hence:farmoreassociationrulesthanclassification
rules
Thus:constraintsarenecessary
Minimumcoverageandminimumaccuracy
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
Clustering
Findinggroupsofitemsthataresimilar
Clusteringisunsupervised
Theclassofanexampleisnotknown
Successoftenmeasuredsubjectively
Sepallength
Sepalwidth
Petallength
Petalwidth
Type
5.1
3.5
1.4
0.2
Irissetosa
4.9
3.0
1.4
0.2
Irissetosa
51
7.0
3.2
4.7
1.4
Irisversicolor
52
6.4
3.2
4.5
1.5
Irisversicolor
101
6.3
3.3
6.0
2.5
Irisvirginica
102
5.8
2.7
5.1
1.9
Irisvirginica
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
Numericprediction
Variantofclassificationlearningwhere
classisnumeric(alsocalledregression)
Learningissupervised
Schemeisbeingprovidedwithtargetvalue
Measuresuccessontestdata
Outlook
Temperature
Humidity
Windy
Play-time
Sunny
Hot
High
False
Sunny
Hot
High
True
Overcast
Hot
High
False
55
Rainy
Mild
Normal
False
40
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
Whatsinanexample?
Instance:specifictypeofexample
Inputtolearningscheme:setof
instances/dataset
Representedasasinglerelation/flatfile
Ratherrestrictedformofinput
Thingtobeclassified,associated,orclustered
Individual,independentexampleoftargetconcept
Characterizedbyapredeterminedsetofattributes
Norelationshipsbetweenobjects
Mostcommonforminpracticaldatamining
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
Afamilytree
Peter
M
Steven
M
Peggy
F
Graham
M
Pam
F
Anna
F
Grace
F
Ian
M
Ray
M
Pippa
F
Brian
M
Nikki
F
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
10
Familytreerepresentedasatable
Name
Gender
Parent1
parent2
Peter
Male
Peggy
Female
Steven
Male
Peter
Peggy
Graham
Male
Peter
Peggy
Pam
Female
Peter
Peggy
Ian
Male
Grace
Ray
Pippa
Female
Grace
Ray
Brian
Male
Grace
Ray
Anna
Female
Pam
Ian
Nikki
Female
Pam
Ian
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
11
Thesisterofrelation
First
person
Second
person
Sister
of?
First
person
Second
person
Sister
of?
Peter
Peggy
No
Steven
Pam
Yes
Peter
Steven
No
Graham
Pam
Yes
Ian
Pippa
Yes
Steven
Peter
No
Brian
Pippa
Yes
Steven
Graham
No
Anna
Nikki
Yes
Steven
Pam
Yes
Nikki
Anna
Yes
Ian
Pippa
Yes
Anna
Nikki
Yes
Nikki
Anna
yes
Alltherest
No
Closedworldassumption
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
12
Afullrepresentationinonetable
First person
Name
Gender
Steven
Second person
Parent
2
Peggy
Name
Gender
Male
Parent
1
Peter
Pam
Graha
m
Ian
Male
Peter
Peggy
Male
Grace
Brian
Male
Anna
Nikki
Sister
of?
Parent2
Female
Parent
1
Peter
Peggy
Yes
Pam
Female
Peter
Peggy
Yes
Ray
Pippa
Female
Grace
Ray
Yes
Grace
Ray
Pippa
Female
Grace
Ray
Yes
Female
Pam
Ian
Nikki
Female
Pam
Ian
Yes
Female
Pam
Ian
Anna
Female
Pam
Ian
Yes
Alltherest
No
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
13
Generatingaflatfile
Processofflatteningcalleddenormalization
Possiblewithanyfinitesetoffiniterelations
Problematic:relationshipswithoutprespecified
numberofobjects
Severalrelationsarejoinedtogethertomakeone
Example:conceptofnuclearfamily
Denormalizationmayproducespuriousregularities
thatreflectstructureofdatabase
Example:supplierpredictssupplieraddress
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
14
Theancestorofrelation
First person
Name
Second person
Peter
Gende
r
Male
Parent
1
?
Parent
2
?
Steven
Peter
Male
Pam
Peter
Male
Anna
Peter
Male
Nikki
Pam
Femal
e
Femal
e
Femal
e
Peter
Peggy
Nikki
Ian
Nikki
Grace
Grace
Name
Ancestor
of?
Gende
r
Male
Parent
1
Peter
Parent2
Peggy
Yes
Femal
e
Femal
e
Femal
e
Femal
e
Male
Peter
Peggy
Yes
Pam
Ian
Yes
Pam
Ian
Yes
Pam
Ian
Yes
Grace
Ray
Yes
Pam
Ian
Yes
Femal
e
Otherpositiveexampleshere
Alltherest
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
Yes
No
15
Recursion
Infiniterelationsrequirerecursion
If person1 is a parent of person2
then person1 is an ancestor of person2
If person1 is a parent of person2
and person2 is an ancestor of person3
then person1 is an ancestor of person3
Appropriatetechniquesareknownas
inductivelogicprogramming
(e.g.QuinlansFOIL)
Problems:(a)noiseand(b)computationalcomplexity
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
16
MultiinstanceConcepts
Eachindividualexamplecomprisesasetof
instances
Allinstancesaredescribedbythesame
attributes
Oneormoreinstanceswithinanexample
mayberesponsibleforitsclassification
Goaloflearningisstilltoproducea
conceptdescription
Importantrealworldapplications
e.g.drugactivityprediction
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
17
Whatsinanattribute?
Eachinstanceisdescribedbyafixedpredefined
setoffeatures,itsattributes
But:numberofattributesmayvaryinpractice
Possiblesolution:irrelevantvalueflag
Relatedproblem:existenceofanattributemay
dependofvalueofanotherone
Possibleattributetypes(levelsof
measurement):
Nominal,ordinal,intervalandratio
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
18
Nominalquantities
Valuesaredistinctsymbols
Example:attributeoutlookfromweatherdata
Valuesthemselvesserveonlyaslabelsornames
NominalcomesfromtheLatinwordforname
Values:sunny,overcast,andrainy
Norelationisimpliedamongnominalvalues(no
orderingordistancemeasure)
Onlyequalitytestscanbeperformed
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
19
Ordinalquantities
Imposeorderonvalues
But:nodistancebetweenvaluesdefined
Example:
attributetemperatureinweatherdata
Values:hot>mild>cool
Note:additionandsubtractiondontmakesense
Examplerule:
temperature<hotplay=yes
Distinctionbetweennominalandordinalnot
alwaysclear(e.g.attributeoutlook)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
20
Intervalquantities
Intervalquantitiesarenotonlyorderedbut
measuredinfixedandequalunits
Example1:attributetemperature
expressedindegreesFahrenheit
Example2:attributeyear
Differenceoftwovaluesmakessense
Sumorproductdoesntmakesense
Zeropointisnotdefined!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
21
Ratioquantities
Ratioquantitiesareonesforwhichthe
measurementschemedefinesazeropoint
Example:attributedistance
Ratioquantitiesaretreatedasrealnumbers
Distancebetweenanobjectanditselfiszero
Allmathematicaloperationsareallowed
But:isthereaninherentlydefinedzeropoint?
Answerdependsonscientificknowledge(e.g.
Fahrenheitknewnolowerlimittotemperature)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
22
Attributetypesusedinpractice
Mostschemesaccommodatejusttwolevelsof
measurement:nominalandordinal
Nominalattributesarealsocalled
categorical,enumerated,ordiscrete
But:enumeratedanddiscreteimplyorder
Specialcase:dichotomy(booleanattribute)
Ordinalattributesarecallednumeric,or
continuous
But:continuousimpliesmathematicalcontinuity
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
23
Metadata
Informationaboutthedatathatencodes
backgroundknowledge
Canbeusedtorestrictsearchspace
Examples:
Dimensionalconsiderations
(i.e.expressionsmustbedimensionallycorrect)
Circularorderings
(e.g.degreesincompass)
Partialorderings
(e.g.generalization/specializationrelations)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
24
Preparingtheinput
Denormalizationisnottheonlyissue
Problem:differentdatasources(e.g.sales
department,customerbillingdepartment,)
Differences:stylesofrecordkeeping,conventions,
timeperiods,dataaggregation,primarykeys,errors
Datamustbeassembled,integrated,cleanedup
Datawarehouse:consistentpointofaccess
Externaldatamayberequired(overlaydata)
Critical:typeandlevelofdataaggregation
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
25
TheARFFformat
%
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute
@attribute
@attribute
@attribute
@attribute
@data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
26
Additionalattributetypes
ARFFsupportsstringattributes:
@attribute description string
Similartonominalattributesbutlistofvalues
isnotprespecified
Italsosupportsdateattributes:
@attribute today date
UsestheISO8601combineddateandtime
formatyyyyMMddTHH:mm:ss
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
27
Relationalattributes
Allowmultiinstanceproblemstobe
representedinARFFformat
Thevalueofarelationalattributeisaseparate
setofinstances
@attribute bag relational
@attribute outlook { sunny, overcast, rainy }
@attribute temperature numeric
@attribute humidity numeric
@attribute windy { true, false }
@end bag
Nestedattributeblockgivesthestructureof
thereferencedinstances
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
28
MulitinstanceARFF
%
% Multiple instance ARFF file for the weather data
%
@relation weather
@attribute bag_ID { 1, 2, 3, 4, 5, 6, 7 }
@attribute bag relational
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@attribute play? {yes, no}
@end bag
@data
1, sunny, 85, 85, false\nsunny, 80, 90, true, no
2, overcast, 83, 86, false\nrainy, 70, 96, false, yes
...
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
29
Sparsedata
Insomeapplicationsmostattributevaluesin
adatasetarezero
E.g.:wordcountsinatextcategorizationproblem
ARFFsupportssparsedata
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, class A
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, class B
{1 26, 6 63, 10 class A}
{3 42, 10 class B}
Thisalsoworksfornominalattributes(where
thefirstvaluecorrespondstozero)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
30
Attributetypes
InterpretationofattributetypesinARFFdepends
onlearningscheme
Numericattributesareinterpretedas
ordinalscalesiflessthanandgreaterthanareused
ratioscalesifdistancecalculationsareperformed
(normalization/standardizationmayberequired)
Instancebasedschemesdefinedistancebetween
nominalvalues(0ifvaluesareequal,1otherwise)
Integersinsomegivendatafile:nominal,ordinal,
orratioscale?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
31
Nominalvs.ordinal
Attributeagenominal
If age = young and astigmatic = no
and tear production rate = normal
then recommendation = soft
If age = pre-presbyopic and astigmatic = no
and tear production rate = normal
then recommendation = soft
Attributeageordinal
(e.g.young<prepresbyopic<presbyopic)
If age pre-presbyopic and astigmatic = no
and tear production rate = normal
then recommendation = soft
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
32
Missingvalues
Frequentlyindicatedbyoutofrangeentries
Types:unknown,unrecorded,irrelevant
Reasons:
malfunctioningequipment
changesinexperimentaldesign
collationofdifferentdatasets
measurementnotpossible
Missingvaluemayhavesignificanceinitself(e.g.
missingtestinamedicalexamination)
Mostschemesassumethatisnotthecase:missing
mayneedtobecodedasadditionalvalue
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
33
Inaccuratevalues
Reason:datahasnotbeencollectedforminingit
Result:errorsandomissionsthatdontaffectoriginal
purposeofdata(e.g.ageofcustomer)
Typographicalerrorsinnominalattributesvalues
needtobecheckedforconsistency
Typographicalandmeasurementerrorsinnumeric
attributesoutliersneedtobeidentified
Errorsmaybedeliberate(e.g.wrongzipcodes)
Otherproblems:duplicates,staledata
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
34
Gettingtoknowthedata
Simplevisualizationtoolsareveryuseful
Nominalattributes:histograms(Distribution
consistentwithbackgroundknowledge?)
Numericattributes:graphs
(Anyobviousoutliers?)
2Dand3Dplotsshowdependencies
Needtoconsultdomainexperts
Toomuchdatatoinspect?Takeasample!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)
35