Vous êtes sur la page 1sur 35

DataMining

PracticalMachineLearningToolsandTechniques
SlidesforChapter3ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall

Input:Concepts,instances,attributes

Terminology
Whatsaconcept?

Whatsinanexample?

Relations,flatfiles,recursion

Whatsinanattribute?

Classification,association,clustering,numericprediction

Nominal,ordinal,interval,ratio

Preparingtheinput

ARFF,attributes,missingvalues,gettingtoknowdata

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

Terminology

Componentsoftheinput:

Concepts:kindsofthingsthatcanbelearned

Instances:theindividual,independentexamples
ofaconcept

Aim:intelligibleandoperationalconceptdescription

Note:morecomplicatedformsofinputarepossible

Attributes:measuringaspectsofaninstance

Wewillfocusonnominalandnumericones

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

Whatsaconcept?

Stylesoflearning:

Classificationlearning:
predictingadiscreteclass
Associationlearning:
detectingassociationsbetweenfeatures
Clustering:
groupingsimilarinstancesintoclusters
Numericprediction:
predictinganumericquantity

Concept:thingtobelearned
Conceptdescription:
outputoflearningscheme
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

Classificationlearning

Exampleproblems:weatherdata,contact
lenses,irises,labornegotiations
Classificationlearningissupervised

Schemeisprovidedwithactualoutcome

Outcomeiscalledtheclassoftheexample
Measuresuccessonfreshdataforwhich
classlabelsareknown(testdata)
Inpracticesuccessisoftenmeasured
subjectively

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

Associationlearning

Canbeappliedifnoclassisspecifiedandanykind
ofstructureisconsideredinteresting
Differencetoclassificationlearning:

Canpredictanyattributesvalue,notjusttheclass,and
morethanoneattributesvalueatatime
Hence:farmoreassociationrulesthanclassification
rules
Thus:constraintsarenecessary

Minimumcoverageandminimumaccuracy

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

Clustering

Findinggroupsofitemsthataresimilar
Clusteringisunsupervised

Theclassofanexampleisnotknown

Successoftenmeasuredsubjectively
Sepallength

Sepalwidth

Petallength

Petalwidth

Type

5.1

3.5

1.4

0.2

Irissetosa

4.9

3.0

1.4

0.2

Irissetosa

51

7.0

3.2

4.7

1.4

Irisversicolor

52

6.4

3.2

4.5

1.5

Irisversicolor

101

6.3

3.3

6.0

2.5

Irisvirginica

102

5.8

2.7

5.1

1.9

Irisvirginica

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

Numericprediction

Variantofclassificationlearningwhere
classisnumeric(alsocalledregression)
Learningissupervised

Schemeisbeingprovidedwithtargetvalue

Measuresuccessontestdata
Outlook

Temperature

Humidity

Windy

Play-time

Sunny

Hot

High

False

Sunny

Hot

High

True

Overcast

Hot

High

False

55

Rainy

Mild

Normal

False

40

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

Whatsinanexample?

Instance:specifictypeofexample

Inputtolearningscheme:setof
instances/dataset

Representedasasinglerelation/flatfile

Ratherrestrictedformofinput

Thingtobeclassified,associated,orclustered
Individual,independentexampleoftargetconcept
Characterizedbyapredeterminedsetofattributes

Norelationshipsbetweenobjects

Mostcommonforminpracticaldatamining

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

Afamilytree
Peter
M

Steven
M

Peggy
F

Graham
M

Pam
F

Anna
F

Grace
F

Ian
M

Ray
M

Pippa
F

Brian
M

Nikki
F

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

10

Familytreerepresentedasatable

Name

Gender

Parent1

parent2

Peter

Male

Peggy

Female

Steven

Male

Peter

Peggy

Graham

Male

Peter

Peggy

Pam

Female

Peter

Peggy

Ian

Male

Grace

Ray

Pippa

Female

Grace

Ray

Brian

Male

Grace

Ray

Anna

Female

Pam

Ian

Nikki

Female

Pam

Ian

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

11

Thesisterofrelation
First
person

Second
person

Sister
of?

First
person

Second
person

Sister
of?

Peter

Peggy

No

Steven

Pam

Yes

Peter

Steven

No

Graham

Pam

Yes

Ian

Pippa

Yes

Steven

Peter

No

Brian

Pippa

Yes

Steven

Graham

No

Anna

Nikki

Yes

Steven

Pam

Yes

Nikki

Anna

Yes

Ian

Pippa

Yes

Anna

Nikki

Yes

Nikki

Anna

yes

Alltherest

No

Closedworldassumption

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

12

Afullrepresentationinonetable
First person
Name

Gender

Steven

Second person
Parent
2
Peggy

Name

Gender

Male

Parent
1
Peter

Pam

Graha
m
Ian

Male

Peter

Peggy

Male

Grace

Brian

Male

Anna
Nikki

Sister
of?
Parent2

Female

Parent
1
Peter

Peggy

Yes

Pam

Female

Peter

Peggy

Yes

Ray

Pippa

Female

Grace

Ray

Yes

Grace

Ray

Pippa

Female

Grace

Ray

Yes

Female

Pam

Ian

Nikki

Female

Pam

Ian

Yes

Female

Pam

Ian

Anna

Female

Pam

Ian

Yes

Alltherest

No

If second persons gender = female


and first persons parent = second persons parent
then sister-of = yes

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

13

Generatingaflatfile

Processofflatteningcalleddenormalization

Possiblewithanyfinitesetoffiniterelations
Problematic:relationshipswithoutprespecified
numberofobjects

Severalrelationsarejoinedtogethertomakeone

Example:conceptofnuclearfamily

Denormalizationmayproducespuriousregularities
thatreflectstructureofdatabase

Example:supplierpredictssupplieraddress

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

14

Theancestorofrelation
First person
Name

Second person

Peter

Gende
r
Male

Parent
1
?

Parent
2
?

Steven

Peter

Male

Pam

Peter

Male

Anna

Peter

Male

Nikki

Pam

Femal
e
Femal
e
Femal
e

Peter

Peggy

Nikki

Ian

Nikki

Grace
Grace

Name

Ancestor
of?

Gende
r
Male

Parent
1
Peter

Parent2
Peggy

Yes

Femal
e
Femal
e
Femal
e
Femal
e
Male

Peter

Peggy

Yes

Pam

Ian

Yes

Pam

Ian

Yes

Pam

Ian

Yes

Grace

Ray

Yes

Pam

Ian

Yes

Femal
e
Otherpositiveexampleshere
Alltherest

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

Yes
No

15

Recursion

Infiniterelationsrequirerecursion
If person1 is a parent of person2
then person1 is an ancestor of person2
If person1 is a parent of person2
and person2 is an ancestor of person3
then person1 is an ancestor of person3

Appropriatetechniquesareknownas
inductivelogicprogramming

(e.g.QuinlansFOIL)

Problems:(a)noiseand(b)computationalcomplexity

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

16

MultiinstanceConcepts

Eachindividualexamplecomprisesasetof
instances

Allinstancesaredescribedbythesame
attributes
Oneormoreinstanceswithinanexample
mayberesponsibleforitsclassification

Goaloflearningisstilltoproducea
conceptdescription
Importantrealworldapplications

e.g.drugactivityprediction
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

17

Whatsinanattribute?

Eachinstanceisdescribedbyafixedpredefined
setoffeatures,itsattributes
But:numberofattributesmayvaryinpractice

Possiblesolution:irrelevantvalueflag

Relatedproblem:existenceofanattributemay
dependofvalueofanotherone
Possibleattributetypes(levelsof
measurement):

Nominal,ordinal,intervalandratio

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

18

Nominalquantities

Valuesaredistinctsymbols

Example:attributeoutlookfromweatherdata

Valuesthemselvesserveonlyaslabelsornames
NominalcomesfromtheLatinwordforname
Values:sunny,overcast,andrainy

Norelationisimpliedamongnominalvalues(no
orderingordistancemeasure)
Onlyequalitytestscanbeperformed

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

19

Ordinalquantities

Imposeorderonvalues
But:nodistancebetweenvaluesdefined
Example:
attributetemperatureinweatherdata

Values:hot>mild>cool

Note:additionandsubtractiondontmakesense
Examplerule:
temperature<hotplay=yes
Distinctionbetweennominalandordinalnot
alwaysclear(e.g.attributeoutlook)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

20

Intervalquantities

Intervalquantitiesarenotonlyorderedbut
measuredinfixedandequalunits
Example1:attributetemperature
expressedindegreesFahrenheit
Example2:attributeyear
Differenceoftwovaluesmakessense
Sumorproductdoesntmakesense

Zeropointisnotdefined!

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

21

Ratioquantities

Ratioquantitiesareonesforwhichthe
measurementschemedefinesazeropoint
Example:attributedistance

Ratioquantitiesaretreatedasrealnumbers

Distancebetweenanobjectanditselfiszero
Allmathematicaloperationsareallowed

But:isthereaninherentlydefinedzeropoint?

Answerdependsonscientificknowledge(e.g.
Fahrenheitknewnolowerlimittotemperature)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

22

Attributetypesusedinpractice

Mostschemesaccommodatejusttwolevelsof
measurement:nominalandordinal
Nominalattributesarealsocalled
categorical,enumerated,ordiscrete

But:enumeratedanddiscreteimplyorder

Specialcase:dichotomy(booleanattribute)
Ordinalattributesarecallednumeric,or
continuous

But:continuousimpliesmathematicalcontinuity

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

23

Metadata

Informationaboutthedatathatencodes
backgroundknowledge
Canbeusedtorestrictsearchspace
Examples:

Dimensionalconsiderations
(i.e.expressionsmustbedimensionallycorrect)
Circularorderings
(e.g.degreesincompass)
Partialorderings
(e.g.generalization/specializationrelations)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

24

Preparingtheinput

Denormalizationisnottheonlyissue
Problem:differentdatasources(e.g.sales
department,customerbillingdepartment,)

Differences:stylesofrecordkeeping,conventions,
timeperiods,dataaggregation,primarykeys,errors
Datamustbeassembled,integrated,cleanedup
Datawarehouse:consistentpointofaccess

Externaldatamayberequired(overlaydata)
Critical:typeandlevelofdataaggregation

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

25

TheARFFformat
%
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute
@attribute
@attribute
@attribute
@attribute

outlook {sunny, overcast, rainy}


temperature numeric
humidity numeric
windy {true, false}
play? {yes, no}

@data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

26

Additionalattributetypes

ARFFsupportsstringattributes:
@attribute description string

Similartonominalattributesbutlistofvalues
isnotprespecified

Italsosupportsdateattributes:
@attribute today date

UsestheISO8601combineddateandtime
formatyyyyMMddTHH:mm:ss
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

27

Relationalattributes

Allowmultiinstanceproblemstobe
representedinARFFformat

Thevalueofarelationalattributeisaseparate
setofinstances
@attribute bag relational
@attribute outlook { sunny, overcast, rainy }
@attribute temperature numeric
@attribute humidity numeric
@attribute windy { true, false }
@end bag

Nestedattributeblockgivesthestructureof
thereferencedinstances
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

28

MulitinstanceARFF
%
% Multiple instance ARFF file for the weather data
%
@relation weather
@attribute bag_ID { 1, 2, 3, 4, 5, 6, 7 }
@attribute bag relational
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@attribute play? {yes, no}
@end bag
@data
1, sunny, 85, 85, false\nsunny, 80, 90, true, no
2, overcast, 83, 86, false\nrainy, 70, 96, false, yes
...
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

29

Sparsedata

Insomeapplicationsmostattributevaluesin
adatasetarezero

E.g.:wordcountsinatextcategorizationproblem

ARFFsupportssparsedata
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, class A
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, class B
{1 26, 6 63, 10 class A}
{3 42, 10 class B}

Thisalsoworksfornominalattributes(where
thefirstvaluecorrespondstozero)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

30

Attributetypes

InterpretationofattributetypesinARFFdepends
onlearningscheme

Numericattributesareinterpretedas

ordinalscalesiflessthanandgreaterthanareused
ratioscalesifdistancecalculationsareperformed
(normalization/standardizationmayberequired)

Instancebasedschemesdefinedistancebetween
nominalvalues(0ifvaluesareequal,1otherwise)

Integersinsomegivendatafile:nominal,ordinal,
orratioscale?

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

31

Nominalvs.ordinal

Attributeagenominal
If age = young and astigmatic = no
and tear production rate = normal
then recommendation = soft
If age = pre-presbyopic and astigmatic = no
and tear production rate = normal
then recommendation = soft

Attributeageordinal
(e.g.young<prepresbyopic<presbyopic)
If age pre-presbyopic and astigmatic = no
and tear production rate = normal
then recommendation = soft

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

32

Missingvalues

Frequentlyindicatedbyoutofrangeentries

Types:unknown,unrecorded,irrelevant
Reasons:

malfunctioningequipment
changesinexperimentaldesign
collationofdifferentdatasets
measurementnotpossible

Missingvaluemayhavesignificanceinitself(e.g.
missingtestinamedicalexamination)

Mostschemesassumethatisnotthecase:missing
mayneedtobecodedasadditionalvalue

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

33

Inaccuratevalues

Reason:datahasnotbeencollectedforminingit
Result:errorsandomissionsthatdontaffectoriginal
purposeofdata(e.g.ageofcustomer)
Typographicalerrorsinnominalattributesvalues
needtobecheckedforconsistency
Typographicalandmeasurementerrorsinnumeric
attributesoutliersneedtobeidentified
Errorsmaybedeliberate(e.g.wrongzipcodes)
Otherproblems:duplicates,staledata

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

34

Gettingtoknowthedata

Simplevisualizationtoolsareveryuseful

Nominalattributes:histograms(Distribution
consistentwithbackgroundknowledge?)
Numericattributes:graphs
(Anyobviousoutliers?)

2Dand3Dplotsshowdependencies
Needtoconsultdomainexperts
Toomuchdatatoinspect?Takeasample!

DataMining:PracticalMachineLearningToolsandTechniques(Chapter2)

35

Vous aimerez peut-être aussi