Académique Documents
Professionnel Documents
Culture Documents
K Nearest Neighbor
Classification
All the slides were adapted from:
1- Intro. To Data Mining by Tan et. al.
2- Dr. Ibrahim Albluwi
3- Dr. Noureddin Sadawi
Is it a Duck?
• If it quacks like a duck and walks like a duck, and looks like a
duck, then most probably, it is a duck!
Compare
with all the
Test Record
records
• Lazy Learners:
– Do not build any model: Zero training time.
– Delay “thinking” to classification time.
– Most time is spent on classification.
• Eager Learners:
– Spend most of the time on building the model prior to
classification.
– Classification is quick since the model is ready.
Proximity Measures
Definitions:
• Similarity: A numerical measure of how much two data objects are
alike.
• Numeric Attributes:
– Manhattan Distance, Euclidean Distance, etc.
• Euclidian Distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j 2 ip jp
• Manhattan Distance:
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
Euclidean Distance
Example: Age Income Height Weight
Record 1 45 2000 1.6 80
Record 2 32 1200 1.75 75
Normalization:
age1 = 45/140 = 0.32 age2 = 32/140 = 0.23
Income1 = 2000/5000 = 0.4 Income2 = 1200/5000 = 0.24
Height1 = 1.6/2.1 = 0.76 Height2 = 1.75/2.1 = 0.83
Weight1 = 80/150 = 0.53 Weight2 = 0.5
Euclidean Distance =
X X X
In a
Voronoi Diagram,
all points in a cell
are closer to the
record in that cell NN-Classifiers
more than any
+ can learn complex
record in the other patterns that
cells.
+ are difficult for
decision trees.
+
To classify a record:
See in which cell it + +
falls and assign to it
+
the class of the
record in that cell. +
Notes
• When to use NN-Classification?
– If there are less than 20 attributes.
[Curse of Dimensionality: In higher dimensions, intuition fails,
distance measures become less meaningful and computation
becomes expensive.
– If the application affords long classification time.
– If there are lots of training data.
• Advantages of NN-Classification:
– Quick training time.
– Can learn complex patterns.
– Can be used for regression (numeric class attributes).
• Disadvantages of NN-Classification:
– Slow at query time.
– Easily fooled by irrelevant attributes [Feature subset selection is
very important].