Académique Documents
Professionnel Documents
Culture Documents
Greg DeCroix
Spring 2019
Class 18
K Nearest Neighbors Classification
Three-step process
• Use logistic regression algorithm to fit the data to the equation
𝑝𝑝
𝑙𝑙𝑙𝑙 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 + ⋯ + 𝛽𝛽𝑛𝑛 𝑋𝑋𝑛𝑛
1 − 𝑝𝑝
• Distances:
Distance(H1,H3) = 2000
Distance(H2,H3) = 1900.004
𝜎𝜎𝑗𝑗 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑖𝑖 (𝑥𝑥𝑖𝑖𝑗𝑗 ) Standard deviation for variable j, across all items
𝑥𝑥𝑖𝑖𝑗𝑗 − 𝜇𝜇𝑗𝑗
𝑧𝑧𝑖𝑖𝑗𝑗 = Normalized value for the jth variable of item i
𝜎𝜎𝑗𝑗
• Distances:
Distance(H1,H3) = 0.725
Distance(H2,H3) = 1.573
General principles
• Small k
- Catches local features of data that may be relevant
- Can tend to overfit – fit to “noise”
• Large k
- Smooths out responses and reduces risk of overfitting
- May miss local features of data
o If k = total number of observations in training set, all new observations will be
assigned to the category with the largest number of members
Possible approaches
• Minimize total error %
- But some types of errors might be more important than others
- Weighted error measure?
• More “holistic” evaluation of confusion matrix
• Suppose k = 5
• Weight closer items higher?
- Sometimes use 1/distance as the weight
- Also reduces chance of ties
DeCroix -- OTM 714 Wisconsin School of Business
Extension to More Categories