Vous êtes sur la page 1sur 18

OTM 714 - Supply Chain Analytics

Greg DeCroix
Spring 2019

Class 18
K Nearest Neighbors Classification

“Not your best work, Devin – not your best.”


- Jason Sudeikis
Saturday Night Live, Pandora Internet Radio skit, featuring Bruno Mars

© 2019 Gregory A. DeCroix - All Rights Reserved


Classification Application
Pandora Internet Radio
• Measures songs on hundreds of variables
- Acid rock qualities
- Accordian playing
- Acousti-Lectric sonority, etc.
• Pays musicians to do rating of songs
• Listener chooses a song  Pandora measures distance to each
song in collection
- Selects a song that is nearby and plays it next
• If listener “likes” new song, it is placed in the “like” group;
If “dislike”, it is placed in “dislike”
• Pandora recommends songs that are close to “likes” and far from
“dislikes”
Sources: www.Pandora.com; Wikipedia’s article on the Music Genome Project.

DeCroix -- OTM 714 Wisconsin School of Business


Recall Logistic Regression

Three-step process
• Use logistic regression algorithm to fit the data to the equation
𝑝𝑝
𝑙𝑙𝑙𝑙 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 + ⋯ + 𝛽𝛽𝑛𝑛 𝑋𝑋𝑛𝑛
1 − 𝑝𝑝

• “Invert” the regression equation to obtain the probability p given a


particular set of independent variables
𝑒𝑒 𝛽𝛽0+𝛽𝛽1𝑋𝑋1+𝛽𝛽2𝑋𝑋2+⋯+𝛽𝛽𝑛𝑛𝑋𝑋𝑛𝑛
𝑝𝑝 =
1 + 𝑒𝑒𝛽𝛽0 +𝛽𝛽1𝑋𝑋1+𝛽𝛽2𝑋𝑋2+⋯+𝛽𝛽𝑛𝑛𝑋𝑋𝑛𝑛

• Use some cutoff value to assign the observation to a category


- E.g., suppose p = Prob(good credit risk)
- If p > 0.5, assign customer to “good credit risk” category

DeCroix -- OTM 714 Wisconsin School of Business


K Nearest Neighbors
• Also a classification method
- Training data set contains collection of variables about each
observation
- Also contains known classification for those observations
- These data points are used to “learn” method for assigning new
observations to the classification categories
• Conceptually somewhat simpler
- Does not rely on specific functional form (e.g., logit)
- Doesn’t statistically fit parameters used to create a score
• Can be applied with any number of classification categories

DeCroix -- OTM 714 Wisconsin School of Business


K Nearest Neighbors
• Start with training data with known classifications

• Choose user-specified parameter k


DeCroix -- OTM 714 Wisconsin School of Business
K Nearest Neighbors
• For each new observation, compute distance to each
observation in the training data

DeCroix -- OTM 714 Wisconsin School of Business


Distance Measure

Consider two items


Item i variable values: (𝑥𝑥𝑖𝑖𝑖 , 𝑥𝑥𝑖𝑖2 , … , 𝑥𝑥𝑖𝑖𝑝𝑝 )
Item j variable values: (𝑥𝑥𝑗𝑗𝑗 , 𝑥𝑥𝑗𝑗𝑗 , … , 𝑥𝑥𝑗𝑗𝑗𝑗 )

Most commonly used distance between items


• Euclidean: 𝑑𝑑𝑖𝑖𝑖𝑖 = (𝑥𝑥𝑖𝑖𝑖 − 𝑥𝑥𝑗𝑗1 )2 +(𝑥𝑥𝑖𝑖2 − 𝑥𝑥𝑗𝑗2 )2 + ⋯ + (𝑥𝑥𝑖𝑖𝑝𝑝 − 𝑥𝑥𝑗𝑗𝑝𝑝 )2

DeCroix -- OTM 714 Wisconsin School of Business


Key Requirement – Numerical X Variables

Distance measure requires X variables to be numbers


• Consider Charles Book Club example
Gender Money Recency Frequency FirstPurch ChildBks
F 172 6 1 6 0
F 49 36 1 36 0
F 198 2 1 2 0
M 191 12 4 34 0
M 393 24 8 68 1

• What is the distance between M and F?


• Replace with this:
Female Money Recency Frequency FirstPurch ChildBks
1 172 6 1 6 0
1 49 36 1 36 0
1 198 2 1 2 0
0 191 12 4 34 0
0 393 24 8 68 1

DeCroix -- OTM 714 Wisconsin School of Business


K Nearest Neighbors
• Identify the k nearest neighbors, and let them vote!

k=1 k=3 k=5


Assign to Assign to Assign to

DeCroix -- OTM 714 Wisconsin School of Business


Normalizing Data
It is usually important to normalize data first
• Consider a data set with just two variables
• Consider simple training set with just two data points, k = 1
Children Income Own Home?
Household 1 2 $80,000 Yes
Household 2 6 $83,900 No
New data point Nearest
Household 3 2 $82,000 No
to classify

• Distances:
Distance(H1,H3) = 2000
Distance(H2,H3) = 1900.004

DeCroix -- OTM 714 Wisconsin School of Business


How to Normalize

Standard normalization technique


• Subtract mean, divide by standard deviation (yields a z-score)
• For each variable 𝑥𝑥𝑖𝑖𝑖𝑖 representing the jth variable for item i:
𝑛𝑛
1
𝜇𝜇𝑗𝑗 = � 𝑥𝑥𝑖𝑖𝑗𝑗 Mean value for variable j, across all items
𝑛𝑛
𝑖𝑖=1

𝜎𝜎𝑗𝑗 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑖𝑖 (𝑥𝑥𝑖𝑖𝑗𝑗 ) Standard deviation for variable j, across all items

𝑥𝑥𝑖𝑖𝑗𝑗 − 𝜇𝜇𝑗𝑗
𝑧𝑧𝑖𝑖𝑗𝑗 = Normalized value for the jth variable of item i
𝜎𝜎𝑗𝑗

DeCroix -- OTM 714 Wisconsin School of Business


Normalizing Our Mini-Example

Means and standard deviations (over training set) are:


𝜇𝜇𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 4, 𝜎𝜎𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 2.83, 𝜇𝜇𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = $81,950, 𝜎𝜎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = $2757.72

• The normalized data become:


Children Income Own Home?
Household 1 -0.707 -0.707 Yes
Household 2 0.707 0.707 No Nearest
Household 3 -0.707 0.018 Yes

• Distances:
Distance(H1,H3) = 0.725
Distance(H2,H3) = 1.573

DeCroix -- OTM 714 Wisconsin School of Business


Assessing Performance

Similar to logistic regression


• Confusion matrix
• IF only 2 categories
- Sensitivity, specificity, etc.

DeCroix -- OTM 714 Wisconsin School of Business


Choosing k

General principles
• Small k
- Catches local features of data that may be relevant
- Can tend to overfit – fit to “noise”
• Large k
- Smooths out responses and reduces risk of overfitting
- May miss local features of data
o If k = total number of observations in training set, all new observations will be
assigned to the category with the largest number of members

DeCroix -- OTM 714 Wisconsin School of Business


Choosing k

Possible approaches
• Minimize total error %
- But some types of errors might be more important than others
- Weighted error measure?
• More “holistic” evaluation of confusion matrix

DeCroix -- OTM 714 Wisconsin School of Business


Other Issues

Are all variables important?


• All variables that are included have equal influence
• Perhaps screen some variables
- Consult domain experts
- Cross-validation approaches using variable weighting

DeCroix -- OTM 714 Wisconsin School of Business


Other Issues

Should all observations get the same vote?

• Suppose k = 5
• Weight closer items higher?
- Sometimes use 1/distance as the weight
- Also reduces chance of ties
DeCroix -- OTM 714 Wisconsin School of Business
Extension to More Categories

Can easily be extended to more than 2 categories


• Whichever category has most elements (out of k)
closest to the new observation gets chosen

DeCroix -- OTM 714 Wisconsin School of Business

Vous aimerez peut-être aussi