OTM 714 - Supply Chain Analytics: Class 18 K Nearest Neighbors Classification

OTM 714 - Supply Chain Analytics
Greg DeCroix
Spring 2019
Class 18
K Nearest Neighbors Classification
“Not your best work, Devin – not your best.”

- Jason Sudeikis
Saturday Night Live, Pandora Internet Radio skit, featuring Bruno Mars
© 2019 Gregory A. DeCroix - All Rights Reserved

Classification Application
Pandora Internet Radio
• Measures songs on hundreds of variables
- Acid rock qualities
- Accordian playing
- Acousti-Lectric sonority, etc.
• Pays musicians to do rating of songs
• Listener chooses a song  Pandora measures distance to each
song in collection
- Selects a song that is nearby and plays it next
• If listener “likes” new song, it is placed in the “like” group;
If “dislike”, it is placed in “dislike”
• Pandora recommends songs that are close to “likes” and far from
“dislikes”
Sources: www.Pandora.com; Wikipedia’s article on the Music Genome Project.
DeCroix -- OTM 714 Wisconsin School of Business

Recall Logistic Regression
Three-step process
• Use logistic regression algorithm to fit the data to the equation
𝑝𝑝
𝑙𝑙𝑙𝑙 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 + ⋯ + 𝛽𝛽𝑛𝑛 𝑋𝑋𝑛𝑛
1 − 𝑝𝑝
• “Invert” the regression equation to obtain the probability p given a

particular set of independent variables
𝑒𝑒 𝛽𝛽0+𝛽𝛽1𝑋𝑋1+𝛽𝛽2𝑋𝑋2+⋯+𝛽𝛽𝑛𝑛𝑋𝑋𝑛𝑛
𝑝𝑝 =
1 + 𝑒𝑒𝛽𝛽0 +𝛽𝛽1𝑋𝑋1+𝛽𝛽2𝑋𝑋2+⋯+𝛽𝛽𝑛𝑛𝑋𝑋𝑛𝑛
• Use some cutoff value to assign the observation to a category

- E.g., suppose p = Prob(good credit risk)
- If p > 0.5, assign customer to “good credit risk” category

K Nearest Neighbors
• Also a classification method
- Training data set contains collection of variables about each
observation
- Also contains known classification for those observations
- These data points are used to “learn” method for assigning new
observations to the classification categories
• Conceptually somewhat simpler
- Does not rely on specific functional form (e.g., logit)
- Doesn’t statistically fit parameters used to create a score
• Can be applied with any number of classification categories

K Nearest Neighbors
• Start with training data with known classifications
• Choose user-specified parameter k

K Nearest Neighbors
• For each new observation, compute distance to each
observation in the training data

Distance Measure
Consider two items

Item i variable values: (𝑥𝑥𝑖𝑖𝑖 , 𝑥𝑥𝑖𝑖2 , … , 𝑥𝑥𝑖𝑖𝑝𝑝 )
Item j variable values: (𝑥𝑥𝑗𝑗𝑗 , 𝑥𝑥𝑗𝑗𝑗 , … , 𝑥𝑥𝑗𝑗𝑗𝑗 )
Most commonly used distance between items

• Euclidean: 𝑑𝑑𝑖𝑖𝑖𝑖 = (𝑥𝑥𝑖𝑖𝑖 − 𝑥𝑥𝑗𝑗1 )2 +(𝑥𝑥𝑖𝑖2 − 𝑥𝑥𝑗𝑗2 )2 + ⋯ + (𝑥𝑥𝑖𝑖𝑝𝑝 − 𝑥𝑥𝑗𝑗𝑝𝑝 )2

Key Requirement – Numerical X Variables
Distance measure requires X variables to be numbers

• Consider Charles Book Club example
Gender Money Recency Frequency FirstPurch ChildBks
F 172 6 1 6 0
F 49 36 1 36 0
F 198 2 1 2 0
M 191 12 4 34 0
M 393 24 8 68 1
• What is the distance between M and F?

• Replace with this:
Female Money Recency Frequency FirstPurch ChildBks
1 172 6 1 6 0
1 49 36 1 36 0
1 198 2 1 2 0
0 191 12 4 34 0
0 393 24 8 68 1

K Nearest Neighbors
• Identify the k nearest neighbors, and let them vote!
k=1 k=3 k=5

Assign to Assign to Assign to

Normalizing Data
It is usually important to normalize data first
• Consider a data set with just two variables
• Consider simple training set with just two data points, k = 1
Children Income Own Home?
Household 1 2 $80,000 Yes
Household 2 6 $83,900 No
New data point Nearest
Household 3 2 $82,000 No
to classify
• Distances:
Distance(H1,H3) = 2000
Distance(H2,H3) = 1900.004

How to Normalize
Standard normalization technique

• Subtract mean, divide by standard deviation (yields a z-score)
• For each variable 𝑥𝑥𝑖𝑖𝑖𝑖 representing the jth variable for item i:
𝑛𝑛
1
𝜇𝜇𝑗𝑗 = � 𝑥𝑥𝑖𝑖𝑗𝑗 Mean value for variable j, across all items
𝑛𝑛
𝑖𝑖=1
𝜎𝜎𝑗𝑗 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑖𝑖 (𝑥𝑥𝑖𝑖𝑗𝑗 ) Standard deviation for variable j, across all items
𝑥𝑥𝑖𝑖𝑗𝑗 − 𝜇𝜇𝑗𝑗
𝑧𝑧𝑖𝑖𝑗𝑗 = Normalized value for the jth variable of item i
𝜎𝜎𝑗𝑗

Normalizing Our Mini-Example
Means and standard deviations (over training set) are:

𝜇𝜇𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 4, 𝜎𝜎𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 2.83, 𝜇𝜇𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = $81,950, 𝜎𝜎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = $2757.72
• The normalized data become:

Children Income Own Home?
Household 1 -0.707 -0.707 Yes
Household 2 0.707 0.707 No Nearest
Household 3 -0.707 0.018 Yes
• Distances:
Distance(H1,H3) = 0.725
Distance(H2,H3) = 1.573

Assessing Performance
Similar to logistic regression

• Confusion matrix
• IF only 2 categories
- Sensitivity, specificity, etc.

Choosing k
General principles
• Small k
- Catches local features of data that may be relevant
- Can tend to overfit – fit to “noise”
• Large k
- Smooths out responses and reduces risk of overfitting
- May miss local features of data
o If k = total number of observations in training set, all new observations will be
assigned to the category with the largest number of members

Choosing k
Possible approaches
• Minimize total error %
- But some types of errors might be more important than others
- Weighted error measure?
• More “holistic” evaluation of confusion matrix

Other Issues
Are all variables important?

• All variables that are included have equal influence
• Perhaps screen some variables
- Consult domain experts
- Cross-validation approaches using variable weighting

Other Issues
Should all observations get the same vote?
• Suppose k = 5
• Weight closer items higher?
- Sometimes use 1/distance as the weight
- Also reduces chance of ties
Extension to More Categories
Can easily be extended to more than 2 categories

• Whichever category has most elements (out of k)
closest to the new observation gets chosen

OTM 714 - Supply Chain Analytics: Class 18 K Nearest Neighbors Classification

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

OTM 714 - Supply Chain Analytics: Class 18 K Nearest Neighbors Classification

Transféré par

Droits d'auteur :

Formats disponibles

OTM 714 - Supply Chain Analytics

“Not your best work, Devin – not your best.”

© 2019 Gregory A. DeCroix - All Rights Reserved

DeCroix -- OTM 714 Wisconsin School of Business

• “Invert” the regression equation to obtain the probability p given a

• Use some cutoff value to assign the observation to a category

DeCroix -- OTM 714 Wisconsin School of Business

DeCroix -- OTM 714 Wisconsin School of Business

• Choose user-specified parameter k

DeCroix -- OTM 714 Wisconsin School of Business

Consider two items

Most commonly used distance between items

DeCroix -- OTM 714 Wisconsin School of Business

Distance measure requires X variables to be numbers

• What is the distance between M and F?

DeCroix -- OTM 714 Wisconsin School of Business

k=1 k=3 k=5

DeCroix -- OTM 714 Wisconsin School of Business

DeCroix -- OTM 714 Wisconsin School of Business

Standard normalization technique

DeCroix -- OTM 714 Wisconsin School of Business

Means and standard deviations (over training set) are:

• The normalized data become:

DeCroix -- OTM 714 Wisconsin School of Business

Similar to logistic regression

DeCroix -- OTM 714 Wisconsin School of Business

DeCroix -- OTM 714 Wisconsin School of Business

DeCroix -- OTM 714 Wisconsin School of Business

Are all variables important?

DeCroix -- OTM 714 Wisconsin School of Business

Should all observations get the same vote?

Can easily be extended to more than 2 categories

DeCroix -- OTM 714 Wisconsin School of Business

Vous aimerez peut-être aussi