13 PracticalMachineLearning PDF

Prac%cal
Machine Learning
Verena Kaynig-Fi4kau (vkaynig@seas.Harvard.edu)
We are drowning in informa%on and

starving for knowledge
John Naisbi4
Machine Learning
Analyze training data
Make predic%ons for new unseen data:

supervised learning

Find pa4erns:
unsupervised learning

Machine Learning
Supervised Learning
SVM, Decision Tree, Boos%ng, Random Forest
Unsupervised Learning
K-means, mean shiS
Supervised Learning
data points
x2
labels
features
separa%ng
hyper plane
x1
Features are important
?
roundness
weight
Features are important
shape
color
Googles Self-Driving Car
Car Features
Laser scan Intensity model Eleva%on model
Lane model
Camera vision 2D sta%onary map
So just measure everything?
More features = be4er classica%on?
Prac%cal issues:
Data volume, computa%on overhead
Theore%cal issues:
Generaliza%on performance
Curse of dimensionality
Supervised Learning
data points
x2
labels
features
separa%ng
hyper plane
x1
Perceptron
x: data point x1 w1
y: label x2
w2
w3
w: weight vector x3 b
b: bias -1
+1
-1 w
The XOR Problem
x3
x2
x1
Support Vector Machine
Widely used for all sorts of classica%on
problems
www.clopinet.com/isabelle/Projects/SVM/applist.html
Some people say it is the best of the shelf

classier out there
Maximum Margin Classica%on
x2 x2
x1 x1
What about outliers?
: slack variables
x2
x1
XOR problem revised
x=0
Did we add informa%on to make the problem seperable?

Polynomial Kernel in 3D
Quadra%c Kernel
Kernel Func%ons
Polynomial:

Radial basis func%on (RBF):
Kernel Trick for SVMs
Arbitrary many dimensions
Li4le computa%onal cost
Maximal margin helps with curse of
dimensionality

SVM Applet
h4p://www.ml.inf.ethz.ch/educa%on/
lectures_and_seminars/annex_estat/Classier/
JSupportVectorApplet.html
Tips and Tricks
SVMs are not scale invariant
Check if your library normalizes by default
Normalize your data
mean: 0 , std: 1
map to [0,1] or [-1,1]
Normalize test set in same way!

Tips and Tricks
RBF kernel is a good default
For parameters try exponen%al sequences
Read:
Chih-Wei Hsu et al., A Prac%cal Guide to
Support Vector Classica%on,
Bioinforma%cs (2010)
Parameter Tuning
Given a classica%on task
Which kernel ?
Which kernel parameter values?
Which value for C?

Try dierent combina%ons
and take the best.
Grid Search
Zang et al., Iden%ca%on of heparin samples that contain impuri%es

or contaminants by chemometric pa4ern recogni%on analysis
of proton NMR spectral data, Anal Bioanal Chem (2011)
Mul% Class
One vs. All

One vs All
Train n classier for n classes
Take classica%on with greatest posi%ve
margin
Slow training
Mul% Class
One vs. One

One vs One
Train n(n-1)/2 classiers
Take majority vote
Fast training
Decision Tree
aSer 10
no pm? yes
got
call friend
electricity?
got new
read book
dvd?
play
watch tv
computer
Decision Trees
Fast training
Fast prediciton
Easy to understand
Easy to interpret
Decision Tree - Idea
C D
A
A B C D E

Bishop, Pa4ern Recogni%on and Machine Learning, Springer, 2006
Decision Tree - Predic%on
C D
A
A B C D E
Decision Tree -Training
Learn the tree structure:
which feature to query
which threshold to choose
A B C D E
Node Purity
10 7
E
3 5 7 2 B
3 2
C D
A
A B C D E
When to Stop
node contains only one class
node contains less than x data points
max depth is reached
node purity is sucient
you start to overt => cross-valida%on
Decision Trees - Disadvantages
Sensi%ve to small changes in the data
Overtng
Only axis aligned splits
Decision Trees vs SVM
Has%e et al.,The Elements of Sta%s%cal Learning: Data Mining, Inference, and Predic%on, Springer (2009)
Wisdom of Crowds
The collec%ve knowledge of a diverse and
independent body of people typically exceeds
the knowledge of any single individual, and can
be harnessed by vo%ng.
James Surowiecki
h4p://socialmedia4srm.wordpress.com/
Ensemble Methods
A single decision tree does not perform well
But, it is super fast
What if we learn mul%ple trees?
For mul%ple trees we need

even more data!
Bootstrap
Resampling method from sta%s%cs
Useful to get error bars on es%mates

Take N data points
Draw N %mes with replacement
Get es%mate from each bootstrapped sample

Bagging
Bootstrap aggregating
Sample with replacement from your data set

Learn a classier for each bootstrap sample
Average the results
Bagging Example
x2
x1
Bagging
Reduces overtng (variance)
Normally uses one type of classier
Decision trees are popular
Easy to parallelize
Boos%ng
Also ensemble method like Bagging
But:
weak learners evolve over %me
votes are weighted
Be4er than Bagging for many applica%ons
Very popular method

Boos%ng
Boos%ng is one of the most powerful learning
ideas introduced in the last twenty years.

Has%e et al.,The Elements of Sta%s%cal Learning: Data
Mining, Inference, and Predic%on, Springer (2009)
Adaboost
x2
x1
AdaBoost
Ini%alize weights for data points
For each itera%on:
Fit classier to training data
Compute weighted classica%on error
Compute weight for classier from the error
Update weights for data points
Final classier is weighted sum of all single
classiers
AdaBoost
Has%e et al.,The Elements of Sta%s%cal Learning: Data Mining, Inference, and Predic%on, Springer (2009)
AdaBoost
AdaBoost
Introduced by Freund and Schapire in 1995
Worked great, nobody understood why!
Then ve years later (Friedman et al. 2000):

Adaboost minimizes exponen%al loss func%on.
There s%ll are open ques%ons.
Random Forest
Builds upon the idea of bagging
Each tree build from bootstrap sample
Node splits calculated from random feature
subsets
h4p://www.andrewbun%ne.com/ar%cles/about/fun
Random Forest
All trees are fully grown
No pruning
Two parameters
Number of trees
Number of features

Random Forest Error Rate
Error depends on:
Correla%on between trees (higher is worse)
Strength of single trees (higher is be4er)
Increasing number of features for each split:

Increases correla%on
Increases strength of single trees
Out of Bag Error
Each tree is trained on a bootstrapped sample
About 1/3 of data points not used for training
Predict unseen points with each tree

Measure error
Out of Bag Error
data points
sample lter
bootstrap unused
sample data points
train test
Out of Bag Error
Very similar to cross-valida%on
Measured during training
Can be too op%mis%c
Variable Importance
Again use out of bag samples
Predict class for these samples
Randomly permute values of one feature
Predict classes again
Measure decrease in accuracy
Temp%ng Scenario
Run random forest with all features
Reduce number of features based on
importance weights
Run again with reduced feature set and report
out of bag error
This does not measure test

performance!
Unbalanced Classes
The Problem:
Oversample:
Subsample:
Subsample for each tree!

Random Forest Subsampling
sample
train
Random Forest
Similar to Bagging
Easy to parallelize
Packaged with some neat func%ons:
Out of bag error
Feature importance measure
Proximity es%ma%on
Cascade Classier
Ensemble methods are strong
But predic%on is slow
Solu%on: Make predic%on faster

Idea: Build a cascade
Cascade Classier
h4p://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detec%on_framework
Viola Jones Face Detec%on
h4p://cvdazzle.com/
Viola Jones Face Detec%on
Takes long to train
Predic%on in real %me!
Widely used today

Summary
SVMs
Decision Trees
Bootstrap, Bagging, Boos%ng
Random Forest
Cascade Classier
Further Reading
Error Measures
predicted
True posi%ve (tp)
1 -1
True nega%ve (tn)
1
False posi%ve (fp) tp fn
true
False nega%ve (fn)
-1 fp tn
TPR and FPR
predicted
True Posi%ve Rate:
1 -1

1
tp fn
true

-1 fp tn
False Posi%ve Rate:
Precision Recall
predicted
1 -1
Recall:
1
tp fn
true

-1 fp tn
Precision:
Precision Recall Curve
1
precision
1
recall
Comparison
J. Davis & M. Goadrich,

The Rela%onship Between Precision-Recall and ROC Curves.,
ICML (2006)
F-measure
Weighted average of precision and recall
Usual case:
Increasing allocates weight to recall
Clustering Evalua%on Criteria
Based on expert knowledge
Debatable for real data
Hidden Unknown structures could be present
Do we even want to just reproduce known
structure?
Rand Index
Percentage of correct classica%ons
Compare pairs of elements:
tn
tp
fn fp
Fp and fn are equally weighted

Stability
Stability
What is the right number of clusters?
What makes a good clustering solu%on?
Clustering should generalize!

Stability
Gini Impurity
Example:
4 red, 3 green, 3 blue data points
random sample:
red: 4/10 green: 3/10 blue: 3/10
misclassica%on:
red: 4/10 * (3/10 + 3/10)
true wrong
class predic%on
Gini Impurity
Number of classes:
Number of data points:
Number of data points of class i:
true wrong
class predic%on
Gini Impurity
Has%e et al.,The Elements of Sta%s%cal Learning: Data Mining, Inference, and

Predic%on, Springer (2009)
Node Purity Gain
Compare:
A
Gini impurity of parent node
Gini impurity of child nodes B C
Pseudocode
Check for base cases
For each a4ribute a
Calculate the gain from splitng on a
Let a_best be the a4ribute with highest gain
Create a decision node that splits on a_best
Repeat on the sub-nodes

h4p://en.wikipedia.org/wiki/C4.5_algorithm

13 PracticalMachineLearning PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

13 PracticalMachineLearning PDF

Transféré par

Droits d'auteur :

Formats disponibles

Prac%cal

We are drowning in informa%on and

Make predic%ons for new unseen data:

Some people say it is the best of the shelf

Did we add informa%on to make the problem seperable?

Zang et al., Iden%ca%on of heparin samples that contain impuri%es

One vs. All

One vs. One

For mul%ple trees we need

Get es%mate from each bootstrapped sample

Sample with replacement from your data set

Be4er than Bagging for many applica%ons

Very popular method

Then ve years later (Friedman et al. 2000):

Increasing number of features for each split:

Predict unseen points with each tree

This does not measure test

Subsample for each tree!

Widely used today

J. Davis & M. Goadrich,

Fp and fn are equally weighted

Clustering should generalize!

Has%e et al.,The Elements of Sta%s%cal Learning: Data Mining, Inference, and

Vous aimerez peut-être aussi