Vous êtes sur la page 1sur 84

Prac%cal

Machine Learning
Verena Kaynig-Fi4kau (vkaynig@seas.Harvard.edu)

We are drowning in informa%on and


starving for knowledge
John Naisbi4
Machine Learning
Analyze training data

Make predic%ons for new unseen data:


supervised learning

Find pa4erns:
unsupervised learning

Machine Learning
Supervised Learning
SVM, Decision Tree, Boos%ng, Random Forest

Unsupervised Learning
K-means, mean shiS
Supervised Learning

data points
x2
labels

features

separa%ng
hyper plane
x1
Features are important

?
roundness

weight
Features are important

shape

color
Googles Self-Driving Car
Car Features
Laser scan Intensity model Eleva%on model

Lane model
Camera vision 2D sta%onary map
So just measure everything?
More features = be4er classica%on?

Prac%cal issues:
Data volume, computa%on overhead

Theore%cal issues:
Generaliza%on performance
Curse of dimensionality
Supervised Learning

data points
x2
labels

features

separa%ng
hyper plane
x1
Perceptron
x: data point x1 w1

y: label x2
w2
w3
w: weight vector x3 b
b: bias -1

+1

-1 w
The XOR Problem

x3
x2

x1
Support Vector Machine
Widely used for all sorts of classica%on
problems
www.clopinet.com/isabelle/Projects/SVM/applist.html

Some people say it is the best of the shelf


classier out there
Maximum Margin Classica%on

x2 x2

x1 x1
What about outliers?

: slack variables

x2

x1
XOR problem revised

x=0

Did we add informa%on to make the problem seperable?


Polynomial Kernel in 3D
Quadra%c Kernel
Kernel Func%ons

Polynomial:


Radial basis func%on (RBF):
Kernel Trick for SVMs
Arbitrary many dimensions
Li4le computa%onal cost
Maximal margin helps with curse of
dimensionality

SVM Applet

h4p://www.ml.inf.ethz.ch/educa%on/
lectures_and_seminars/annex_estat/Classier/
JSupportVectorApplet.html
Tips and Tricks
SVMs are not scale invariant
Check if your library normalizes by default
Normalize your data
mean: 0 , std: 1
map to [0,1] or [-1,1]
Normalize test set in same way!

Tips and Tricks
RBF kernel is a good default
For parameters try exponen%al sequences
Read:
Chih-Wei Hsu et al., A Prac%cal Guide to
Support Vector Classica%on,
Bioinforma%cs (2010)
Parameter Tuning
Given a classica%on task

Which kernel ?
Which kernel parameter values?
Which value for C?

Try dierent combina%ons
and take the best.
Grid Search

Zang et al., Iden%ca%on of heparin samples that contain impuri%es


or contaminants by chemometric pa4ern recogni%on analysis
of proton NMR spectral data, Anal Bioanal Chem (2011)
Mul% Class

One vs. All


One vs All
Train n classier for n classes
Take classica%on with greatest posi%ve
margin
Slow training
Mul% Class

One vs. One


One vs One
Train n(n-1)/2 classiers
Take majority vote
Fast training
Decision Tree
aSer 10
no pm? yes

got
call friend
electricity?

got new
read book
dvd?

play
watch tv
computer
Decision Trees
Fast training
Fast prediciton
Easy to understand
Easy to interpret
Decision Tree - Idea

C D

A
A B C D E


Bishop, Pa4ern Recogni%on and Machine Learning, Springer, 2006
Decision Tree - Predic%on

C D

A
A B C D E
Decision Tree -Training
Learn the tree structure:
which feature to query
which threshold to choose

A B C D E
Node Purity

10 7
E

3 5 7 2 B

3 2
C D

A
A B C D E
When to Stop
node contains only one class
node contains less than x data points
max depth is reached
node purity is sucient
you start to overt => cross-valida%on
Decision Trees - Disadvantages
Sensi%ve to small changes in the data
Overtng
Only axis aligned splits
Decision Trees vs SVM

Has%e et al.,The Elements of Sta%s%cal Learning: Data Mining, Inference, and Predic%on, Springer (2009)
Wisdom of Crowds
The collec%ve knowledge of a diverse and
independent body of people typically exceeds
the knowledge of any single individual, and can
be harnessed by vo%ng.
James Surowiecki

h4p://socialmedia4srm.wordpress.com/
Ensemble Methods
A single decision tree does not perform well
But, it is super fast
What if we learn mul%ple trees?

For mul%ple trees we need


even more data!
Bootstrap
Resampling method from sta%s%cs
Useful to get error bars on es%mates

Take N data points
Draw N %mes with replacement

Get es%mate from each bootstrapped sample


Bagging
Bootstrap aggregating

Sample with replacement from your data set


Learn a classier for each bootstrap sample
Average the results
Bagging Example

x2

x1
Bagging
Reduces overtng (variance)
Normally uses one type of classier
Decision trees are popular
Easy to parallelize
Boos%ng
Also ensemble method like Bagging
But:
weak learners evolve over %me
votes are weighted

Be4er than Bagging for many applica%ons

Very popular method


Boos%ng
Boos%ng is one of the most powerful learning
ideas introduced in the last twenty years.

Has%e et al.,The Elements of Sta%s%cal Learning: Data
Mining, Inference, and Predic%on, Springer (2009)
Adaboost

x2

x1
AdaBoost
Ini%alize weights for data points
For each itera%on:
Fit classier to training data
Compute weighted classica%on error
Compute weight for classier from the error
Update weights for data points
Final classier is weighted sum of all single
classiers
AdaBoost

Has%e et al.,The Elements of Sta%s%cal Learning: Data Mining, Inference, and Predic%on, Springer (2009)
AdaBoost
AdaBoost
Introduced by Freund and Schapire in 1995
Worked great, nobody understood why!

Then ve years later (Friedman et al. 2000):


Adaboost minimizes exponen%al loss func%on.
There s%ll are open ques%ons.
Random Forest
Builds upon the idea of bagging
Each tree build from bootstrap sample
Node splits calculated from random feature
subsets

h4p://www.andrewbun%ne.com/ar%cles/about/fun
Random Forest
All trees are fully grown
No pruning

Two parameters
Number of trees
Number of features


Random Forest Error Rate
Error depends on:
Correla%on between trees (higher is worse)
Strength of single trees (higher is be4er)

Increasing number of features for each split:


Increases correla%on
Increases strength of single trees
Out of Bag Error
Each tree is trained on a bootstrapped sample
About 1/3 of data points not used for training

Predict unseen points with each tree


Measure error
Out of Bag Error
data points
sample lter

bootstrap unused
sample data points

train test
Out of Bag Error
Very similar to cross-valida%on
Measured during training
Can be too op%mis%c
Variable Importance
Again use out of bag samples
Predict class for these samples
Randomly permute values of one feature
Predict classes again
Measure decrease in accuracy
Temp%ng Scenario
Run random forest with all features
Reduce number of features based on
importance weights
Run again with reduced feature set and report
out of bag error

This does not measure test


performance!
Unbalanced Classes
The Problem:

Oversample:

Subsample:

Subsample for each tree!


Random Forest Subsampling

sample

train
Random Forest
Similar to Bagging
Easy to parallelize
Packaged with some neat func%ons:
Out of bag error
Feature importance measure
Proximity es%ma%on
Cascade Classier
Ensemble methods are strong
But predic%on is slow
Solu%on: Make predic%on faster

Idea: Build a cascade
Cascade Classier

h4p://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detec%on_framework
Viola Jones Face Detec%on

h4p://cvdazzle.com/
Viola Jones Face Detec%on
Takes long to train
Predic%on in real %me!

Widely used today


Summary
SVMs
Decision Trees
Bootstrap, Bagging, Boos%ng
Random Forest
Cascade Classier
Further Reading
Error Measures
predicted
True posi%ve (tp)
1 -1
True nega%ve (tn)
1
False posi%ve (fp) tp fn

true
False nega%ve (fn)
-1 fp tn
TPR and FPR
predicted
True Posi%ve Rate:
1 -1

1
tp fn

true

-1 fp tn
False Posi%ve Rate:
Precision Recall
predicted
1 -1
Recall:
1
tp fn

true

-1 fp tn
Precision:
Precision Recall Curve
1
precision

1
recall
Comparison

J. Davis & M. Goadrich,


The Rela%onship Between Precision-Recall and ROC Curves.,
ICML (2006)
F-measure
Weighted average of precision and recall

Usual case:
Increasing allocates weight to recall
Clustering Evalua%on Criteria
Based on expert knowledge
Debatable for real data
Hidden Unknown structures could be present
Do we even want to just reproduce known
structure?
Rand Index
Percentage of correct classica%ons
Compare pairs of elements:

tn
tp

fn fp

Fp and fn are equally weighted


Stability
Stability
What is the right number of clusters?
What makes a good clustering solu%on?

Clustering should generalize!


Stability
Gini Impurity
Example:
4 red, 3 green, 3 blue data points

random sample:
red: 4/10 green: 3/10 blue: 3/10

misclassica%on:
red: 4/10 * (3/10 + 3/10)
true wrong
class predic%on
Gini Impurity
Number of classes:
Number of data points:
Number of data points of class i:

true wrong
class predic%on
Gini Impurity

Has%e et al.,The Elements of Sta%s%cal Learning: Data Mining, Inference, and


Predic%on, Springer (2009)
Node Purity Gain
Compare:
A
Gini impurity of parent node
Gini impurity of child nodes B C
Pseudocode
Check for base cases
For each a4ribute a
Calculate the gain from splitng on a
Let a_best be the a4ribute with highest gain
Create a decision node that splits on a_best
Repeat on the sub-nodes

h4p://en.wikipedia.org/wiki/C4.5_algorithm

Vous aimerez peut-être aussi