Académique Documents
Professionnel Documents
Culture Documents
Heikki Huttunen
University Lecturer, Dr.Eng.
Heikki.Huttunen@tut.fi
• Motivation
• Why machine learning?
• Machine learning basics
• Why Python?
• Review of widely used classifiers and their implementation in
Python
• Examples
• Data driven design of a classifier for ovarian cancer detection
• ICANN MEG Mind Reading Challenge 2011
• IEEE MLSP Birds 2013 Competition
• DecMeg 2014 Decoding the Human Brain Competition
• Hands-on session (TC407)
• Slides available at http://www.cs.tut.fi/~hehu/MMLP.pdf
Introduction
Python
• However, due to licensing issues and heavy
development of Python, scientific Python
started to gain its user base.
numpy
• Python’s strength is in its variability and
huge community.
sklearn
• There are 2 versions: Python 2.7 and 3.4.
We’ll use the former one as it’s better
matplotlib
supported by most packages interesting to us.
Alternatives to Python in Science
Python vs. R
Python vs. Matlab
• R is #1 workhorse for statistics
• Matlab is #1 workhorse for
and data analysis.
linear algebra.
• Popularity: http://www.
• Matlab is professionally
kdnuggets.com/polls/2015/
maintained product.
r-vs-python.html
• Some Matlab’s toolboxes are
• R is great for specific data
great (Image Processing tb).
analysis and visualization needs.
Some are obsolete (Neural
Network tb). • However, Python interfaces to
other domains ranging from
• New versions 2 twice a year.
deep neural networks (Caffe,
Amount of novelty varies.
Theano) and image analysis
• Matlab is expensive for (OpenCV) to even a fullblown
non-educational users. webserver (Django)
Essential Modules
Samples: x1 , x2 , . . . , xN ∈ RP
Class labels: y1 , y2 , . . . , yN ∈ {1, 2, . . . , C }
Classifier: F (x) : RP 7→ {1, 2, . . . , C }
• Now, the task is to find the function F that maps the samples
most accurately to their corresponding labels.
• For example: Find the function F that minimizes the number
of erroneous predictions, i.e., the cases in which F (xk ) 6= yk .
Classification Example
2 Classified as RED
• The expression wT x = k wk xk
P
essentially transforms the
multidimensional data x to a real number,
which is then compared to a threshold b.
Flavors of Linear Classifiers
0 0 0
4 4 4
6 6 6
8 8 8
10 10 10
2 1 0 1 2 3 4 5 6 2 1 0 1 2 3 4 5 6 2 1 0 1 2 3 4 5 6
Details
• The LDA:
• The oldest of the three: Fisher, 1935.
• ”Find the projection that maximizes class separation”, i.e., pull
the classes as far from each other as possible.
• Closed form solution, fast to train.
• The SVM:
• Vapnik and Chervonenkis, 1963.
• ”Maximize the margin between classes”.
• Slowest of the three, performs well with sparse
high-dimensional data.
• Logistic Regression (a.k.a. Generalized Linear Model):
• History traces back to 1700’s, but proper formulation and
efficient training algorithm by Nelder and Wedderburn in 1972.
• Statistical algorithm: ”Maximize the likelihood of the data”.
• Outputs also class probability. Has been extended to
automatic feature selection.
The Class Probabilities of Logistic Regression
Classified as RED
2 p = 60.6 %
10
2 1 0 1 2 3 4 5 6
Nonlinear SVM
2 Support Vector Machine with Kernel Trick
2 Classified as BLUE
feature importances.
6
Relative Importance of Attributes
Horizontal axis 8
26.8%
10
2 1 0 1 2 3 4 5 6
73.2%
Vertical axis
Generalization
• The important thing is how well the classifier works with
unseen data.
• An overfitted classifier memorizes the training data and does
not generalize.
2 1-NN / Training Data: Acc = 100.0 % 2 1-NN / Test Data: Acc = 88.2 %
0 0
2 2
4 4
6 6
8 8
10 10
2 1 0 1 2 3 4 5 6 2 1 0 1 2 3 4 5 6
Overfitting
3 Example time series
2
1
0
• Generalization is also related to overfitting. 1
20
`2 penalty: 24 nonzeros
• A zero coefficient for a linear operator is 15
10
equivalent to discarding the corresponding 5
0
feature altogether. 5
10
• The plots illustrate the model coefficients 15
0 5 10 15 20 25
Order
without regularization, with traditional 6
`1 penalty: 8 nonzeros
4
regularization and sparse regularization. 2
0
• The importance of sparsity is twofold: The 2
4
model can be used for feature selection, but 6
8
often also generalizes better. 0 5 10
Order
15 20 25
Example: Classifier Design for Ovarian Cancer
Detection
preprocessing step).
0
a
Conrads, Thomas P., et al. ”High-resolution serum proteomic
5
features for ovarian cancer detection.” Endocrine-Related Cancer (2004). 0 500 1000 1500 2000 2500 3000 3500 4000
Reading the Data
from sklearn.cross_validation \
import cross_val_score
scores = cross_val_score(clf, X, y)
• The four most popular Deep learning toolboxes are (in the
order of decreasing popularity):
• Caffe: http://caffe.berkeleyvision.org/ from Berkeley
• Torch: http://torch.ch/ from NYU (used by Facebook)
• pylearn2: http://deeplearning.net/ from U. Montreal
• Minerva https://github.com/dmlc/minerva from New
York, Peking Univ.
• All except Torch can be used via a Python front end (Torch
uses Lua language).
Applications
1
H. Huttunen et al., ”Regularized logistic regression for mind reading with parallel validation,” Proc.
ICANN/PASCAL2 Challenge: MEG Mind-Reading. Aalto University publication series, Espoo, June 2011.
2
H. Huttunen et al., ”MEG Mind Reading: Strategies for Feature Selection,” in Proc. of The Federated
Computer Science Event 2012, Helsinki, May 2012.
3
H. Huttunen et al., ”Mind Reading with Regularized Multinomial Logistic Regression,” Machine Vision and
Applications, Aug. 2013
ICANN MEG Mind Reading Challenge Data
−2
−4
20 40 60 80 100 120 140 160 180 200
−11
x 10
2
−2
−4
0 20 40 60 80 100 120 140 160 180 200
Where the Selected Features are Located?
2000 2000
1500 1500
Count
Count
1000 1000
500 500
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Atom index Atom index
2500
2000
1500
Count
1000
500
0
0 5 10 15 20 25 30 35 40 45 50
Atom index
DecMeg2014 Competition
0
Face shown
50
Sensor Number
task was to predict whether test persons 150
300
• Sequences were 1.5 second long 0.00 0.25 0.50 0.75 1.00
Time / s
1.25 1.50
Sensor Number
approximately 600 datapoints from each 150
subject. 200
300
0.00 0.25 0.50 0.75 1.00 1.25 1.50
Time / s
Our Model
20 LR
30
...
50 100 150 200 250 300
Sensor
LR
LR ... LR LR LR
RF Decision
Feature Importances
0.06
0.05
0.04
Feature Importance
31 timepoints
0.03
0.01