Vous êtes sur la page 1sur 47

Notes on Machine Learning

Tobia Tesan
May 14, 2016

Contents
1 Introduction
1.1 The case for Machine Learning . . . . . . . . . . . . . . . . . . .
1.2 Applications of Machine Learning . . . . . . . . . . . . . . . . . .
1.3 Tasks in Machine Learning . . . . . . . . . . . . . . . . . . . . .

3
3
3
3

2 Supervised learning
2.1 Unsupervised Learning . . . . . .
2.2 Supervised Learning . . . . . . .
2.3 Examples of supervised learning
2.4 VC Dimension and Shattering . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

4
4
4
5
6

3 Concept Learning
3.1 Concepts . . . . . . . . . . . . . . . . . . . .
3.1.1 Partial ordering of the hypotesis space
3.2 Find-S . . . . . . . . . . . . . . . . . . . . .
3.3 Candidate-Elimination . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

7
7
7
7
8

4 Decision Tree Learning


4.1 The ID3 Learning Algorithm . . . . . .
4.1.1 Entropy . . . . . . . . . . . . . .
4.1.2 Algorithm . . . . . . . . . . . . .
4.1.3 Issues with attribute selection . .
4.2 Special Cases in Decision Tree Learning
4.3 Overfitting and Pruning . . . . . . . . .
4.3.1 Reduced Error Pruning . . . . .
4.3.2 Rule post-pruning . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

10
11
11
11
11
13
13
14
14

5 Neural Networks
5.1 Network structure . . . . . . . . . . . . . . . .
5.2 The perceptron . . . . . . . . . . . . . . . . . .
5.2.1 The Perceptron and functions . . . . . .
5.2.2 Learning in the Perceptron . . . . . . .
5.2.3 Gradient descent and the the delta rule
5.3 Multilayer Networks . . . . . . . . . . . . . . .
5.3.1 The sigmoid . . . . . . . . . . . . . . . .
5.3.2 The Back-Propagation algorithm . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

16
17
17
18
18
19
20
20
21

DR
AF
T
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

6 Pipelines for Supervised Learning


6.1 Data preprocessing . . . . . . . . . . . . . . . . . .
6.1.1 Feature Mapping and Normalization . . . .
6.2 Model selection and hyperparameter optimization .
6.3 Cross Validation and Hold-out . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

23
23
23
25
25

7 Support Vector Machines


7.1 Structural Risk Minimization . . . . .
7.2 Support Vector Machines Themselves
7.2.1 Linearly Separable Case . . . .
7.2.2 Linearly inseparable case . . .
7.2.3 The Kernel Trick . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

27
27
27
27
29
31

8 Bayesian Learning
8.0.4 Brute-Force Bayes Concept Learning . .
8.0.5 Minimum Description Length Principle
8.0.6 Bayes Optimal Classifier . . . . . . . . .
8.0.7 Gibbs classifier . . . . . . . . . . . . . .
8.1 Naive Bayes . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

33
34
34
35
35
36

9 Clustering
9.1 Clustering and Objective Functions . . . . . . . . . .
9.2 Criteria for Clustering . . . . . . . . . . . . . . . . .
9.3 Clustering Algorithms . . . . . . . . . . . . . . . . .
9.3.1 k-Means clustering . . . . . . . . . . . . . . .
9.4 Hierarchical Clustering . . . . . . . . . . . . . . . . .
9.4.1 Hierarchical Agglomerative Clustering (HAC)

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

37
37
38
39
39
40
40

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

DR
AF
T

.
.
.
.
.

10 Feature Selection
42
10.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 43
11 Recommendation Systems
11.1 Feedback in a RS . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 Approaches to RS . . . . . . . . . . . . . . . . . . . . . . . .
11.3 Tasks in RS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.1 Evaluating Performance in Recommendation Systems

.
.
.
.

.
.
.
.

44
44
44
45
45

Introduction

1.1

The case for Machine Learning

Machine Learning turns data into knowledge when an algorithmic approach is


infeasible.
This can be the case whenever [Aio15]:
It is impossible to give a precise formalization of the problem
Data is polluted by noise or non-determinism
An exact solution is highly complex or inefficient
There is a lack of machine-readable knowledge about the problem at hand
Conversely, Machine Learning is favored whenever the system is required to
[Aio15]:
Adapt to the operating environment (automatic personalization)

DR
AF
T

Improve its performance with respect to a particular task


Discover regularity and new information from empirical data

1.2

Applications of Machine Learning

Among the applications of machine learning are:


Face recognition

Named Entity Recognition


Document classification

Automated opponent profiling in games


Bioinformatcs, several applications

Speech and Handwritten Recognition

1.3

Tasks in Machine Learning

Binary classification
Multiclass classification
Regression
Class and instance ranking
Novelty Detection
Clustering
Basket Analysis
Reinforcement Learning

Supervised learning

Definition 2.1 (Supervised learning). Supervised learning is the machine learning task of inferring a function f : X Y from labeled training data (xi , yi ).
[Mit97]
Regression and classification are examples of supervised learning.[MRT12]
Definition 2.2 (Oracle). An oracle - a human expert or nature itself - classifies
arbitrary instances xi .
It is not necessarily deterministic.
It can be characterized as the function f to approximate. [Aio15]

2.1

Unsupervised Learning

Definition 2.3 (Unsupervised learning). In unsupervised learning, only the


inputs xi are available; the aim is to find regularities in the input. [Alp10]
One method for density estimation is called clustering.

Supervised Learning

DR
AF
T

2.2

In supervised learning, we have a:

Definition 2.4 (Training set). A set of pairs E = (xi , yi )


Definition 2.5 (Hypotesis Space). H

from which we select h H.


The sample error of a hypothesis with respect to some sample S of instances
drawn from X is the fraction of S that it misclassifies The true error of a
hypothesis is the probability that it will misclassify a single randomly drawn
instance from the distribution D.
The fundamental assumption about h is the following:
The inductive learning hypothesis Any hypothesis found to approximate
the target function well over a sufficiently large set of training examples will also
approximate the target function well over other unobserved examples. [Mit97]
Definition 2.6 (Sample error (or empirical error)). The sample error (denoted
errors (h)) of hypothesis h with respect to target function f and data sample S
is
errors (h) =

1X
(f (x), h(x))
n
xS

Where n is the number of examples in S, and the quantity (f (x), h(x)) is


1 if f (x) = h(x), and 0 otherwise.[Mit97]
Definition 2.7 (True error (ideal error)). The true error (denoted errorD (h))
of hypothesis h with respect to target function f and distribution D, is the
probability that h will misclassify an instance drawn at random according to D.
errorD (h) = P rxD [f (x) 6= h(x)]
[Mit97]
4

Lemma 2.1. Whenver f is non-deterministic, errorD 0.[Aio15]


Definition 2.8 (Inductive bias). Consider a concept learning algorithm L for
the set of instances X. Let c be an arbitrary concept defined over X, and let
DC = (hx, c(x)i) be an arbitrary set of training examples of c.
Let L(xi , Dc ) denote the classification assigned to the instance xi by L after
training on the data Dc . The inductive bias of L is any minimal set of assertions
B such that for any target concept c and corresponding training examples Dc
(xi X)[(B Dc xi ) ` L(xi , Dc )]
Inductive bias is necessary to learning An unbiased learner would be
unable to generalize beyond the observed examples[Mit97] - i.e., it wouldnt be
able to do anything but recognize a previously seen example.
The following contribute to the inductive bias [Aio15]:
How examples are represented

DR
AF
T

Model (definition of H)
Target function (search in H)

2.3

Examples of supervised learning

Example 2.1 (Hyperplanes in R2 ). The instance space is made up of all points


in the plane R2 :
X = y : y R2

The hypotesis space consists of the dichotomies induced by every possible


hyperplane in R2 :
H = {f(w,b) (y) : f(w,b) (y) = sign(w y b), w R2 , b R}
Example 2.2 (Discs in R2 ). In this case, the instance space is again made up
of all points in the plane R2 :
Y = {y : y R2 }
The hypotesis space consists of the dichotomies induced by every possible
disc centered in the origin:
H = {fb (y) : fb (y) = sign(kyk2 b), b R
Example 2.3 (Rectangles in R2 ). In this case, the instance space is again made
up of all points in the plane R2 :
Y = {y : y R2 }
The hypotesis space consists of the dichotomies induced by every possible
rectangle:
H = {f (y) : f (y) = p1 x1 l2 e1 x2 e2 , = {p1 , p2 , e1 , e2 }}
5

2.4

VC Dimension and Shattering

The complexity of the hypotesis space can be measured by its VC-Dimension.


Definition 2.9. Shattering A set of instances S X is shattered by hypothesis
space H if and only if for every dichotomy of S there exists some hypothesis in
H consistent with this dichotomy[Mit97]:
S 0 S : h Hs.t. x S, h(x) = 1 x S 0
[Aio15]
Definition 2.10. VC-Dimension The maximum number of points that can
be shattered by H is called the Vapnik-Chervonenkis (VC) dimension of H, is
denoted as V C(H), and measures the capacity of H.[Alp10]
(
V C(H) = maxSX |S| if H shatters S
V C(H) = inf
otherwise

DR
AF
T

[Aio15]
Example 2.4 (Hyperplanes in R).

Upper bound on the VC-Dimension of binary classification problems


Let a binary classification problem have

Training set
Hypotesis space

An algorithm

S = {(xi , yi )}i=1,...,N
H = {h (x)}
L : S H H : L(S, H) = h (x) = argminhH errorS (h(x))

It is then possible to obtain an upper bound with P = (1 ) for an arbitrarily small to the true error (2.7), which is:
errorD (h (x)) = errorS (h (x)) + g(N, V C(H), )
{z
}
|
V CConf idence

g(N, V C(H), ) is called VC-Confidence and is independent of L. It is inversely proportional to |S|. For a set N , it is monotonous to V C(H).[Aio15]
Theorem 2.1 (Upper bound on VC Dimension fo Finite H).
V C(H) log2 (|H|)
Proof. Suppose that V C(H) = d. Then H will require 2d distinct hypotheses
to shatter d instances.
Hence, 2d |H| and d = V C(H) log2 |H|.[Mit97]

Concept Learning

3.1

Concepts

We call concept learning the task of inferring a concept from training examples
of its input and output. [Mit97]
Definition 3.1 (Concept). We call a concept a boolean-valued function defined
over the instance space X [Aio15]:
c:XB
Definition 3.2 (Concept Example). A an example of concept c over an instance
space X is a pair (x, c(x)) with x X and c : X B [Aio15].
Definition 3.3 (Instance satisfaction). Let h : X B. We say h satisfies
x X if h(x) = true.

3.1.1

DR
AF
T

Definition 3.4 (Hypotesis consistency). Let h : X B and (x, c(x)) an example of c. We say h is consistent with the example if h(x) = c(x). We say h
is consistent with a set D if h(x) = c(x)hx, c(x)i D.
Partial ordering of the hypotesis space

Definition 3.5 (Ordering on hypoteses). We say hi is more general or equivalent to hj and write hi g hj if
(x X)[(hj (x) = 1) (hi (x) = 1)]

Lemma 3.1.

hi , hj : hi g (hi hj )

Lemma 3.2.

hi , hj : (hi g (hi hj )) (hi g (hi hj )) hi 6g hj , hj 6g hi


i.e., hi and hj are incomparable.

3.2

Find-S

The Find-S algorithm illustrates one way in which the g partial ordering can
be used to organize the search for an acceptable hypothesis.
Therefore, at each stage the hypothesis is the most specific hypothesis consistent with the training examples observed up to this point (hence the name
Find-S)
Soundness of Find-S Every time h is generalized to some h0 s.t. h0 g h,
each previously observed positive example is satisfied[Aio15]: by definition of
g
x X : (h(x) = 1) (h0 (x) = 1))
If the target concept is in H, all negative examples are trivially satisfied by
the resulting h by virtue of it being the most specific.
7

Find-S
1 h min(H, g ) // h gets the most specific hypotesis in H
2 for each x : c(x) = true // for each positive training instance x
3
for each attribute constraint ai in h
4
if ai is satisfied by x
5
do nothing
6
else ai (the next more general constraint that is satisfied by x)
7 return h

Figure 3.1: Find-S algorithm [Mit97]


Issues with Find-S
Find-S is prone to error in presence of noise, since it does not take into
account negative examples. [Mit97]

DR
AF
T

Find-S cannot tell when it has converged and if h is indeed the correct
target concept - i.e. the only h H consistent with the data.
There is no reason to favor the most specific hypotesis. [Mit97]

3.3

Candidate-Elimination

Candidate-Elimination addresses the concerns in Paragraph 3.2 by outputting a description of the set of all hypotheses consistent with the training
examples.
Definition 3.6 (Version Space). The version space, denoted V SH,D 1 with respect to hypothesis space H and training examples D is the subset of hypotheses
from H consistent with the training examples in D [Mit97]:
V SH,D {h H : Consistent(h, D)

This set is characterized as the set of all hypotheses that have not been
eliminated as a result of being in conflict with observed data.
The idea behind Candidate-Elimination is to remove from H the overspecific hypoteses using the positive examples and the under-specific hypoteses
using the negative examples[Aio15].
Remarks on Candidate-Elimination and V SH,D |V SH,D | can be infinite
(if H is), but S and G can have a finite representation.
Lemma 3.3. |V SH,D | tends to decrease in general with |D|, as a consequence
of additional constraints.[Aio15]
Lemma 3.4. Assuming c H, the smaller |V SH,D |, the higher h = c for a
randomly selected h V SH,D .
1 Some

authors prefer T r to denote the training set

DR
AF
T

Candidate-Elimination
1 G set of the most general hypoteses
2 S set of the most specific hypoteses
3 for each d hx, c(x)i D
4
if c(x) = 1 // positive example
5
G G \ {h : Consistent(h, d)}
6
for each s S : Consistent(s, d)
7
S S\s
8
S S all minimal generalizations h of s s.t.
9
Consistent(h, d) g G : g g h
10
S S \ {s : s; S : s g s0 }
11
12
else // Negative example
13
S S \ {h : Consistent(h, d)}
14
for each g G : Consistent(g, d)
15
GG\g
16
G G all minimal specializations h of g s.t.
17
Consistent(h, d) s S : s g h
18
G G \ {g : g 0 ; G : g 0 g g}
19
20 return G, S

Figure 3.2: Candidate-Elimination algorithm [Aio15]

Outlook
Sunny
Humidity
High
No

Rain

Overcast
Yes

Wind

Normal

Strong

Yes

No

Weak
Yes

Figure 4.1: Decision tree, from [Mit97]

Decision Tree Learning

DR
AF
T

The case for decision trees Decision trees are often used in practice. They
are generally very efficient and are especially suited for
1. Instances of the form hattribute, valuei

2. Target functions with discrete ( 2) values

3. Concepts represented by boolean disjunctions

4. Training sets polluted with noise or missing values [Aio15]


Notice how 1. and 2. imply that the functions that can be learned by DLTs
are a superset of the boolean concepts of Section 3.
Decision Trees have the added benefit of being easily human-readable [Mit97].
A decision tree looks like Figure 4.1. Every internal node represents a test
on an attribute; every outgoing edge represents one of the possible values for
the attribute; every leaf represents a final classification.
Boolean functions as decision trees
Lemma 4.1. Every boolean function can be represented as a decision tree.
Every path from the root to a leaf encodes a conjunction of attribute values. n
paths to the same classification encode a disjunction of n conjunctions.[Aio15]
Example 4.1. The DLT in figure represents
(Outlook = Sunny Humidity = Normal)
(Outlook = Overcast)
(Outlook = Rain Wind = Weak)

10

4.1
4.1.1

The ID3 Learning Algorithm


Entropy

ID3 depends on the notion of entropy and information gain.


Definition 4.1 (Entropy). The entropy of a set S is defined as:
Entropy(S)

c
X

pi log2 pi

i1

Or
Entropy(S) p log2 (p ) p log2 (p)
Where pc is the proportion of S belonging to some class c.
Lemma 4.2. If the target attribute can take on c possible values,
Entropy(S) log2 c

DR
AF
T

Entropy can be used to measure the impurity in a collection of training


examples [Aio15]; it is possible to use it to define a measure of the effectiveness
of an attribute in classifying the training data:
Definition 4.2 (Information Gain). The information gain of an attribute A
relative to a collection of examples S is defined as:
X

Gain(S, A) Entropy(S)

vValues(A)

|Sv |
Entropy(Sv )
|S|

Where Values(A) is the set of all possible values for attribute A and Sv is the
subset of S for which attribute A has value v[Mit97].
4.1.2

Algorithm

The algorithm is presented in figure 4.1.2.


4.1.3

Issues with attribute selection

There is a natural bias in the information gain measure that favors attributes
with many values over those with few values.[Mit97]
As an extreme example, date of birth could sometimes be a perfect predictor of the final grade on an exam.
Thus, it would be selected as the decision attribute for the root node of the
tree and lead to a broad tree of depth one - it will have a very high information
gain relative to the training examples, despite being a very poor predictor of
the target function over unseen instances. [Mit97]
Gain Ratio One alternative measure that has been used successfully is Gain
Ratio, which penalizes attributes such as date by incorporating a term, called
split information, that is sensitive to how broadly and uniformly the attribute
splits the data [Mit97]:

11

ID3(D,A)
1 Create Root
2 if d D : f (x) =
3
return (Root, )
4 if d D : f (x) =
5
return (Root, )
6 if A =
7
return (Root, most common label in D)
8 Asplit Ai A s.t. Ai = arg max Gain(S, Ai )
9 // Split on A s.t. maximum information gain is had
10 Decision Attribute(Root) Asplit
11 for each vj of Ai
?

DR
AF
T

12
Add a new tree branch below Root, corresponding to Ai = vj
13
Dvj {d D : Ai (d) = vj } // the subset of examples that have value vj for Ai
14
if Dvj =
15
Add a leaf node with label = most common value of Target attribute in D
16
else
17
Add the subtree ID3(Examplesv , T arget, Attributes \ {Ai }
18 return Root
Figure 4.2: ID3 Algorithm

Definition 4.3 (Split Information).

SplitInformation(S, A)

c
X
|Si |
i=1

|S|

log2

|Si |
|S|

Gain Ratio is thus defined as


Definition 4.4 (Gain Ratio).

GainRatio(S, A)

Gain(S, A)
SplitInformation(S, A)

Note that SplitInformation is actually the entropy of S with respect to the


values of attribute A.
Practical Application of Gain Ratio The problem opposite to the one in
4.1.3 can arise through application of Gain Ratio: significant attributes with a
slightly higher number of values can be neglected.
The strategy used in practice works as follows [Aio15]:
1. Compute GainAi for each attribute Ai
2. Compute the average EAi A (GainAi )
3. Select only those attributes Ai s.t. GainAi > EA0 A (GainA0 ])
4. Among these, select Ai such that Ai = arg max GainRatio(S, Ai )
12

Alternative metrics Among alternative metrics are:


Variance impurity
p p
Weighted (Gini) Impurity
X

c,c0 pc pc0

c,c0 Classi,c6=c0

Misclassification impurity
1

4.2

max pc

cclasses

Special Cases in Decision Tree Learning

DR
AF
T

Continuous-valued attributes It is possible to handle continuous-valued


attributes by generating a dynamic attribute Ac s.t.:
(
true A < c
Ac =
false otherwise
Selection of c It is possible to choose the value of c that yields maximum
Gain: it has been shown that such a c is the midpoint between two values of A
from two instances with a different target.[Aio15]
Costs It is possible to take account of costs by defining metrics accordingly.
Missing Values In practical applications, some examples can be missing an
attribute. Among the strategies for coping are:
Using the most frequent A in D

Using the most frequent A in D/ depending on the target


A probability-based approach

4.3

Overfitting and Pruning

Definition 4.5. Overfitting Given a hypothesis space H, a hypothesis h H


is said to overlit the training data if there exists some alternative hypothesis
h0 H such that h has smaller error than h0 over the training examples, but h0
has a smaller error than h over the entire distribution of instances.
Overfitting the training data is an important issue in decision tree learning.
Because the training examples are only a sample of all possible instances, it is
possible to add branches to the tree that improve performance on the training
examples while decreasing performance on other instances outside this set.
Methods for post-pruning the decision tree are therefore important to avoid
overfitting in decision tree learning (and other inductive inference methods that
employ a preference bias) [Mit97].
13

Avoiditing overfitting in DLT There are two main classes of approaches


to avoid overfitting:
1. Approaches that stop growing the tree earlier, before it reaches the point
where it perfectly classifies the training data
2. Approaches that allow the tree to overfit the data, and then post-prune
the tree.
The second class of approaches has been found more successful in practice
[Mit97].
In both approaches it is necessary to decide the optimal tree height. Among
the approaches that can be used for this purpose are:
The training and validation set approach: using a separate set of
examples to evaluate the utility of post-pruning.
Applying a statistical test to estimate whether expanding or pruning a
particular node is likely to produce an improvement.

4.3.1

DR
AF
T

The minimum description length principle, which uses an explicit measure of complexity.
Reduced Error Pruning

One approach to the actual pruning is Reduced Error Pruning.


Reduced-Error-Pruning(T, D)
1 (T rs , V s) D // Divide the data set in a training and a validation set
2 T 0 , T 00 T
3 while (perfVs (T 00 ) perfVs (T 0 )) // While performance on Vs does not decrease
4
T 0 T 00
5
for each non-leaf node n
6
perf [n] = perfVs (Prune(T 0 , n))
7
// try pruning each non-leaf node and measure
8
// the resulting trees performance on the validation set
9
T 00 Prune(T 0 , n) s.t. n = arg max perf [n]
10
// Prune the node resulting in the highest gain
11 return T 00
Prune(T, n)
1 return The tree where the sub-tree with root n is replaced by
2 a leaf with the most common label in the example set for n.
4.3.2

Rule post-pruning

The fundamental idea in rule post-pruning is turning a decision tree in a rule


set, then prune the rules and use the (pruned) rule set for classification.[Aio15].
Usually rule-post pruning yields better performance than both the starting
tree and Reduced-Error-Pruning.
One additional benefit is that the resulting rule-set is easily human-readable.
The procedure is as follows[Aio15]:
14

1. for every path(r, li ) from root r to leaf li a rule Ri is generated of the form
if (Ai0 = vi0 ) (Ai1 = vi1 ) (Aik = vik ) then labeli
2. Then for every individual Ri pruning is done by:
Estimating the performance yielded by Ri alone
Removing preconditions (Aij = vij ) that lead to an increase in performance
3. Finally, the pruned Ri are sorted by their performance in decreasing order.

DR
AF
T

When a new instance is to be classified, the sorted rules {R0 . . . Rk } are


applied in decreasing order.
The first matching rule is used to classify the instance.
If there is no match, the most common class in the training set is used.

15

Neural Networks

That of artificial netural networks - or ANNs - is an ineherently interdisciplinary


field, with contributions coming from biology, computer science, mathematics,
statistics, economy, physics. There are two main reasons for studying ANNs:
1. Reproducing a model of the human (or animal) brain in order to understand it.
2. Understanding the fundamental principles behind its functions and use
them in practical applications (think building airworthy planes as opposed to building models of avian organisms).
Machine learning is interested in the latter viewpoint. There are several
ANN-based models used in Machine Learning, with applications in supervised
learning, unsupervised learning and associative memories.
These models mainly differ in:
Network topology

DR
AF
T

Function computed by the individual neuron


Learning algorithm

Treatment of training data

The case for ANNs Artificial neural networks are especially suited for problems where
1. Input is discrete or continuous-valued

2. Output is discrete (classification) or continuous (regression)


3. Output is vectorial

4. Data is prone to noise

5. The form of the target function is unknown

6. The final solution does not need to be in a human-readable form - i.e.


so-called black-box problems.[Spe08]
Therefore, ANNs are found in the fields of:
Speech recognition
Image recognition
Predictions in finance
Industrial process control
The perspective on the human brain The human brain is made up of
around 1010 strongly interconnected neurons, each having a response time of
around 103 seconds.
Since a human takes approximately 0.1 seconds to recognize and interpret a
picture, it must be the case that the brain works ina massively parallel fashion.
16

+---------------------------+
|
|
|
|
|
|
|
TODO: Missing figure
|
|
|
|
|
|
|
+---------------------------+
Figure 5.1: Structure of a feed-forward neural network

DR
AF
T

+---------------------------+
|
|
|
|
|
|
|
TODO: Missing figure
|
|
|
|
|
|
|
+---------------------------+

Figure 5.2: Structure of a Perceptron

5.1

Network structure

An artificial neural network is a system of interconnected units (neurons) that


is able to compute non-linear functions.
In an artificial neural network the individual units are divided into input
units, output units and hidden units.
Hidden units (if any) represent the inner variables that encode, as a result
of the training process, the correlation between input values and output values
[Spe08].
Feed-forward networks are a directed, acyclic graphs of such units (as
opposed to recurrent networks, which admit cycles).

5.2

The perceptron

The most basic example of an artificial neuron is the perceptron. A perceptron


has x1 . . . xn continuous inputs, calculates a linear combination of these inputs
and then outputs a 1 if the result is greater than some threshold[Mit97]:
(
1
if w0 + w1 x1 + w2 x2 + + wn xn > 0
o(x1 , . . . , xn ) =
1 otherwise
or
o(~x) = (w
~ ~x)

17

Where
= sgn
Learning a perceptron involvees choosing values for w.
~
The space H of hypoteses is therefore the set of all possible real-valued weight
vecotrs:
H = {w
~ :w
~ R(n+1) }
The perceptron can be seen as a hyperplane decision surface that outputs a
1 for instances lying on one side of the hyperplane and a 1 for instances lying
on the other side:
H = {fw,b
y ) : fw,b
y ) = sgn(w
~ ~y + b), w,
~ ~y Rn , b R}
~ (~
~ (~
or

with

DR
AF
T

H = {fw~ 0 (~y 0 ) : fw~ 0 (~y 0 ) = sgn(w


~ 0 ~y 0 ), w
~ 0 , ~y 0 Rn+1 }

w
~ 0 = [b, w]
~ t , ~y 0 = [1, ~y ]t

5.2.1

The Perceptron and functions

The (single) Perceptron can represent the OR, AND and NOT Boolean operators. In fact it can represent any m-of-n function. However,
Lemma 5.1. Any non-linearly separable function cannot be represented by an
individual Perceptron
.
One such function is the XOR function.

Definition 5.1 (Linearly separable). A function


f : Rn {0, 1}

is said to be linearly separable if there is an hyperplane in Rn that separates


the combinations ~x s.t. f~x = 1 from those s.t. f~x = 0. [Spe08]
5.2.2

Learning in the Perceptron

A naive learning algorithm is presented in 5.2.2.


However, this algorithm has stability issues.
Customarily,
w
~ w
~ + (t out)~x
is replaced by
w
~ w
~ + (t out)~x
With > 0 and preferably < 1.
18

Learn(D = {(~x, t)}) with t {0, 1}


1 w
~ 0
2 repeat
3
(~x, t) = randomly extracted example D
4
out sgn(w
~ ~x)
5
w
~ w
~ + (t out)~x
6 until out = t

Figure 5.3: (Naive) learning for a single perceptron


5.2.3

Gradient descent and the the delta rule

DR
AF
T

Although the perceptron rule finds a successful weight vector when the training
examples are linearly separable, it can fail to converge if the examples are not
linearly separable.
A second training rule, called the delta rule, is designed to overcome this
difficulty. If the training examples are not linearly separable, the delta rule
converges toward a best-fit approximation to the target concept.
The key idea behind the delta rule is to use gradient descent to search the
hypothesis space of possible weight vectors to find the weights that best fit the
training examples.[Mit97]
Consider a modified continuous perceptron[Spe08] such that:
out(~x) =

n
X

wi xi = w
~ ~x

i=1

Let us now define a measure of the error attributed to a specific weight


vector:
E[w]
~ =

1 X
2
(ti out(~xi ))
|D|
~
xi ,~
ti D

i.e. 21 the standard deviation[Spe08].


We want to minimize E[w].
~
Now, the function E[w]
~ can yield the following gradient:


E E
E
E[w]
~
,
,...
w0 w1
wn
w
~ = E[w]
~
wi =

E
wi

Since the gradient specifies the direction of steepest increase of E, the training rule for gradient descent is
w
~ w
~ + w
~

19

Gradient-Descent(D, )
1 wi random // let wi assume random, small values
2 while E[w]
~ >
3
wi 0
4
for each (~x, t) D
5
outxi output(~x)
6
for each wi
7
wi wi + (t outxi )xi
8
for each wi
9
wi wi + wi
Figure 5.4: Gradient-Descent[Spe08][Mit97]
Where

i.e.

DR
AF
T

w
~ = E[w]
~

w
~ =

(td od )xid

dD

Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, regardless of
whether the training examples are linearly separable, given a sufficiently small
learning rate is used. If is too large, the gradient descent search runs the
risk of overstepping the minimum in the error surface rather than settling into
it. For this reason, one common modification to the algorithm is to gradually
reduce the value of ) as the number of gradient descent steps grows. [Mit97]

5.3
5.3.1

Multilayer Networks
The sigmoid

Multiple layers of cascaded linear units still produce only linear functions, and
we prefer networks capable of representing highly nonlinear functions[Mit97]. A
unit is needed such that
1. Its output is a nonlinear function of its inputs
2. Its output is differentiable
The sigmoid unit is one such unit. Its output is (again):
out = (w
~ ~x)
Where, this time:
1
1 + ey
The sigmoid function has an easily expressed derivative:
(y) =

20

+---------------------------+
|
|
|
|
|
|
|
TODO: Missing figure
|
|
|
|
|
|
|
+---------------------------+
Figure 5.5: Structure of a sigmoid artificial neuron

d(y)
= (y) (1 (y)))
dy
The Back-Propagation algorithm

DR
AF
T

5.3.2

The Backpropagation algorithm learns the weights for a multilayer network,


given a network with a fixed set of units and interconnections. It employs
gradient descent to attempt to minimize the squared error between the network
output values and the target values for these outputs[Mit97].
Since we are considering a network of neurons with multiple output units we
redefine E to sum the errors over all of the outputs:
E(w)
~

1X X
(tkd okd )2
2
dD kout

Local minima The new E(w)


~ allows for local minima, so there is no guarantee
of convergence to a global minimum.
Lemma 5.2. The hypotesis space of an Artificial Neural Network is given by
its topology.

21

Where

DR
AF
T

Back-Propagation
1 wi random // let wi assume random, small values
2 while E[w]
~ >
3
for each (~x, ~t) D
4
outxi output(~x)
5
for each output unit k
6
k ok (1 ok )(tk ok )
7
for each hidden unitP
j
8
j oj (1 oj ) koutputs (wk,j k )
9
for each ws,q
10
ws,q ws,q + ws,q

ws,q

(
s xq
=
s yq

if s hidden
if s output

Figure 5.6: Back-Propagation

22

Pipelines for Supervised Learning

Learning does not happen in a vacuum - it is instead carried out on real-world


data. Therefore, it typically happens in a pipeline with additional steps:
1. First, we throughly analyze the learning problem at hand
2. Secondly, we gather, analyze and preprocess data
3. We study correlations between variables
4. We do feature selection/weighting/normalization
5. We choose a predictor and a model we usually seek to build a pipeline
with additional step;s which guarantee maximum effectiveness.

6.1

Data preprocessing

Among the data types we deal with are:

DR
AF
T

Vectors
Strings

Sets and bags (which include frequency as well)


Multidimensional arrays
Graphs; trees

We also deal with composite structures.


At a higher level, we have features.
6.1.1

Feature Mapping and Normalization

Taxonomy of features Features may be subdivided as follows; each can be


mapped differently.
Categorical
Nominal (unordered)
Car: color
Car: brand
Ordinal
Military rank
Quantitative
Intervals
Movie rating, 1 to 5 stars.
Continuous
Weight

23

Mapping categorical features Categorical features are best mapped to a


vector with as many components as the possible values.
Example 6.1. Let
Brand = c1...3 = {c1 = FIAT, c2 = Toyota, c3 = Ford}
Color = c4...5 = {c4 = White, c5 = Black, c6 = Red}
Configuration = c7,8 = {c7 = Hatchback, c8 = Sports Car}
Then:
(Toyota, Red, Hatchback) [0, 1, 0, 0, 0, 1, 1, 0]

DR
AF
T

Mapping continous features Continous features are harder than categorical features. Typically, transforms are applied to features to make them easier
to compare.
The two main transforms are feature centering and feature standardization
or feature rescaling[Aio15].
Feature centering

f (x) = x x

Feature normalization Feature standardization makes the values of each


feature in the data have zero-mean (when subtracting the mean in the enumerator) and unit-variance. This method is widely used for normalization in
many machine learning algorithms (e.g., SVMs, Logistic Regression, Neural
Networks).
The general method of calculation is to determine the distribution mean and
standard deviation for each feature.
Next we subtract the mean from each feature. Then we divide the values
(mean is already subtracted) of each feature by its standard deviation:
x0 =

xx

Feature rescaling Feature scaling is a method used to standardize the


range of independent variables or features of data.
The simplest method is rescaling the range of features to scale the range in
[0, 1] or [1, 1].
Selecting the target range depends on the nature of the data. The general
formula is given as:
x min(x)
max(x) min(x)
where x is an original value, x0 is the normalized value. For example, suppose that we have the students weight data, and the students weights span
[160 pounds, 200 pounds].
To rescale this data, we first subtract 160 from each students weight and
divide the result by 40 (the difference between the maximum and minimum
weights).
x0 =

24

6.2

Model selection and hyperparameter optimization

Model Selection is the phase when the best values for hyperparameters are
selected for the task at hand.
Hyperparameter optimization contrasts with actual learning problems, which
are also often cast as optimization problems, but optimize a loss function on the
training set alone.2
In effect, learning algorithms learn parameters that model/reconstruct their
inputs well, while hyperparameter optimization is to ensure the model does not
overfit its data by tuning, e.g., regularization.
Bias and Variance 3
In statistics and machine learning, the biasvariance tradeoff (or dilemma)
is the problem of simultaneously minimizing two sources of error that prevent
supervised learning algorithms from generalizing beyond their training set:

DR
AF
T

The bias is error from erroneous assumptions in the learning algorithm.


High bias can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).
The variance is error from sensitivity to small fluctuations in the training
set. High variance can cause overfitting: modeling the random noise in
the training data, rather than the intended outputs.
Definition 6.1 (Bias). Suppose we have a statistical model, parameterized by
a real number , giving rise to a probability distribution for observed data,
P (x) = P (x | ), and a statistic which serves as an [[estimator]] of based on
any observed data x. That is, we assume that our data follows some unknown
distribution P (x) = P (x | ) (where is a fixed constant that is part of this
distribution, but is unknown), and then we construct some estimator that
maps observed data to values that we hope are close to . Then the bias of
this estimator (relative to the parameter ) is defined to be:
= E []
= E [ ]
Bias []

where E denotes expected value over the distribution P (x) = P (x | ), i.e.


averaging over all possible observations x. The second equation follows since
is measurable with respect to the conditional distribution P (x | ).
An estimator is said to be unbiased if its bias is equal to zero for all values
of parameter .

6.3

Cross Validation and Hold-out

Most of the time, the learner is parametric. These parameters should be optimized by testing which values of the parameters yield the best effectiveness.
It is possible to show that the evaluation performed in Step 2 gives an unbiased estimate of the error performed by a classifier learnt with the same parameters and with training set of cardinality |T r| |V a| < |T r|.
2 From http://web.archive.org/web/20160113235232/https://en.wikipedia.org/wiki/
Hyperparameter_optimization
3 From http://web.archive.org/web/20160116105049/https://en.wikipedia.org/wiki/
Bias%E2%80%93variance_tradeoff

25

Hold-Out
1 Let V a T r
2 for all p~ in hyperparameter range
3
lp~ Learn(T r V a)
4
Measure performance of lp~ (T r V a) on V a

K-fold Cross Validation An alternative approach to model selection (and


evaluation) is the K-fold cross-validation method: K different classifier h1 , . . . hk
are build by partitioning the initial corpus T r into k disjoint sets V a1 , . . . V ak
and the iteratively applying the hold-out approach on the k-pairs ((T ri = T r
V ai ), V ai ).
Effectiveness is then obtained by individually
computing the effectivenesses
= 1 Pn hi .
as in hold-out and then computing h
1
n

DR
AF
T

Leave-one-out The special case k = |T r| of k-fold cross validation is


called leave-one-out cross validation.
Variation of k : When k is larger, smaller training sets are yielded and,
consequently, smaller bias; conversely, smaller validation sets are yielded and,
consequently, higher variance.
When k is smaller, greater bias and smaller variance are had.

26

Support Vector Machines

7.1

Structural Risk Minimization

The idea behind Structural Risk Minimiazion is to find solutions that both
minimize the empirical risk (or empirical error) and have low VC dimension.
[Beb16]
Empirical error

We recall:
n

Remp =

1X
[z g(xi )]2
n
k=1

DR
AF
T

VC Dimension and Capacity To guarantee good generalization performance, the capacity (i.e., complexity) of the learned functions must be controlled. Functions with high capacity are more complicated (i.e., have many
degrees of freedom).[Beb16]
It can be shown that: with probability (1 )
s
V C(log V2nC + 1 log 4 )
errtrue errtrain +
n
Vapnik has shown that maximizing the margin of separation (i.e., empty
space between classes) is equivalent to minimizing the VC dimension.
The optimal hyperplane is the one giving the largest margin of separation
between the classes.
The margin is defined by the distance of the nearest training samples from
the hyperplane. We refer to these samples as support vectors. Intuitively speaking, these are the most difficult samples to classify.[Beb16]

7.2

Support Vector Machines Themselves

Primarily two-class classifiers but can be extended to multiple classes.


It performs structural risk minimization to achieve good generalization performance.
The optimization criterion is the margin of separation between classes.
Training is equivalent to solving a quadratic programming problem with
linear constraints.[Beb16]
Thw are two broad cases of SVMs: the simpler, linearly separable one and
the linearly inseparable one.
7.2.1

Linearly Separable Case

Consider an hyperplane dividing a space in two half-spaces 1 and 2 .


w
~ ~x + b = 0
Consider the function
g(x) = w
~ ~x + b

27

And a labelling
(
yi =

+1
1

xi > 0
otherwise

I.e. yi = 1 if xi belongs in some half-space 1 , 1 if xi belongs in 2 [Beb16].


The distance of some ~x from such a plane is given by
g(~x) = w
~ ~x + b
We can also express ~x as
~x = ~xp + r

w
~
kwk
~

Where ~xp is the normal projection of ~x on the hyperplane and r is the


algebraic distance.
Therefore:
g(~x)
w
~ ~x + b
=
kwk
~
kwk
~

DR
AF
T

r=

The hyperplane does not change when its normal vector is scaled.
Therefore we constain the length of w for uniqueness, we impose:
rkwk
~ =1

For the optimal hyperplane, the margin to one of the closest positive example
is equal to that to one of the closest negative examples, so the margin is defined
as
=

2
kwk
~

Also, if the examples are linearly separable with some margin r we have:
yi g(~xi )
r
kwk
~

i = 1, . . . , n

Finding the optimal hyperplane is maximizing the margin, which equates to


minimizing kwk.
~
Remember that for linearly separable examples the empirical (sample) error
is 0 for all hyperplanes.
In a SRM perspective, to minimize the bound on the ideal error we thus pick
the hyperplane with the smallest VC-Dimension.[Spe08]
Theorem 7.1. Let R be the diameter of the smallest sphere containing all
training examples.
The VC-dimension h of the set of optimal hyperplanes described by w~
~ x +b =
0 is bounded:
h (min d

R2
e, m}) + 1
2

where m is the dimensionality of training data.

28

Therefore we want to minimize the c in


{w
~ ~x + b : kwk
~ 2 c}
Quadratic formula With n linearly separable examples {(~xi , yi )}n1 finding
the optimal hyperplane is thus equivalent to solving the following quadratic
optimization problem:
1
min kwk
~ 2
w,b
~
2
with
i {1, . . . , n} : yi (w
~ ~xi + b) 1 0
Using the Kuhn-Tucker theorem it can be shown that it is equivalent to
max
with

i=1

n
1 X
yi yj i j ~xi ~xj
2 i,j=1

DR
AF
T

n
X

i {1, . . . , n} : i 0

n
X

yi i = 0

i=1

By Kuhn-Tucker the optimum is such that:


i [yi (w
~ ~xi + b ) 1] = 0

i = 1, . . . , n

Those vectors x~s such that

i > 0

are called support vectors.


We observe now that for any ~xs it must be the case that
ys (w
~ ~xs + b ) = 1

And for any positive example (ys = 1):


b = 1 w
~ ~xs
7.2.2

Linearly inseparable case

In the linearly inseparable case, we are forced to allow for some constraints to
be broken.[Spe08]

29

A naive approach We introduce slack variables i 0, i = 1, . . . , n, one for


each constraint, and we rewrite:
yi (~x ~xi + b) 1 i
We then have to modify the cost function to account for the slack variables:
n

X
1
kwk
~ 2+C
i
2
i=1
C > 0 controls the tradeoff between the complexity of the hypotesis space
and the number of linearly inseparable examples.
The dual problem is
max

n
X
i=1

n
1 X
yi yj i j ~xi ~xj
2 i,j=1

With

DR
AF
T

n
X

i {1, . . . , n} : 0 i C

yi i = 0

i=1

Mapping to Higher Dimensions The naive solution does not usually work
well.[Spe08]
An hyperplane is still an hyperplane and it can only impose a dichotomy on
the instance space.
We will instead use a two-step strategy as follows:
1. We map the input space to a feature space with much higher dimension
2. We naively compute the optimal hyperplane (as seen in ??) in the feature
space
Soundness of the two-step strategy Step 2 is trivially justified by the
fact that the optimal hyperplane minimizes the VC-dimension and thus improves
generalization.
Theorem 7.2 (Covers Theorem). A complex pattern-classification problem,
cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely
populated.[?]
By Covers theorem Step 1 is justified.
In particular, Step 1 equates to considering a () : Rm X M linear transformation for the original {(~xi , yi )}ni , where M >> m.

30

The strategy in practice We can assume that every new coordinate in the
feature space X M is generated by a non-linear function j .
Therefore

~ (~x) = [1 (~x), . . . , M (~x)]


Step 2 requires to find an optimal hyperplane in the M -dimensional feature
space F M . This is accomplished by:
M
X

wj j (~x) + b = 0

j=1

Imposing w0 = b and y0 (~x):


M
X

wj j (~x) = w
~
~ (~x) = 0

j=1

Let
n
X

DR
AF
T
w
~=

yk k
~ (~xk )

k=1

7.2.3

The Kernel Trick

[Spe08]
Observe first that for

w
~=

n
X

yk k
~ (~xk )

k=1

the hyperplane equation becomes


n
X

yk k
~ (~xk )
~ (~x) = 0

k1

The term
~ (~xk )
~ (~x) is the scalar product between the vectors induced by
the k-th learning instance and the input vector ~x.
Suppose there exists a symmetrical function K(, ) such that
K(~xk , ~x) =
~ (~xk )
~ (~x) =

M
X

j (~xk )j (~x) = K(~x, ~xk )

j=0

The hyperplane equation could then be rewritten as


n
X

yk k K(~xk , ~x)

k=1

This is much simpler, because while it is one equation for the hyperplane
in the feature space, the actual transformation tho the feature space does not
need to be computed.

31

Preconditions for Kernel Functions


Theorem 7.3 (Mercers Theorem). Let K(~x, ~x0 ) a continuous, symmetrical
kernel function in the closed interval ~x ~x ~b, id. for ~x0 .
K(~x, ~x0 ) can be expanded to
K(~x, ~x0 ) =

i i (~x)i (~x0 )

i=1

with i > 0 if and only if


Z

~
a

~b

~
a

K(~x, ~x0 ), (~x)(~x0 )d~xd~x0 0

~b

For all () such that


Z

~
a

2 (~x)d~x <

~b

DR
AF
T

Then it also converges absolutely, uniformly.


A kernel function is thus a scalar product in a feature space defined by some
nonlinear transform.
Note that the feature space can be infinite dimensional and that the kernel
must be positive, because ii > 0.

32

Bayesian Learning

Bayesian learning methods are relevant to our study of machine learning for two
different reasons [Mit97]:
1. Bayesian learning algorithms that calculate explicit probabilities for hypotheses, such as the naive Bayes classifier, are among the most practical
approaches to certain types of learning problems.
2. Bayesian methods provide an useful conceptual framework [Aio15] and a
standard for comparison against other algorithms.
Features of Bayesian learning include:
Bayesian methods can accommodate hypotheses that make probabilistic
predictions
Prior knowledge can be combined with observed data to determine the
final probability of a hypothesis

DR
AF
T

Downsides of Bayesian learning include:


They typically require initial knowledge of many probabilities. When these
probabilities are not known in advance they are often estimated based on
background knowledge, previously available data, and assumptions about
the form of the underlying distribu- tions.
Significnt computational cost required to determine the Bayes optimal hypothesis in the general case (linear in the number of candidate hypotheses). In certain specialized situations, this computational cost can be
significantly reduced.[Mit97]
Theorem 8.1. Bayes Theorem

P (h|D) =

P (D|h)P (h)
P (D)

Where:
P (h) is the a priori probability of h
P (D) is the a priori probability of training data D

Generally, we want to select the most probable h given the training data D,
i.e. the maximum a posteriori hypotesis:
hM AP arg max P (h|D)
hH

P (D|h)P (h)
hh
P (D)
= arg max P (D|h)P (h)

= arg max

hinH

In some cases, we will assume that every hypothesis in H is equally probable a priori; assuming P (hi ) = P (hj ) we can further simplyfy and choose the
maximum likelihood hypotesis:
hM L arg max P (D|h)
hH

33

8.0.4

Brute-Force Bayes Concept Learning

We can design a straightforward concept learning algorithm to output the maximum a posteriori hypothesis, based on Bayes theorem, as follows[Mit97]:
Brute-Force-Bayes
1 for each hi H
2
Pi P (hi |D)
3 return hM AP = hj s.t. Pj = max P1...n
This algorithm may require significant computation and is thus impractical;
still, it is of theoretical interest as a benchmark.
In order specify a learning problem for the Brute-Force algorithm we must
specify what values are to be used for P (h) and for P (D|h) (as we shall see,
P (D) will be determined once we choose the other two).[Mit97]
Assume as well that:
The training data D is noise free

DR
AF
T

The target concept c is contained in the hypothesis space H


We have no a priori reason to believe that any hypothesis is more probable
than any other.
Therefore, it is reasonable to assign the same prior probability to every
hypothesis h in H. Furthermore, because we assume the target concept is
contained in H we should require that these prior probabilities sum to 1.[Mit97]
So we choose
P (h) =

1
kHk

h H

Since we assume noise-free training data, the probability of observing classification di given h is just 1 if di = h(xi ) and 0 otherwise:
n
P (D|h) = 1
if di = h(xi )di D0
otherwise
Then:
(
P (h|D) =

8.0.5

1
|V SH,D |

if h consistent with D

otherwise

Minimum Description Length Principle

The Minimum Description Length principle is motivated by interpreting the


definition of hM AP in the light of basic concepts from information theory.
Recall that
hM AP = arg max(D|h)P (h)
hH

By monotonicity of log, the dual min problem can be rewritten as


hM AP = arg min log2 P (D|h) log2 P (h)
h

34

Consider the problem of designing a code to transmit messages drawn at


random, where the probability of encountering message i is pi .
Shannon and Weaver showed that the optimal code that minimizes the expected message length assigns log2 pi bits to encode message i.
Therefore, we can rewrite
hM AP = arg min LCh (h) + LCD|h (D|h)
h

where LCh and LCD|h are the length of the optimal encodings for H and for
D given h, respectively.
The Minimum Description Length (MDL) principle recommends choosing
the hypothesis that minimizes the sum of these two description lengths. Of
course to apply this principle in practice we must choose specific encodings or
representations appropriate for the given learning task. Assuming we use the
codes C1 and C2 to represent the hypothesis and the data given the hypothesis,
we can state the MDL principle as

DR
AF
T

Definition 8.1. MDL principle Choose hM DL s.t.


hM DL = arg min LC1 (h) + LC2 (D|h)
h

The above analysis shows that if we choose C1 to be the optimal encoding


of hypotheses CH , and if we choose C2 to be the optimal encoding CD|h then
hM DL = hM AP
Intuitively, we can think of the MDL principle as recommending the shortest
method for re-encoding the training data, where we count both the size of the
hypothesis and any additional cost of encoding the data given this hypothesis.
8.0.6

Bayes Optimal Classifier

So far we have considered the question what is the most probable hypothesis
given the training data? In fact, the question that is often of most significance
is the closely related question what is the most probable classiJication of the
new instance given the training data?
In general, the most probable classification of the new instance is obtained
not by hM AP alone - but instead, by combining the predictions of all hypotheses,
weighted by their posterior probabilities.
Definition 8.2 (Bayes Optimal Classification).
X
P (vj |hi )P (hi |D)
arg max
vj V

8.0.7

hi H

Gibbs classifier

Although the Bayes optimal classifier obtains the best performance that can
be achieved from the given training data, it can be quite costly to apply. The
expense is due to the fact that it computes the posterior probability for every
hypothesis in H and then combines the predictions of each hypothesis to classify
each new instance.
An alternative, less optimal method is the Gibbs algorithm defined as follows:

35

1. Choose a hypothesis h from H at random, according to the posterior


probability distribution over H.
2. Use h to predict the classification of the next instance x.

8.1

Naive Bayes

Recall that
vM AP = arg max P (vj |a1 , a2 , . . . an )
vj V

We can use Bayes theorem to rewrite this expression as

vM AP = arg max
vj V

P (a1 , a2 , . . . an |vj )P (vj )


P (a1 , a2 , . . . , an )

= arg max P (a1 , a2 , . . . an |vj )P (vj )


vj V

DR
AF
T

The key idea behind the naive Bayes classifier is based on the simplifying
assumption that the attribute values are conditionally independent given the
target value and therefore:
P (a1 , a2 , . . . an |vj ) =

n
Y

P (ai |vj )

i=1

So we can rewrite arg maxvj V P (a1 , a2 , . . . an |vj )P (vj ) accordingly and obtain[Mit97]:
Definition 8.3 (Naive Bayes classifier).

vN B = arg max P (vj )


vj V

n
Y

i=1

36

P (ai |vj )

Clustering

Clustering algorithms group a set of documents into subsets or clusters.


The algorithms goal is to create clusters that are coherent internally, but
clearly different from each other.
In other words, documents within a cluster should be as similar as possible;
and documents in one cluster should be as dissimilar as possible from documents
in other clusters.
Clustering is the commonest form of unsupervised learning - i.e. learning
from raw data, where a classification of examples is not given. It is the process
of grouping a set of objects into groups of similar objects.
Definition 9.1. The Clustering Problem
The clustering problem can be defined as follows:
Given
A set of examples D = {d1 , . . . , dn }

DR
AF
T

A distance metric
A partitioning criterion

A desired number of clusters K

Compute an assignment function : D {1, . . . , K} s.t.


None of the clusters is empty

Satisfies the partitioning criterion w.r.t. the similarity measure


Issues with clustering are mainly of representation or cardinality of the
cluster set. If the data is represented as vector sets, it poses the problem
of normalization. A notion of similarity or distance is needed for the chosen
representation.[Aio15]
The number of clusters can be fixed a priori or completely data driven.[Aio15]
The threat lingers of having trivial clusters, either too large or too small - at
the extremes, the set of all possible objects and sets of cardinality exactly one.

9.1

Clustering and Objective Functions

Often, the goal of a clustering algorithm is to optimize an objective function.


In these cases, clustering is a search (optimization) problem.[Aio15]
Theorem 9.1. For K clusters and n objects,
(clusterings) are possbile.

Kn
K!

total possible permutations

Proof 9.1.1. Partitioning n objects into k clusters is analogous to putting the


n objects in a row and putting on each a label from the set {l1 , l2 , . . . lk }, which
yields k n different possibilities. However, in clustering the bags are unordered,
the only thing that matters is that objects are in the same cluster: k clusters
can be ordered in k! ways so we divide by k!.
Most partitioning algorithms start from a guess and then refine the partition.

37

Non-determinism of flat clustering Many local minima in the objective


function implies that different starting point may lead to very different (and
non-optimal) final partitions.

9.2

Criteria for Clustering

We can think of two broad kinds of criterions for evaluating clustering [Aio15]:
1. Internal criteria
2. External criteria
Definition 9.2 (Internal Criterion). An internal criterion for the quality of a
clustering is one that relies on the measured quality of a clustering that depends
on both the document representation and the similarity measure used.
A good clustering will produce high quality clusters in which
1. The intra-class (that is, intra-cluster) similarity is high

DR
AF
T

2. The inter-class similarity is low

But good scores on an internal criterion do not necessarily translate into


good effectiveness in an application. An alternative to internal criteria is direct
evaluation in the application of interest. [MRS08]
An external criteria assesses a clustering with respect to some ground truth.
Assume we can use a set of classes in an evaluation benchmark or gold standard.
We can then compute an external criterion that evaluates how well the clustering
matches the gold standard classes.
Definition 9.3 (External Criterion). In external criteria, quality is measured
by its ability to discover some or all of the hidden patterns or latent classes in
gold standard data.
Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, 1 , . . . , k with ni members. One simple external
measure is:
Definition 9.4 (Purity). The ratio between the dominant class in the cluster
i and the size of cluster i :
purity =

|pii |
|i |

Other external criteria are entropy of classes in clusters (or mutual information between classes and clusters)
The Rand Index The Rand index or Rand measure (named after William
M. Rand) is a measure of the similarity between two data clusterings.
Definition 9.5. Rand Index Given a set of n elements S = {o1 , . . . , on } and two
partitions of S to compare, X = {X1 , . . . , Xr }, a partition of S into r subsets,
and Y = {Y1 , . . . , Ys }, a partition of S into s subsets, define the following:

38

1. a, the number of pairs of elements in S that are in the same set in X and
in the same set in Y
2. b, the number of pairs of elements in S that are in different sets in X and
in different sets in Y
3. c, the number of pairs of elements in S that are in the same set in X and
in different sets in Y
4. d, the number of pairs of elements in S that are in different sets in X and
in the same set in Y
The Rand index, R, is:
R=

a+b
a+b
= n
a+b+c+d
2

9.3

DR
AF
T

Intuitively, a + b can be considered as the number of agreements between X


and Y and c + d as the number of disagreements between X and Y 4 .
The Rand Index can be used to compare the ground truth clustering and
the clustering produced by our algorithms. In this particular case, if we let X
be clustering to evaluate and Y the ground truth, the a term would be the true
positives, b the true negatives.

Clustering Algorithms

There are two main families of clustering algorithms:


1. Partitional algorithms

2. Hierarchical algorithms

Partitional algorithms Examples of Partitional Algorithms are K-means


clustering and Model based clustering.
Partitional algorithms usually start with a random (partial) partitioning of
n documents into a set of K clusters and refine it iteratively to optimize thje
chosen paritioning criterion.
Effective heuristic methods are used in K-means and K-medoids algorithms
9.3.1

k-Means clustering

k-Means clustering assumes documents are real-valued vectors.


Its partitioning criterion is as follows: for each cluster, minimize the average
distance between docs and the center of the cluster.
Definition 9.6 (Centroid). The definition of clusters is based on centroids, i.e.
the center of gravity or mean) of points in a cluster c:

~ (c) =

1 X
~x
|c|
~
xc

4 http://web.archive.org/web/20151223080751/https://en.wikipedia.org/wiki/Rand_
index

39

1. Place K points into the space represented by the objects that are being
clustered. These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K
centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces
a separation of the objects into groups from which the metric to be minimized can be calculated.
Figure 9.1: k-Means algorithm

9.4

Hierarchical Clustering

DR
AF
T

Hierarchical clustering, as opposed to flat clustering, outputs a hierarchy, a


structure that is more informative than the unstructured set of clusters returned
by flat clustering.[MRS08]
Hierarchical clustering does not require us to prespecify the number of clusters; it is thus especially suited for problems where determining the number K
of clusters is part of the problem.
Families of clustering algorithms There are two big families of hierchical
clustering algorithms:
1. Bottom up, or agglomerative: an agglomerative clustering algorithm
starts with N groups, each initially containing one training instance, merging similar groups to form larger groups, until there is a single one.
2. Top down, or divisive: a divisive clustering algorithm goes in the other direction, starting with a single group and dividing large groups into smaller
groups, until each group contains a single instance.
Dendrograms The results of hierarchical clustering are usually presented in a
dendrogram. Each merge is represented by a horizontal line. The y-coordinate
of the horizontal line is the similarity of the two clusters that were merged,
where documents are viewed as singleton clusters. We call this similarity the
combination similarity of the merged cluster.
Reducing a dendrogram to flat clustering It is possible to cut the
dendrogram at a desired level: each connected component, then, forms a cluster
in a flat clustering.
9.4.1

Hierarchical Agglomerative Clustering (HAC)

Consider the problem of constructing a tree-based hierarchical taxonomy called


dendrogram from a set of documents. One approach is the recursive application
of a partitional clustering algorithm.
The y-axis of the dendogram represents the combination similarities, i.e. the
similarities of the clusters merged by a the horizontal lines for a particular y.
40

A fundamental assumption in HAC is that the merge operation is monotonic,


i.e. if s1 , . . . , sk1 are successive combinations of similarities, then s1 > s2 >
. . . sk1 must hold.
Clustering algorithms and measures of similarity There are several
more or less naive agglomerative algorithms; the basic idea is that at each
iteration of an agglomerative algorithm, we choose the two closest groups to
merge and merge them.[?]
Among the variants are:
Single-link clustering In single-link clustering or single-linkage clustering, the similarity of two clusters is the (cosine-)similarity of their most
similar members. This single-link merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other.
Other, more distant parts of the cluster and the clusters overall structure
are not taken into account. [Aio15][MRS08]

DR
AF
T

Complete-link In complete-link clustering or complete-linkage clustering, the similarity of two clusters is the similarity of their most least
(cosine-)similar members. This is equivalent to choosing the cluster pair
whose merge has the smallest diameter. This complete-link merge criterion is non-local ; the entire structure of the clustering can influence merge
decisions. This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers.
Centroid: In centroid clustering, the similarity of two clusters is defined
as the similarity of their centroids:
Centr-Simil(i , j ) =
~ (i )
~ (j )

[MRS08]

Average-link: also known as Group Average Agglomerative Clustering


or GAAC, evaluates cluster quality based on all similarities between documents, thus avoiding the pitfalls of the single-link and complete-link criteria, which equate cluster similarity with the similarity of a single pair of
documents.

41

10

Feature Selection

All features in learning should ideally be relevant to the classification/prediction


problem[Aio15].
Firstly, less features means models that are more compact and need a smaller
number of examples to obtain acceptable results.
Models that use few features are more easily represented and understood by
humans.[Aio15]
Feature selection is especially relevant in several fields:
Computational biology, where very few examples (a few dozen samples)
and very many features (genes) are found.
Face recognition: which ones are the most important features in a face?
Health studies, where data is generally expensive and/or gathered with
invasive procedures

DR
AF
T

Financial engineering and risk management: very many features


Text classification: very many features

With this in mind, there are two approaches to achieve that[Aio15]:


1. Feature Selection

2. Feature Extraction

10.1

Feature Selection

Feature selection consists, as the name suggets, in selecting a subset of the


attributes occurring in the raw data.
Feature Selection Methods

Among feature selection methods are the following[Aio15]:

Filter methods In filter methods, the general characteristics of the training set are considered. Feature selection with filter methods is a pre-processing
step independent of the prediction algorithm.
Wrapper methods Features are selected according to their predictive
capabilities, typically with a hold-out set.
Embedded methods

Feature selection is part of the training step.

Advantages of Feature Selection Methods Result in the removal of irrelevant or redundant features. Better interoperability with the predictive
model.[Aio15]
Feature Selection methods are favored for: applications where interpretability is more relevant than accuracy.

42

10.2

Feature Extraction

Feature Extraction methods derive completely new features - for examples, as


combinations of existing ones.
The most important feature extraction method is called Principal Component Analysis (PCA) and consists in extrating a set of linearly uncorrelated
features (also called principal components).
Principal components are usually much inferior in number to the original
features.

DR
AF
T

Advantages of Feature Extraction Typically, extraction methods are more


powerful and yield more discriminative features.
The higher resulting accuracy makes them favored for applications where
accuracy is more relevant than interpretability.[Aio15]

43

11

Recommendation Systems

Recommendation systems are a subclass of information filtering system that


seek to predict the rating or preference that a user would give to an item.

11.1

Feedback in a RS

There are two broad kinds of feedback that can be had in a recommendation
system[Aio15]:
Explicit Feedback Explicit feedback can take the shape of an ordering or a
preference expressed by the user from 2 or n items, or maybe a per-item rating
expressed by the user.
Implicit Feedback Consider for example the list of items bought by the user,
or his/her connection network, or the users permanence on a web page.

Approaches to RS

DR
AF
T

11.2

There are, again, two broad families of approaches to recommendation systems:


Content Based Content-based systems recommend the items that are
most similiar to those towards which the user has already explicitly or implictly
shown interest.
CB systems are favored when little history is available (so-called col-start
problem).
Collaborative Filtering Content-based systems recommend the items
that are most similiar to those towards which the user has already explicitly or
implictly shown interest or those that are most similar to the those liked by the
users neighbors (users similar to the user)[Aio15].
CFRS requires then to define:
1. A measure of Item Item similarity
2. A measure of User User similarity
In CFRS no knowledge about users and items is involved, except for the
users-items interaction behaviours.
CF systems are favored when the information yielded by interaction patterns
is more informative[Aio15].
Hybrid Methods Hybrid methods exist that use CB to address the coldstart problem and use CF once a significant history has become available.

44

11.3

Tasks in RS

Commonly, RS is associated with the presence of a rating matrix R where


Ri,j is the rating given by the i-th user to the j-th item.
Such a matrix is usually very sparse, i.e.: many missing entries.
Common tasks in RS are[Aio15]:
Rate prediction: predict the missing entries
TOP-N Recommendation: predict the N items with the highest rating
for a given user
Collaborative Filtering approaches

DR
AF
T

Rate Prediction Predicting ratings is a regression problem.


A Collaborative Filtering approach to rate prediction is matrix factorization - a representation for users and items is learned so that their scalar product
approximates the existing ratings.
TOP-N recommendation TOP-N is essentially a ranking problem.
Matrix Factorization is carried out on preferences - the problem is how to
treat missing data.
11.3.1

Evaluating Performance in Recommendation Systems

Evaluating rate prediction systems A common metric is the root mean


square error:
v
u
X
u 1
(rui rui )2
RM SE = t
|Rte |
(u,i)Rte

Evaluating Top-N recommendation systems There are several metrics


for evaluating the performance of recommendation systems, such as AUC (area
Under ROC Curve), prec@n, et cetera.

45

References
[Aio15]

Fabio Aiolli. Lecture notes in machine learning. http://www.math.


unipd.it/~aiolli/corsi/1516/aa, October-December 2015.

[Alp10]

E. Alpaydin. Introduction to Machine Learning. Adaptive computation and machine learning. MIT Press, 2010.

[Beb16]

George Bebis. Lecture notes in pattern recognition, January 2016.

[Mit97]

T.M. Mitchell. Machine Learning. McGraw-Hill International Editions. McGraw-Hill, 1997.

[MRS08] C.D. Manning, P. Raghavan, and H. Sch


utze. Introduction to Information Retrieval. An Introduction to Information Retrieval. Cambridge
University Press, 2008.

[Spe08]

DR
AF
T

[MRT12] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning series.
MIT Press, 2012.
Alessandro Sperduti. Lecture notes in intelligent systems. http:
//http://www.math.unipd.it/~sperduti/SI08/, 2008.

46

Index
g , 7
Candidate-Elimination, 8
Find-S, 7
ID3, 11

MDL, see Minimum Description Length


Principle
Mercers Theorem, 32
Minimum Description Length Principle, 34
Model selection, 25

DR
AF
T

ANN, see Neutral Network


Artificial Neural Network, see Neural
Network
Naive Bayes classifier, 36
Neural Network, 16
Bayes Optimal Classifier, 35
Bayesian Learning, 33
Ordering of the hypotesis space, 7
Black box problem, 16
Overfitting, 13
Brute Force Bayes, 34
Perceptron, 17, 18
Centroid, 39
Pipeline, 23
Classification, 4
Rand Index, 38
Clustering, 4, 37
Clustering , Possible permutations37 Recommendation Systems, 44
Reduced Error Pruning, 14
Clustering, evaluation criteria, 38
Regression, 4
Collaborative Filtering, 44
Rule post-pruning, 14
Concept Learning, 7
Covers Theorem, 30
Sample Error, 4
Cross-Validation, 25
Shattering, 6
Split Information, 11
Decision Trees, 10
Structural Risk Minimization, 27
Dendrograms, 40
Supervised Learning, 4
Entropy, 11
Support Vector Machine, 27
SVM, see Support Vector Machine
Feature Centering, 24
Feature Extraction, 43
True error, 4
Feature Mapping, 23
VC Dimension, 6
Feature Normalization, 24
Version Space, 8
Feature Selection, 42
Gain Ratio, 11
Gibbs classifier, 35
Hyperparameter Optimization, 25
Inductive bias, 5
Inductive learning hypotesis, 4
Information Gain, 11
k-Means Clustering, 39
Kernel Functions, 31
Kernel Trick, 31

47

Vous aimerez peut-être aussi