Académique Documents
Professionnel Documents
Culture Documents
Tobia Tesan
May 14, 2016
Contents
1 Introduction
1.1 The case for Machine Learning . . . . . . . . . . . . . . . . . . .
1.2 Applications of Machine Learning . . . . . . . . . . . . . . . . . .
1.3 Tasks in Machine Learning . . . . . . . . . . . . . . . . . . . . .
3
3
3
3
2 Supervised learning
2.1 Unsupervised Learning . . . . . .
2.2 Supervised Learning . . . . . . .
2.3 Examples of supervised learning
2.4 VC Dimension and Shattering . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
5
6
3 Concept Learning
3.1 Concepts . . . . . . . . . . . . . . . . . . . .
3.1.1 Partial ordering of the hypotesis space
3.2 Find-S . . . . . . . . . . . . . . . . . . . . .
3.3 Candidate-Elimination . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
7
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
11
11
11
11
13
13
14
14
5 Neural Networks
5.1 Network structure . . . . . . . . . . . . . . . .
5.2 The perceptron . . . . . . . . . . . . . . . . . .
5.2.1 The Perceptron and functions . . . . . .
5.2.2 Learning in the Perceptron . . . . . . .
5.2.3 Gradient descent and the the delta rule
5.3 Multilayer Networks . . . . . . . . . . . . . . .
5.3.1 The sigmoid . . . . . . . . . . . . . . . .
5.3.2 The Back-Propagation algorithm . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
17
17
18
18
19
20
20
21
DR
AF
T
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
23
25
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
27
27
27
29
31
8 Bayesian Learning
8.0.4 Brute-Force Bayes Concept Learning . .
8.0.5 Minimum Description Length Principle
8.0.6 Bayes Optimal Classifier . . . . . . . . .
8.0.7 Gibbs classifier . . . . . . . . . . . . . .
8.1 Naive Bayes . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
34
34
35
35
36
9 Clustering
9.1 Clustering and Objective Functions . . . . . . . . . .
9.2 Criteria for Clustering . . . . . . . . . . . . . . . . .
9.3 Clustering Algorithms . . . . . . . . . . . . . . . . .
9.3.1 k-Means clustering . . . . . . . . . . . . . . .
9.4 Hierarchical Clustering . . . . . . . . . . . . . . . . .
9.4.1 Hierarchical Agglomerative Clustering (HAC)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
38
39
39
40
40
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
DR
AF
T
.
.
.
.
.
10 Feature Selection
42
10.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 43
11 Recommendation Systems
11.1 Feedback in a RS . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 Approaches to RS . . . . . . . . . . . . . . . . . . . . . . . .
11.3 Tasks in RS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.1 Evaluating Performance in Recommendation Systems
.
.
.
.
.
.
.
.
44
44
44
45
45
Introduction
1.1
DR
AF
T
1.2
1.3
Binary classification
Multiclass classification
Regression
Class and instance ranking
Novelty Detection
Clustering
Basket Analysis
Reinforcement Learning
Supervised learning
Definition 2.1 (Supervised learning). Supervised learning is the machine learning task of inferring a function f : X Y from labeled training data (xi , yi ).
[Mit97]
Regression and classification are examples of supervised learning.[MRT12]
Definition 2.2 (Oracle). An oracle - a human expert or nature itself - classifies
arbitrary instances xi .
It is not necessarily deterministic.
It can be characterized as the function f to approximate. [Aio15]
2.1
Unsupervised Learning
Supervised Learning
DR
AF
T
2.2
1X
(f (x), h(x))
n
xS
DR
AF
T
Model (definition of H)
Target function (search in H)
2.3
2.4
DR
AF
T
[Aio15]
Example 2.4 (Hyperplanes in R).
Training set
Hypotesis space
An algorithm
S = {(xi , yi )}i=1,...,N
H = {h (x)}
L : S H H : L(S, H) = h (x) = argminhH errorS (h(x))
It is then possible to obtain an upper bound with P = (1 ) for an arbitrarily small to the true error (2.7), which is:
errorD (h (x)) = errorS (h (x)) + g(N, V C(H), )
{z
}
|
V CConf idence
g(N, V C(H), ) is called VC-Confidence and is independent of L. It is inversely proportional to |S|. For a set N , it is monotonous to V C(H).[Aio15]
Theorem 2.1 (Upper bound on VC Dimension fo Finite H).
V C(H) log2 (|H|)
Proof. Suppose that V C(H) = d. Then H will require 2d distinct hypotheses
to shatter d instances.
Hence, 2d |H| and d = V C(H) log2 |H|.[Mit97]
Concept Learning
3.1
Concepts
We call concept learning the task of inferring a concept from training examples
of its input and output. [Mit97]
Definition 3.1 (Concept). We call a concept a boolean-valued function defined
over the instance space X [Aio15]:
c:XB
Definition 3.2 (Concept Example). A an example of concept c over an instance
space X is a pair (x, c(x)) with x X and c : X B [Aio15].
Definition 3.3 (Instance satisfaction). Let h : X B. We say h satisfies
x X if h(x) = true.
3.1.1
DR
AF
T
Definition 3.4 (Hypotesis consistency). Let h : X B and (x, c(x)) an example of c. We say h is consistent with the example if h(x) = c(x). We say h
is consistent with a set D if h(x) = c(x)hx, c(x)i D.
Partial ordering of the hypotesis space
Definition 3.5 (Ordering on hypoteses). We say hi is more general or equivalent to hj and write hi g hj if
(x X)[(hj (x) = 1) (hi (x) = 1)]
Lemma 3.1.
hi , hj : hi g (hi hj )
Lemma 3.2.
3.2
Find-S
The Find-S algorithm illustrates one way in which the g partial ordering can
be used to organize the search for an acceptable hypothesis.
Therefore, at each stage the hypothesis is the most specific hypothesis consistent with the training examples observed up to this point (hence the name
Find-S)
Soundness of Find-S Every time h is generalized to some h0 s.t. h0 g h,
each previously observed positive example is satisfied[Aio15]: by definition of
g
x X : (h(x) = 1) (h0 (x) = 1))
If the target concept is in H, all negative examples are trivially satisfied by
the resulting h by virtue of it being the most specific.
7
Find-S
1 h min(H, g ) // h gets the most specific hypotesis in H
2 for each x : c(x) = true // for each positive training instance x
3
for each attribute constraint ai in h
4
if ai is satisfied by x
5
do nothing
6
else ai (the next more general constraint that is satisfied by x)
7 return h
DR
AF
T
Find-S cannot tell when it has converged and if h is indeed the correct
target concept - i.e. the only h H consistent with the data.
There is no reason to favor the most specific hypotesis. [Mit97]
3.3
Candidate-Elimination
Candidate-Elimination addresses the concerns in Paragraph 3.2 by outputting a description of the set of all hypotheses consistent with the training
examples.
Definition 3.6 (Version Space). The version space, denoted V SH,D 1 with respect to hypothesis space H and training examples D is the subset of hypotheses
from H consistent with the training examples in D [Mit97]:
V SH,D {h H : Consistent(h, D)
This set is characterized as the set of all hypotheses that have not been
eliminated as a result of being in conflict with observed data.
The idea behind Candidate-Elimination is to remove from H the overspecific hypoteses using the positive examples and the under-specific hypoteses
using the negative examples[Aio15].
Remarks on Candidate-Elimination and V SH,D |V SH,D | can be infinite
(if H is), but S and G can have a finite representation.
Lemma 3.3. |V SH,D | tends to decrease in general with |D|, as a consequence
of additional constraints.[Aio15]
Lemma 3.4. Assuming c H, the smaller |V SH,D |, the higher h = c for a
randomly selected h V SH,D .
1 Some
DR
AF
T
Candidate-Elimination
1 G set of the most general hypoteses
2 S set of the most specific hypoteses
3 for each d hx, c(x)i D
4
if c(x) = 1 // positive example
5
G G \ {h : Consistent(h, d)}
6
for each s S : Consistent(s, d)
7
S S\s
8
S S all minimal generalizations h of s s.t.
9
Consistent(h, d) g G : g g h
10
S S \ {s : s; S : s g s0 }
11
12
else // Negative example
13
S S \ {h : Consistent(h, d)}
14
for each g G : Consistent(g, d)
15
GG\g
16
G G all minimal specializations h of g s.t.
17
Consistent(h, d) s S : s g h
18
G G \ {g : g 0 ; G : g 0 g g}
19
20 return G, S
Outlook
Sunny
Humidity
High
No
Rain
Overcast
Yes
Wind
Normal
Strong
Yes
No
Weak
Yes
DR
AF
T
The case for decision trees Decision trees are often used in practice. They
are generally very efficient and are especially suited for
1. Instances of the form hattribute, valuei
10
4.1
4.1.1
c
X
pi log2 pi
i1
Or
Entropy(S) p log2 (p ) p log2 (p)
Where pc is the proportion of S belonging to some class c.
Lemma 4.2. If the target attribute can take on c possible values,
Entropy(S) log2 c
DR
AF
T
Gain(S, A) Entropy(S)
vValues(A)
|Sv |
Entropy(Sv )
|S|
Where Values(A) is the set of all possible values for attribute A and Sv is the
subset of S for which attribute A has value v[Mit97].
4.1.2
Algorithm
There is a natural bias in the information gain measure that favors attributes
with many values over those with few values.[Mit97]
As an extreme example, date of birth could sometimes be a perfect predictor of the final grade on an exam.
Thus, it would be selected as the decision attribute for the root node of the
tree and lead to a broad tree of depth one - it will have a very high information
gain relative to the training examples, despite being a very poor predictor of
the target function over unseen instances. [Mit97]
Gain Ratio One alternative measure that has been used successfully is Gain
Ratio, which penalizes attributes such as date by incorporating a term, called
split information, that is sensitive to how broadly and uniformly the attribute
splits the data [Mit97]:
11
ID3(D,A)
1 Create Root
2 if d D : f (x) =
3
return (Root, )
4 if d D : f (x) =
5
return (Root, )
6 if A =
7
return (Root, most common label in D)
8 Asplit Ai A s.t. Ai = arg max Gain(S, Ai )
9 // Split on A s.t. maximum information gain is had
10 Decision Attribute(Root) Asplit
11 for each vj of Ai
?
DR
AF
T
12
Add a new tree branch below Root, corresponding to Ai = vj
13
Dvj {d D : Ai (d) = vj } // the subset of examples that have value vj for Ai
14
if Dvj =
15
Add a leaf node with label = most common value of Target attribute in D
16
else
17
Add the subtree ID3(Examplesv , T arget, Attributes \ {Ai }
18 return Root
Figure 4.2: ID3 Algorithm
SplitInformation(S, A)
c
X
|Si |
i=1
|S|
log2
|Si |
|S|
GainRatio(S, A)
Gain(S, A)
SplitInformation(S, A)
c,c0 pc pc0
c,c0 Classi,c6=c0
Misclassification impurity
1
4.2
max pc
cclasses
DR
AF
T
4.3
4.3.1
DR
AF
T
The minimum description length principle, which uses an explicit measure of complexity.
Reduced Error Pruning
Rule post-pruning
1. for every path(r, li ) from root r to leaf li a rule Ri is generated of the form
if (Ai0 = vi0 ) (Ai1 = vi1 ) (Aik = vik ) then labeli
2. Then for every individual Ri pruning is done by:
Estimating the performance yielded by Ri alone
Removing preconditions (Aij = vij ) that lead to an increase in performance
3. Finally, the pruned Ri are sorted by their performance in decreasing order.
DR
AF
T
15
Neural Networks
DR
AF
T
The case for ANNs Artificial neural networks are especially suited for problems where
1. Input is discrete or continuous-valued
+---------------------------+
|
|
|
|
|
|
|
TODO: Missing figure
|
|
|
|
|
|
|
+---------------------------+
Figure 5.1: Structure of a feed-forward neural network
DR
AF
T
+---------------------------+
|
|
|
|
|
|
|
TODO: Missing figure
|
|
|
|
|
|
|
+---------------------------+
5.1
Network structure
5.2
The perceptron
17
Where
= sgn
Learning a perceptron involvees choosing values for w.
~
The space H of hypoteses is therefore the set of all possible real-valued weight
vecotrs:
H = {w
~ :w
~ R(n+1) }
The perceptron can be seen as a hyperplane decision surface that outputs a
1 for instances lying on one side of the hyperplane and a 1 for instances lying
on the other side:
H = {fw,b
y ) : fw,b
y ) = sgn(w
~ ~y + b), w,
~ ~y Rn , b R}
~ (~
~ (~
or
with
DR
AF
T
w
~ 0 = [b, w]
~ t , ~y 0 = [1, ~y ]t
5.2.1
The (single) Perceptron can represent the OR, AND and NOT Boolean operators. In fact it can represent any m-of-n function. However,
Lemma 5.1. Any non-linearly separable function cannot be represented by an
individual Perceptron
.
One such function is the XOR function.
DR
AF
T
Although the perceptron rule finds a successful weight vector when the training
examples are linearly separable, it can fail to converge if the examples are not
linearly separable.
A second training rule, called the delta rule, is designed to overcome this
difficulty. If the training examples are not linearly separable, the delta rule
converges toward a best-fit approximation to the target concept.
The key idea behind the delta rule is to use gradient descent to search the
hypothesis space of possible weight vectors to find the weights that best fit the
training examples.[Mit97]
Consider a modified continuous perceptron[Spe08] such that:
out(~x) =
n
X
wi xi = w
~ ~x
i=1
1 X
2
(ti out(~xi ))
|D|
~
xi ,~
ti D
E
wi
Since the gradient specifies the direction of steepest increase of E, the training rule for gradient descent is
w
~ w
~ + w
~
19
Gradient-Descent(D, )
1 wi random // let wi assume random, small values
2 while E[w]
~ >
3
wi 0
4
for each (~x, t) D
5
outxi output(~x)
6
for each wi
7
wi wi + (t outxi )xi
8
for each wi
9
wi wi + wi
Figure 5.4: Gradient-Descent[Spe08][Mit97]
Where
i.e.
DR
AF
T
w
~ = E[w]
~
w
~ =
(td od )xid
dD
Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, regardless of
whether the training examples are linearly separable, given a sufficiently small
learning rate is used. If is too large, the gradient descent search runs the
risk of overstepping the minimum in the error surface rather than settling into
it. For this reason, one common modification to the algorithm is to gradually
reduce the value of ) as the number of gradient descent steps grows. [Mit97]
5.3
5.3.1
Multilayer Networks
The sigmoid
Multiple layers of cascaded linear units still produce only linear functions, and
we prefer networks capable of representing highly nonlinear functions[Mit97]. A
unit is needed such that
1. Its output is a nonlinear function of its inputs
2. Its output is differentiable
The sigmoid unit is one such unit. Its output is (again):
out = (w
~ ~x)
Where, this time:
1
1 + ey
The sigmoid function has an easily expressed derivative:
(y) =
20
+---------------------------+
|
|
|
|
|
|
|
TODO: Missing figure
|
|
|
|
|
|
|
+---------------------------+
Figure 5.5: Structure of a sigmoid artificial neuron
d(y)
= (y) (1 (y)))
dy
The Back-Propagation algorithm
DR
AF
T
5.3.2
1X X
(tkd okd )2
2
dD kout
21
Where
DR
AF
T
Back-Propagation
1 wi random // let wi assume random, small values
2 while E[w]
~ >
3
for each (~x, ~t) D
4
outxi output(~x)
5
for each output unit k
6
k ok (1 ok )(tk ok )
7
for each hidden unitP
j
8
j oj (1 oj ) koutputs (wk,j k )
9
for each ws,q
10
ws,q ws,q + ws,q
ws,q
(
s xq
=
s yq
if s hidden
if s output
22
6.1
Data preprocessing
DR
AF
T
Vectors
Strings
23
DR
AF
T
Mapping continous features Continous features are harder than categorical features. Typically, transforms are applied to features to make them easier
to compare.
The two main transforms are feature centering and feature standardization
or feature rescaling[Aio15].
Feature centering
f (x) = x x
xx
24
6.2
Model Selection is the phase when the best values for hyperparameters are
selected for the task at hand.
Hyperparameter optimization contrasts with actual learning problems, which
are also often cast as optimization problems, but optimize a loss function on the
training set alone.2
In effect, learning algorithms learn parameters that model/reconstruct their
inputs well, while hyperparameter optimization is to ensure the model does not
overfit its data by tuning, e.g., regularization.
Bias and Variance 3
In statistics and machine learning, the biasvariance tradeoff (or dilemma)
is the problem of simultaneously minimizing two sources of error that prevent
supervised learning algorithms from generalizing beyond their training set:
DR
AF
T
6.3
Most of the time, the learner is parametric. These parameters should be optimized by testing which values of the parameters yield the best effectiveness.
It is possible to show that the evaluation performed in Step 2 gives an unbiased estimate of the error performed by a classifier learnt with the same parameters and with training set of cardinality |T r| |V a| < |T r|.
2 From http://web.archive.org/web/20160113235232/https://en.wikipedia.org/wiki/
Hyperparameter_optimization
3 From http://web.archive.org/web/20160116105049/https://en.wikipedia.org/wiki/
Bias%E2%80%93variance_tradeoff
25
Hold-Out
1 Let V a T r
2 for all p~ in hyperparameter range
3
lp~ Learn(T r V a)
4
Measure performance of lp~ (T r V a) on V a
DR
AF
T
26
7.1
The idea behind Structural Risk Minimiazion is to find solutions that both
minimize the empirical risk (or empirical error) and have low VC dimension.
[Beb16]
Empirical error
We recall:
n
Remp =
1X
[z g(xi )]2
n
k=1
DR
AF
T
VC Dimension and Capacity To guarantee good generalization performance, the capacity (i.e., complexity) of the learned functions must be controlled. Functions with high capacity are more complicated (i.e., have many
degrees of freedom).[Beb16]
It can be shown that: with probability (1 )
s
V C(log V2nC + 1 log 4 )
errtrue errtrain +
n
Vapnik has shown that maximizing the margin of separation (i.e., empty
space between classes) is equivalent to minimizing the VC dimension.
The optimal hyperplane is the one giving the largest margin of separation
between the classes.
The margin is defined by the distance of the nearest training samples from
the hyperplane. We refer to these samples as support vectors. Intuitively speaking, these are the most difficult samples to classify.[Beb16]
7.2
27
And a labelling
(
yi =
+1
1
xi > 0
otherwise
w
~
kwk
~
DR
AF
T
r=
The hyperplane does not change when its normal vector is scaled.
Therefore we constain the length of w for uniqueness, we impose:
rkwk
~ =1
For the optimal hyperplane, the margin to one of the closest positive example
is equal to that to one of the closest negative examples, so the margin is defined
as
=
2
kwk
~
Also, if the examples are linearly separable with some margin r we have:
yi g(~xi )
r
kwk
~
i = 1, . . . , n
R2
e, m}) + 1
2
28
i=1
n
1 X
yi yj i j ~xi ~xj
2 i,j=1
DR
AF
T
n
X
i {1, . . . , n} : i 0
n
X
yi i = 0
i=1
i = 1, . . . , n
i > 0
In the linearly inseparable case, we are forced to allow for some constraints to
be broken.[Spe08]
29
X
1
kwk
~ 2+C
i
2
i=1
C > 0 controls the tradeoff between the complexity of the hypotesis space
and the number of linearly inseparable examples.
The dual problem is
max
n
X
i=1
n
1 X
yi yj i j ~xi ~xj
2 i,j=1
With
DR
AF
T
n
X
i {1, . . . , n} : 0 i C
yi i = 0
i=1
Mapping to Higher Dimensions The naive solution does not usually work
well.[Spe08]
An hyperplane is still an hyperplane and it can only impose a dichotomy on
the instance space.
We will instead use a two-step strategy as follows:
1. We map the input space to a feature space with much higher dimension
2. We naively compute the optimal hyperplane (as seen in ??) in the feature
space
Soundness of the two-step strategy Step 2 is trivially justified by the
fact that the optimal hyperplane minimizes the VC-dimension and thus improves
generalization.
Theorem 7.2 (Covers Theorem). A complex pattern-classification problem,
cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely
populated.[?]
By Covers theorem Step 1 is justified.
In particular, Step 1 equates to considering a () : Rm X M linear transformation for the original {(~xi , yi )}ni , where M >> m.
30
The strategy in practice We can assume that every new coordinate in the
feature space X M is generated by a non-linear function j .
Therefore
wj j (~x) + b = 0
j=1
wj j (~x) = w
~
~ (~x) = 0
j=1
Let
n
X
DR
AF
T
w
~=
yk k
~ (~xk )
k=1
7.2.3
[Spe08]
Observe first that for
w
~=
n
X
yk k
~ (~xk )
k=1
yk k
~ (~xk )
~ (~x) = 0
k1
The term
~ (~xk )
~ (~x) is the scalar product between the vectors induced by
the k-th learning instance and the input vector ~x.
Suppose there exists a symmetrical function K(, ) such that
K(~xk , ~x) =
~ (~xk )
~ (~x) =
M
X
j=0
yk k K(~xk , ~x)
k=1
This is much simpler, because while it is one equation for the hyperplane
in the feature space, the actual transformation tho the feature space does not
need to be computed.
31
i i (~x)i (~x0 )
i=1
~
a
~b
~
a
~b
~
a
2 (~x)d~x <
~b
DR
AF
T
32
Bayesian Learning
Bayesian learning methods are relevant to our study of machine learning for two
different reasons [Mit97]:
1. Bayesian learning algorithms that calculate explicit probabilities for hypotheses, such as the naive Bayes classifier, are among the most practical
approaches to certain types of learning problems.
2. Bayesian methods provide an useful conceptual framework [Aio15] and a
standard for comparison against other algorithms.
Features of Bayesian learning include:
Bayesian methods can accommodate hypotheses that make probabilistic
predictions
Prior knowledge can be combined with observed data to determine the
final probability of a hypothesis
DR
AF
T
P (h|D) =
P (D|h)P (h)
P (D)
Where:
P (h) is the a priori probability of h
P (D) is the a priori probability of training data D
Generally, we want to select the most probable h given the training data D,
i.e. the maximum a posteriori hypotesis:
hM AP arg max P (h|D)
hH
P (D|h)P (h)
hh
P (D)
= arg max P (D|h)P (h)
= arg max
hinH
In some cases, we will assume that every hypothesis in H is equally probable a priori; assuming P (hi ) = P (hj ) we can further simplyfy and choose the
maximum likelihood hypotesis:
hM L arg max P (D|h)
hH
33
8.0.4
We can design a straightforward concept learning algorithm to output the maximum a posteriori hypothesis, based on Bayes theorem, as follows[Mit97]:
Brute-Force-Bayes
1 for each hi H
2
Pi P (hi |D)
3 return hM AP = hj s.t. Pj = max P1...n
This algorithm may require significant computation and is thus impractical;
still, it is of theoretical interest as a benchmark.
In order specify a learning problem for the Brute-Force algorithm we must
specify what values are to be used for P (h) and for P (D|h) (as we shall see,
P (D) will be determined once we choose the other two).[Mit97]
Assume as well that:
The training data D is noise free
DR
AF
T
1
kHk
h H
Since we assume noise-free training data, the probability of observing classification di given h is just 1 if di = h(xi ) and 0 otherwise:
n
P (D|h) = 1
if di = h(xi )di D0
otherwise
Then:
(
P (h|D) =
8.0.5
1
|V SH,D |
if h consistent with D
otherwise
34
where LCh and LCD|h are the length of the optimal encodings for H and for
D given h, respectively.
The Minimum Description Length (MDL) principle recommends choosing
the hypothesis that minimizes the sum of these two description lengths. Of
course to apply this principle in practice we must choose specific encodings or
representations appropriate for the given learning task. Assuming we use the
codes C1 and C2 to represent the hypothesis and the data given the hypothesis,
we can state the MDL principle as
DR
AF
T
So far we have considered the question what is the most probable hypothesis
given the training data? In fact, the question that is often of most significance
is the closely related question what is the most probable classiJication of the
new instance given the training data?
In general, the most probable classification of the new instance is obtained
not by hM AP alone - but instead, by combining the predictions of all hypotheses,
weighted by their posterior probabilities.
Definition 8.2 (Bayes Optimal Classification).
X
P (vj |hi )P (hi |D)
arg max
vj V
8.0.7
hi H
Gibbs classifier
Although the Bayes optimal classifier obtains the best performance that can
be achieved from the given training data, it can be quite costly to apply. The
expense is due to the fact that it computes the posterior probability for every
hypothesis in H and then combines the predictions of each hypothesis to classify
each new instance.
An alternative, less optimal method is the Gibbs algorithm defined as follows:
35
8.1
Naive Bayes
Recall that
vM AP = arg max P (vj |a1 , a2 , . . . an )
vj V
vM AP = arg max
vj V
DR
AF
T
The key idea behind the naive Bayes classifier is based on the simplifying
assumption that the attribute values are conditionally independent given the
target value and therefore:
P (a1 , a2 , . . . an |vj ) =
n
Y
P (ai |vj )
i=1
So we can rewrite arg maxvj V P (a1 , a2 , . . . an |vj )P (vj ) accordingly and obtain[Mit97]:
Definition 8.3 (Naive Bayes classifier).
n
Y
i=1
36
P (ai |vj )
Clustering
DR
AF
T
A distance metric
A partitioning criterion
9.1
Kn
K!
37
9.2
We can think of two broad kinds of criterions for evaluating clustering [Aio15]:
1. Internal criteria
2. External criteria
Definition 9.2 (Internal Criterion). An internal criterion for the quality of a
clustering is one that relies on the measured quality of a clustering that depends
on both the document representation and the similarity measure used.
A good clustering will produce high quality clusters in which
1. The intra-class (that is, intra-cluster) similarity is high
DR
AF
T
|pii |
|i |
Other external criteria are entropy of classes in clusters (or mutual information between classes and clusters)
The Rand Index The Rand index or Rand measure (named after William
M. Rand) is a measure of the similarity between two data clusterings.
Definition 9.5. Rand Index Given a set of n elements S = {o1 , . . . , on } and two
partitions of S to compare, X = {X1 , . . . , Xr }, a partition of S into r subsets,
and Y = {Y1 , . . . , Ys }, a partition of S into s subsets, define the following:
38
1. a, the number of pairs of elements in S that are in the same set in X and
in the same set in Y
2. b, the number of pairs of elements in S that are in different sets in X and
in different sets in Y
3. c, the number of pairs of elements in S that are in the same set in X and
in different sets in Y
4. d, the number of pairs of elements in S that are in different sets in X and
in the same set in Y
The Rand index, R, is:
R=
a+b
a+b
= n
a+b+c+d
2
9.3
DR
AF
T
Clustering Algorithms
2. Hierarchical algorithms
k-Means clustering
~ (c) =
1 X
~x
|c|
~
xc
4 http://web.archive.org/web/20151223080751/https://en.wikipedia.org/wiki/Rand_
index
39
1. Place K points into the space represented by the objects that are being
clustered. These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K
centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces
a separation of the objects into groups from which the metric to be minimized can be calculated.
Figure 9.1: k-Means algorithm
9.4
Hierarchical Clustering
DR
AF
T
DR
AF
T
Complete-link In complete-link clustering or complete-linkage clustering, the similarity of two clusters is the similarity of their most least
(cosine-)similar members. This is equivalent to choosing the cluster pair
whose merge has the smallest diameter. This complete-link merge criterion is non-local ; the entire structure of the clustering can influence merge
decisions. This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers.
Centroid: In centroid clustering, the similarity of two clusters is defined
as the similarity of their centroids:
Centr-Simil(i , j ) =
~ (i )
~ (j )
[MRS08]
41
10
Feature Selection
DR
AF
T
2. Feature Extraction
10.1
Feature Selection
Filter methods In filter methods, the general characteristics of the training set are considered. Feature selection with filter methods is a pre-processing
step independent of the prediction algorithm.
Wrapper methods Features are selected according to their predictive
capabilities, typically with a hold-out set.
Embedded methods
Advantages of Feature Selection Methods Result in the removal of irrelevant or redundant features. Better interoperability with the predictive
model.[Aio15]
Feature Selection methods are favored for: applications where interpretability is more relevant than accuracy.
42
10.2
Feature Extraction
DR
AF
T
43
11
Recommendation Systems
11.1
Feedback in a RS
There are two broad kinds of feedback that can be had in a recommendation
system[Aio15]:
Explicit Feedback Explicit feedback can take the shape of an ordering or a
preference expressed by the user from 2 or n items, or maybe a per-item rating
expressed by the user.
Implicit Feedback Consider for example the list of items bought by the user,
or his/her connection network, or the users permanence on a web page.
Approaches to RS
DR
AF
T
11.2
44
11.3
Tasks in RS
DR
AF
T
45
References
[Aio15]
[Alp10]
E. Alpaydin. Introduction to Machine Learning. Adaptive computation and machine learning. MIT Press, 2010.
[Beb16]
[Mit97]
[Spe08]
DR
AF
T
[MRT12] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning series.
MIT Press, 2012.
Alessandro Sperduti. Lecture notes in intelligent systems. http:
//http://www.math.unipd.it/~sperduti/SI08/, 2008.
46
Index
g , 7
Candidate-Elimination, 8
Find-S, 7
ID3, 11
DR
AF
T
47