Académique Documents
Professionnel Documents
Culture Documents
Machine Learning
Hamid Beigy
Fall 1393
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 1 / 70
1 Introduction
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 2 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 3 / 70
Introduction
In classification, the goal is to find a mapping from inputs X to outputs t given a labeled
set of input-output pairs
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 3 / 70
Introduction (cont.)
Bayes theorem
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 4 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 5 / 70
Bayes decision theory
Given a classification task of M classes, C1 , C2 , . . . , CM , and an input vector x, we can
form M conditional probabilities
p(Ck |x) ∀k = 1, 2, . . . , M
Without loss of generality, consider two class classification problem. From the Bayes
theorem, we have
p(Ck |x) = P(x|Ck )P(Ck )
The coloured region may produce error. The probability of error equals to
Pe = p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )
Z Z
1 1
= p(x|C2 )dx + p(x|C1 )dx
2 R1 2 R2
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 6 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
Pe = p(x ∈ R2 , C1 ) + p(x ∈ R1 , C2 )
= p(x ∈ R2 |C1 )p(C2 ) + p(x ∈ R1 |C2 )p(C1 )
Z Z
= p(C2 ) p(x ∈ R2 |C1 ) + p(C1 ) p(x ∈ R1 |C2 )
R2 R1
Since R1 ∪ R2 covers all the feature space, from the definition of probability density
function, we have
Z Z
p(C1 ) = p(C1 |x)p(x)dx + p(C1 |x)p(x)dx
R1 R2
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 7 / 70
Minimizing the classification error probability (cont.)
Then R2 becomes the region where the reverse is true, i.e. is the region of the space in
which
[p(C1 |x) − p(C2 |x)] < 0
This completes the proof of the Theorem.
For classification task with M classes, x is assigned to class Ck with the following rule
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 8 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
The goal is to partition the feature space so that the average risk is minimized.
M=2
X
r = rk p(Ck )
k=1
M Z M
!
X X
= λki p(x|Ck )p(Ck ) dx
i=1 Ri k=1
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 9 / 70
Minimizing the average risk (cont.)
The average risk is equal to
M Z M
!
X X
r = λki p(x|Ck )p(Ck ) dx
i=1 Ri k=1
When λki = 1 (for k 6= i), minimizing the average risk is equivalent to minimizing the
classification error probability.
In two–class case, we have
r1 = λ11 p(x|C1 )p(C1 ) + λ21 p(x|C2 )p(C2 )
r2 = λ12 p(x|C1 )p(C1 ) + λ22 p(x|C2 )p(C2 )
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 10 / 70
Minimizing the average risk (cont.)
In other words,
p(x|C1 ) p(C2 ) λ21 − λ22
x ∈ C1 (C2 ) if > (<)
p(x|C2 ) p(C1 ) λ12 − λ11
Assume that the loss matrix is in the form of
0 λ12
Λ= .
λ21 0
Then, we have
p(x|C1 ) p(C2 ) λ21
x ∈ C1 (C2 ) if > (<)
p(x|C2 ) p(C1 ) λ12
When p(C1 ) = p(C2 ) = 12 , we have
λ21
x ∈ C1 (C2 ) if p(x|C1 ) > (<)p(x|C2 )
λ12
If λ21 > λ12 , then x is assigned to C2 if
λ12
p(x|C2 ) > p(x|C1 )
λ21
That is, p(x|C1 ) is multiplied by a factor less than and the effect is the movement of the
threshold to left of x0 .
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 11 / 70
Minimizing the average risk (example)
In a two class problem with a single feature x, distributions of two classes are
1
p(x|C1 ) = √ exp −x 2
π
1
p(x|C2 ) = √ exp −(x − 1)2
π
1−ln 2 1
x0 = 2 < 2 is the solution of the above equation.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 12 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
For Normally distributed classes, we have the following quadratic form classifier.
1 1 T −1 1 T −1 1 T −1
gi (x) = − x T Σ−1
i x + x Σi µi − µi Σi µi + µi Σi x + wi0
2 2 2 2
Assume
σi2 0
Σi =
0 σi2
Thus we have
1 2 2
1 1 2 2
gi (x) = − x 1 + x 2 + (µ i1 x 1 + µ i2 x 2 ) − µ i1 + µ i2 + wi0
2σi2 2σi2 2σi2
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 15 / 70
Discriminant function for Normally distributed classes (cont.)
The discriminant functions for optimal classifier when the involved pdfs are N (µ, Σ) have
the following form
1
gi (x) = − (x − µi )T Σ−1
i (x − µi ) + wi0
2
1 1
wi0 = − ln(2π) − ln |Σi | + ln p(Ci )
2 2
By expanding the above equation, we obtain the following quadratic form.
1 1 T −1 1 T −1 1 T −1
gi (x) = − x T Σ−1
i x + x Σi µi − µi Σi µi + µi Σi x + wi0
2 2 2 2
Based on the above equations, We distinguish three distinct cases:
When Σi = σ 2 I , where σ 2 is a scalar and I is the identity matrix;
Σi = Σ, i.e. all classes have equal covariance matrices;
Σi is arbitrary.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 16 / 70
Discriminant function for Normally distributed classes (when Σi = σ 2 I )
The discriminant functions for optimal classifier when the involved pdfs are N (µ, Σ) have
the following form
1
gi (x) = − (x − µi )T Σ−1 i (x − µi ) + wi0
2
By replacing Σi = σ 2 I in the above equation, we obtain
1
gi (x) = − (x − µi )T (σ 2 )−1 (x − µi ) + wi0
2
||x − µi ||2
= − + wi0
2σ 2
1
= − 2 x T x − 2µT i x + µ T
i µ i + wi0
2σ
Terms x T x and other constants are equal for all classes so they can be dropped.
1 T 1 T
gi (x) = µi x − µi µi + wi0
σ2 2
This implies that the decision surface is a hyperplane passing through the point x0 .
For any x on the decision hyperplane, vector (x − x0 ) also lies on the hyperplane and
hence (µi − µj ) is orthogonal to the decision hyperplane.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 18 / 70
Discriminant function for Normally distributed classes( Σi = σ 2 I )(cont.)
When p(Ci ) = p(Cj ), then x0 = 12 (µi + µj ) and the hyperplane passes through the
average of µi and µj .
When p(Ci ) < p(Cj ), the hyperplane located closer to µi .
When p(Ci ) > p(Cj ), the hyperplane located closer to µj .
If σ 2 is small with respect to ||µi − µj ||, the location of the hyperplane is insensitive to
the values of p(Ci ) and p(Cj ).
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 19 / 70
Discriminant function for Normally distributed classes (Σi = Σ)
The discriminant functions for optimal classifier when the involved pdfs are N (µ, Σ) have
the following form
1
gi (x) = − (x − µi )T Σ−1i (x − µi ) + wi0
2
By replacing Σi = Σ in the above equation, we obtain
1
gi (x) = − (x − µi )T Σ−1 (x − µi ) + wi0
2
1 1 1 1
= − x T Σ−1 x + x T Σ−1 µi − µT Σ−1 µi + µT Σ−1 x + wi0
2 2 2 i 2 i
1 1 T −1
= − x T Σ−1 x + µT −1
i Σ x − µi Σ µi + wi0
2 2
T
Terms x x and other constants are equal for all classes and can be dropped. This gives
1 T −1 −1
gi (x) = 2µi Σ x − µT i Σ µ i + ln p(Ci )
2
This is a linear discriminant function with
0
gi (x) = wiT x + wi0
With the following parameters
wi = µi Σ−1
0 1
wi0 = − µT Σ−1 µi + ln p(Ci )
2 i
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 20 / 70
Discriminant function for Normally distributed classes ( Σi = Σ) (Cont.)
The decision function is no longer orthogonal to vector (µi − µj ) but to its linear
transformation Σ−1 (µi − µj ).
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 21 / 70
Discriminant function for Normally distributed classes (arbitrary Σi )
The discriminant functions for optimal classifier when the involved pdfs are N (µ, Σ) have
the following form
1 D D
gi (x) = − (x − µi )T Σ−1 i (x − µi ) + ln p(Ci ) − ln(2π) − ln |Σi |
2 2 2
The discriminant functions cannot be simplified much further. Only the constant term
D
2 ln(2π) can be dropped.
Discriminant functions are not linear but quadratic.
They have much more complicated decision regions than the linear classifiers of the two
previous cases.
Now, decision surfaces are also quadratic and the decision regions do not have to be even
connected sets.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 22 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 23 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
qij (for j = 1, 2, . . . , D) are parameters for the class conditional density of the class Ci .
The discriminant function is
D
x
Y
gi (x) = ln p(Ci |x) = ln p(x|Ci )p(Ci ) = ln qijj (1 − qij )(1−xj ) p(Ci )
j=1
D
X
= [xj ln qij + ln (1 − qij ) − xj ln (1 − qij )] + ln p(Ci )
j=1
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 25 / 70
Supervised learning of the Bayesian classifiers
We assumed that the class conditional pdfs p(x|Ci ) and the prior probabilities p(Ci ) were
known. In practice, this is never the case and we study supervised learning of class
conditional pdfs.
For supervised learning we need training samples. In the training set there are feature
vectors from each class and we re-arrange training samples based on their classes.
We assumed that the probability density functions are known. In most cases, these
probability density functions are not known and the underlying pdf will be estimated from
the available data.
There are various ways to estimate the probability density functions.
If we know the type of of the pdf, we can estimate the parameters of the pdf such as mean
and variance from the available data. These methods are known as parametric methods.
In the estimative approach to parametric density estimation, we use an estimate of the
parameter θj in the parametric density.
p(x|Cj ) = p(x|θ̂j )
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 26 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
In parametric methods, we assume that the sample is drawn from some known
distribution (for example Gaussian). But the parameters of this distribution is not known
and our goal is to estimate these parameters from the data.
The main advantage of the parametric methods is the model is defined up to a small
number of parameters and when these parameters are estimated, the whole distribution is
known.
The following methods usually are used to estimate the parameters of the distribution
maximum likelihood estimation
Bayesian estimation
Maximum a posteriori probability estimation
Maximum entropy estimation
Mixture Models
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 27 / 70
Maximum likelihood parameter estimation
Consider an M−class problem with feature vectors distributed according to p(x|Ci ) (for
i = 1, 2, . . . , M).
We assume that p(x|Ci ) belongs to some family of parametric distributions. For example,
we assume that p(x|Ci ) is a normal density with unknown parameters θi = (µi , Σi ).
To show the dependence on θi , we denote p(x|Ci ) = p(x|Ci ; θi ). The class Ci defines the
parametric family, and the parameter vector θi defines the member of that parametric
family.
The parametric families do not need to be same for all classes.
Our goal is to estimate the unknown parameters using a set of known feature vectors in
each class.
If we assume that data from one class do not affect the parameter estimation of the
others,we can formulate the problem independent of classes and simplify our notation
(p(x; θ)). Then solve the problem for each class independently.
Let X = {x1 , x2 , . . . , xN } be random samples drawn from pdf p(x; θ). We form the joint
pdf p(X ; θ).
Assuming statistical independence between the different samples, we have
N
Y
p(X ; θ) = p(x1 , x2 , . . . , xN ; θi ) = p(xk ; θ)
k=1
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 28 / 70
Maximum likelihood parameter estimation (cont.)
p(X ; θ) is a function of θ and is known as likelihood function.
The maximum likelihood (ML) method estimates θ so that the likelihood function takes
its maximum value, that is,
N
Y
θ̂ML = argmax p(xk ; θ)
θ k=1
A necessary condition that θ̂ML must satisfy in order to be a maximum is the gradient of
the likelihood function with respect to θ to be zero.
∂ N
Q
k=1 p(xk ; θ)
= 0
∂θ
It is more convenient to work with the logarithm of the likelihood function than with the
likelihood function itself. Hence, we define the log likelihood function as
N
Y
LL(θ) = ln p(xk ; θ)
k=1
N
X
= ln p(xk ; θ)
k=1
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 29 / 70
Maximum likelihood parameter estimation (cont.)
In order to find θ̂ML , it must satisfy
PN
∂LL(θ) k=1 ∂ ln p(xk ; θ)
=
∂θ ∂θ
N
X 1 ∂p(xk ; θ)
=
p(xk ; θ) ∂θ
k=1
= 0
The single unknown parameter case
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 30 / 70
Maximum likelihood estimation for normal distribution
Let x1 , x2 , . . . , xN be vectors sampled from a normal distribution with known covariance
matrix and unknown mean, that is,
1 1 T −1
p(x; µ) = N (µ, Σ) = exp − (x − µ) Σ (x − µ)
|Σ|D/2 (2π)D/2 2
Obtain ML-estimate of the unknown mean vector.
For N available samples, we have
N N
Y N 1X
LL(µ) = ln p(xk ; µ) = − ln[(2π)D |Σ|] − (xk − µ)T Σ−1 (xk − µ)
2 2
k=1 k=1
Taking the gradient with respect to µ, we obtain
∂LL(µ)
∂µ1
∂LL(µ) N
∂LL(µ) ∂µ2 X
= ..
= Σ−1 (xk − µ) = 0
∂µ
.
k=1
∂LL(µ)
∂µD
or
N
1 X
µ̂ML = xk
N
k=1
That is, the ML estimate of the mean, for Gaussian densities, is the sample average.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 31 / 70
Maximum likelihood estimation for normal distribution
Assume x1 , x2 , . . . , xN have been generated by a one-dimensional Gaussian pdf of known
mean, µ, but of unknown variance, that is,
(x − µ)2
2 1
p(x; σ ) = √ exp −
σ 2π 2σ 2
Obtain ML-estimate of the unknown variance.
For N available samples, we have
N N
Y N 1 X
LL(σ 2 ) = ln p(xk ; σ 2 ) = − ln(2πσ 2 ) − 2 (xk − µ)2
2 2σ
k=1 k=1
Taking the derivative of the above with respect to σ 2 and equating to zero, we obtain
N
dLL(σ 2 ) N 1 X
=− 2 + 4 (xk − µ)2 = 0
dσ 2 2σ 2σ
k=1
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 32 / 70
Evaluating an estimator
If biasθ (θ̂) = 0 for all values of θ, then we say that θ̂ is an unbiased estimator.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 33 / 70
Evaluating an estimator: bias and variance (cont.)
If biasθ (θ̂) = 0 for all values of θ, then we say that θ̂ is an unbiased estimator.
Nσ 2 σ2
P
k xk 1 X
Var (µ̂) = Var = 2 Var [xk ] = 2 =
N N N N
k
E [xk2 ] = σ 2 + µ2
σ2
E [µ̂2 ] = + µ2
N
Replacing back, we obtain
N(σ 2 + µ2 ) + N(σ 2 /N + µ2 )
2 N −1
E [σ̂ ] = = σ 2 6= σ 2
N N
If θ0 is the true value of the unknown parameter in p(x; θ), it can be shown that under
generally valid conditions the following are true
The ML estimate is asymptotically unbiased, that is
lim E [θ̂ML ] = θ0
N→∞
The ML estimate is asymptotically efficient; that is, this is the lowest value of variance,
which any estimate can achieve (Cramer-Rao lower bound).
The pdf of the ML estimate as N → ∞ approaches the Gaussian distribution with mean θ0 .
In summary, the ML estimator is unbiased, is normally distributed, and has the minimum
possible variance. However, all e properties are valid only for large values of N.
If N is small, little can be said about the ML-estimates in general.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 36 / 70
Bayesian estimation
Sometimes, before looking at a sample, we (or experts of the application) may have some
prior information on the possible value range that a parameter, θ, may take. This
information is quite useful and should be used, especially when the sample is small.
The prior information does not tell us exactly what the parameter value is (otherwise we
would not need the sample), and we model this uncertainty by viewing θ as a random
variable and by defining a prior density for it, p(θ).
For example, we are told that θ is approximately normal and with 90 percent confidence,
θ lies between 5 and 9, symmetrically around 7.
The prior density p(θ) tells the likely values that θ may take before looking at the sample.
This is combined with what the sample data tells (p(X |θ)) using Bayes rule and get the
posterior density of θ, which tells the θ values after looking at the sample.
p(X |θ)p(θ)
p(θ|X ) =
p(X )
p(X |θ)p(θ)
= R
p(X |θ0 )p(θ0 )dθ0
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 37 / 70
Bayesian estimation (cont.)
p(x|θ, X ) = p(x|θ), because once we know θ, the sufficient statics, we know everything
about the distribution.
Evaluating the integrals may be quite difficult, except in case where the posterior has a
nice form.
When the full integration is not feasible, we reduce it to a single point.
If we can assume that p(θ|X ) has a narrow peak around its mode, then using maximum
posteriori (MAP) estimate will make the calculation easier.
If p(θ|X ) is known, then p(x|X ) is average of p(x|θ) with respect to θ, that is
p(x|X ) = Eθ [p(x|θ)]
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 38 / 70
Bayesian estimation (example)
Let p(x|µ) be a univariate Gaussian N (µ, σ 2 ) with unknown parameter mean, which is
also assumed to follow a Gaussian N (µ0 , σ02 ). From the previous slide, we have
N
p(X |µ)p(µ) 1 Y
p(µ|X ) = = p(xk |µ)p(µ)
p(X ) α
k=1
p(X ) is a constant denoted as α, or
N
(xk − µ)2 (µ − µ0 )2
1 Y 1 1
p(µ|X ) = √ exp − √ exp −
α 2πσ 2σ 2 2πσ0 2σ02
k=1
When N samples are given, p(µ|X ) turns out to be a Gaussian (show it), that is
(µ − µN )2
1
p(µ|X ) = √ exp − 2
2πσN 2σN
Nσ02 x̄N + σ 2 µ0
µN =
Nσ02 + σ 2
2 σ 2 σ02
σN =
Nσ02 + σ 2
By some algebraic simplification, we obtain the following Gaussian pdf
1 (x − µN )2
1
p(x|X ) = q exp − 2
2π(σ 2 + σN2) 2 σ 2 + σN
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 39 / 70
Maximum a posteriori estimation
In the maximum likelihood estimate, we considered θ as an unknown parameter.
In maximum a posteriori estimation, we consider θ as a random vector, and we will
estimate its value based on sample X .
From the Bayes theorem, we have
p(X |θ)p(θ)
p(θ|X ) =
p(X )
The maximum a posteriori estimation (MAP) θ̂MAP is defined at the point where p(θ|X )
becomes maximum.
A necessary condition that θ̂MAP must satisfy in order to be a maximum is its gradient
with respect to θ to be zero.
∂p(θ|X )
= 0
∂θ
or
∂p(X |θ)p(θ)
= 0
∂θ
The difference between ML and MAP estimates lies in the involvement of p(θ) in the
MAP.
If p(θ) is uniform, then both estimates yield identical results.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 40 / 70
Maximum a posteriori estimation (example)
For Σ = σ 2 I , we obtain
2
σµ PN
µ0 + σ2 k=1 xk
µ̂MAP = 2
σµ
1+ σ2
N
2
σµ
When σ2
1, then µ̂MAP ≈ µ̂ML .
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 41 / 70
Mixture models for density estimation
An alternative way to model an unknown density function p(x) is via linear combination
of M density functions in the form of
M
X
p(x) = πm p(x|m)
m=1
where
M
X
πm = 1
m=1
Z
p(x|m)dx = 1
x
This modeling implicitly assumes that each point x may be drawn from any M model
distributions with probability πm (for m = 1, 2, . . . , M).
It can be shown that this modeling can approximate closely any continuous density
function for a sufficient number mixtures M and appropriate model parameters.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 42 / 70
Mixture models(cont.)
The first step of the procedure involves the choice of the set of density components
p(x|m) in the parametric form p(x|m, θ).Thus we have
M
X
p(x; θ) = πm p(x|θm )
m=1
The second step is the computation of the unknown parameters θ and π1 , π2 , . . . , πM
based on the set of available training data. P
The parameter set is defined as θ = {π1 , π2 , . . . , πM , θ1 , θ2 , . . . , θM } and i π1 = 1.
Given data X = {x1 , x2 , . . . , xN }, and assuming mixture model
M
X
p(x; θ) = πm p(x|θm )
m=1
we want to estimate parameters.
In order to estimate each πm , we can count how many points from X coming from each
of M components then normalize by N.
Nm
π̂m =
N
PN
Each Nm can be obtained from Nm = n=1 zmn and
if the nth point was drawn from component m
1
zmn =
0 otherwise.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 43 / 70
Mixture models(cont.)
The difficulty is that we do not know zmn . This is a major difficulty because the variables
zmn are hidden or latent then our ML estimates cannot follow in the straightforward
manner we had anticipated.
The problem is that we assumed knowledge of the values for indicator variables zmn .
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 44 / 70
Mixture models(cont.)
We need the joint likelihood of data X = {x1 , x2 , . . . , xN } and indicator variables
Z = {z1 , z2 , . . . , zM } where each zm = {zm1 , zm2 , . . . , zMN }.
Given θ = {θ1 , θ2 , . . . , θM }, we can marginalize over all possible component allocations.
X
p(X |θ) = p(X , Z |θ)
Z
The summation is over all possible values which Z may take on.
Then log p(X |θ) equals to
X
log p(X |θ) = log p(X , Z |θ)
Z
X p(X , Z |θ)
= log p(Z |X )
p(Z |X )
Z
Since xn drawn i.i.d. from m distributions exclusively, then summation over all Z equals
to a summation over all n and m i.e. log–likelihood (LL) equals (drive it.)
M,N
X p(X , Z |θ) X p(xn |θm )p(m)
p(Z |X ) log = p(m|xn ) log
p(Z |X ) m,n
p(m|xn )
Z
M X
X N
= p(m|xn ) log p(xn |θm )p(m)
m=1 n=1
XM X N
− p(m|xn ) log p(m|xn )
m=1 n=1
p(m|xn ) is the probability that zmn = 1 and p(m) is the probability that zmn = 1 for any n.
The Expectation Maximization (EM) algorithm is a general purpose method to maximize
the likelihood of the complete data (X &Z ) so as to obtain estimates of the component
parameters θm .
Before performing the Maximization step we require to obtain the Expected values of a
set of latent variables zmn .
Once we have obtained the Expected values of the latent variables we then perform the
Maximization step to obtain our current parameter estimates.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 46 / 70
Mixture models(cont.)
This EM interleaving is continued until some convergence criterion is achieved.
Taking derivatives of the LL with respect to p(m|xn ) then
∂LL
= log p(m|xn ) − log p(xn |θm )p(m) − 1
p(m|xn )
Setting to zero we see that p(m|xn ) ∝ p(xn |θm )p(m) and then normalizing appropriately
yields the distribution of the form
p(xn |θm )p(m)
p(m|xn ) = PM 0
m0 =1 p(xn |θm )p(m )
You should now be able to see that this is the posterior distribution over the mixture
components m which generated xn , or the expected value of the binary variable zmn .
We have maximized the bound with respect to the Expected value of the indicator
variable we need to Maximize the bound with respect to the parameter values.
The only terms in LL which are dependent on the component parameters are
M X
X N
p(m|xn ) log p(xn |θm )p(m).
m=1 n=1
We also see that we have replaced perfect knowledge of the allocation variables with our
current estimates of the posteriors p(m|xn ).
We also need an estimate for p(m), taking derivatives we observe that
N
X
p(m) ∝ p(m|xn )
n=1
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 49 / 70
Expectation maximization (cont.)
In Maximization step
PN
zmn xn
µ̂m = Pn=1
N
n=1 zmn
PN T
n=1 p(m|xn )(xn − µ̂m ) (xn − µ̂m )
Σ̂m = PN
n=1 p(m|xn )
N
1 X
p(m) = p(m|xn )
N
n=1
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 50 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
In parametric methods, we assume that the sample is drawn from some known
distribution (for example Gaussian). But the parameters of this distribution is not known
and our goal is to estimate these parameters from the data.
The main advantage of the parametric methods is the model is defined up to a small
number of parameters and when these parameters are estimated, the whole distribution is
known.
The method used to estimate the parameters of the distribution is maximum likelihood
estimation and Bayesian estimation.
Why nonparametric methods for density estimation?
Common parametric forms do not always fit the densities encountered in practice.
Most of the classical parametric densities are unimodal, whereas many practical problems
involve multi-modal densities.
Non-parametric methods can be used with arbitrary distributions and without the assumption
that the forms of the underlying densities are known.
In nonparametric estimation, we assume is that similar inputs have similar outputs. This
is a reasonable assumption because the world is smooth and functions, whether they are
densities, discriminants, or regression functions, change slowly.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 51 / 70
Nonparametric methods for density estimation (cont.)
Assume X = {x1 , x2 , . . . , xN } be random samples drawn i.i.d. (independently and
identically distributed) from probability density function p(x).
The probability PR is probability that a vector x will fall in a region R and is given by
Z
PR = p(x)dx
x∈R
The probability that k of N samples will fall in R is given by the binomial law.
(k) N
PR = PRk
(1 − PR )N−k
k
The expected value of k is equal to E [k] = NPR and MLE for PR equals to Nk .
If p(x) is continuous and R is small enough so that p(x) does not vary significantly in it,
then for all x ∈ R, we can approximate PR with
Z
PR = p(x 0 )dx 0 ≈ p(x)V
x 0 ∈R
V is the volume of R
Then the density function can be estimated as
k/N
p(x) ≈
V
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 52 / 70
Nonparametric methods for density estimation (example)
R(x) = {x 0 |x0 ≤ x}
The nonparametric estimator for the cumulative distribution function, P(x), at point x is
the proportion of sample points that are less than or equal to x.
|R(x)|
P̂(x) =
N
The nonparametric estimate for the density function can be calculated as
1 |R(x + h)| − |R(x)|
p̂(x) =
h N
h is the length of the interval and instances x that fall in this interval are assumed to be
close enough.
Different heuristics are used to determine the instances that are close and their effects on
the estimate.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 53 / 70
Histogram estimator
The oldest and most popular method is thehistogram where the input space is divided
into equal-sized intervals called bins.
Given an origin x0 and a bin width h, the mth bins denoted by Rm (x) is the interval
[x0 + mh, x0 + (m + 1)h) for positive and negative integers m and the estimate is given as
|Rm (x)|
p̂(x) =
Nh
In constructing the histogram, we have to choose both an origin and a bin width.
The choice of origin affects the estimate near boundaries of bins, but it is mainly the bin
width that has an effect on the estimate
When bins are small, the estimate is spiky.
When bins become larger, the estimate becomes smoother.
The estimate is 0 if no instance falls in a bin and there are discontinuities at bin
boundaries.
One advantage of the histogram is that once the bin estimates are calculated and stored,
we do not need to retain the training set.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 54 / 70
Naive estimator
A generalization of the histogram method called the nave estimator, addresses the choice
of bin locations.
The main idea behind this method is to use the estimation point to adaptively determine
the bin locations, thereby eliminating it as an extra parameter. Thus the naive estimator
frees us from setting an origin.
Given a bin width h, the bin denoted by R(x) is the interval [x − h2 , x + h2 ) and the
estimate is given as
|R(x)|
p̂(x) =
Nh
This equals to the histogram estimate where x is always at the center of a bin of size h.
The estimator can also be written as
N
1 X x − xk
p̂(x) = w
Nh h
k=1
if |u| ≤ 12
1
w (u) =
0 otherwise
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 55 / 70
Properties of Histogram (cont.)
Histogram density model has some drawbacks.
If the process generated the data is multi-modal, this aspect of the distribution can never
be captured by unimodal distributions such as Gaussian distribution.
A histogram density model is dependent on the choice of the origin x0 . This is typically
much less significant than the value of h.
A histogram density model may has discontinuities due to the bin edge rather than
properties of the data.
A major limitation of the histogram approach is its scalability with dimensionality.
One of the difficulties with the kernel approach is that the parameter h is fixed for all
kernels.
Large value of h may lead to over-smoothing.
Reducing value of h may lead to noisy estimates.
The optimal choice of h may be dependent on location within the data space.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 58 / 70
k−Nearest neighbor estimator (cont.)
Instead of fixing h and determining the value of k from the data, we fix the value of k
and use the data to find an appropriate value of h.
To do this, we consider a small sphere centered on the point x at which we wish to
estimate the density p(x) and allow the radius of the sphere to grow until it contains
precisely k data points.
k
p̂(x) =
NV
V is the volume of the resulting sphere.
Value of k determines the degree of smoothing and there is an optimum choice for k that
is neither too large nor too small.
Note that: The model produced by k nearest neighborhood is not a true density model
because the integral over all space diverges.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 59 / 70
k−Nearest neighbor classifier
In k−Nearest neighbor classifier, we apply the k−Nearest neighbor density estimation
estimation technique to each class separately and then make use of Bayes theorem.
Suppose
P that we have a data set with Ni points in class Ci with N points in total, so that
N
i k = N.
To classify a new point x, we draw a sphere centered on x containing precisely k points
irrespective of their class.
Suppose this sphere has volume V and contains ki points from class Ci .
An estimate of the density associated with each class equals to
ki
p(x|Ci ) =
Ni V
The unconditional density is given by
k
p(x) =
NV
The class priors equal to
Ni
p(Ci ) =
N
Combining the above equations using Bayes theorem will results in
ki Ni
p(x|Ci )p(Ci ) Ni V N ki
p(Ci |x) = = k
=
p(x) NV
k
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 60 / 70
k−Nearest neighbor classifier (cont.)
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 61 / 70
k−Nearest neighbor classifier (example)
The parameter k controls the degree of smoothing, i.e. small k produces many small
regions of each class and large k leads to a fewer larger regions.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 62 / 70
k−Nearest neighbor classifier (cont.)
D
! p1
X
Lp (x, y ) = |xi − yi |p
i=1
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 63 / 70
k−Nearest neighbor classifier (cont.)
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 64 / 70
Conclusions
Both the k−nearest-neighbor method, and the kernel density estimator, require the entire
training data set to be stored, leading to expensive computation if the data set is large.
This effect can be offset, at the expense of some additional one-off computation, by
constructing tree-based search structures such as KD-tree to allow(approximate) nearest
neighbors to be found efficiently without doing an exhaustive search of the data set.
These nonparametric methods are still severely limited.
On the other hand, simple parametric models are very restricted in terms of the forms of
distribution that they can represent.
We therefore need to find density models that are very flexible and yet for which the
complexity of the models can be controlled independently of the size of the training set,
and we shall see in subsequent chapters how to achieve this.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 65 / 70
Outline
1 Introduction
2 Bayes decision theory
Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers
Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 66 / 70
Naive Bayes classifier
Bayesian classifiers estimate posterior probabilities based likelihood, prior, and evidence.
These classifiers first estimate p(x|Ci ) and p(Ci ) and then classify the given instance.
How much training data will be required to obtain reliable estimates of these
distributions?
Consider the number of parameters that must be estimated when C = 2 and x is a vector
of D boolean features.
In this case, we need to estimate a set of parameters
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 66 / 70
Naive Bayes classifier (cont.)
Given the intractable sample complexity for learning Bayesian classifiers, we must look for
ways to reduce this complexity.
The Naive Bayes classifier does this by making a conditional independence assumption
that dramatically reduces the number of parameters to be estimated when modelling
P(xi |Cj ), from our original 2(2D − 1) to just 2D.
The Naive Bayes algorithm is a classification algorithm based on Bayes rule, that assumes
the features x1 , x2 , . . . , xD are all conditionally independent of one another, given the
class label Ci . Thus we have
D
Y
px1 , x2 , . . . , xD |Cj = p(xi |Cj )
i=1
Note that when C and the xi are boolean variables, we need only 2D parameters to define
p(xik |Cj ) for the necessary i, j, and k.
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 67 / 70
Naive Bayes classifier (cont.)
We derive the Naive Bayes algorithm, assuming in general that C is any discrete-valued
variable, and features x1 , x2 , . . . , xD are any discrete or real-valued features.
Our goal is to train a classifier that will output the probability distribution over possible
values of C , for each new instance x that we ask it to classify.
The probability that C will take on its k th possible value equals to
p(C )p(x1 , x2 , . . . , xD |Ck )
p(Ck |x1 , x2 , . . . , xD ) = P k
j p(Cj )p(x1 , x2 , . . . , xD |Cj )
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 68 / 70
Naive Bayes for discrete-valued inputs
When the D input features xi each take on J possible discrete values, and C is a discrete
variable taking on M possible values, then our learning task is to estimate two sets of
parameters.
θijk = p(xi = xij0 |C = Ck ) Feature xi takes value xij
πk = p(C = Ck )
We can estimate these parameters using either ML estimates or Bayesian/MAP estimates.
|xi = xij0 C = Ck |
V
θijk =
|Ck |
This maximum likelihood estimate sometimes results in θ estimates of zero, if the data
does not happen to contain any training examples satisfying the condition in the
numerator. To avoid this, it is common to use a smoothed estimate.
|xi = xij0 C = Ck | + l
V
θijk =
|Ck | + lJ
Value of l determines the strength of this smoothing.
Maximum likelihood estimates for πk are
|Ck |
πk =
N
|Ck | + l
=
N + lM
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 69 / 70
Naive Bayes for continuous inputs
When features are continuous, we must choose some other way to represent the
distributions p(xi |Ck ).
One common approach is to assume that for each possible Ck , the distribution of each
feature xi is Gaussian defined by mean and variance specific to xi and Ck .
In order to train such a Naive Bayes classifier, we must therefore estimate the mean and
standard deviation of each of these distributions.
µik = E [xi |Ck ]
2
σik = E [(xik − µik )2 |Ck ]
We must also estimate the prior on C .
πk = p(C = Ck )
we can use either maximum likelihood estimates (MLE) or maximum a posteriori (MAP)
estimates for these parameters.
The maximum likelihood estimator for µik is
P
xij δ(tj = Ck )
µ̂ik = Pj
j δ(tj = Ck )
2 is
The maximum likelihood estimator for σik
2
P
2 j (xij − µ̂ik ) δ(tj = Ck )
σ̂ik = P
j δ(tj = Ck )
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 70 / 70