BayesClassifier Updated

Classification based on Bayes decision theory
Machine Learning
Hamid Beigy
Sharif University of Technology
Fall 1393
Hamid Beigy (Sharif University of Technology) Classification based on Bayes decision theory Fall 1393 1 / 70
1 Introduction
2 Bayes decision theory

Minimizing the classification error probability
Minimizing the average risk
Discriminant function and decision surface
Bayesian classifiers for Normally distributed classes
Minimum distance classifier
Bayesian classifiers for independent binary features
3 Supervised learning of the Bayesian classifiers

Parametric methods for density estimation
Maximum likelihood parameter estimation
Bayesian estimation
Maximum a posteriori estimation
Mixture models for density estimation
Nonparametric methods for density estimation
Histogram estimator
Naive estimator
Kernel estimator
k−Nearest neighbor estimator
k−Nearest neighbor classifier
4 Naive Bayes classifier
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator
Introduction
In classification, the goal is to find a mapping from inputs X to outputs t given a labeled
set of input-output pairs
S = {(x1 , t1 ), (x2 , t2 ), . . . , (xN , tN )}.
S is called training set.

In the simplest setting, each training input x is a D−dimensional vector of numbers.
Each component of x is called feature, attribute, or variable and x is called feature vector.
The goal is to find a mapping from inputs X to outputs t, where t ∈ {1, 2, . . . , C } with C
being the number of classes.
When C = 2, the problem is called binary classification. In this case, we often assume
that t ∈ {−1, +1} or t ∈ {0, 1}.
When C > 2, the problem is called multi-class classification.
Introduction (cont.)
Bayes theorem
P(X |Ck )P(Ck )

p(Ck |X ) =
P(X )
P(X |Ck )P(Ck )
= P
Y p(X |Ck )p(Ck )
p(Ck ) is called prior of Ck .

p(X |Ck ) is called likelihood of data .
p(Ck |X ) is called posterior probability.
Since p(X ) is the same for all classes, we can write as
p(Ck |X ) = P(X |Ck )P(Ck )
Approaches for building a classifier.

Generative approach: This approach first creates a joint model of the form of p(x, Ck ) and
then to condition on x, then deriving p(Ck |x).
Discriminative approach: This approach creates a model of the form of p(Ck |x) directly.
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator
Bayes decision theory
Given a classification task of M classes, C1 , C2 , . . . , CM , and an input vector x, we can
form M conditional probabilities
p(Ck |x) ∀k = 1, 2, . . . , M
Without loss of generality, consider two class classification problem. From the Bayes
theorem, we have
p(Ck |x) = P(x|Ck )P(Ck )
The base classification rule is

if p(C1 |x) > p(C2 |x) then x is classified to C1
if p(C1 |x) < p(C2 |x) then x is classified to C2
if p(C1 |x) = p(C2 |x) then x is classified to either C1 or C2
Since p(x) is same for all classes, then it can be removed. Hence
p(x|C1 )p(C1 ) ≶ p(x|C2 )p(C2 )
If p(C1 ) = p(C2 ) = 21 , then we have
p(x|C1 ) ≶ p(x|C2 )
Bayes decision theory
If p(C1 ) = p(C2 ) = 21 , then we have
The coloured region may produce error. The probability of error equals to
Pe = p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )
Z Z
1 1
= p(x|C2 )dx + p(x|C1 )dx
2 R1 2 R2
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator

Minimizing the classification error probability (cont.)
We now show that the Bayesian classifier is optimal with respect to minimizing the
classification probability.
Let R1 (R2 ) be the region in the feature space in which we decide in favor of C1 (C2 ).
Then error is made if x ∈ R∞ although it belongs to C2 , or if x ∈ R∈ but it may belongs
to C1 . That is
Pe = p(x ∈ R2 , C1 ) + p(x ∈ R1 , C2 )
= p(x ∈ R2 |C1 )p(C2 ) + p(x ∈ R1 |C2 )p(C1 )
Z Z
= p(C2 ) p(x ∈ R2 |C1 ) + p(C1 ) p(x ∈ R1 |C2 )
R2 R1
Since R1 ∪ R2 covers all the feature space, from the definition of probability density
function, we have
Z Z
p(C1 ) = p(C1 |x)p(x)dx + p(C1 |x)p(x)dx
R1 R2
By combining these two equation, we obtain

Z
Pe = p(C1 ) − [p(C1 |x) − p(C2 |x)] p(x)dx
R1
Minimizing the classification error probability (cont.)
The probability of error equals to

Z
Pe = p(C1 ) − [p(C1 |x) − p(C2 |x)] p(x)dx
R1
The probability of error is minimized if R1 is the region of the space in which
[p(C1 |x) − p(C2 |x)] > 0
Then R2 becomes the region where the reverse is true, i.e. is the region of the space in
which
[p(C1 |x) − p(C2 |x)] < 0
This completes the proof of the Theorem.
For classification task with M classes, x is assigned to class Ck with the following rule
if p(Ck |x) > p(Cj |x) ∀j 6= k

Show that this rule also minimizes the classification error probability for classification task
with M classes.
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator

The classification error probability is not always the best criterion to be adopted for
minimization. Why?
This because it assigns the same importance to all errors.
In some applications such as IDS, patient classification, and spam filtering some wrong
decisions may have more serious implications than others.
In some cases, it is more appropriate to assign a penalty term to weight each error.
In such case, we try to minimize the following risk.
Z Z
r = λ12 p(C1 ) p(x|C1 )dx + λ21 p(x|C1 )dx
R2 R2
In general, the risk/loss associated to class Ck is defined as
M=2
X Z
rk = λki p(x|Ck )dx
i=1 Ri
The goal is to partition the feature space so that the average risk is minimized.
M=2
X
r = rk p(Ck )
k=1
M Z M
!
X X
= λki p(x|Ck )p(Ck ) dx
i=1 Ri k=1
Minimizing the average risk (cont.)
The average risk is equal to
M Z M
!
X X
r = λki p(x|Ck )p(Ck ) dx
i=1 Ri k=1
This is achieved if each integral is minimized, so that

M
X M
X
x ∈ R if ri = λki p(x|Ck )p(Ck ) < rj = λkj p(x|Ck )p(Ck ) ∀j 6= i
k=1 k=1
When λki = 1 (for k 6= i), minimizing the average risk is equivalent to minimizing the
classification error probability.
In two–class case, we have
r1 = λ11 p(x|C1 )p(C1 ) + λ21 p(x|C2 )p(C2 )
r2 = λ12 p(x|C1 )p(C1 ) + λ22 p(x|C2 )p(C2 )
We assign x to C1 if r1 < r2 , that is

(λ21 − λ22 ) p(x|C2 )p(C2 ) < (λ12 − λ11 ) p(x|C1 )p(C1 )
Minimizing the average risk (cont.)
In other words,
p(x|C1 ) p(C2 ) λ21 − λ22
x ∈ C1 (C2 ) if > (<)
p(x|C2 ) p(C1 ) λ12 − λ11
Assume that the loss matrix is in the form of

0 λ12
Λ= .
λ21 0
Then, we have
p(x|C1 ) p(C2 ) λ21
x ∈ C1 (C2 ) if > (<)
p(x|C2 ) p(C1 ) λ12
When p(C1 ) = p(C2 ) = 12 , we have
λ21
x ∈ C1 (C2 ) if p(x|C1 ) > (<)p(x|C2 )
λ12
If λ21 > λ12 , then x is assigned to C2 if
λ12
p(x|C2 ) > p(x|C1 )
λ21
That is, p(x|C1 ) is multiplied by a factor less than and the effect is the movement of the
threshold to left of x0 .
Minimizing the average risk (example)
In a two class problem with a single feature x, distributions of two classes are
1
p(x|C1 ) = √ exp −x 2

π
1
p(x|C2 ) = √ exp −(x − 1)2

π
The prior probabilities of two class are p(C1 ) = p(C2 ) = 21 .

Compute x0 for minimum error probability classifier. x0 is the solution of
1 1
√ exp −x02 = √ exp −(x0 − 1)2

π π
x0 = 12 is the solution of the above equation.

If the following loss matrix is given, compute x0 for the minimum average risk classifier.

0 0.5
Λ= .
1.0 0
x0 must satisfy the following equation.
1 2
√ exp −x02 = √ exp −(x0 − 1)2

π π
1−ln 2 1
x0 = 2 < 2 is the solution of the above equation.
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator

As discussed, minimizing either the risk or the error probability is equivalent to
partitioning the feature space into M regions for M classes.
If two regions Ri and Rj happen to be continuous, then they are separated by a decision
surface in the multi-dimensional feature space.
For the minimum error probability case, this surface described by equation
p(Ci |x) − p(Cj |x) = 0.
From the one side of the surface this difference is positive and from the other side, it is
negative.
Sometimes, instead of working directly with probabilities (or risks), it is more convenient
to work with an equivalent function of them such as
gi (x) = f (p(Ci |x))
f (.) is a monotonically increasing function. (why?)
Function gi (x) is known as a discriminant function.
Now, the decision test is stated as
Classify x in Ci if gi (x) > gj (x) ∀j 6= i
The decision surfaces, separating continuous regions are stated as
gij (x) = gi (x) − gj (x) ∀i, j = 1, 2, . . . , M, and j 6= i
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator

Discriminant function for Normally distributed classes
The one dimensional Gaussian distribution with mean of µ and variance σ 2 is given by
!
2
1 (x − µ)
p(x) = N (µ, σ 2 ) = √ exp −
σ 2π 2σ 2
The D−dimensional Gaussian distribution with mean of µ and covariance matrix Σ is

1 1 T −1
p(x) = N (µ, Σ) = exp − (x − µ) Σ (x − µ)
|Σ|D/2 (2π)D/2 2
What is the optimal classifier when the involved pdfs are N (µ, Σ)?
Because of the exponential form of the involved densities, it is preferable to work with the
following discriminant functions.
gi (x) = ln[p(x|ci )p(Ci )]
= ln p(x|ci ) + ln p(Ci )
Or
1
gi (x) = − (x − µi )T Σ−1 i (x − µi ) + wi0
2
D D
wi0 = − ln(2π) − ln |Σi | + ln p(Ci )
2 2
By expanding the above equation, we obtain the following quadratic form.
1 1 T −1 1 T −1 1 T −1
gi (x) = − x T Σ−1 i x + x Σi µi − µi Σi µi + µi Σi x + wi0
2 2 2 2
Discriminant function for Normally distributed classes(example)
For Normally distributed classes, we have the following quadratic form classifier.
1 1 T −1 1 T −1 1 T −1
gi (x) = − x T Σ−1
i x + x Σi µi − µi Σi µi + µi Σi x + wi0
2 2 2 2
Assume
σi2 0

Σi =
0 σi2
Thus we have
1 2 2
1 1 2 2

gi (x) = − x 1 + x 2 + (µ i1 x 1 + µ i2 x 2 ) − µ i1 + µ i2 + wi0
2σi2 2σi2 2σi2
Obviously the associated decision curves gi (x) − gj (x) = 0 are quadratics.

In this case the Bayesian classifier is a quadratic classifier, i.e. the partition of the feature
space is performed via quadratic decision surfaces.
Discriminant function for Normally distributed classes (cont.)
The discriminant functions for optimal classifier when the involved pdfs are N (µ, Σ) have
the following form
1
gi (x) = − (x − µi )T Σ−1
i (x − µi ) + wi0
2
1 1
wi0 = − ln(2π) − ln |Σi | + ln p(Ci )
2 2
By expanding the above equation, we obtain the following quadratic form.
1 1 T −1 1 T −1 1 T −1
gi (x) = − x T Σ−1
i x + x Σi µi − µi Σi µi + µi Σi x + wi0
2 2 2 2
Based on the above equations, We distinguish three distinct cases:
When Σi = σ 2 I , where σ 2 is a scalar and I is the identity matrix;
Σi = Σ, i.e. all classes have equal covariance matrices;
Σi is arbitrary.
Discriminant function for Normally distributed classes (when Σi = σ 2 I )
the following form
1
gi (x) = − (x − µi )T Σ−1 i (x − µi ) + wi0
2
By replacing Σi = σ 2 I in the above equation, we obtain
1
gi (x) = − (x − µi )T (σ 2 )−1 (x − µi ) + wi0
2
||x − µi ||2
= − + wi0
2σ 2
1
= − 2 x T x − 2µT i x + µ T
i µ i + wi0
2σ
Terms x T x and other constants are equal for all classes so they can be dropped.

1 T 1 T
gi (x) = µi x − µi µi + wi0
σ2 2
This is a linear discriminant function with

µi
wi =
σ2
0 µT µi
wi0 = − i 2 + ln p(Ci )
2σ
Discriminant function for Normally distributed classes (Σi = σ 2 I )(cont.)
For this case, the discriminant functions are equal to

1 T
gi (x) = µ x + wi0
σ2 i
The corresponding hyperplanes can be written as
1 T 1
gij (x) = gi (x) − gj (x) = µi x + wi0 − 2 µT x + wj0
σ 2 σ j
1
= (µi − µj )T x + wi0 − wj0
σ2
= w T (x − x0 ) = 0
w = µ i − µj
1 p(Ci ) µi − µj
x0 = (µi + µj ) − σ 2 ln
2 p(Cj ) ||µi − µj ||2
This implies that the decision surface is a hyperplane passing through the point x0 .
For any x on the decision hyperplane, vector (x − x0 ) also lies on the hyperplane and
hence (µi − µj ) is orthogonal to the decision hyperplane.
Discriminant function for Normally distributed classes( Σi = σ 2 I )(cont.)
When p(Ci ) = p(Cj ), then x0 = 12 (µi + µj ) and the hyperplane passes through the
average of µi and µj .
When p(Ci ) < p(Cj ), the hyperplane located closer to µi .
When p(Ci ) > p(Cj ), the hyperplane located closer to µj .
If σ 2 is small with respect to ||µi − µj ||, the location of the hyperplane is insensitive to
the values of p(Ci ) and p(Cj ).
Discriminant function for Normally distributed classes (Σi = Σ)
the following form
1
gi (x) = − (x − µi )T Σ−1i (x − µi ) + wi0
2
By replacing Σi = Σ in the above equation, we obtain
1
gi (x) = − (x − µi )T Σ−1 (x − µi ) + wi0
2
1 1 1 1
= − x T Σ−1 x + x T Σ−1 µi − µT Σ−1 µi + µT Σ−1 x + wi0
2 2 2 i 2 i
1 1 T −1
= − x T Σ−1 x + µT −1
i Σ x − µi Σ µi + wi0
2 2
T
Terms x x and other constants are equal for all classes and can be dropped. This gives
1 T −1 −1

gi (x) = 2µi Σ x − µT i Σ µ i + ln p(Ci )
2
This is a linear discriminant function with
0
gi (x) = wiT x + wi0
With the following parameters
wi = µi Σ−1
0 1
wi0 = − µT Σ−1 µi + ln p(Ci )
2 i
Discriminant function for Normally distributed classes ( Σi = Σ) (Cont.)
For this case, the discriminant functions are equal to

1 T −1 −1

gi (x) = 2µi Σ x − µT
i Σ µ i + ln p(Ci )
2
The corresponding hyperplanes can be written as
gij (x) = gi (x) − gj (x) = w T (x − x0 ) = 0

w = Σ−1 (µi − µj )
x0 = (µi + µj ) − ln
2 p(Cj ) ||µi − µj ||2Σ−1
= (µi + µj ) − ln
2 p(Cj ) (µi − µj )T Σ−1 (µi − µj )
The decision function is no longer orthogonal to vector (µi − µj ) but to its linear
transformation Σ−1 (µi − µj ).
Discriminant function for Normally distributed classes (arbitrary Σi )
the following form
1 D D
gi (x) = − (x − µi )T Σ−1 i (x − µi ) + ln p(Ci ) − ln(2π) − ln |Σi |
2 2 2
The discriminant functions cannot be simplified much further. Only the constant term
D
2 ln(2π) can be dropped.
Discriminant functions are not linear but quadratic.
They have much more complicated decision regions than the linear classifiers of the two
previous cases.
Now, decision surfaces are also quadratic and the decision regions do not have to be even
connected sets.
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator

Assume that we have p(Ci ) = p(Cj ) with the same covariance matrix, then gi (x) equals to
1
gi (x) = − (x − µi )T Σ−1 (x − µi )
2
2
For diagonal covariance matrix (Σ = σ I ), the maximum gi (x) implies minimum
Euclidean distance.
d = ||x − µi ||
Feature vectors are assigned to classes according to their Euclidean distance from their
respective mean points.
For non-diagonal covariance matrix, the maximum gi (x) is equivalent to minimizing
Mahalanobis distance (Σ−1 −norm).
dm = (x − µi )T Σ−1 (x − µi )
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator

We consider the case where features are binary-valued and independent and the class
conditional density for each feature is the Bernoulli distribution. This yields
D
x
Y
p(x|Ci ) = qijj (1 − qij )(1−xj )
j=1
qij (for j = 1, 2, . . . , D) are parameters for the class conditional density of the class Ci .
The discriminant function is
D
x
Y
gi (x) = ln p(Ci |x) = ln p(x|Ci )p(Ci ) = ln qijj (1 − qij )(1−xj ) p(Ci )
j=1
D
X
= [xj ln qij + ln (1 − qij ) − xj ln (1 − qij )] + ln p(Ci )
j=1
These are linear discriminant functions

D
X
gi (x) = WiT x + wi0 = wij xj + wi0
j=1
wi0 = ln qij − ln(1 − qij )
D
X
wij = ln(1 − qij ) + ln p(Ci )
j=1
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator
Supervised learning of the Bayesian classifiers
We assumed that the class conditional pdfs p(x|Ci ) and the prior probabilities p(Ci ) were
known. In practice, this is never the case and we study supervised learning of class
conditional pdfs.
For supervised learning we need training samples. In the training set there are feature
vectors from each class and we re-arrange training samples based on their classes.
Si = {(xi1 , ti1 ) , (xi2 , ti2 ) , . . . , (xiNi , tiNi )}
Ni is the number of training samples from the class Ci .

We assume that the training samples in the sets Si are occurrences of the independent
random variables.
The training data may be collected in two distinct ways. These are meaningful when we
need to learn the prior probabilities.
In mixture sampling, a set of objects are randomly selected, their feature vectors extracted
and then they hand-classified to the most appropriate classes. The prior probability of each
class is estimated as
Ni
p(Ci ) =
N
In separate sampling, the training data for each class is collected separately. In this case, the
prior probabilities cannot be deduced and it is most reasonable to assume that they are known
(If they are unknown, we usually assume that the prior probabilities for all classes are equal.)
Supervised learning of the Bayesian classifiers (Cont.)
We assumed that the probability density functions are known. In most cases, these
probability density functions are not known and the underlying pdf will be estimated from
the available data.
There are various ways to estimate the probability density functions.
If we know the type of of the pdf, we can estimate the parameters of the pdf such as mean
and variance from the available data. These methods are known as parametric methods.
In the estimative approach to parametric density estimation, we use an estimate of the
parameter θj in the parametric density.
p(x|Cj ) = p(x|θ̂j )
θ̂j is an estimate of the parameter θj based on the data samples.

In the Bayesian/predictive approach, we assume that we don’t know the true value of
parameter θj . This approach treats θj as an unknown random variable.
In many cases, we may not have the information about the type of the pdf, but we may
know certain statistical parameters such as the mean and the variance. These methods
are known as nonparametric methods.
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator

In parametric methods, we assume that the sample is drawn from some known
distribution (for example Gaussian). But the parameters of this distribution is not known
and our goal is to estimate these parameters from the data.
The main advantage of the parametric methods is the model is defined up to a small
number of parameters and when these parameters are estimated, the whole distribution is
known.
The following methods usually are used to estimate the parameters of the distribution
maximum likelihood estimation
Bayesian estimation
Maximum a posteriori probability estimation
Maximum entropy estimation
Mixture Models
Consider an M−class problem with feature vectors distributed according to p(x|Ci ) (for
i = 1, 2, . . . , M).
We assume that p(x|Ci ) belongs to some family of parametric distributions. For example,
we assume that p(x|Ci ) is a normal density with unknown parameters θi = (µi , Σi ).
To show the dependence on θi , we denote p(x|Ci ) = p(x|Ci ; θi ). The class Ci defines the
parametric family, and the parameter vector θi defines the member of that parametric
family.
The parametric families do not need to be same for all classes.
Our goal is to estimate the unknown parameters using a set of known feature vectors in
each class.
If we assume that data from one class do not affect the parameter estimation of the
others,we can formulate the problem independent of classes and simplify our notation
(p(x; θ)). Then solve the problem for each class independently.
Let X = {x1 , x2 , . . . , xN } be random samples drawn from pdf p(x; θ). We form the joint
pdf p(X ; θ).
Assuming statistical independence between the different samples, we have
N
Y
p(X ; θ) = p(x1 , x2 , . . . , xN ; θi ) = p(xk ; θ)
k=1
Maximum likelihood parameter estimation (cont.)
p(X ; θ) is a function of θ and is known as likelihood function.
The maximum likelihood (ML) method estimates θ so that the likelihood function takes
its maximum value, that is,
N
Y
θ̂ML = argmax p(xk ; θ)
θ k=1
A necessary condition that θ̂ML must satisfy in order to be a maximum is the gradient of
the likelihood function with respect to θ to be zero.
∂ N
Q
k=1 p(xk ; θ)
= 0
∂θ
It is more convenient to work with the logarithm of the likelihood function than with the
likelihood function itself. Hence, we define the log likelihood function as
N
Y
LL(θ) = ln p(xk ; θ)
k=1
N
X
= ln p(xk ; θ)
k=1
Maximum likelihood parameter estimation (cont.)
In order to find θ̂ML , it must satisfy
PN
∂LL(θ) k=1 ∂ ln p(xk ; θ)
=
∂θ ∂θ
N
X 1 ∂p(xk ; θ)
=
p(xk ; θ) ∂θ
k=1
= 0
The single unknown parameter case
Maximum likelihood estimation for normal distribution
Let x1 , x2 , . . . , xN be vectors sampled from a normal distribution with known covariance
matrix and unknown mean, that is,

1 1 T −1
p(x; µ) = N (µ, Σ) = exp − (x − µ) Σ (x − µ)
|Σ|D/2 (2π)D/2 2
Obtain ML-estimate of the unknown mean vector.
For N available samples, we have
N N
Y N 1X
LL(µ) = ln p(xk ; µ) = − ln[(2π)D |Σ|] − (xk − µ)T Σ−1 (xk − µ)
2 2
k=1 k=1
Taking the gradient with respect to µ, we obtain
 ∂LL(µ) 
∂µ1
 ∂LL(µ)  N
∂LL(µ)  ∂µ2  X
= ..
= Σ−1 (xk − µ) = 0
∂µ 
 .

 k=1
∂LL(µ)
∂µD
or
N
1 X
µ̂ML = xk
N
k=1
That is, the ML estimate of the mean, for Gaussian densities, is the sample average.
Maximum likelihood estimation for normal distribution
Assume x1 , x2 , . . . , xN have been generated by a one-dimensional Gaussian pdf of known
mean, µ, but of unknown variance, that is,
(x − µ)2

2 1
p(x; σ ) = √ exp −
σ 2π 2σ 2
Obtain ML-estimate of the unknown variance.
For N available samples, we have
N N
Y N 1 X
LL(σ 2 ) = ln p(xk ; σ 2 ) = − ln(2πσ 2 ) − 2 (xk − µ)2
2 2σ
k=1 k=1
Taking the derivative of the above with respect to σ 2 and equating to zero, we obtain
N
dLL(σ 2 ) N 1 X
=− 2 + 4 (xk − µ)2 = 0
dσ 2 2σ 2σ
k=1
Solving the above equation with respect to σ 2 , results in

N
2 1 X
σ̂ML = (xk − µ)2
N
k=1
Evaluating an estimator
Let x be a sample from a pdf with parameter θ, and θ̂ be an estimator of θ.

To evaluate the quality of this estimator, we can measure how much it is different from θ,
that is (θ̂ − θ)2 .
But since it is a random variable (it depends on the sample), we need to average mean
square error this over possible x and consider r (θ̂, θ), the mean square error of the
estimator θ̂ defined as
r (θ̂, θ) = E [(θ̂ − θ)2 ]
The bias of an estimator is given as
biasθ (θ̂) = E [θ̂] − θ
If biasθ (θ̂) = 0 for all values of θ, then we say that θ̂ is an unbiased estimator.
Evaluating an estimator: bias and variance (cont.)
If biasθ (θ̂) = 0 for all values of θ, then we say that θ̂ is an unbiased estimator.
Example (Sample average)

Assume that N samples xk are drawn from some density with mean µ, the sample average, µ̂,
is an unbiased estimator of the mean, µ, because
P
k xk 1 X Nµ
E [µ̂] = E = E [xk ] = =µ
N N N
k
An estimator θ̂ is consistent estimator if
lim Var (θ̂) → 0

N→∞
Example (Sample average)

Assume that N samples xk are drawn from some density with mean µ, the sample average, µ̂,
is a consistent estimator estimator of the mean, µ, because
Nσ 2 σ2
P
k xk 1 X
Var (µ̂) = Var = 2 Var [xk ] = 2 =
N N N N
k
As N gets larger, µ̂ deviates less from µ.

Evaluating an estimator (cont.)
Example (Sample variance)
Assume that N samples xk are drawn from some density with variance σ 2 , the sample
variance, σ̂ 2 , is a biased estimator estimator of the variance, σ 2 , because
P 2
2 x − N µ̂2
P
2 k (xk − µ̂)
σ̂ = = k k
P N2 2]
N
E [x ] − NE [µ̂
E [σ̂ 2 ] = k k
N
Given that E [x 2 ] = Var (x) + E [x]2 , we can write
E [xk2 ] = σ 2 + µ2
σ2
E [µ̂2 ] = + µ2
N
Replacing back, we obtain
N(σ 2 + µ2 ) + N(σ 2 /N + µ2 )

2 N −1
E [σ̂ ] = = σ 2 6= σ 2
N N
This is an example of an asymptotically unbiased estimator whose bias goes to 0 as N goes to

∞.
Properties of maximum likelihood estimation
If θ0 is the true value of the unknown parameter in p(x; θ), it can be shown that under
generally valid conditions the following are true
The ML estimate is asymptotically unbiased, that is
lim E [θ̂ML ] = θ0
N→∞
The ML estimate is asymptotically consistent, that is, it satisfies

h i
lim Prob ||θ̂ML − θ0 ||2 ≤ = 1
N→∞
for arbitrarily small . A stronger condition for consistency is also true

h i
lim E ||θ̂ML − θ0 ||2 = 0
N→∞
The ML estimate is asymptotically efficient; that is, this is the lowest value of variance,
which any estimate can achieve (Cramer-Rao lower bound).
The pdf of the ML estimate as N → ∞ approaches the Gaussian distribution with mean θ0 .
In summary, the ML estimator is unbiased, is normally distributed, and has the minimum
possible variance. However, all e properties are valid only for large values of N.
If N is small, little can be said about the ML-estimates in general.
Bayesian estimation
Sometimes, before looking at a sample, we (or experts of the application) may have some
prior information on the possible value range that a parameter, θ, may take. This
information is quite useful and should be used, especially when the sample is small.
The prior information does not tell us exactly what the parameter value is (otherwise we
would not need the sample), and we model this uncertainty by viewing θ as a random
variable and by defining a prior density for it, p(θ).
For example, we are told that θ is approximately normal and with 90 percent confidence,
θ lies between 5 and 9, symmetrically around 7.
The prior density p(θ) tells the likely values that θ may take before looking at the sample.
This is combined with what the sample data tells (p(X |θ)) using Bayes rule and get the
posterior density of θ, which tells the θ values after looking at the sample.
p(X |θ)p(θ)
p(θ|X ) =
p(X )
p(X |θ)p(θ)
= R
p(X |θ0 )p(θ0 )dθ0
Bayesian estimation (cont.)
For estimating the density at x, we have

Z
p(x|X ) = p(x, θ|X )dθ
Z
= p(x|θ, X )p(θ|X )dθ
Z
= p(x|θ)p(θ|X )dθ
p(x|θ, X ) = p(x|θ), because once we know θ, the sufficient statics, we know everything
about the distribution.
Evaluating the integrals may be quite difficult, except in case where the posterior has a
nice form.
When the full integration is not feasible, we reduce it to a single point.
If we can assume that p(θ|X ) has a narrow peak around its mode, then using maximum
posteriori (MAP) estimate will make the calculation easier.
If p(θ|X ) is known, then p(x|X ) is average of p(x|θ) with respect to θ, that is
p(x|X ) = Eθ [p(x|θ)]
Bayesian estimation (example)
Let p(x|µ) be a univariate Gaussian N (µ, σ 2 ) with unknown parameter mean, which is
also assumed to follow a Gaussian N (µ0 , σ02 ). From the previous slide, we have
N
p(X |µ)p(µ) 1 Y
p(µ|X ) = = p(xk |µ)p(µ)
p(X ) α
k=1
p(X ) is a constant denoted as α, or
N
(xk − µ)2 (µ − µ0 )2

1 Y 1 1
p(µ|X ) = √ exp − √ exp −
α 2πσ 2σ 2 2πσ0 2σ02
k=1
When N samples are given, p(µ|X ) turns out to be a Gaussian (show it), that is
(µ − µN )2

1
p(µ|X ) = √ exp − 2
2πσN 2σN
Nσ02 x̄N + σ 2 µ0
µN =
Nσ02 + σ 2
2 σ 2 σ02
σN =
Nσ02 + σ 2
By some algebraic simplification, we obtain the following Gaussian pdf
1 (x − µN )2

1
p(x|X ) = q exp − 2
2π(σ 2 + σN2) 2 σ 2 + σN
In the maximum likelihood estimate, we considered θ as an unknown parameter.
In maximum a posteriori estimation, we consider θ as a random vector, and we will
estimate its value based on sample X .
From the Bayes theorem, we have
p(X |θ)p(θ)
p(θ|X ) =
p(X )
The maximum a posteriori estimation (MAP) θ̂MAP is defined at the point where p(θ|X )
becomes maximum.
A necessary condition that θ̂MAP must satisfy in order to be a maximum is its gradient
with respect to θ to be zero.
∂p(θ|X )
= 0
∂θ
or
∂p(X |θ)p(θ)
= 0
∂θ
The difference between ML and MAP estimates lies in the involvement of p(θ) in the
MAP.
If p(θ) is uniform, then both estimates yield identical results.
Maximum a posteriori estimation (example)
Let x1 , x2 , . . . , xN be vectors drawn from a normal distribution with known covariance

matrix and unknown mean, that is

1 1 T −1
p(xk ; µ) = exp − (xk − µ) σ (xk − µ)
(2π)D/2 |Σ|D/2 2
Assume that the unknown mean vector µ is known to be normally distributed as
1 ||µ − µ0 ||2

1
p(µ) = exp −
(2π)D/2 σµD 2 σµ2
The MAP estimate is given by the solution of

N
!
∂ Y
ln p(xk |µ)p(µ) = 0
∂µ
k=1
For Σ = σ 2 I , we obtain
2
σµ PN
µ0 + σ2 k=1 xk
µ̂MAP = 2
σµ
1+ σ2
N
2
σµ
When σ2
1, then µ̂MAP ≈ µ̂ML .
An alternative way to model an unknown density function p(x) is via linear combination
of M density functions in the form of
M
X
p(x) = πm p(x|m)
m=1
where
M
X
πm = 1
m=1
Z
p(x|m)dx = 1
x
This modeling implicitly assumes that each point x may be drawn from any M model
distributions with probability πm (for m = 1, 2, . . . , M).
It can be shown that this modeling can approximate closely any continuous density
function for a sufficient number mixtures M and appropriate model parameters.
Mixture models(cont.)
The first step of the procedure involves the choice of the set of density components
p(x|m) in the parametric form p(x|m, θ).Thus we have
M
X
p(x; θ) = πm p(x|θm )
m=1
The second step is the computation of the unknown parameters θ and π1 , π2 , . . . , πM
based on the set of available training data. P
The parameter set is defined as θ = {π1 , π2 , . . . , πM , θ1 , θ2 , . . . , θM } and i π1 = 1.
Given data X = {x1 , x2 , . . . , xN }, and assuming mixture model
M
X
p(x; θ) = πm p(x|θm )
m=1
we want to estimate parameters.
In order to estimate each πm , we can count how many points from X coming from each
of M components then normalize by N.
Nm
π̂m =
N
PN
Each Nm can be obtained from Nm = n=1 zmn and
if the nth point was drawn from component m

1
zmn =
0 otherwise.
What of the specific parameters of each of the components θ̂m ?

We need to obtain the estimates θ̂m which maximize the likelihood of the data points
which were drawn from component m under the parametric form p(x|θm ).
If the mixture components were Gaussian, then the Maximum-Likelihood estimate for the
component mean vectors would be
PN
zmn xn
µ̂m = Pn=1
N
n=1 zmn
The estimation for covariance matrix for each component would be

N
1 X
Σ̂m = PN zmn (xn − µ̂m )(xn − µ̂m )T
n=1 zmn n=1
The difficulty is that we do not know zmn . This is a major difficulty because the variables
zmn are hidden or latent then our ML estimates cannot follow in the straightforward
manner we had anticipated.
The problem is that we assumed knowledge of the values for indicator variables zmn .
We need the joint likelihood of data X = {x1 , x2 , . . . , xN } and indicator variables
Z = {z1 , z2 , . . . , zM } where each zm = {zm1 , zm2 , . . . , zMN }.
Given θ = {θ1 , θ2 , . . . , θM }, we can marginalize over all possible component allocations.
X
p(X |θ) = p(X , Z |θ)
Z
The summation is over all possible values which Z may take on.
Then log p(X |θ) equals to
X
log p(X |θ) = log p(X , Z |θ)
Z
X p(X , Z |θ)
= log p(Z |X )
p(Z |X )
Z
Using inequality log E [x] ≥ E [log x], we can write

X p(X , Z |θ) X p(X , Z |θ)
log p(Z |X ) ≥ p(Z |X ) log
p(Z |X ) p(Z |X )
Z Z
X
≥ p(Z |X ) log p(X , Z |θ)
Z
X
− p(Z |X ) log p(Z |X )
Z
Since xn drawn i.i.d. from m distributions exclusively, then summation over all Z equals
to a summation over all n and m i.e. log–likelihood (LL) equals (drive it.)
M,N
X p(X , Z |θ) X p(xn |θm )p(m)
p(Z |X ) log = p(m|xn ) log
p(Z |X ) m,n
p(m|xn )
Z
M X
X N
= p(m|xn ) log p(xn |θm )p(m)
m=1 n=1
XM X N
− p(m|xn ) log p(m|xn )
m=1 n=1
p(m|xn ) is the probability that zmn = 1 and p(m) is the probability that zmn = 1 for any n.
The Expectation Maximization (EM) algorithm is a general purpose method to maximize
the likelihood of the complete data (X &Z ) so as to obtain estimates of the component
parameters θm .
Before performing the Maximization step we require to obtain the Expected values of a
set of latent variables zmn .
Once we have obtained the Expected values of the latent variables we then perform the
Maximization step to obtain our current parameter estimates.
This EM interleaving is continued until some convergence criterion is achieved.
Taking derivatives of the LL with respect to p(m|xn ) then
∂LL
= log p(m|xn ) − log p(xn |θm )p(m) − 1
p(m|xn )
Setting to zero we see that p(m|xn ) ∝ p(xn |θm )p(m) and then normalizing appropriately
yields the distribution of the form
p(xn |θm )p(m)
p(m|xn ) = PM 0
m0 =1 p(xn |θm )p(m )
You should now be able to see that this is the posterior distribution over the mixture
components m which generated xn , or the expected value of the binary variable zmn .
We have maximized the bound with respect to the Expected value of the indicator
variable we need to Maximize the bound with respect to the parameter values.
The only terms in LL which are dependent on the component parameters are
M X
X N
p(m|xn ) log p(xn |θm )p(m).
m=1 n=1
We maximize the above with respect to each θm .

Expectation maximization
Assume that each p(xn |θm ) is a multivariate Gaussian, then expanding and retaining the
elements dependent on the parameters we obtain
M N
∂L 1 XX
= − p(m|xn ) ln |Σm |
p(m|xn ) 2
m=1 n=1
M N
1 XX
− p(m|xn )(xn − µm )T Σ−1
m (xn − µm )
2
m=1 n=1
M X N
1 X
+ p(m|xn ) log p(m)
2
m=1 n=1
Taking Taking derivatives with respect to µm and solving them, we have
PN
p(m|xn )xn
µ̂m = Pn=1 N
n=1 p(m|xn )
Compare with the estimator when we have perfect knowledge of the hidden variables zmn
i.e.
PN
zmn xn
µ̂m = Pn=1 N
n=1 zmn
So in the absence of the values zmn , we employ the expected values, or the posterior
probabilities p(m|xn ) which are obtained in the Expectation step.
Expectation maximization (cont.)
The estimator for covariance matrices

PN T
n=1 p(m|xn )(xn − µ̂m ) (xn − µ̂m )
Σ̂m = PN
n=1 p(m|xn )
We also see that we have replaced perfect knowledge of the allocation variables with our
current estimates of the posteriors p(m|xn ).
We also need an estimate for p(m), taking derivatives we observe that
N
X
p(m) ∝ p(m|xn )
n=1
Then normalizing the above equality results in

N
1 X
p(m) = p(m|xn )
N
n=1
Expectation maximization (cont.)
EM algorithm has two steps: Expectation & Maximization

In Expectation step
p(xn |θm )p(m)

p(m|xn ) = PM 0 0
m0 =1 p(xn |θm )p(m )
In Maximization step
PN
zmn xn
µ̂m = Pn=1
N
n=1 zmn
PN T
n=1 p(m|xn )(xn − µ̂m ) (xn − µ̂m )
Σ̂m = PN
n=1 p(m|xn )
N
1 X
p(m) = p(m|xn )
N
n=1
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator

In parametric methods, we assume that the sample is drawn from some known
distribution (for example Gaussian). But the parameters of this distribution is not known
and our goal is to estimate these parameters from the data.
The main advantage of the parametric methods is the model is defined up to a small
number of parameters and when these parameters are estimated, the whole distribution is
known.
The method used to estimate the parameters of the distribution is maximum likelihood
estimation and Bayesian estimation.
Why nonparametric methods for density estimation?
Common parametric forms do not always fit the densities encountered in practice.
Most of the classical parametric densities are unimodal, whereas many practical problems
involve multi-modal densities.
Non-parametric methods can be used with arbitrary distributions and without the assumption
that the forms of the underlying densities are known.
In nonparametric estimation, we assume is that similar inputs have similar outputs. This
is a reasonable assumption because the world is smooth and functions, whether they are
densities, discriminants, or regression functions, change slowly.
Nonparametric methods for density estimation (cont.)
Assume X = {x1 , x2 , . . . , xN } be random samples drawn i.i.d. (independently and
identically distributed) from probability density function p(x).
The probability PR is probability that a vector x will fall in a region R and is given by
Z
PR = p(x)dx
x∈R
The probability that k of N samples will fall in R is given by the binomial law.

(k) N
PR = PRk
(1 − PR )N−k
k
The expected value of k is equal to E [k] = NPR and MLE for PR equals to Nk .
If p(x) is continuous and R is small enough so that p(x) does not vary significantly in it,
then for all x ∈ R, we can approximate PR with
Z
PR = p(x 0 )dx 0 ≈ p(x)V
x 0 ∈R
V is the volume of R
Then the density function can be estimated as
k/N
p(x) ≈
V
Nonparametric methods for density estimation (example)
Let x be a univariate feature vector and R(x) be the region given by
R(x) = {x 0 |x0 ≤ x}
The nonparametric estimator for the cumulative distribution function, P(x), at point x is
the proportion of sample points that are less than or equal to x.
|R(x)|
P̂(x) =
N
The nonparametric estimate for the density function can be calculated as

1 |R(x + h)| − |R(x)|
p̂(x) =
h N
h is the length of the interval and instances x that fall in this interval are assumed to be
close enough.
Different heuristics are used to determine the instances that are close and their effects on
the estimate.
Histogram estimator
The oldest and most popular method is thehistogram where the input space is divided
into equal-sized intervals called bins.
Given an origin x0 and a bin width h, the mth bins denoted by Rm (x) is the interval
[x0 + mh, x0 + (m + 1)h) for positive and negative integers m and the estimate is given as
|Rm (x)|
p̂(x) =
Nh
In constructing the histogram, we have to choose both an origin and a bin width.
The choice of origin affects the estimate near boundaries of bins, but it is mainly the bin
width that has an effect on the estimate
When bins are small, the estimate is spiky.
When bins become larger, the estimate becomes smoother.
The estimate is 0 if no instance falls in a bin and there are discontinuities at bin
boundaries.
One advantage of the histogram is that once the bin estimates are calculated and stored,
we do not need to retain the training set.
Naive estimator
A generalization of the histogram method called the nave estimator, addresses the choice
of bin locations.
The main idea behind this method is to use the estimation point to adaptively determine
the bin locations, thereby eliminating it as an extra parameter. Thus the naive estimator
frees us from setting an origin.
Given a bin width h, the bin denoted by R(x) is the interval [x − h2 , x + h2 ) and the
estimate is given as
|R(x)|
p̂(x) =
Nh
This equals to the histogram estimate where x is always at the center of a bin of size h.
The estimator can also be written as
N
1 X x − xk
p̂(x) = w
Nh h
k=1
w is weight function and defined as
if |u| ≤ 12

1
w (u) =
0 otherwise
Properties of Histogram (cont.)
Histogram density model has some drawbacks.
If the process generated the data is multi-modal, this aspect of the distribution can never
be captured by unimodal distributions such as Gaussian distribution.
A histogram density model is dependent on the choice of the origin x0 . This is typically
much less significant than the value of h.
A histogram density model may has discontinuities due to the bin edge rather than
properties of the data.
A major limitation of the histogram approach is its scalability with dimensionality.
Histogram density model has some advantages.

Histogram density model has the property that once histogram has been computed, the
dataset itself can be discarded. It is an advantage if the dataset is large.
Lessons from histogram approach to density density estimation

For the estimation of the probability at any point, we should consider the data points that
lie within some local neighborhood of that data. For histogram, this neighborhood
property was defined by the bins. There is a natural smoothing parameter describes bin
locality.
This smoothing parameter should be neither too large nor too small.
Kernel estimator
In order to get a smooth estimate, we use a smooth weight function, called kernel
function, and in this context is also called Parzen window.
N
1 X x − xi
p̂(x) = K
Nh h
i=1
K (.) is some kernel (window) function and h is the bandwidth (smoothing parameter).
The most popular kernel function is Gaussian kernel function with mean 0 and variance 1.
2
1 u
K (u) = √ exp −
2π 2
Function K (.) determines the shape of influences and h determines the window width.
The kernel estimator can be generalized to D−dimensional data.
N
1 X x − xk
p̂(x) = K
NhD h
k=1
D
||u||2

1
K (u) = √ exp −
2π 2
The total number of data points lying in this window (cube) equals to (drive it.)
N
X x − xi
k = K
h
i=1
One of the difficulties with the kernel approach is that the parameter h is fixed for all
kernels.
Large value of h may lead to over-smoothing.
Reducing value of h may lead to noisy estimates.
The optimal choice of h may be dependent on location within the data space.
k−Nearest neighbor estimator (cont.)
Instead of fixing h and determining the value of k from the data, we fix the value of k
and use the data to find an appropriate value of h.
To do this, we consider a small sphere centered on the point x at which we wish to
estimate the density p(x) and allow the radius of the sphere to grow until it contains
precisely k data points.
k
p̂(x) =
NV
V is the volume of the resulting sphere.
Value of k determines the degree of smoothing and there is an optimum choice for k that
is neither too large nor too small.
Note that: The model produced by k nearest neighborhood is not a true density model
because the integral over all space diverges.
In k−Nearest neighbor classifier, we apply the k−Nearest neighbor density estimation
estimation technique to each class separately and then make use of Bayes theorem.
Suppose
P that we have a data set with Ni points in class Ci with N points in total, so that
N
i k = N.
To classify a new point x, we draw a sphere centered on x containing precisely k points
irrespective of their class.
Suppose this sphere has volume V and contains ki points from class Ci .
An estimate of the density associated with each class equals to
ki
p(x|Ci ) =
Ni V
The unconditional density is given by
k
p(x) =
NV
The class priors equal to
Ni
p(Ci ) =
N
Combining the above equations using Bayes theorem will results in
ki Ni
p(x|Ci )p(Ci ) Ni V N ki
p(Ci |x) = = k
=
p(x) NV
k
k−Nearest neighbor classifier (cont.)
In k−Nearest neighbor classifier, the posterior probability of each class equals to

ki Ni
p(x|Ci )p(Ci ) Ni V N ki
p(Ci |x) = = k
=
p(x) NV
k
If we wish to minimize the probability of misclassification, this is done by assigning the

test point x to the class having the largest posterior probability, corresponding to the
largest value of ki /k.
Thus to classify a new point, we identify the k nearest points from the training data set
and then assign the new point to the class having the largest number of representatives
amongst this set.
The particular case of k = 1 is called the nearest-neighbor rule, because a test point is
simply assigned to the same class as the nearest point from the training set.
An interesting property of the nearest-neighbor (k = 1) classifier is that, in the limit
N → ∞, the error rate is never more than twice the minimum achievable error rate of an
optimal classifier.
k−Nearest neighbor classifier (example)
The parameter k controls the degree of smoothing, i.e. small k produces many small
regions of each class and large k leads to a fewer larger regions.
k−Nearest neighbor classifier relies on a metric or a distance function, between points

For all points , x, y , and z, a metric d(., .) must satisfy the following properties
Non–negativity: d(x, y ) ≥ 0.
Reflexivity: d(x, y ) = 0 ⇐⇒ x = y .
Symmetry: d(x, y ) = d(y , x).
Triangle inequality : d(x, y ) + d(y , z) ≥ d(x, z).
A general class of metrics for D−dimensional feature vectors is Minkowski metric (also
referred to as Lp −norm)
D
! p1
X
Lp (x, y ) = |xi − yi |p
i=1
When p = 1, the metric called Manhattan or city–block distance and is L1 −norm .

When p = 2, the metric called Euclidean distance and is L2 −norm .
When p = ∞, the L∞ −norm is the maximum of distance along individual coordinates
axes.
L∞ (x, y ) = max |xi − yi |

i
k−Nearest neighbor (k-NN) is considered a lazy learning algorithm.

It defers data processing until a test example arrives.
Replies to request by combining its stored data.
discards the constructed answer.
Other names for lazy algorithms
Memory–based.
Instance–based.
Example–based.
Case–based.
Experience–based.
This strategy is opposed to an eager learning algorithm which
Analyzes the data and builds a model.
Uses the constructed model to classify the test example.
Read chapter 8 of T. Mitchel’s book for other models of instance based algorithms such
as locally weighted regression and case–based reasoning.
Conclusions
Both the k−nearest-neighbor method, and the kernel density estimator, require the entire
training data set to be stored, leading to expensive computation if the data set is large.
This effect can be offset, at the expense of some additional one-off computation, by
constructing tree-based search structures such as KD-tree to allow(approximate) nearest
neighbors to be found efficiently without doing an exhaustive search of the data set.
These nonparametric methods are still severely limited.
On the other hand, simple parametric models are very restricted in terms of the forms of
distribution that they can represent.
We therefore need to find density models that are very flexible and yet for which the
complexity of the models can be controlled independently of the size of the training set,
and we shall see in subsequent chapters how to achieve this.
Outline
1 Introduction
Bayesian estimation
Histogram estimator
Naive estimator
Kernel estimator
Naive Bayes classifier
Bayesian classifiers estimate posterior probabilities based likelihood, prior, and evidence.
These classifiers first estimate p(x|Ci ) and p(Ci ) and then classify the given instance.
How much training data will be required to obtain reliable estimates of these
distributions?
Consider the number of parameters that must be estimated when C = 2 and x is a vector
of D boolean features.
In this case, we need to estimate a set of parameters
θij = p(xi |Cj )
Index i takes on 2D possible values, and j takes on 2 possible values.

Therefore, we will need to estimate exactly 2(2D − 1) of such θij parameters.
Unfortunately, this corresponds to two distinct parameters for each of the distinct
instances in the instance space for x.
In order to obtain reliable estimates of each of these parameters, we will need to observe
each of these distinct instances multiple times! This is clearly unrealistic in most practical
learning domains.
For example, if x is a vector containing 30 boolean features, then we will need to estimate
more than 3 billion parameters.
Naive Bayes classifier (cont.)
Given the intractable sample complexity for learning Bayesian classifiers, we must look for
ways to reduce this complexity.
The Naive Bayes classifier does this by making a conditional independence assumption
that dramatically reduces the number of parameters to be estimated when modelling
P(xi |Cj ), from our original 2(2D − 1) to just 2D.
Definition (Conditional Independence)

Given random variables x, y and z, we say x is conditionally independent of y given z, if and
only if the probability distribution governing x is independent of the value of y given z; that is
p(xi |yj , zk ) = p(xi |zk ) ∀i, j, k
The Naive Bayes algorithm is a classification algorithm based on Bayes rule, that assumes
the features x1 , x2 , . . . , xD are all conditionally independent of one another, given the
class label Ci . Thus we have
D
Y
px1 , x2 , . . . , xD |Cj = p(xi |Cj )
i=1
Note that when C and the xi are boolean variables, we need only 2D parameters to define
p(xik |Cj ) for the necessary i, j, and k.
Naive Bayes classifier (cont.)
We derive the Naive Bayes algorithm, assuming in general that C is any discrete-valued
variable, and features x1 , x2 , . . . , xD are any discrete or real-valued features.
Our goal is to train a classifier that will output the probability distribution over possible
values of C , for each new instance x that we ask it to classify.
The probability that C will take on its k th possible value equals to
p(C )p(x1 , x2 , . . . , xD |Ck )
p(Ck |x1 , x2 , . . . , xD ) = P k
j p(Cj )p(x1 , x2 , . . . , xD |Cj )
Now, assuming the xi are conditionally independent given Ck , we can rewrite as

Q
p(Cj ) i p(xi |Cj )
p(Ck |x1 , x2 , . . . , xD ) = P Q
j p(Cj ) i p(xi |Ck )
The Naive Bayes classification rule is

Q
p(Cj ) i p(xi |Cj )
C = argmax P Q
Cj j p(Cj ) i p(xi |Ck )
Since the denominator does not depend on C , it simplifies to the following

Y
C = argmax p(Cj ) p(xi |Cj )
Cj i
Naive Bayes for discrete-valued inputs
When the D input features xi each take on J possible discrete values, and C is a discrete
variable taking on M possible values, then our learning task is to estimate two sets of
parameters.
θijk = p(xi = xij0 |C = Ck ) Feature xi takes value xij
πk = p(C = Ck )
We can estimate these parameters using either ML estimates or Bayesian/MAP estimates.
|xi = xij0 C = Ck |
V
θijk =
|Ck |
This maximum likelihood estimate sometimes results in θ estimates of zero, if the data
does not happen to contain any training examples satisfying the condition in the
numerator. To avoid this, it is common to use a smoothed estimate.
|xi = xij0 C = Ck | + l
V
θijk =
|Ck | + lJ
Value of l determines the strength of this smoothing.
Maximum likelihood estimates for πk are
|Ck |
πk =
N
|Ck | + l
=
N + lM
Naive Bayes for continuous inputs
When features are continuous, we must choose some other way to represent the
distributions p(xi |Ck ).
One common approach is to assume that for each possible Ck , the distribution of each
feature xi is Gaussian defined by mean and variance specific to xi and Ck .
In order to train such a Naive Bayes classifier, we must therefore estimate the mean and
standard deviation of each of these distributions.
µik = E [xi |Ck ]
2
σik = E [(xik − µik )2 |Ck ]
We must also estimate the prior on C .
πk = p(C = Ck )
we can use either maximum likelihood estimates (MLE) or maximum a posteriori (MAP)
estimates for these parameters.
The maximum likelihood estimator for µik is
P
xij δ(tj = Ck )
µ̂ik = Pj
j δ(tj = Ck )
2 is
The maximum likelihood estimator for σik
2
P
2 j (xij − µ̂ik ) δ(tj = Ck )
σ̂ik = P
j δ(tj = Ck )

BayesClassifier Updated

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

BayesClassifier Updated

Transféré par

Droits d'auteur :

Formats disponibles

Classification based on Bayes decision theory

Sharif University of Technology

2 Bayes decision theory

3 Supervised learning of the Bayesian classifiers

4 Naive Bayes classifier

4 Naive Bayes classifier

S = {(x1 , t1 ), (x2 , t2 ), . . . , (xN , tN )}.

S is called training set.

P(X |Ck )P(Ck )

p(Ck ) is called prior of Ck .

p(Ck |X ) = P(X |Ck )P(Ck )

Approaches for building a classifier.

4 Naive Bayes classifier

The base classification rule is

4 Naive Bayes classifier

By combining these two equation, we obtain

The probability of error equals to

The probability of error is minimized if R1 is the region of the space in which

[p(C1 |x) − p(C2 |x)] > 0

if p(Ck |x) > p(Cj |x) ∀j 6= k

4 Naive Bayes classifier

This is achieved if each integral is minimized, so that

We assign x to C1 if r1 < r2 , that is

The prior probabilities of two class are p(C1 ) = p(C2 ) = 21 .

x0 = 12 is the solution of the above equation.

4 Naive Bayes classifier

4 Naive Bayes classifier

Obviously the associated decision curves gi (x) − gj (x) = 0 are quadratics.

This is a linear discriminant function with

For this case, the discriminant functions are equal to

For this case, the discriminant functions are equal to

gij (x) = gi (x) − gj (x) = w T (x − x0 ) = 0

4 Naive Bayes classifier

4 Naive Bayes classifier

These are linear discriminant functions

4 Naive Bayes classifier

Si = {(xi1 , ti1 ) , (xi2 , ti2 ) , . . . , (xiNi , tiNi )}

Ni is the number of training samples from the class Ci .

θ̂j is an estimate of the parameter θj based on the data samples.

4 Naive Bayes classifier

Solving the above equation with respect to σ 2 , results in

Let x be a sample from a pdf with parameter θ, and θ̂ be an estimator of θ.

biasθ (θ̂) = E [θ̂] − θ

Example (Sample average)

An estimator θ̂ is consistent estimator if

lim Var (θ̂) → 0

Example (Sample average)

As N gets larger, µ̂ deviates less from µ.

Given that E [x 2 ] = Var (x) + E [x]2 , we can write

This is an example of an asymptotically unbiased estimator whose bias goes to 0 as N goes to

The ML estimate is asymptotically consistent, that is, it satisfies

for arbitrarily small . A stronger condition for consistency is also true

For estimating the density at x, we have

Let x1 , x2 , . . . , xN be vectors drawn from a normal distribution with known covariance

The MAP estimate is given by the solution of

What of the specific parameters of each of the components θ̂m ?

The estimation for covariance matrix for each component would be

Using inequality log E [x] ≥ E [log x], we can write

We maximize the above with respect to each θm .

The estimator for covariance matrices

Then normalizing the above equality results in

EM algorithm has two steps: Expectation & Maximization

p(xn |θm )p(m)

for arbitrarily small . A stronger condition for consistency is also true