Académique Documents
Professionnel Documents
Culture Documents
By : Shedriko
Preliminary
We have learned from previous chapter about :
algorithm that try to learn
directly (such a
s logistic regression) or algorithms that try to l
earn mappings directly from the space of inputs X
to the labels {0,1} (such as the perceptron algorit
hm) are called discriminative learning algorithm
s.
Preliminary
del
and/or
. These algorithms are called g
enerative learning algorithms.
For e.g. :
If y indicates whether an example is a dog (y =
0) or an elephant (y = 1), then p(x|y = 0) models t
he distribution of dogs features, and p(x|y = 1)
models the distribution of elephants features.
Preliminary
After modeling
(called class prior) and
o
ur algorithm can then use Bayes rule to derive the p
osterior distribution on y given x :
GDA (Gaussian)
The multivariate normal distribution
In this model, well assume that
is distributed
according to multivariate distribution
The multivariate normal distribution in n-dimensio
ns (also called multivariate Gaussian distribution), is
parameterized by a mean vector
and a c
ovariance matrix
where
is symmetric
and positive semi-definite
Also written
its density is given by :
GDA (Gaussian)
denotes the determinant of the matrix
For a random variable X distributed
, the me
an is given by :
The covariance of a vector-valued random variable
Z is defined as Cov(Z) =
or Cov(Z) =
If
then
GDA (Gaussian)
GDA (Gaussian)
The middle figure shows the density of a Gaussian
with zero mean and
The right-most figure shows one with
We see that as
becomes larger, the Gaussian bec
omes more spread-out
GDA (Gaussian)
GDA (Gaussian)
The left-most figure shows the familiar standard n
ormal distribution, and we see that as we increase
the off- diagonal entry in , the density becomes mo
re compressed towards the 45o line (given by x1 = x
b
2), as seen at the contours of those three densities
elow
GDA (Gaussian)
Heres the last set of examples generated by varyi
ng :
GDA (Gaussian)
From the leftmost and middle figures, by decre
asing
the diagonal elements of the covariance
matrix, the density now becomes compressed a
gain.
As we vary the parameters, the contours will fo
rm ellipses (the rightmost figure)
GDA (Gaussian)
Fixing
, by varying , we can also move the me
an of the density around
which plots is
GDA (Gaussian)
The GDA model
GDA (Gaussian)
The distribution is
and
and
GDA (Gaussian)
By maximizing
:
GDA (Gaussian)
and
GDA (Gaussian)
Discussion: GDA and logistic regression
discr
GDA (Gaussian)
NB (Naive Bayes)
NB (Naive Bayes)
NB (Naive Bayes)
To model
, we have to make a very strong as
sumption
Assume: xis are conditionally independent given y
(called NB assumption and NB classifier for th
e resulting algorithm)
For instance: I tell you y = 1 (spam email), so x2087 (w
ord buy whether appears in the message) and x
(word price whether appears in the messag
39831
e) can be written:
x2087 and x39831 are conditionally independent given y
NB (Naive Bayes)
We now have
NB (Naive Bayes)
and
giv
NB (Naive Bayes)
NB (Naive Bayes)
Laplace smoothing
When the original continuous-valued attributes ar
e not well-modeled by a multivariate normal distribut
ion, discretizing the features and using NB (instead
of GDA) will often result in better classifier
Simple change will make NB algorithm work much
better, especially for text classification
For e.g. : You never receive email with word NIPS
before, its the 35000th word in the dictionary, your
NB spam filter therefore had picked its max likeli
hood estimates of the parameters
to be:
NB (Naive Bayes)
NB (Naive Bayes)
The result 0 because each of terms
inclu
des a term
=0
We then, estimate the mean of a multinomial rand
om variable z valued {1, , k} parameterized with
, the maximum likelihood estimates are
The
above might ends up as zero (which was a
problem), to avoid this, we use Laplace smoothin
NB (Naive Bayes)
NB (Naive Bayes)
Event models for text classification
Usually for text classification, NB use multi-variate
Bernoulli event model (as presented).
First, its randomly determine whether spam or no
n- spam (according to class prior
)
Then it runs through the dictionary, deciding whet
her to include each word in that email independent
ly, according to
Thus the probability of a message is given by
NB (Naive Bayes)
Heres the different model, a better one, called the
multinomial event model (MEM).
We let xi denote the identity of the i-th in the email,
which has values in {1, , |V|}, where |V| is the si
ze of our vocabulary
For e.g. : An email starts with A NIPS.,
x1 = 1 (a is the first word in the dictionary)
x2 = 35000 (if nips is the 35000th word in the
d
ictionary)
NB (Naive Bayes)
In MEM, assume an email is generated via a rando
m process which spam/non-spam is first determined
(according to
as before)
Then, it generates x1, x2, x3 and so on from some
multinomial distribution over words (
), the over
all probability is given by
(as multinomi
al, not as Bernoulli distribution)
The parameters :
,
(for any j) and
NB (Naive Bayes)
Maximum likelihood
NB (Naive Bayes)
Thank you.