Vous êtes sur la page 1sur 36

Generative Algorithms

By : Shedriko

Outline & Content


Preliminary
GDA (Gaussian Discriminant Analysis)
NB (Naive Bayes)

Preliminary
We have learned from previous chapter about :
algorithm that try to learn
directly (such a
s logistic regression) or algorithms that try to l
earn mappings directly from the space of inputs X
to the labels {0,1} (such as the perceptron algorit
hm) are called discriminative learning algorithm
s.

Preliminary

del

Here, well talk about algorithms that try to mo

and/or
. These algorithms are called g
enerative learning algorithms.
For e.g. :
If y indicates whether an example is a dog (y =
0) or an elephant (y = 1), then p(x|y = 0) models t
he distribution of dogs features, and p(x|y = 1)
models the distribution of elephants features.

Preliminary
After modeling
(called class prior) and
o
ur algorithm can then use Bayes rule to derive the p
osterior distribution on y given x :

Here the denominator is given by :


p(x) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)
If we are calculating
in order to make a predi
ction, we dont need to calculate the denominator, sin
ce

GDA (Gaussian)
The multivariate normal distribution
In this model, well assume that
is distributed
according to multivariate distribution
The multivariate normal distribution in n-dimensio
ns (also called multivariate Gaussian distribution), is
parameterized by a mean vector
and a c
ovariance matrix
where
is symmetric
and positive semi-definite
Also written
its density is given by :

GDA (Gaussian)
denotes the determinant of the matrix
For a random variable X distributed
, the me
an is given by :
The covariance of a vector-valued random variable
Z is defined as Cov(Z) =
or Cov(Z) =
If
then

GDA (Gaussian)

Examples of what the density of a Gaussian distribution


looks like :

The left-most figure shows a Gaussian with mean zero


(that is the 2x1 zero-vector) and covariance matrix
(the 2x2 identity matrix). A Gaussian with zero m
ean and identity covariance is also called the
standard n
ormal distribution

GDA (Gaussian)
The middle figure shows the density of a Gaussian
with zero mean and
The right-most figure shows one with
We see that as
becomes larger, the Gaussian bec
omes more spread-out

GDA (Gaussian)

Here is some more examples :

The figures above show Gaussian with mean 0, and


with covariance matrices respectively

GDA (Gaussian)
The left-most figure shows the familiar standard n
ormal distribution, and we see that as we increase
the off- diagonal entry in , the density becomes mo
re compressed towards the 45o line (given by x1 = x
b
2), as seen at the contours of those three densities
elow

GDA (Gaussian)
Heres the last set of examples generated by varyi
ng :

The plots above used, respectively

GDA (Gaussian)
From the leftmost and middle figures, by decre
asing
the diagonal elements of the covariance
matrix, the density now becomes compressed a
gain.
As we vary the parameters, the contours will fo
rm ellipses (the rightmost figure)

GDA (Gaussian)

Fixing
, by varying , we can also move the me
an of the density around

which plots is

GDA (Gaussian)
The GDA model

When we have the classification problem in which


the input features x are continuous-valued rand
om variables, we can use GDA model
Its
using a multivariate normal distribution, t
he
model is :

GDA (Gaussian)

The distribution is

The parameter of the model are , ,


(there are two different mean vectors
with only one covariance matrix )

and
and

GDA (Gaussian)

The log-likelihood of the data is given by

By maximizing
:

, we find the maximum likelihood

GDA (Gaussian)

Example of two different mean vectors


with only one covariance matrix
Also shown in the figure
the straight line giving the
decision boundary at which
On one side of the boundary well predict
and the other side

and

GDA (Gaussian)
Discussion: GDA and logistic regression

GDA model has a relationship to logistic regression


If we view the quantity
as a func
tion of x, well find
where is some appropriate function of
(This is the form of logistic regression a
iminative algorithm to model p(y = 1|x) )

discr

GDA (Gaussian)

Which one is better ?


GDA :
- stronger modeling assumptions
- more data efficient (requires less training data to l
earn well)
Logistic regression :
- weaker assumptions, but more robust to deviations
from modeling assumptions
- when the data is indeed non-Gaussian (in large dat
aset), its better than GDA
For the last reason, logistic regression is used more
often than GDA

NB (Naive Bayes)

In GDA, the feature vector x were continuous, in Ba


yes the xis are discrete-valued
For e.g. :
to classify whether an email is unsolicited commer
cial (spam) email or non-spam email
We begin our spam filter by specifying the feature xi
used to represent an email
Well represent an email via a feature vector whose
length is equal to the number of words in the
dic
tionary, specifically if an email contains the i-th w
ord of the dictionary, then well set xi = 1; otherwis
e we let xi = 0

NB (Naive Bayes)

For instance, the vector :

The set of words encoded


into the feature vector is
called the vocabulary, so the
dimension of x is equal to the size of the vocabulary
Then, we build a generative model
If, say, we have 50,000 words, then
( x is
a 50,000-dimensional vector of 0s and 1s), we mo
del x with a multinomial distribution, we have 2 50
000-1
dimensional parameter vector (too many)

NB (Naive Bayes)

To model
, we have to make a very strong as
sumption
Assume: xis are conditionally independent given y
(called NB assumption and NB classifier for th
e resulting algorithm)
For instance: I tell you y = 1 (spam email), so x2087 (w
ord buy whether appears in the message) and x
(word price whether appears in the messag
39831
e) can be written:
x2087 and x39831 are conditionally independent given y

NB (Naive Bayes)

We now have

First equality : usual properties of probabilities


Second equality : NB assumption (extremely strong
assumption, works well on many problems)

NB (Naive Bayes)

Our model is parameterized by


,
and
As usual given training set {(x(i),y(i))}; i = 1, , m}, the j
oint likelihood of the data :
Maximizing this with respect to
es the maximum
likelihood estimates :
(
symbol means
and,
spam
email in which word
j does appear)

and

giv

NB (Naive Bayes)

To make a prediction on a new example with featu


res x

We have used NB where feature xi are binary-value


d, for xi which has values {1, 2, , ki}
Here, we model
as multinomial rather than a
s Bernoulli
For e.g. : We have input of living area of a house in
continuous valued, then we discretize it as follo
ws :

NB (Naive Bayes)
Laplace smoothing
When the original continuous-valued attributes ar
e not well-modeled by a multivariate normal distribut
ion, discretizing the features and using NB (instead
of GDA) will often result in better classifier
Simple change will make NB algorithm work much
better, especially for text classification
For e.g. : You never receive email with word NIPS
before, its the 35000th word in the dictionary, your
NB spam filter therefore had picked its max likeli
hood estimates of the parameters
to be:

NB (Naive Bayes)

i.e. its never been in training examples (either spa


m or non-spam)
when trying to decide nips is spam, it calculates t
he
class posterior probabilities and obtains

NB (Naive Bayes)
The result 0 because each of terms
inclu
des a term
=0
We then, estimate the mean of a multinomial rand
om variable z valued {1, , k} parameterized with
, the maximum likelihood estimates are

The
above might ends up as zero (which was a
problem), to avoid this, we use Laplace smoothin

NB (Naive Bayes)

Returning to our NB classifier above, with laplace


smoothing to avoid 0, we have

NB (Naive Bayes)
Event models for text classification
Usually for text classification, NB use multi-variate
Bernoulli event model (as presented).
First, its randomly determine whether spam or no
n- spam (according to class prior
)
Then it runs through the dictionary, deciding whet
her to include each word in that email independent
ly, according to
Thus the probability of a message is given by

NB (Naive Bayes)
Heres the different model, a better one, called the
multinomial event model (MEM).
We let xi denote the identity of the i-th in the email,
which has values in {1, , |V|}, where |V| is the si
ze of our vocabulary
For e.g. : An email starts with A NIPS.,
x1 = 1 (a is the first word in the dictionary)
x2 = 35000 (if nips is the 35000th word in the
d
ictionary)

NB (Naive Bayes)
In MEM, assume an email is generated via a rando
m process which spam/non-spam is first determined
(according to
as before)
Then, it generates x1, x2, x3 and so on from some
multinomial distribution over words (
), the over
all probability is given by
(as multinomi
al, not as Bernoulli distribution)
The parameters :
,
(for any j) and

NB (Naive Bayes)

The likelihood of the data is given by

Maximum likelihood

NB (Naive Bayes)

Maximum likelihood with laplace smoothing

Thank you.

Vous aimerez peut-être aussi