Generative Algorithms, Machine Learnig

Generative Algorithms
By : Shedriko
Outline & Content

Preliminary
GDA (Gaussian Discriminant Analysis)
NB (Naive Bayes)
Preliminary
We have learned from previous chapter about :
algorithm that try to learn
directly (such a
s logistic regression) or algorithms that try to l
earn mappings directly from the space of inputs X
to the labels {0,1} (such as the perceptron algorit
hm) are called discriminative learning algorithm
s.
Preliminary
del
Here, well talk about algorithms that try to mo
and/or
. These algorithms are called g
enerative learning algorithms.
For e.g. :
If y indicates whether an example is a dog (y =
0) or an elephant (y = 1), then p(x|y = 0) models t
he distribution of dogs features, and p(x|y = 1)
models the distribution of elephants features.
Preliminary
After modeling
(called class prior) and
o
ur algorithm can then use Bayes rule to derive the p
osterior distribution on y given x :
Here the denominator is given by :

p(x) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)
If we are calculating
in order to make a predi
ction, we dont need to calculate the denominator, sin
ce
GDA (Gaussian)
The multivariate normal distribution
In this model, well assume that
is distributed
according to multivariate distribution
The multivariate normal distribution in n-dimensio
ns (also called multivariate Gaussian distribution), is
parameterized by a mean vector
and a c
ovariance matrix
where
is symmetric
and positive semi-definite
Also written
its density is given by :
GDA (Gaussian)
denotes the determinant of the matrix
For a random variable X distributed
, the me
an is given by :
The covariance of a vector-valued random variable
Z is defined as Cov(Z) =
or Cov(Z) =
If
then
GDA (Gaussian)
Examples of what the density of a Gaussian distribution

looks like :
The left-most figure shows a Gaussian with mean zero

(that is the 2x1 zero-vector) and covariance matrix
(the 2x2 identity matrix). A Gaussian with zero m
ean and identity covariance is also called the
standard n
ormal distribution
GDA (Gaussian)
The middle figure shows the density of a Gaussian
with zero mean and
The right-most figure shows one with
We see that as
becomes larger, the Gaussian bec
omes more spread-out
GDA (Gaussian)
Here is some more examples :
The figures above show Gaussian with mean 0, and

with covariance matrices respectively
GDA (Gaussian)
The left-most figure shows the familiar standard n
ormal distribution, and we see that as we increase
the off- diagonal entry in , the density becomes mo
re compressed towards the 45o line (given by x1 = x
b
2), as seen at the contours of those three densities
elow
GDA (Gaussian)
Heres the last set of examples generated by varyi
ng :
The plots above used, respectively
GDA (Gaussian)
From the leftmost and middle figures, by decre
asing
the diagonal elements of the covariance
matrix, the density now becomes compressed a
gain.
As we vary the parameters, the contours will fo
rm ellipses (the rightmost figure)
GDA (Gaussian)
Fixing
, by varying , we can also move the me
an of the density around
which plots is
GDA (Gaussian)
The GDA model
When we have the classification problem in which

the input features x are continuous-valued rand
om variables, we can use GDA model
Its
using a multivariate normal distribution, t
he
model is :
GDA (Gaussian)
The distribution is
The parameter of the model are , ,

(there are two different mean vectors
with only one covariance matrix )
and
and
GDA (Gaussian)
The log-likelihood of the data is given by
By maximizing
:
, we find the maximum likelihood
GDA (Gaussian)
Example of two different mean vectors

with only one covariance matrix
Also shown in the figure
the straight line giving the
decision boundary at which
On one side of the boundary well predict
and the other side
and
GDA (Gaussian)
Discussion: GDA and logistic regression
GDA model has a relationship to logistic regression

If we view the quantity
as a func
tion of x, well find
where is some appropriate function of
(This is the form of logistic regression a
iminative algorithm to model p(y = 1|x) )
discr
GDA (Gaussian)
Which one is better ?

GDA :
- stronger modeling assumptions
- more data efficient (requires less training data to l
earn well)
Logistic regression :
- weaker assumptions, but more robust to deviations
from modeling assumptions
- when the data is indeed non-Gaussian (in large dat
aset), its better than GDA
For the last reason, logistic regression is used more
often than GDA
NB (Naive Bayes)
In GDA, the feature vector x were continuous, in Ba

yes the xis are discrete-valued
For e.g. :
to classify whether an email is unsolicited commer
cial (spam) email or non-spam email
We begin our spam filter by specifying the feature xi
used to represent an email
Well represent an email via a feature vector whose
length is equal to the number of words in the
dic
tionary, specifically if an email contains the i-th w
ord of the dictionary, then well set xi = 1; otherwis
e we let xi = 0
NB (Naive Bayes)
For instance, the vector :
The set of words encoded

into the feature vector is
called the vocabulary, so the
dimension of x is equal to the size of the vocabulary
Then, we build a generative model
If, say, we have 50,000 words, then
( x is
a 50,000-dimensional vector of 0s and 1s), we mo
del x with a multinomial distribution, we have 2 50
000-1
dimensional parameter vector (too many)
NB (Naive Bayes)
To model
, we have to make a very strong as
sumption
Assume: xis are conditionally independent given y
(called NB assumption and NB classifier for th
e resulting algorithm)
For instance: I tell you y = 1 (spam email), so x2087 (w
ord buy whether appears in the message) and x
(word price whether appears in the messag
39831
e) can be written:
x2087 and x39831 are conditionally independent given y
NB (Naive Bayes)
We now have
First equality : usual properties of probabilities

Second equality : NB assumption (extremely strong
assumption, works well on many problems)
NB (Naive Bayes)
Our model is parameterized by

,
and
As usual given training set {(x(i),y(i))}; i = 1, , m}, the j
oint likelihood of the data :
Maximizing this with respect to
es the maximum
likelihood estimates :
(
symbol means
and,
spam
email in which word
j does appear)
and
giv
NB (Naive Bayes)
To make a prediction on a new example with featu

res x
We have used NB where feature xi are binary-value

d, for xi which has values {1, 2, , ki}
Here, we model
as multinomial rather than a
s Bernoulli
For e.g. : We have input of living area of a house in
continuous valued, then we discretize it as follo
ws :
NB (Naive Bayes)
Laplace smoothing
When the original continuous-valued attributes ar
e not well-modeled by a multivariate normal distribut
ion, discretizing the features and using NB (instead
of GDA) will often result in better classifier
Simple change will make NB algorithm work much
better, especially for text classification
For e.g. : You never receive email with word NIPS
before, its the 35000th word in the dictionary, your
NB spam filter therefore had picked its max likeli
hood estimates of the parameters
to be:
NB (Naive Bayes)
i.e. its never been in training examples (either spa

m or non-spam)
when trying to decide nips is spam, it calculates t
he
class posterior probabilities and obtains
NB (Naive Bayes)
The result 0 because each of terms
inclu
des a term
=0
We then, estimate the mean of a multinomial rand
om variable z valued {1, , k} parameterized with
, the maximum likelihood estimates are
The
above might ends up as zero (which was a
problem), to avoid this, we use Laplace smoothin
NB (Naive Bayes)
Returning to our NB classifier above, with laplace

smoothing to avoid 0, we have
NB (Naive Bayes)
Event models for text classification
Usually for text classification, NB use multi-variate
Bernoulli event model (as presented).
First, its randomly determine whether spam or no
n- spam (according to class prior
)
Then it runs through the dictionary, deciding whet
her to include each word in that email independent
ly, according to
Thus the probability of a message is given by
NB (Naive Bayes)
Heres the different model, a better one, called the
multinomial event model (MEM).
We let xi denote the identity of the i-th in the email,
which has values in {1, , |V|}, where |V| is the si
ze of our vocabulary
For e.g. : An email starts with A NIPS.,
x1 = 1 (a is the first word in the dictionary)
x2 = 35000 (if nips is the 35000th word in the
d
ictionary)
NB (Naive Bayes)
In MEM, assume an email is generated via a rando
m process which spam/non-spam is first determined
(according to
as before)
Then, it generates x1, x2, x3 and so on from some
multinomial distribution over words (
), the over
all probability is given by
(as multinomi
al, not as Bernoulli distribution)
The parameters :
,
(for any j) and
NB (Naive Bayes)
The likelihood of the data is given by
Maximum likelihood
NB (Naive Bayes)
Maximum likelihood with laplace smoothing
Thank you.

Generative Algorithms, Machine Learnig

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Generative Algorithms, Machine Learnig

Transféré par

Droits d'auteur :

Formats disponibles

Generative Algorithms

Outline & Content

Here, well talk about algorithms that try to mo

Here the denominator is given by :

Examples of what the density of a Gaussian distribution

The left-most figure shows a Gaussian with mean zero

Here is some more examples :

The figures above show Gaussian with mean 0, and

The plots above used, respectively

When we have the classification problem in which

The parameter of the model are , ,

The log-likelihood of the data is given by

, we find the maximum likelihood

Example of two different mean vectors

GDA model has a relationship to logistic regression

Which one is better ?

In GDA, the feature vector x were continuous, in Ba

For instance, the vector :

The set of words encoded

First equality : usual properties of probabilities

Our model is parameterized by

To make a prediction on a new example with featu

We have used NB where feature xi are binary-value

i.e. its never been in training examples (either spa

Returning to our NB classifier above, with laplace

The likelihood of the data is given by

Maximum likelihood with laplace smoothing

Vous aimerez peut-être aussi