Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor

Logistic Regression
Some slides adapted from Dan Jurfasky and Brendan O’Connor

Naïve Bayes Recap
• Bag of words (order independent)
• Features are assumed independent given class
P (x1, . . . , xn|c) = P (x1|c) . . . P (xn|c)
Q: Is this really true?

The problem with assuming
conditional independence
• Correlated features -> double counting
evidence
– Parameters are estimated independently
• This can hurt classifier accuracy and

calibration
Logistic Regression
• (Log) Linear Model – similar to Naïve Bayes
• Doesn’t assume features are independent
• Correlated features don’t “double count”

What are “Features”?
• A feature function, f
– Input: Document, D (a string)
– Output: Feature Vector, X
What are “Features”?
0 1
count(“boring”)
Bcount(“not boring”)C
f(d) = length of document
B author of document C
@
.
.
What
Doesn’t are
have “Features”?
to be just “bag of words”
Feature Templates
• Typically “feature templates” are used to
generate many features at once
• For each word:

– ${w}_count
– ${w}_lowercase
– ${w}_with_NOT_before_count
Logistic Regression: Example
• Compute Features:
0 1
count(“nigerian”)
f(d i) = xi = @ count(“prince”)
count(“nigerian prince”)
• Assume we are given
0some1
weights:
—1.0
w = @—1.0
4.0
• Compute Features
• We are given some weights
• Compute the dot product:
|X|
X
z= w ix i
i=0
• Compute the dot product:
|X|
X
z= w ix i
i=0
• Compute the logistic function:
ez 1
P (spam|x) = z =
e + 1 1 + e—z
The Logistic function
ez 1
P (spam|x) = z =
e + 1 1 + e—z
The Dot Product
|X|
X
z= wixi
i=0
• Intuition: weighted sum of features

• All Linear models have this form
Naïve Bayes as a log-linear model
• Q: what are the features?
• Q: what are the weights?

Naïve Bayes as a Log-Linear Model
Y
P(spam|D) / P(spam) P (w|spam)
w2D
Y
P(spam|D) / P(spam) P (w|spam)xi
w Vocab
X
log P (spam|D) / log P (spam) + xi · log P (w|spam)
w Vocab
Naïve Bayes as a Log-Linear Model
X
log P (spam|D) / log P(spam) + xi · log P (w|spam)
w Vocab
In both Naïve Bayes and

Logistic Regression we
Compute The Dot Product!
NB vs. LR
• Both compute the dot product
• NB: sum of log probabilities
• LR: logistic function

NB vs. LR:
Parameter Learning
• Naïve Bayes:
– Learn conditional probabilities independently by
counting
• Logistic Regression:
– Learn weights jointly
LR: Learning Weights
• Given: a set of feature vectors and labels
• Goal: learn the weights

Learning Weights
Feature Labels
0 1
x11 x12 x13 ... x1n y1
Document x
6 21 x22 x23 ... x2n7 B y2 C
.. B C
. . . . . .
. . . .
xd1 xd2 xd3 ... xdn yn
Q: what parameters should we
choose?
• What is the right value for the weights?
• Maximum Likelihood Principle:

– Pick the parameters that maximize the probability
of the data
Maximum Likelihood Estimation
wMLE = argmaxw log P(y1, . . . , yd|x1, . . . , xd; w)
X
= argmaxw log P (yi|xi; w)
i (
X
= argmaxw log p i , if y i = 1
i
1 — p i, if y i =0
X
I(yi =1)
= argmaxw log pi (1 — pi )I(yi =0)
i
X
I(yi =1) I(yi =0)
= argmaxw log pi (1 — pi )
i
X
= argmaxw yi log pi + (1 — yi) log(1 — pi)
i
• Unfortunately there is no closed form solution
– (like there was with naïve bayes)
• Solution:
– Iteratively climb the log-likelihood surface through
the derivatives for each weight
• Luckily, the derivatives turn out to be nice
Gradient ascent
Loop While not converged:
For all features j, compute and add derivatives
new old
@
wj =wj +⌘ L(w)
@wj
L(w): Training set log-likelihood
@L @L , . . . , @L : Gradient vector
,
@w1 @w2 @wn
Gradient ascent
w2
w1
Gradient ascent
w2
w1
LR Gradient
@L X
= (yi — pi )xj
@wj
i
Logistic Regression: Pros and Cons
• Doesn’t assume conditional independence of
features
– Better calibrated probabilities
– Can handle highly correlated overlapping features
• NB is faster to train, less likely to overfit

NB & LR
|X|
• Both are linear models X
z= wixi
i=0
• Training is different:
– NB: weights are trained independently
– LR: weights trained jointly
Perceptron Algorithm
• Very similar to logistic regression
• Not exactly computing gradients
Perceptron Algorithm
• Algorithm is Very similar to logistic regression
• Not exactly computing gradients
Initalize weight vector w = 0

Loop for K iterations
Loop For all training examples x_i
if sign(w * x_i) != y_i
w += (y_i - sign(w * x_i)) * x_i
Differences between LR and
Perceptron
• Online learning vs. Batch
• Perceptron doesn’t always make updates

MultiClass Classification
• Q: what if we have more than 2 categories?
– Sentiment: Positive, Negative, Neutral
– Document topics: Sports, Politics, Business,
Entertainment, …
• Could train a seperate logistic regression
model for each category...
• Pretty clear what to do with Naive Bayes.
Log-Linear Models
w·f (d,y)
P (y|x) / e
1 w·f (d,y)
P (y x)
| = e
Z(w)
MultiClass Logistic Regression
w·f (d,y)
P (y|x) / e
1 w·f (d,y)
P (y x)
| = e
Z(w)
ew·f (d,y)
P(y|x) = P w·f (d,y 0)
y02Y e
MultiClass Logistic Regression
• Binary logistic regression:
– We have one feature vector that matches the size
of the vocabulary
Can represent this in practice with
• Multiclass in practice:
one giant weight vector and
one weight vector for each category
– repeated features for each category.
wpos wneg wneut

Q: How to compute posterior class
probabilities for multiclass?
ewj ·xi
P (y = j|xi) = wk ·xi
k e
wMLE = argmaxw log P(y1, . . . , yn|x1, . . . , xn; w)
X
= argmaxw log P (yi|xi; w)
i
X ew·f (xi ,yi )

= argmaxw log Py02Y
i ew·f (xi ,yi )
i y02Y
Multiclass LR Gradient
D
X D X
X
@L
= fj(yi ,di) — fj(y, di )P (y|di)
@wj i=1 y2Y
i=1
MAP-based learning
(perceptron)
D
X D
X
@L
= fj(yi ,di) — fj(arg max P (y|di ), di)
@wj y2Y
i=1 i=1
Online Learning (perceptron)
• Rather than making a full pass through the

data, compute gradient and update
parameters after each training example.
MultiClass Perceptron Algorithm
Initalize weight vector w = 0
Loop for K iterations
Loop For all training examples x_i
y_pred = argmax_y w_y * x_i
if y_pred != y_i
w_y_gold += x_i
w_y_pred -= x_i
Q: what if there are only 2 categories?
ewj ·xi
P (y = j|xi) = wk ·xi
k e
ew1·x
P(y = 1|x) =
ew0·x+w1·x—w1·x + ew1·x
ew1·x
P(y = 1|x) =
ew0·x—w1·xew1·x + ew1·x
ew1·x
P(y = 1|x) =
ew1·x(ew0·x—w1·x + 1)
1
P(y = 1|x) =
ew0·x—w1·x + 1
1
P(y = 1|x) = —w 0·x
e +1
Regularization
• Combating over fitting
• Intuition: don’t let the weights get very large

wMLE = argmaxw log P(y1, . . . , yd|x1, . . . , xd; w)
V
X 2
argmaxw log P(y1, . . . , yd|x1, . . . , xd; w) — 6 wi
i=1
Regularization in the Perceptron
Algorithm
• Can’t directly include regularization in
gradient
• # of iterations
• Parameter averaging

Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor

Transféré par

Droits d'auteur :

Formats disponibles

Logistic Regression

Some slides adapted from Dan Jurfasky and Brendan O’Connor

• Features are assumed independent given class

P (x1, . . . , xn|c) = P (x1|c) . . . P (xn|c)

Q: Is this really true?

• This can hurt classifier accuracy and

• (Log) Linear Model – similar to Naïve Bayes

• Doesn’t assume features are independent

• Correlated features don’t “double count”

• For each word:

• Intuition: weighted sum of features

• Q: what are the features?

• Q: what are the weights?

In both Naïve Bayes and

• NB: sum of log probabilities

• LR: logistic function

• Given: a set of feature vectors and labels

• Goal: learn the weights

• Maximum Likelihood Principle:

• NB is faster to train, less likely to overfit

Initalize weight vector w = 0

• Perceptron doesn’t always make updates

wpos wneg wneut

X ew·f (xi ,yi )

• Rather than making a full pass through the

• Intuition: don’t let the weights get very large

Vous aimerez peut-être aussi