Académique Documents
Professionnel Documents
Culture Documents
|X|
X
z= w ix i
i=0
Logistic Regression: Example
• Compute the dot product:
|X|
X
z= w ix i
i=0
• Compute the logistic function:
ez 1
P (spam|x) = z =
e + 1 1 + e—z
The Logistic function
ez 1
P (spam|x) = z =
e + 1 1 + e—z
The Dot Product
|X|
X
z= wixi
i=0
X
log P (spam|D) / log P (spam) + xi · log P (w|spam)
w Vocab
Naïve Bayes as a Log-Linear Model
X
log P (spam|D) / log P(spam) + xi · log P (w|spam)
w Vocab
• Logistic Regression:
– Learn weights jointly
LR: Learning Weights
0 1
x11 x12 x13 ... x1n y1
Document x
6 21 x22 x23 ... x2n7 B y2 C
.. B C
. . . . . .
. . . .
xd1 xd2 xd3 ... xdn yn
Q: what parameters should we
choose?
• What is the right value for the weights?
i
1 — p i, if y i =0
X
I(yi =1)
= argmaxw log pi (1 — pi )I(yi =0)
i
Maximum Likelihood Estimation
X
I(yi =1) I(yi =0)
= argmaxw log pi (1 — pi )
i
X
= argmaxw yi log pi + (1 — yi) log(1 — pi)
i
Maximum Likelihood Estimation
• Unfortunately there is no closed form solution
– (like there was with naïve bayes)
• Solution:
– Iteratively climb the log-likelihood surface through
the derivatives for each weight
• Luckily, the derivatives turn out to be nice
Gradient ascent
Loop While not converged:
For all features j, compute and add derivatives
new old
@
wj =wj +⌘ L(w)
@wj
L(w): Training set log-likelihood
@L @L , . . . , @L : Gradient vector
,
@w1 @w2 @wn
Gradient ascent
w2
w1
Gradient ascent
w2
w1
LR Gradient
@L X
= (yi — pi )xj
@wj
i
Logistic Regression: Pros and Cons
• Doesn’t assume conditional independence of
features
– Better calibrated probabilities
– Can handle highly correlated overlapping features
• Training is different:
– NB: weights are trained independently
– LR: weights trained jointly
Perceptron Algorithm
• Very similar to logistic regression
• Not exactly computing gradients
Perceptron Algorithm
• Algorithm is Very similar to logistic regression
• Not exactly computing gradients
w·f (d,y)
P (y|x) / e
1 w·f (d,y)
P (y x)
| = e
Z(w)
MultiClass Logistic Regression
w·f (d,y)
P (y|x) / e
1 w·f (d,y)
P (y x)
| = e
Z(w)
ew·f (d,y)
P(y|x) = P w·f (d,y 0)
y02Y e
MultiClass Logistic Regression
• Binary logistic regression:
– We have one feature vector that matches the size
of the vocabulary
Can represent this in practice with
• Multiclass in practice:
one giant weight vector and
one weight vector for each category
– repeated features for each category.
ewj ·xi
P (y = j|xi) = wk ·xi
k e
Maximum Likelihood Estimation
wMLE = argmaxw log P(y1, . . . , yn|x1, . . . , xn; w)
X
= argmaxw log P (yi|xi; w)
i
D
X D X
X
@L
= fj(yi ,di) — fj(y, di )P (y|di)
@wj i=1 y2Y
i=1
MAP-based learning
(perceptron)
D
X D
X
@L
= fj(yi ,di) — fj(arg max P (y|di ), di)
@wj y2Y
i=1 i=1
Online Learning (perceptron)
ewj ·xi
P (y = j|xi) = wk ·xi
k e
Q: what if there are only 2 categories?
ew1·x
P(y = 1|x) =
ew0·x+w1·x—w1·x + ew1·x
Q: what if there are only 2 categories?
ew1·x
P(y = 1|x) =
ew0·x—w1·xew1·x + ew1·x
Q: what if there are only 2 categories?
ew1·x
P(y = 1|x) =
ew1·x(ew0·x—w1·x + 1)
Q: what if there are only 2 categories?
1
P(y = 1|x) =
ew0·x—w1·x + 1
Q: what if there are only 2 categories?
1
P(y = 1|x) = —w 0·x
e +1
Regularization
• Combating over fitting
V
X 2
argmaxw log P(y1, . . . , yd|x1, . . . , xd; w) — 6 wi
i=1
Regularization in the Perceptron
Algorithm
• Can’t directly include regularization in
gradient
• # of iterations
• Parameter averaging