Vous êtes sur la page 1sur 20

6.

867: Recitation Handout (Week 4)


October 15, 2015

Ridge Regression

1. (based on HTF Ex. 3.6 in on-line version, with some notation changed)
Show that the ridge regression estimate is the mean (and mode) of the posterior distribution,
under a Gaussian prior w N(0, I), and Gaussian sampling model y N(Xw, 2 I). Find the
relationship between the regularization parameter in the ridge formula, and the variances
and 2 . Assume that the data are centered as described in the previous problem, so that we
dont need a bias term.
Solution: By Bayes rule, p(w|y) p(y|w)p(w), hence in the log domain:
n
d
1 X (i)
1 X 2
(i) 2
log p(w|y) = const 2
(y w x )
wj
2
2
i=1
j=1

d
n
2 X

1 X (i)
w2j
= const 2
(y w x(i) )2 +
2

j=1

i=1

where we have accumulated constant terms that do not depend on w. The mode (and
mean) of the posterior Gaussian distribution is given by the parameters that maximize
the log probability above, which is equivalent to minimizing the expression in the square
brackets on the RHS. The latter expression is equivalent to the ridge regression objective
2
for = , so the estimates are the same.
2. (Bishop 3.4)
Consider a linear model of the form
y(x, w) = wo +

D
X

wi xi

(3.105)

i=1

together with a sum-of-squares error function of the form


N
1X
ErrD (w) =
{y(x(n) , w) tn }2
2
n=1

(3.106)

MIT 6.867

Fall 2014

where tn is the true value for x(n) . Now suppose that Gaussian noise i with zero mean
and variance 2 is added independently to each of the input variables xi . By making use
of E[i ] = 0 and E[i j ] = ij 2 (where ij = 1 when i = j and 0 otherwise) show that
minimizing ErrD averaged over the noise distribution is equivalent to minimizing the sum-ofsquares error for noise-free input variables with the addition of a weight-decay regularization
term, in which the bias parameter w0 is omitted from the regularizer.
The set-up is illustrated below, where the gray points are the original training examples and
the black ones have their x values perturbed.

Solution:
Let
y n = w0 +

= yn +

D
X
i=1
D
X

(n)

wi (xi

+ i )

wi i

i=1

where yn = y(x(n) , w) (the prediction from the current model for the nth data point) and
i N(0, 2 ). Note that we have used (3.105). Using (3.106), we define
=
Err

N
1X
(y n tn )2
2

1
2

n=1
N
X

(y 2n 2y n tn + t2n )

n=1

N
D
X
X
1
y2n 2yn
wi i +
2
n=1

i=1

D
X

!2
wi i

i=1

2tn yn 2tn

D
X

wi i + t2n

i=1

under the distribution of i , we see that the second


If we now take the expectation of Err
and fifth term disappear, since E[i ] = 0 (its a zero mean normal), while for the third term
we get:

!2
D
D
X
X
E
wi i =
w2i 2
i=1

i=1

MIT 6.867

Fall 2014

since the i are all independent with variance 2 . From this and (3.106) we get:
D
X
 
= ErrD + 1
E Err
w2i 2
2
i=1

where the regularization coefficient = N2 . Compare to Bishop 3.27.


3. (HTF Ex. 3.12 in on-line version, with some notation changed)
Show that the ridge regression estimates can be obtained by ordinary least squares regression
(i)
on an augmented data set. First we center the data, computing a new matrix C, where cj =
(i)

(xj x j ); that is, that we subtract the average value of each feature j, x j , from each of the
values for feature j in the data.

We augment the centered matrix C with d additional rows I, and augment y with d zeros.
By introducing artificial data having response value zero, the fitting procedure is forced to
shrink the coefficients toward zero. This is related to the idea of hints due to Abu-Mostafa
(1995), where model constraints are implemented by adding artificial data examples that satisfy them.
So, the data matrix will look like

(1)
(1)
(1)
x2
. . . xd
x1
(2)
(2)
(2)
x1
. . . xd
x2

. . . . . . . . . . . . . . . . . . . . . . .
(n) (n)

(n)
x
.
.
.
x
x

2
d
1

...
0
0

0
...
0

. . . . . . . . . . . . . . . . . . . . . . .

0
0
...

Solution: The centering step is necessary so that the bias parameter w0 is not regularized
(there is generally no reason to believe the data has mean zero, and it can be removed by
such a centering step anyway). If we center both X and y by subtracting their averages
resulting in C and z, and augment them as suggested, ordinary least squares (without
bias) minimizes the following objective with respect to w:
n+d
X

(z

(i)

wc

(i) 2

n
d
X
X

(i)
(i)
2
=
((y y)
w (x x )) +
(0 wj )2

i=1

i=1

j=1

n
X

((y

i=1

(i)

(i)

wx

) (y w x )) +

d
X
j=1

w2j

MIT 6.867

Fall 2014

The final expression is similar to the ridge regression objective. In fact, if we recall that
P
(i) w x(i) ) = y
w x , we see that the bias
in ordinary least squares w0 = n1 n
i=1 (y
term essentially falls out from the centered matrices. If we replace (y w x ) within the
squared summand with w0 , then the objective from this approach is equivalent to that of
ridge regression, and the resulting w will be the same.

One parameter, two estimators

In this problem, were going to explore the bias-variance trade-off in a very simple setting. We
have a set of unidimensional data, x(1) , . . . , x(n) , drawn from the positive reals. Consider a simple
model for its distribution (in a later problem we will consider a slightly different model):
Model 1: The data are drawn from a uniform distribution on the interval [0, b]. This model
has a single positive real parameter b, such that 0 < b.
We are interested in estimates of the mean of the distribution.
1. Whats the mean of the Model 1 distribution?
Solution: The model density is

1
b

(over [0, b]) giving a mean b2 .

Lets start by considering the situation in which the data were, in fact, drawn from an instance
of the model under consideration: a uniform distribution on [0, b] (for model 1),
In model 1, the ML estimator for b is bml = maxi x(i) . The likelihood of the data is:

n
Y
b1
if x(i) 6 bml
ml
L(bml ) =
0
otherwise
i=1
We can see that if bml < x(i) , for any x(i) , then the likelihood of the whole data set must be
0. So, we should pick bml to be as small as possible subject to the constraint that bml > x(i) ,
which means bml = maxi x(i) .
To understand the properties of this estimator we have to start by deriving their PDFs. The
minimum and maximum of a data set are also known as their first and nth order statistics, and
sometimes written x[1] and x[n] (were using square brackets to distinguish these from our
notation for samples in a data set).
In model 1, we just need to consider the distribution of bml . Generally speaking, the pdf of the
maximum of a set of data drawn from pdf f, with cdf F, is:
fbml (x) = nF(x)n1 f(x)

(1)

The idea is that, if x is the maximum, then n 1 of the other data values will have to be less
than x, and the probability of that is F(x)n1 , and then one value will have to equal x, the
probability of which is f(x). We multiply by n because there are n different ways to choose
the data value that could be the maximum.

MIT 6.867

Fall 2014

2. (a) What is the maximum likelihood estimate of the mean, ml , of the distribution?
Solution: Given the MLE bml of b, which is x[n] the maximum of the data set, the
[n]
MLE of the mean is b2ml = x2 (from our expression of the mean in part 1).
(b) What is fbml for this particular case where the data are drawn uniformly from 0 to b?
Solution: f(x) =

1
b,

F(x) =

x
b,

n1

hence fbml (x) = n xbn over [0, b], and is zero otherwise.

(c) Write an expression for the expected value of ml , as an integral,


Solution: The pdf of the max of n data points was given in Equation 1 above. Given
that the max value is x, the mean is x2 from Q1. Hence:
Zb
E[ml ] =

x
fb (x) dx =
2 ml

Zb
0

b n
x xn1
n n dx =
2 b
2n+1

In fact, theres a nice closed form expression, which you can use in the following questions:
b n
E[ml ] =
.
2 (n + 1)
(d) What is the squared bias of ml ? Is this estimator unbiased? Is it asymptotically unbiased?
(Reminder: bias2 (ml ) = (ED [ml ] )2 .)
Solution: Using the given equations,
2

bias (ml ) = (E[ml ] ) =

b n
b

2n+1 2

2
=

b2
4(n + 1)2

Because bias2 is not zero, it is biased. However, since bias2 0 as n , it is


asymptotically unbiased.
(e) Write an expression for the variance of ml , as an integral.
Solution:


V[ml ] = E 2ml [E[ml ]]2
Z b  2
x
=
fbml (x) dx [E[ml ]]2
2
0


Z b 2 n1
x x
b n 2
=
n n dx
b
2n+1
0 4
2
b
n
=
4 (n + 1)2 (n + 2)

MIT 6.867

Fall 2014

The closed form for the variance is


V[ml ] =

n
b2
.
4 (n + 1)2 (n + 2)

(f) What is the mean squared error of ml ? (Reminder: MSE(ml ) = bias2 (ml ) + var(ml ).)
Solution:
MSE(ml ) = bias2 (ml )+var(ml ) =

b2
b2
n
b2
+
=
4(n + 1)2 4 (n + 1)2 (n + 2)
2(n + 1)(n + 2)

(g) So far, we have been considering the error of the estimator, comparing the estimated value
of the mean with its actual value. We will often want to use the estimator to make predictions, and so we might be interested in the expected error of a prediction.
Assume the loss function for your predictions is L(g, a) = (g a)2 . Given an estimate mu

of the mean of the distribution, what value should you predict?


What is the expected loss (risk) of this prediction? Take into account both the error due
to inaccuracies in estimating the mean as well as the error due to noise in the generation
of the actual value.
Solution: Let be the actual mean, and ml be the estimated mean. Note that we
have > . Then minimize the risk of prediction, we get
Z
=argmin
( )2 f(|ml )d
(2)
ml
Z
=argmin
( )2 f(ml |)d (Assuming uniform prior on .)
(3)
ml

n1
( )2 n ml n d

ml
Z
=argmin nml n1
2 n 2 n+1 + n+2 d

(4)

=argmin

(5)

ml

2 ml n1 2 ml n+2 ml n+3
=argmin nml

+
n + 1
n2
n3
!
2
2 ml
2
=argmin

+ ml
n1
n2
n3
s
!
n1
(n 1)2 n 1
=
+
ml

n2
(n 2)2 n 3
n1

!
(6)
(7)

(8)

One thing to notice is that, when n , ml . This shows estimated mean


converges to the true mean when there is infinite number of samples.

MIT 6.867

Fall 2014

3. We might consider something other than the MLE for Model 1 (labeled o for other). Consider
the estimator
x[n] (n + 1)
.
o =
2n
where x[n] is the maximum of the data set.
(a) Write an expression for the expected value of this version of o as an integral. Then solve
the integral.
Solution:

Zb
E[o ] =
0

x(n + 1)
fbo dx =
2n

Zb
0

x(n + 1) xn1
b
n n dx =
2n
b
2

Where fbo also assumes the data is drawn uniformly from 0 to b.


(b) What is the squared bias of this estimator for o ? Is this estimator unbiased? Is it asymptotically unbiased?
Solution:
2

bias (o ) = (E[o ] ) =

b b

2
2

2
=0

The estimator is unbiased (and asymptotically unbiased).


(c) Write an expression for the variance of o as an integral.
Solution:
Z b
V[o ] =
0

Zb
=
=

x(n + 1)
2n

2

fbo dx [E[o ]]2

b2
x2 (n + 1)2 xn1
n
dx

4n2
bn
4
0
2
b
4n(n + 2)

The closed form for the variance is


V[o ] =

b2
.
4n(n + 2)

(d) What is the mean squared error of this version of o ?


Solution: Since the bias is zero, the MSE is the same as the variance V[o ].
(e) What are the relative advantages of the estimator from the previous question and this
one?

MIT 6.867

Fall 2014

Solution: The unbiased estimator is strictly better. It always has smaller MSE, even
if the variance is higher.

One problem, two models

In this problem, were going to continue exploring the bias-variance trade-off in a very simple
setting. We have a set of unidimensional data, x(1) , . . . , x(n) , drawn from the positive reals. We
will consider two different models for its distribution:
Model 1: The data are drawn from a uniform distribution on the interval [0, b]. This model
has a single positive real parameter b, such that 0 < b.
Model 2: The data are drawn from a uniform distribution on the interval [a, b]. This model
has two positive real parameters, a and b, such that 0 < a < b.
We are interested in comparing estimates of the mean of the distribution, derived from each of
these two models.

3.1

Using Model 2

1. Whats the mean of the Model 2 distribution?

Solution: The model density is

1
ba

(over [a, b]) giving a mean

a+b
2 .

2. Lets consider the situation in which the data were, in fact, drawn from an instance of the
model under consideration: either a uniform distribution on [0, b] (for model 1) or a uniform
distribution on [a, b] (for model 2).
In model 1, the ML estimator for b is bml = maxi x(i) . The likelihood of the data is:

n
Y
b1
if x(i) 6 bml
ml
L(bml ) =
0
otherwise
i=1
We can see that if bml < x(i) , for any x(i) , then the likelihood of the whole data set must be 0.
So, we should pick bml to be as small as possible subject to the constraint that bml > x(i) i,
which means bml = maxi x(i) .
By a similar argument in model 2, the ML estimator for b remains the same and the ML estimator for a is aml = mini x(i) . To understand the properties of these estimators we have to
start by deriving their PDFs. The minimum and maximum of a data set are also known as
their first and nth order statistics, and sometimes written x(1) and x(n) .
We started our analysis of Model 1 in question 2. Now, lets do the same thing, but for the MLE
for model 2. We have to start by thinking about the joint distribution of MLEs aml and bml .

MIT 6.867

Fall 2014

Generally speaking, the joint pdf of the minimum and the maximum of a set of data drawn
from pdf f, with cdf F, is
faml ,bml (x, y) = n(n 1)(F(y) F(x))n2 f(x)f(y) .
Explain in words why this makes sense.

Solution: The argument for faml ,bml (x, y) is similar to the one for fbml (x). However, we
have to choose a minimum value x in addition to the maximum value y and ensure all
other values fall between x and y. First, we factor in the probability (density) of x and y,
giving the final f(x)f(y) terms. The other (n 2) data points must all be between x and
y, which is true with probability (F(y) F(x))(n2) . Finally, there are n(n 1) different
ways of choosing the maximum and minimum points. (Note that the ordering of these

two points matters, so the multiplicative factor is not n2 = n(n1)
).
2
3. What is faml ,bml in the particular case where the data are drawn uniformly from a to b?

1
Solution: f(x) = ba
, F(x) =
y 6 b, and is zero otherwise.

xa
ba ,

n2

hence faml ,bml (x, y) = n(n 1) (yx)


(ba)n for a 6 x 6

4. Write an expression for the expected value of ml in terms of an integral.


Heres what it should integrate to:
E[ml ] =

a+b
.
2

Solution: Given that x and y are the min and max values for Model 2, the MLE is now
x+y
2 . Hence:
ZZ
E[ml ] =

x+y
faml ,bml (x, y) dxdy =
2

Zb Zy
a

x+y
(y x)n2
a+b
n(n 1)
dxdy =
n
2
(b a)
2
a

5. What is the squared bias of ml ? Is this estimator unbiased? Is it asymptotically unbiased?

Solution: bias2 (ml ) = (E[ml ] )2 =


(and asymptotically unbiased).

a+b
2


a+b 2
2

= 0. The estimator is unbiased

MIT 6.867

Fall 2014

10

6. Write an expression for the variance of ml in terms of an integral.


The closed form for the variance is
V[ml ] =

(b a)2
.
2(n + 1)(n + 2)

Solution:
Z Z
V[ml ] =

x+y
2

2

faml ,bml (x, y) dxdy [E[ml ]]2

Zb Zy
=
a

(x + y)2
(y x)n2
(a + b)2
(b a)2
n(n 1)
dxdy

=
4
(b a)n
4
2(n + 1)(n + 2)
a

7. What is the mean squared error of ml ?


Solution: Since the bias is zero, the MSE is the same as the variance V[ml ].

3.2

Comparing Models

What if we have data that is actually drawn from the interval [0, 1]? Both models seem like reasonable choices.
1. Show plots that compare the bias, variance, and MSE of each of the estimators weve considered on that data, as a function of n. (Use the formulas above; dont do it by actually generating data). Write a paragraph in English explaining your results. What estimator would you
use?
Solution: The plots in Figure 1 compare the bias, variance, and MSE of the models. Note
that both Model 1 unbiased (blue) and Model 2 (black) have zero bias in Figure 1(a). Also,
for a = 0, the MSE for Model 1 MLE (red) and Model 2 are the same, so they overlap in
Figure 1(c) (red under black). We already know that the Model 1 unbiased estimator (blue)
has lower error than the Model 1 MLE (red). Since the MSE for Model 2 and Model 1 MLE
are the same, we conclude that the Model 1 unbiased estimator is superior for data from
[0, 1] due to its lower variance.

2. Now, what if we have data that is actually drawn from the interval [.1, 1]? It seems like model
2 is the only reasonable choice. But is it?
We already know the bias, variance, and MSE for model 2 in this case. But what about the
MLE and unbiased estimators for model 1? Lets characterize the general behavior when we
use the estimator ml = x(n) (n + 1)/(2n) on data drawn from an interval [a, b].

MIT 6.867

Fall 2014

(a) Squared bias (black overlaps blue)

(b) Variance

11

(c) MSE (black overlaps red)

Figure 1: Red = model 1 MLE, blue = model 1 unbiased, black = model 2.


Write an expression for the expected value of ml in terms of an integral.
Solution:
ZZ
Zb Zy
y(n + 1)
a + bn
(y x)n2
y(n + 1)
E[ml ] =
dxdy =
faml ,bml (x, y) dxdy =
n(n1)
n
2n
2n
(b a)
2n
a a

3. The closed form expression is


E[ml ] =

a + bn
.
2n

Explain in English why this answer makes sense.

Solution: For small n (and in particular for n = 1), since the maximum value in fact
cannot be less than a, a high value of a means that initial maximum values will be higher,
and hence the estimated mean is higher. Ultimately, the estimator only depends on the
maximum of the data b, and as we saw earlier the expected value is b2 . The expression
above tends to this as n , since with many data points, it is likely that their maximum
is close to b.
4. What is the squared bias of this ml ? Explain in English why your answer makes sense. Consider how it behaves as a increases, and how it behaves as n increases.


2
2
a+b 2
. We already know its
Solution: bias2 (ml ) = (E[ml ] )2 = a+bn
= a (n1)
2n 2
4n2
unbiased if a = 0; as a increases, this is an increasingly bad (inaccurate) model. Furthermore, for fixed a, the bias increases as a function of n, because the expected answer gets
closer to b2 (and farther from the true a+b
2 ).

MIT 6.867

Fall 2014

12

5. Write an expression for the variance of this ml in terms of an integral.


The closed form for the variance is
V[ml ] =

(b a)2
.
4n(n + 2)

To save you some tedious algebra, well tell you that the mean squared error of this ml is
(apologies for the ugliness; let us know if you find a beautiful rewrite)
b2 n 2abn + a2 (2 2n + n3 )
.
4n2 (n + 2)

Solution:
Z Z
V[ml ] =

y(n + 1)
2n

2

faml ,bml (x, y) dxdy [E[ml ]]2

Zb Zy
=
a

(y x)n2
(a + bn)2
(b a)2
y2 (n + 1)2
n(n

1)
dxdy

=
4n2
(b a)n
4n2
4n(n + 2)
a

6. Show plots that compare the bias, variance, and MSE of this estimator with the regular model
2 estimator on data drawn from [0.1, 1], as a function of n. Are there circumstances in which it
would be better to use this estimator? If so, what are they and why? If not, why not?

Solution: The MSE plots in Figure 2 cross over at around 8. That is, for n < 8, using
the model 1 estimator is better, and beyond that the model 2 estimator should be used.
Although model 1 has lower variance than model 2, the bias in using model 1 takes over
for larger n.

(a) Squared bias

(b) Variance

(c) MSE

Figure 2: MSE Plots: Blue = model 1, black = model 2. Data from [.1,1].

MIT 6.867

Fall 2014

13

7. Show plots of MSE of both estimators, as a function of n on data drawn from [.01, 1] and on
data drawn from [.2, 1]. How do things change? Explain why this makes sense.

Solution: See Figure 3. For data from [.01, 1], model 1 is a very good approximation.
Although the model 1 estimator is still biased, because a is very small, the effect of the bias
is much smaller, and model 1 is superior for a larger range of n due to its lower variance.
In contrast, for data from [.2, 1], model 1 is less accurate compared to its application in
Figure 2. The model 1 estimator is more biased and is inferior for n > 3.

(a) [.01,1]

(b) [.2,1]

Figure 3: Blue = model 1, black = model 2.

Regression with class targets

One way to approach a classification problem is to use OLS regression, with a y value of +1 for
positive examples and 1 for negative examples.
1. Is it possible to construct an example data set with X = R2 that is separable by a line through
the origin, such that, if the data is used as a training set for linear least-squares regression with
class targets, the resulting classifier will mis-classify some of the training points.
2. Is there a different way of selecting the regression targets that would allow linear regression
to find a separator on your data set? Would that strategy work on all data sets?
3. Is it possible for logistic regression to find a solution that classifies all of the examples correctly? Do you have to change the targets in some way?

Solution:
Some old answers follow. Need to be cleaned up.

MIT 6.867

Fall 2014

14

1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1

10

Consider the following data set:


X = [-1 0; -0.1 0; 0.1 0; 10 1; 10 -1]
y = [-1; -1; 1; 1; 1]
The estimated parameters for the data is (visualized above):
theta = [0.1351; 0]
b = -0.3133
X * theta + b = [-0.4483; -0.3268; -0.2998; 1.0374; 1.0374]
As usual, each row of X is a data point. These five points are linearly separable about the
origin, in particular by the line x1 = 0. However, the third point is mis-classified (sign of
prediction is negative).
The above example uses the following intuition. The objective of linear least-squares regression is, as the name suggests, to minimize the squared difference between the target (yi )
i +b) values. The quadratic objective penalizes outliers very heavily, much
and prediction (x
more than closer values. This means that it is willing to make many more small errors than
to make a large one. The last two data points in the above example are far out on the positive
x1 -axis. Because regression attempts (and succeeds) at giving the outliers a +1 prediction, the
third data point, which is much closer to the negative examples, is misclassified.
(Technically we could make this work with a slightly simpler data set with only one large
point on the right, but if all points were aligned on the axes, X> X would be singular and cause
problems for least-squares.)

MIT 6.867

Fall 2014

15

1.5

1.0

0.5

-2

10

12

-0.5

-1.0

Above is an even simpler example, in one dimension, with X on the X axis and Y on the Y
axis.
Next part
Note: Many people appear to confuse the targets and the objective. The objective of an optimization problem is the cost function; for least-squares this is the squared error (yi (
xi + b))2 . The targets are synonymous with labels, i.e., the targets the yi values. We originally
assumed the yi had to be 1; here we ask if relaxing this assumption (so we can set yi to any
real number) allows you to find the correct separator.
If we knew the separator initially, we could devise values for the targets such that the regression fits perfectly. Suppose we knew the optimal separator , b. Recall that the prediction
is normally y i = xi + b. If we simply set yi = xi + b, then , b is an optimal solution
to the regression problem since it gives zero error. (It is not necessarily unique, e.g., if the
system is underdetermined, but we will not consider such degeneracies.) This strategy is not
practical, however, since it assumes that we already have the answer.
Some of you, perhaps based on the misconception detailed above, gave schemes that tried
to make the objective function more robust. Since they are probably more practical than our
given solution, we will accept those answers as well.

Logistic Regression

Bishop 4.14
Show that for a linearly separable data set, the maximum likelihood solution for the logistic regression model is obtained by finding a vector w whose decision boundary wT (x) = 0 separates
the classes and then taking the magnitude of w to infinity.
Solution:
Using Bishops notation, the data set is a set of pairs (n , tn ) (n = 1, . . . N), with the feature

MIT 6.867

Fall 2014

16

vector n = (x(n) ), and the target value for that feature vector, tn .
If the data set is linearly separable, any decision boundary separating the two classes will
have the property
wT n > 0

tn = 1,

if

wT n < 0

otherwise.

The likelihood function can be written


N
Y

p(t | w) =

ytnn (1 yn )1tn

(4.89)

n=1

where t = (t1 , . . . , tN )T and yn = (wT n ). We can define an error function as the negative
log of the likelihood:
Err(w) = ln p(t | w) =

N
X

{tn ln yn + (1 tn ) ln(1 yn )}

(4.90)

n=1

Moreover, from (4.90) we see that the negative log-likelihood will be minimized (i.e., the
likelihood maximized) when yn = (wT n ) = tn for all n. This will be the case when the
sigmoid function (()) is saturated, which occurs when its argument, wT , goes to , i.e.,
when the magnitude of w goes to infinity.

Softmax

Another possible approach to classification to use a generalized version of the logistic model. Let
x = [x1 , x2 , . . . , xd ] be an input vector, and suppose we would like to classify into k classes; that
is, the output y can take a value in 1, . . . , k. The softmax generalization of the logistic model uses
k(d + 1) parameters = (ij ), i = 1, . . . , k, j = 0, . . . , d, which define the following k intermediate
values:
z1

10 +

1j xj

zi

i0 +

ij xj

zk

k0 +

kj xj

The classification probabilities under the softmax model are:


ezi
Pr(y = i | x; ) = Pk

j=1 e

zj

MIT 6.867

Fall 2014

17

1. Show that when k = 2 the softmax model reduces to the logistic model. That is, show how
both give rise to the same classification probabilities Pr(y | x). Do this by constructing an explicit transformation between the parameters: for any given set of 2(d+1) softmax parameters,
show an equivalent set of (d + 1) logistic parameters.
Solution: The posterior of a logistic model with weights 0
P(Y = 1|x; ) =
where z 0 = 00 +

0
j j xj .

1
0
1 + ez

The posterior of the softmax model when k = 2,


P(Y = 1|x; ) =

ez1

ez1
+ ez2

equating the two


ez1
1
=
0
ez1 + ez2
1 + ez
0
0
ez1 + ez2 = ez1 + ez1 ez = ez1 + ez1 z
0

ez1 z = ez2
z 0 = z1 z2
If j0 = 1j 2j for each j, the softmax model reduces to the logistic model.

2. Which of the decision regions from question 2 can represent decision boundaries for a softmax
model?

Solution: Only linear decision boundaries are possible.

3. Show that the softmax model, for any k, can always be represented by a Gaussian mixture
model. What type of Gaussian mixture models are equivalent to softmax models?

Solution: Consider a softmax model with k classes and weights ij and denote i as a
d-element vector with components (i )j = ij for 1 6 j 6 d, then the softmax posterior is
given by
T
ey x+y0
P(y|x) = P
Ti x+i0
ie
We would like to find a Gaussian mixture model (i , i , i )i=1..k that yields the same

MIT 6.867

Fall 2014

18

posterior. The posterior of a k-component Gaussian mixture model (GMM) is given by


1

y |y | 2 e 2 (xy ) y (xy )
P(y|x; (i , i , i )i=1..k ) = P
21 e 12 (xi )T 1
i (xi )
i i |i |
T

y |y | 2 e 2 (x y xx y y y y x+y y y )
=P
T 1
T 1
T 1
21 e 12 (xT 1
i xx i i i i x+i i i )
i i |i |
Since covariance matrices are symmetric
1

y |y | 2 e 2 (x y x2y y x+y y y )
=P
T 1
T 1
21 e 12 (xT 1
i x2i i x+i i i )
i i |i |
The exponents are quadratic to x. But in the softmax posterior the exponents are linear to
x. In order to make them equal, we need to choose identical covariance matrices to cancel
out quadratic terms as follows
1

y || 2 e 2 (x x2y x+y y )
=P
21 e 12 (xT 1 x2Ti 1 x+Ti 1 i )
i i ||
1

y e 2 (2y x+y y )
=P
21 (2Ti 1 x+Ti 1 i )
i i e
T

ey x 2 y y +log y
=P
Ti 1 x 21 Ti 1 i +log i
ie
To make the linear coefficients of x equal to the weights in the softmax posterior, we need
to set the means and covariances such that
Ti 1 = i

for all i

where must be invertible. To satisfy this condition, let = I and i = i . The GMM
posterior of becomes
T

ey x 2 y y +log y
P(y|x; (i , i = i , i = I)i=1..k ) = P
Ti x 12 Ti i +log i
ie
Note that if setting log i 21 Ti = i0 , we would obtain exactly the same representation
as the softmax posterior. However, this might result in negative priors or priors that do not
sum up to one. One way to solve this is to multiply both the numerator and denominator
by a constant Z
T

Zy ey x 2 y y
P(y|x; (i , i = i , i = I)i=1..k ) = P
Ti x 12 Ti i
i Zi e

MIT 6.867

Fall 2014
1 T +
i0

Then by setting i = e 2 iZ
and Z =
a Gaussian mixture model where

ie

1 T
2 i +i0

19

we convert the softmax weights to

e 2 i +i0
i = P 1 T
2 i +i0
ie
i = i
i = I
Remember that the convariance matrix must be 1) invertible 2) identical for all Gaussian components. Therefore, the softmax model can only be reduced to a special case of
Gaussian mixture models.
4. A stochastic gradient ascent learning rule for softmax is given by:
X
ij ij +
log Pr(yt | xt ; ) ,

ij
t
where (xt , yt ) are the training examples. We would like to rewrite this rule as a delta rule. In
a delta rule the update is specified as a function of the difference between the target and the
prediction. In our case, our target for each example will actually be a vector yt = (yt1 , . . . , ytk )
where yti = 1 if yt = i and 0 otherwise.
Our prediction will be a corresponding vector of probabilities:
y t = (Pr(y = 1 | xt ; ), . . . , Pr(y = k | xt ; ))
Calculate the derivative above, and rewrite the update rule as a function of y y.

Solution: Sometimes it is easier to calculate derivatives in log-scale.


X
log P(y = i) = zi log
ezl
l

zi
= xj
ij
Two cases
log P(y = i)
ezi
= 1 xj P z xj = yi xj y i xj
l
ij
le
log P(y = k 6= i)
ezi
= 0 xj P z xj = yi xj y i xj
l
ij
le
Combining them in the vector form
log P(yt |xt )
= yti xj y ti xj = (yt y t )T xt
ij

MIT 6.867

Fall 2014

Therefore, the update rule is


+

X
(yt y t )T xt
t

20

Vous aimerez peut-être aussi