Académique Documents
Professionnel Documents
Culture Documents
Ridge Regression
1. (based on HTF Ex. 3.6 in on-line version, with some notation changed)
Show that the ridge regression estimate is the mean (and mode) of the posterior distribution,
under a Gaussian prior w N(0, I), and Gaussian sampling model y N(Xw, 2 I). Find the
relationship between the regularization parameter in the ridge formula, and the variances
and 2 . Assume that the data are centered as described in the previous problem, so that we
dont need a bias term.
Solution: By Bayes rule, p(w|y) p(y|w)p(w), hence in the log domain:
n
d
1 X (i)
1 X 2
(i) 2
log p(w|y) = const 2
(y w x )
wj
2
2
i=1
j=1
d
n
2 X
1 X (i)
w2j
= const 2
(y w x(i) )2 +
2
j=1
i=1
where we have accumulated constant terms that do not depend on w. The mode (and
mean) of the posterior Gaussian distribution is given by the parameters that maximize
the log probability above, which is equivalent to minimizing the expression in the square
brackets on the RHS. The latter expression is equivalent to the ridge regression objective
2
for = , so the estimates are the same.
2. (Bishop 3.4)
Consider a linear model of the form
y(x, w) = wo +
D
X
wi xi
(3.105)
i=1
(3.106)
MIT 6.867
Fall 2014
where tn is the true value for x(n) . Now suppose that Gaussian noise i with zero mean
and variance 2 is added independently to each of the input variables xi . By making use
of E[i ] = 0 and E[i j ] = ij 2 (where ij = 1 when i = j and 0 otherwise) show that
minimizing ErrD averaged over the noise distribution is equivalent to minimizing the sum-ofsquares error for noise-free input variables with the addition of a weight-decay regularization
term, in which the bias parameter w0 is omitted from the regularizer.
The set-up is illustrated below, where the gray points are the original training examples and
the black ones have their x values perturbed.
Solution:
Let
y n = w0 +
= yn +
D
X
i=1
D
X
(n)
wi (xi
+ i )
wi i
i=1
where yn = y(x(n) , w) (the prediction from the current model for the nth data point) and
i N(0, 2 ). Note that we have used (3.105). Using (3.106), we define
=
Err
N
1X
(y n tn )2
2
1
2
n=1
N
X
(y 2n 2y n tn + t2n )
n=1
N
D
X
X
1
y2n 2yn
wi i +
2
n=1
i=1
D
X
!2
wi i
i=1
2tn yn 2tn
D
X
wi i + t2n
i=1
!2
D
D
X
X
E
wi i =
w2i 2
i=1
i=1
MIT 6.867
Fall 2014
since the i are all independent with variance 2 . From this and (3.106) we get:
D
X
= ErrD + 1
E Err
w2i 2
2
i=1
(xj x j ); that is, that we subtract the average value of each feature j, x j , from each of the
values for feature j in the data.
We augment the centered matrix C with d additional rows I, and augment y with d zeros.
By introducing artificial data having response value zero, the fitting procedure is forced to
shrink the coefficients toward zero. This is related to the idea of hints due to Abu-Mostafa
(1995), where model constraints are implemented by adding artificial data examples that satisfy them.
So, the data matrix will look like
(1)
(1)
(1)
x2
. . . xd
x1
(2)
(2)
(2)
x1
. . . xd
x2
. . . . . . . . . . . . . . . . . . . . . . .
(n) (n)
(n)
x
.
.
.
x
x
2
d
1
...
0
0
0
...
0
. . . . . . . . . . . . . . . . . . . . . . .
0
0
...
Solution: The centering step is necessary so that the bias parameter w0 is not regularized
(there is generally no reason to believe the data has mean zero, and it can be removed by
such a centering step anyway). If we center both X and y by subtracting their averages
resulting in C and z, and augment them as suggested, ordinary least squares (without
bias) minimizes the following objective with respect to w:
n+d
X
(z
(i)
wc
(i) 2
n
d
X
X
(i)
(i)
2
=
((y y)
w (x x )) +
(0 wj )2
i=1
i=1
j=1
n
X
((y
i=1
(i)
(i)
wx
) (y w x )) +
d
X
j=1
w2j
MIT 6.867
Fall 2014
The final expression is similar to the ridge regression objective. In fact, if we recall that
P
(i) w x(i) ) = y
w x , we see that the bias
in ordinary least squares w0 = n1 n
i=1 (y
term essentially falls out from the centered matrices. If we replace (y w x ) within the
squared summand with w0 , then the objective from this approach is equivalent to that of
ridge regression, and the resulting w will be the same.
In this problem, were going to explore the bias-variance trade-off in a very simple setting. We
have a set of unidimensional data, x(1) , . . . , x(n) , drawn from the positive reals. Consider a simple
model for its distribution (in a later problem we will consider a slightly different model):
Model 1: The data are drawn from a uniform distribution on the interval [0, b]. This model
has a single positive real parameter b, such that 0 < b.
We are interested in estimates of the mean of the distribution.
1. Whats the mean of the Model 1 distribution?
Solution: The model density is
1
b
Lets start by considering the situation in which the data were, in fact, drawn from an instance
of the model under consideration: a uniform distribution on [0, b] (for model 1),
In model 1, the ML estimator for b is bml = maxi x(i) . The likelihood of the data is:
n
Y
b1
if x(i) 6 bml
ml
L(bml ) =
0
otherwise
i=1
We can see that if bml < x(i) , for any x(i) , then the likelihood of the whole data set must be
0. So, we should pick bml to be as small as possible subject to the constraint that bml > x(i) ,
which means bml = maxi x(i) .
To understand the properties of this estimator we have to start by deriving their PDFs. The
minimum and maximum of a data set are also known as their first and nth order statistics, and
sometimes written x[1] and x[n] (were using square brackets to distinguish these from our
notation for samples in a data set).
In model 1, we just need to consider the distribution of bml . Generally speaking, the pdf of the
maximum of a set of data drawn from pdf f, with cdf F, is:
fbml (x) = nF(x)n1 f(x)
(1)
The idea is that, if x is the maximum, then n 1 of the other data values will have to be less
than x, and the probability of that is F(x)n1 , and then one value will have to equal x, the
probability of which is f(x). We multiply by n because there are n different ways to choose
the data value that could be the maximum.
MIT 6.867
Fall 2014
2. (a) What is the maximum likelihood estimate of the mean, ml , of the distribution?
Solution: Given the MLE bml of b, which is x[n] the maximum of the data set, the
[n]
MLE of the mean is b2ml = x2 (from our expression of the mean in part 1).
(b) What is fbml for this particular case where the data are drawn uniformly from 0 to b?
Solution: f(x) =
1
b,
F(x) =
x
b,
n1
hence fbml (x) = n xbn over [0, b], and is zero otherwise.
x
fb (x) dx =
2 ml
Zb
0
b n
x xn1
n n dx =
2 b
2n+1
In fact, theres a nice closed form expression, which you can use in the following questions:
b n
E[ml ] =
.
2 (n + 1)
(d) What is the squared bias of ml ? Is this estimator unbiased? Is it asymptotically unbiased?
(Reminder: bias2 (ml ) = (ED [ml ] )2 .)
Solution: Using the given equations,
2
b n
b
2n+1 2
2
=
b2
4(n + 1)2
MIT 6.867
Fall 2014
n
b2
.
4 (n + 1)2 (n + 2)
(f) What is the mean squared error of ml ? (Reminder: MSE(ml ) = bias2 (ml ) + var(ml ).)
Solution:
MSE(ml ) = bias2 (ml )+var(ml ) =
b2
b2
n
b2
+
=
4(n + 1)2 4 (n + 1)2 (n + 2)
2(n + 1)(n + 2)
(g) So far, we have been considering the error of the estimator, comparing the estimated value
of the mean with its actual value. We will often want to use the estimator to make predictions, and so we might be interested in the expected error of a prediction.
Assume the loss function for your predictions is L(g, a) = (g a)2 . Given an estimate mu
n1
( )2 n ml n d
ml
Z
=argmin nml n1
2 n 2 n+1 + n+2 d
(4)
=argmin
(5)
ml
2 ml n1 2 ml n+2 ml n+3
=argmin nml
+
n + 1
n2
n3
!
2
2 ml
2
=argmin
+ ml
n1
n2
n3
s
!
n1
(n 1)2 n 1
=
+
ml
n2
(n 2)2 n 3
n1
!
(6)
(7)
(8)
MIT 6.867
Fall 2014
3. We might consider something other than the MLE for Model 1 (labeled o for other). Consider
the estimator
x[n] (n + 1)
.
o =
2n
where x[n] is the maximum of the data set.
(a) Write an expression for the expected value of this version of o as an integral. Then solve
the integral.
Solution:
Zb
E[o ] =
0
x(n + 1)
fbo dx =
2n
Zb
0
x(n + 1) xn1
b
n n dx =
2n
b
2
bias (o ) = (E[o ] ) =
b b
2
2
2
=0
Zb
=
=
x(n + 1)
2n
2
b2
x2 (n + 1)2 xn1
n
dx
4n2
bn
4
0
2
b
4n(n + 2)
b2
.
4n(n + 2)
MIT 6.867
Fall 2014
Solution: The unbiased estimator is strictly better. It always has smaller MSE, even
if the variance is higher.
In this problem, were going to continue exploring the bias-variance trade-off in a very simple
setting. We have a set of unidimensional data, x(1) , . . . , x(n) , drawn from the positive reals. We
will consider two different models for its distribution:
Model 1: The data are drawn from a uniform distribution on the interval [0, b]. This model
has a single positive real parameter b, such that 0 < b.
Model 2: The data are drawn from a uniform distribution on the interval [a, b]. This model
has two positive real parameters, a and b, such that 0 < a < b.
We are interested in comparing estimates of the mean of the distribution, derived from each of
these two models.
3.1
Using Model 2
1
ba
a+b
2 .
2. Lets consider the situation in which the data were, in fact, drawn from an instance of the
model under consideration: either a uniform distribution on [0, b] (for model 1) or a uniform
distribution on [a, b] (for model 2).
In model 1, the ML estimator for b is bml = maxi x(i) . The likelihood of the data is:
n
Y
b1
if x(i) 6 bml
ml
L(bml ) =
0
otherwise
i=1
We can see that if bml < x(i) , for any x(i) , then the likelihood of the whole data set must be 0.
So, we should pick bml to be as small as possible subject to the constraint that bml > x(i) i,
which means bml = maxi x(i) .
By a similar argument in model 2, the ML estimator for b remains the same and the ML estimator for a is aml = mini x(i) . To understand the properties of these estimators we have to
start by deriving their PDFs. The minimum and maximum of a data set are also known as
their first and nth order statistics, and sometimes written x(1) and x(n) .
We started our analysis of Model 1 in question 2. Now, lets do the same thing, but for the MLE
for model 2. We have to start by thinking about the joint distribution of MLEs aml and bml .
MIT 6.867
Fall 2014
Generally speaking, the joint pdf of the minimum and the maximum of a set of data drawn
from pdf f, with cdf F, is
faml ,bml (x, y) = n(n 1)(F(y) F(x))n2 f(x)f(y) .
Explain in words why this makes sense.
Solution: The argument for faml ,bml (x, y) is similar to the one for fbml (x). However, we
have to choose a minimum value x in addition to the maximum value y and ensure all
other values fall between x and y. First, we factor in the probability (density) of x and y,
giving the final f(x)f(y) terms. The other (n 2) data points must all be between x and
y, which is true with probability (F(y) F(x))(n2) . Finally, there are n(n 1) different
ways of choosing the maximum and minimum points. (Note that the ordering of these
two points matters, so the multiplicative factor is not n2 = n(n1)
).
2
3. What is faml ,bml in the particular case where the data are drawn uniformly from a to b?
1
Solution: f(x) = ba
, F(x) =
y 6 b, and is zero otherwise.
xa
ba ,
n2
a+b
.
2
Solution: Given that x and y are the min and max values for Model 2, the MLE is now
x+y
2 . Hence:
ZZ
E[ml ] =
x+y
faml ,bml (x, y) dxdy =
2
Zb Zy
a
x+y
(y x)n2
a+b
n(n 1)
dxdy =
n
2
(b a)
2
a
a+b
2
a+b 2
2
MIT 6.867
Fall 2014
10
(b a)2
.
2(n + 1)(n + 2)
Solution:
Z Z
V[ml ] =
x+y
2
2
Zb Zy
=
a
(x + y)2
(y x)n2
(a + b)2
(b a)2
n(n 1)
dxdy
=
4
(b a)n
4
2(n + 1)(n + 2)
a
3.2
Comparing Models
What if we have data that is actually drawn from the interval [0, 1]? Both models seem like reasonable choices.
1. Show plots that compare the bias, variance, and MSE of each of the estimators weve considered on that data, as a function of n. (Use the formulas above; dont do it by actually generating data). Write a paragraph in English explaining your results. What estimator would you
use?
Solution: The plots in Figure 1 compare the bias, variance, and MSE of the models. Note
that both Model 1 unbiased (blue) and Model 2 (black) have zero bias in Figure 1(a). Also,
for a = 0, the MSE for Model 1 MLE (red) and Model 2 are the same, so they overlap in
Figure 1(c) (red under black). We already know that the Model 1 unbiased estimator (blue)
has lower error than the Model 1 MLE (red). Since the MSE for Model 2 and Model 1 MLE
are the same, we conclude that the Model 1 unbiased estimator is superior for data from
[0, 1] due to its lower variance.
2. Now, what if we have data that is actually drawn from the interval [.1, 1]? It seems like model
2 is the only reasonable choice. But is it?
We already know the bias, variance, and MSE for model 2 in this case. But what about the
MLE and unbiased estimators for model 1? Lets characterize the general behavior when we
use the estimator ml = x(n) (n + 1)/(2n) on data drawn from an interval [a, b].
MIT 6.867
Fall 2014
(b) Variance
11
a + bn
.
2n
Solution: For small n (and in particular for n = 1), since the maximum value in fact
cannot be less than a, a high value of a means that initial maximum values will be higher,
and hence the estimated mean is higher. Ultimately, the estimator only depends on the
maximum of the data b, and as we saw earlier the expected value is b2 . The expression
above tends to this as n , since with many data points, it is likely that their maximum
is close to b.
4. What is the squared bias of this ml ? Explain in English why your answer makes sense. Consider how it behaves as a increases, and how it behaves as n increases.
2
2
a+b 2
. We already know its
Solution: bias2 (ml ) = (E[ml ] )2 = a+bn
= a (n1)
2n 2
4n2
unbiased if a = 0; as a increases, this is an increasingly bad (inaccurate) model. Furthermore, for fixed a, the bias increases as a function of n, because the expected answer gets
closer to b2 (and farther from the true a+b
2 ).
MIT 6.867
Fall 2014
12
(b a)2
.
4n(n + 2)
To save you some tedious algebra, well tell you that the mean squared error of this ml is
(apologies for the ugliness; let us know if you find a beautiful rewrite)
b2 n 2abn + a2 (2 2n + n3 )
.
4n2 (n + 2)
Solution:
Z Z
V[ml ] =
y(n + 1)
2n
2
Zb Zy
=
a
(y x)n2
(a + bn)2
(b a)2
y2 (n + 1)2
n(n
1)
dxdy
=
4n2
(b a)n
4n2
4n(n + 2)
a
6. Show plots that compare the bias, variance, and MSE of this estimator with the regular model
2 estimator on data drawn from [0.1, 1], as a function of n. Are there circumstances in which it
would be better to use this estimator? If so, what are they and why? If not, why not?
Solution: The MSE plots in Figure 2 cross over at around 8. That is, for n < 8, using
the model 1 estimator is better, and beyond that the model 2 estimator should be used.
Although model 1 has lower variance than model 2, the bias in using model 1 takes over
for larger n.
(b) Variance
(c) MSE
Figure 2: MSE Plots: Blue = model 1, black = model 2. Data from [.1,1].
MIT 6.867
Fall 2014
13
7. Show plots of MSE of both estimators, as a function of n on data drawn from [.01, 1] and on
data drawn from [.2, 1]. How do things change? Explain why this makes sense.
Solution: See Figure 3. For data from [.01, 1], model 1 is a very good approximation.
Although the model 1 estimator is still biased, because a is very small, the effect of the bias
is much smaller, and model 1 is superior for a larger range of n due to its lower variance.
In contrast, for data from [.2, 1], model 1 is less accurate compared to its application in
Figure 2. The model 1 estimator is more biased and is inferior for n > 3.
(a) [.01,1]
(b) [.2,1]
One way to approach a classification problem is to use OLS regression, with a y value of +1 for
positive examples and 1 for negative examples.
1. Is it possible to construct an example data set with X = R2 that is separable by a line through
the origin, such that, if the data is used as a training set for linear least-squares regression with
class targets, the resulting classifier will mis-classify some of the training points.
2. Is there a different way of selecting the regression targets that would allow linear regression
to find a separator on your data set? Would that strategy work on all data sets?
3. Is it possible for logistic regression to find a solution that classifies all of the examples correctly? Do you have to change the targets in some way?
Solution:
Some old answers follow. Need to be cleaned up.
MIT 6.867
Fall 2014
14
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
10
MIT 6.867
Fall 2014
15
1.5
1.0
0.5
-2
10
12
-0.5
-1.0
Above is an even simpler example, in one dimension, with X on the X axis and Y on the Y
axis.
Next part
Note: Many people appear to confuse the targets and the objective. The objective of an optimization problem is the cost function; for least-squares this is the squared error (yi (
xi + b))2 . The targets are synonymous with labels, i.e., the targets the yi values. We originally
assumed the yi had to be 1; here we ask if relaxing this assumption (so we can set yi to any
real number) allows you to find the correct separator.
If we knew the separator initially, we could devise values for the targets such that the regression fits perfectly. Suppose we knew the optimal separator , b. Recall that the prediction
is normally y i = xi + b. If we simply set yi = xi + b, then , b is an optimal solution
to the regression problem since it gives zero error. (It is not necessarily unique, e.g., if the
system is underdetermined, but we will not consider such degeneracies.) This strategy is not
practical, however, since it assumes that we already have the answer.
Some of you, perhaps based on the misconception detailed above, gave schemes that tried
to make the objective function more robust. Since they are probably more practical than our
given solution, we will accept those answers as well.
Logistic Regression
Bishop 4.14
Show that for a linearly separable data set, the maximum likelihood solution for the logistic regression model is obtained by finding a vector w whose decision boundary wT (x) = 0 separates
the classes and then taking the magnitude of w to infinity.
Solution:
Using Bishops notation, the data set is a set of pairs (n , tn ) (n = 1, . . . N), with the feature
MIT 6.867
Fall 2014
16
vector n = (x(n) ), and the target value for that feature vector, tn .
If the data set is linearly separable, any decision boundary separating the two classes will
have the property
wT n > 0
tn = 1,
if
wT n < 0
otherwise.
p(t | w) =
ytnn (1 yn )1tn
(4.89)
n=1
where t = (t1 , . . . , tN )T and yn = (wT n ). We can define an error function as the negative
log of the likelihood:
Err(w) = ln p(t | w) =
N
X
{tn ln yn + (1 tn ) ln(1 yn )}
(4.90)
n=1
Moreover, from (4.90) we see that the negative log-likelihood will be minimized (i.e., the
likelihood maximized) when yn = (wT n ) = tn for all n. This will be the case when the
sigmoid function (()) is saturated, which occurs when its argument, wT , goes to , i.e.,
when the magnitude of w goes to infinity.
Softmax
Another possible approach to classification to use a generalized version of the logistic model. Let
x = [x1 , x2 , . . . , xd ] be an input vector, and suppose we would like to classify into k classes; that
is, the output y can take a value in 1, . . . , k. The softmax generalization of the logistic model uses
k(d + 1) parameters = (ij ), i = 1, . . . , k, j = 0, . . . , d, which define the following k intermediate
values:
z1
10 +
1j xj
zi
i0 +
ij xj
zk
k0 +
kj xj
j=1 e
zj
MIT 6.867
Fall 2014
17
1. Show that when k = 2 the softmax model reduces to the logistic model. That is, show how
both give rise to the same classification probabilities Pr(y | x). Do this by constructing an explicit transformation between the parameters: for any given set of 2(d+1) softmax parameters,
show an equivalent set of (d + 1) logistic parameters.
Solution: The posterior of a logistic model with weights 0
P(Y = 1|x; ) =
where z 0 = 00 +
0
j j xj .
1
0
1 + ez
ez1
ez1
+ ez2
ez1 z = ez2
z 0 = z1 z2
If j0 = 1j 2j for each j, the softmax model reduces to the logistic model.
2. Which of the decision regions from question 2 can represent decision boundaries for a softmax
model?
3. Show that the softmax model, for any k, can always be represented by a Gaussian mixture
model. What type of Gaussian mixture models are equivalent to softmax models?
Solution: Consider a softmax model with k classes and weights ij and denote i as a
d-element vector with components (i )j = ij for 1 6 j 6 d, then the softmax posterior is
given by
T
ey x+y0
P(y|x) = P
Ti x+i0
ie
We would like to find a Gaussian mixture model (i , i , i )i=1..k that yields the same
MIT 6.867
Fall 2014
18
y |y | 2 e 2 (xy ) y (xy )
P(y|x; (i , i , i )i=1..k ) = P
21 e 12 (xi )T 1
i (xi )
i i |i |
T
y |y | 2 e 2 (x y xx y y y y x+y y y )
=P
T 1
T 1
T 1
21 e 12 (xT 1
i xx i i i i x+i i i )
i i |i |
Since covariance matrices are symmetric
1
y |y | 2 e 2 (x y x2y y x+y y y )
=P
T 1
T 1
21 e 12 (xT 1
i x2i i x+i i i )
i i |i |
The exponents are quadratic to x. But in the softmax posterior the exponents are linear to
x. In order to make them equal, we need to choose identical covariance matrices to cancel
out quadratic terms as follows
1
y || 2 e 2 (x x2y x+y y )
=P
21 e 12 (xT 1 x2Ti 1 x+Ti 1 i )
i i ||
1
y e 2 (2y x+y y )
=P
21 (2Ti 1 x+Ti 1 i )
i i e
T
ey x 2 y y +log y
=P
Ti 1 x 21 Ti 1 i +log i
ie
To make the linear coefficients of x equal to the weights in the softmax posterior, we need
to set the means and covariances such that
Ti 1 = i
for all i
where must be invertible. To satisfy this condition, let = I and i = i . The GMM
posterior of becomes
T
ey x 2 y y +log y
P(y|x; (i , i = i , i = I)i=1..k ) = P
Ti x 12 Ti i +log i
ie
Note that if setting log i 21 Ti = i0 , we would obtain exactly the same representation
as the softmax posterior. However, this might result in negative priors or priors that do not
sum up to one. One way to solve this is to multiply both the numerator and denominator
by a constant Z
T
Zy ey x 2 y y
P(y|x; (i , i = i , i = I)i=1..k ) = P
Ti x 12 Ti i
i Zi e
MIT 6.867
Fall 2014
1 T +
i0
Then by setting i = e 2 iZ
and Z =
a Gaussian mixture model where
ie
1 T
2 i +i0
19
e 2 i +i0
i = P 1 T
2 i +i0
ie
i = i
i = I
Remember that the convariance matrix must be 1) invertible 2) identical for all Gaussian components. Therefore, the softmax model can only be reduced to a special case of
Gaussian mixture models.
4. A stochastic gradient ascent learning rule for softmax is given by:
X
ij ij +
log Pr(yt | xt ; ) ,
ij
t
where (xt , yt ) are the training examples. We would like to rewrite this rule as a delta rule. In
a delta rule the update is specified as a function of the difference between the target and the
prediction. In our case, our target for each example will actually be a vector yt = (yt1 , . . . , ytk )
where yti = 1 if yt = i and 0 otherwise.
Our prediction will be a corresponding vector of probabilities:
y t = (Pr(y = 1 | xt ; ), . . . , Pr(y = k | xt ; ))
Calculate the derivative above, and rewrite the update rule as a function of y y.
zi
= xj
ij
Two cases
log P(y = i)
ezi
= 1 xj P z xj = yi xj y i xj
l
ij
le
log P(y = k 6= i)
ezi
= 0 xj P z xj = yi xj y i xj
l
ij
le
Combining them in the vector form
log P(yt |xt )
= yti xj y ti xj = (yt y t )T xt
ij
MIT 6.867
Fall 2014
X
(yt y t )T xt
t
20