Académique Documents
Professionnel Documents
Culture Documents
Y = β0 + β1 X1 + · · · + βp Xp + .
1 / 57
In praise of linear models!
2 / 57
Why consider alternatives to least squares?
3 / 57
Three classes of methods
4 / 57
Subset Selection
Best subset and stepwise model selection procedures
5 / 57
Example- Credit data set
1.0
8e+07
Residual Sum of Squares
0.8
6e+07
0.6
R2
4e+07
0.4
2e+07
0.2
0.0
2 4 6 8 10 2 4 6 8 10
7 / 57
Stepwise Selection
8 / 57
Forward Stepwise Selection
9 / 57
In Detail
10 / 57
More on Forward Stepwise Selection
11 / 57
Credit data example
12 / 57
Backward Stepwise Selection
13 / 57
Backward Stepwise Selection: details
14 / 57
More on Backward Stepwise Selection
15 / 57
Choosing the Optimal Model
16 / 57
Estimating test error: two approaches
17 / 57
Cp , AIC, BIC, and Adjusted R2
18 / 57
Credit data example
30000
30000
0.96
0.94
25000
25000
0.92
Adjusted R2
20000
20000
BIC
Cp
0.90
15000
15000
0.88
0.86
10000
10000
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
19 / 57
Now for some details
• Mallow’s Cp :
1
RSS + 2dσ̂ 2 ,
Cp =
n
where d is the total # of parameters used and σ̂ 2 is an
estimate of the variance of the error associated with each
response measurement.
• The AIC criterion is defined for a large class of models fit
by maximum likelihood:
AIC = −2 log L + 2 · d
1
RSS + log(n)dσ̂ 2 .
BIC =
n
21 / 57
Adjusted R2
• For a least squares model with d variables, the adjusted R2
statistic is calculated as
RSS/(n − d − 1)
Adjusted R2 = 1 − .
TSS/(n − 1)
where TSS is the total sum of squares.
• Unlike Cp , AIC, and BIC, for which a small value indicates
a model with a low test error, a large value of adjusted R2
indicates a model with a small test error.
• Maximizing the adjusted R2 is equivalent to minimizing
RSS
n−d−1 . While RSS always decreases as the number of
RSS
variables in the model increases, n−d−1 may increase or
decrease, due to the presence of d in the denominator.
• Unlike the R2 statistic, the adjusted R2 statistic pays a
price for the inclusion of unnecessary variables in the
model. See Figure on slide 19.
22 / 57
Validation and Cross-Validation
• Each of the procedures returns a sequence of models Mk
indexed by model size k = 0, 1, 2, . . .. Our job here is to
select k̂. Once selected, we will return model Mk̂
• We compute the validation set error or the cross-validation
error for each model Mk under consideration, and then
select the k for which the resulting estimated test error is
smallest.
• This procedure has an advantage relative to AIC, BIC, Cp ,
and adjusted R2 , in that it provides a direct estimate of
the test error, and doesn’t require an estimate of the error
variance σ 2 .
• It can also be used in a wider range of model selection
tasks, even in cases where it is hard to pinpoint the model
degrees of freedom (e.g. the number of predictors in the
model) or hard to estimate the error variance σ 2 .
23 / 57
Square Root of BIC
100 120 140 160 180 200 220
2
4
6
8
Number of Predictors
10
Validation Set Error
100 120 140 160 180 200 220
2
4
6
8
Number of Predictors
10
Cross−Validation Error
Credit data example
Number of Predictors
10
24 / 57
Details of Previous Figure
• The validation errors were calculated by randomly selecting
three-quarters of the observations as the training set, and
the remainder as the validation set.
• The cross-validation errors were computed using k = 10
folds. In this case, the validation and cross-validation
methods both result in a six-variable model.
• However, all three approaches suggest that the four-, five-,
and six-variable models are roughly equivalent in terms of
their test errors.
• In this setting, we can select a model using the
one-standard-error rule. We first calculate the standard
error of the estimated test MSE for each model size, and
then select the smallest model for which the estimated test
error is within one standard error of the lowest point on
the curve. What is the rationale for this?
25 / 57
Shrinkage Methods
26 / 57
Ridge regression
• Recall that the least squares fitting procedure estimates
β0 , β1 , . . . , βp using the values that minimize
2
n
X p
X
RSS = yi − β0 − βj xij .
i=1 j=1
28 / 57
Credit data example
400
Income
400
Limit
Standardized Coefficients
Standardized Coefficients
300
300
Rating
Student
200
200
100
100
0
0
−100
−100
−300
−300
1e−02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0
λ kβ̂λR k2 /kβ̂k2
29 / 57
Details of Previous Figure
30 / 57
Ridge regression: scaling of predictors
• The standard least squares coefficient estimates are scale
equivariant: multiplying Xj by a constant c simply leads to
a scaling of the least squares coefficient estimates by a
factor of 1/c. In other words, regardless of how the jth
predictor is scaled, Xj β̂j will remain the same.
• In contrast, the ridge regression coefficient estimates can
change substantially when multiplying a given predictor by
a constant, due to the sum of squared coefficients term in
the penalty part of the ridge regression objective function.
• Therefore, it is best to apply ridge regression after
standardizing the predictors, using the formula
xij
x̃ij = q P
1 n
n i=1 (xij − xj )2
31 / 57
Why Does Ridge Regression Improve Over Least
Squares?
The Bias-Variance tradeoff
60
60
Mean Squared Error
50
40
40
30
30
20
20
10
10
0
0
1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0
λ kβ̂λR k2 /kβ̂k2
33 / 57
The Lasso: continued
34 / 57
Example: Credit dataset
400
400
Standardized Coefficients
Standardized Coefficients
300
300
200
200
100
100
0
0
−100
Income
Limit
−200
Rating
Student
20 50 100 200 500 2000 5000 −300 0.0 0.2 0.4 0.6 0.8 1.0
λ kβ̂λL k1 /kβ̂k1
35 / 57
The Variable Selection Property of the Lasso
Why is it that the lasso, unlike ridge regression, results in
coefficient estimates that are exactly equal to zero?
One can show that the lasso and ridge regression coefficient
estimates solve the problems
2
n
X p
X p
X
minimize yi − β0 − βj xij subject to |βj | ≤ s
β
i=1 j=1 j=1
and
2
n
X p
X p
X
minimize yi − β0 − βj xij subject to βj2 ≤ s,
β
i=1 j=1 j=1
respectively.
36 / 57
The Lasso Picture
37 / 57
Comparing the Lasso and Ridge Regression
60
60
Mean Squared Error
50
40
40
30
30
20
20
10
10
0
0
0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0
λ R2 on Training Data
100
100
Mean Squared Error
80
60
60
40
40
20
20
0
0
0.02 0.10 0.50 2.00 10.00 50.00 0.4 0.5 0.6 0.7 0.8 0.9 1.0
λ R2 on Training Data
40 / 57
Selecting the Tuning Parameter for Ridge Regression
and Lasso
• As for subset selection, for ridge regression and lasso we
require a method to determine which of the models under
consideration is best.
• That is, we require a method selecting a value for the
tuning parameter λ or equivalently, the value of the
constraint s.
• Cross-validation provides a simple way to tackle this
problem. We choose a grid of λ values, and compute the
cross-validation error rate for each value of λ.
• We then select the tuning parameter value for which the
cross-validation error is smallest.
• Finally, the model is re-fit using all of the available
observations and the selected value of the tuning
parameter.
41 / 57
Credit data example
Standardized Coefficients
Cross−Validation Error
25.6
300
25.4
100
0
25.2
−100
25.0
−300
5e−03 5e−02 5e−01 5e+00 5e−03 5e−02 5e−01 5e+00
λ λ
42 / 57
Simulated data example
1400
15
Standardized Coefficients
Cross−Validation Error
10
1000
5
600
0
200
−5
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
kβ̂λLk1/kβ̂k1 kβ̂λLk1/kβ̂k1
44 / 57
Dimension Reduction Methods: details
• Let Z1 , Z2 , . . . , ZM represent M < p linear combinations of
our original p predictors. That is,
p
X
Zm = φmj Xj (1)
j=1
M
X M
X p
X p X
X M p
X
θm zim = θm φmj xij = θm φmj xij = βj xij ,
m=1 m=1 j=1 j=1 m=1 j=1
where
M
X
βj = θm φmj . (3)
m=1
46 / 57
Principal Components Regression
47 / 57
Pictures of PCA
35
30
25
Ad Spending
20
15
10
5
0
10 20 30 40 50 60 70
Population
48 / 57
Pictures of PCA: continued
5
20
0
15
−5
10
−10
5
20 30 40 50 −20 −10 0 10 20
30
50
25
Ad Spending
Population
20
40
15
30
10
20
5
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Plots of the first principal component scores zi1 versus pop and
ad. The relationships are strong.
50 / 57
Pictures of PCA: continued
60
30
50
25
Ad Spending
Population
20
40
15
30
10
20
5
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
51 / 57
Application to Principal Components Regression
70
150
Squared Bias
Test MSE
60
Mean Squared Error
100
40
30
50
20
10
0
0
0 10 20 30 40 0 10 20 30 40
PCR was applied to two simulated data sets. The black, green,
and purple lines correspond to squared bias, variance, and test
mean squared error, respectively. Left: Simulated data from
slide 32. Right: Simulated data from slide 39.
52 / 57
Choosing the number of directions M
Income
400
80000
Limit
Standardized Coefficients
300
Rating
Cross−Validation MSE
Student
200
60000
100
40000
0
−100
20000
−300
2 4 6 8 10 2 4 6 8 10
53 / 57
Partial Least Squares
54 / 57
Partial Least Squares: continued
55 / 57
Details of Partial Least Squares
56 / 57
Summary
57 / 57