Vous êtes sur la page 1sur 6

Which Is Better, Stepwise Regression or

Best Subsets Regression?


4
19
1
0

Stepwise regression and best subsets regression are both automatic tools that help you
identify useful predictors during the exploratory stages of model building for linear
regression. These two procedures use different methods and present you with different
output.
An obvious question arises. Does one procedure pick the true model more often than the
other? Ill tackle that question in this post.

First, a quick refresher about the two procedures and their different results:

Stepwise regression presents you with a single model constructed using the pvalues of the predictor variables

Best subsets regression assess all possible models and displays a subset along with
theiradjusted R-squared and Mallows Cp values

The key benefit of the stepwise procedure is the simplicity of the single model. Best subsets
does not pick a final model for you but it does present you with multiple models and
information to help you choose the final model. For more details, read this post where
I compare stepwise regression to best subsets regression and present examples using both
analyses.

Determining the Better Model Selection Method


A study by Olejnik, Mills, and Keselman* compares how often stepwise regression, best
subsets regression using the lowest Mallows Cp, and best subsets using the highest
adjusted R-squared selects the true model.
The authors assessed 32 conditions that differed by the number of candidate variables,
number of authentic variables, sample size, and level of multicollinearity. For each
condition, the authors created 1,000 computer-generated datasets and analyzed them with
both stepwise and best subsets to determine how often each procedure selected the
correct model.
And, the winner is...stepwise regression!! Congratulations! Well, sort of, as well see.
Best subsets regression using the lowest Mallows Cp is a very close second. The overall
difference between Mallows Cp and stepwise selection is less than 3%. The adjusted Rsquared performed much more poorly than either stepwise or Mallows Cp.
However, before we pop open the champagne to celebrate stepwise regressions victory,
theres a huge caveat to reveal.
Stepwise selection usually did not identify the correct model. Gasp!

Digging into the Results


Lets look at the results more closely to see how well stepwise selection performs and what
affects its performance. Ill only cover stepwise selection, but the results for Mallows Cp are
essentially tied and follow the same patterns. Ill give my thoughts on the matter at the end.
In the results below, stepwise regression identifies the correct model if it selects all of the
authentic predictors and excludes all of the noise predictors.
Best case scenario
In the study, stepwise regression performs the best when there are four candidate variables,
three of which are authentic; there is zero correlation between the predictors; and there is

an extra-large sample size of 500 observations. For this case, the stepwise procedure selects
the correct model 84% of the time. Unfortunately, this is not a realistic scenario and the
accuracy diminishes from here.
Number of candidate predictors and number of authentic predictors
The study looks at scenarios where there are either 4 or 8 candidate predictors. It is harder
to choose the correct model when there are more candidates simply because there are
more possible models to choose from. The same pattern holds true for the number of
authentic predictors.
The table below shows the results for models with no multicollinearity and a good sample
size (100-120 observations). Notice the decrease in the percent correct as both the number
of candidates and number of authentic predictors increase.

Candidate predictors

Authentic predictors

% Correct model

62.7

54.3

34.4

31.3

12.7

1.1

Multicollinearity
The study varies multicollinearity to determine how correlated predictors affect the ability
of stepwise regression to choose the correct model. When predictors are correlated, its
harder to determine the individual effect each one has on the response variable. The study
set the correlation between predictors to 0, 0.2, and 0.6.
The table below shows the results for models with a good sample size (100-120
observations). As correlation increases, the percent correct decreases.

Candidate
predictors

Authentic
predictors

Correlation

% Correct
model

0.0

54.3

Candidate
predictors

Authentic
predictors

Correlation

% Correct
model

0.2

43.1

0.6

15.7

0.0

12.7

0.2

1.0

0.6

0.4

Sample size
The study uses two sample sizes to see how that influences the ability to select the correct
model. The size of the smaller samples is calculated to achieve 0.80 power, which amounts
to 100-120 observations. These sample sizes are consistent with good practices and can be
considered a good sample size.
The very large sample size is 500 observations and it is 5 times the size that you need to
achieve the benchmark power of 0.80.
The table below shows that a very large sample size improves the ability of stepwise
regression to choose the correct model. When choosing your sample size, you may want to
consider a larger sample than what the power and sample size calculations suggest in order
to improve the variable selection process.

%
%
Correct Correct
Candidate predictors Authentic predictors Correlation - good - very
sample large
size
sample
4

0.0

54.3

72.1

0.2

43.1

72.9

0.6

15.7

69.2

0.0

12.7

53.9

0.2

1.0

39.5

%
%
Correct Correct
Candidate predictors Authentic predictors Correlation - good - very
sample large
size
sample
0.6

0.4

1.8

Closing Thoughts
Stepwise regression generally cant pick the true model. This is true even with the small
number of candidate predictors that this study looks at. In the real world, researchers often
have many more candidates, which lowers the chances even further.
Reality is complex and we should not expect that an automated algorithm can figure it out
for us. After all, the stepwise algorithm follows simple rules and it knows nothing about the
underlying process or subject area. However, stepwise regression can get you to right
ballpark. At a glance, youll have a rough idea of what is going on in your data.
Its up to you to get from the rough idea to the correct model. To do this, youll need to use
your expertise, theory, and common sense rather than relying solely on simplistic model
selection rules.
For tips about how to do this, read my post Four Tips on How to Perform a Regression
Analysis that Avoids Common Problems.
If you're learning about regression, read my regression tutorial!
*Stephen Olejnik, Jamie Mills, and Harfey Keselman, Using Wherrys Adjusted R2 and
Mallows Cp for Model Selection from All Possible Regressions, The Journal of Experimental
Education, 2000, 68(4), 365-380.

Vous aimerez peut-être aussi