Vous êtes sur la page 1sur 8

Green // Statistics

The Mechanics of Multiple Regression

One of the most important concepts in statistics is the idea of “controlling” for a variable.
This lecture is designed to give you a feel for what “controls” are and how they are
implemented in the context of multiple regression.

Let’s begin by considering an example. In the weeks leading up to the November 2003
election, a group called ACORN sought to bolster support for a ballot proposition in
Kansas City. The measure authorized a rise in sales tax in order to fend off cuts to public
transportation. ACORN canvassed voters in a predominantly black section of Kansas
City, targeting registered voters who had voted in at least one of the five most recent
elections. The campaign consisted primarily of door-to-door canvassing conducted
during the final two weeks before Election Day.

I was asked to evaluate the effectiveness of this campaign. ACORN identified 28


precincts of potential interest to their campaign; I randomly assigned 14 to the treatment
group and 14 to the control group. After the election, voter turnout records were
gathered. Voting rates among those living in the treatment and control precincts were
calculated. The data may be found at

Kansas City Dataset

The data may be modeled in a few different ways. The simplest model describes the
voter turnout rate (Y) as a linear function of the experimental treatment (X) plus a
disturbance term:

Y = a + bX + U.

Here is an “individual value plot” of the data. Note that all of the X values are either 0
(control) or 1 (treatment), but the plot scatters them a bit in order to make the individual
values easier to see.
Individual Value Plot of VOTE03 vs TREATMEN
0.50

0.45

0.40
VOTE03

0.35

0.30

0.25

0.20

0.00 1.00
TREATMEN

Using regression, we obtain the following results:

Regression Analysis: VOTE03 versus TREATMEN

The regression equation is


VOTE03 = 0.289 + 0.0355 TREATMEN

Predictor Coef SE Coef T P


Constant 0.28884 0.01778 16.24 0.000
TREATMEN 0.03554 0.02515 1.41 0.169

S = 0.0665291 R-Sq = 7.1% R-Sq(adj) = 3.6%

The critical numbers here are .036, which suggests that the expected rate of turnout
increases by 3.6 percentage-points as we move from control to treatment, and .025, which
conveys the uncertainty surrounding this experimental effect. The p-value of .169 tells us
that there is a 16.9% chance of observing a treatment effect as large as this in absolute
value even if the true experimental effect were zero. Ordinarily, we would use a 1-tailed
test here, because one would suppose that canvassing would increase turnout; in that
case, the one-tailed p-value is approximately .09. For what it’s worth, that falls a bit
short of the conventional statistical significance threshold of .05.

(Note that it is just a coincidence that the estimated treatment effect of .036 coincides
with the adjusted R-squared of 3.6%. Why, speaking of R-squared, is it not of central
concern as we interpret these regression statistics?)
How can we make this analysis more precise? One answer is to gather more data.
Another is to control for other predictors of voter turnout that are not consequences of the
treatment. (We’ll see why we don’t want to control for consequences of the treatment in
next week’s lectures.) Fortunately, we happen to have just such a predictor at hand. The
Kansas City voter file contains extensive information about the past voter turnout of
every voter. I calculated the average voting rate over several elections from 1998
through the summer of 2003. Since these votes occurred before the experiment, we need
not be concerned that they represent consequences of the treatment. Let’s call the past
vote average Z and control for it in our revised regression model:

Y = a + bX + cZ + U.

In terms of sheer Minitab mechanics, this model is easy to estimate.

Regression Analysis: VOTE03 versus TREATMEN, VOTEAVG

The regression equation is


VOTE03 = - 0.310 + 0.0452 TREATMEN + 1.17 VOTEAVG

Predictor Coef SE Coef T P


Constant -0.31046 0.08199 -3.79 0.001
TREATMEN 0.04518 0.01446 3.12 0.004
VOTEAVG 1.1723 0.1591 7.37 0.000

S = 0.0381032 R-Sq = 70.7% R-Sq(adj) = 68.4%

Take a close look at these regression results, and compare them to the results presented
above. The estimated treatment effect is somewhat larger than before. This pattern is
specific to this example, not a general feature of multiple regression. You should not
expect coefficients to grow when control variables are added to a regression equation –
especially when analyzing experimental data (Why?). In this case, the estimated
treatment effect grows from .036 to .045, but it could have gone the other way. It just
happens to be the case that randomly assigned treatments were more likely to go to
precincts with below average VOTEAVG scores. In the plot below, we see that the
correlation between TREATMEN and VOTEAVG is slightly negative. Thus, some of
the positive influence of the treatment is understated in the first regression, because it
ignores the fact that the treated precincts had slightly lower voting propensities before the
experiment got underway.
Individual Value Plot of VOTEAVG vs TREATMEN
0.65

0.60

0.55
VOTEAVG

0.50

0.45

0.40
0.00 1.00
TREATMEN

The next thing to notice about the multiple regression output is that the standard error of
the estimated treatment effect has dropped from .025 to .014. That may not sound like
much, but remember that a drop of this magnitude is tantamount to a dramatic expansion
in sample size. In order to reduce the standard error by a factor of 1.78, one would have
to increase the sample size by a factor of 3.19.

Why the decline in the standard error? A big part of the answer has to do with “s”, the
estimated standard deviation of the disturbances. The first regression estimated s to be .
067; the second regression brought the standard deviation of the unobservables down to .
038.

Was there a cost to adding an additional control variable, or “covariate”? The answer is
yes. Any time we add an additional covariate we are penalized in two ways. Here’s the
formula for the standard error of a regression of Y on X, which you should compare to
the formula for the standard error of a regression of Y on X controlling for Z.

1
n −k
∑ ei2
s .0665291
= = = .02514
( n −1)Var ( X ) (n −1)Var ( X ) ( 27 )(. 2593 )

= estimated standard error of b when Y is regressed on X.


1
n −k
∑ei2 s .0381032
= = = .01446
(n − 1)Var ( X )(1 − R XZ
2
) ( n − 1)Var ( X )(1 − R 2
XZ ) ( 27 )(. 2593 )(1 − .0081 )

= estimated standard error of b when Y is regressed on X,


controlling for Z.

What are the two penalties? First, the number of degrees of freedom (n-k) decreases as
we add more variables. In this formula, n is the number of observations and k is the
number of parameters that are being estimated. All other things being equal, a decline in
the number of degrees of freedom tends to make the numerator bigger, which in turn
makes the standard error bigger. Second, correlation between the independent variables
makes the denominator smaller, which in turn makes the standard error bigger. Notice
that when this correlation is zero, the two standard errors have the same denominators.
When the correlation is 1, the standard error in the second formula becomes infinite.

Let’s try to get a feel for what multiple regression is actually doing. Here is the
scatterplot of VOTE03 with VOTEAVG, with different markers for the treatment and
control groups. Imagine that we were to pass two parallel regression lines through these
data, one for the red points and another for the black points. These parallel lines would
have a slope of 1.17. The vertical distance between the two lines would be .045; this
vertical distance reveals the apparent effect of the experimental treatment. Because the
treatment variable in this example is a dummy variable, it can be thought of as a variable
the generates a shift in the intercept.
Scatterplot of VOTE03 vs VOTEAVG
0.50 TREATMEN
0.00
1.00
0.45

0.40
VOTE03

0.35

0.30

0.25

0.20

0.40 0.45 0.50 0.55 0.60 0.65


VOTEAVG

The predicted values from the multiple regression are depicted in the following graph.
Notice that the points in each experimental group are arrayed along (invisible) parallel
regression lines.

Scatterplot of FITS1 vs VOTEAVG


0.50 TREATMEN
0.00
1.00
0.45

0.40
FITS1

0.35

0.30

0.25

0.20
0.40 0.45 0.50 0.55 0.60 0.65
VOTEAVG
How does multiple regression know how to space the two parallel lines? How does the
computer choose which coefficients to attach to each independent variable? As in the
case of regression of Y on a single variable, least squares regression selects the
coefficients that minimize the sum of squared residuals. Fortunately, this algorithm is
easy to implement algebraically.

In order to see the mechanics at work, break down multiple regression into a series of
biviarate regressions. In order to estimate the experimental effect (i.e., the coefficient on
the TREATMEN variable), we perform the following operations:

1. Regress TREATMEN on VOTEAVG.


2. Calculate the residuals from this regression. Residuals are computed as the actual
values of TREATMEN minus the predicted values of TREATMEN.
3. Regress VOTE03 on the residuals calculated in Step 2 to obtain the multiple
regression estimate.

Let’s give it a try.

Step 1:

Regression Analysis: TREATMEN versus VOTEAVG

The regression equation is


TREATMEN = 1.00 - 1.00 VOTEAVG

Predictor Coef SE Coef T P


Constant 1.005 1.094 0.92 0.367
VOTEAVG -0.996 2.149 -0.46 0.647

S = 0.516746 R-Sq = 0.8% R-Sq(adj) = 0.0%

Step 2:

When performing this regression, ask Minitab to save residuals under the “storage”
option. (Or you can do this manually by plugging in the values from the regression
equation.) In this case, Minitab stores the residuals as RESI1.

Step 3.

Regression Analysis: VOTE03 versus RESI1

The regression equation is


VOTE03 = 0.307 + 0.0452 RESI1

Predictor Coef SE Coef T P


Constant 0.30661 0.01228 24.97 0.000
RESI1 0.04518 0.02466 1.83 0.078
S = 0.0649703 R-Sq = 11.4% R-Sq(adj) = 8.0%

How does our pseudo-regression procedure compare to the real thing?

Regression Analysis: VOTE03 versus VOTEAVG, TREATMEN

The regression equation is


VOTE03 = - 0.310 + 1.17 VOTEAVG + 0.0452 TREATMEN

Predictor Coef SE Coef T P


Constant -0.31046 0.08199 -3.79 0.001
VOTEAVG 1.1723 0.1591 7.37 0.000
TREATMEN 0.04518 0.01446 3.12 0.004

S = 0.0381032 R-Sq = 70.7% R-Sq(adj) = 68.4%

Notice that the slope coefficient for the TREATMEN variable is the same as the slope for
the RESI1 variable. Success. (On the other hand, the standard errors and t-ratios are
wrong, so beware. Eventually, you’ll learn how to calculate all of the regression output,
but so far you only know how to generate the slopes.)

What is the underlying theory behind multiple regression? The key idea is to isolate the
component of the TREATMEN variable that is uncorrelated with VOTEAVG. This aim
is accomplished by performing a bivariate regression and calculating residuals. Those
residuals represent the component of TREATMEN that is not predicted by VOTEAVG.

Can you now figure out how to calculate the multiple regression slope for VOTEAVG?

Vous aimerez peut-être aussi