Vous êtes sur la page 1sur 9

MULTIPLE REGRESSION PART 2

Topics Outline
Running Multiple Regression and Interpreting the Results
Validation of the Fit
Running Multiple Regression and Interpreting the Results
Example 1
Overhead Costs at Bendrix
The Bendrix Company manufactures various types of parts for automobiles. The manager of the
factory wants to get a better understanding of overhead costs and has tracked total overhead costs
for the past 36 months. To help explain these, he has also collected data on two variables that are
related to the amount of work done at the factory (see Overhead_Costs.xlsx):
MachHrs: number of machine hours used during the month
ProdRuns: the number of separate production runs during the month
Our earlier analysis of the two candidates for explanatory variables (see Overhead_Costs_Finished.xlsx)
indicated that both variables are related to Overhead. Therefore, it makes sense to try including
both in the regression equation. With any luck, the linear fit should improve.
(a) Use StatTools and Overhead_Costs.xlsx to run a regression for Overhead costs as a linear
function of MachHrs (machine hours) and ProdRuns (production runs).
(Use Overhead_Costs_MultipleRegression_Finished.xlsx as a reference.)
To obtain the regression output, select Regression from the StatTools Regression and
Classification dropdown list and fill out the resulting dialog box as shown below.

-1-

The main regression output is:


Multiple
Summary

ANOVA Table
Explained
Unexplained

ProdRuns

StErr of

R-Square

Estimate

4108.993

0.8664

0.8583

Degrees of

Sum of

Mean of

Freedom

Squares

Squares

2
33
Coefficient

MachHrs

Adjusted

0.9308

3614020661 1807010330
557166199.1 16883824.22
Standard

F-Ratio

p-Value

107.0261

< 0.0001

t-Value

p-Value

Error

Regression Table
Constant

R-Square

3996.678
43.536
883.618

6603.651
3.589
82.251

0.6052
12.1289
10.7429

0.5492
< 0.0001
< 0.0001

Confidence Interval 95%


Lower

Upper

-9438.551
36.234
716.276

17431.907
50.839
1050.960

(b) What is the equation of the regression model?

y = + 1 x1 + 2 x 2 +
(c) What is the equation of the true regression surface?

y = + 1 x1 + 2 x 2
Geometrically it is an equation of a plane in three-dimensional space. We refer to this plane
as the plane of means.
(d) What is the equation of the fitted surface?

y = a + b1 x1 + b2 x2
From the regression output, the equation of the fitted surface is
Predicted Overhead = 3997 + 43.54MachHours + 883.62ProdRuns
Geometrically it is an equation of a plane in three-dimensional space. We refer to this plane as
the least squares plane.
(e) Interpret the regression coefficients.
The Bendrix manager can interpret the intercept, a = $3,997, as the fixed component of overhead;
that is, the overhead cost when MachHrs = 0 and ProdRuns = 0.
The slope terms involving MachHrs and ProdRuns are the variable components of overhead.
If the number of production runs is held constant, the overhead cost is expected to increase
by b1 = $43.54 for each extra machine hour.
If the number of machine hours is held constant, the overhead cost is expected to increase by
b2 = $883.62 for each extra production run.
-2-

(f) From our previous analysis, the regression equations with single explanatory variables are
Predicted Overhead = 48621 + 34.7MachHrs
and
Predicted Overhead = 75606 + 655.1ProdRuns
Compare these equations with the multiple regression equation.
The coefficient of MachHrs has increased from 34.7 to 43.5 and the coefficient of
ProdRuns has increased from 655.1 to 883.6. Also, the intercept is now lower than either
intercept in the single-variable equations.
The reasoning is that when MachHrs is the only variable in the equation, ProdRuns is not
being held constant it is being ignored so in effect the coefficient 34.7 of MachHrs
indicates the effect of MachHrs and the omitted ProdRuns on Overhead. But when both
variables are included, the coefficient 43.5 of MachHrs indicates the effect of MachHrs only,
holding ProdRuns constant. Because the coefficients of MachHrs in the two equations have
different meanings, it is not surprising that they result in different numerical estimates.
Note: In general, it is difficult to guess the changes that will occur when more explanatory
variables are included in the equation, but it is likely that changes will occur.
The estimated coefficient of any explanatory variable typically depends on which other
explanatory variables are included in the equation.
(g) Interpret the standard error of estimate se.
Recall that se is essentially the standard deviation of residuals and estimates the true
unknown standard deviation of the error term in the model.
The standard error of estimate se is a measure of the typical prediction error when the multiple
regression equation is used to predict the response variable.
In this example, se = $4,109. Assuming that the errors are approximately normally distributed
and using the 68%95%99.7% empirical rule, we can conclude that about two-thirds of the
predictions should be within one standard error, or $4,109, of the actual overhead cost.
(h) Compare the standard error of estimate with the standard errors from the single-variable
equations for Overhead.
The comparison of the standard error se = $4,109 with the standard errors from the singlevariable equations for Overhead, $8,585 and $9,457, shows that the multiple regression
equation provides predictions that are more than twice as accurate as the single-variable
equations a big improvement.
(i) Interpret the coefficient of determination r 2 .
The r 2 value is the percentage of variation of the response variable explained by the combined
set of explanatory variables.
MachHrs and ProdRuns combine to explain 86.6% of the variation in Overhead.
-3-

(j) Compare r 2 coefficients for the multiple and single regression outputs.
The r 2 of 86.6% for the multiple regression is a big improvement over the single-variable
equations that were able to explain only 39.9% and 27.1% of the variation in Overhead.
Remarkably, the combination of the two explanatory variables explains a larger percentage
than the sum of their individual effects.
Note: This is an admittedly unusual case where the total is greater than the sum of its parts.
This is not common, but this example shows that it is possible. Evidently, these two explanatory
variables "line up" just right to predict overhead quite well.
(k) What is the correlation r?
The square root of r 2 (the multiple R in the Excel output) is the correlation between the fitted
values y and the observed values y of the response variable. For the Bendrix data the
correlation between them is r = 0.866 = 0.931, quite high.
(l) Show a graph illustrating the correlation r.
A graphical indication of the high correlation can be seen in the plot of fitted y values versus
observed y values of Overhead:
Scatterplot of Fit vs Overhead
120000
110000
Fit

100000
90000
80000
70000
60000
60000

70000

80000

90000

100000

110000

120000

Overhead

(m) Is collinearity a concern for this model?

MachHrs

Scatterplot of MachHrs vs ProdRuns


1800
1700
1600
1500
1400
1300
1200
1100
1000
10

15

20

25

30

35

40

45

50

55

60

ProdRuns

Correlation Table
MachHrs
ProdRuns
Overhead

MachHrs
1.000
-0.229
0.632
-4-

ProdRuns

Overhead

1.000
0.521

1.000

The scatterplot of MachHrs and ProdRuns does not suggest a high correlation between these
two variables. In fact, the correlation 0.229 is pretty low: its absolute value 0.229 is lower
than 0.7; it is also lower than 0.632 which is the largest of the correlations between Overhead
and MachHours or ProdRuns.
The variance inflation factor
VIF =

1
= 1.055
2
1 ( 0.229 )

is small as well (smaller than 10 and also smaller than the more conservative 5).
Hence, collinearity is not present in this model.
(n) What inferences can be made about the regression coefficients b1 and b2 ?
Recall that each coefficient b represents a point estimate of the true, but unobservable,
population parameter , based on this particular sample. The corresponding SE indicates the
accuracy of this point estimate.
For example, the point estimate of 1 , the effect on Overhead of a one-unit increase in
MachHrs (when ProdRuns is held constant), is 43.536 and the standard error is 3.589.
From the regression output, you can be 95% confident that the true 1 lies in the interval from
36.234 to 50.839.
The 95% confidence interval for 2 is from 716.276 to 1050.960.
(o) Perform tests for the significance of the regression coefficients.
Recall that the value of the test statistic for the individual regression coefficient is the ratio of
the estimated coefficient to its standard error:
b
t=
SE
Therefore, it indicates how many standard errors the regression coefficient is from zero.
For example, the t-value for MachHrs is about 12.13, so the regression coefficient of
MachHrs, 43.536, is more than 12 of its standard errors to the right of zero. Similarly,
the coefficient of ProdRuns is more than 10 of its standard errors to the right of zero.
To decide whether a particular explanatory variable belongs in the regression equation,
the following test is performed:
H0 : = 0
Ha : 0
For MachHrs, the value of the test statistic is 12.13, and the associated P-value is less than
0.0001. This means that there is virtually no probability beyond the observed t-value.
In words, you are still not exactly sure of the true slope coefficient 1 of MachHrs, but you
are virtually sure it is not zero. The same can be said for the true slope 2 of ProdRuns.
-5-

(p) How good is the overall fit?


From the regression output, the value of the F statistic is 107.0261 and the corresponding
P-value is practically zero. This means that the regression equation provides a good fit.
Note: Even if the F test gives an extremely significant result, there is no guarantee that the
regression equation provides a good enough fit for practical uses. For example, the Bendrix
manager is trying to understand the causes of variation in overhead costs. This manager already
knows that machine hours and production runs are related positively to overhead costs.
What he really wants is a set of explanatory variables that yields a high r 2 and a low se.
The low P-value in the ANOVA table does not guarantee these. All it guarantees is that
MachHrs and ProdRuns are of some help in explaining variations in Overhead.
(q) Are the regression assumptions satisfied?
The plots of the residuals versus y , x1 , and x 2 show random scatters without any patterns,
clumping or excessive increase/decrease in their variation around the horizontal zero line.
Therefore, the linearity, independence, and the equal spread regression assumptions are satisfied.
Predicted Overhead Residual Plot
8000

Residuals

3000
-200070000

80000

90000

100000

110000

120000

-7000
-12000

Predicted Overhead

ProdRuns Residual Plot

8000

8000

3000

3000

-20001000

1200

1400

1600

Residuals

Residuals

MachHrs Residual Plot

1800

-7000
-12000

-2000 20

40

50

-7000
-12000

MachHrs

30

ProdRuns

Note: These data were collected over time. Therefore, to make a stringent decision about the
independence of the residuals we should also inspect the time series plot of residuals:

-6-

60

Time Series of Residuals


10000
Residuals

5000
0
-5000

10

20

30

40

-10000
-15000

Month

This plot shows signs of the so called lag 1 autocorrelation. To access the severity of this
condition, we need to perform a test which will be studied later when we explore topics in time
series analysis. For now, consider that the independence assumption is not violated if the plots of
residuals versus y , x1 , and x 2 look random and do not show systematic patterns.
The histogram of the residuals is single peaked with no apparent outliers. There is a left skew
(skewness = 0.64) which is mild enough to be overcome by the least squares procedure.
This is confirmed also by the inspection of the normal probability plot. Except for the mild
left skewness (as indicated by the slight upward and then leveled off curving), the points are
pretty closely located around a 45 o line. Thus, the normality assumption is not seriously violated.
Q-Q Normal Plot of Residual / Residuals
3.5

10

2.5

Standardized Q-Value

12
8
6
4

-3.5

1.5
0.5
-2.5

-1.5

5569.81

3091.21

612.61

-1865.98

-4344.58

-6823.18

0
-9301.78

Frequency

Histogram of Residual / Residuals

-0.5
-0.5
-1.5

0.5

1.5

2.5

-2.5
-3.5
Z-Value

(r) Suppose Bendrix expects the values of MachHrs and ProdRuns for the next three months to be
1430, 1560, 1520, and 35, 45, 40, respectively. What are the point predictions and 95%
prediction intervals for Overhead for these three months?
First set up a second data set with the following column headings:

-7-

3.5

Enter the values for Month, MachHrs, and ProdRuns. The last three columns can be blank or have
values, but when regression is run with the prediction options, they will be filled in or overwritten.
Define the entire region A1:F4 as a new StatTools data set named Data for Prediction.
Then use StatTools as shown below:

The Overhead values in column D are the point predictions for the next three months,
and the LowerLimit95 and UpperLimit95 values in columns E and F indicate the 95%
prediction intervals.
You can see from the wide prediction intervals how much uncertainty remains. The reason is
the relatively large standard error of estimate, se = 4108.993. Contrary to what you might
expect, this is not a sample size problem. That is, a larger sample size would probably not
produce a smaller value of se. The whole problem is that MachHrs and ProdRuns are not
perfectly correlated with Overhead. The only way to decrease se and get more accurate
predictions is to find other explanatory variables that are more closely related to Overhead.
Note: StatTools provides prediction intervals for individual values, but it does not provide
confidence intervals for the mean of y, given a set of xs.

-8-

Validation of the Fit


Now suppose that this data set is from one of Bendrix's two plants. The company would like
to predict overhead costs for the second plant by using data on machine hours and production
runs at the first plant (See Overhead_Costs_Validation.xlsx. The regression output for the first
plant is at the Regression tab.)
How well does the regression from the first plant fit data from the other plant?
Use the following steps to perform this validation:
1. Copy the results from the original regression to the ranges B5:D5 and B9:B10 in the
Validation Data spreadsheet.
2. Calculate the fitted values.
The fitted values are now the predicted values of overhead for the second plant, based on the
original regression equation. Find these by substituting the new values of MachHrs and
ProdRuns into the original equation. Specifically, enter the formula
=$B$5+SUMPRODUCT($C$5:$D$5,B13:C13)
in cell E13 and copy it down.
(You can also use the simpler formula =$B$5+$C$5*B13+$D$5*C13)
3. Calculate the residuals (prediction errors for the second plant) by entering the formula
=D13-E13
in cell F13 and copying it down.
4. Calculate the coefficient of determination by entering the formula
=CORREL(E13:E48,D13:D48)^2
in cell C9.
5. Calculate the standard error of estimate.
The se value is essentially the average of the squared residuals, but it uses the denominator
n 3 (when there are two explanatory variables) rather than n 1. Therefore, enter the formula
=SQRT(SUMSQ(F13:F48)/33)
in cell C10.
The results are typical. The validation results are usually not as good as the original results.
The value of r2 has decreased from 86.6% to 77.3%, and the value of se has increased from
$4,109 to $5,257. Nevertheless, Bendrix might conclude that the original regression equation
is adequate for making future predictions at either plant.

-9-

Vous aimerez peut-être aussi