Vous êtes sur la page 1sur 19

Example of Using SPSS to Generate a Simple Regression Analysis

Given a desire of a Retail Chain management team to develop a strategy to


forecasting annual sales, the following data from a random sample of existing stores has
been gathered:

STORE SQUARE FOOTAGE ANNUAL SALES ($)


1 1726.00 3681.00
2 1642.00 3895.00
3 2816.00 6653.00
4 5555.00 9543.00
5 1292.00 3418.00
6 2208.00 5563.00
7 1313.00 3660.00
8 1102.00 2694.00
9 3151.00 5468.00
10 1516.00 2898.00
11 5161.00 10674.00
12 4567.00 7585.00
13 5841.00 11760.00
14 3008.00 4085.00

We can enter the data into SPSSPc by typing it directly into the data editor, or by cutting
and pasting:
Next, by clicking on ‘Variable View’, we can apply variable and value labels where
appropriate:

Assuming, for now, that if a relationship exists between the two variables, it is linear in
nature, we can generate a simple Scatterplot (or Scatter Diagram) for the data. This is
accomplished with the command sequence:
Which yields the following (editable) scatterplot:

Regression Analysis for Site Selection


Simple Scatterplot of Data
14000

12000

10000

8000
Sales Revenue of Store

6000

4000

2000

0
0 1000 2000 3000 4000 5000 6000 7000

Square Footage of Store

We can generate a simple straight line equation from the output resulting when using the
Enter Command in regression:
Which yields:

Variable s Ente re d/Re mov ebd

Variables Variables
Model Entered Removed Method
1 Square
Footage
a
of . Enter
Store
a. All requested variables entered.
b. Dependent Variable: Sales Revenue of Store
M ode l Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 .954 a .910 .902 936.8500
a. Predictors: (Constant), Square Footage of Store

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 1.06E+08 1 106208119.7 121.009 .000 a
Residual 10532255 12 877687.937
Total 1.17E+08 13
a. Predictors: (Constant), Square Footage of Store
b. Dependent Variable: Sales Revenue of Store

SS
R
SS SS
T E

b0
Coefficientsa

Standardi
zed
Unstandardized Coefficien
Coefficients ts 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 901.247 513.023 1.757 .104 -216.534 2019.027
Square Footage of Store 1.686 .153 .954 11.000 .000 1.352 2.020
a. Dependent Variable: Sales Revenue of Store

b1

^
So then Yi = 901.247 + 1.686X (noting that no direct interpretation of the Y
intercept at 0 Square Footage is possible, so
that the intercept represents the portion of
the annual sales varying due to factors other
than store size)

and where
SST = SSR (regression sum of squares) + SSE (error sum of squares)

= sum of the squared differences between each observed value for Y


and Y-Bar

SSR = sum of the squared differences between each predicted value of Y


and Y-Bar

SSE = sum of the squared differences between each observed value of Y


and each predicted value for Y

Coefficient of Determination = SSR/SSt = 0.91 (sample)

Standard Error of the Estimate = SYX = SQRT { SSE / n - 2} = 936.85


Testing the General Assumptions of Regression and Residual Analysis

1. Normality of Error - similar to the t-test and ANOVA, regression is robust to


departure from the normality of errors around the regression line. This assumption is
often tested by simply plotting the Standardized Residuals (each residual divided by
its standard error) on a histogram with a superimposed normal distribution, or on a
normalo probability plot. SPSS allows us to perform both functions automatically
(while, incidentally, saving the residual values in the original data file if this option is
toggled):
Histogram Normal P-P Plot of Regression Standardized Residual

Dependent Variable: Sales Revenue of Store


Dependent Variable: Sales Revenue of Store 1.00
5

4 .75

3
.50

Expected Cum Prob


2

.25
Frequency

1 Std. Dev = .96


Mean = 0.00
N = 14.00 0.00
0
0.00 .25 .50 .75 1.00
-2.00 -1.50 -1.00 -.50 0.00 .50 1.00

Observed Cum Prob


Regression Standardized Residual

Of course, the assessment of normality by visually scanning the data leaves some
statisticians unsettled; so I usually add an appropriate test of normality conducted on the
data:

Variable n A-D p-value


Stand._Resid. 14 0.348 0.503
2. Homoscedasticity - the assumption that the variability of data around the regression
line be constant for all values of X. In other words, error must be independent of
X. Generally, this assumption may be tested by plotting the X values against the
raw residuals for Y. In SPSS, this must be done by plotting a Scatterplot from the
saved variables:

Click Here
Results in data automatically added to the data file:
Then, simply produce the requisite scatterplot as before:

2000

1000

0
Unstandardized Residual

-1000

-2000
1000 2000 3000 4000 5000 6000

Square Footage of Store

Notice how there is no 'fanning' pattern to the data, implying homoscedasticity.


Other authors, including those who wrote the SPSS routine, choose to plot the X values
against the Studentized Residuals (Standardized Residuals Adjusted for their distance
from the average X value) rather than the Unstandardized (raw) Residuals. SPSS will
generate this plot automatically (select this under the ‘Plots’ panel):

Scatterplot of Studentized Residuals


and Square Footage (X)
1.5

1.0
Studentized Residual

.5

0.0

-.5

-1.0

-1.5

-2.0

-2.5
1000 2000 3000 4000 5000 6000

Square Footage of Store

Note the equivalence of results between the two plots. Statistically speaking, the X values
and Residuals may be inferred to be 0.00. We can infer this using the correlation utility in
SPSSPc, which tests the null hypothesis that the Pearson rho for the population is equal to
0.00:
Corre lations

Square Unstanda
Footage rdized Studentize
of Store Residual d Residual
Square Footage of Store Pearson Correlation 1.000 .000 .015
Sig. (2-tailed) . 1.000 .959
N 14 14 14
Unstandardized Residual Pearson Correlation .000 1.000 .999**
Sig. (2-tailed) 1.000 . .000
N 14 14 14
Studentized Residual Pearson Correlation .015 .999** 1.000
Sig. (2-tailed) .959 .000 .
N 14 14 14
**. Correlation is significant at the 0.01 level (2-tailed).

It should be noted that the distribution of the data also suggest that an assumption of
linearity is also reasonable at this point.

3) Independence of the Errors - assumes that no autocorrelation is present. Generally,


evaluated by plotting the residuals in the order or sequence in which the original
data were collected. This approach, when meaningful, uses the Durbin-Watson
Statistic and associated Tables of Critical values. SPSS can generate this value
when requested as part of the Model Summary:

M ode l Summaryb

Change Statistics
Adjusted Std. Error of R Square Durbin-W
Model R R Square R Square the Estimate Change F Change df1 df2 Sig. F Change atson
1 .954a .910 .902 936.8500 .910 121.009 1 12 .000 2.446
a. Predictors: (Constant), Square Footage of Store
b. Dependent Variable: Sales Revenue of Store

A number of other statistics are also available in SPSS regarding Residual


Analysis:
Re siduals Statisticsa

Minimum Maximum Mean Std. Deviation N


Predicted Value 2759.3672 10749.96 5826.9286 2858.2959 14
Std. Predicted Value -1.073 1.722 .000 1.000 14
Standard Error of
250.7362 512.8126 345.3026 81.3831 14
Predicted Value
Adjusted Predicted Value 2771.8208 10518.55 5804.4373 2830.7178 14
Residual -1888.14 1070.6108 -3.25E-13 900.0964 14
Std. Residual -2.015 1.143 .000 .961 14
Stud. Residual -2.092 1.288 .011 1.035 14
Deleted Residual -2033.82 1442.1392 22.4913 1049.3911 14
Stud. Deleted Residual -2.512 1.329 -.014 1.111 14
Mahal. Distance .003 2.967 .929 .901 14
Cook's Distance .001 .355 .086 .103 14
Centered Leverage Value .000 .228 .071 .069 14
a. Dependent Variable: Sales Revenue of Store
Inferences About the Model and Interval Estimates

We can determine the presence of a significant relationship between X and Y by testing to


determine whether the observed slope is significantly greater than 0, the hypothesized
slope of the regression line if no relationship existed. This can be done with a t-test,
which divides the observed slope by the standard error of the slope (supplied by SPSS):

Coe fficie ntsa

Standardi
zed
Unstandardized Coefficien
Coefficients ts 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 901.247 513.023 1.757 .104 -216.534 2019.027
Square Footage of Store 1.686 .153 .954 11.000 .000 1.352 2.020
a. Dependent Variable: Sales Revenue of Store

or with an ANOVA model, which provides identical results:

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 1.06E+08 1 106208119.7 121.009 .000 a
Residual 10532255 12 877687.937
Total 1.17E+08 13
a. Predictors: (Constant), Square Footage of Store
b. Dependent Variable: Sales Revenue of Store

noting that t2, as expected, equals F; and the p-values are therefore equal. Note that SPSS
also provides the confidence interval associated with the slope.

Finally, SPSS allows you to calculate and store both Confidence and Prediction Limits
for the observed data. After you generate the scatterplot, left double-click on the chart;
this will take you to the chart editor:
Next:

Then:
Click on ‘Fit Options’
Regression Analysis for Site Selection
Scatterplot of Data Including Confidence & Prediction Limits
12000

10000

8000
Sales Revenue of Store

6000

4000

2000 Rsq = 0.9098


1000 2000 3000 4000 5000 6000

Square Footage of Store

LCL UCL LPL UPL


3135.52558 4487.50548 1661.27256 5961.75850
2976.95430 4362.80609 1514.25297 5825.50741
5102.73145 6196.07384 3536.24581 7762.55948
9232.70820 11302.74446 7979.09247 12556.36019
2309.22155 3850.24435 897.92860 5261.53731
4028.95209 5219.51308 2497.98206 6750.48311
2349.56701 3880.71656 935.07592 5295.20765
1942.80866 3575.92595 560.87909 4957.85553
5663.35086 6765.16486 4100.00127 8328.51446
2737.79303 4177.06134 1293.06683 5621.78754
8677.59067 10529.18763 7362.03125 11844.74705
7827.42925 9376.22071 6418.64584 10785.00412
9632.63839 11867.28348 8422.94738 13076.97449
5426.83323 6519.44789 3860.07783 8086.20329