Académique Documents
Professionnel Documents
Culture Documents
14-1
Page
14-2
14-5
14-10
14-13
14-16
14-33
14-39
2007 A. Karpinski
14-2
2007 A. Karpinski
=0
(e
Var (e) =
e)2
n2
e
=
2
i
n2
SSE
= MSE
n2
ei
MSE
Yi Y
MSE
14-3
2007 A. Karpinski
ZRESID
SRESID
-.80365
-1.33272
-1.60685
1.50761
1.97215
-1.33425
.37854
-2.73592
-3.46819
-.13105
3.39148
1.61081
2.91415
2.50928
-2.87139
-.34695
-.57536
-.69370
.65086
.85140
-.57601
.16342
-1.18114
-1.49727
-.05658
1.46415
.69541
1.25808
1.08329
-1.23962
-.35921
-.61672
-.73813
.68389
.88173
-.64739
.18972
-1.22407
-1.56097
-.05952
1.54912
.77910
1.49866
1.16348
-1.28850
14-4
2007 A. Karpinski
14-5
2007 A. Karpinski
0.80
0.60
0.40
0.20
0.00
0.00
0.20
0.40
0.60
0.80
1.00
Studentized Residual
1.00000
0.00000
-1.00000
-2.00000
0.00
0.20
0.40
0.60
0.80
1.00
14-6
2007 A. Karpinski
0.80
0.60
0.40
0.20
0.00
0.00
0.20
0.40
0.60
0.80
1.00
x1
Studentized Residual
3.00000
2.00000
1.00000
0.00000
-1.00000
-2.00000
0.00
0.20
0.40
0.60
0.80
1.00
x1
14-7
2007 A. Karpinski
0.80
0.60
0.40
0.20
0.00
0.00
0.20
0.40
0.60
0.80
1.00
x2
Studentized Residual
1.00000
0.00000
-1.00000
-2.00000
-3.00000
0.00
0.20
0.40
0.60
0.80
1.00
x2
14-8
2007 A. Karpinski
Model
1
(Constant)
X2
Unstandardized
Coefficients
B
Std. Error
.232
.013
.887
.032
Standardi
zed
Coefficien
ts
Beta
.941
t
18.195
27.458
Sig.
.000
.000
Zero-order
.941
Correlations
Partial
.941
Part
.941
a. Dependent Variable: Y
R
R Square
.941a
.885
Adjusted
R Square
.884
Std. Error of
the Estimate
.07585
a. Predictors: (Constant), X2
Yet from the residual plot, we know that this linear model is incorrect
and does not fit the data well
Despite the level of significance and the large percentage of the
variance accounted for, we should not report this erroneous model
Detecting the omission of an important variable by looking at the residuals is
very difficult!
14-9
2007 A. Karpinski
GOOD
3
-1
-1
-2
-2
-3
-3
BAD
BAD
14-10
2007 A. Karpinski
Studentized Residual
1.00000
0.00000
-1.00000
-2.00000
0.00000
0.20000
0.40000
0.60000
0.80000
1.00000
o The band of residuals is constant across the entire length of the observed
predicted values
Example #2: A heteroscedastic model (n = 100)
GRAPH /SCATTERPLOT(BIVAR)=sresid WITH pred.
4.00000
Studentized Residual
2.00000
0.00000
-2.00000
-4.00000
5.00000
6.00000
7.00000
8.00000
9.00000
14-11
2007 A. Karpinski
o In this case, the unequal heteroscedasticity is also apparent from the X-Y
scatterplot. But in general, violations of the variance assumption are
easier to spot in the residual plots
GRAPH /SCATTERPLOT(BIVAR)=y WITH x.
20.00
15.00
10.00
5.00
0.00
0.00
5.00
10.00
15.00
20.00
R
.303a
R Square
.092
Adjusted
R Square
.083
Std. Error of
the Estimate
4.11810
a. Predictors: (Constant), X
Coefficientsa
Model
1
(Constant)
X
Unstandardized
Coefficients
B
Std. Error
8.650
.649
-.222
.070
Standardi
zed
Coefficien
ts
Beta
-.303
t
13.336
-3.153
Sig.
.000
.002
Zero-order
-.303
Correlations
Partial
-.303
Part
-.303
a. Dependent Variable: Y
14-12
2007 A. Karpinski
Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Statistic
.0002928
-.0129241
-.0584096
1.010
1.004895
-2.22678
2.57839
4.80518
1.2754269
.211
-.182
Std. Error
.10048947
.241
.478
14-13
2007 A. Karpinski
3
100
-1
-2
Tests of Normality
-3
N=
100
Studentized Residual
Studentized Residual
Statistic
.990
Shapiro-Wilk
df
100
Sig.
.648
Histogram
Dependent Variable: Y
16
Studentized Residual
1.00
14
12
.75
10
8
.50
6
4
.25
2
0
0.00
0.00
.25
.50
.75
1.00
o The histogram and P-P plot are as good as they get. There are no
problems with the normality assumption.
14-14
2007 A. Karpinski
Histogram
Standardized Residual
Dependent Variable: Y
1.00
20
.75
10
.50
.25
0
0.00
0.00
.25
.50
.75
1.00
EXAMINE VARIABLES=zresid1
/PLOT BOXPLOT HISTOGRAM NPPLOT
/STATISTICS DESCRIPTIVES.
Descriptives
Standardized Residual
Statistic
.0000000
.1054551
.2856435
.990
.99493668
-4.06265
1.13547
5.19812
1.1329148
-1.769
3.747
Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Std. Error
.09949367
.241
.478
Tests of Normality
Standardized Residual
Statistic
.835
Shapiro-Wilk
df
100
Sig.
.000
14-15
2007 A. Karpinski
# of observations
50
100
200
500
1000
o Many people use ei > 2 as a check for outliers, but this criterion results
in too many observations being identified as outliers. In large samples,
we expect a large number of observations to have residuals greater than 2
o A more reasonable cut-off for outliers is to use ei > 2.5 or even ei > 3
14-16
2007 A. Karpinski
#1
0.4
0.3
#2
0.2
0.1
#4
#3
0
0
2000
4000
6000
8000
10000
12000
14000
16000
o #1 is an Y outlier
o #2 is an X outlier
o #3 and #4 are outliers for both X and Y
When we examine extreme observations, we want to know:
o Is it an outlier? (i.e., Does it differ from the rest of the observed data?)
o Is it an influential observation?
(i.e., Does it have an impact on the regression equation?)
Clearly, each of the values highlighted on the graph is an outlier, but how
will each influence estimation of the regression line?
o Outlier #1
Influence on the intercept:
Influence on the slope:
o Outlier #2
Influence on the intercept:
Influence on the slope:
o Outlier #3
Influence on the intercept:
Influence on the slope:
o Outlier #4
Influence on the intercept:
Influence on the slope:
14-17
2007 A. Karpinski
Not all outliers are equally influential. It is not enough to identify outliers;
we must also consider the influence each may have (particularly on the
estimation of the slope)
Methods of identifying outliers and influential points:
o Examination of the studentized residuals
o A scatterplot of studentized residuals with X
o Examination of the studentized deletion residuals
o Examination of leverage values
o Examination of Cooks distance (Cooks D)
di =
s( di )
14-18
2007 A. Karpinski
Leverage values
o It can be shown (proof omitted) that the predicted value for the ith
observation can be written as a linear combination of the observed Y
values
Yi = h1Y1 + h2Y2 + ...+ hiYi + ...+ hnYn
p
n
2p
n
14-19
2007 A. Karpinski
p * MSE (1 hi ) 2
14-20
2007 A. Karpinski
0.5
#1
0.4
0.3
#2
0.2
0.1
#4
#3
0
0
14-21
2000
4000
6000
8000
10000
12000
14000
16000
2007 A. Karpinski
R 2 = .767
rXY = .876
o Examining residuals
Casewise Diagnosticsa
Stud.
Residual
.277
1.805
.786
-1.518
1.170
-.977
.131
.414
.805
.712
-.905
-.956
-1.116
1.491
.196
-1.155
.155
-1.226
1.272
-.035
.235
.866
.547
-1.427
-1.465
Y
.30
.45
.39
.33
.40
.33
.33
.42
.44
.30
.23
.31
.25
.39
.22
.14
.17
.15
.20
.19
.16
.26
.21
.17
.26
Predicted
Value
.2870
.3676
.3539
.3978
.3461
.3744
.3239
.4015
.4042
.2668
.2723
.3539
.3021
.3207
.2110
.1927
.1631
.2063
.1440
.1916
.1496
.2200
.1851
.2363
.3280
Residual
.0130
.0824
.0361
-.0678
.0539
-.0444
.0061
.0185
.0358
.0332
-.0423
-.0439
-.0521
.0693
.0090
-.0527
.0069
-.0563
.0560
-.0016
.0104
.0400
.0249
-.0663
-.0680
Studentized Residual
Case Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
A
A
-1
A A
A
A
-2
5000
a. Dependent Variable: Y
6000
7000
8000
9000
10000
14-22
2007 A. Karpinski
REGRESSION
/DEPENDENT y
/METHOD=ENTER x
/SAVE COOK (cook) LEVER (level) SDRESID (sdresid).
List var = ID cook level sdresid.
ID
COOK
LEVEL
SDRESID
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
.00162
.15078
.02389
.15820
.04787
.04820
.00046
.01235
.04834
.01084
.01722
.03537
.02788
.05806
.00139
.06141
.00162
.05796
.14004
.00006
.00443
.02431
.01523
.05487
.06055
.00029
.04468
.03175
.08079
.02541
.05181
.01122
.08593
.08969
.00102
.00035
.03179
.00283
.00963
.02770
.04435
.07952
.03156
.10758
.04544
.09888
.02094
.05234
.01111
.01339
.27175
1.90591
.77952
-1.56462
1.17947
-.97548
.12792
.40647
.79907
.70417
-.90124
-.95448
-1.12251
1.53438
.19142
-1.16348
.15130
-1.24063
1.29016
-.03471
.22963
.86075
.53871
-1.46223
-1.50504
o Critical values
Cooks D:
Leverage:
Studentized Deletion Residuals:
14-23
2007 A. Karpinski
R 2 = .654
rXY = .809
(Slope is unchanged)
o Examining residuals
Casewise Diagnosticsa
Stud.
Residual
.087
1.268
.479
-1.309
.777
-.888
-.027
.187
.490
.424
-.829
-.871
-.993
1.027
.022
-1.025
-.013
-1.080
.850
-.158
.047
.542
.293
-1.234
-1.264
3.189
Y
.30
.45
.39
.33
.40
.33
.33
.42
.44
.30
.23
.31
.25
.39
.22
.14
.17
.15
.20
.19
.16
.26
.21
.17
.26
.48
Predicted
Value
.2947
.3753
.3616
.4055
.3538
.3821
.3316
.4092
.4119
.2744
.2800
.3616
.3098
.3284
.2187
.2004
.1708
.2140
.1517
.1993
.1573
.2277
.1928
.2440
.3357
.2877
Residual
.0053
.0747
.0284
-.0755
.0462
-.0521
-.0016
.0108
.0281
.0256
-.0500
-.0516
-.0598
.0616
.0013
-.0604
-.0008
-.0640
.0483
-.0093
.0027
.0323
.0172
-.0740
-.0757
.1923
Studentized Residual
Case Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
A A
A
A
A
A
A A
-1
A
A
A
A
-2
5000
a. Dependent Variable: Y
6000
7000
8000
9000
10000
14-24
2007 A. Karpinski
REGRESSION
/DEPENDENT y
/METHOD=ENTER x
/SAVE COOK (cook) LEVER (level) SDRESID (sdresid).
List var = ID cook level sdresid.
ID
COOK
LEVEL
SDRESID
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
.00015
.07291
.00868
.11600
.02059
.03909
.00002
.00249
.01765
.00370
.01386
.02864
.02122
.02665
.00002
.04745
.00001
.04390
.06178
.00115
.00018
.00926
.00428
.03972
.04367
.20343
.00029
.04468
.03175
.08079
.02541
.05181
.01122
.08593
.08969
.00102
.00035
.03179
.00283
.00963
.02770
.04435
.07952
.03156
.10758
.04544
.09888
.02094
.05234
.01111
.01339
.00000
.08550
1.28518
.47159
-1.32978
.77023
-.88363
-.02646
.18329
.48210
.41665
-.82313
-.86611
-.99226
1.02828
.02160
-1.02631
-.01313
-1.08375
.84490
-.15495
.04601
.53355
.28711
-1.24846
-1.28045
4.11310
o Critical values
Cooks D:
Leverage:
Studentized Deletion Residuals:
o Observation #26
Has large residual and deletion residual
Has OK Cooks D and leverage
14-25
2007 A. Karpinski
Y = .00443 + .0000384 X
R 2 = .564
rXY = .762
o Examining residuals
Casewise Diagnosticsa
Stud.
Residual
.126
1.610
.779
-.784
1.040
-.477
.160
.695
1.002
.376
-.833
-.547
-.877
1.184
-.241
-1.336
-.475
-1.336
.273
-.496
-.475
.304
-.084
-1.371
-1.040
3.261
Y
.30
.45
.39
.33
.40
.33
.33
.42
.44
.30
.23
.31
.25
.39
.22
.14
.17
.15
.20
.19
.16
.26
.21
.17
.26
.28
Predicted
Value
.2923
.3534
.3430
.3763
.3371
.3585
.3203
.3791
.3811
.2769
.2811
.3430
.3037
.3178
.2347
.2208
.1983
.2311
.1839
.2200
.1881
.2415
.2151
.2538
.3233
.1059
Residual
.0077
.0966
.0470
-.0463
.0629
-.0285
.0097
.0409
.0589
.0231
-.0511
-.0330
-.0537
.0722
-.0147
-.0808
-.0283
-.0811
.0161
-.0300
-.0281
.0185
-.0051
-.0838
-.0633
.1741
Studentized Residual
Case Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
27
A
A
A
AA
A A
A
-1
A
A
AA
A
A
-2
-3
4000
a. Dependent Variable: Y
6000
8000
10000
14-26
2007 A. Karpinski
REGRESSION
/DEPENDENT y
/METHOD=ENTER x
/SAVE COOK (cook) LEVER (level) SDRESID (sdresid).
List var = ID cook level sdresid.
ID
COOK
LEVEL
SDRESID
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
27.00
.00033
.11241
.02246
.03788
.03662
.01065
.00068
.03099
.06647
.00284
.01389
.01107
.01720
.03644
.00167
.06241
.01110
.05370
.00458
.00870
.01299
.00242
.00027
.04294
.03021
1.97793
.00116
.04134
.03043
.07116
.02499
.04729
.01244
.07536
.07842
.00007
.00001
.03046
.00431
.01097
.01578
.02692
.05117
.01833
.07090
.02765
.06476
.01138
.03236
.00525
.01441
.23268
.12296
1.66884
.77271
-.77795
1.04164
-.46875
.15652
.68703
1.00235
.36942
-.82777
-.53874
-.87309
1.19431
-.23622
-1.35925
-.46709
-1.35897
.26774
-.48791
-.46726
.29784
-.08220
-1.39763
-1.04229
4.27763
o Critical values
Cooks D:
Leverage:
Studentized Deletion Residuals:
o Observation #27
Has large residual, deletion residual, Cooks D, and leverage
14-27
2007 A. Karpinski
R 2 = .811
rXY = .900
Y
.30
.45
.39
.33
.40
.33
.33
.42
.44
.30
.23
.31
.25
.39
.22
.14
.17
.15
.20
.19
.16
.26
.21
.17
.26
.05
Predicted
Value
.2870
.3677
.3539
.3979
.3461
.3744
.3239
.4016
.4042
.2667
.2723
.3539
.3021
.3207
.2110
.1926
.1630
.2063
.1439
.1916
.1496
.2200
.1851
.2363
.3280
.0473
Residual
.0130
.0823
.0361
-.0679
.0539
-.0444
.0061
.0184
.0358
.0333
-.0423
-.0439
-.0521
.0693
.0090
-.0526
.0070
-.0563
.0561
-.0016
.0104
.0400
.0249
-.0663
-.0680
-.0003
A
A
Studentized Residual
Case Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
28
AA
-1
AA
A
A
A
A
-2
4000
a. Dependent Variable: Y
6000
8000
10000
14-28
2007 A. Karpinski
REGRESSION
/DEPENDENT y
/METHOD=ENTER x
/SAVE COOK (cook) LEVER (level) SDRESID (sdresid).
List var = ID cook level sdresid.
ID
COOK
LEVEL
SDRESID
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
28.00
.00166
.14731
.02385
.14727
.04836
.04665
.00048
.01137
.04440
.01059
.01705
.03536
.02903
.06035
.00114
.04801
.00122
.04678
.10074
.00004
.00325
.02042
.01179
.04835
.06259
.00001
.00114
.04166
.03063
.07179
.02514
.04767
.01248
.07603
.07913
.00008
.00001
.03067
.00430
.01100
.01611
.02743
.05206
.01870
.07207
.02818
.06584
.01164
.03296
.00539
.01447
.22342
.27793
1.94253
.79550
-1.59008
1.20442
-.99473
.13066
.41213
.81044
.71940
-.92025
-.97485
-1.14763
1.56861
.19514
-1.17608
.15340
-1.25721
1.29087
-.03419
.23152
.87563
.54551
-1.48820
-1.53863
-.00664
o Critical values
Cooks D:
Leverage:
Studentized Deletion Residuals:
o Observation #28
Has large leverage
Has OK residual, deletion residual, and Cooks D
14-29
2007 A. Karpinski
R 2 = .162
rXY = .403
o Examining residuals
Casewise Diagnosticsa
Stud.
Residual
.319
1.580
.994
.169
1.132
.269
.480
1.128
1.334
.405
-.359
.142
-.274
1.129
-.207
-.992
-.547
-.938
-.137
-.448
-.602
.182
-.205
-.845
-.279
-4.287
Y
.30
.45
.39
.33
.40
.33
.33
.42
.44
.30
.23
.31
.25
.39
.22
.14
.17
.15
.20
.19
.16
.26
.21
.17
.26
.07
Predicted
Value
.2699
.3022
.2967
.3143
.2936
.3049
.2847
.3158
.3169
.2617
.2639
.2967
.2759
.2834
.2393
.2320
.2201
.2375
.2124
.2315
.2147
.2429
.2289
.2495
.2863
.3944
Residual
.0301
.1478
.0933
.0157
.1064
.0251
.0453
.1042
.1231
.0383
-.0339
.0133
-.0259
.1066
-.0193
-.0920
-.0501
-.0875
-.0124
-.0415
-.0547
.0171
-.0189
-.0795
-.0263
-.3244
4
3
Studentized Residual
Case Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
29
2
A
A A
A
1
0
A
AA
-1
A A
A
A A
A
AA
A
A
A A A
A A
-2
-3
-4
5000
a. Dependent Variable: Y
7500
10000
12500
14-30
2007 A. Karpinski
REGRESSION
/DEPENDENT y
/METHOD=ENTER x
/SAVE COOK (cook) LEVER (level) SDRESID (sdresid).
List var = ID cook level sdresid.
ID
COOK
LEVEL
SDRESID
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
29.00
.00204
.07917
.02744
.00125
.03319
.00245
.00508
.05754
.08286
.00350
.00268
.00056
.00151
.02759
.00144
.04004
.01675
.03104
.00129
.00827
.02337
.00102
.00185
.01885
.00175
5.74625
.00010
.02119
.01416
.04155
.01081
.02513
.00375
.04450
.04667
.00241
.00148
.01418
.00036
.00302
.02454
.03687
.06205
.02744
.08174
.03766
.07567
.01940
.04266
.01163
.00476
.34626
.31270
1.63404
.99364
.16580
1.13871
.26350
.47244
1.13456
1.35774
.39819
-.35230
.13866
-.26839
1.13603
-.20276
-.99116
-.53933
-.93564
-.13450
-.44047
-.59412
.17853
-.20071
-.84022
-.27318
-8.67294
o Critical values
Cooks D:
Leverage:
Studentized Deletion Residuals:
o Observation #29
Has large residual, deletion residual, Cooks D, and leverage
14-31
2007 A. Karpinski
#1
0.4
0.3
#2
0.2
0.1
#4
#3
0
0
2000
4000
6000
8000
10000
12000
14000
16000
Problematic?
Regression
Equation
Obs
Baseline
#1
#2
#3
#4
Y
Y
Y
Y
Y
= .104 + .0000506 X
= .097 + .0000506 X
= .004 + .0000384 X
= .105 + .0000506 X
= .113 + .0000203 X
rXY
rXY
rXY
rXY
rXY
rXY
= .876
= .809
= .762
= .900
= .403
~
ei
No
Yes
Yes
No
Yes
di
No
Yes
Yes
No
Yes
Di
No
No
Yes
No
Yes
hi
No
No
Yes
Yes
Yes
14-32
2007 A. Karpinski
14-33
2007 A. Karpinski
o When we minimize the residuals, solve for the parameters, and compute
p-values, we need the residuals to have equal variance across the range of
X values. If this equal variance assumption is violated, then the OLS
regression parameters will be biased.
o In OLS regression, each observation is treated equally. But if some
observations are more precise than others (i.e., they have smaller
variance), it make sense that they should be given more weight than the
less precise values.
o In weighted least squares regression, each observation is weighted by the
inverse of its variance
wi =
i2
14-34
2007 A. Karpinski
Robust regression
o When assumptions are violated and/or outliers are present, OLS
regression parameters may be biased. Robust regression is a series of
approaches that computes estimates of regression parameters using
techniques that are robust to violations of OLS assumptions
o Least absolute residual (LAR) regression estimates regression parameters
by minimizing the sum of the absolute deviations of the Y observations
from the regression line:
e = Y
i
Y = Yi b0 b1 X i
i
14-35
2007 A. Karpinski
Non-parametric regression
o Non-parametric regression techniques do not estimate model parameters.
For a regression analysis, we are usually interested in estimating a slope,
thus non-parametric methods are of limited utility from an inferential
perspective.
o In general, non-parametric techniques can be used to explore the shape of
the regression line. These methods provide a smoothed curve that fits the
data, but do not provide an equation for this line or allow for inference on
this line.
o Previously, we examined the following data and concluded that the
relationship between X2 and Y was non-linear (see p. 14-8)
1.00
0.80
0.60
0.40
0.20
0.00
0.00
0.20
0.40
0.60
0.80
1.00
x2
14-36
2007 A. Karpinski
o For example, if w = 3, then you take the three smallest X values, average
the Y values associated with these points, and plot that point. Next, take
the 2nd, 3rd, and 4th smallest X values and repeat the process . . .
o An example of the method of moving averages, comparing different w
values:
Smoothing interval = 5
Smoothing interval = 9
0.75
0.75
0.5
0.5
0.25
0.25
0
0
0.2
0.4
0.6
0.8
0.2
X2
0.6
0.8
X2
Smoothing interval = 25
Smoothing interval = 15
1
0.75
0.5
0.5
0.75
0.4
0.25
0.25
0
0
0.2
0.4
0.6
X2
0.8
0.2
0.4
0.6
0.8
X2
If the window is too small, it does not smooth the data enough; if the
window is too large, it can smooth too much and you lose the shape of
the data.
In this case, w = 15 looks about right.
14-37
2007 A. Karpinski
0.80
0.60
0.40
0.20
0.00
0.00
0.20
0.40
0.60
0.80
1.00
x2
14-38
2007 A. Karpinski
0.40
0.80
0.60
0.30
0.40
0.20
0.20
0.10
0.00
4000.00
6000.00
8000.00
10000.00
12000.00
Good Fit
0.00
0.20
0.40
0.60
0.80
1.00
x2
Poor Fit
14-39
2007 A. Karpinski
Ln Transformation
3.5
3.5
2.5
2.5
1.5
1.5
0.5
0.5
0
0
0
10
20
30
40
50
o Try X = ln( X ) or X = X
Non Linear Relationship #2
X2 Transformation
1800
1800
1600
1600
1400
1400
1200
1200
1000
1000
800
800
600
600
400
400
200
200
0
0
0
10
20
30
40
50
500
1000
1500
2000
o Try X = X 2 or X = exp(X)
1/X Transformation
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
10
20
30
40
50
0.2
0.4
0.6
0.8
1.2
o Try X = 1/ X or X = exp(1 X )
14-40
2007 A. Karpinski
Transforming Y
o If the data are non-normal and/or heteroscedasticitic, a transformation on
Y may be useful
o It can be very difficult to determine the most appropriate transformation
in Y to fix the data
o One popular class of transformations is the family of power
transformations
Y = Y
=2
=1
= .5
=0
= .5
= 1
= 2
Y = Y 2
Y = Y
Y = Y
Y = ln(Y) by definition
Y = 1
Y
1
Y = Y
Y = 1 2
Y
14-41
2007 A. Karpinski
Transformations: An example
o Lets examine the relationship between the number of credits taken in a
minor and interest in taking further coursework in that discipline.
o A university collects data on 100 students
X = Number of credits completed in the minor
Y = Interest in taking another course in the minor discipline
o First, lets plot the data
OLS Regression Line
25.0000
20.0000
15.0000
10.0000
Loess Curve
5.0000
10
15
20
Number of credits
25.0000
25.0000
20.0000
15.0000
10.0000
5.0000
20.0000
15.0000
10.0000
5.0000
1.00
2.00
3.00
sqrtX
4.00
0.00
0.50
1.00
1.50
2.00
2.50
3.00
lnx
14-42
2007 A. Karpinski
REGRESSION
/DEPENDENT ybr
/METHOD=ENTER lnx
/RESIDUALS HIST(SRESID) NORM(SRESID)
/SAVE RESID (resid1) ZRESID (zresid1) SRESID (sresid1) pred (pred1).
Histogram
14
0.8
Frequency
12
10
8
6
4
0.6
0.4
0.2
2
0
-3
-2
-1
Mean = -2.31E-4
Std. Dev. = 1.005
N = 100
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Studentized Residual
2.00000
0.00000
-2.00000
-4.00000
0.00000
5.00000
10.00000
15.00000
20.00000
25.00000
14-43
2007 A. Karpinski
R
.813a
R Square
.661
Adjusted
R Square
.658
Std. Error of
the Estimate
2.7779456
Model
1
(Constant)
lnx
Unstandardized
Coefficients
B
Std. Error
1.337
1.244
8.335
.603
Standardized
Coefficients
Beta
t
1.075
13.828
.813
Sig.
.285
.000
R
R Square
.749a
.561
Adjusted
R Square
.556
Std. Error of
the Estimate
3.1625661
Model
1
(Constant)
Number of credits
Unstandardized
Coefficients
B
Std. Error
8.718
.897
1.149
.103
Standardized
Coefficients
Beta
.749
t
9.721
11.187
Sig.
.000
.000
14-44
2007 A. Karpinski