Vous êtes sur la page 1sur 95

V.

Regression Diagnostics
Regression analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
What are its other basic
assumptions? They all concern
the residuals (e):
(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.
(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelatedthe errors
associated with one value of y
have no effect on the errors
associated with other y values.

(4) The probability distribution of


e is normal.
The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.
What are the implications of these
assumptions?
Assumption 1: ensures that the regression
coefficients are unbiased.
Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
Assumption 4: ensures the validity of
confidence intervals & p-values.
Assumption 4 is by far the least important:
even if the distribution of a regession models
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?
Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.
Violations of assumption 3 occur as a
result of clustered observations or time-
series observations: variance is not
independent from one observation to
another but rather is correlated.
E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or time-
series observations.
In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
What matters is if the violations surpass
some critical threshold.
Regression diagnostics: procedures to
detect violations of the linear models
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.
Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

See King et al. for applications of the


logic of regression diagnostics to
qualitative social science research as
well.
Keep in mind that the linear model does
not assume that the distribution of a
variables observations is normal.
Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
While its important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of xs with y.
Lets turn our attention to regression
diagnostics.
For the sake of presenting the
material, well examine & respond to
the diagnostic tests step by step.
In real life, though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.
Model Specification

Does a regression model properly


account for the relationship between the
outcome & explanatory variables?
See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.
If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.

We could then either under or


overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.
If this is a problem, perhaps the
outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from miles per gallon
to gallons per mile).
Or perhaps, e.g., wage needs to be
transformed to log(wage).
And/or maybe not OLS but another kind
of regressione.g., quantile regression
should be used.
Lets begin by exploring the
variables well use.
. use WAGE1, clear
. hist wage, norm
. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage

Note that ladder wage doesnt


suggest a transformation, but log
wage is common for right skewness.
. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))

While a log transformation makes wages


distribution much more normal, it leaves an
extreme low-value outlier, id=24.
Lets inspect its profile:
. list id wage lwage educ exp tenure female
nonwhite if id==24
Its a white female with 12 years of
education, but earning a very low wage.
We dont know if its wage is an error, well
keep an eye on id=24 for possible problems.
Lets exam the independent variables:
.
. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2
. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper
. gen exper2=exper^2
. su exper exper2
. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.
Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure
. sparl lwage tenure, quad
Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.
. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure [i.e. transformed]
. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
What other options could be explored for
educ, exper, & tenure?
The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)
. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
Although nonwhite tests insignificant, it might
become significant in the model.
Lets first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite
A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).
Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
Thus we may incorrectly detect or fail to
detect y/x relationships.
InSTATA we use estat ovtest (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
estat ovtest adds polynomials to the
models fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
We want to fail to reject the null hypothesis
that the model has no omitted variables.
. reg wage educ exper tenure female
nonwhite
. estat ovtest

Ramsey RESET test using powers of the fitted values of


wage
Ho: model has no omitted variables
F(3, 517) = 9.37
Prob > F = 0.0000

The model fails. Lets add the


transformed variables.
. reg lwage educ educ2 exper exper2 ltenure
female nonwhite
. estat ovtest
. estat ovtest

Ramsey RESET test using powers of the fitted values of


lwage
Ho: model has no omitted variables
F(3, 515) = 2.11
Prob > F = 0.0984

The model passes. Perhaps it would do


better if age were available, and with other
variables & other forms of these predictors.
estat ovtest is a decisive hurdle to clear.
Even so, passing any diagnostic test
by no means guarantees that weve
specified the best possible model,
statistically or substantively.
It could be, again, that other models
would be better in terms of statistical fit
and/or substance.
Another way to test the models functional
specification via linktest: it tests whether y
is properly specified or not.
linktests hatsq must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.
. linktest
---------------------------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+-------------------------------------------------------------------------------------
_hat | .3212452 .3478094 0.92 0.356 -.36203 1.00452
_hatsq | .2029893 .1030452 1.97 0.049 .0005559 .4054228
_cons | .5407215 .2855868 1.89 0.059 -.0203167 1.10176
-----------------------------------------------------------------------------------------------------

The model fails. Lets try another


model that uses categorized predictors.
. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite
. estat ovtest=.12
. linktest=.15
Well stick with this modelunless other
diagnostics indicate problems. Again, too
bad the data dont include age.
Passing ovtest, linktest, or other
diagnostic indicators does not mean
that weve necessarily specified the
best possible modeleither
statistically or substantively.

It merely means that the model has


passed some minimal statistical threshold
of data fitting.
At this stage its helpful to plot the models
residual versus fitted values to obtain a
graphic perspective on the models fit &
problems.
260 15
440 186
. rvfplot, yline(0) ml(id) 59

1
41 203381
58 172 343
105
12 522 66
61 229 112
17 43642
450 29 16
324
95 142 17789
198 513
355 62 107
98 171 170
449 175 405217
104 230 23518 497
35231 46 505
40197 88 52378
13 515 33 140 399 92
525 72 245
326278
411 167
94 144
488410 25 421 15421 183 386
284283
265 178 339 10
35
275 406
208 202 158 200
256 76 444 476
287 127 227
220
521 420
400 165 68 149 28 325
162 64 294249279 122 196
37
454168 106
182 299214 423 394
156
110
383 518431
181
1
385 328
348 179 4 314
417
19445
407
370
356
195
456228
474
416
428425 255
96 239
319 401 269
143
354
164
373 8252
130 34 366 375
408
338430
281
209
169 484
176508
288
84 90 65 478 26
469 489
187
302
79
22 435 334395
21083 13430
317 246
308
85 304 44
321
75 257 5101 206
71 46863
307
267
11
460251 500
519 199
138
397 330 434
263 459 32
481
253
189 486443114 234 477
231
163 131 7 479427
372 503
271 268 396 39843 520 18418517354 25967 157
121
0

226155
78
93
463
323
266
306424
499 362
221 462102
298 77 357
466329
318
351 377 311 433
494 342
501 292
496467
146 442
48011719 145 273379 20 293 345
113 393 358
409108201 359 441
523159 504309
232
3
240
404
250
365
161
129213 495 285
452 133
191
446
387
368344
116 363
461512 190
336
99 148
135
475
418482
340
280
103
471
403
243 472
132
233274
91
272
414333
422
458
491 510
225437
215
41380 332 6 141
147 211 204
511
415
264419
87493
310
222297
100
526
353 367
247
261 361 402
432
384
81 514 14 23
270192 337
109 451
205
262 216
97 126
301 160
507485
151
465
439
4850
305
322
2219238
124374
517
236
470
118
335360
389
364509
111
57
119207
380
152483
139 392
180
438 412 290
115
53 313 153
376 296506445 27455
136 120
490
303
331
31639
341
244 56429
86
123258
295 237
349
9
22449 320
223 82 350
277 73 137
70 289 388 242
241 38
312 464 498 371
327
55 391 347369 315254218 346 60 524 390
69
248 473 286 36 291
47502 174 426
300
74
276 447 193 492
166
282 382 188
453 51 487
150 457
-1

516 212 125 448

128

24
-2

1 1.5 2 2.5
Fitted values

Problems of heteroscedasticity?
So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
Lets examine other assumptions.
The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.

Next, though, lets examine a potential


model problemmulticollinearitywhich in
fact does not violate any of the regression
assumptions.
Multicollinearity

Multicollinearity: high correlations


between the explanatory variables.
Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
But, like small sample or subsample size, it
does inflate standard errors.
It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.
Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) arent reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.
(3) Very large standard errors.
(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpsons paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable


because they better gauge the array of
joint effects within a model:
(6) Post-model estimation (STATA command vif) -
VIF>10: measures inflation in variance due to
multicollinearity.
Square root of VIF: shows the amount of increase in
an explanatory variables standard error due to
multicollinearity.
(7) Post-model estimation (STATA command vif)-
Tolerance<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
collin) Condition Number>15 or especially >30.

.
. vif
Variable | VIF 1/VIF
-------------+-----------------------------
exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc | 1.88 0.532923
_Itencat_3 | 1.83 0.545275
ccol | 1.70 0.589431
scol | 1.69 0.591201
_Itencat_2 | 1.50 0.666210
_Itencat_1 | 1.38 0.726064
female | 1.07 0.934088
nonwhite | 1.03 0.971242
-------------+----------------------------
Mean VIF | 4.23

The seemingly troublesome scores for exper


exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.
What would we do if there were a
problem of multicollinearity?
Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

Center the offending explanatory variables (see


Mendenhall/Sincich), perhaps using STATAs
center command.

Eliminate variablesbut this might cause


specification errors (see, e.g., ovtest).
Collect additional dataif you have a big bank
account & plenty of time!
Group relevant variables into sets of variables
(e.g., an index): how might we do this?
Learn how to do principal components analysis
or principal factor analysis (see Hamilton).
Or do ridge regression (see Mendenhall/
Sincich).
Lets skip to Assumption 4: that the residuals
are normally distributed.

While this is by far the least important


assumption, examining the residuals at this
stageas we began to do with rvfplotcan tip
us off to other problems.
Normal Distribution of Residuals

This is necessary for confidence intervals


& p-values to be accurate.
But this is the least worrisome problem in
general: if the sample is as large as 100-
200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.
The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
Well use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals
predict e if e(sample), resid
. hist rstu, norm
.5
.4
.3
Density

.2
.1
0

-4 -2 0 2 4
Studentized residuals

Not bad for the assumption of normality, but


the low-end outliers correspond to the earlier
evidence of heteroscedasticity.
estat imtest (information matrix test) gives us a formal test
of the normal distribution of residualswhich they really
dont need to passplus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source chi2 df p
Heteroskedasticity 21.27 13 0.0677
Skewness 4.25 4 0.3733
Kurtosis 2.47 1 0.1160

Total 27.99 18 0.0621

Normality (skewness) is good, but the model just


edges by with respect to non-constant variance
(p=.0677). Lets investigate.
Non-Constant Variance
If the variance changes according to the levels
of the explanatory variablesi.e. the residuals
are not random but rather are correlated with the
values of xthen:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.
In STATA we test for non-constant
variance by means of:
(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that theres no heteroscedasticity.
Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1) = 15.76
Prob > chi2 = 0.0001

There seem to be problems. Lets


inspect the individual predictors.
. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance

----------------------------------------------
Variable | chi2 df p
-------------+-------------------------------
hsc | 0.14 1 1.0000 #
scol | 0.17 1 1.0000 #
ccol | 2.47 1 0.7078 #
exper | 1.56 1 0.9077 #
exper2 | 0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female | 1.02 1 0.9762 #
nonwhite | 0.03 1 1.0000 #
-------------+--------------------------------

It seems that the problem has to do with


simultaneous | 23.68 10 0.0085
-----------------------------------------------

tenure.
# Sidak adjusted p-values
. estat szroeter, rhs mt(sidak)

both hettest & szroeter say that the serious


problem is with tenure.
. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable

---------------------------------------
Variable | chi2 df p
-------------+-------------------------
hsc | 0.14 1 1.0000 #
scol | 0.17 1 1.0000 #
ccol | 2.47 1 0.7078 #
exper | 3.20 1 0.5341 #
exper2 | 3.20 1 0.5341 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female | 1.02 1 0.9762 #
nonwhite | 0.03 1 1.0000 #
---------------------------------------
# Sidak adjusted p-values
So, hettest & szroeter say that the high
category of tenure is to blame.
What measures might be taken to
correct or reduce the problem?
Lets examine the graphs: rvfplot & rvpplot.
. rvfplot, yline(0) ml(id)

260 15
440 186
59
1

41 203381
58 172 343
105
12 522 66
61 229 112
95 17 43642
450 89 29 16
324 142 177198 513 355
23518
62 107
98 171 170 505
449 175 405217 230
104 52378 497
35231 46
411 40197 88 13 515
386 33 140 399 92
525 72 245
326278
167
9435
275 488410 14425 421
15421
183 284283
265
406 178
158 200
256
339
444 10
476
287 127 227
220 420 208
400 202
165 68 76
149 28
162 249 521 196
37
454168 106
299214 423383 394
156
110 431 325
1
385 64 294279 122 328
348 179 182
4 314
417
19445 181 51826
407
370
356
195
456228
474
416
428425 255
96 239
319 401 269
143
354
164
373 8252
130 34366 375
408
338430
281
209
169 484
508
288
176 84 90 65 478
469 489
187
302
79
22 435 334395
83
210 134 30
317 246
308
85 304 44
321
75 257 5101 206
71 46863
307
267
11
460251 500
519 199
138
372 503397 330 434
263
268 459
396 32
481
253
189
398 43 520 443
486 54114 234 477
231
163 131 7 479427
271 121
0

226155
78
93
463
323
266
306424
499 362
221 462102
298 77 357
466329
318
351184185 173 259 67 157
377 311 433
342
501 292
496467
146 442
117 145 379 345
113 358
409108201 359 441
494
523159 504309
232
3
240
404
250
365
161
129
480
213 49519 285
452 133
191
446
273387
368344
116 20
461512 293
363
190
336
99 148
135
475
418482
340
280
103
471
403
243
393
472
132
233274
91
272
414333
422
458
491 510
225437
215
41380 332 6 141
147 211 204
511
415
264419
87493
310
222297
100
526
353 367
247
261 361 402
432
384
81 514 14 27023
192 337
109 451
205
262 216
97 126 160
301
507485
151
465
439
4850
305
322
2219238
124374
517
236
470
118
335360
389
364509
111
57
119207
380
152483
139 392
180
438 412 290
115
53 313 153
376 296506445 27455
136 120
490
303
331
31639
341
244 56429
86
123258
295 237
349
938
22449 320
223 27782 73350
70 289 388 242 312 464 137498 371
55 241
391 347369 315254 524 327
390
248 473 286 218 346 60 69
291
502 174 426
300
74
276 447 193 36 492
47
166
282 382 188
453 51 487
150 457
-1

516 212 125 448

128

24
-2

1 1.5 2 2.5
Fitted values
260 15
186
. rvpplot _Itencat_3, yline(0) ml(id)
440 59

1
381
172
58
203 343
105
41
522 66
12
229
42
16 61
112
29
436
17
450
95 89
177
62
171
142
324
513
198
355 107
170
98
235
505
405
230
104
217
352 18
497
449
52
31
40
175
13
245
88
399
378
525
197
515
72 46
33
140
92
326
411
386
144
178
421
283
167
339
94
488 183
278
284
25
265
10
406
410
35
21
158
154
275
202
208 200
476
256
444
76
68
28
287
127
165
149
227
420
400
521 220
325
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64 394
214
383
328
179
314
1
385
194
45
348
407
96
370
375 417
4
518
252
478
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489 334
209
484
187
302
79
22
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460 176
44
63
468
307
267
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43 427
479
231
486
398
271
121
0

362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102 463
466
377
433
185
173
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274 345
441
108
201
379
393
141
458
215
471
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446 461
475
510
413
103
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353 437
297
6
367
514
23
384
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380 270
465
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455 412
290
296
27
180
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224 438
111
115
139
320
120
73
86
136
244
123
277
295
137
242 341
38
388
498
312
258
371
70
327
254 464
241
524
369
390
289
55
391
60
315
218
69
291
346 347
286
248
36
426
193
492 473
300
502
174
276
47 74
447
166
382 282
188
453
51
487 457
150
-1

125 448
516 212

128

24
-2

0 .2 .4 .6 .8 1
tencat==3

By the way, note id=24.


What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to non-
constant variance. Im guessing that
including the variable age, which the data set
doesnt have, would either solve or reduce
the problem. Why?
Maybe the small # observations for the high
end of tenure matters as well.
Other options:
Try interactions &/or transformations
based on qladder & ladder.
Categorizing a continuous predictor
(multi-level or binary) may work
although at the cost of lost information).
We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).
A more complicated option would be to
use weighted least squares regression.
If nothing else works, we could use
robust standard errors. These relax
Assumption 2to the point that we
wouldnt have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldnt work).
Whatever strategies we try, we redo the
diagnostics & compare the new models
coefficients to the original model.
The key question: is there a practically
significant difference in the models?
My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.
What can we do?

We can use robust standard errors.


Its quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isnt
small.
Doing so relaxes Assumption 2, which
we then no longer need to check.
If we do use robust standard errors, lots of
our routine diagnostic procedures wont work
because their statistical premises dont hold.
A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.
. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat


female nonwhite, robust

. est store m2_robust


. est table m1 m2_robust, star stats(N)
. est table m1 m2_robust, star stats(N)

----------------------------------------------
Variable | m1 m2_robust
-------------+--------------------------------
hsc | .22241643*** .22241643***
scol | .32030543*** .32030543***
ccol | .68798333*** .68798333***
exper | .02854957*** .02854957***
exper2 | -.00057702*** -.00057702***
_Itencat_1 | -.0027571 -.0027571
_Itencat_2 | .22096745*** .22096745***
_Itencat_3 | .28798112*** .28798112***
female | -.29395956*** -.29395956***
nonwhite | -.06409284 -.06409284
_cons | 1.1567164*** 1.1567164***
-------------+--------------------------------
N | 526 526
----------------------------------------------
legend: * p<0.05; ** p<0.01; *** p<0.001
Theres no difference at all!
See Allison, who points out that non-
constant variance has to be
pronounced in order to make a
difference.
Its a good idea, in any case, to
specify robust standard errors in a
final model.
For now, we wont use
robust standard errors so that
we can explore additional
diagnostics.
Our final model, however,
will use robust standard
errors.
Correlated Errors
In the case of these data theres no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
In general theres no straightforward way
to check for correlated errors.
If we suspect correlated errors, we
compensate in one or more of the following
three ways:
(1) by using robust standard errors;
(2) if its a cluster sample, by using STATAs
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
But again, our data arent based on a cluster
sample.
(3) if its time-series data, by using Statas
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.
This model seems to be satisfactory from the
perspective of linear regressions assumptions
with the exception of an insignificant problem
with non-constant variance.

But theres another potential problem:


influential outliers.

Particularly in small samples, OLS slope


estimates can be strongly influenced by
particular observations.
An observations influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the X-
axis falls from the mean for x.
discrepancy + leverage = influence
Highly influential observations are most
likely to occur in small samples.
Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
Studentized residuals of 3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
Large outliers can affect the equations constant,
reduce its fit, & increase its standard errors, but
they dont influence the regression coefficients.
Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.
Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.
Whereas studentized residuals & hat
values each measure potential influence,
Cooks Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
Cooks Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the models overall
fit.
DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.
But before we examine these influence
indicators, lets examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)
. lvr2plot, ms(i) ml(id)
465
.06

306
.05

520
298
266
252
133305
463
253
407 167 282
.04

308
262
164
342196 150
226
219
511
431
202 72 36
26
398
512
250 444 241
391 382
67
418
109265 248
526
468
328
287
299
336
397 43825 105
414
425
64
138
85 58
191
222
442
147509 178 89
.03

239
234
417
471
259
366
179151
46637 388 405 42
62
309
129
9628
44 498 492
409
319 303 390 59
273
23
480335
445 175 447
503
499325
337
130
385 111 276
174
315502 29 381 212
216
311
379
461
403
81
387
365
370
401 127
149341
406 61 487
458
211
32279
318
367
4
486
280
30 57
483
470
126
452 331 378
245230 450
436166 522
504
159
330
345
484
121
131
146 389
165 33
258
92 473
18171
497 260
462
272
485
267
404 488 218
.02

334
430
293
71182
402
467
99
213
433
6160
97 39
455
376277
237
208
506
205
153 52
140
183
123
49 355
426
286
393
479
116119
375
478
274
422
163
408
514
518
297
333
199
80
114
496
11 420
115
120
27
439
207
290
48
132392
227
220 283
386
244
326
524
421 16
43
427
141
508
395
424
185
288
161
443
157
358
7456
195
356400
181
477
441 223
517
316
476
364
296
162
373 7610
158
82 515
278
137 31
46235
449
217 170 229 66
47 112 448125
359
101
155
78
268
501
233
351
413
215
332
231
435412
490
168
156
360
275
313
192 56339
525
173
232
3
176
82
113
368
489
372
210
494
523
428
416
1
255
301
225
357
460
344
377
20
93
77 521
194
281
338
432
45
361
65
54
84
145
5243
475
307
384
83
63
271
236
380
383
124
110350
180
423
251
228
474
294
184
363
247
353139
106
374
118
169
122
221 38
410
320
73
507
214
327
371
70
284
289
464
291
352
41169
399
55
369
505
347
60 193
142
198
74
107
300 12 457203
343 440
172
128 24
493
189
481
100
317
79
396
134
117
495
415
491
510
354
34
472
459
257
209
103
148
269
446
323
519
246
143
14322
270
302
469
437 68
50429
9 312
144
94 346324
95
90
419
500
304
91 394
53
348
454
206
108
285
187 154
238295
256
224 104
254
13 513 188 51 186
362
264
329
22
482
340
434
201
135249
310
190152
314 21
86242
349
35
20088 98177 41 453 51615
261
240
102
87
292
19 136197
263
75451 40
.01

204
321 17

0 .01 .02 .03 .04


Normalized residual squared

There are signs of a high residual point & a high


leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.
. avplots: look for simultaneous x/y
extremes.

2
1

1
1

1
e( lwage | X )

e( lwage | X )

e( lwage | X )
0

0
0

0
-1
-1

-1
-1
-2

-2

-2

-2
-1 -.5 0 .5 1 -1 -.5 0 .5 1 -.5 0 .5 1 -10 -5 0 5 10
e( hsc | X ) e( scol | X ) e( ccol | X ) e( exper | X )
coef = .22241643, se = .04881795, t = 4.56 coef = .32030543, se = .05467635, t = 5.86 coef = .68798333, se = .05753517, t = 11.96 coef = .02854957, se = .00503299, t = 5.67
1

1
1

1
e( lwage | X )

e( lwage | X )

e( lwage | X )
0

0
0

0
-1
-1

-1

-1
-2

-2

-2

-2
-400 -200 0 200 400 600 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1
e( exper2 | X ) e( _Itencat_1 | X ) e( _Itencat_2 | X ) e( _Itencat_3 | X )
coef = -.00057702, se = .00010737, t = -5.37 coef = -.0027571, se = .0491805, t = -.06 coef = .22096745, se = .04916835, t = 4.49 coef = .28798112, se = .0557211, t = 5.17
1
1

e( lwage | X )

0
0

-1
-1
-2

-2

-1 -.5 0 .5 1 -.5 0 .5 1
e( female | X ) e( nonwhite | X )
coef = -.29395956, se = .03576118, t = -8.22 coef = -.06409284, se = .05772311, t = -1.11
. avplot _Itencat_3, ml(id), ml(id)
15 186 59
440 381
343
1

260 172
105 66
203 41
522 112 61
58 12 89170 107
229 42 450
177 95 98 18
17 171
198 405 104 497 46
140 33
16 142
324 505 183 444
29 436 513 230 62
355 31217
175
352
378
525 88 52 386 92
326278
284
265
10 25
200
476
76 256
220
40
245 399 411
94
339 421 21 154 275 383
235
449 13488 197 72
515
178 144
167
283 406
410 35
165 227
420
158
202
400 196 325
168
454294
279 214394 518
478
334
287 208 68
127
521 149
249 37 106
423
181 299
182
194 417
385
370
8 4
252 187209
302 79468484
176
431 122156 1 162 9045
28126 401
34 6322
267
307 44 486 466
28 110
64 328
179 96 375
356
195
130143
456
469 65338
239
5 199
30 84304
101 32 427398
348 314430269408
319
366
425
71 164
308
474
228
395
83435
246354
416
428
508
519
210
489
206
13811
500 251
85
434
131
54
121
231
479
155
78 463 379 345185
377
433
173
407 169255 321
257
317
75477 443
7 372
234 357362
318
329 499 441
501 108
201
141 393
0

288
373 460 459
163
396 134
481
263
189
253
271 268 93
323226
259 67 467
342
113
293 146458215
471 461
475
103
437
297510 6 23
330114 43 397
520184 462
221
298
351
157 117 424
266
311
292 145
496
494
523 91332 233
472
274
344
272 159
365
191
213 413367
384 514
270
503 102
44219 358
409
4953273
20
309491
240
232
403
148
161 336
414 418
99 422 211
526353 465 27
180
306387 77 359285
512
190
404
368 363
225 452
333280
337129
147 419
493119
192
14 361
264160
118 216
380360374290 412
296
438
115 12073
135 133 126 111 341 241
250480
50480
132116
446
243 482
340
432
301
87
310 402
204
247
261511
451485
207
335
48
100
415
97
238
507
81
222
262
2
509
53
50
322483 82
517
506
86 237
316
364
57 139
320
39498
38 312388464524
369347
313
392
153 305
376
455 236 223
9 429
224
49 70
219205
109
124
439389
152
349
151
56
303 277 490
123
295 136 350
137 327 371 242
254
258 390 473 300 74
470
445
331 244 55
289 69 218
391 315 346 291 502276 174
60 248
28642636 166 188
492193 47 282
382 457150
448
447 453 212
-1

487
51 125
516
128

24
-2

-1 -.5 0 .5 1
e( _Itencat_3 | X )
coef = .28798112, se = .0557211, t = 5.17

There again is id=24. Why isnt it a problem?


So far, we see no problems with
influential observations.
Next lets numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cooks Distance
(d), & dfbeta (DF_).
. predict rstu if e(sample), rstu
. predict h if e(sample), hat
. predict d if e(sample), cooksd
. dfbeta
. su rstu-DF_Ixtenure_3
Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
Lets use d (Cooks Distance) to illustrate
the further analysis of influence diagnostics.
. scatter d id
24
.03

150
.02

128
59
58 282
105 212

260 382
381

125 487
448
.01

15 186 440 522


42 89
36 61 172 203 343 457 516
66 166 248
29 447
12 5162 436453
450 492
16 41 112 174188 229 276 391405 502
47 72 171
167 241
18 230 315 355 390 426 473
25 52 74 95107 142 170 175
178 193 218
177198 235 265286 300 324 388 497
505
513
17313346 60
55 69 92
98 140 202 217 245258 291
289 305 346
352
347 378 444
449465 498 524
13 2838 70 104 151 183 254 287
278
283 303 326
327341 369
371386 399
406
411
421 438 464 488 515
525
10 39
404957
56 8288
94 111123
127
120 137
144 158
149 196
197
208 223
219 244
237
242 277
284295 312
316 339
331 350 410 445 470483
476 509
9 21 273748
35 6473
687686 109 115
119 139
136 153
154165180
179 200 224
220
227
222236 262
256
252 275 299
296
290 325
320
328
322 335349 364376
374 389400
392 407420
412 431
429439455 490 506
517526
2
4
16 14 26
23 50
4553 81 97
96106
110
100 124
126
118
122133
130
129 156
152
147 168
160
162
164 182
181
191
192
194 216
207
214
205
211 238
239249
247
250 279
270
264 301
294
297 313
314
310 336
337 360
353
356367
370
361
366 380
375 394
383
385
384 402 417
414423
418
415
408 430
432 454
451 485
478493 507
511
521
518
514
35811 22
20 30
3444 65
6371
67 80
7987
90
84
85
839199
103
101 116132143
138
135148
141
146159
161176
169 195
187 204
199
190 213
209
206
215
210 228
225
232243
233
240 261
255
246
251 272
269
267
257 280
281
274
273 288 308
309
302
293
285 304
307319
321334
333
338
332 348
340354
344
342
345 365
368
373
363
359 387
379 401
395
393403
404
409422
419
425
416
413 437
428
435
442456
446461
452 468
469
458
460 484
474
471
475
472482
489
480491
495
494
496
500
501512
508
510
504 523
519
0

7 19 32 4354 75
77
78 93102
108
113
114
117131
121134145 163
155
157 173185
184
189201 231
221234
226 253
259
266
263
268
271 292306
298 311
317
318330
323
329 351
358
357
362 377
372 397
396
398 427
433
434
424 441
443459
467
462
463477
466479
481
486499
503 520

0 100 200 300 400 500


id

Note id=24.
. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, theres no problem of influential
outliers.

If there were a problem, what would we do?


Correct outliers that are coding errors if
possible.
Examine the models adequacy (see the
sections on model specification & non-
constant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
Other options: try other types of
regression
Try, e.g., quantileincluding medianregression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLA-
ATS web book).
Quantile regression works with y-outliers only,
while robust regression works with x-outliers.
One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
A given outlier, then, could result from
chance aloneas occurs by definition in a
normal distribution.
For a waywhich includes a significance
testto examine if observations are
outliers on more than one quantitative
variable, see the command hadimvo.
lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.
Recall that outliers are most likely to
cause problems in small samples.
In a normal distribution we expect
5% of the observations to be outliers.
Dont over-fit a model to a
sampleremember that there is
sample-to-sample variation.
Lets Wrap It Up

Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2


i.tencat female nonwhite, robust