Vous êtes sur la page 1sur 11

22s:152 Applied Linear Regression In multiple regression, we often investigate with

a scatterplot matrix first:


Chapter 11 Diagnostics:
Unusual and Influential Data: collinearity (highly correlated x’s)
Outliers, Leverage, and Influence outliers
———————————————————— marginal pairwise linearity

We’ve seen that there are certain assumptions These scatterplots can be useful, but it only
made when we model the data. shows pairwise relationships.

These assumptions allow us to calculate a test


statistic, and know the distribution of the test
statistic under a null hypothesis.
After fitting the model, we check for:
Besides the specified distributional assumptions, nonlinearity (with residual plot)
there are other things to check for in regression. non-constant variance
non-normality
In this section, we look for characteristics in the multicollinearity (VIFs, yet to come in Ch. 13)
data that can cause problems with our fitted *outliers and influential observations
models.

1 2

Outliers, Leverage, and Influence • A strong outlier with respect to the indepen-
• In general, an outlier is any unusual data dent variable(s), is said to have high lever-
point. It helps to be more specific... age.

• Such a point could have a big impact on the


120

fitted model.
100
repwt

• Here is the fitted line for the full data set:


80
60


120
40

40 60 80 100 120 140 160 ●


100

● ●
weight ●







• In the above plot there is a strong outlier in


● ●●

repwt


● ●
● ●

80

● ● ●


●●●

the bottom right.


● ●● ●●
● ●● ●

● ●●
● ● ●
● ●●● ●
●●
● ●●●●
●●● ●
● ●●●
● ●
●●●●
●●● ●
●●●
● ●●●
60

● ●
● ● ●● ●
● ●● ●
● ●●
● ●●●● ●
● ●●●● ● ●●

This is an outlier with respect to the X-distribution


●●●● ●
●●● ●
●●●
●●● ●
●● ●

● ●

(independent variable).
● ●
● ●


40

40 60 80 100 120 140 160

weight

It is not an outlier with respect to the Y-


distribution (dependent variable).
3 4
• Here is the fitted line after removing the high • A point with high leverage doesn’t HAVE to
leverage point: have a large impact on the fitted line.

120

• Recall the cigarette data:



100

● ●



● ●

2.0




● ●●

repwt


● ●
● ●

80

● ● ●


●●●
● ●● ●●
● ●● ●

● ●●
● ● ●
● ●●● ●
●●
● ●●●●
●●● ●
● ●●●

1.5
● ●
●●●●
●●● ●
●●●
● ●●●
60

● ●
● ● ●● ●
● ●● ●
● ●●
● ●●●●
● ●●●● ● ●●
●●●● ●
●●● ●
●●● ●
●●● ●
●● ●

cig.data$Nic
● ●
● ● ●
● ●

● ●
40


● ●●

1.0

● ●
40 60 80 100 120 140 160


weight ●





0.5

• This one point had a a lot of influence on the


fitted line. The β̂0 and β̂1 changed greatly in

0 5 10 15 20 25 30

the re-fit. cig.data$Tar

• The second picture shows a much better fit to • Above is the fitted line with the inclusion
the bulk of the data. But one must be careful of the point of high leverage (top right data
about removing outliers. In this case it was point is far outside the distribution of x-values).
a value that was reported incorrectly, and re-
moval was justified.
5 6

• Here is the fitted line after removing the out- • The blue point below is an outlier, but not with re-
lier: spect to the independent variable (not high leverage).

5
2.0


4


● ● ●
3

●●

● ● ● ●
1.5


● ● ●
● ●
●● ●
y

● ● ●

2

●● ●

● ●


● ●



cig.data$Nic


● ● ●
1



● ●
● ●
● ● ●
● ●●
1.0

● ●
● ●

0




● ●



● −1 ● 0 1 2 3

● x
0.5


• Here is the fitted line after removal of the outlier, it


0 5 10 15 20 25 30

cig.data$Tar
did not have a big influence on the fitted line.

5

• The before and after in this case are quite


4

similar. This leverage point did NOT have ●



● ●
3

●●

● ● ● ●

a large influence on the fitted model. ● ● ●


● ●
●● ●
y

● ● ●

2

●● ●

● ●


● ●



● ● ●
1



● ●

• A point with high leverage has the potential


0

to greatly affect the fitted model. −1 0

x
1 2 3

7 8
• What about higher dimensions (more predictors)? Assessing Leverage: Hat Values

• A point with high leverage will be outside of • Beyond graphics, we have a quantity called
the ‘cloud’ of observed x-values. the hat value which is a measure of leverage.

• With 2 predictors X1 and X2 (no response shown) • Leverage measures “potential” influence.

• In regression, the hat value hi (or hii) is a


5

common measure of leverage.


4
3

• A high hi hat value equates to high leverage.


2
x2
1
0

• What is hi? How is it calculated?


-1

– First, why is it called a hat value?


-3 -2 -1 0 1 2

x1
– In matrix notation for OLS, we have:
� �−1 �
Ŷ = X β̂ = X X �X X Y
The point at the top left has high leverage � �� �
an n × n matrix
because it is outside the cloud of (X1, X2) called the Hat matrix
independent values. Ŷ = HY

9 10

– Thus, • The hat value for obs i is the ith element in


Ŷj is the j th row of H times Y the diagonal of H (thus, hii).
But H is symmetrical, so the book shows
it as...
• How do we determine if hi is large?
Ŷj = h1j Y1 + h2j Y2 + · · · + hnj Yn * 1/n ≤ hi ≤ 1

�n �n
= i=1 hij Yi
* i=1 hi = (k + 1)
(where k is the number of predictors)
– The value hij captures the contribution of
observation Yi to the fitted value of the * h̄ = (k + 1)/n
jth observation, or Ŷj (consider hij a weight)
• We can compare hi to the average. If it is
– If hij is large, Yi CAN have a substantial much larger, then the point has high lever-
impact on the jth fitted value. age.

– It can be shown that... • Often use 2(k +1)/n or 3(k +1)/n as a guide

hi = hii = nj=1 h2ij for determining high leverage.

and hi summarizes the potential influence


of Yi on the fit of ALL other observations.
11 12
• In SLR, hi just measures the distance from • Example: Duncan data
the mean X̄. Reponse: Prestige
Percent of raters in an opinion study rating
1 (X − X̄)2
hi = + �n i 2
occupation as excellent or good in prestige.
n j=1(Xj − X̄)
Predictors: Income
• In multiple regression, hi measures distance Percent of males in occupation earn-
from the centroid (center of the cloud of data ing $3500 or more in 1950.
points in the X-space).
Education
Percent of males in occupation in 1950
• The dependent-variable values are not involved
who were high school graduates.
in determining leverage. Leverage is a state-
ment about the X-space. > attach(Duncan)
> head(Duncan)
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
minister prof 21 84 87

> nrow(Duncan)
[1] 45
13 14

Two Predictors: Fit the model, get the hat values:


> plot(education,income,pch=16,cex=2)
## The next line produces an interactive option for the
> lm.out=lm(prestige ~ education + income)
## user to identify points on a plot with the mouse.
> identify(education,income,row.names(Duncan))
> hatvalues(lm.out)
1 2 3 4 ...
● RR.engineer
0.05092832 0.05732001 0.06963699 0.06489441 ...
80



● conductor ●●

● ●

● ●

60

● ●

● ●

income


● ● ●

Plot the hat values against the indices 1 to 45,
40



and include thresholds for 2 ∗ (k + 1)/n and
● ●
3 ∗ (k + 1)/n:

● ● ● ●
20

minister

● ●


● > plot(hatvalues(lm.out),pch=16,cex=2)
● ●
● ●

> abline(h=2*3/45,lty=2)
20 40 60 80 100 > abline(h=3*3/45,lty=2)
education
> identify(1:45,hatvalues(lm.out),row.names(Duncan))

From this bivariate plot of the two predictors,


it looks like the ‘RR.engineer’, ‘conductor’, and
‘minister’ may have high leverage.
15 16
Studentized Residuals

RR.engineer
0.25

• In this section, we’re looking for outliers in


the Y-direction for a given combination of in-
0.20

● conductor
dependent variables (x-vector).
● minister
hatvalues(lm.out)

0.15

• Such an observation would have an unusally


large residual, and we call it a regression outlier.
0.10


● ● ●
●●●
● ● ● ●
● ●
● ● ●
● ● ●
● ●
● ●
• If you have two predictors, the mean struc-
0.05

● ● ● ●●●
● ● ● ●
●● ●

●●
● ● ●
ture is a plane, and you’d be looking for ob-
servations that fall exceptionally far from the
0 10 20 30 40

Index

plane compared to the other observations.


These three points have high leverage (potential
to greatly influence the fitted model) using the • First, we start with standardized residuals.
2 times and 3 times the average hat value crite-
rion. • Though we assume the errors in our model
have constant variance [�i ∼ N (0, σ 2)], the
To measure influence, we’ll look at another statis- estimated errors, or the sample residuals ei’s
tic, but first... consider studentized residuals. DO NOT have equal variance...
17 18

• Observations with high leverage tend to have We can get away from the problem by esti-
smaller residuals. This is an artifact of our mating σ 2 with a sum of squares that does
model fitting. These points can pull the fit- not include the ith residual.
ted model close to them, giving them a ten-
dency toward smaller residuals. We will use subscript (−i) to indicate
quantities calculated without case i...
• Var(ei) = σ 2(1 − hi) – Delete the ith observation, and re-fit the
model based on n−1 observations, and get
• We know... 1/n ≤ hi ≤ 1.
σˆ2(−i) = n−1−k−1
RSS

• With this variance, we can form a standardized �


RSS
or SE (−i) = n−1−k−1
residual e�i which all have equal variance as
e e
e�i = √ i = √i • This gives us the studentized residual
σ̂ 1 − hi SE 1 − hi ei
e∗i = √
SE(−i) 1 − hi
but the distribution of e�i isn’t a t because
the numerator and denominator are not in-
• Now, the σ̂ in the denominator, or SE (−i),
dependent (part of theory of a t statistic).
is not correlated with the numerator and
e∗i ∼ tn−1−k−1
19 20
• Test for outliers by comparing e∗i to a t dis-
e∗i ∼ tn−1−k−1 tribution with n − k − 2 df and applying a
Bonferroni correction (multiply the p-values
by the number of residuals).
• We can now use the t-distribution to judge
what is a large studentized residual, or how
likely we are to get a studentized residual as • Really, we only have to be concerned about
far away from 0 as the ones we get. the largest studentized residuals. Perhaps
take a closer look at those with |e∗i | > 2

• As a note, if the ith observation has a large


residual and we left it in the computation of
SE , it may greatly inflate SE , deflating the
standardized residual e�i and making it hard
to notice that it was large.

• The standardized residuals is also referred to


as the internally studentized residual.

• The studentized residual listed here is also re-


ferred to as the externally studentized residual.

21 22

• Example: Returning to the Duncan model: • An outlier may indicate a sample peculiar-
> nrow(Duncan) ity or may indicate a data entry error. It
[1] 45 may suggest an observation belongs to an-
> lm.out=lm(prestige ~ education + income) other ‘population’.

> sort(rstudent(lm.out))
-2.3970223990 -1.9309187757 -1.7604905300 ... • Fitting a model with and without an obser-
vation gives you a feel for the sensitivity of
Can look at adjusted p-values from t41, but the fitted model to the observation.
there is a built-in function in the car library
to help with this... • If an observation has a big impact on the
fitted model and it’s not justified to remove
Get the adjusted p-value for the largest |e∗i |: it, one option is to report both models (with
> outlierTest(lm.out,row.names(Duncan))
and without) commenting on the differences.

max|rstudent| = 3.134519, degrees of freedom = 41,


unadjusted p = 0.003177202, Bonferroni p = 0.1429741
• An analysis called Robust Regression will al-
low you to leave the observation in, while
Observation: minister reducing it’s impact on the fitted model.
Since p= 0.1429741 is larger than 0.05, we
would conclude that this model doesn’t have This method essentially ‘weights’ the obser-
any extreme residuals. vations differently when computing the least
squares estimates.
23 24
Measuring influence • Fortunately this can be done analytically (and
we don’t have to re-fit the model n times).
• We’ve described a point with high leverage
as having the potential to greatly influence We will use subscript (−i) to indicate quan-
the fitted model. tities calculated without case i...

• To greatly influence the fitted model, a point


• DFBETAS - effect of Yi on a single estimated
has high leverage and it’s Y-value is not ‘in-
coefficient
line’ with the general trend of the data (has
high leverage and looks ‘odd’). β̂j − β̂j(−i)
DF BET ASj,i = =
SE(βˆj (−i))
• High leverage + large studentized residual =
High influence
|DF BET AS| > 1 is considered large in a
• To check influence, we can delete an observa- small or medium sized sample
tion, and see how much the fitted regression
coefficients change. A large change suggests |DF BET AS| > 2n−1/2 is considered large
high influence. in a big sample

Difference=β̂j − β̂j(−i)
25 26

• DFFITS - effect of ith case on fitted value • COOKSD - effect on all fitted values
for Yi
� Based on:
Ŷi − Ŷi(−i) hi how far x-values are from the mean of the x’s
DF F IT S = √ = e∗i
SE(−i) hi 1 − hi how far Yi is from the regression line

� 2
j (Ŷj − Ŷj(−i))
|DF F IT S| > 1 is considered large in a COOKSD = 2
small or medium sized sample (k + 1)SE
� e2i hi
= 2
|DF F IT S| > 2 k+1 SE (k + 1)(1 − hi)2
n is considered large in
a big sample � �� �
e∗2
i hi
Look at DFFITS relative to each other, in =
k+1 1 − hi
other words, look for large values.

So, COOKSD can be high if hi is very large


(close to 1) and e∗2 ∗2
i is moderate, or if ei is
very large and hi is moderate, or if they’re
both extreme.

27 28
4
COOKSD > n−k−1 Unusually influential > plot( dfbetas(lm.out)[,c(2,3)],pch=16)
> identify(dfbetas(lm.out)[,2],dfbetas(lm.out)[,3],
case. row.names(Duncan))

Look at COOKSD relative to each other.

0.5
RR.engineer

• Example: Returning to the Duncan model:


> influence.measures(lm.out)

0.0
Influence measures of
lm(formula = prestige ~ education + income) :

dfb.1_ dfb.edct dfb.incm dffit cov.r cook.d hat inf

income
1 -2.25e-02 0.035944 6.66e-04 0.070398 1.125 1.69e-03 0.0509

-0.5
2 -2.54e-02 -0.008118 5.09e-02 0.084067 1.131 2.41e-03 0.0573
3 -9.19e-03 0.005619 6.48e-03 0.019768 1.155 1.33e-04 0.0696
4 -4.72e-05 0.000140 -6.02e-05 0.000187 1.150 1.20e-08 0.0649
conductor
5 -6.58e-02 0.086777 1.70e-02 0.192261 1.078 1.24e-02 0.0513
.

-1.0
.
.
minister
You get a DFBETA for each regression coef-
0.0 0.5 1.0
ficient and each observation. education

We’re not really interested in the intercept


DFBETA though. That said, a bivariate plot The ‘minister’ observation may have a large
of the other DFBETAS may be useful... impact on both regression coefficients (using
the > 1 threshold).
29 30

The Cook’s distance is perhaps most com- We can bring together leverage, studentized
monly looked at (effect on all fitted values). residuals and cooks distance in the following
“Bubble plot”:
> plot(cooks.distance(lm.out),pch=16,cex=1.5)
> abline(h=4/(45-2-1),lty=2) ## Plot Residuals vs. Leverage:
> identify(1:45,cooks.distance(lm.out),row.names(Duncan)) > plot(hatvalues(lm.out),rstudent(lm.out),type="n")
> cook=sqrt(cooks.distance(lm.out))
> points(hatvalues(lm.out),rstudent(lm.out),
● minister
cex=10*cook/max(cook))

## Include diagnostic thresholds:


0.5

> abline(h=c(-2,0,2),lty=2)
> abline(v=c(2,3)*3/45,lty=2)
0.4

## Overlay names:
cookd(lm.out)

0.3

> identify(hatvalues(lm.out),rstudent(lm.out),
row.names(Duncan))
● conductor
0.2
0.1




● ●● ●
●●
● ●
● ●● ● ● ●
●●●●●● ● ● ●●● ●●●● ●● ●●●●
0.0

●●●● ●

0 10 20 30 40

Index

31 32
It turns out, R will do some work for us in our
diagnostic plotting...
minister > par(mfrow=c(2,2))
3

> plot(lm.out)
2

Residuals vs Fitted Normal Q-Q

40

3
6 6

Standardized residuals
rstudent(lm.out)

RR.engineer 17
17

2
20
Residuals

1
0
0

0
-1
-20

-2
9
-1

0 20 40 60 80 100 -2 -1 0 1 2
conductor
Fitted values Theoretical Quantiles
-2

reporter

0.05 0.10 0.15 0.20 0.25 Scale-Location Residuals vs Leverage


6
hatvalues(lm.out)

3
6 1

Standardized residuals

1.5
9

Standardized residuals
17
0.5

2
1.0

1
• High Leverage: conductor, minister, RR en-

0
gineer

0.5

-1
16
Cook's distance 0.5

-2
• Large regression residuals: minister, reporter 9

0.0
1

0 20 40 60 80 100 0.00 0.10 0.20


• ‘minister’ has the highest influence on the
Fitted values Leverage
fitted model
33 34

1. Residuals vs. fitted Jointly influential data points


(checking constant variance)
• The previous section considered the impact
2. Normal QQ plot that individual data points can have on the
(checking normality) fitted regression coefficients.

3. Scale-Location plot � • Their impact can be seen with leave-one-out


(similar to 1., but uses |e∗i | vs. Ŷi) deletion diagnostics. See...
> influence.measures(lm.out)

4. Residuals vs. Leverage


(checking influence, plus it gives you a sweet • But data points can have a joint influence on
contour plot of COOKSD) the fitted model.

• Graphical methods can help identify such sub-


sets.

• For this purpose, we will use the Partial


Regression Plot that we saw earlier (in the
multiple regression introduction). Also known
as Added Variable Plot.
35 36
Recall an Added Variable Plot for Xi
As this ‘special’ plot is used to fit the multi-
For Xi, ple regression coefficient, the relationship we see
should be linear.
1. get the residuals from the regression of Y on
all predictors except Xi, and plot on the Y
axis (what’s left for Xi to explain). ————————————————————–
2. get the residuals from the regression of Xi on
all other predictors, and plot on the X axis • In the Duncan data set, the data point
(the non-redundant part of Xi in the model). ‘minister’ had the most influence when
3. The plot provides information on adding the considering leave-one-out diagnostics.
Xi variable to the model.
MINISTER DATA POINT:
This data point had a large positive studen-
The ‘special’ simple linear regression related to
tized residual... the prestige for the ‘minis-
the plot above (i.e. of Residuals(Y ∼ X(−Xi))
ter’ was higher than expected for the given
on Residuals(Xi ∼ X(−Xi))) gives the multiple
income and education (seems reasonable for
regression coefficient for Xi.
the occupation though).
After we fit the regression to this plot, the resid-
uals are the same as the full model multiple re-
gression residuals.
37 38

This data point also had an odd combination • The fitted regression coefficients with
of income/education with a low income for the ‘minister’ data point:
their given amount of education (high lever- > summary(lm.out)
age, but the combination again seems rea-
Coefficients:
sonable). Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.06466 4.27194 -1.420 0.163
education 0.54583 0.09825 5.555 1.73e-06 ***
This lead to minister being highly influen- income 0.59873 0.11967 5.003 1.05e-05 ***
tial.
• The fitted regression coefficients without
the ‘minister’ data point:
Let’s fit the model with and without this data > summary(lm(prestige[-6]~education[-6]+income[-6]))
point...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
And let’s see if it is jointly influential by con- (Intercept) -6.62751 3.88753 -1.705 0.0958 .
education[-6] 0.43303 0.09629 4.497 5.56e-05 ***
sidering added variable plots... income[-6] 0.73155 0.11674 6.266 1.81e-07 ***

The education coefficient is lower without ‘minister’.


The income coefficient is higher without ‘minister’.
(these changes are understandable when we look at
the added variable plots.)

39 40
• Example: Returning to Duncan model: The ‘conductor’ data point influences the fit-
> avPlots(lm.out,"education",labels=row.names(Duncan), ted line in the same direction (removal of
id.method=cooks.distance(lm.out),id.n=2) both at once may show a large change in the
education regression coefficient).
> avPlots(lm.out,"income",labels=row.names(Duncan),
minister
60

id.method=cooks.distance(lm.out),id.n=2)
40
prestige | others

40
20

30
0

20
prestige | others
minister

10
-20

conductor

0
-40

-10
conductor

-20
-60 -40 -20 0 20 40

-30
education | others
-40 -20 0 20 40

This is the plot used to fit the education mul- income | others

tiple regression coefficient. Here we see that ‘minister’ and ‘conductor’


work together to flatten the income multi-
Here we see how the multiple regression coef- ple regression coefficient. Removal of both
ficient for education is higher when ‘minister’ at once may show a large change in the in-
is included (influential point). come regression coefficient.
41 42

Removing outliers (careful)

• Removal of an outlier should only be done


after careful investigation, and perhaps dis-
cussion.

• It shouldn’t be removed if it’s not an error, or


can’t be shown that the data point belongs
to a different ‘population’.

• Duncan Data set:


When reporting the analysis, one could re-
port both analyses (with and without ‘min-
ister’). Otherwise, if all involved agree that
‘minister’ is not representative of the popu-
lation at hand, it could be removed.

43