Académique Documents
Professionnel Documents
Culture Documents
We’ve seen that there are certain assumptions These scatterplots can be useful, but it only
made when we model the data. shows pairwise relationships.
1 2
Outliers, Leverage, and Influence • A strong outlier with respect to the indepen-
• In general, an outlier is any unusual data dent variable(s), is said to have high lever-
point. It helps to be more specific... age.
fitted model.
100
repwt
●
120
40
● ●
weight ●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
80
● ● ●
●
●
●●●
● ●
● ● ●● ●
● ●● ●
● ●●
● ●●●● ●
● ●●●● ● ●●
(independent variable).
● ●
● ●
●
40
weight
● ●
●
●
●
●
● ●
2.0
●
●
●
●
● ●●
●
repwt
●
● ●
● ●
●
80
● ● ●
●
●
●●●
● ●● ●●
● ●● ●
●
● ●●
● ● ●
● ●●● ●
●●
● ●●●●
●●● ●
● ●●●
1.5
● ●
●●●●
●●● ●
●●●
● ●●●
60
● ●
● ● ●● ●
● ●● ●
● ●●
● ●●●●
● ●●●● ● ●●
●●●● ●
●●● ●
●●● ●
●●● ●
●● ●
●
cig.data$Nic
● ●
● ● ●
● ●
●
● ●
40
●
● ●●
1.0
●
● ●
40 60 80 100 120 140 160
●
●
weight ●
●
●
●
●
●
●
●
0.5
●
●
0 5 10 15 20 25 30
• The second picture shows a much better fit to • Above is the fitted line with the inclusion
the bulk of the data. But one must be careful of the point of high leverage (top right data
about removing outliers. In this case it was point is far outside the distribution of x-values).
a value that was reported incorrectly, and re-
moval was justified.
5 6
• Here is the fitted line after removing the out- • The blue point below is an outlier, but not with re-
lier: spect to the independent variable (not high leverage).
●
5
2.0
●
4
●
● ● ●
3
●●
●
● ● ● ●
1.5
●
● ● ●
● ●
●● ●
y
● ● ●
●
2
●● ●
●
● ●
●
●
● ●
●
●
●
cig.data$Nic
●
● ● ●
1
●
●
● ●
● ●
● ● ●
● ●●
1.0
● ●
● ●
●
0
●
●
●
● ●
●
●
●
● −1 ● 0 1 2 3
●
● x
0.5
●
●
0 5 10 15 20 25 30
cig.data$Tar
did not have a big influence on the fitted line.
●
5
●●
●
● ● ● ●
●
● ● ●
●
2
●● ●
●
● ●
●
●
● ●
●
●
●
● ● ●
1
●
●
● ●
●
●
x
1 2 3
7 8
• What about higher dimensions (more predictors)? Assessing Leverage: Hat Values
• A point with high leverage will be outside of • Beyond graphics, we have a quantity called
the ‘cloud’ of observed x-values. the hat value which is a measure of leverage.
• With 2 predictors X1 and X2 (no response shown) • Leverage measures “potential” influence.
x1
– In matrix notation for OLS, we have:
� �−1 �
Ŷ = X β̂ = X X �X X Y
The point at the top left has high leverage � �� �
an n × n matrix
because it is outside the cloud of (X1, X2) called the Hat matrix
independent values. Ŷ = HY
9 10
�n �n
= i=1 hij Yi
* i=1 hi = (k + 1)
(where k is the number of predictors)
– The value hij captures the contribution of
observation Yi to the fitted value of the * h̄ = (k + 1)/n
jth observation, or Ŷj (consider hij a weight)
• We can compare hi to the average. If it is
– If hij is large, Yi CAN have a substantial much larger, then the point has high lever-
impact on the jth fitted value. age.
– It can be shown that... • Often use 2(k +1)/n or 3(k +1)/n as a guide
�
hi = hii = nj=1 h2ij for determining high leverage.
> nrow(Duncan)
[1] 45
13 14
●
●
● conductor ●●
●
● ●
● ●
●
60
● ●
●
● ●
●
income
●
● ● ●
●
Plot the hat values against the indices 1 to 45,
40
●
●
and include thresholds for 2 ∗ (k + 1)/n and
● ●
3 ∗ (k + 1)/n:
●
● ● ● ●
20
minister
●
● ●
●
●
● > plot(hatvalues(lm.out),pch=16,cex=2)
● ●
● ●
●
> abline(h=2*3/45,lty=2)
20 40 60 80 100 > abline(h=3*3/45,lty=2)
education
> identify(1:45,hatvalues(lm.out),row.names(Duncan))
● conductor
dependent variables (x-vector).
● minister
hatvalues(lm.out)
0.15
●
● ● ●
●●●
● ● ● ●
● ●
● ● ●
● ● ●
● ●
● ●
• If you have two predictors, the mean struc-
0.05
● ● ● ●●●
● ● ● ●
●● ●
●
●●
● ● ●
ture is a plane, and you’d be looking for ob-
servations that fall exceptionally far from the
0 10 20 30 40
Index
• Observations with high leverage tend to have We can get away from the problem by esti-
smaller residuals. This is an artifact of our mating σ 2 with a sum of squares that does
model fitting. These points can pull the fit- not include the ith residual.
ted model close to them, giving them a ten-
dency toward smaller residuals. We will use subscript (−i) to indicate
quantities calculated without case i...
• Var(ei) = σ 2(1 − hi) – Delete the ith observation, and re-fit the
model based on n−1 observations, and get
• We know... 1/n ≤ hi ≤ 1.
σˆ2(−i) = n−1−k−1
RSS
21 22
• Example: Returning to the Duncan model: • An outlier may indicate a sample peculiar-
> nrow(Duncan) ity or may indicate a data entry error. It
[1] 45 may suggest an observation belongs to an-
> lm.out=lm(prestige ~ education + income) other ‘population’.
> sort(rstudent(lm.out))
-2.3970223990 -1.9309187757 -1.7604905300 ... • Fitting a model with and without an obser-
vation gives you a feel for the sensitivity of
Can look at adjusted p-values from t41, but the fitted model to the observation.
there is a built-in function in the car library
to help with this... • If an observation has a big impact on the
fitted model and it’s not justified to remove
Get the adjusted p-value for the largest |e∗i |: it, one option is to report both models (with
> outlierTest(lm.out,row.names(Duncan))
and without) commenting on the differences.
Difference=β̂j − β̂j(−i)
25 26
• DFFITS - effect of ith case on fitted value • COOKSD - effect on all fitted values
for Yi
� Based on:
Ŷi − Ŷi(−i) hi how far x-values are from the mean of the x’s
DF F IT S = √ = e∗i
SE(−i) hi 1 − hi how far Yi is from the regression line
� 2
j (Ŷj − Ŷj(−i))
|DF F IT S| > 1 is considered large in a COOKSD = 2
small or medium sized sample (k + 1)SE
� e2i hi
= 2
|DF F IT S| > 2 k+1 SE (k + 1)(1 − hi)2
n is considered large in
a big sample � �� �
e∗2
i hi
Look at DFFITS relative to each other, in =
k+1 1 − hi
other words, look for large values.
27 28
4
COOKSD > n−k−1 Unusually influential > plot( dfbetas(lm.out)[,c(2,3)],pch=16)
> identify(dfbetas(lm.out)[,2],dfbetas(lm.out)[,3],
case. row.names(Duncan))
0.5
RR.engineer
0.0
Influence measures of
lm(formula = prestige ~ education + income) :
income
1 -2.25e-02 0.035944 6.66e-04 0.070398 1.125 1.69e-03 0.0509
-0.5
2 -2.54e-02 -0.008118 5.09e-02 0.084067 1.131 2.41e-03 0.0573
3 -9.19e-03 0.005619 6.48e-03 0.019768 1.155 1.33e-04 0.0696
4 -4.72e-05 0.000140 -6.02e-05 0.000187 1.150 1.20e-08 0.0649
conductor
5 -6.58e-02 0.086777 1.70e-02 0.192261 1.078 1.24e-02 0.0513
.
-1.0
.
.
minister
You get a DFBETA for each regression coef-
0.0 0.5 1.0
ficient and each observation. education
The Cook’s distance is perhaps most com- We can bring together leverage, studentized
monly looked at (effect on all fitted values). residuals and cooks distance in the following
“Bubble plot”:
> plot(cooks.distance(lm.out),pch=16,cex=1.5)
> abline(h=4/(45-2-1),lty=2) ## Plot Residuals vs. Leverage:
> identify(1:45,cooks.distance(lm.out),row.names(Duncan)) > plot(hatvalues(lm.out),rstudent(lm.out),type="n")
> cook=sqrt(cooks.distance(lm.out))
> points(hatvalues(lm.out),rstudent(lm.out),
● minister
cex=10*cook/max(cook))
> abline(h=c(-2,0,2),lty=2)
> abline(v=c(2,3)*3/45,lty=2)
0.4
## Overlay names:
cookd(lm.out)
0.3
> identify(hatvalues(lm.out),rstudent(lm.out),
row.names(Duncan))
● conductor
0.2
0.1
●
●
●
● ●● ●
●●
● ●
● ●● ● ● ●
●●●●●● ● ● ●●● ●●●● ●● ●●●●
0.0
●●●● ●
0 10 20 30 40
Index
31 32
It turns out, R will do some work for us in our
diagnostic plotting...
minister > par(mfrow=c(2,2))
3
> plot(lm.out)
2
40
3
6 6
Standardized residuals
rstudent(lm.out)
RR.engineer 17
17
2
20
Residuals
1
0
0
0
-1
-20
-2
9
-1
0 20 40 60 80 100 -2 -1 0 1 2
conductor
Fitted values Theoretical Quantiles
-2
reporter
3
6 1
Standardized residuals
1.5
9
Standardized residuals
17
0.5
2
1.0
1
• High Leverage: conductor, minister, RR en-
0
gineer
0.5
-1
16
Cook's distance 0.5
-2
• Large regression residuals: minister, reporter 9
0.0
1
This data point also had an odd combination • The fitted regression coefficients with
of income/education with a low income for the ‘minister’ data point:
their given amount of education (high lever- > summary(lm.out)
age, but the combination again seems rea-
Coefficients:
sonable). Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.06466 4.27194 -1.420 0.163
education 0.54583 0.09825 5.555 1.73e-06 ***
This lead to minister being highly influen- income 0.59873 0.11967 5.003 1.05e-05 ***
tial.
• The fitted regression coefficients without
the ‘minister’ data point:
Let’s fit the model with and without this data > summary(lm(prestige[-6]~education[-6]+income[-6]))
point...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
And let’s see if it is jointly influential by con- (Intercept) -6.62751 3.88753 -1.705 0.0958 .
education[-6] 0.43303 0.09629 4.497 5.56e-05 ***
sidering added variable plots... income[-6] 0.73155 0.11674 6.266 1.81e-07 ***
39 40
• Example: Returning to Duncan model: The ‘conductor’ data point influences the fit-
> avPlots(lm.out,"education",labels=row.names(Duncan), ted line in the same direction (removal of
id.method=cooks.distance(lm.out),id.n=2) both at once may show a large change in the
education regression coefficient).
> avPlots(lm.out,"income",labels=row.names(Duncan),
minister
60
id.method=cooks.distance(lm.out),id.n=2)
40
prestige | others
40
20
30
0
20
prestige | others
minister
10
-20
conductor
0
-40
-10
conductor
-20
-60 -40 -20 0 20 40
-30
education | others
-40 -20 0 20 40
This is the plot used to fit the education mul- income | others
43