Académique Documents
Professionnel Documents
Culture Documents
To cite this article: A. H. M. Rahmatullah Imon & Ali S. Hadi (2008) Identification of Multiple
Outliers in Logistic Regression, Communications in Statistics - Theory and Methods, 37:11,
1697-1709, DOI: 10.1080/03610920701826161
The use of logistic regression modeling has seen a great deal of attention in the
literature in recent years. This includes all aspects of the logistic regression model
including the identification of outliers. A variety of methods for the identification
of outliers, such as the standardized Pearson residuals, are now available in the
literature. These methods, however, are successful only if the data contain a single
outlier. In the presence of multiple outliers in the data, which is often the case
in practice, these methods fail to detect the outliers. This is due to the well-known
problems of masking (false negative) and swamping (false positive) effects. In this
article, we propose a new method for the identification of multiple outliers in logistic
regression. We develop a generalized version of standardized Pearson residuals
based on group deletion and then propose a technique for identifying multiple
outliers. The performance of the proposed method is then investigated through
several examples.
1. Introduction
Diagnostics methods are commonly used in all branches of regression analysis.
In recent years diagnostic has become an essential part of logistic regression
(see Hosmer and Lemeshow, 2000). We often observe that outliers greatly affect the
covariate pattern and consequently their presence can mislead our interpretation.
So we need to detect such observations and study their impact on the model. In
Sec. 2, we introduce some commonly used diagnostics based on residuals for the
1697
1698 Imon and Hadi
Y = X + ∈ (2.1)
where
expZ
X = (2.2)
1 + expZ
∈ˆ i = yi − ˆ i i = 1 2 n (2.3)
In linear regression the hat matrix plays an extremely important role in the analysis.
This is the matrix that provides the fitted values as the projection of the outcome
variable into the covariate space. The linear regression residuals, Y −
Y , are often
Identification of Multiple Outliers in Logistic Regression 1699
expressed in terms of the hat matrix and this forms basis of many diagnostics. Using
weighted least squares linear regression as a model, Pregibon (1981) derived a linear
approximation to the fitted values, which yields a hat matrix for logistic regression,
which is
−1
H = V 1/2 X X T VX X T V 1/2 (2.4)
where xiT = 1 x1i x2i xpi is the 1 × k vector of observations corresponding to
the ith case.
In logistic regression, the residuals measure the extent of ill-fitted
factor/covariate patterns. Hence, the observations possessing large residuals are
suspect outliers. But at this stage a natural question comes to our mind: How
big is big? The residuals defined in (2.3) are unscaled, that is why they are not
readily applicable in detecting outliers. Now let us introduce some scaled version of
the above residuals that are commonly used in diagnostics for the identification
of outliers.
The Pearson residuals are elements of the Pearson chi-square that can be used
to detect ill-fitted factor/covariate patterns. In linear regression a key assumption
is that the error variance does not depend on the conditional mean Eyi xi = ˆ i .
However, in logistic regression we have Bernoulli errors and as a result the error
variance is a function of the conditional mean, i.e.,
Varyi xi = vi = ˆ i 1 − ˆ i (2.6)
The Pearson residual defined for the ith factor/covariate pattern is given by
y − ˆ
ri = i√ i i = 1 2 n (2.7)
vi
∈ˆ i = yi − ˆ i ≈ 1 − hi yi (2.8)
V∈ˆ i = vi 1 − hi (2.9)
1700 Imon and Hadi
which suggests that the Pearson residuals do not have variance equal to 1. For this
reason we could use the standardized Pearson residuals defined as
yi − ˆ i
rsi = i = 1 2 n (2.10)
vi 1 − hi
∈ˆ i = yi − ˆ i i = 1 2 n (3.2)
Identification of Multiple Outliers in Logistic Regression 1701
We also define the respective deletion variances and deletion leverages for the entire
data set as
−D −D −D
vi = ˆ i 1 − ˆ i (3.3)
−D −D −D T T −1
hi = ˆ i 1 − ˆ i xi XR VR XR xi (3.4)
In linear regression, Hadi and Simonoff (1993) pointed out that the residuals
computed internally (for the observations which are used to fit the model) and the
residuals computed externally (for the observations which are not used in the fitting
of the model) are not measured in a similar scale. They suggested using a new set of
scaled residuals
yi − xiT ˆ −D
ti∗ = for i ∈ R
−D
ˆ R 1 − wii
yi − xiT ˆ −D
= for i ∈ D (3.5)
−D
ˆ R 1 + wii
where ˆ R is the usual scale estimate based on the remaining cases. Residuals for
both the R set and the D set defined in (3.5) are measured in a similar scale and
this type of residuals has wide applications in regression diagnostics (see Atkinson,
1994; Imon, 2005; Munier, 1999).
Using the above results and also using linear-regression like approximation,
we define the ith generalized standardized Pearson residual (GSPR) for logistic
regression as:
−D
−D yi − ˆ i
rsi = for i ∈ R
−D −D
vi 1 − hi
−D
yi − ˆ i
= for i ∈ D (3.6)
−D −D
vi 1 + hi
compact data sets and consequently are too prone to declare observations as outliers
when they are not. Cook and Hawkins (1990) note that robust techniques have a
tendency to mark ‘innocent’ observations as outliers. Even considering a sample of
20 5-dimensional N0 1 data they found that 6 cases are marked as outliers by
the LMS. They describe this situation as a case of ‘outliers everywhere’. Thus, we
notice that the performance of neither the existing diagnostic nor robust techniques
is entirely satisfactory in the outlier detection study. As a compromise we suggest a
diagnostic-robust approach where suspect outliers (if at all) will be identified first by
recently developed graphical methods and/or robust methods and then diagnostic
tools are applied to the resulting residuals to put all inliers (if any) back into the
estimation subset that were wrongly diagnosed as outliers at the initial stage.
For a two variable regression, the scatter plot of Y against X can give an idea
that which observation may disrupt the covariate pattern and they can be considered
as suspect outliers. For a three-variable regression the character plot of Y on the
two explanatory variables may give some ideas about the suspect outliers. By virtue
of recent visualization techniques (see Hadi, 2006), we may identify suspect outliers
for five dimensions as well, but graphical displays are not suitable for higher-
dimensional cases. Another disadvantage of graphical methods is that they heavily
depend on experimenter’s own interpretations. We prefer using a graphical method
when the display gives a very clear view about the suspect outliers. In general, we
prefer using any suitable robust technique such as LMS, LTS, BACON (the block
adaptive computationally efficient outlier nominator (BACON) algorithm proposed
by Billor et al., 2000) or BOFOLS (best omitted from the ordinary least squares
technique proposed by Davies et al., 2004) to find all suspect outliers to form the
initial deletion set D0 . After the omission of the set of observations indexed by D0 ,
we would compute the generalized standardized Pearson residuals for the entire set.
We declare observations as outliers which satisfy the rule
−D0
rsi > 3 (3.7)
4. Examples
Here we consider few data sets to investigate the usefulness of our newly proposed
tool for the identification of multiple outliers in logistic regression.
Identification of Multiple Outliers in Logistic Regression 1703
Table 1
Modified Brown data
Index L.N.I. A.P. Index L.N.I. A.P. Index L.N.I. A.P.
1 0 48 20 0 98 39 0 76
2 0 56 21 0 52 40 0 95
3 0 50 22 0 75 41 0 66
4 0 52 23 1 99 42 1 84
5 0 50 24 0 187 43 1 81
6 0 49 25 1 136 44 1 76
7 0 46 26 1 82 45 1 70
8 0 62 27 0 40 46 1 78
9 1 56 28 0 50 47 1 70
10 0 55 29 0 50 48 1 67
11 0 62 30 0 40 49 1 82
12 0 71 31 0 55 50 1 67
13 0 65 32 0 59 51 1 72
14 1 67 33 1 48 52 1 89
15 0 47 34 1 51 53 1 126
16 0 49 35 1 49 54 0 200
17 0 50 36 0 48 55 0 220
18 0 78 37 0 63
19 0 83 38 0 102
1704 Imon and Hadi
Table 2
Outlier diagnostics for modified Brown data
Index PR SPR GSPR Index PR SPR GSPR
1 −0722 −0732 −0.505 29 −0724 −0734 −0.529
2 −0732 −0741 −0.607 30 −0713 −0725 −0.420
3 −0725 −0734 −0.529 31 −0731 −0739 −0.593
4 −0727 −0736 −0.554 32 −0736 −0744 −0.651
5 −0725 −0734 −0.529 33 1385 1404 2.054
6 −0724 −0733 −0.517 34 1378 1395 1.911
7 −0720 −0730 −0.482 35 1382 1401 2.005
8 −0739 −0747 −0.697 36 −0722 −0732 −0.505
9 1366 1382 1.695 37 −0741 −0748 −0.714
10 −0731 −0739 −0.593 38 −0791 −0802 −1.851
11 −0739 −0747 −0.697 39 −0757 −0764 −0.973
12 −0751 −0758 −0.862 40 −0782 −0791 −1.557
13 −0743 −0751 −0.748 41 −0744 −0752 −0.766
14 1341 1354 1.305 42 1303 1316 0.885
15 −0721 −0731 −0.493 43 1310 1322 0.947
16 −0723 −0733 −0.517 44 1321 1333 1.060
17 −0725 −0734 −0.529 45 1334 1347 1.217
18 −0760 −0767 −1.021 46 1316 1329 1.013
19 −0766 −0774 −1.155 47 1334 1347 1.217
20 −0786 −0796 −1.677 48 1341 1354 1.305
21 −0727 −0736 −0.554 49 1307 1320 0.926
22 −0756 −0763 −0.950 50 1341 1354 1.305
23 1271 1287 0.635 51 1330 1342 1.162
24 −0913 −1015 −12.873 52 1292 1306 0.792
25 1194 1237 0.267 53 1214 1248 0.340
26 1307 1320 0.926 54 −0933 −1066 −17.558
27 −0713 −0725 −0.420 55 −0965 −1162 −28.234
28 −0725 −0734 −0.529
Identification of Multiple Outliers in Logistic Regression 1705
Figure 2. Index plot of standardized Pearson residuals for modified Brown data.
which are shown in Table 2. We observe from this table that the GSPR values for
the observations 24, 54, and 55 are unusually high and hence can be declared as
outliers. The advantage of using the generalized standardized Pearson residuals for
the detection of multiple outliers is also visible in their index plot. We observe from
Fig. 3 that the three outliers are easily identified and they are clearly separated from
the rest of the data.
Figure 3. Index plot of generalized standardized Pearson residuals for modified Brown
data.
1706 Imon and Hadi
Table 3
Modified Finney data
Index Response Volume Rate Index Response Volume Rate
1 1 3.70 0.825 21 0 0.40 2.000
2 1 3.50 1.090 22 0 0.95 1.360
3 1 1.25 2.500 23 0 1.35 1.350
4 1 0.75 1.500 24 0 1.50 1.360
5 1 0.80 3.200 25 1 1.60 1.780
6 1 0.70 3.500 26 0 0.60 1.500
7 0 0.60 0.750 27 1 1.80 1.500
8 0 1.10 1.700 28 0 0.95 1.900
9 0 0.90 0.750 29 1 1.90 0.950
10 0(1) 0.90 0.450 30 0 1.60 0.400
11 0(1) 0.80 0.570 31 1 2.70 0.750
12 0 0.55 2.750 32 0 2.35 0.030
13 0 0.60 3.000 33 0 1.10 1.830
14 1 1.40 2.330 34 1 1.10 2.200
15 1 0.75 3.750 35 1 1.20 2.000
16 1 2.30 1.640 36 1 0.80 3.330
17 1 3.20 1.600 37 0 0.95 1.900
18 1 0.85 1.415 38 0 0.75 1.900
19 0 1.70 1.060 39 1 1.30 1.625
20 1 1.80 1.800
This data set was analyzed extensively by Pregibon (1981). Figure 4 presents
the character plot of the Finney data where rate is plotted against volume and the
character corresponding to occurrence and non occurrence are denoted by + and o,
respectively. Looking at the pattern of occurrence and non occurrence in relation to
rate and volume, Pregibon (1981) pointed out that this data set might contain two
Table 4
Outlier diagnostics for modified Finney data
Index PR SPR GSPR Index PR SPR GSPR
1 01440 01491 0.0 21 −05870 −06157 −00
2 01493 01543 0.0 22 −06827 −07032 −00
3 05629 05799 0.0 23 −09956 −10176 −02
4 16340 16914 587.5 24 −11562 −11836 −06
5 05744 06074 0.0 25 06143 06301 01
6 05301 05683 0.0 26 −05301 −05520 −00
7 −03414 −03570 −0.0 27 05978 06174 01
8 −09623 −09807 −0.1 28 −09373 −09577 −01
9 −04550 −04760 −0.0 29 07501 07854 06
10 26209 27516 44555.1 30 −07244 −07693 −00
11 26882 28168 56081.2 31 03920 04223 00
12 −10523 −11087 −0.3 32 −11957 −13568 −11
13 −12783 −13554 −2.0 33 −10385 −10578 −02
14 05387 055403 0.0 34 07750 07921 07
15 04364 046816 0.0 35 07919 08071 09
16 03411 03559 0.0 36 05322 05649 00
17 01475 01516 0.0 37 −09373 −09577 −01
18 15607 16115 386.8 38 −07739 −07972 −00
19 −11742 −12160 −0.7 39 08967 09132 27
20 05013 05176 0.0
outliers (cases 4 and 18). Following the same argument we put two more outliers
(cases 10 and 11) where non occurrences are replaced by occurrences.
The Pearson and standardized Pearson residuals for all observations of the
modified Finney data are presented in Table 4. We observe from this table that
the two original outliers are totally masked at the presence of two new outliers. The
Figure 5. Index plot of standardized Pearson residuals for modified Finney data.
1708 Imon and Hadi
Figure 6. Index plot of generalized standardized Pearson residuals for modified Finney
data.
Pearson and standardized Pearson residuals corresponding to all outliers are not
that big (less than three in absolute term). The index plot of standardized Pearson
residuals as shown in Fig. 5 also shows that all four outliers lies inside safely when
the commonly used SPR’s are used to identify them.
Now we apply our newly proposed detection technique for the identification
of multiple outliers for the modified Finney data. Here the deletion set contains 4
observations (cases 4, 10, 11, and 18) as suggested by Fig. 4 and also by the robust
techniques. We reestimate the logistic regression model without the observations
indexed by D and compute the generalized standardized Pearson residuals which
are also shown in Table 4. We observe from this table that the GSPR values for
the observations 4, 10, 11, and 18 are unusually high and hence can be declared as
outliers.
Since the GSPR values for the outliers are excessively large in comparison with
those for the inliers we present their index plots in Fig. 6 after suitably rescaling
(shifted log scale) them. This figure shows that the four outliers are easily identified
and they are clearly separated from the rest of the data.
5. Conclusions
In this article, we propose a new method for the identification of multiple outliers
in logistic regression. We introduce group deletion residuals that we call the
generalized standardized Pearson residuals and use them as effective diagnostics
for detecting multiple outliers in logistic regression. The numerical examples clearly
show that the proposed method is very successful in the identification of multiple
outliers when the existing commonly used methods fail to do so.
Acknowledgment
The authors gratefully acknowledge valuable comments and suggestions from the
reviewer.
Identification of Multiple Outliers in Logistic Regression 1709
References
Atkinson, A. C. (1994). Fast very robust methods for the detection of multiple outliers.
J. Amer. Statist. Asso. 89:1329–1339.
Barnett, V., Lewis, T. B. (1994). Outliers in Statistical Data. 3rd ed. New York: Wiley.
Billor, N., Hadi, A. S., Velleman, F. (2000). BACON: Blocked adaptive computationally-
efficient outlier nominator. Computat. Statist. Data Anal. 34:279–298.
Brown, B. W., Jr. (1980). Prediction analysis for binary data. In: Miller, R. G. Jr., Efron, B.,
Brown, B. W. Jr., Moses, L. E., eds. Biostatistics Casebook. New York: Wiley.
Chatterjee, S., Hadi, A. S. (1988). Sensitivity Analysis in Linear Regression. New York: Wiley.
Chen, C., Liu, L. M. (1993). Joint estimation of model parameters and outlier effects in time
series. J. Amer. Statist. Asso. 88:284–297.
Cook, R. D., Hawkins, D. M. (1990). Comment on unmasking multivariate outliers and
leverage points by Rousseeuw, P. J., van Zomeren, B. C. J. Amer. Statist. Assoc.
85:640–644.
Davies, P., Imon, A. H. M. R., Ali, M. M. (2004). A conditional expectation method for
improved residual estimation and outlier identification in linear regression. Int. J.
Statist. Sci. (Special issue in honour of Professor M. S. Haq). 191–208.
Finney, D. J. (1947). The estimation from individual records of the relationship between
dose and quantal response. Biometrika 34:320–334.
Gentleman, J. F., Wilk, M. B. (1975). Detecting outliers in a two-way table: II supplementing
the direct analysis of residuals. Biometrics 31:387–410.
Hadi, A. S. (1992). A new measure of overall potential influence in linear regression.
Computat. Statist. Data Anal. 14:1–27.
Hadi, A. S. (2006). On the Visualization of Massive, Hyperdimensional Data. Seminar session,
Department of Statistics, University of Rajshahi, Bangladesh.
Hadi, A. S., Simonoff, J. S. (1993). Procedure for the identification of outliers in linear
models. J. Amer. Statist. Asso. 88:1264–1272.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., Stahel, W. A. (1986). Robust Statistics:
The Approach Based on Influence Function. New York: Wiley.
Hosmer, D. W., Lemeshow, S. (2000). Applied Logistic Regression. 2nd ed. New York: Wiley.
Imon, A. H. M. R. (2005). Identifying multiple influential observations in linear regression.
J. Appl. Statist. 32:73–90.
Munier, S. (1999). Multiple outlier detection in logistic regression. Student 3:117–126.
Pregibon, D. (1981). Logistic regression diagnostics. Ann. Statist. 9:977–986.
Rousseeuw, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc.
79:871–880.
Rousseeuw, P. J., Leroy, A. (1987). Robust Regression and Outlier Detection. New York:
Wiley.
Ryan, T. P. (1997). Modern Regression Methods. New York: Wiley.