Vous êtes sur la page 1sur 14

Communications in Statistics - Theory and Methods

ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: https://www.tandfonline.com/loi/lsta20

Identification of Multiple Outliers in Logistic


Regression

A. H. M. Rahmatullah Imon & Ali S. Hadi

To cite this article: A. H. M. Rahmatullah Imon & Ali S. Hadi (2008) Identification of Multiple
Outliers in Logistic Regression, Communications in Statistics - Theory and Methods, 37:11,
1697-1709, DOI: 10.1080/03610920701826161

To link to this article: https://doi.org/10.1080/03610920701826161

Published online: 04 Apr 2008.

Submit your article to this journal

Article views: 596

Citing articles: 15 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=lsta20
Communications in Statistics—Theory and Methods, 37: 1697–1709, 2008
Copyright © Taylor & Francis Group, LLC
ISSN: 0361-0926 print/1532-415X online
DOI: 10.1080/03610920701826161

Identification of Multiple Outliers


in Logistic Regression

A. H. M. RAHMATULLAH IMON1 AND ALI S. HADI2


1
Institute for Mathematical Research, University Putra Malaysia,
Selangor, Malaysia
2
Department of Mathematics, The American University in Cairo,
Cairo, Egypt

The use of logistic regression modeling has seen a great deal of attention in the
literature in recent years. This includes all aspects of the logistic regression model
including the identification of outliers. A variety of methods for the identification
of outliers, such as the standardized Pearson residuals, are now available in the
literature. These methods, however, are successful only if the data contain a single
outlier. In the presence of multiple outliers in the data, which is often the case
in practice, these methods fail to detect the outliers. This is due to the well-known
problems of masking (false negative) and swamping (false positive) effects. In this
article, we propose a new method for the identification of multiple outliers in logistic
regression. We develop a generalized version of standardized Pearson residuals
based on group deletion and then propose a technique for identifying multiple
outliers. The performance of the proposed method is then investigated through
several examples.

Keywords Generalized standardized Pearson residuals; Group deletion; Logistic


regression; Masking; Outliers; Pearson residuals.

Mathematics Subject Classification Primary 62J02; Secondary 62J20.

1. Introduction
Diagnostics methods are commonly used in all branches of regression analysis.
In recent years diagnostic has become an essential part of logistic regression
(see Hosmer and Lemeshow, 2000). We often observe that outliers greatly affect the
covariate pattern and consequently their presence can mislead our interpretation.
So we need to detect such observations and study their impact on the model. In
Sec. 2, we introduce some commonly used diagnostics based on residuals for the

Address correspondence to A. H. M. Rahmatullah Imon, Institute for Mathematical


Research, University Putra Malaysia, 43400 Serdang, Selangor, Malaysia; E-mail:
imon_ru@yahoo.com

1697
1698 Imon and Hadi

identification of outliers in logistic regression. Although it is generally believed that


the identification of a single outlier is often achieved satisfactorily by the use of
traditional methods, these methods may be ineffective when a group of outliers
are present in the data. We anticipate that residuals based on group deletions may
produce better results in this situation. We introduce group-deleted residuals that
we name Generalized Standardized Pearson Residual (GSPR) in Sec. 3 and then
propose a new method for the identification of multiple outliers in logistic regression
using GSPR. The usefulness of this newly proposed method is investigated in Sec. 4
by several well-known examples.

2. Residuals and Leverages in Logistic Regression


Let us consider a multiple logistic regression model

Y = X + ∈ (2.1)

where
expZ
X = (2.2)
1 + expZ

with Z = 0 + 1 X1 + 2 X2 + · · · + p Xp = X. Here, Y is an n × 1 vector of


responses; we would logically let yi = 0 if the ith unit does not have the
characteristic and yi = 1 if the ith unit does possess that characteristic, X is an n × k
matrix containing the data for each case with k = p + 1, T = 0  1  2      p  is
the vector of regression parameters, and ∈ is an n × 1 vector of unobserved random
errors. The quantity i is known as probability for the ith factor/covariate. The
model given in (2.2) satisfies the important requirement that 0 ≤ i ≤ 1 and will be
a satisfactory model in many applications.
In linear regression, the ordinary least squares (OLS) method is commonly used
for estimating parameters mainly because of tradition and for ease of computation.
We can use the OLS method for estimating parameters in logistic regression,
but the assumptions, under which the OLS estimators possess some nice and
desirable properties, do not hold for logistic regression model. Mainly for this
reason the maximum likelihood (ML) method based on iterative-reweighed least
squares algorithm (see Ryan, 1997) becomes more popular with the statisticians.
After estimating the model by the ML method, let ˆ denote the vector of estimated
coefficients. Thus, the fitted values for the logistic regression model are x ˆ i , the
value of the expression in (2.2) computed using . ˆ
In linear regression, the ith residual is defined as the difference between the
observed and fitted value yi − ŷi . To emphasize the fact that the fitted values
in logistic regression are calculated for each covariate pattern and depend on the
estimated probability for that covariate pattern, we denote the fitted value of the ith
covariate pattern as ŷi = ˆ i . Thus, we define the ith residual as:

∈ˆ i = yi − ˆ i  i = 1 2     n (2.3)

In linear regression the hat matrix plays an extremely important role in the analysis.
This is the matrix that provides the fitted values as the projection of the outcome
variable into the covariate space. The linear regression residuals, Y − 
Y , are often
Identification of Multiple Outliers in Logistic Regression 1699

expressed in terms of the hat matrix and this forms basis of many diagnostics. Using
weighted least squares linear regression as a model, Pregibon (1981) derived a linear
approximation to the fitted values, which yields a hat matrix for logistic regression,
which is
 −1
H = V 1/2 X X T VX X T V 1/2  (2.4)

where V is an n × n diagonal matrix with general element vi = ˆ i 1 − ˆ i . In linear


regression the diagonal element of the hat matrix are called the leverage values. Let
the quantity hi denote the ith diagonal element of the matrix H defined in (2.4). It
is easy to show that
 −1
hi = ˆ i 1 − ˆ i xiT X T VX xi (2.5)

where xiT = 1 x1i  x2i      xpi  is the 1 × k vector of observations corresponding to
the ith case.
In logistic regression, the residuals measure the extent of ill-fitted
factor/covariate patterns. Hence, the observations possessing large residuals are
suspect outliers. But at this stage a natural question comes to our mind: How
big is big? The residuals defined in (2.3) are unscaled, that is why they are not
readily applicable in detecting outliers. Now let us introduce some scaled version of
the above residuals that are commonly used in diagnostics for the identification
of outliers.
The Pearson residuals are elements of the Pearson chi-square that can be used
to detect ill-fitted factor/covariate patterns. In linear regression a key assumption
is that the error variance does not depend on the conditional mean Eyi  xi  = ˆ i .
However, in logistic regression we have Bernoulli errors and as a result the error
variance is a function of the conditional mean, i.e.,

Varyi  xi  = vi = ˆ i 1 − ˆ i  (2.6)

The Pearson residual defined for the ith factor/covariate pattern is given by

y − ˆ
ri = i√ i  i = 1 2     n (2.7)
vi

We call an observation outlier if its corresponding Pearson residual exceeds


a quantity c in absolute term. Since Pearson residuals are scaled residuals, a
reasonable choice for c could be 3 (see Ryan, 1997), that matches with the 3
distance rule used in the normal theory. But we often experience that the cut-off
value 3 identifies too many observations as outliers. Therefore, we may follow Chen
and Liu (1993) to consider c as a suitably chosen constant between 3 and 5.
If we use the Pregibon (1981) linear regression-like approximation for the
residual for the ith covariance pattern, we observe

∈ˆ i = yi − ˆ i ≈ 1 − hi yi  (2.8)

Hence, the variance of the residual is given by

V∈ˆ i  = vi 1 − hi  (2.9)
1700 Imon and Hadi

which suggests that the Pearson residuals do not have variance equal to 1. For this
reason we could use the standardized Pearson residuals defined as

yi − ˆ i
rsi =   i = 1 2     n (2.10)
vi 1 − hi 

We may declare the ith observation as an outlier if rsi  > c.

3. Identification of Multiple Outliers


A variety of diagnostic tools are now available in the literature (see Hosmer and
Lemeshow, 2000; Ryan, 1997) for the identification of a single outlier in logistic
regression. But in a real life problem there is no guarantee that the data set will
contain just a single outlier. Hampel et al. (1986) claim that a routine data set
typically contains about 1–10% outliers. When multiple outliers are present in a data
set one might think of applying a single case diagnostic successively to identify all of
them. But this problem is not that simple. A group of outliers may distort the fitting
of a model in such a way that outliers may have artificially very small residuals so
that they may appear as inliers. This problem is known as masking in the literature
(see Rousseeuw and Leroy, 1987). The opposite effect of masking is known as
swamping (see Barnett and Lewis, 1994) for which inliers may appear as outliers.
Unfortunately, most of the outlier detection methods suffer from masking and/or
swamping effects of multiple outliers. Therefore, we need detection techniques that
are free from these problems.
Here we introduce a group-deleted version of the residuals and weights that
will be later used to develop effective diagnostics for the identification of multiple
outliers in logistic regression. We assume that d observations among a set of n
observations are omitted before the fitting of the model. Let us denote a set of cases
‘remaining’ in the analysis by R and a set of cases ‘deleted’ by D. Hence, R contains
n − d cases after d cases in D are deleted. Without loss of generality, assume that
these observations are the last of d rows of X, Y , and V so that
     
XR YR VR 0
X= Y = V = 
XD YD 0 VD

Let ˆ −D be the corresponding vector of estimated coefficients when a group of


observations indexed by D is omitted. Thus, the corresponding fitted values for the
logistic regression model are
 
−D exp xiT ˆ −D
ˆ i =    i = 1 2     n (3.1)
1 + exp xiT ˆ −D

Here, we define the ith deletion residual as:

∈ˆ i = yi − ˆ i  i = 1 2     n (3.2)
Identification of Multiple Outliers in Logistic Regression 1701

We also define the respective deletion variances and deletion leverages for the entire
data set as
−D −D  −D 
vi = ˆ i 1 − ˆ i (3.3)
−D −D  −D  T  T −1
hi = ˆ i 1 − ˆ i xi XR VR XR xi  (3.4)

In linear regression, Hadi and Simonoff (1993) pointed out that the residuals
computed internally (for the observations which are used to fit the model) and the
residuals computed externally (for the observations which are not used in the fitting
of the model) are not measured in a similar scale. They suggested using a new set of
scaled residuals

yi − xiT ˆ −D
ti∗ =  for i ∈ R
−D
ˆ R 1 − wii

yi − xiT ˆ −D
=  for i ∈ D (3.5)
−D
ˆ R 1 + wii

where ˆ R is the usual scale estimate based on the remaining cases. Residuals for
both the R set and the D set defined in (3.5) are measured in a similar scale and
this type of residuals has wide applications in regression diagnostics (see Atkinson,
1994; Imon, 2005; Munier, 1999).
Using the above results and also using linear-regression like approximation,
we define the ith generalized standardized Pearson residual (GSPR) for logistic
regression as:
−D
−D yi − ˆ i
rsi =  for i ∈ R
−D −D
vi 1 − hi 
−D
yi − ˆ i
=  for i ∈ D (3.6)
−D −D
vi 1 + hi 

We call an observation outlier if its corresponding GSPR value is excessively large


(say 3 or more) in absolute term.
Although the expression for generalized standardized Pearson residual is
available for any arbitrary set of deleted cases, D, the choice of such a set is the
most important part of the study since the omission of this group determines the
residuals for both the D set and the R set. For the detection of multiple outliers
our initial deletion set, say D0 , must contain all suspect outliers because if any
outlier remains in the R set, the entire GSPR set will be in faulty. Because of
the masking and/or swamping effects of multiple outliers the fitted model may be
distorted in such a way that an iterative or jackknife like procedure will not be at all
helpful. On the other hand, any ‘all possible outlier detection method’ as suggested
by Gentleman and Wilk (1975) for linear regression will be computationally very
extensive for logistic regression. In this situation one might consider a robust fit of
the model like the least median of squares (LMS) or the least trimmed squares (LTS)
suggested by Rousseeuw (1984). But theses robust methods are based on the most
1702 Imon and Hadi

compact data sets and consequently are too prone to declare observations as outliers
when they are not. Cook and Hawkins (1990) note that robust techniques have a
tendency to mark ‘innocent’ observations as outliers. Even considering a sample of
20 5-dimensional N0 1 data they found that 6 cases are marked as outliers by
the LMS. They describe this situation as a case of ‘outliers everywhere’. Thus, we
notice that the performance of neither the existing diagnostic nor robust techniques
is entirely satisfactory in the outlier detection study. As a compromise we suggest a
diagnostic-robust approach where suspect outliers (if at all) will be identified first by
recently developed graphical methods and/or robust methods and then diagnostic
tools are applied to the resulting residuals to put all inliers (if any) back into the
estimation subset that were wrongly diagnosed as outliers at the initial stage.
For a two variable regression, the scatter plot of Y against X can give an idea
that which observation may disrupt the covariate pattern and they can be considered
as suspect outliers. For a three-variable regression the character plot of Y on the
two explanatory variables may give some ideas about the suspect outliers. By virtue
of recent visualization techniques (see Hadi, 2006), we may identify suspect outliers
for five dimensions as well, but graphical displays are not suitable for higher-
dimensional cases. Another disadvantage of graphical methods is that they heavily
depend on experimenter’s own interpretations. We prefer using a graphical method
when the display gives a very clear view about the suspect outliers. In general, we
prefer using any suitable robust technique such as LMS, LTS, BACON (the block
adaptive computationally efficient outlier nominator (BACON) algorithm proposed
by Billor et al., 2000) or BOFOLS (best omitted from the ordinary least squares
technique proposed by Davies et al., 2004) to find all suspect outliers to form the
initial deletion set D0 . After the omission of the set of observations indexed by D0 ,
we would compute the generalized standardized Pearson residuals for the entire set.
We declare observations as outliers which satisfy the rule
 −D0  
rsi  > 3 (3.7)

We have already mentioned that in deletion of a group of observations it is


possible that some observations may be wrongly detected as suspect outliers because
of their association with other unusual cases. Since the GSPR values for both
the R set and the D set are measured in a similar scale, the deletion of innocent
observations should not matter too much in the outlier detection procedure, but
we feel that the deletion of harmless cases does not help to produce a better set of
residuals. Hence, we would like to replace observations (if any) into the estimation
subset when we believe that they were wrongly omitted.
If all members belonging to the set D0 satisfy the rule (3.7), they will be declared
as outliers. Otherwise we would put the observations, not satisfying the rule (3.7),
back to the estimation set. We continue revising the deletion set D and recomputing
GSPR’s unless all members of the final deletion set, say Dm , individually satisfy
−D 
rsi m  > 3. The members belonging to the set Dm will be finally declared as outliers.

4. Examples
Here we consider few data sets to investigate the usefulness of our newly proposed
tool for the identification of multiple outliers in logistic regression.
Identification of Multiple Outliers in Logistic Regression 1703

4.1. Modified Brown Data


We first consider the data set given by Brown (1980). Here the main objective was to
see whether an elevated level of acid phosphatase (A.P.) in the blood serum would
be of value for predicting whether or not prostate cancer patients also had lymph
node involvement (L.N.I). Ryan (1997) points out that the original data on the 53
patients contains a single outlier (observation number 24). We modify this data set
by putting two more outliers as cases 54 and 55 and this data set is given in Table 1.
Here the dependent variable is nodal involvement, with 1 denoting the presence
of nodal involvement and 0 indicating the absence of such involvement. The index
plot of the explanatory variable (see Fig. 1) clearly shows that observations 24, 54,
and 55 may severely distort the covariate pattern and hence may be considered as
outliers.
For the modified Brown data, Table 2 gives Pearson and standardized Pearson
residuals for all observations. It is clear from the results given in the table that both
Pearson and standardized Pearson residuals fail to identify the outliers. The index
plot of standardized Pearson residuals as shown in Fig. 2 clearly shows that all three
suspect outliers get masked for this data.
Now we apply our newly proposed detection technique for the identification
of multiple outliers. As Fig. 1 suggests, we form the deletion set D with the cases
24, 54, and 55. It is worth mentioning that all the robust techniques considered in
the previous section also suggest that these three observations are outliers. After
reestimating the logistic regression model without the observations indexed by D, we
refit the entire data set and compute the generalized standardized Pearson residuals

Table 1
Modified Brown data
Index L.N.I. A.P. Index L.N.I. A.P. Index L.N.I. A.P.
1 0 48 20 0 98 39 0 76
2 0 56 21 0 52 40 0 95
3 0 50 22 0 75 41 0 66
4 0 52 23 1 99 42 1 84
5 0 50 24 0 187 43 1 81
6 0 49 25 1 136 44 1 76
7 0 46 26 1 82 45 1 70
8 0 62 27 0 40 46 1 78
9 1 56 28 0 50 47 1 70
10 0 55 29 0 50 48 1 67
11 0 62 30 0 40 49 1 82
12 0 71 31 0 55 50 1 67
13 0 65 32 0 59 51 1 72
14 1 67 33 1 48 52 1 89
15 0 47 34 1 51 53 1 126
16 0 49 35 1 49 54 0 200
17 0 50 36 0 48 55 0 220
18 0 78 37 0 63
19 0 83 38 0 102
1704 Imon and Hadi

Figure 1. Scatter plot of acid phosphatase for modified Brown data.

Table 2
Outlier diagnostics for modified Brown data
Index PR SPR GSPR Index PR SPR GSPR
1 −0722 −0732 −0.505 29 −0724 −0734 −0.529
2 −0732 −0741 −0.607 30 −0713 −0725 −0.420
3 −0725 −0734 −0.529 31 −0731 −0739 −0.593
4 −0727 −0736 −0.554 32 −0736 −0744 −0.651
5 −0725 −0734 −0.529 33 1385 1404 2.054
6 −0724 −0733 −0.517 34 1378 1395 1.911
7 −0720 −0730 −0.482 35 1382 1401 2.005
8 −0739 −0747 −0.697 36 −0722 −0732 −0.505
9 1366 1382 1.695 37 −0741 −0748 −0.714
10 −0731 −0739 −0.593 38 −0791 −0802 −1.851
11 −0739 −0747 −0.697 39 −0757 −0764 −0.973
12 −0751 −0758 −0.862 40 −0782 −0791 −1.557
13 −0743 −0751 −0.748 41 −0744 −0752 −0.766
14 1341 1354 1.305 42 1303 1316 0.885
15 −0721 −0731 −0.493 43 1310 1322 0.947
16 −0723 −0733 −0.517 44 1321 1333 1.060
17 −0725 −0734 −0.529 45 1334 1347 1.217
18 −0760 −0767 −1.021 46 1316 1329 1.013
19 −0766 −0774 −1.155 47 1334 1347 1.217
20 −0786 −0796 −1.677 48 1341 1354 1.305
21 −0727 −0736 −0.554 49 1307 1320 0.926
22 −0756 −0763 −0.950 50 1341 1354 1.305
23 1271 1287 0.635 51 1330 1342 1.162
24 −0913 −1015 −12.873 52 1292 1306 0.792
25 1194 1237 0.267 53 1214 1248 0.340
26 1307 1320 0.926 54 −0933 −1066 −17.558
27 −0713 −0725 −0.420 55 −0965 −1162 −28.234
28 −0725 −0734 −0.529
Identification of Multiple Outliers in Logistic Regression 1705

Figure 2. Index plot of standardized Pearson residuals for modified Brown data.

which are shown in Table 2. We observe from this table that the GSPR values for
the observations 24, 54, and 55 are unusually high and hence can be declared as
outliers. The advantage of using the generalized standardized Pearson residuals for
the detection of multiple outliers is also visible in their index plot. We observe from
Fig. 3 that the three outliers are easily identified and they are clearly separated from
the rest of the data.

4.2. Modified Finney Data


We now consider another data set given by Finney (1947). The original data set (see
Table 3) were obtained to study the effect of the rate and volume of air inspired on a
transient vaso-constriction in the skin of the digits. The nature of the measurement
process was such that only the occurrence and nonoccurrence of vaso-constriction
could be reliably measured.

Figure 3. Index plot of generalized standardized Pearson residuals for modified Brown
data.
1706 Imon and Hadi

Table 3
Modified Finney data
Index Response Volume Rate Index Response Volume Rate
1 1 3.70 0.825 21 0 0.40 2.000
2 1 3.50 1.090 22 0 0.95 1.360
3 1 1.25 2.500 23 0 1.35 1.350
4 1 0.75 1.500 24 0 1.50 1.360
5 1 0.80 3.200 25 1 1.60 1.780
6 1 0.70 3.500 26 0 0.60 1.500
7 0 0.60 0.750 27 1 1.80 1.500
8 0 1.10 1.700 28 0 0.95 1.900
9 0 0.90 0.750 29 1 1.90 0.950
10 0(1) 0.90 0.450 30 0 1.60 0.400
11 0(1) 0.80 0.570 31 1 2.70 0.750
12 0 0.55 2.750 32 0 2.35 0.030
13 0 0.60 3.000 33 0 1.10 1.830
14 1 1.40 2.330 34 1 1.10 2.200
15 1 0.75 3.750 35 1 1.20 2.000
16 1 2.30 1.640 36 1 0.80 3.330
17 1 3.20 1.600 37 0 0.95 1.900
18 1 0.85 1.415 38 0 0.75 1.900
19 0 1.70 1.060 39 1 1.30 1.625
20 1 1.80 1.800

This data set was analyzed extensively by Pregibon (1981). Figure 4 presents
the character plot of the Finney data where rate is plotted against volume and the
character corresponding to occurrence and non occurrence are denoted by + and o,
respectively. Looking at the pattern of occurrence and non occurrence in relation to
rate and volume, Pregibon (1981) pointed out that this data set might contain two

Figure 4. Character plot of occurrence and non-occurrence of modified Finney data.


Identification of Multiple Outliers in Logistic Regression 1707

Table 4
Outlier diagnostics for modified Finney data
Index PR SPR GSPR Index PR SPR GSPR
1 01440 01491 0.0 21 −05870 −06157 −00
2 01493 01543 0.0 22 −06827 −07032 −00
3 05629 05799 0.0 23 −09956 −10176 −02
4 16340 16914 587.5 24 −11562 −11836 −06
5 05744 06074 0.0 25 06143 06301 01
6 05301 05683 0.0 26 −05301 −05520 −00
7 −03414 −03570 −0.0 27 05978 06174 01
8 −09623 −09807 −0.1 28 −09373 −09577 −01
9 −04550 −04760 −0.0 29 07501 07854 06
10 26209 27516 44555.1 30 −07244 −07693 −00
11 26882 28168 56081.2 31 03920 04223 00
12 −10523 −11087 −0.3 32 −11957 −13568 −11
13 −12783 −13554 −2.0 33 −10385 −10578 −02
14 05387 055403 0.0 34 07750 07921 07
15 04364 046816 0.0 35 07919 08071 09
16 03411 03559 0.0 36 05322 05649 00
17 01475 01516 0.0 37 −09373 −09577 −01
18 15607 16115 386.8 38 −07739 −07972 −00
19 −11742 −12160 −0.7 39 08967 09132 27
20 05013 05176 0.0

outliers (cases 4 and 18). Following the same argument we put two more outliers
(cases 10 and 11) where non occurrences are replaced by occurrences.
The Pearson and standardized Pearson residuals for all observations of the
modified Finney data are presented in Table 4. We observe from this table that
the two original outliers are totally masked at the presence of two new outliers. The

Figure 5. Index plot of standardized Pearson residuals for modified Finney data.
1708 Imon and Hadi

Figure 6. Index plot of generalized standardized Pearson residuals for modified Finney
data.

Pearson and standardized Pearson residuals corresponding to all outliers are not
that big (less than three in absolute term). The index plot of standardized Pearson
residuals as shown in Fig. 5 also shows that all four outliers lies inside safely when
the commonly used SPR’s are used to identify them.
Now we apply our newly proposed detection technique for the identification
of multiple outliers for the modified Finney data. Here the deletion set contains 4
observations (cases 4, 10, 11, and 18) as suggested by Fig. 4 and also by the robust
techniques. We reestimate the logistic regression model without the observations
indexed by D and compute the generalized standardized Pearson residuals which
are also shown in Table 4. We observe from this table that the GSPR values for
the observations 4, 10, 11, and 18 are unusually high and hence can be declared as
outliers.
Since the GSPR values for the outliers are excessively large in comparison with
those for the inliers we present their index plots in Fig. 6 after suitably rescaling
(shifted log scale) them. This figure shows that the four outliers are easily identified
and they are clearly separated from the rest of the data.

5. Conclusions
In this article, we propose a new method for the identification of multiple outliers
in logistic regression. We introduce group deletion residuals that we call the
generalized standardized Pearson residuals and use them as effective diagnostics
for detecting multiple outliers in logistic regression. The numerical examples clearly
show that the proposed method is very successful in the identification of multiple
outliers when the existing commonly used methods fail to do so.

Acknowledgment
The authors gratefully acknowledge valuable comments and suggestions from the
reviewer.
Identification of Multiple Outliers in Logistic Regression 1709

References
Atkinson, A. C. (1994). Fast very robust methods for the detection of multiple outliers.
J. Amer. Statist. Asso. 89:1329–1339.
Barnett, V., Lewis, T. B. (1994). Outliers in Statistical Data. 3rd ed. New York: Wiley.
Billor, N., Hadi, A. S., Velleman, F. (2000). BACON: Blocked adaptive computationally-
efficient outlier nominator. Computat. Statist. Data Anal. 34:279–298.
Brown, B. W., Jr. (1980). Prediction analysis for binary data. In: Miller, R. G. Jr., Efron, B.,
Brown, B. W. Jr., Moses, L. E., eds. Biostatistics Casebook. New York: Wiley.
Chatterjee, S., Hadi, A. S. (1988). Sensitivity Analysis in Linear Regression. New York: Wiley.
Chen, C., Liu, L. M. (1993). Joint estimation of model parameters and outlier effects in time
series. J. Amer. Statist. Asso. 88:284–297.
Cook, R. D., Hawkins, D. M. (1990). Comment on unmasking multivariate outliers and
leverage points by Rousseeuw, P. J., van Zomeren, B. C. J. Amer. Statist. Assoc.
85:640–644.
Davies, P., Imon, A. H. M. R., Ali, M. M. (2004). A conditional expectation method for
improved residual estimation and outlier identification in linear regression. Int. J.
Statist. Sci. (Special issue in honour of Professor M. S. Haq). 191–208.
Finney, D. J. (1947). The estimation from individual records of the relationship between
dose and quantal response. Biometrika 34:320–334.
Gentleman, J. F., Wilk, M. B. (1975). Detecting outliers in a two-way table: II supplementing
the direct analysis of residuals. Biometrics 31:387–410.
Hadi, A. S. (1992). A new measure of overall potential influence in linear regression.
Computat. Statist. Data Anal. 14:1–27.
Hadi, A. S. (2006). On the Visualization of Massive, Hyperdimensional Data. Seminar session,
Department of Statistics, University of Rajshahi, Bangladesh.
Hadi, A. S., Simonoff, J. S. (1993). Procedure for the identification of outliers in linear
models. J. Amer. Statist. Asso. 88:1264–1272.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., Stahel, W. A. (1986). Robust Statistics:
The Approach Based on Influence Function. New York: Wiley.
Hosmer, D. W., Lemeshow, S. (2000). Applied Logistic Regression. 2nd ed. New York: Wiley.
Imon, A. H. M. R. (2005). Identifying multiple influential observations in linear regression.
J. Appl. Statist. 32:73–90.
Munier, S. (1999). Multiple outlier detection in logistic regression. Student 3:117–126.
Pregibon, D. (1981). Logistic regression diagnostics. Ann. Statist. 9:977–986.
Rousseeuw, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc.
79:871–880.
Rousseeuw, P. J., Leroy, A. (1987). Robust Regression and Outlier Detection. New York:
Wiley.
Ryan, T. P. (1997). Modern Regression Methods. New York: Wiley.

Vous aimerez peut-être aussi