Académique Documents
Professionnel Documents
Culture Documents
8.6 r
8.4 r
r
r r
8.2 r
r
r
r
r
r
8 r r
r
r
r
r r
log($) 7.8 r
r
r
r
r r
7.6 r
7.4 r
7.2 r
7
65 70 75 80 85 90 95
year
Figure 2.1: Movie earnings by year
this can be hampered by the lack of true replication in the data. A Levene test run on this
data was insignicant (p = :8097), indicating that there are at least no obvious problems with this
assumption. Finally, there's usually not much that you can do about the independence assumption,
other than to come up with a hand-waving argument that dierent data points can't aect each
other.
These methods of examining the residuals are general enough that they can be used for multiple
regressions problems as well (i.e., when there are several predictors in the model).
As I mentioned above, individual observations can also cause problems, either because the
response is abnormal, relative to the rest of the data, or else because the predictors are abnormal
relative to the rest of the data. To look at the responses, you can calculate standardized residuals,
which indicate how many standard deviations a given observation lies above or below the regression
line. This is a little more complicated than looking for the largest residuals, since the regression
line itself is less variable for values of the predictors that are in the body of the data (near the
mean value of the predictors) than it is at the extremes of the range of the predictors. In addition,
a standardized residual can be calculated either relative to the regression line calculated from the
entire data set, or relative to a regression line that was calculated after removing the data point
that's under consideration. An examination of the standardized residuals showed that there were
two that were larger than 2 in absolute value. (Hawaii was the leading movie in 1966, earning
$15.5 million dollars, which was low, even by the standards of the day. Its standardized residual
was 2:136. Star Wars was the leading movie in 1977, and its gross receipts ($193.5 million at the
time) were surprisingly large relative to the rest of the data. Its standardized residual was 2.145.)
Neither of these standardized residuals is all that extreme, since the probability of getting a Z -value
greater than 2.14 is still greater than .01 on any one of the residuals.
Finally, an observation can cause problems because its predictors are far away from the bulk of
the data. Abnormal predictors aren't bad per se, but the model can't aord to ignore points like
that completely, and consequently they will have a lot of in
uence in determining the estimated
parameters. There are two reasons for wanting to identify in
uential points. First, you may not
feel comfortable about extrapolating the model all the way out to those extreme situations, perhaps
because you're not completely certain that the model is linear everywhere. Second, for such an
extreme point, it's often dicult to tell whether the response is strange, since the model will go out
of its way to t those points well, regardless of the value of the response. By identifying in
uential
points, you can give them extra scrutiny to determine critically whether they belong in the model
or not. With only one predictor, such a strange observation could only have an abnormally high
or abnormally low predictor, and it's pretty easy to pick points like this out on the graph. As we'll
see, with more predictors in the model, you may need more help with this. Since the independent
variable in this data set was for the most part consecutive years from 1966 to 1994, there aren't
any extreme points like this. If I had changed the data set to include the highest grossing movie
of 1944 (Sergeant York), for example, then it might be viewed as an extreme point. There are a
number of methods for picking out points like this. For example, the leverage, hj , which is dened
as
hj = xj (X 0X ) 1 x0j ;
where xj is the vector of predictors for the j -th observation, and X is the matrix of predictors for
all variables. Other closely-related measures of the in
uence of an observation are Cook's distance,
The Pragmatist's Guide to Statistics: logistic regression 16
900 r
850 r
800 r
r
r
750 r r
r
r
700
r r
r rr
r r r
rr r r r r
Runs 650 r r r
r r
r
600
r
r r
r r
550 r
500
450 r
400
400 450 500 550 600 650 700 750 800 850
Runs against
Figure 2.2: Runs scored for and against San Francisco Giants
predicted residuals, or a method devised by Andrews and Pregibon3 . These methods dier primarily
in whether the measure of in
uence is standardized, and in whether the observation whose in
uence
is being assessed is included in the linear model calculations. For a better sense of these in
uence
diagnostics, I'm going to look at a second example.
This example deals with the relationship in baseball between runs scored, runs given up, and a
team's success (how many games it wins. The data are for the San Francisco Giants in the years
1958 through 1993. In general, you would expect a team to do better, the more runs it scores, and
the fewer that it allows. There have been a number of people who've studied this question in a
mathematical setting, including Earnshaw Cook4 and Bill James5 . A plot of the total runs scored
by the Giants and by their opponents each year is given in Figure 2.2.
One of the rst things you notice is that there's a point way down in the lower left hand corner
of the graph. This point is for the year 1981, which was the year that the rst long baseball strike
occurred. The form of the model will depend to a large extent on how an extreme data point like
3 D.F. Andrews and D. Pregibon, \Finding the Outliers that Matter," Journal of the Royal Statistical Society,
Series B, vol. 58 (1978), pp. 85-93.
4 In his book Percentage Baseball, published in 1964 by M.I.T. Press. His approach was based on linear regression
methods.
5 Bill James published a series of books entitled Bill James' Baseball Abstract, published from around 1981 to
1989. They dealt in a somewhat mathematical way with a variety of questions that arise in baseball folklore. In my
opinion, his greatest contribution is to plant the notion that many generally accepted \truths" that have become part
of baseball folklore, are things that can be investigated quantitatively. His approach to the relationship between runs
and wins was largely ad hoc. His Pythagorean Theorem states that winning percentage should be related to runs via
the equation
p =
(W+ )2
(W )2 + (W )2 :
+
The Pragmatist's Guide to Statistics: logistic regression 17
2 r
r
r
1
r
r r r
r r r
r r r
r r
0
r r r
r r r
r r r r r r
Residual -1 r r r r r
r
-2 r
r
-3
-4 r
55 60 65 70 75 80 85 90 95
Year
0.45
0.4 r
0.35
0.3
Leverage0.25
0.2 r
r r
0.15 r r
0.1 r r
r
r r r
r r r
0.05 r r
r
r r r
r r r r
r r
r r
r r r r
r r r
0
55 60 65 70 75 80 85 90 95
Year
Figure 2.3: Model predicting wins from runs for/against
that is t. For example, if you simply t a linear model for the number of wins as a function of
runs scored (X+ ) and given up (X ), the least squares equation is
W = 49:82 + :127X+ :082 X :
I'm going to look at three possible models here to see how the leverage and residuals can change
depending on how the model is formulated. The rst of these, Figure 2.3, predicts the number of
wins from the numbers of runs the Giants and their opponents scored. Notice that 1981 is seen
as both a point with high leverage (more than twice the leverage of any other point), and it also
has the largest residual in absolute value (Z = 3:87). That's because from the way the data was
presented, the model can't tell the dierence between a season with fewer games, and one in which
the Giants just didn't win much.
If we instead express the dependent variable as the Giants' win percentage, then 1981's data point
is still in
uential (since leverage is dened in terms of the predictors only and not the response),
but the residual isn't bad at all (Z = 0:22). The worst residual in this model was for the year
1972, when the Giants outscored their opponents but wound up winning under 45% of their games.
The leverage and standardized residuals for this model are presented in Figure 2.4. Note that the
leverages are identical to ones from the previous model.
Finally, if we express the dependent variable as the Giants' win percentage and the independent
variables as the runs per game scored either by or against the Giants, then 1981 isn't particularly
strange, either in terms of its response or in terms of its predictors. In this plot, the largest residual
was again for the 1972 season (Z = 2:91), and the largest leverage was for the 1970 season, in
which the Giants led the league both in runs scored, as well as in runs allowed. The standardized
residuals and leverages for this model are plotted in Figure 2.5.
The Pragmatist's Guide to Statistics: logistic regression 18
3
2 r
r
r
1 r
r r
r r r
r
r
r
Residual 0 r
r
r
r
r r
r r r
r r
r
r
r
r
r r
-1
r
r r
r
r
-2 r
-3 r
55 60 65 70 75 80 85 90 95
Year
0.45
0.4 r
0.35
0.3
0.25
Leverage 0.2 r
r r
0.15 r r
0.1 r r
r
r r r
r r r
0.05 r r
r r r r
r r r r
r r
r r
r r r r
r r r
0
55 60 65 70 75 80 85 90 95
Year
Figure 2.4: Model predicting winning percentage from runs for/against
2.5
2 r r
1.5 r
r
1
r
r
r
0.5
r r
r
r r
Residual -0.50
r r r r
r r r r r
r r r r r
r r
r
-1 r r
r
-1.5 r
r
-2 r
-2.5
-3 r
55 60 65 70 75 80 85 90 95
Year
0.3
0.25 r
0.2 r
r
r
Leverage0.15 r
r
r
0.1
r
r r r r
r
r r r r
r
0.05
r r
r r r r r
r r r r r r r
r r r r
0
55 60 65 70 75 80 85 90 95
Year
Figure 2.5: Model predicting winning percentage from runs per game for/against
The Pragmatist's Guide to Statistics: logistic regression 19
To recap the most important aspects of these linear regression models, it's important both
to look at the overall t of the model, using a criterion like Durbin Watson or even the model's
R2 value, but it's also important to look at the impact of individual points on the results of the
regression, based on some in
uence diagnostics. What's more, whether a point is an outlier or not,
or whether it's excessively in
uential or not will depend strongly on how the model is formulated.
Similar ideas can be applied to logistic regressions, as we'll see in the next section.
2.3 Logistic Regression Diagnostics
In a logistic regression, there's considerably less that can go wrong with the model, but by the
same token, there's less information to tell you when things are going wrong. If we look back at
the assumptions underlying a linear model (regression or analysis of variance), many of them have
limited relevance for a logistic regression:
The residuals from the model are normally distributed. Of course, the residuals from a logistic
regression aren't normally distributed, since the response is either 0 or 1. (That is, the
distribution of the responses is Bernoulli with parameter p, or at worst, binomial (n; p).) The
residual thus will be either 1 p (with probability p) or p (with probability 1 p). What's
important though is that the distribution of the residuals is determined by the probability
p of a response, unlike the least squares situation, in which the distribution of the residuals
doesn't depend on at all, but rather on the additional parameter 2.
The residuals have a constant residual variance. This isn't a concern either, since the mean
(p) determines the variance (p(1 p)).
The residuals are independent. If you recall, this assumption was the one with the hand-
waving justication, and for a logistic regression there's still typically no statistical way to
argue that independence is a reasonable assumption. You have to argue that one individual's
response isn't aected by how the other individuals responded.
The residuals have a mean of zero. This is the assumption that was addressed by a Durbin
Watson statistic or by a 2 test for goodness of t in the least squares case. As we saw
last time, there are similar statistics available for a logistic regression, including the Hosmer
Lemeshow statistic and the C. C. Brown statistic.
The other type of residual analysis we did in our regression examples looked at individual
residuals to determine whether either the predictors or the responses for those cases were aberrant.
Similar statistics are available for a logistic regression; the ones that are easiest to come by are
based on the standardized residuals and on the leverage statistic, which is why I emphasized those
in the least squares examples. I'm going to do a couple of examples of logistic regressions, in
which those diagnostics provide useful information. After the examples, I'll talk a little about the
control language that's used to run a logistic regression, and to get the diagnostic information.
Unfortunately, no one package gives all the diagnostics you'd like (at least in the form you'd like),
as we'll see.
2.4 Logistic Regression Diagnostics: Patient Survival Example
The rst example consists of data on close to 300 emergency room patients who were admitted
for a variety of causes and who ultimately either survived to be discharged from the hospital or
The Pragmatist's Guide to Statistics: logistic regression 20
1
0.8
0.6
P rfDeathg
0.4
0.2
0
0 2 4 6 8 10 12 14 16
APSM1
Figure 2.6: Mortality in Emergency Room Patients
else they didn't. There are a number of indices which have been proposed as triage tools (as well
as to evaluate the performance (success) of a hospital) that are based on a patient's condition
and produce an estimate of how likely the patient is to survive. Some of these indices, such as
ISS (Injury Severity Score) are pretty ad hoc, while others, such as APACHE II, are based on
logistic regression analyses. The problem with many of these indices is that the predictors on
which they're based often include variables that aren't observable on a timely basis. That is, they
may involve laboratory values that aren't available at the time that decisions have to be made
about a patient's treatment. For this reason, there is a need for somewhat cruder indices that are
based on information that can be collected very easily, as soon as a patient enters the hospital. The
data in this example is for such an index that we've discussed before, called APSM1 here, which
is based on some basic vital signs (heart rate, blood pressure and respiration), as well as a fairly
standard measure of a patient's responsiveness, called the Glasgow Coma Scale (GCS). The index
is integer-valued, and ranges from 0 to 16, with the higher numbers corresponding to the patients
whose medical condition is most critical.
The graph above (Figure 2.6) shows a plot of the raw data, along with the predicted probability
of mortality, based on a logistic regression in which APSM1 was the only predictor. The graph
can be a little misleading, since there are many instances in which patients have identical APSM1
values. Thus there are a bunch of points that get plotted on top of each other. What you can see
from this graph is that there's a lot of overlap in APSM1 score between the patients who survived
and those who died, but it's also noteworthy that nobody with a score above 11 survived, whereas
there were quite a few patients who died and who had scores that high.
In the excerpt from the (BMDP) output which follows, you can see that APSM1 is quite
signicant as a predictor of mortality (2 = 58:71; p < :0001). You can also see that the model ts
reasonably, if not spectacularly well, since both the Hosmer-Lemeshow and Brown goodness of t
statistics are insignicant (Hosmer: 26 = 11:180; p = :083; Brown: 22 = 1:617; p = :445). As is
The Pragmatist's Guide to Statistics: logistic regression 21
often the case, the likelihood ratio (2*O*LN(O/E)) chi squared statistic isn't terribly enlightening,
since it's based on a contingency table that has 58 degrees of freedom. This is a lot, given that
there were under 300 observations in the data set. In a sense, we were lucky here, since although
APSM1 ranges from 0 to 16, it takes only integer values. If it had been continuously distributed,
then the chi squared table on which the likelihood ratio chi squared statistic is based would have
had one cell for each unique combination of predictors, possibly as many as there were observations.
STEP NUMBER 1 apsm1 IS ENTERED
---------------
LOG LIKELIHOOD = -101.772
IMPROVEMENT CHI-SQUARE ( 2*(LN(MLR) ) = 92.533 D.F.= 1 P-VALUE= 0.000
GOODNESS OF FIT CHI-SQ (2*O*LN(O/E)) = 59.965 D.F.= 58 P-VALUE= 0.404
GOODNESS OF FIT CHI-SQ (HOSMER-LEMESHOW)= 11.180 D.F.= 6 P-VALUE= 0.083
GOODNESS OF FIT CHI-SQ ( C.C.BROWN ) = 1.617 D.F.= 2 P-VALUE= 0.445
STANDARD 95% C.I. OF EXP(COEF)
TERM COEFFICIENT ERROR COEF/SE EXP(COEF) LOWER-BND UPPER-BND
Based on this portion of the output, you can't get all that good an idea of how the model is
performing. The fact that the p-value for the Hosmer goodness of t statistic is less than 0.10 is a
little ominous, but it's not clear why this is happening. To ll in the picture, you need to look at
some additional diagnostic information, of which the leverage and standardized residuals are two
examples. Both BMDP and SAS will calculate this information, but since I prefer the way SAS
presents the information, I'll extract a little of this information from the analogous SAS output.
Regression Diagnostics
Covariates Pearson Residual
Case (1 unit = 0.66)
Number APSM1 Value -8 -4 0 2 4 6 8
1 6.0000 . | | |
2 2.0000 . | | |
3 0 -0.1893 | * |
4 4.0000 -0.4400 | *| |
The Pragmatist's Guide to Statistics: logistic regression 22
0.022
0.02 "apsm1.txt"
0.018
0.016
Leverage0.014
0.012
0.01
0.008
0.006
0.004
0 2 4 6 8 10 12 14 16
APSM1
5
4 r
r "apsm1.txt" r
3 r
Residual 21
r
r
r
r
r r r r r
0
r r r
r r r r r r
-1
r r r
r
r
-2 r
0 2 4 6 8 10 12 14 16
APSM1
Figure 2.7: Diagnostics on LR of Emergency Room Patients
C CBAR
Case (1 unit = 0.01) (1 unit = 0.01)
Number Value 0 2 4 6 8 12 16 Value 0 2 4 6 8 12 16
1 . | | . | |
2 . | | . | |
3 0.000145 |* | 0.000144 |* |
4 0.00102 |* | 0.00102 |* |
5 0.000402 |* | 0.0004 |* |
6 0.000145 |* | 0.000144 |* |
7 0.000145 |* | 0.000144 |* |
8 0.000402 |* | 0.0004 |* |
9 0.00173 |* | 0.00172 |* |
10 0.00173 |* | 0.00172 |* |
DIFDEV DIFCHISQ
Case (1 unit = 0.43) (1 unit = 1.75)
Number Value 0 2 4 6 8 12 16 Value 0 2 4 6 8 12 16
1 . | | . | |
2 . | | . | |
3 0.0705 |* | 0.0360 |* |
4 0.3549 | * | 0.1946 |* |
5 0.1604 |* | 0.0837 |* |
6 0.0705 |* | 0.0360 |* |
7 0.0705 |* | 0.0360 |* |
8 0.1604 |* | 0.0837 |* |
9 0.5189 | * | 0.2968 |* |
10 0.5189 | * | 0.2968 |* |
First o, Figure 2.7 shows both the leverage of a data point, along with the standardized
residuals of a point, both as a function of APSM1. From the plot of standardized residuals, we see
The Pragmatist's Guide to Statistics: logistic regression 24
two curves stretching across the page. The upper of these corresponds to the standardized residuals
for patients who ultimately died, while the lower one corresponds to the standardized residuals for
patients who survived. By far, the largest residuals occur for the patients with low APSM1 scores
who for whatever reason died. There are many possible explanations for this, but basically these
people had something wrong with them that wasn't re
ected in their APSM1 score. Thus, while
you wouldn't think they would die by looking at their score, something else was going on that
meant that they were sicker than they appeared on the surface, based on their APSM1 score.
Carrying this observation a little further, one of the assumptions of a logistic model (or probit,
for that matter) is that as the predictors get extreme in one direction or the other, the probability
of a response will go to zero or one, depending on the direction. This assumption may not be
reasonable for this type of data, since it may be the case that patients have some probability of
dying, regardless of how healthy they look. The mere fact that they're in the hospital should mean
that something is wrong with them. Likewise, it may also be true that even extremely sick patients
may have some probability of surviving. Thus, while the model may t quite well in the body of
the data (well enough to pass a goodness of t test), the model may break down when you get to
the observations with the most extreme predictors. The same can be said of a linear regression
model, for which the regression (EY jX ) may be essentially linear for most of the data, and yet
depart from linearity as you get to the fringes of the data. It's always much easier to tell whether
the model is correctly specied where there's a lot of data, than it is at the extremes. That's why
extrapolation is much trickier to do than is interpolation.
A nal observation on the standardized residual plot is that the most extreme extreme residuals
all correspond to patients who died when they weren't expected to. It's also possible to get large
residuals in the other direction, but it's not surprising in this case that we didn't, since many more
patients survived than didn't (78% of them). The \healthiest" patient to die had an estimated
probability of death around 3%, while the sickest patient to survive (who had an APSM1 of 11)
had an estimated probability of survival around 22%.
The plot of leverage values may be a little surprising, since in a linear regression, the leverage
increases as a quadratic function of the dierence between the predictors for the observation in
question and the mean vector of predictors across all observations. In the lower part of Figure 2.7,
we see that the leverage increases up to an APSM1 score of around 12, and decreases beyond that
point. The reason this happens is that for an observation to have leverage, it must have predictors
that are out of the ordinary, but it must also be variable enough that surprises can occur. In a
logistic regression, the points at either extreme will have predicted probabilities close to either 0 or
1, so that their variance (p(1 p)) will be small.
The excerpt from the output contains a number of statistics that I haven't discussed, though
they're still aiming to identify either points that have anomalous responses or ones that have undue
in
uence on the outcome of the analysis. Brie
y, the additional measures are as follows:
The Deviance Residual is similar to a Pearson residual, in that it measures the dierence be-
tween the observed response (i.e., 1 or 0) and the model's predicted probability of a response.
The dierence is that the Pearson residual is measured in terms q of squared error, while the
Deviance residual is measured on a log scale, dened as 2 ln(1 ^j ) if subject j re-
p
sponded, and as 2 ln ^j if subject j didn't respond, where ^j is the predicted probability
that subject j would respond.
DFBETA measures how much one of the model's parameter estimates will change if the j -th
The Pragmatist's Guide to Statistics: logistic regression 25
observation is deleted from the sample. It's measured relative to the standard error for that
parameter, so it's expressed in standardized units. Since each model will have one parameter
for the intercept and one for each of k covariates in the model, the number of DFBETA
measures will be equal to the number of degrees of freedom in the MODEL, including one for
the intercept term.
C and C (CBAR are composite measures that approximate how much the estimated param-
eter vector will change if the j -th observation is deleted. This is measured relative to the
estimated covariance matrix of the parameters, so again this is a standardized measure. C 's
approximation is based on the asymptotics of the Pearson 2 measure of t, while C is based
on the asymptotics of the Deviance (log likelihood) measure of t.
DIFDEV and DIFCHISQ measure how much the deviance (log likelihood 2) or Pearson
2 statistics would change if the j -th observation was deleted. Large values correspond to
observations that aren't being predicted very well by the model.
I'm going to look at one more example, which diers primarily in that there are going to be
more predictors included in the model, some of which are categorical.
2.5 Logistic Regression Diagnostics: Resistant Infection Example
It's not at all uncommon for hospitalized patients to develop bacterial infections, either in con-
junction with or peripheral to their disease. There are many antibiotic agents that can be used
to treat these infections, so an infected patient will often receive a series of antibiotic treatments
to try and control their infection. This can get nasty if the bacterial strain involved is one that's
resistant to antibiotic treatment. There's some reason to suspect that if a patient has an initial
infection that's sensitive to treatment with antibiotics, then subsequent infections may be more
likely to be resistant, depending on the type of antibiotics that were used to treat the initial infec-
tion. The data set in this example looks at a number of patients who had an initial infection that
was sensitive and who subsequently developed a second infection. (It was known that the second
infection wasn't just a reappearance of the initial infection, since the data set contains only cases
in which a dierent bacterial strain was involved in the second infection.) Logistic regression was
used to look at whether the antibiotics that were used in treating the initial infection in
uenced
whether the second infection was resistant or sensitive to antibiotics. I've taken a few liberties with
the original analysis, sticking in a covariate that wasn't important, and redening a variable that
originally indicated simply whether or not an animoglycoside was used so that it indicated which
one was used. I made these changes so that I'd have at least one covariate in the model and at
least one categorical eect that had more than two levels.
SAS and BMDP wound up tting very similar, though not identical models. Some relevant
portions of the BMDP output, including the last step in the stepwise procedure are given below.
resist . . . . . . 25.
sens . . . . . . 86.
VARIABLE STANDARD
NO. N A M E MINIMUM MAXIMUM MEAN DEVIATION SKEWNESS KURTOSIS
3 age 3.0000 77.0000 43.6486 19.5093 0.1232 -1.0894
5 apache 0.0000 30.0000 9.8468 6.9219 0.5329 -0.4116
19 esc 0 85 0
1 26 1
20 cef 0 56 0
1 55 1
37 aglyco 1 82 0 0 0
2 21 1 0 0
3 7 0 1 0
4 1 0 0 1
BMDP chose 3 predictors for inclusion in the model: whether cefazolin was used, whether an
extended spectrum cephalosporin (ESC) was used, and which (or whether an) aminoglycoside was
used. Other terms were listed as possible covariates, including age, Apache score, and interac-
tions between ESC's and either cefazolin or aminoglycosides. (The interest in the interactions was
whether the use of some of the more routine antibiotics made the problems with ESC's worse.)
The overall tests of the model's t came out reasonably well (p = :944 for the Hosmer statistic
and p = :727 for the Brown statistic), so on that level, there's no reason to question the model. As
I've mentioned previously, BMDP will let you calculate leverage and residual diagnostics and save
them in a BMDP output data set, but since this makes it a little cumbersome to work with them,
I'm going to take this information out of the SAS output again. Since there are a few more points
I want to make about the BMDP output, I'll come back to this.
An extra piece of output that I requested that BMDP print is the estimated covariance matrix
for the parameters of the model. The reason that this might be interesting is that for a signicant
categorical predictor, you probably also want to know which of the categories dier signicantly.
For example, for the aminoglycoside predictor, we might want to know whether patients that receive
gentamicin are at greater risk of getting a resistant infection than patients who receive tobramicin,
or patient who receive none of the aminoglycosides. I didn't copy the portion of the output that
tells you which of the categories correspond to which treatment (it's in the control language in
Section 2.6). The four categories I dened (in order) were \no aminoglycoside," \gentamicin,"
\tobramicin," and \amikacin." According to the denition of the design variables (cf. p. 33),
a patient who was in the \no aminoglycoside" group would have a value of zero for each of the
design variables, a patient who was in the \gentamicin" group would have a one for the rst design
variable and zeroes for the others, and so forth. From this, you can see that the dierence in the
logit between a gentamicin patient and one who didn't receive any aminoglycosides would be the
dierence between a term equal to 1 C1 + 0 C2 + 0 C3 = C1 in the logit, and one that's equal to
0 C1 + 0 C2 + 0 C3 = 0; respectively, where Ci is the coecient of the i-th design variable. The
dierence between these two expressions is just C1. Thus, to compare these two groups, all you
have to do is look at the estimated coecient C1 , divided by its estimated standard error. This will
give a Z statistic which has an approximate normal distribution with mean zero and variance 1.
If it's larger than approximately 2 in absolute value, then patients who received gentamicin would
have a signicantly dierent response, relative to patients who didn't get any aminoglycoside.
It's more complicated to compare gentamicin with tobramicin, since the corresponding coe-
cients of the three design variables would be 1 C1 +0 C2 +0 C3 and 0 C1 +1 C2 +0 C3, respectively.
The dierence between the two is C1 C2 and while it's easy to calculate this dierence, it's a little
harder to calculate an estimate of the standard error of this dierence. If the dierence in question
was Ci Cj , then you could calculate an approximate Z statistic of the form
Z= psiiC+i sjjCj 2sij ;
where sij is the (i; j )-th element of the covariance matrix of the coecients. (It's also possible to get
at this quantity based on the means, standard errors, and correlation matrix among the coecients,
but this is a more cumbersome calculation.) You can dene more complicated contrasts among the
The Pragmatist's Guide to Statistics: logistic regression 29
estimated coecients and construct Z statistics based on them, but I won't go into how to do that
here.
I should comment that BMDP uses a number of dierent ways of dening the design variables,
and the form that's used by default can dier from one version of the program to another, so you
need to check the way the design variables have been dened. For the record, you can ask BMDP
to use a particular parameterization, by specifying the DVAR option as either MARG (marginal), PART
(partial, which is the one that was used here) or ORTH (orthogonal).
One thing that appears all over the place in the BMDP output are the warning messages about
the model failing to converge. The reason for all of these messages is that for one of the design vari-
ables (the last dummy variable for the aminoglycoside eect), the parameter was essentially innite
(the estimated adjusted odds ratio for this dummy variable was over a million). This happened
because there was only one subject that received amikacin (the last of the aminoglycosides), and
that patient developed a resistant infection, so there was no observed variability in the response to
that antibiotic. In such cases, there's no way to estimate the parameter (which truly is innite)
to within the tolerance factor that's used to dene convergence, so it's actually no surprise that
the model didn't converge. It's noteworthy that SAS and BMDP react dierently to this situation.
Both programs recognize that the algorithm has failed to converge, but BMDP gives you the chance
to proceed with the calculations anyway, whereas SAS simply prints out an error message and quits.
For this (and one other) reason, the model that SAS wound up selecting was slightly dierent.
The model that SAS t is summarized in the following excerpt from its output:
Step 3. Variable CEF entered:
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 120.424 65.546 .
SC 123.134 76.384 .
-2 LOG L 118.424 57.546 60.878 with 3 DF (p=0.0001)
Score . . 58.428 with 3 DF (p=0.0001)
Score Pr >
Variable Chi-Square Chi-Square
NOTE: No (additional) variables met the 0.05 significance level for entry into
the model.
As in the BMDP analysis, three predictors got chosen for the model, two of which were the same
as before (ESC's and cefazolin). The third predictor was one of the aminoglycosides, but since SAS
didn't know anything about the relationships among the potential predictors (which ones should
enter and/or leave the model simultaneously), it chose only one of the three dummy variables for
the aminoglycoside factor. This could be viewed as either a strength or as a weakness, but I think
that at the very least, it's unfortunate that you never get a 3 degree of freedom test of whether the
aminoglycosides (as a group) make a signicant dierence. One implication of this is that the form
of SAS's model will often depend on how you decide to parameterize the categorical eects (which
set of dummy variables you use) whereas BMDP's model won't. This failure to recognize which of
the predictors need to be linked with each other is particularly unfortunate if you're trying to keep
track of interaction terms (which in a SAS model, are simply represented by products between the
main eect dummy variables). It's quite possible (as well as likely) that SAS will choose only some
of the dummy variables for a main eect, and a dierent subset of the dummy variables for the
corresponding interactions. A model like this can be extremely hard to interpret. For this reason,
I nd that the SAS stepwise algorithm in PROC LOGISTIC is useful primarily when all eects
are covariates and there aren't any interactions being considered. There are some other ways of
running logistic models in SAS, which is one of the things that I'll discuss next time.
The following is an excerpt from the diagnostics on this model. I'm showing only a few of the
cases and only a few of the diagnostic variables, primarily to conserve space.
Covariates Pearson Residual
Case (1 unit = 0.45)
Number ESC CEF TOB Value -8 -4 0 2 4 6 8
41 0 1.0000 0 -0.3034 | *| |
42 1.0000 0 0 -1.0349 | * | |
43 0 0 0 -0.0867 | * |
The Pragmatist's Guide to Statistics: logistic regression 31
44 0 0 0 -0.0867 | * |
45 0 0 1.0000 -1.0072 | * | |
46 0 0 0 -0.0867 | * |
From this, you can see that of these ten caes, the one with the worst standardized residual
was case 47, which was a subject who had received two of the \bad" antibiotics, and yet whose
reinfection was sensitive to antibiotic therapy. The large residual re
ects the fact that the predicted
probability of sensitivity for that subject's infection was only 7:1%. By contrast, subject 45 had a
residual that was quite reasonable (around 1), but its leverage (hat matrix diagonal) was much
larger than for the other listed points. This was because this was the only subject in the study who
received cephazolin, but not the other two antibiotics that were chosen for the model.
I should comment that for both BMDP and SAS, the leverage diagnostics pertain only to the
variables that were selected for the model. This is probably a reasonable choice, since otherwise,
you would have to be concerned with the in
uence of a case under a lot of dierent circumstances,
depending on which predictors were in the model. However, this doesn't help you to diagnose
whether one of the variables that wasn't chosen, failed to be chosen because of what happened for
a few in
uential cases. The only way to get this kind of information is to calculate the in
uence
diagnostics for some additional models. (Those models would have to be chosen explicitly, rather
than using the stepwise algorithms, since of course the stepwise procedure would start by eliminating
the extra variables that you had just put into the model.)
It's possible to get a Hosmer goodness of t statistic out of PROC LOGISTIC (version 6.07 or
The Pragmatist's Guide to Statistics: logistic regression 32
later), but the statistic you get isn't exactly the same as the one from BMDP. The next piece of
output is what SAS prints out for its Hosmer statistic.
Last time, I described the Hosmer statistic as a goodness of t 2 statistic, based on a ten-
cell table, in which the rst cell contained the 10% of the cases that had the lowest predicted
Prfresponseg, the second cell consisted of the next lowest 10% of the cases, and so forth. The
problem with this is that if there are a bunch of cases with very low or very high estimated
probabilities, then the expected frequencies in some of these cells can be lower than you probably
ought to use in a 2 table. SAS's Hosmer statistic went ahead and used all 10 cells anyway. BMDP's
Hosmer statistic used a little more restraint and based the statistic on an 8 cell table (hence it had
6 degrees of freedom). Because of this, I prefer the way that BMDP does the Hosmer calculation.
2.6 Control language for Logistic Regression, using SAS and BMDP
As I've mentioned in my lecture notes on Repeated Measures Analysis of Variance6 , SAS and
BMDP control language are similar in structure, in that there are two main parts to the control
language, the rst of which tells you how to read the data correctly, and the second of which tells
you what to do with the data, once you've read it. In BMDP, the \read the data" sections are the
PROBLEM, INPUT, VARIABLE, GROUP, and TRANSFORM paragraphs, while in SAS, all of this is contained
in the DATA step. For a logistic regression in BMDP, the description of what to do with the data
is in the DESIGN and PRINT paragraphs, while in SAS, it's what follows PROC LOGISTIC. I'm going
to list out the two programs that I used to create the example on resistant infections (Section 2.5)
in their entirety, and then discuss the purpose of the various subcommands.
The control language I used for the BMDP analysis of this data was:
/problem title is 'stu cohen ESC data revisited, infants excluded'.
/input variables are 35.
6 The Pragmatist's Guide to Statistics: repeated measures, which is available on the UCD Division of Statistics
gopher in a series of three LATEXles.
The Pragmatist's Guide to Statistics: logistic regression 33
file is 'c:\home\sc4x.dat'.
reclen is 132.
format is free.
/variable names are firstcnt, los, age, sex, apache, apache2, restnt1,
date1, losfps, restnt2, losfpr, restnt3, noabx, ags,
oneceph, thrcph, exceph, noexceph, esc, cef, van, pip,
gen, tob, ami, tmp, imi, surgery, drains, intub,
foley, lines, feedtube, radiol, gilab, los1, aglyco.
add = 2.
missing is 35*-1.
use = age, apache, restnt2, esc, cef, aglyco.
/group codes(restnt2) are 2, 0.
names(restnt2) are resist, sens.
codes(aglyco) are 0, 1, 2, 3.
names(aglyco) are none, gen, tob, ami.
/transform if (firstcnt ne 1) then use = 0.
if (age le 1) then use = 0.
los1 = max(losfps, losfpr).
aglyco = 0.
if (gen = 1) then aglyco = 1.
if (tob = 1) then aglyco = 2.
if (ami = 1) then aglyco = 3.
/regress dependent is restnt2.
categorical are esc, cef, aglyco.
interval are age, apache.
method is mlr.
model is age, apache, esc, cef, esc*cef, aglyco, esc*aglyco,
cef*aglyco.
start = out, out, out, out, out, out, out, out.
move = 2, 2, 2, 2, 2, 2, 2, 2.
/print cova.
/end
The most important aspects of the \data" section of the program are the USE statement in the
VARIABLE paragraph, and the GROUP paragraph. By default, BMDP will use all of the variables in
the model, treating any variables that aren't listed as covariates (interval variables) as categorical
predictors. To limit the number of variables that are potential predictors in the model, you list
them, along with the dependent variable in the USE statement. This is particularly important,
since the analysis will be able to use only cases for which all of the potential predictors have been
measured. It's a bad thing if cases get deleted from the analysis because of missing values for
variables that you weren't going to use anyway.
The importance of the GROUP paragraph is primarily that it can be used to dene which logistic
model is being t. If a dependent variable has responses of 0 and 1, then you could do a logistic
regression to predict either PrfY = 0g or PrfY = 1g. The dierence between the two wouldn't
appear in terms of what predictors were signicant; in that sense the models would be identical.
However, each of the model coecients for one of these models would have the opposite sign from
The Pragmatist's Guide to Statistics: logistic regression 34
the corresponding coecient from the other model. Unless you're sure which of the two models
is being t, you can misinterpret the results badly. By including a GROUP paragraph and dening
groups for the dependent variable (restnt2 in this case), you can force BMDP to t a model that
predicts the rst of these responses.
The REGRESS paragraph has a number of features, and it's best to go over them one at a time.
The dependent statement denes the dependent variable. It should be a dichotomous vari-
able, and it should re
ect the response for a single individual. If the data have been entered
with a set of predictors, followed by the number of subjects who responded positively or neg-
atively, then there's an alternate way of specifying the dependent variable, by listing variables
that contain the number of successes (SCOUNT), the number of failures (FCOUNT) or the total
number of cases (COUNT). You need only two of these three variables, since you can always
calculate the third one from the other two.
interval and categorical are used to dene which of the variables are covariates and which
are categorical predictors. If you forget to mention some of the predictors here, BMDP will
assume they're categorical.
The method statement tells BMDP whether to use the fast and sloppy algorithm (ACE) or the
slower, maximum likelihood method (MLR). The default is to use the fast method.
The model statement tells BMDP which predictors to use in the model. If you're not inter-
ested in any interactions, this statement can be omitted and BMDP will use a model that
(potentially) includes all of the main eects and no interactions.
The start and move statements tell BMDP which terms should be included in the model at
the start of its calculations, and how often they can be moved into or out of the model. The
default choices here are a little complicated, since it depends on whether a model statement
was used or not. If you didn't use a model statement, then the default is to start with no
predictors (except the intercept) in the model and move each of them up to twice. If you did
use a model statement, the default is to include all the terms in the model at the start and
not move them at all. This is a strange default, since it denes a stepwise algorithm whose
default is never to step. To override this, you need the start and move statements.
The print cova statement was used to request that BMDP print out the covariance matrix
for the predicted coecients. As I've pointed out, this can be very useful if you want to nd
which categories of a categorical predictor are responsible for its signicance.
Finally, if I had included a save paragraph, the program would have saved a BMDP le
that would contain lots of nifty things, including the in
uence and residual diagnostics that
I discussed.
The control language that I used to t the analogous model in SAS was:
options ls=80;
data stucohen;
array data{35} firstcnt los age sex apache apache2 restnt1
date1 losfps restnt2 losfpr restnt3 noabx ags
The Pragmatist's Guide to Statistics: logistic regression 35
in earlier versions of SAS, the data would have to be sorted rst in descending order according to
the response, and then you would have to use the ORDER=DATA option. A non trivial disadvantage
to this approach is that when the diagnostic information is printed, the cases would be numbered
in their new (sorted) order, rather than in the original order in the data base. This can make it
tough to gure out which of the observations are causing problems for the analysis.
2.7 Dierences between BMDPLR and PROC LOGISTIC
I've covered most of these items in going through the examples, but I think it's worth summarizing
them here, for easy reference if nothing else. Each of these packages does a pretty good job of
the analysis, but it's also true that each package provides some information that the other either
doesn't provide or least doesn't provide in a terribly useful form. It may seem at times that I have
a preference for one package over the other. In fact, I often do, but I don't always prefer the same
package. It's reasonable to say that you can do a better job of this type of analysis using both
packages in conjunction with each other than you can using either package on its own. With those
comments out of the way, I'll go on to my list of some of the more notable dierences between the
two.
Handling of categorical predictors. As I mentioned above, PROC LOGISTIC really isn't set
up to handle categorical predictors very well (about as well as a typical regression package is
to handle ANOVA problems). In order to include a categorical predictor, you have to create
a series of dummy variables and list all of them in the MODEL statement. At that point, SAS
will pick and choose among the predictors (assuming you're doing a stepwise analysis), rather
than choosing all of them or none of them. Moreover, there are no checks to insure that
interaction terms are included only in conjunction with the corresponding main eects (or
lower order interactions). BMDP does handle categorical predictors in a reasonable way.
Categorical predictors with many levels. The drawback in BMDP's handling of categorical
variables is that it has an upper limit (equal to ten) on how many categories a categorical
predictor can have. If your predictor has more than ten levels, then there's no straightforward
way to use it in the model, short of dening a series of dummy variables. SAS can do this
too, so on this score, the two packages are about the same.
Goodness of t statistics. Prior to Version 6.07, SAS gives you a number of criteria that can
be used to choose among competing models, but it doesn't given you any indication (other
than the deviance test) of whether any of the models ts the data. Starting in Version 6.07,
it calculates a Hosmer goodness of t statistic, though not quite in the form you might like.
BMDP gives both a Hosmer statistic, along with a Brown goodness of t statistic. These
complement each other, since they look at dierent aspects of lack of t. Both packages give
a goodness of t test that's based on the likelihood ratio, or deviance chi squared statistic.
This statistic is of very little use whenever there are (continuous) covariates in the model, and
its presence in the output may give you the false impression that you've done an adequate
check of the model's t.
In
uence diagnostics. Both BMDP and SAS calculate a number of regression diagnostics
(SAS oers a somewhat greater variety of statistics) that measure whether individual points
are abnormal either in their response, in their predictors, or in a combination of the two.
The Pragmatist's Guide to Statistics: logistic regression 37
BMDP's diagnostics are output to a BMDP output le, from which it's cumbersome to
extract information. SAS does a better job of presenting these results in an easily usable
form.
Mathematical algorithms. Both BMDPLR and PROC LOGISTIC have two available algo-
rithms for doing these calculations, one of which is quicker (less computationally intense)
and somewhat approximate, and the other of which is more involved. The use of the more
approximate method can lead to what may seem to be contradictions, in which a variable
outside the model is added to the model based on its (signicant) predictive power, but then
is immediately removed for being insignicant. BMDP's slow algorithm is somewhat slower
than SAS's, but it's also a true maximum likelihood solution, rather than an approximation
to one. The fast algorithms may be useful in screening variables, but it's a good idea to do
the nal calculations on a model using the slower of the two algorithms.
Handling of models that fail to converge. Working with real data (i.e., reasonable sample
sizes), situations can and will arise in which one or more of the actual model parameters is
innite. Both SAS and BMDP are trying to make the model converge to a nite solution,
so they'll recognize this as a problem and tell you that the model has failed to converge.
BMDP's reaction is to tell you that this problem has arisen and ask you if you want to go
ahead anyway. (The correct answer is quite often \yes.") SAS simply prints out an error
message and terminates the procedure. Since it's genuinely impossible to make a model like
this converge (to a nite solution), BMDP's handling of this problem is more reasonable. In
SAS, the only ways of handling the problem are to weaken the convergence criterion (a bad
idea) or to remove the oending data from the analysis.
General data manipulations. One of SAS's strengths relative to BMDP are the data handling
capabilities in the DATA step and the programming
exibility oered by its macro capabilities.
This can be a nontrivial advantage if you're interested in tting a variety of models using
the same data. Another advantage SAS has it that it handles the input and manipulation
of character data more easily than BMDP. For the infection data, I rst had to recode the
data into numerical form, and then redene the character labels I wanted once the data was
in BMDP. This is a nuisance that could have been avoided in SAS.
Polychotomous responses. BMDP has a separate program (BMDPPR) that handles similar
models for situations in which the response has more than two levels. (I'll spend some
time discussing these models next time.) SAS has combined the two programs into one,
so that if you specify a dependent variable that has 3 or more response levels, then it will
t a polychotomous logistic regression. (More often than not, this is called a polytomous
logistic regression, which I don't view as an actual English word, so I'm going to resist this
terminology.) If this isn't what you intended, then you have to dene a dichotomous response
before you run PROC LOGISTIC.
2.8 Some comments about R2
I mentioned in the rst session that in a logistic regression, there was nothing analogous to the R2
value that you get from a linear regression. To some extent this is an oversimplication, although
The Pragmatist's Guide to Statistics: logistic regression 38
it's certainly true that there's not a lot of consensus about what the appropriate statistic is to use,
and the statistics that have been proposed aren't being used all that widely in practice.
There are basically two types of statistics that have been proposed. The rst came out of some
work by McFadden7 and it's closely related to the likelihood ratio goodness of t statistic (the one
I didn't much like). In linear regression, R2 is dened as one minus the proportion of the variability
in a model containing only a constant term that's explained by the model in question. Another
way of thinking of this is that it's the point on a continuum that the model in question lies where
the two extremes are dened as an R2 of zero (the constant response model) and an R2 of one (for
a model with a residual sum of squares equal to zero). For a logistic regression, you can dene
something similar, along the lines of
L L0
R2L = M ;
L L
S 0
where L0; LS , and LM are the log likelihoods of the constant model, a saturated model, and
the model in question, respectively. There's no problem with the constant model and the model
in question, but the saturated model could be taken either as a model that contains all of the
available covariates, or else one that predicts a dierent value for each of the dierent covariate
combinations. Neither of these denitions is terribly satisfying, since they both depend on which
covariates you're considering for the model. If you added some new covariates, this R2 value would
change, and that doesn't seem quite right.
The second type of statistic was originally proposed by Morrison8 and it's based on the predicted
probabilities of a positive response that come out of the model. If the dependent variable Y is
dened to be either 0 or 1, and for individual i, the model's estimate of PrfY = 1g is p^i , then
Morrison suggested dening his R2 measure as
Pi(Yi p^i )2
R2 = Pi(Yi Y )2
;
where Y is the arithmetic average of the Y 's. The main problem with this measure is that some
of the Y 's are more variable than others (since VarY = p(1 p)), and yet this formula gives them
equal weight. Amemiya9 has suggested a generalization of this measure that takes this into account.
I wish I could say that I knew exactly what was meant by R2 in the context of a logistic
regression, but unfortunately, each of these measures can be referred to as \R2 ," as a \pseudo R2 ,"
or as a \quasi R2."
2.9 Next time.. .
. . . we'll talk about some other types of analysis that you might be considering as alternatives to
logistic regression. My goal in discussing this is that this will help to clarify logistic regression's
place relative to the other methods, to give a sense of how the alternative methods dier and of
why one method might be preferable over another under a given set of circumstances. I'm not
7 D. McFadden: \Conditional logit analysis of qualitative choice behavior," in Frontiers in Econometrics, P.
Zarembka, ed., MIT Press, 1974
8 D. Morrison, \Upper bounds for correlations between binary outcomes and probabilistic predictions," Journal of
the American Statistical Association, v. 67, pp. 68-70, 1972.
9 T. Amemiya, \Qualitative Response Models: a summary," Journal of Econometric Literature, v. 19, pp. 1483-
1536, 1981.
The Pragmatist's Guide to Statistics: logistic regression 39
going to discuss the other methods in all that much detail; not so much that you'd become expert
in how to carry them out, but at least you'll recognize them and how they dier from a logistic
analysis. There are ve methods that I'm particularly interested in discussing, namely:
probit analysis,
loglinear models,
polychotomous logistic regression,
discriminant analysis, and
cluster analysis.
All of these are at least supercially similar to logistic regression in that on some level there are
groups of things involved, but for most of these methods, the similarities don't go a heck of a lot
further than that.