Vous êtes sur la page 1sur 33

An Introduction to Logistic Regression Analysis and Reporting

Chao-Ying Joanne Peng Indiana University-Bloomington

Three purposes of this session:


1. Introduces you to basic concepts of logistic regression
LR constitutes a special class of regression methods for research utilizing dichotomous outcomes.

Three purposes of this session:


2. Provides you with a set of guidelines of what to expect in an article using logistic regression techniques
What tables, figures, or charts should be included to comprehensively assess the results? And, what assumption should be verified?

Three purposes of this session:


3. Recommendations are also offered for
appropriate reporting formats of logistic regression results and the minimum observation to predictor ratio.

An Introduction to Logistic Regression Analysis and Reporting


Many research problems in education call for the analysis and prediction of a dichotomous outcome, for example, whether a child should be classified as learning disabled (LD), or whether a teenager is prone to engage in risky behaviors. Traditionally, these research questions were addressed by either ordinary least squares (OLS) regression or linear discriminant function analysis.

An Introduction to Logistic Regression Analysis and Reporting


Both techniques were subsequently found to be less than ideal in handling dichotomous outcomes, due to their strict statistical assumptions i.e., linearity, normality, and continuity for OLS regression and multivariate normality with equal variances and covariances for discriminant analysis.

As an alternative, logistic regression was proposed in the late 60s and early 70s (Cabrera, 1994). It became routinely available in statistical packages in the early 80s. With the wide availability of sophisticated statistical software installed on high-speed computers, the use of logistic regression is increasing.

An Introduction to Logistic Regression Analysis and Reporting

Logistic Regression Models


The central mathematical concept that underlies logistic regression is the logit. The simplest example of a logit derives ln(odds) from a 22 contingency table. Consider an instance in which the distribution of a dichotomous outcome variable (a child from an inner city school recommended for remedial reading classes) is paired with a dichotomous predictor variable (gender).

Table 1. Sample Data for Gender and Recommendation for Remedial Reading Gender Totals Boys 73 23 96 Girls 15 11 26 88 34 122

Remedial reading instruction recommended Yes (coded as 1) No (coded as 0) Totals

The results yield 2(df =1) = 3.43. Alternatively, one might prefer to assess a boys odds of being recommended for remedial reading instructions, relative to a girls odds; the result is an odds ratio of 2.33.

73 23 odds ratio = = 3.17 = 2.33 1.36 15 11

Its natural logarithm [i.e., ln (2.33)] equals 0.85 which would be the regression coefficient of the gender predictor, if logistic regression were used to model the two outcomes of a remedial recommendation as it is related to gender.

The simple logistic model has the form:

logit (Y ) = natural log( odds ) = ln(

) = + X ,

where is the probability of interested outcome, is the intercept parameter, is a regression coefficient, and X is a predictor. a.k.a. slope parameter

Figure 1. The relationship of a dichotomous outcome variable, Y (1=remedial reading recommended, 0=remedial reading not recommended) with a continuous predictor, READING scores.

For the data in Table 1, the regression coefficient () is the logit (=0.85) previously explained. Taking the antilog of equation (1) on both sides, one derives an equation for the prediction of the probability of the occurrence of the outcome of interest as follows:

= P(Y = outcome of interest | X = x, a specific value) e + x


= 1+e + x

Extending the logic of the simple logistic regression to multiple predictors (say X1=reading score and X2=gender), one may construct a complex logistic regression for Y (recommendation for remedial reading programs) as follows:

logit (Y ) = ln = + 1 X 1 + 2 X 2 . 1

Therefore, = P(Y = outcome of interest | X 1 = x1 , X 2 = x2 )

e = + 1 x1 + 2 x 2 1+ e

+ 1 x1 + 2 x 2

Illustration of Logistic Regression Analysis and Reporting


The hypothetical data consisted of 189 inner city school childrens reading scores and gender. Of these children, 59 (31.22%) were recommended for remedial reading classes while 130 (68.78%) were not. A legitimate research hypothesis posed to the data was: the likelihood that an inner city school child is recommended for remedial reading instruction is related to both his/her reading score and gender.

Table 2. Description of a Hypothetical Data Set for Logistic Regression

Remedial Gender Total reading instruction Sample Boys Girls (N) recommende (n1) (n2) d? Yes No Summary 59 130 189 36 57 93 23 73 96

Reading scores Mea n 61.0 7 66.6 5 64.9 1 SD 13.2 8 15.8 6 15.2 9

Logistic Regression Analysis


The logistic regression analysis was carried out by the LOGISTIC REGRESSION command in SPSS version 13 (SPSS Inc., 2004) Predicted logit of (REMEDIAL)=0.534 + (0.026)READING + (0.648)GENDER

Evaluations of the Logistic Regression Model


a) overall model evaluation b) goodness-of-fit statistics c) statistical tests of individual predictors d) validations of predicted probabilities

Overall Model Evaluation Tests Likelihood Ratio test Score test 2 10.019 9.518 2 9.286 R2-type Indices Cox and Snell R squared = .052 Nagelkerke (Max rescaled) R squared = .073 df 2 2 p 0.007 0.009 OK?


OK?

Goodness-of-fit Test Test Hosmer-Lemeshow Goodness-of-fit test df 8 p 0.319

Table 3. Logistic Regression Analysis of 189 Childrens Referrals for Remedial Reading Programs by SPSS LOGISTIC REGRESSION command (version 12) Predictor

SE

Walds 2 (df=1)

e (odds ratio)
(not applicable)

CONSTANT READING GENDER


(1=boys, 0=girls)

0.534 0.026 0.648

0.811 0.012 0.325

0.434 4.565 3.976

.510 .033 .046

0.974 1.911

Table 4. The Observed and the Predicted Frequencies for Remedial Reading Instructions by Logistic Regression with the Cutoff of 0.50 Predicted Observed Yes No Overall % correct Note. Sensitivity=3/(3+56)=5.1% Specificity=129/(1+129)=99.2% False Positive=1/(1+3)=25.0% False Negative=56/(56+129)=30.3% Yes 3 1 No 56 129 Percentage Correct 5.1% 99.2% 69.8%

E s t i m a t e d P r o b a b i l i t y

0.6 Boys A B 0.5 A FA CC EA AI E 0.4 PC Girls BC HB D AC A CB AC 0.3 BB ABA AIB A CJ BA E AKE B AFA B 0.2 BB A CA AAA ACA AB A B A A A 0.1 A A 0.0 40 60 80 100 120 140

Figure 2. Predicted probability of being referred for remedial reading instructions versus reading scores, plotting symbols A=1 observation, B=2 observations, C=3 observations, etc.

Reading score

Reporting and Interpreting Logistic Regression Results


In addition to Tables 3, 4 and Figure 2, it is helpful to demonstrate the relationship between the predicted outcome and certain characteristics found in observations.

Table 5. Predicated Probability of Being Referred for Remedial Reading Instructions for 8 Children
Case Number
Beta= 0.026

READIND

GENDER
Beta=0.648

Intercept = 0.534

Predicted probability of being referred for remedial reading instructions 0.4530 0.2618 0.1941 0.1250 0.4051 0.2627 0.1934 0.1115

Actual outcome 1=Yes, 0=No

1 2 3 4 5 6 7 8

52.5 85 75 92 60 60 100 100

Boy Boy Girl Girl Boy Girl Boy Girl

0.5340 0.5340 0.5340 0.5340 0.5340 0.5340 0.5340 0.5340

1 0 1 0 -----

Interpretation of Regression Coefficients


For each point increase on the reading score, the odds of being recommended for remedial reading programs decrease from one to 0.974 (=e 0.026, Table 3). If the increase on the reading score is 10 points, the odds decrease from one to 0.771 [=e 10*(0.026) ]. However, when the READING score was held as a constant, boys were predicted to be referred for remedial reading instructions with greater probability than girls.

Guidelines and Recommendations


What Tables, Figures, or Charts Should be Included to Comprehensively Assess the Result? the overall evaluation of the logistic model goodness-of-fit statistics statistical tests of individual predictors an assessment of the predicted probabilities

What Assumption Should be Verified?


Logistic regression does not assume that predictor variables are distributed as a multivariate normal distribution with equal covariance matrix. It assumes that the binomial distribution describes the distribution of the errors, which equal the actual Y minus the predicted Y ; It is also the assumed distribution for the conditional mean of the dichotomous outcome.

What Assumption Should be Verified?


The binomial assumption may be tested by the normal z test (Siegel & Castellan, 1988), or taken to be robust as long as the sample is random; thus, observations are independent from each other.

Recommended Reporting Formats of Logistic Regression


In terms of reporting logistic regression results, we recommend presenting

A complete logistic regression model including


the Y-intercept odds ratio a table such as Table 5 to illustrate the relationship between outcomes and observations with profiles of certain characteristics

Recommended Minimum Observation to Predictor Ratio


The literature has not offered specific rules applicable to logistic regression. Several authors of multivariate statistics recommended a minimum ratio of 10 to 1, with a minimum sample size of 100 or 50 plus a variable number that is a function of the number of predictors.

Preview of Next Session

Vous aimerez peut-être aussi