Vous êtes sur la page 1sur 14

Logistic regression

 Framework and ideas of logistic


Lecture 14: Interpreting regression similar to linear regression
logistic regression models  Still have a systematic and probabilistic
part to any model
Sandy Eckel  Coefficients have a new interpretation,
seckel@jhsph.edu based on log(odds) and log(odds ratios)

15 May 2008

1 2

Recall from last time:


The logit function Example: Public health graduate students
 323 graduate students in introductory
 In logistic regression, we are always biostatistics took a health survey. Current
modelling the outcome log(p/(1-p)) smoking status was assessed, which we will
 We define the function: predict with gender
 Associating demographics with smoking is vital to
logit(p)= log(p/(1-p)) planning public health programs.
 We often use the name logit for
convenience  Information was also collected on age, exercise, and
history of smoking; potential confounders of the
 In logistic regression, we have the logit association between gender and current smoking.
on the left-hand side of the equation
 First we will focus only on the association between
3 gender and current smoking status 4
Coding our two variables for the first example Recall: an analogous linear regression model

 Outcome:  In linear regression, if we had only one


 smoking = 1 for current smokers binary X like gender, we would be predicting
two means: E(Y) = β + β (Gender )
0 for current nonsmokers 0 1

 β0 – the mean outcome when X=0


 Primary predictor:  β0 + β1 – the mean outcome when X=1
 gender = 1 for men  β1 – the difference in mean outcome
when X=1 vs. when X=0
0 for women

5 6

Logistic regression model Logistic Regression


and Results Gender-specific results
 p   p 
 p   p  ln  = β0 + β1 (Gender ) ⇒ ln  = -3.1 + 1.0(Gender )
log  = β0 + β1 (Gender ) ⇒ log  = -3.1 + 1.0(Gender )
1− p  1− p  1− p  1− p 

 p 
Logit estimates Number of obs = 323  For women, gender=0: ln  = −3.1+1.0(0) = −3.1
LR chi2(1)
Prob > chi2
=
=
4.46
0.0348  1− p 
Log likelihood = -75.469757 Pseudo R2 = 0.0287

------------------------------------------------------------------------------
smoke | Coef. Std. Err. z P>|z| [95% Conf. Interval]
 p 
-------------+----------------------------------------------------------------
gender | .967966 .4547931 2.13 0.033 .0765879 1.859344
 For men, gender=1: ln  = −3.1+1.0(1) = −2.1
(Intercept)| -3.058707 .3235656 -9.45 0.000 -3.692884 -2.42453  1− p 
------------------------------------------------------------------------------

 β1 is the difference between men and women


gender = 1 for men
0 for women  β1 is the change in log odds comparing men to
7 women 8
Logistic Regression What if we wanted to get the odds interpretation,
Interpretation 1: log(odds) scale not the log odds…

 p   p   We can start to “untransform” the equations


ln  = β0 + β1 (Gender ) ⇒ ln  = -3.1 + 1.0(Gender )
1− p  1− p   Recall:
gender = 1 for men if log(a ) = b, then exp(log(a)) = a = e b
0 for women
 For women, X=0: log(odds)= β0+β1(0) = β0
 β0: the log odds of smoking for women
 β0+β1: the log odds of smoking for men odds of smoking for women = eβ0 = e-3.1 = 0.05

 For men, X=1: log(odds)= β0+β1(1)


 β1: the difference in the log odds of smoking
for men compared to women odds of smoking for men = eβ0 +β1 = e-3.1+1.0 = e-2.1 = 0.12

9 10

Logistic Regression
Interpretation 2: odds scale Comparing odds

 eβ0 : the odds of smoking for women  If we subtract the log odds, mathematically
(when X=0) that’s equivalent to dividing inside the log:
 log(a) – log(b) = log(a/b)
 the odds of smoking for men  So, if
eβ0 +β1 : (when X=1) e
β +β
0
= e-3.1+1.0 = e-2.1 = 0.12 is the odds when X=1, and
1

 eβ = e-3.1 = 0.05 is the odds when X=0, then


0

 we want to divide them in order to compare


 In the past, we’ve compared two sets of odds
by dividing to find the odds ratio (OR)
odds for men eβ 0 +β1 0.12
Odds Ratio = = β0 = = 2.4
odds for women e 0.05

11 12
Logistic Regression
Interpretation: the odds ratio Useful math – ratios of exponentiated terms
odds for men eβ 0 +β1 0.12
Odds Ratio = = β0 = = 2.4  We can usually simplify an equation like this
odds for women e 0.05

 The odds of smoking is about 2 ½ times eβ0 +β1


Odds Ratio = β0
greater for men than for women. e
= e (β0 +β1 )-(β0 )
 Based on this study, perhaps smoking
cessation programs should be targeted = eβ1
toward men

ea
because
b
= e a −b
e
13 14

Taking a ratio of odds to get the odds ratio Two interpretations of logistic regression slopes

 eβ0 : the odds when X=0  β0+β1 = log(odds) (for X=1)


 β1 = difference in log odds
 eβ0 +β1 : the odds when X=1
 eβ +β = odds (for X=1)
0 1

β 0 +β1
 eβ1 = odds ratio
e
 β0
= eβ1 the odds ratio
e  But we started with P(Y=1)
comparing the odds when X=1 vs. X=0  Can we find that?

15 16
More useful math –
how to get the probability from the odds Finding the probability from the log odds
Find the log odds:
probability For X=0: log(odds) = β0
 odds= 1− probability For X=1: log(odds) = β0 + β1
Find odds:
odds For X=0: odds = β 0
 probability = e
1 + odds For X=1: odds = β 0 +β1
e
Transform odds into probability:
eβ0 +β1 (next slide…)
 so P (X = 1) =
1 + eβ0 +β1

17 18

Finding the probability from the log odds, cont… We could even go one step further
p1
Transform odds into probability:  Re lative Risk (RR) =
p2

eβ 0 +β1
odds  For X = 1 : P(smoke | male) =
p= 1 + eβ 0 +β1
1 + odds eβ 0
For X = 0 : P(smoke | female) =
1 + eβ 0
eβ 0  eβ0 +β1 
 
p1  1 + e 0 1 
For X = 0 : probability = β +β
1 + eβ 0
 Relative Risk for Men vs. Women :
p2
=
 eβ 0 
 β0 

 no way to simplify 1+ e 
eβ 0 +β1
For X = 1 : probability =
1 + eβ 0 +β1
19 20
Remember to consider study design In General

 We always can calculate the relative risk  Logistic regression for a binary outcome
 Left side of equation is log odds
 The relative risk is not appropriate for  Can transform the equation to find
case-control studies  odds
 Again, because the investigators decide the  probability
number of cases and controls to study  Can compare two groups
 difference of log odds ≡ log odds ratio
 The odds ratio is appropriate for all  odds ratio
study designs  relative risk
 (Almost) everything we learned before applies

21 22

Summary:
Useful math for logistic regression Another Example
 If log(a ) = b, then exp(log(a)) = a = e b
 Regular physical examination is an important
X=1: log(odds)= β0+β1(1) so odds for (X = 1) = eβ 0 +β1
preventative public health measure
 log(a) – log(b) = log(a/b)  We’ll study this outcome using the public
so log(odds|X=1) – log(odds|X=0) = log(OR for X=1 vs. X=0)
health graduate student dataset
 ea eβ0 +β1 Also : ea +b = ea × e b  Outcome: No physical exam in the past two years
= e a -b so = eβ1
eb eβ0 so e 2β1 = eβ1 × eβ1 = eβ1( )2
 Primary predictor: age (centered)
odds  Secondary predictor and potential confounder:
 probability =
1 + odds regularly taking a multivitamin
eβ 0 +β1
so probability for (X = 1) =
1 + eβ 0 +β1

23 24
Problem with outcome variable: Goals
 The original “physician visit” variable was meant to be
continuous, but it was collected categorically  Predict Phys (no physician visit within
 time since last physician visit
 Since it is now categorical and we wish to use it as the
the past two years=1) with centered
outcome for a regression model, we will make it binary Age (continuous)
and use logistic regression
 After adjusting for age, is taking a
Phys = 1 if over 2 years multivitamin (1=yes) a statistically significant
0 if 2 years or less predictor for not regularly visiting a physician?
Length of time since last |
 Is taking a multivitamin a confounder for the
check-up | Freq. Percent Cum.
--------------------------+-----------------------------------
age-physician visit relationship?
Within the past year | 182 54.17 54.17
Within the past 1-2 years | 72 21.43 75.60
Within the past 2-5 years | 53 15.77 91.37
5 or more years | 29 8.63 100.00
--------------------------+-----------------------------------
Total | 336 100.00
25 26

Results Model 1:
Model 1: Intercept and Age Interpretation of coefficients on log odds scale
Note that agec = age-30 (centered age)
Logit estimates Number of obs = 336  β0: the log odds of not visiting a physician
LR chi2(1)
Prob > chi2
=
=
0.00
0.9567
for a 30-year-old
Log likelihood = -186.71399 Pseudo R2 = 0.0000

------------------------------------------------------------------------------
phys_no | Coef. Std. Err. z P>|z| [95% Conf. Interval]
 β1: the difference in the log odds of not
-------------+---------------------------------------------------------------- visiting a physician for a one year increase
agec | -.0009585 .0176509 -0.05 0.957 -.0355536 .0336365
(Intercept) | -1.130428 .1270539 -8.90 0.000 -1.379449 -.8814066 in age
------------------------------------------------------------------------------

 p   p 
log  = β0 + β1 ( Age − 30) ⇒ log  = -1.13 − 0.001( Age − 30 )
 p   p  1− p  1− p 
log  = β0 + β1 ( Age − 30) ⇒ log  = -1.13 − 0.001( Age − 30 )
1− p  1− p 

27 28
Model 1: How did we get the difference in log Model 1:
odds interpretation of β1 ? Interpretation of β1 (diff log odds = log OR)
Predictions by age
 p   p 
 log(a) – log(b) = log(a/b)
log  = β0 + β1 ( Age − 30) ⇒ log  = -1.13 − 0.001( Age − 30)
1− p  1− p   so log(odds|X=31) – log(odds|X=30)
= log(OR for X=31 vs. X=30)
 For a 30-year-old:  difference of log odds = log odds ratio
 p 
log  = -1.13 − 0.001(30 − 30) = −1.13
1− p   Alternate interpretation for β1:
 For a 31-year-old:  The log odds ratio of not visiting a
 p 
log  = -1.13 − 0.001(31 − 30) = −1.13 − 0.001 = −1.129
physician associated with a one year
1− p  increase in age
 β1 is the difference in the log odds
associated with a 1 year increase in age 29 30

Model 1: Model 1: Interpretation of β1


Interpretation of β1 (OR
Interpretation: = ratioratio)
log(odds of odds)
for one odds ratio for one year age difference
year age difference β
p  e 0 is the odds of not visiting a physician for
odds of not visiting a physician = = e -1.13−0.001(Age−30 )
1− p 30-year-olds
 For a 31-year-old: β +β
 e0 1 is the odds of not visiting a physician
p
= e-1.13−0.001(31−30) = e-1.13−0.001 = e−1.131 = 0.3227 for 31-year-olds
1− p
 For a 30-year-old: β1
 e is the odds ratio of not visiting a
p
= e-1.13 = 0.3230 physician corresponding to a one year
1− p increase in age
β 0 +β1
 Odds ratio = 0.3227 = 0.999 = e β 0 = eβ1
0.3230 e
31 32
Model 1: Interpretation of β1 Model 1: Interpretation of β1
the OR for two
What isInterpretation: year
odds agefor
ratio difference?
two year the OR for ten
What isInterpretation: yearratio
odds agefor
difference?
10 year age
age difference p difference p
odds of not visiting a physician = = e -1.13−0.001(Age−30 ) odds of not visiting a physician = = e -1.13−0.001(Age−30 )
1− p 1− p
 For a 32-year-old:  For a 40-year-old:
p p
= e-1.13−0.001(32−30) = e-1.13−0.001×2 = e−1.132 = 0.3224 = e-1.13−0.001(40−30) = e-1.13−0.01 = e−1.14 = 0.3198
1− p 1− p

 For a 30-year-old:  For a 30-year-old:


p p
= e-1.13 = 0.3230 = e-1.13 = 0.3230
1− p 1− p
eβ 0 + 2β1 0.3198 eβ 0 +10β1
 Ratio = 0.3224
= 0.998 = β 0 = e 2β1 = eβ1 ( ) 2
 Ratio =
0.3230
= 0.990 = β 0 = e10β1 = eβ1
e
( ) 10

0.3230 e
33 34

Model 1: Interpretation of β1 Model 1: How could we get a Relative Risk?


What is the OR for any age difference? (if it was appropriate based on our study design)
β e -1.13−0.001(Age−30 )
 e 1 is the proportional increase of the probability of not visiting a physician = p =
odds of not visiting a physician corresponding 1 + e -1.13−0.001(Age−30 )
to a one year increase in age  For a 40-year-old:
e-1.13−0.001(40−30) e-1.13−0.01 e-1.14
p= = = 0.2423
(odds for 30 - yr - old) × (odds for 31 - yr - old) = (odds for 31 - yr - old) 1+ e-1.13−0.001(40−30) 1+ e-1.13−0.01 1+ e-1.14
(odds for 30 - yr - old)
 For a 30-year-old:
( )
β1 10
= e10β1 is the proportional increase of e-1.13−0.001(0) e-1.13
 e p= = = 0.2442
the odds of not visiting a physician 1+ e-1.13−0.001(0) 1+ e-1.13
corresponding to a ten year increase in age  The relative risk (RR) is
p1
= 0.2423 = 0.992
35
p2 0.2442 36
Model 1:
Probabilities and Relative Risk for 10 year diff Remember those Goals?
eβ 0
 is the probability of not visiting a
1 + eβ 0  Predict Phys (no physician visit within the past
physician for 30-year-olds two years=1) with Age (continuous)
eβ 0 +β1 ×10  After adjusting for age, is taking a
 is the probability of not visiting a
1 + eβ 0 +β1×10 multivitamin (1=yes) a statistically
physician for 40-year-olds significant predictor for not regularly
eβ 0 +β1×10 visiting a physician?
1 + eβ 0 +β1 ×10
eβ 0  Is taking a multivitamin a confounder for the
 is the relative risk of not
1 + eβ 0 age-physician visit relationship?
visiting a physician for 40-year-olds vs. 30-
year-olds

37 38

Logistic regression:
Nested models Comparing nested models that differ by one variable

 Adding a single new variable to the model  Compare models with p-value or CI
 p   What method is this?
 Model 1: log  = β0 + β1 ( Age − 30 )
1− p   The Wald test, a test that applies the CLT, like
 Z test comparing proportions in 2x2 table
 p   X2 test for independence in 2x2 table
 Model 2: log  = β0 + β1 ( Age − 30) + β 2 (Multivitamin )
1− p   analogous to the t test for linear regression
 H0: the new variable is not needed
Or, equivalently
H0: βnew=0 in the population

39 40
Model 2: Results Conclusion from the Wald test
Logit estimates Number of obs = 317
LR chi2(2) = 7.87  The p-value for multivitamin is 0.007 (<0.05)
Prob > chi2 = 0.0195
Log likelihood = -171.80997 Pseudo R2 = 0.0224 and the CI for coefficient multivitamin does
------------------------------------------------------------------------------ not include 0 (CI for OR doesn’t include 1)
phys_no | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
agec | .0012855 .0192619 0.07 0.947 -.0364671 .0390381
 Reject H0
multivit | -.7808889 .2871247 -2.72 0.007 -1.343643 -.2181349
(Intercept) | -.8571962 .159519 -5.37 0.000 -1.169848 -.5445446  Conclude that the larger model is better:
------------------------------------------------------------------------------
after adjusting for age, multivitamin use is still
 p  an important predictor of physician visits in
log  = β0 + β1 ( Age − 30) + β 2 (Multivitamin )
1− p  the population
 p 
⇒ log  = -0.86 + 0.001( Age − 30) − 0.78( Multivitamin )
1− p 
41 42

Model 2: Model 2:
Coefficient interpretation on the log odds scale Interpretation – odds and odds ratio scale
 p 
log  = β0 + β1 ( Age − 30) + β2 (Multivitamin )
1− p   exp(β0): the odds of not visiting a physician

 p 
log
for a 30-year-old person who reports not
 = -0.86 + 0.001( Age − 30) − 0.78( Multivitamin )
1− p  regularly taking multivitamins
 β0: the log odds of not visiting a physician for a 30-
year-old person who reports not regularly taking
multivitamins  p 
log  = β0 + β1 ( Age − 30) + β 2 (Multivitamin )
1− p 
 β1: the log odds ratio of not visiting a physician for
 p 
a one year increase in age controlling for multivitamin ⇒ log  = -0.86 + 0.001( Age − 30) − 0.78( Multivitamin )
1− p 
use
 β2: the log odds ratio of not visiting a physician for
those who take multivitamins compared with those
who do not, adjusting for age
43 44
Model 2: Model 2:
Interpretation – odds and odds ratio scale Interpretation – odds and odds ratio scale

 exp(β1): after adjusting for multivitamin use,  exp(β2): the odds ratio of not visiting a
the odds ratio of not visiting a physician physician for those who take multivitamins
changes by a factor of exp(β1)=1.001 compared with those who do not is
for each additional year of age exp(β2)=0.46, adjusting for age

 additional age is associated with lower frequency of  taking multivitamins is associated with regular physician
physician visits in these students, but the association is not visits (p=0.007)
statistically significant (p>0.05)
 p 
 p  log  = β0 + β1 ( Age − 30) + β 2 (Multivitamin )
log  = β0 + β1 ( Age − 30 ) + β2 (Multivitamin ) 1− p 
1− p 
 p 
 p  ⇒ log  = -0.86 + 0.001( Age − 30) − 0.78( Multivitamin )
⇒ log  = -0.86 + 0.001( Age − 30) − 0.78( Multivitamin ) 1− p 
1− p 
45 46

Goals Was multivitamin use a confounder?

 Predict Phys (no physician visit within the past  CI for β1 in model 1: (-0.036, 0.034)
two years=1) with Age (continuous)  Estimate for β1 in model 2: 0.001
 After adjusting for age, is taking a
multivitamin (1=yes) a statistically significant  CI for exp(β1) in model 1:
predictor for not regularly visiting a physician? (exp(-0.036), exp(0.034)) → (0.97, 1.03)
 Estimate for exp{β1} in model 2:
 Is taking a multivitamin a confounder exp(0.001) = 1.001
for the age-physician visit relationship?
 Estimate from model 2 is in original CI:
multivitamin use is not a statistically
significant confounder
47 48
Interpretation of lack of confounding result Goals: conclusion 1

 The factor by which the odds of  Predict Phys (no physician visit
irregular physician visits changes for within the past two years=1) with
each additional year of age does not
change appreciably when we adjust for Age (continuous)
multivitamin use  There is no statistically significant effect of
age on physician visits in the population

 The “slope” is roughly the same before


and after adjusting for multivitamin use.

49 50

Goals: conclusion 2 Goals: conclusion 3

 After adjusting for age, is taking a  Is taking a multivitamin a


multivitamin (1=yes) a statistically confounder for the age-physician
significant predictor for not regularly visit relationship?
visiting a physician?  The effect of age on physician visit is still
 After adjusting for age, those who regularly nonsignificant after adjusting for
take a multivitamin are also more likely to multivitamin use and
have visited a physician during the past two multivitamin use is not a confounder
years (p=0.007)

51 52
Summary of Lecture 14

 Logistic regression interpretation


 Intercept – log odds when all X’s are 0
 Slope
 difference in log odds for a 1 unit increase in X,
controlling for other X’s
 log odds ratio associated with a 1 unit increase in X,
controlling for other X’s
 Transform log odds/ log odds ratio to odds/odds
ratio scale by exponentiating
 For a continuous X, eβ is the factor by which the odds
changes (or odds ratio) for each unit change of X
 Can also transform from log odds to probability
 Nested models in Logistic regression that
differ by one variable
 Use the Wald test (z-test) for the new variable
53

Vous aimerez peut-être aussi