Vous êtes sur la page 1sur 24

Logistic Regression

EPGP 04: Business Research Methods


Multiple Regression
Relationship between a continuous dependent
variable and a set of independent continuous
variables
Regression coefficients bi


Amount by which y changes on when x
i
changes
by one unit and all the other x
i
s remain constant
Measures association between x
i
and y adjusted
for all other x
i

x ... x x y
i i 2 2 1 1
+ + + + =
When to Use Binary Logistic Regression
The dependent variable is dichotomous.
Independent variables may be categorical or
continuous.
If independents are all continuous and nicely
distributed, we may use discriminant analysis
with categorical depedent.
If predictors are all categorical, may use
logit/loglinear analysis.
Logistic regression
Models relationship between set of
variables x
i
dichotomous (yes/no)
categorical (social class, ... )
continuous (age, ...)

and

dichotomous (binary) variable Y


Logistic regression
Age CD Age CD Age CD
22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1

Age and coronary heart disease (CD)
How can we analyse these data?
Compare mean age of diseased and non-
diseased

Non-diseased: 38.6 years
Diseased: 58.7 years (p<0.0001)

Linear regression?
Scatter plot
AGE (years)
S
i
g
n
s

o
f

c
o
r
o
n
a
r
y

d
i
s
e
a
s
e
No
Yes
0 20 40 60 80 100
Logistic regression
Prevalence (%) of CD according to age group
Diseased
Age group # in group # %
20 - 29 5 0 0
30 - 39 6 1 17
40 - 49 7 2 29
50 - 59 7 4 57
60 - 69 5 4 80
70 - 79 2 2 100
80 - 89 1 1 100



Data from Table
0
20
40
60
80
100
0 2 4 6 8
Diseased %
Age group
Logistic function
0.0
0.2
0.4
0.6
0.8
1.0
Probability of
disease
x
P y x
e
e
x
x
( ) =
+
+
+
o |
o |
1
ln
( )
( )
P y x
P y x
x
1

(
= + o |
Logistic transformation
logit of P(y|x)
{

P y x
e
e
x
x
( ) =
+
+
+
o |
o |
1
Advantages of Logit
Properties of a linear regression model
Logit between - and +
Probability (P) constrained between 0 and 1


Directly related to notion of odds of choice
(purchase, participation, etc.) also used
heavily in health sciences (i.e. disease)

x
P - 1
P
ln + = |
.
|

\
|
e
P - 1
P
x +
=
Advantages of Logit
Logistic Regression is extensively used for prediction in
marketing/consumer research because of its inherent
advantages over ordinary Least Squares Regression.
Logistic Regression enables the researcher to overcome many of
the restrictive assumptions of ordinary least squares regression.
Logistic regression does not assume a linear relationship
between the dependents and the independents.
The dependent variable need not be normally distributed.
Many outcomes in marketing are binary (for example response
to a marketing stimuli can be buy or no buy)

Fitting equation to the data
Linear regression: Least squares
Logistic regression: Maximum likelihood
Likelihood function
Estimates parameters o and | with property that likelihood
(probability) of observed data is higher than for any other
values
OLS seeks to minimize the sum of squared distances of
the data points to the regression line. MLE seeks to
maximize the log likelihood, LL, which reflects how likely
it is (the odds) that the observed values of the
dependent may be predicted from the observed values
of the independents.






Maximum likelihood
Iterative computing
Computing of log-likelihood
Variation of coefficients values
Re-iteration until maximisation (plateau)
Choice of an arbitrary value for the
coefficients (usually 0)
Results
Maximum Likelihood Estimates (MLE) for o
and |
Estimates of P(y) for a given value of x
Multiple logistic regression
More than one independent variable
Dichotomous, ordinal, nominal, continuous



Interpretation of |
i

Increase in log-odds for a one unit increase in x
i
with all the other x
i
s constant
Measures association between x
i
and log-odds
adjusted for all other x
i







i i 2 2 1 1
x ... x x
P - 1
P
ln + + + = |
.
|

\
|
Multinomial Logistic Regression
Simple extension to the Logit model when the
dependent variable can take more than 2 categorical
values. Individual/household (i) can choose between
J categories

Examples aplenty
means of getting to work: bus, car, train or bicycle
insurance cover taken: none, partial or full;


Logistic Regression for Classification
Classification based on the probability
estimated
A cut off is to be decided
A cut off of 0.5 is reasonable when both 0 and
1 are equally likely and the cost of
misclassification is same.
Plot helps
Logistic Regression Example
Challenger Space Shuttle Data
Classification Accuracy
Classification Table
a
17 0 100.0
3 4 57.1
87.5
Observed
OK
Leak
JointStatus
Overall Percentage
Step 1
OK Leak
JointStatus
Percentage
Correct
Predicted
The cut value is .500
a.
Predicted Prob Vs Temp
Coefficients
Variables in the Equation
-.233 .107 4.763 1 .029 .793
15.048 7.278 4.275 1 .039 3431017
JointTemp
Constant
Step
1
a
B S.E. Wald df Sig. Exp(B)
Variable(s) entered on step 1: JointTemp.
a.
Prob. Values Computation
15.048
-0.233
x 31
z 7.825

Exp(x) 2502.386111
1+Exp(x) 2503.386111
P(y/x) 0.999600541
Logit Model
We can express z as a linear combination of independent
variables
Z =
0
+
1
x
1
+
2
x
2
+-------+
n
x
n

P(Y/Z) = e
Z
/1+e
Z
1- P(Y/Z) =1/1+e
Z
P(Y/Z)/ 1- P(Y/Z) =e
Z
Log (P(Y/Z)/ 1- P(Y/Z) ) = e
Z

Log (Odds) = 15.048 -0.233 (Joint Temperature)

Vous aimerez peut-être aussi