Vous êtes sur la page 1sur 41

Logistic Regression

Dr. Monica
Department of Mathematics
When the outcome is linear
When the outcome is binary (0,1) (yes/no)
Examples of Binary outcomes:
• Will you get admission in a college or not?
• Should a bank give a person a loan or not ?
• Which people are more likely to vote for a particular candidate?
• Which customers are more likely to buy a new product?
There are two outcomes: Yes/No
What is Logistic Regression ?
Logistic regression seeks to:
• Model the probability of an event occurring depending on the values
of the independent variables , which can be categorical or numerical
• Estimate the probability that an event occurs for a randomly selected
observations versus the probability that the event does not occur
• Predict the effect of a series of variables on a binary response
variable
• Classify observations by estimating the probability that an
observation is in a particular category.
Understanding the process
Data Example: Customers’ subscription to a
magazine
• We have data on 1000 random customers from a given city . We want
to know what determines their decision to subscribe to a magazine.
Subscribe: Indicates if a customer has subscribed to the magazine.
Age: we will start by examining how age influences the likelihood of
subscription.
A Linear model

Subscribe:  0  1 age
And we are given coefficients as below
Subscribe = -1.7 + 0.064 age
P(subscribe=1)= p = -1.7 + 0.064 age
Note:
• The regression coefficients for logistic regression are calculated using
maximum likelihood estimation or MLE. OR
• Values of coefficients can be estimated by using a ANOVA table or any
regression method using R.
Problems with Linear Approach
• Probabilities are bounded whereby 0 ≤ p ≤ 1.
• The range of age in the data is such that 20 ≤ age ≤ 55.
• The probability that a 35 year-old person subscribes is :
p = -1.700 + 0.064 x 35 = 0.54
What about people with 25 and 45 years of age ?
p = -1.700 + 0.064 x 25 = -0.09
p = -1.700 + 0.064 x 45 = 1.20
Linear Model Plot

age
Fixing the Linear approach
• We need to somehow constraint p such that 0 ≤ p ≤ 1
• We know p = f(age) , but the linear function didn’t work.
• What must f(.) satisfy to always produce reasonable forecasts?
What must our probability function must satisfy?
F(.) must satisfy two things :
1. It must always be positive(since p ≥ 0)
2. It must be less than 1 (since p ≤ 1)
Two Steps!
Step 1
It must always be positive (since p ≥ 0)
 0  1 age
So p = e
Step 2
It must be less than 1
So  0  1 age
e
p  0  1 age
1 e
Logit Function
Previous equation can also be re written as
p
ln( )   0  1 age
1 p
Name of this function is Logit Function and Logit function is used in
logistic regression
Even though the probability of ‘customer subscribing’ is not linear
function of age but above equation is linear function of age
Estimated model
• Suppose coefficients given are
 0   0.26
1  0.78

p e y*
Age y *  ln( )   0  1 age p Change
1 p 1  e y*
25 -7 ≈0
26 -6.22 ≈0 0%
35 0.813 0.69
36 1.594 0.83 14%
45 8.623 ≈1
46 9.404 ≈1 0%
Logistic Model Plot

age
Multiple Logistic Regression
• Moving from a single regression with single independent
variable(age) to a multiple regression model with more than one (age
and gender) is very simple.
• Lets add a dummy variable for gender and run :
p
y *  ln ( )   0  1 age   2 women
1 p
Results of Multiple Regression
• Estimated model is:
y* = -26.47 + 0.79 age – 0.56 woman

p e y*
Age Woman y *  ln( )   0  1 age   2 woman p Change
1 p 1  e y*
35 1 0.529 0.629
35 0 1.087 0.748 11.9%
36 1 1.317 0.789
36 0 1.874 0.867 7.8%
Thank You!
What is logit?

• In logistic regression we don’t know p. the goal of logistic regression


is to estimate p for

a linear combination of the independent variables.
Estimate of p is p
• To tie together our linear combination of variables we need a function
That links them together , or maps the linear combination of variables
that could result in to any value with a domain from 0 to 1. The natural
log of odds ratio, the logit, is that link function .
ln(odds)  ln (p/1-p) is the logit (p)
OR
logit(p) = ln(p)-ln(1-p)
Odds Ratio
The odds ratio is exactly what it says it is, a ratio of two odds
Fair flip coin :
P(heads)= ½= 0.5
Odds(heads)= 0.5/0.5= 1 or 1:1
Loaded coin flip
P(heads)= 7/10= 0.7
Odds(heads)= 0.7/0.3= 2.333
Odds ratio = odds1/odds0 = 0.7/0.3/ 0.5/0.5 = 2.333
Hence the odds of getting ‘heads’ on the loaded coin are 2.333 x greater than
the fair coin.
First time home buyer
As a first time home buyer you are busy organizing your financial
records so you can apply for a home loan. As a part of this process you
order a copy of credit score which can range from 300 to 850.
Lenders will factor in your credit score when deciding to approve or not
approve you for a home loan. Turns out your score is 720.

While doing your research you find some raw data online showing 1000
applicant credit scores and whether or not the application was
approved (yes/no)
MODEL data
Scatter plot of Approved Vs Credit Score

Approved

Credit Score
Calculating probability at Score 720 using Logit
• Suppose coefficients given are
 0   9.346
1  0.014634

p
Y *  Logit  ln( )   0  1 score
1 p

P=e^(y*)/(1+e^(y*))

P=e^(-9.346+0.014634score)/(1+e^(-9.346+0.014634score)

P=0.7668
One point increase in score increases the odds of getting ‘loan approval’ by
3.337/3.289=1.0146 times
Odds ratio in logistic regression

• The odds ratio for a variable in logistic regression represents how the
odds change with a 1 unit increase in that variable holding all other
variables constant
• Score variable has an odds ratio of 1.0146
Score Probability Odds Odds ratio
720 0.7668 3.289
721 0.7694 3.337 1.0146

• This means one point increase in score increases the odds of getting
loan approval by 1.0146 times
• This holds true for all score intervals
Odds ratio in logistic regression (contd.)
Odd ratio at any Odd ratio at any Odd ratio at any Odd ratio at any
Score Probability(P) Odds=P/(1-P) score increment of score increment of score increment of score increment of
10 20 30 70
600 0.36225 0.56802
610 0.39669 0.65753 1.16
620 0.43219 0.76115 1.16 1.34
630 0.46840 0.88110 1.16 1.55
640 0.50494 1.01996 1.16 1.34
650 0.54143 1.18069 1.16
660 0.57748 1.36676 1.16 1.34 1.55
670 0.61273 1.58214 1.16 2.79
680 0.64683 1.83147 1.16 1.34
690 0.67950 2.12010 1.16 1.55
700 0.71050 2.45420 1.16 1.34
• 10 point increase in score in crease the odds by 1.16 times
• 20 point increase in score in crease the odds by 1.34 times
• 30 point increase in score in crease the odds by 1.55 times
• 70 point increase in score in crease the odds by 2.79 times
• It is very important to separate probability and odds
• In this example a 30 points increase in score increases the odds by almost a factor of 1.55 times but the probability at score 630 still remains low
i.e .4684
Effect of increasing Score
Score increase Odds ratio % increase
Note: This is percentage
0 1 0% increase in Odds. Not the
10 1.16 15.76% percentage increase in
20 1.34 34.00% probability being approved
30 1.55 55.12%
40 1.80 79.56%
50 2.08 107.86%
60 2.41 140.62%
70 2.79 178.54%
80 3.22 222.43%
90 3.73 273.24%
100 4.32 332.06%
Exponential regression lines comes out is y=e^0.0146x=e^B1Delta
Estimated regression equation
• The natural logarithm of the odds ratio is equivalent to a linear
function of the independent variables. The antilog of the logit
function allows us to find estimated regression equation.
p
Logit ( p )  ln ( )   0  1 x1
1 p
Taking anti log
p
 e  0  1 x1
1 p
 p  e  0  1 x1 (1  p )
 p  e  0  1 x1  e  0  1 x1 p
 p  e  0  1 x1 p  e  0  1 x1
 p (1  e  0  1 x1 )  e  0  1 x1
e  0  1 x1
p
1  e  0  1 x1
Logistic plot
Scatter plot of Approve VS Credit Score

Approve

Credit Score
A note on coefficients
• The regression coefiicients for logistic regression are calculated using
maximum likelihood estimation or MLE.
• Discussing MLE for the coefficients is way beyond about our topic.
Values of coefficients can be estimated by using a ANOVA table or any
regression method using R.
Suppose estimated values are 0 = -9.346, 1 = 0.01463
Then accordingly, estimated regression equation is p^= exp(…)/(..)
Where p^ is estimated probability of being approved and x1 is credit
score.
Model data
• Remember that you score was 720 , so now we can calculate the
estimated probability and the odds you will be approved for a loan
based on this data and model?
p^ = exp(..)/(..) = 3.289/1+3.289 = 0.7668
Odds= 0.7668/1-.7668 = 3.289
Similarly for 721,
P^= .7694, odds = 3.337
Odds ratio for a one point increase in credit score : 3.337/3.289 =
1.0146

Vous aimerez peut-être aussi