Vous êtes sur la page 1sur 4

Logistic regression

Maths and Statistics Help Centre

Many statistical tests require the dependent (response) variable to be continuous so a different set of tests are
needed when the dependent variable is categorical. One of the most commonly used tests for categorical variables
is the Chi-squared test which looks at whether or not there is a relationship between two categorical variables but
this doesn’t make an allowance for the potential influence of other explanatory variables on that relationship. For
continuous outcome variables, Multiple regression can be used for

a) controlling for other explanatory variables when assessing relationships between a dependent variable and
several independent variables
b) predicting outcomes of a dependent variable using a linear combination of explanatory (independent)
variables

The maths:
For multiple regression a model of the following form can be used to predict the value of a response variable y
using the values of a number of explanatory variables:

y   0  1 x1   2 x2  .....   q xq
 0  Constant/ intercept , 1   q  co  efficients for q explanator y variables x1  xq

The regression process finds the co-efficients which minimise the squared differences between the observed
and expected values of y (the residuals). As the outcome of logistic regression is binary, y needs to be
transformed so that the regression process can be used. The logit transformation gives the following:

 p 
ln    0  1 x1   2 x2  .....   q xq
1  p 
p
p  probabilty of event occuring e.g. person dies following heart attack,  odds ratio
1- p

If probabilities of the event of interest happening for individuals are needed, the logistic regression equation
exp  0  1 x1   2 x2  .....   q xq 
can be written as: p 
1  exp  0  1 x1   2 x2  .....   q xq 
, 0<p<1

Logistic regression does the same but the outcome variable is binary and leads to a model which can predict the
probability of an event happening for an individual.

Titanic example: On April 14th 1912, only 705 passengers and crew out of the 2228 on board the Titanic survived
when the ship sank. Information on 1309 of those on board will be used to demonstrate logistic regression. The
data can be downloaded from
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls

The key variables of interest are:

Dependent variable: Whether a passenger survived or not (survival is indicated by survived = 1).

Possible explanatory variables: Age, gender (recode so that sex = 1 for females and 0 for males), class (pclass = 1, 2 or
3), number of accompanying parents/ children (parch) and number of accompanying siblings/ spouses (sibsp)

1
Logistic regression
Maths and Statistics Help Centre

Most of the variables can be investigated using crosstabulations with the dependent variable ‘survived’. Another
reason for the cross tabulation is to identify categories with small frequencies as this can cause problems with the
logistic regression procedure. The number of accompanying parents/ children (parch) and number of accompanying
siblings/ spouses (sibsp) were used to create a new binary variable indicating whether or not the person was
travelling alone or with family (1 = travelling with family, 0 = travelling alone).

Male Female 1st class 2nd Class 3rd class Travelling Travelling
alone with family
% surviving 19.1% 72.7% 61.9% 43% 25.5% 30.3% 50.3%

When tested separately, Chi-squared tests concluded that there was evidence of a relationship between survival and
gender, class and whether an individual was travelling alone. Looking at the %’s of survival, it’s clear that women,
those in first class and those not travelling alone were much more likely to survive. Logistic regression will be carried
out using these three variables first of all. Stage 1 of the following analysis will relate to using logistic regression to
control for other variables when assessing relationships and stage 2 will look at producing a good model to predict
from.

Use ANALYZE  Regression Binary logistic to get the following screen:

Where there are more than


two categories, the last
category is automatically
the reference category.
This means that all the
other categories will be
compared to the reference
in the output e.g. 1st and
2nd class will be compared
Treatment of categorical explanatory variables to 3rd class.

When interpreting SPSS output for logistic regression, it is important that binary variables are coded as 0 and 1.
Also, categorical variables with three or more categories need to be recoded as dummy variables with 0/ 1 outcomes
e.g. class needs to appear as two variables 1st/ not 1st with 1 = yes and 2nd/ not 2nd with 1 = yes. Luckily SPSS does
this for you! When adding a categorical variable to the list of covariates, click on the Categorical button and move all
categorical variables to the right hand box. The following table in the output shows the coding of these variables.

Categorical Variables Codings

Parameter coding

Frequency (1) (2) For class, 3rd class is the reference class so
class 1st class 323 1.000 .000 if 1st class = 0 and 2nd class = 0, the person
2nd class 277 .000 1.000 must have been in 3rd class.
3rd class 709 .000 .000
Travelling alone Alone 790 1.000 Females and those not travelling alone are
with family 519 .000 the references for the other groups.
Gender male 843 1.000
female 466 .000

2
Logistic regression
Maths and Statistics Help Centre

Interpretation of the output

The output is split into two sections, block 0 and block 1. Block 0 assesses the usefulness of having a null model,
which is a model with no explanatory variables. The ‘variables in the equation’ table only includes a constant so
each person has the same chance of survival.

 p  exp - 0.481
The null model is: ln    0  0.481, p  probabilit y of survival   0.382
1  p  1  exp - 0.481

SPSS calculates the probability of survival for each individual using the block model. If the probability of survival is
0.5 or more it will predict survival (as survival = 1) and death if the probability is less than 0.5. As more people died
than survived, the probability of survival is 0.382 and therefore everyone is predicted as dying (coded as 0). As
61.8% of people were correctly classified, classification from the null model is 61.8% accurate. The addition of
explanatory variables should increase the percentage of correct classification significantly if the model is good.

Block 0: Beginning Block

Block 1: Method = Enter


Block 1 shows the results after the addition of the explanatory variables selected.

The omnibus Tests of Model Co-efficients table gives the result of the Likelihood Ratio (LR) test which indicates
whether the inclusion of this block of variables contributes significantly to model fit. A p-value (sig) of less than 0.05
for block means that the block 1 model is a significant improvement to the block 0 model.

In standard regression, the co-efficient of determination (R2) value gives an indication of how much variation in y is
explained by the model. This cannot be calculated for logistic regression but the ‘Model Summary’ table gives the
values for two pseudo R2 values which try to measure something similar. From the table above, we can conclude

3
Logistic regression
Maths and Statistics Help Centre

that between 31% and 42.1 of the variation in survival can be explained by the model in block 1. The correct
classification rate has increased by 16.2% to 78%.

Finally, the ‘Variables in the Equation’ table summarises the importance of the explanatory variables individually
whilst controlling for the other explanatory variables.

The Wald test is similar to the LR test but here it is used to test the hypothesis that each   0 . In the sig column,
the p-values are all below 0.05 apart from the test for the variable Alone, (p = 0.286). This means that although the
Chi-squared test for Survival vs Alone was significant, once the other variables were controlled for, there is not a
strong enough relationship between that variable and survival. Class is tested as a whole (pclass) and then 1st and
2nd class compared to the reference category 3rd class. When interpreting the differences, look at the exp  
column which represents the odds ratio for the individual variable. For example, those in 1st class were 5.493 times
more likely to survive than those in first class. With gender, the odds ratio compares the likelihood of a male
surviving in comparison to females. The odds are a lot lower for men (0.084 times that of women). For ease of
interpretation, calculate the odds of a female surviving over a male using 1/0.084 = 11.9. Females were 11.9 times
more likely to survive.

The co-efficients for the model are contained in the column headed B. A negative value means that the odds of
survival decreases e.g. for males and those travelling alone.

The full model being tested is:


 p 
ln   0.460  1.703 x1st class  0.832x 2nd class - 2.474x male - 0.156x alone
1 p 
x1st class  1 for 1st class. x2 nd class  1 for 2nd class, x male  1 for men and x alone  1 for person tra velling alone