Multiple Regression

Statistics for Health Research
Entering Multidimensional
Space: Multiple
Regression
Peter T. Donnan
Professor of Epidemiology and Biostatistics
Objectives of session
Recognise the need for multiple regression

Understand methods of selecting variables
Understand strengths and weakness of
selection methods
Carry out Multiple
Regression in SPSS
and interpret the output
Why do we need multiple

regression?
Research is not as simple as effect
of one variable on one outcome,
Especially with observational data
Need to assess many factors
simultaneously; more realistic
models
Dependent (y)
Consider Fitted line of

y = a + b1x1 + b2x2
x 2)
(
ry
o
t
na
a
l
p
Ex
Explanatory (x1)
3-dimensional scatterplot from

SPSS of Min LDL in relation to
baseline LDL and age
When to use multiple

regression modelling (1)
Assess relationship between two
variables while adjusting or allowing
for another variable
Sometimes the second variable is
considered a nuisance factor
Example: Physical Activity allowing
for age and medications

In RCT whenever there is imbalance
between arms of the trial at baseline
in characteristics of subjects
e.g. survival in colorectal cancer on
two different randomised therapies
adjusted for age, gender, stage, and
co-morbidity

A special case of this is when
adjusting for baseline level of the
primary outcome in an RCT
Baseline level added as a factor in
regression model
This will be covered in Trials part of
the course

With observational data in order to
produce a prognostic equation for
future prediction of risk of mortality
e.g. Predicting future risk of CHD
used 10-year data from the
Framingham cohort

With observational
adjust for possible
data in order to
confounders
e.g. survival in colorectal cancer in

those with hypertension adjusted for
age, gender, social deprivation and
co-morbidity
Definition of Confounding
A confounder is a factor which
is related to both the variable
of interest (explanatory) and
the outcome, but is not an
intermediary in a causal
pathway
Example of Confounding
Lung
Cancer
Deprivation
Smoking
But, also worth adjusting for

factors only related to outcome
Lung
Cancer
Deprivation
Exercise
Not worth adjusting for intermediate

factor in a causal pathway
Exercise
Blood
viscosity
Stroke
In a causal pathway each factor is

merely a marker of the other
factors i.e correlated - collinearity
SPSS: Add both baseline LDL and

age in the independent box in linear
regression
Output from SPSS

regression on Age at
linear
baseline
Coefficientsa
Model
1
(Constant)
Age at baseline
Unstandardized
Standardized
Coefficients
Coefficients
B
Std. Error
Beta
2.024
.105
-.008
.002
-.121
a. Dependent Variable: Min LDL achieved
t
19.340
-4.546
95% Confidence Interv al for B Collinearity Statistics

Sig.
Lower Bound Upper Bound Tolerance
VIF
.000
1.819
2.229
.000
-.011
-.004
1.000
1.000
Output from
regression on
SPSS linear
Baseline LDL
Coefficientsa
Model
1
(Constant)
Baseline LDL
Unstandardized
Coeff icients
B
Std. Error
.668
.066
.257
.018
a. Dependent Variable: Min LDL achieved
Standardized
Coeff icients
Beta
.351
t
10.091
13.950
95% Confidence Interv al for B

Sig.
Lower Bound Upper Bound
.000
.538
.798
.000
.221
.293
Output: Multiple regression

Model Summary
R2 now
improved
to 13%
Model
1
R
.360a
R Square
.130
Adjusted
R Square
.129
St d. Error of
the Estimate
.6753538
a. Predictors: (Constant), Age at baseline, Baseline LDL
Coefficientsa
Model
1
(Constant)
Baseline LDL
Age at baseline
Unstandardized
Coeff icients
B
Std. Error
1.003
.124
.250
.019
-.005
.002
Standardized
Coeff icients
Beta
.342
-.081
t
8.086
13.516
-3.187
Sig.
.000
.000
.001
a. Dependent Variable: Min LDL achiev ed
Both variables still significant

INDEPENDENTLY of each other
95% Confidence Interv al for B

Lower Bound Upper Bound
.760
1.246
.214
.286
-.008
-.002
How do you select which

variables to enter the model?
Usually consider what hypotheses are you testing?
If main exposure variable, enter first and assess
confounders one at a time
For derivation of CPR you want powerful predictors
Also clinically important factors e.g. cholesterol in CHD
prediction
Significance is important but
It is acceptable to have an important variable without
statistical significance
How do you decide what variables to

enter in model?
Correlations? With great difficulty!
3-dimensional scatterplot from

SPSS of Time from Surgery in
relation to Dukes staging and age
Approaches to model building

1. Let Scientific or Clinical factors
guide selection
2. Use automatic selection algorithms
3. A mixture of above
1) Let Science or Clinical

factors guide selection
Baseline LDL cholesterol is an
important factor determining LDL
outcome so enter first
Next allow for age and gender
Add adherence as important?
Add BMI and smoking?

Results in model of:
1.Baseline LDL
2.age and gender
3.Adherence
4.BMI and smoking
Is this a good model?
1) Let Science or Clinical factors

guide selection: Final Model
Note three variables entered but not statistically significant
1) Let Science or Clinical factors

guide selection
Is this the best model?
Should I leave out the non-significant factors (Model 2)?
Model
Adj R2
F from
ANOVA
No. of
Paramete
rs p
0.137
37.48
0.134
72.021
Adj R2 lower, F has increased and number of

parameters is less in 2nd model. Is this better?
Kullback-Leibler
Information
Kullback and Leibler (1951)
quantified the meaning of
information related to
Fishers sufficient statistics
Basically we have reality f
And a model g to approximate f
So K-L information is
I(f,g)
Kullback-Leibler
Information
We want to minimise I (f,g)

to obtain the best model
over other models
I (f,g) is the information
lost or distance between
reality and a model so need
to minimise:
f ( x)
I ( f , g ) f ( x ) log(
) dx
g( x )
Akaikes Information
Criterion
It turns out that the
function I(f,g) is
related to a very simple
measure of goodnessof-fit:
Akaikes Information
Criterion or AIC
Selection Criteria
With a large number of factors type 1 error
large, likely to have model with many variables
Two standard criteria:
1) Akaikes Information Criterion (AIC)
2) Schwartzs Bayesian Information
Criterion (BIC)
Both penalise models with large number of
variables if sample size is large
Akaikes Information
Criterion
AIC 2 * loglikelihood 2 * p
Where p = number of parameters and
-2*log likelihood is in the output
Hence AIC penalises models with large
number of variables
Select model that minimises (-2LL+2p)
Generalized linear models

Unfortunately the standard
REGRESSION in SPSS does not give
these statistics
Need to use
Analyze
Generalized Linear Models..
Generalized linear models.

Default is linear
Add Min LDL
achieved as
dependent as in
REGRESSION in
SPSS
Next go to
predictors..
Generalized linear models:

Predictors
WARNING!
Make sure you

add the
predictors in
the correct box
Categorical in
FACTORS box
Continuous in
COVARIATES
box
Generalized linear models:

Model
Add all
factors and
covariates in
the model as
main effects
Generalized Linear Models

Parameter Estimates
Note identical to REGRESSION output
Generalized Linear Models

Goodness-of-fit
Note output gives
log likelihood and
AIC = 2835
(AIC = -2x-1409.6
+2x7= 2835)
Footnote explains
smaller AIC is
better
Let Science or Clinical factors

guide selection: Optimal model
The log likelihood is a measure of
GOODNESS-OF-FIT
Seek optimal model that maximises the log
likelihood or minimises the AIC
Model
1 Full Model
2 Non-significant
variables removed
2LL
AIC
-1409.6
2835.6
-1413.6
2837.2
Chang
e is
1.6

Key points:
1.Results demonstrate a significant association
with baseline LDL, Age and Adherence
2.Difficult choices with Gender, smoking and
BMI
3.AIC only changes by 1.6 when removed
4.Generally changes of 4 or more in AIC are
considered important

Key points:
1.Conclude little to chose between models
2.AIC actually lower with larger model and
consider Gender, and BMI important factors so
keep larger model but have to justify
3.Model building manual, logical, transparent
and under your control
2) Use automatic selection

procedures
These are based on automatic
mechanical algorithms usually related
to statistical significance
Common ones are stepwise, forward
or backward elimination
Can be selected in SPSS using
Method in dialogue box

procedures (e.g Stepwise)
Select
Method =
Stepwise

procedures (e.g Stepwise)
1st step
2nd step
Final
Model
2) Change in AIC with Stepwise

selection
Note: Only available from Generalized Linear Models
Step
Model
Log
Likelihoo
d
AIC
Chang
e in
AIC
No. of
Parameter
s p
Baseline LDL
-1423.1
2852.2
+Adherence
-1418.0
2844.1
8.1
+Age
-1413.6
2837.2
6.9
2) Advantages and
disadvantages of stepwise
Advantages
Simple to implement
Gives a parsimonious model
Selection is certainly objective
Disadvantages
Non stable selection stepwise considers many
models that are very similar
P-value on entry may be smaller once procedure is
finished so exaggeration of p-value
Predictions in external dataset usually worse for
stepwise procedures
2) Automatic procedures:
Backward elimination
Backward starts by eliminating the least
significant factor form the full model and has a
few advantages over forward:
Modeller has to consider the full model and
sees results for all factors simultaneously
Correlated factors can remain in the model (in
forward methods they may not even enter)
Criteria for removal tend to be more lax in
backward so end up with more parameters

procedures (e.g Backward)
Select
Method =
Backward
2) Backward elimination in
SPSS
1st step
Gender
removed
2nd step
BMI
removed
Final
Model
Summary of automatic
selection
Automatic selection may not give optimal

model (may leave out important factors)
Different methods may give different results

(forward vs. backward elimination)
Backward elimination preferred as less

stringent
Too easily fitted in SPSS!
Model assessment still requires some thought
3) A mixture of automatic
procedures and self selection
Use automatic procedures as a
guide
Think about what factors are
important
Add important factors
Do not blindly follow statistical
significance
Consider AIC
Summary of Model
selection
Selection of factors for Multiple Linear
regression models requires some

judgement
Automatic procedures are available but
treat results with caution
They are easily fitted in SPSS

Check AIC or log likelihood for fit
Summary
Multiple regression models are the
most used analytical tool in

quantitative research
They are easily fitted in SPSS

Model assessment requires some
thought
Parsimony is better Occams Razor
Remember Occams Razor

Entia non sunt
multiplicanda
praeter
necessitatem
Entities must not be
multiplied beyond
necessity
William of Ockham
14th century Friar and
logician
1288-1347
Summary
After fitting any model check assumptions
Functional form linearity or not
Check Residuals for normality
Check Residuals for outliers
All accomplished within SPSS
See publications for further info
Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E
genotypes are associated with lipid lowering response to statin treatment in diabetes: A GoDARTS study. Pharmacogenetics and Genomics , 2008; 18: 279-87.
Practical on Multiple
Regression
Read in LDL Data.sav
1)Try fitting multiple regression model on Min

LDL obtained using forward and backward
elimination. Are the results the same? Add
other factors than those considered in the
presentation such as BMI, smoking.
Remember the goal is to assess the
association of APOE with LDL response.
2)Try fitting multiple regression models for
Min Chol achieved. Is the model similar to
that found for Min Chol?

Multiple Regression

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Multiple Regression

Transféré par

Droits d'auteur :

Formats disponibles

Statistics for Health Research

Recognise the need for multiple regression

Why do we need multiple

Consider Fitted line of

3-dimensional scatterplot from

When to use multiple

When to use multiple

When to use multiple

When to use multiple

When to use multiple

e.g. survival in colorectal cancer in

But, also worth adjusting for

Not worth adjusting for intermediate

In a causal pathway each factor is

SPSS: Add both baseline LDL and

Output from SPSS

a. Dependent Variable: Min LDL achieved

95% Confidence Interv al for B Collinearity Statistics

a. Dependent Variable: Min LDL achieved

95% Confidence Interv al for B

Output: Multiple regression

a. Predictors: (Constant), Age at baseline, Baseline LDL

a. Dependent Variable: Min LDL achiev ed

Both variables still significant

95% Confidence Interv al for B

How do you select which

How do you decide what variables to

3-dimensional scatterplot from

Approaches to model building

1) Let Science or Clinical

1) Let Science or Clinical

1) Let Science or Clinical factors

1) Let Science or Clinical factors

Adj R2 lower, F has increased and number of

We want to minimise I (f,g)

Generalized linear models

Generalized linear models.

Generalized linear models:

Make sure you

Generalized linear models:

Generalized Linear Models

Generalized Linear Models

Let Science or Clinical factors

1) Let Science or Clinical

1) Let Science or Clinical

2) Use automatic selection

2) Use automatic selection

2) Use automatic selection

2) Change in AIC with Stepwise

2) Use automatic selection

Automatic selection may not give optimal

Different methods may give different results

Backward elimination preferred as less

Too easily fitted in SPSS!

Model assessment still requires some thought

Use automatic procedures as a

regression models requires some

Automatic procedures are available but

treat results with caution

They are easily fitted in SPSS

most used analytical tool in

They are easily fitted in SPSS

Parsimony is better Occams Razor