Académique Documents
Professionnel Documents
Culture Documents
Entering Multidimensional
Space: Multiple
Regression
Peter T. Donnan
Professor of Epidemiology and Biostatistics
Objectives of session
Dependent (y)
x 2)
(
ry
o
t
na
a
l
p
Ex
Explanatory (x1)
data in order to
confounders
Definition of Confounding
A confounder is a factor which
is related to both the variable
of interest (explanatory) and
the outcome, but is not an
intermediary in a causal
pathway
Example of Confounding
Lung
Cancer
Deprivation
Smoking
Deprivation
Exercise
Blood
viscosity
Stroke
linear
baseline
Coefficientsa
Model
1
(Constant)
Age at baseline
Unstandardized
Standardized
Coefficients
Coefficients
B
Std. Error
Beta
2.024
.105
-.008
.002
-.121
t
19.340
-4.546
Output from
regression on
SPSS linear
Baseline LDL
Coefficientsa
Model
1
(Constant)
Baseline LDL
Unstandardized
Coeff icients
B
Std. Error
.668
.066
.257
.018
Standardized
Coeff icients
Beta
.351
t
10.091
13.950
R2 now
improved
to 13%
Model
1
R
.360a
R Square
.130
Adjusted
R Square
.129
St d. Error of
the Estimate
.6753538
Coefficientsa
Model
1
(Constant)
Baseline LDL
Age at baseline
Unstandardized
Coeff icients
B
Std. Error
1.003
.124
.250
.019
-.005
.002
Standardized
Coeff icients
Beta
.342
-.081
t
8.086
13.516
-3.187
Sig.
.000
.000
.001
Adj R2
F from
ANOVA
No. of
Paramete
rs p
0.137
37.48
0.134
72.021
Kullback-Leibler
Information
Kullback and Leibler (1951)
quantified the meaning of
information related to
Fishers sufficient statistics
Basically we have reality f
And a model g to approximate f
So K-L information is
I(f,g)
Kullback-Leibler
Information
f ( x)
I ( f , g ) f ( x ) log(
) dx
g( x )
Akaikes Information
Criterion
It turns out that the
function I(f,g) is
related to a very simple
measure of goodnessof-fit:
Akaikes Information
Criterion or AIC
Selection Criteria
With a large number of factors type 1 error
large, likely to have model with many variables
Two standard criteria:
1) Akaikes Information Criterion (AIC)
2) Schwartzs Bayesian Information
Criterion (BIC)
Both penalise models with large number of
variables if sample size is large
Akaikes Information
Criterion
AIC 2 * loglikelihood 2 * p
Where p = number of parameters and
-2*log likelihood is in the output
Hence AIC penalises models with large
number of variables
Select model that minimises (-2LL+2p)
WARNING!
Footnote explains
smaller AIC is
better
2 Non-significant
variables removed
2LL
AIC
-1409.6
2835.6
-1413.6
2837.2
Chang
e is
1.6
Select
Method =
Stepwise
1st step
2nd step
Final
Model
Model
Log
Likelihoo
d
AIC
Chang
e in
AIC
No. of
Parameter
s p
Baseline LDL
-1423.1
2852.2
+Adherence
-1418.0
2844.1
8.1
+Age
-1413.6
2837.2
6.9
2) Advantages and
disadvantages of stepwise
Advantages
Simple to implement
Gives a parsimonious model
Selection is certainly objective
Disadvantages
Non stable selection stepwise considers many
models that are very similar
P-value on entry may be smaller once procedure is
finished so exaggeration of p-value
Predictions in external dataset usually worse for
stepwise procedures
2) Automatic procedures:
Backward elimination
Backward starts by eliminating the least
significant factor form the full model and has a
few advantages over forward:
Modeller has to consider the full model and
sees results for all factors simultaneously
Correlated factors can remain in the model (in
forward methods they may not even enter)
Criteria for removal tend to be more lax in
backward so end up with more parameters
Select
Method =
Backward
2) Backward elimination in
SPSS
1st step
Gender
removed
2nd step
BMI
removed
Final
Model
Summary of automatic
selection
3) A mixture of automatic
procedures and self selection
guide
Think about what factors are
important
Add important factors
Do not blindly follow statistical
significance
Consider AIC
Summary of Model
selection
Selection of factors for Multiple Linear
Summary
Multiple regression models are the
thought
William of Ockham
14th century Friar and
logician
1288-1347
Summary
After fitting any model check assumptions
Functional form linearity or not
Check Residuals for normality
Check Residuals for outliers
All accomplished within SPSS
See publications for further info
Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E
genotypes are associated with lipid lowering response to statin treatment in diabetes: A GoDARTS study. Pharmacogenetics and Genomics , 2008; 18: 279-87.
Practical on Multiple
Regression
Read in LDL Data.sav