Vous êtes sur la page 1sur 138

La Rgression Logistique

Michel Tenenhaus

Mthodes explicatives : une rponse Y


Variables explicatives

Variable
expliquer
Y

X1, X2, , Xk
Quantitatives

Qualitatives

Mlange

Quantitatif

Rgression multiple

Analyse de la variance

Modle linaire gnral

Qualitatif

- Rgression
Logistique
- Segmentation
- Analyse factorielle
discriminante
- Analyse discriminante
bayesienne

- Rgression
Logistique
- Segmentation
- Analyse factorielle
discriminante

- Rgression
Logistique
- Segmentation
- Analyse factorielle
discriminante

Loi de probabilit de la rponse dans la famille exponentielle


(Binomiale, Poisson, Normale, Gamma, Gauss Inverse, ...) :
Modle linaire gnralis (Proc GENMOD)

Rseaux de
neurones :
Optimiser la prvision
pour les modle nonlinaires (!!!!)

Plan du cours
Rgression logistique binaire simple (chd)
Rgression logistique binaire multiple
Utilisation
- Donnes individuelles (faillite, bb)
de SPSS
et de la
- Donnes agrges (job satisfaction)
Proc Logistic
Rgression logistique ordinale (bordeaux)
- pentes gales
- partiellement pentes gales (Proc Genmod)
Rgression logistique multinomiale (bordeaux, alligator)
- utilisation de SPSS et de la Proc Catmod
3

Rfrences
Agresti, A. (1990):
Categorical Data Analysis, New York: John Wiley & Sons, Inc.
Hosmer, D.W. & Lemeshow, S. (1989):
Applied Logistic Regression, New York: John Wiley &
Sons, Inc.
P. McCullagh & J.A. Nelder (1989):
Generalized Linear Models, Chapman & Hall, London.

Collet D. (1999):
Modelling binary data, Chapman & Hall/CRC, Londres
P. Allison (1999):
Logistic Regression Using the SAS System: Theory and Applications
Cary, NC: SAS Institute Inc.
Tenenhaus M. (2007):
Statistique, Dunod

A. La rgression logistique binaire


Les donnes
Y = variable expliquer binaire
X1,, Xk = variables explicatives numriques
ou binaires (indicatrices de modalits)

Rgression logistique simple (k = 1)


Rgression logistique multiple (k > 1)
5

I. La rgression logistique simple


Variable dpendante : Y = 0 / 1
Variable indpendante : X
Objectif : Modliser

(x) = Prob(Y = 1/X = x)


Le modle linaire (x) = 0 + 1x
convient mal lorsque X est continue.
Le modle logistique est plus naturel.
6

Exemple : Age and Coronary Heart


Disease Status (CHD)
Les donnes
ID
1
2
3
4
5

AGRP
1
1
1
1
1

AGE
20
23
24
25
25

CHD
0
0
0
0
1

97
98
99
100

8
8
8
8

64
64
65
69

0
1
1
1

Plot of CHD by Age


1.2

1.0

.8

.6

.4

.2

CHD

0.0
-.2
10

20

30

40

50

60

70

AGE

Description des donnes regroupes


par classe dage
Tableau des effectifs
de CHD par classe dage
n
10
15
12
15
13
8
17
10
100

CHD
present
1
2
3
5
6
5
13
8
43

Mean
(Proportion)
0.10
0.13
0.25
0.33
0.46
0.63
0.76
0.80
0.43

1.0

.8

.6

.4

Proportion (CHD)

Age Group
20 29
30 34
35 39
40 44
45 49
50 54
55 - 59
60 - 69
Total

CHD
absent
9
13
9
10
7
3
4
2
57

Graphique des proportions


de CHD par classe dage

.2

0.0
1

AGEGRP

Le modle logistique
0 1x

e
( x )
0 1x
1 e

Probabilit d'une maladie cardiaque


en fonction de l'age
1.0

.8

ou

.6

( x )
Log(
) 0 1x
1 ( x )

Prob(Y=1 / X)

.4

.2

0.0
10

20

30

40

50

60

AGE

Fonction de lien : Logit


10

70

Fonctions de lien
Fonction logit
g(p) = log(p / (1 - p))
Fonction normit ou probit
g(p) = -1(p)
o est la fonction de rpartition de la loi normale rduite

Fonction complementary log-log


g(p) = log(-log(1-p))
11

Estimation des paramtres


du modle logistique
Les donnes
X
x1

xi

xn

Y
y1

yi

yn

Le modle

( x i ) P ( Y 1 / X x i )
e0 1x i

0 1x i
1 e

yi = 1 si caractre prsent,
0 sinon
12

Vraisemblance des donnes


Probabilit dobserver les donnes
[(x1,y1), , (xi,yi), , (xn,yn)]
n

Prob( Y yi / X x i )
i 1
n

( x i ) yi (1 ( x i ))1 yi
i 1

0 1 xi

e
1
yi
1 yi
(
)
(
)
0 1 xi
0 1 xi
1

e
1

e
i 1

l (0 , 1 )
13

Log-Vraisemblance
n

L(0 , 1 ) Log (l (0 , 1 )) Log[ ( xi ) (1 ( xi )

1 yi

yi

i 1

( xi )
yi Log (
) Log (1 ( xi ))
1 ( xi )
i 1
n

yi (0 1 xi ) Log (1 exp(0 1 xi ))
i 1

14

Estimation du maximum de vraisemblance


On cherche et maximisant la
0
1
Log-vraisemblance L( , .)
0
1
La matrice

)
, )
V
(

Cov
(

0
0 1
V( )

V(1 )
Cov(0 , 1 )

est estime par la matrice

L( )

'

15

Rsultats
Model Summary
Step
1

-2 Log
likelihood
107.353

Cox & Snell


R Square
.254

Nagelkerke
R Square
.341

Variables in the Equation


Step
a
1

B
.111
-5.309

AGE
Constant

S.E.
.024
1.134

Wald
21.254
21.935

df
1
1

Sig.
.000
.000

Exp(B)
1.117
.005

a. Variable(s) entered on step 1: AGE.

Test LRT pour H0 : 1 = 0


Omnibus Tests of Model Coefficients
Step 1

Model

Chi-square
29.310

df
1

Sig.
.000

16

Rsultats
Estimated Covariance Matrix
Variable
Intercept
age

Intercept

age

1.285173
-0.02668

-0.02668
0.000579

Ecart-type de la constante = 1.2851731/2 = 1.134


Ecart-type de la pente = .0005791/2 = .024
Covariance entre la constante et la pente = -.02668
17

Test de Wald
Le modle

Test

e0 1x
( x ) P ( Y 1 / X x )
1 e0 1x

H0 : 1 = 0
H1 : 1 0

Statistique utilise

12
Wald 2
s1

Dcision de rejeter H0 au risque


2
Rejet de H0 si Wald 1 (1)

ou

NS = P(2(1)Wald)
18

Test LRT
Le modle

Test

e0 1x
( x ) P ( Y 1 / X x )
1 e0 1x

H0 : 1 = 0
H1 : 1 0

Statistique utilise

-2 L(Cste) - -2 L(Cste, X )

Dcision de rejeter H0 au risque


2
Rejet de H0 si 1 (1)

ou NS = P( 2(1) )
19

Intervalle de confiance de (x) au niveau 95%


De
2

Var(0 1 x) Var(0 ) x Var( 1 ) 2 xCov( 0 , 1 )

on dduit lintervalle de confiance de

0 1x 1.96 Var( 0 1x)

1 e

0 1x 1.96 Var( 0 1x)

e0 1x
(x)
1 e0 1x

0 1x 1.96 Var( 0 1x )

1 e

0 1x 1.96 Var ( 0 1x )

20

Case Summariesa

Intervalle de
confiance de (x)
au niveau 95%

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Total

AGE
20.0
23.0
24.0
25.0
25.0
26.0
26.0
28.0
28.0
29.0
30.0
30.0
30.0
30.0
30.0
30.0
32.0
32.0
33.0
33.0
34.0
34.0
34.0
34.0
34.0
35.0
35.0
36.0
36.0
36.0
30

a. Limited to first 30 cases.

PROBABILITE
CALCULE
.04
.06
.07
.07
.07
.08
.08
.10
.10
.11
.12
.12
.12
.12
.12
.12
.15
.15
.16
.16
.18
.18
.18
.18
.18
.19
.19
.21
.21
.21
30

INF95
.01
.02
.02
.03
.03
.03
.03
.04
.04
.05
.05
.05
.05
.05
.05
.05
.07
.07
.08
.08
.09
.09
.09
.09
.09
.11
.11
.12
.12
.12
30

SUP95
.14
.17
.18
.19
.19
.20
.20
.23
.23
.24
.25
.25
.25
.25
.25
.25
.28
.28
.29
.29
.31
.31
.31
.31
.31
.33
.33
.34
.34
.34
30

21

Intervalle de confiance de (x) au niveau 95%


1.0

.8

.6

.4

.2

SUP95
INF95

0.0
20

PROBABILITE
30

40

50

60

70

22

Comparaison entre les proportions observes


et thoriques
Report

Proportion observe :

yi / nClasse

iClasse

AGEGRP
1
2
3

Proportion thorique :

4
5

i / nClasse

iClasse

6
7

puisque E(yi) = i

estim par

Total

Mean
N
Mean
N
Mean
N
Mean
N
Mean
N
Mean
N
Mean
N
Mean
N
Mean
N

Maladie
cardiaque
.1000
10
.1333
15
.2500
12
.3333
15
.4615
13
.6250
8
.7647
17
.8000
10
.4300
100

Predicted
probability
.0787086
10
.1484562
15
.2299070
12
.3519639
15
.4824845
13
.6087623
8
.7302152
17
.8391673
10
.4300000
100

23

Comparaison entre les proportions observes


et thoriques
1.0

.8

.6

Proportion

.4

.2
Prop. observe
0.0

Prop. thorique
1

Classe d'age

24

Test de Hosmer & Lemeshow


(Goodness of fit test)
Les donnes sont ranges par ordre croissant des probabilits
calcules laide du modle, puis partages en 10 groupes au
plus. Ce test est malheureusement peu puissant.
12

Le test du khi-deux est


utilis pour comparer les
yi )
effectifs observs ( iClasse
aux effectifs
thoriques ( i ).

10

iClasse

Effectif

2
Thorique
Observ

0
1

Groupe

10

Nb de degrs de libert
= Nb de groupes - 2
25

Test de Hosmer & Lemeshow


Contingency Table for Hosmer and Lemeshow Test

Step
1

1
2
3
4
5
6
7
8
9
10

Maladie cardiaque =
chd=no
Observed
Expected
9
9.213
9
8.657
8
8.095
8
8.037
7
6.947
5
5.322
5
4.200
3
3.736
2
2.134
1
.661

Maladie cardiaque =
chd=yes
Observed
Expected
1
.787
1
1.343
2
1.905
3
2.963
4
4.053
5
4.678
5
5.800
10
9.264
8
7.866
4
4.339

Total
10
10
10
11
11
10
10
13
10
5

Hosmer and Lemeshow Test


Step
1

Chi-square
.890

df
8

Sig.
.999

26

Mesure de la qualit de la modlisation


R2 de Cox & Snell

Pseudo R2 (McFadden)

l (cte) n2
R 1[
]
l (cte, X )
2

Max R 1 [l (cte)]
2

Pseudo R 2 1 [

2 L(cte, X )
]
2 L(cte)

2
n

R2 ajust de Nagelkerke
2
R
2
R adj
2
R max

27

Tableau de classification
Une observation i est affecte la classe [Y=1] si
i c.
Tableau de classification (c = 0.5)
TABLE OF CHD BY PREDICTS
CHD

PREDICTS

Frequency
0
1

0
45
12

1
14
29

Total
59
41

Total
57
43

Sensibilit = 29/43
Spcificit = 45/57
taux de faux positifs = 12/41
taux de faux ngatifs = 14/59

100

28

Objectifs
Sensibilit = capacit diagnostiquer les malades
parmi les malades
Spcificit = capacit reconnatre les non-malades
parmi les non-malades
1 - Spcificit = risque de diagnostiquer un malade
chez les non-malades.

Trouver un compromis acceptable


entre forte sensibilit et forte spcificit.
29

Graphique ROC
(Receiver Operating Characteristic)
Coronary Hearth Disease ROC curve
1.0

C = 0.5
.8

.6

.4

Sensitivit

Sensibilit : capacit
prdire un vnement
Spcificit : capacit
prdire un non-vnement
Graphique ROC :
y = Sensibilit(c)
x = 1 - Spcificit (c)

.2

0.0
0.0

.2

.4

.6

.8

1.0

1 - Spcificit

Laire sous la courbe ROC est une mesure du pouvoir prdictif


de la variable X. Ici cette surface est gale 0.8.
30

Coefficients d association entre les probabilits


calcules et les rponses observes
N = effectif total
t = nombre de paires avec
des rponses diffrentes
= nb(0)*nb(1)
nc = nombre de paires
concordantes (yi < yj et

i j )

nd = nombre de paires
discordantes (yi < yj et

i j )

t - nc - nd = Nb dex-aequo
i j )
(yi < yj et

D de Somer = (nc - nd) / t


Gamma =
(nc - nd) / (nc + nd)
Tau-a =
(nc - nd) / .5N(N-1)
c =
(nc + .5(t - nc - nd)) / t

c = aire sous la courbe


ROC
31

Analyse des rsidus donnes individuelles


Rsidu de Pearson (Standardized Residual)

ri

yi i
i (1 i )

comparer 2 en valeur absolue

32

Autres statistiques pour lanalyse des rsidus


Dviance :

D 2log l

di2

Rsidu dviance (Deviance)


d i signe( y i i ) 2 log(Prob estime [Y y i / X x i ]
comparer 2 en valeur absolue

Influence de chaque observation sur la dviance (DifDev)


iD = D(toutes les obs.) - D(toutes les obs. sauf lobs. i)
Studentized residual :

signe( y i i ) Di
33

Analyse des rsidus


4
5

3
16

-1

-2
-3
N=

100

100

Standardized residual Studentized residual

100

Deviance value

34

II. La rgression logistique multiple


Exemple : Prvision de faillite
Les donnes
Les ratios suivants sont observs sur 46 entreprises :

X1 = Flux de trsorerie / Dette totale


X2 = Resultat net / Actif
X3 = Actif court terme / Dette court terme
X4 = Actif court terme / Ventes
Y = F si faillite, NF sinon

Deux ans aprs 21 de ces entreprises ont fait faillite et 25 sont


restes en bonne sant financire.
35

Case Summariesa

Les donnes
des entreprises

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

cash flow /
total debt
-.45
-.56
.06
-.07
-.10
-.14
.04
-.07
.07
-.14
-.23
.07
.01
-.28
.15
.37
-.08
.05
.01
.12
-.28
.51
.08
.38
.19
.32
.31
.12
-.02
.22
.17
.15
-.10
.14
.14
.15
.16
.29
.54
-.33
.48
.56
.20
.47
.17
.58
a. Limited to first 100 cases.

net income /
total assets
-.41
-.31
.02
-.09
-.09
-.07
.01
-.06
-.01
-.14
-.30
.02
.00
-.23
.05
.11
-.08
.03
.00
.11
-.27
.10
.02
.11
.05
.07
.05
.05
.02
.08
.07
.05
-.01
-.03
.07
.06
.05
.06
.11
-.09
.09
.11
.08
.14
.04
.04

current
assets /
current
liabilities
1.09
1.51
1.01
1.45
1.56
.71
1.50
1.37
1.37
1.42
.33
1.31
2.15
1.19
1.88
1.99
1.51
1.68
1.26
1.14
1.27
2.49
2.01
3.27
2.25
4.24
4.45
2.52
2.05
2.35
1.80
2.17
2.50
.46
2.61
2.23
2.31
1.84
2.33
3.01
1.24
4.29
1.99
2.92
2.45
5.06

current
assets /
net sales
.45
.16
.40
.26
.67
.28
.71
.40
.34
.43
.18
.25
.70
.66
.27
.38
.42
.95
.60
.17
.51
.54
.53
.35
.33
.63
.69
.69
.35
.40
.52
.55
.58
.26
.52
.56
.20
.38
.48
.47
.18
.44
.30
.45
.14
.13

FAILLITE
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF
NF

36

Botes moustaches des ratios financiers


selon le critre de Faillite
.8

.2

.6

.1

0.0

40

-.1
0.0
-.2
-.2

40

-.4

-.6
-.8
N=

25

21

Non faillite

Faillite

-.3

-.4

-.5
N=

FAILLITE

25

21

Non faillite

Faillite

FAILLITE

1.2

46

1.0

27
42
26

.8

13

34
0
N=

FAILLITE

6
11

25

21

Non faillite

Faillite

Actif court terme / Ventes

Actif court terme / Dette court terme

34

.2

Rsultat net / Actif

Flux de trsorerie / Dette totale

.4

.6

.4

.2

0.0
N=

FAILLITE

25

21

Non faillite

Faillite

37

Intervalle de confiance des moyennes


des ratios financiers selon le critre de Faillite
.4

.1

.3

.2

0. 0

.1

-.1

95% CI X2

95% CI X1

0. 0

-.1

-.2
N=

25

21

Non

Oui

-.2
N=

FAILLITE

25

21

Non

Oui

FAILLITE

3. 5

.6

3. 0
.5
2. 5

2. 0

1. 5

95% CI X4

95% CI X3

.4

1. 0
N=

FAILLITE

25

21

Non

Oui

.3
N=

25

21

Non

Oui

FAILLITE

38

Rgressions logistiques simples


de Y sur les ratios X

Variable

Coefficient 1

WALD

NS

X1
X2
X3
X4

-7.526
-19.493
-3.382
.354

9.824
8.539
11.75
.040

.002
.003
.001
.841

R2 de
Nagelkerke
.466
.466
.611
.001

NS < .05 Prdicteur significatif


39

ACP des entreprises

40

ACP des entreprises (sans X4)

41

Le modle de la rgression logistique


Le modle

e0 1x1 ...4 x 4
(x) P(Y F / X x)
0 1x1 ...4 x 4
1 e

42

Vraisemblance des donnes


Probabilit dobserver les donnes
[(x1,y1), , (xi,yi), , (xn,yn)]
n

Prob( Y yi / X x i )
i 1
n

( x i ) yi (1 ( x i ))1 yi
i 1
n

(
i 1

1 e

jx j
j

jx j
j

) yi (1

1 e

jx j
j

jx j

)1 yi

l (0 , 1 ,..., 4 )
43

Rsultats
Model Summary
Step
1

-2 Log
likelihood
27.443

Cox & Snell


R Square
.543

Nagelkerke
R Square
.725

X1
X2
X3
X4

Collinearity Statistics
Tolerance
VIF
0.212
4.725
0.252
3.973
0.635
1.575
0.904
1.106

Variables in the Equation


Step
a
1

X1
X2
X3
X4
Constant

B
-7.138
3.703
-3.415
2.968
5.320

S.E.
6.002
13.670
1.204
3.065
2.366

Wald
1.414
.073
8.049
.938
5.053

df
1
1
1
1
1

Sig.
.234
.786
.005
.333
.025

Exp(B)
.001
40.581
.033
19.461
204.283

a. Variable(s) entered on step 1: X1, X2, X3, X4.

44

Rsultats
Correlations
X1
X1

X2

X3

X4

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

1
.
46
.858**
.000
46
.571**
.000
46
-.053
.725
46

X2
.858**
.000
46
1
.
46
.471**
.001
46
.055
.717
46

X3
.571**
.000
46
.471**
.001
46
1
.
46
.154
.306
46

X4
-.053
.725
46
.055
.717
46
.154
.306
46
1
.
46

**. Correlation is significant at the 0.01 level (2-tailed).

45

Le modle estim
Pr ob(Y F / X)
e5.3207.138X1 3.703X 2 3.415X3 2.968X 4
5.320 7.138X1 3.703X 2 3.415X 3 2.968X 4
1 e
Prvision de faillite
Classification Tablea
Predicted
FAILLITE
Observed
SITUATION
Overall Percentage

NF
NF
F

F
24
3

1
18

Percentage
Correct
96.0
85.7
91.3

a. The cut value is .500

46

Test de Hosmer & Lemeshow


Hosmer and Lemeshow Test
Step
1

Chi-square
5.201

df
7

Sig.
.636

Contingency Table for Hosmer and Lemeshow Test

Step
1

1
2
3
4
5
6
7
8
9

FAILLITE = Non
Observed
Expected
5
4.999
5
4.906
4
4.613
5
4.143
4
3.473
1
1.762
0
.667
1
.340
0
.098

FAILLITE = Oui
Observed
Expected
0
.001
0
.094
1
.387
0
.857
1
1.527
4
3.238
5
4.333
4
4.660
6
5.902

Total
5
5
5
5
5
5
5
5
6

47

Rgression logistique pas pas


descendante
Sans X2

Model Summary
Step
1

-2 Log
likelihood
27.516

Cox & Snell


R Square
.542

Nagelkerke
R Square
.724

Variables in the Equation


Step
a
1

X1
X3
X4
Constant

B
-5.772
-3.289
2.979
5.038

S.E.
3.005
1.085
3.025
2.060

Wald
3.690
9.183
.970
5.983

df
1
1
1
1

Sig.
.055
.002
.325
.014

Exp(B)
.003
.037
19.675
154.193

a. Variable(s) entered on step 1: X1, X3, X4.

48

Rgression logistique pas pas


descendante
Sans X4
Model Summary
Step
1

-2 Log
likelihood
28.636

Cox & Snell


R Square
.531

Nagelkerke
R Square
.709

Variables in the Equation


Step
a
1

X1
X3
Constant

B
-6.556
-3.019
5.940

S.E.
2.905
1.002
1.986

Wald
5.092
9.077
8.950

df
1
1
1

Sig.
.024
.003
.003

Exp(B)
.001
.049
379.996

a. Variable(s) entered on step 1: X1, X3.

49

Carte des entreprises dans le plan (x1, x3)


6
46

current assets / current liabilities

27
26

42

4
24
40

44
35
28 45
372530
36
32
13
29
23
43
1531
18
517
7
4
10
8
9
19 12
20
3
33

2
2
21
14

22
39
38

16

41

FAILLITE

34

11

NF

-.6

-.4

-.2

cash flow / total debt

-.0

.2

.4

.6

50

quation de la droite frontire


5.940 6.556X1 3.019X 3

e
Pr ob(Y F / X)
0.5
5.940 6.556X1 3.019X 2
1 e

5.940 - 6.556X1 - 3.019 X3 = 0

X3 = (5.940 - 6.556X1)/3.019
51

Carte des entreprises dans le plan (x1, x3)


avec la droite frontire issue de la rgression logistique
6
46

current assets / current liabilities

27
26

42

X3 = (5.940 - 6.556X1)/3.019

24

40

44
33
29

13

2
5 17
48
10

2
21
14

19

35
28 45
37 2530
36
32
23
43
1531

18
7
9
12
3

-.4

-.2

Droite
diso-probabilit 0.5

34

cash flow / total debt

-.0

16

41

0
-.6

38

20

6
11

22
39

.2

.4

.6

52

Carte des entreprises dans le plan (x1, x3)


avec la droite frontire et le no-mans land issues de la
mthode SVM

53

Carte des entreprises dans le plan (x1, x3)


avec la courbe frontire et le no-mans land issues de la
mthode SVM

54

Exemple II :
Low birth weight baby (Hosmer & Lemeshow)
Y = 1 si le poids du bb < 2 500 grammes,
= 0 sinon
n1 = 59, n0 = 130
Facteurs de risque :
-

Age
LWT (Last Menstrual Period Weight)
Race (White, Black, Other)
FTV (Nb of First Trimester Physician Visits)
Smoke (1 = oui, 0 = non)
55

Model Summary
Step
1

-2 Log
likelihood
214.575

Cox & Snell


R Square
.101

Rsultats

Nagelkerke
R Square
.142

Variables in the Equation


Step
a
1

B
-.022
-.012
-.941
.289
-.008
1.053
1.269

AGE
LWT
WHITE
BLACK
FTV
SMOKE
Constant

S.E.
.035
.006
.418
.527
.164
.381
1.023

Wald
.410
3.762
5.070
.301
.002
7.637
1.539

df
1
1
1
1
1
1
1

Sig.
.522
.052
.024
.583
.962
.006
.215

Exp(B)
.978
.988
.390
1.336
.992
2.866
3.558

a. Variable(s) entered on step 1: AGE, LWT, WHITE, BLACK, FTV, SMOKE.


Coefficientsa

Model
1

AGE
weight last
menstrual period
smoking during
pregnancy
n physician visits
first trimester
WHITE
BLACK

Collinearity Statistics
Tolerance
VIF
.884
1.132
.869

1.150

.865

1.156

.939

1.065

.686
.743

1.457
1.346

a. Dependent Variable: low birth weight

Aucun problme
de multicolinarit

56

Validit du modle
Test de Hosmer et Lemeshow
Hosmer and Lemeshow Test
Step
1

Chi-square
11.825

df
8

Sig.
.159

Contingency Table for Hosmer and Lemeshow Test

Step
1

1
2
3
4
5
6
7
8
9
10

low birth weight =


weight > 2500 g
Observed
Expected
19
17.278
17
16.411
14
15.454
12
13.955
16
13.041
12
12.599
9
12.084
9
11.569
13
10.715
9
6.888

low birth weight =


weight < 2500 g
Observed
Expected
0
1.722
2
2.589
5
3.546
7
5.045
3
5.959
7
6.401
10
6.916
10
7.431
6
8.285
9
11.112

Total
19
19
19
19
19
19
19
19
19
18

57

Odds-Ratio
Odds Ratio(Smoke)

Pr ob(Y 1/ X,Smoke yes) / Pr ob(Y 0 / X,Smoke yes)


Pr ob(Y 1/ X,Smoke no) / Pr ob(Y 0 / X,Smoke no)

exp(Smoke )

Pour un vnement rare lodds-ratio est peu diffrent


du risque relatif dfini par :

Pr ob(Y 1/ X,Smoke yes)


Risque Relatif
Pr ob(Y 1/ X,Smoke no)
58

Intervalle de confiance de lOdds-Ratio


au niveau 95%
De

Var(Smoke ) sSmoke

on dduit lintervalle de confiance de OR(Smoke) :

[e

Smoke 1.96sSmoke

, e

Smoke 1.96sSmoke

]
59

Intervalle de confiance de lOdds-Ratio


au niveau 95%
Variables in the Equation

Step
a
1

AGE
LWT
WHITE
BLACK
FTV
SMOKE
Constant

B
-.022
-.012
-.941
.289
-.008
1.053
1.269

Exp(B)
.978
.988
.390
1.336
.992
2.866
3.558

95.0% C.I.for EXP(B)


Lower
Upper
.914
1.047
.975
1.000
.172
.885
.475
3.755
.719
1.369
1.358
6.046

a. Variable(s) entered on step 1: AGE, LWT, WHITE, BLACK, FTV,


SMOKE.

60

Influence dun groupe de variables


Le modle
Test

e0 1x1 ...k x k
( x ) P ( Y 1 / X x )
1 e0 1x1 ...k x k
- Proc GENMOD (type 3)
- Rgression backwardLR
avec Removal = 1 dans
SPSS

H0 : r+1 = = k = 0
H1 : au moins un j 0

Statistiques utilises

1. = [-2L(Modle simplifi)] - [-2L(Modle complet)]

2.

Wald ( r 1 ,..., k )

r 1

Var

k

r 1

- Proc Logistic
- Proc Genmod
(type 3 et wald)
- SPSS

61

Rgle de dcision
On rejette
H0 : r+1 = = k = 0
au risque de se tromper si
ou Wald

2
1

k r

ou si
NS = Prob( k r Wald ou )
2

62

Test du facteur Race (Wald)


Variables in the Equation
Step
a
1

AGE
LWT
RACE
RACE(1)
RACE(2)
SMOKE
FTV
Constant

B
-.022
-.012

S.E.
.035
.006

-.941
.289
1.053
-.008
1.269

.418
.527
.381
.164
1.023

Wald
.410
3.762
7.784
5.070
.301
7.637
.002
1.539

df
1
1
2
1
1
1
1
1

Sig.
.522
.052
.020
.024
.583
.006
.962
.215

Exp(B)
.978
.988
.390
1.336
2.866
.992
3.558

a. Variable(s) entered on step 1: AGE, LWT, RACE, SMOKE, FTV.

Modle sans le facteur Race :


Model Summary
Step
1

-2 Log
likelihood
222.815

Cox & Snell


R Square
.061

Nagelkerke
R Square
.086

63

Test du facteur Race (LRT)


Likelihood Ratio Tests

Effect
Intercept
AGE
LWT
SMOKE
FTV
RACE

-2 Log
Likelihood of
Reduced
Model
214.575a
214.990
218.746
222.573
214.577
222.815

Chi-Square
.000
.415
4.171
7.998
.002
8.239

df

Sig.
0
1
1
1
1
2

.
.520
.041
.005
.963
.016

The chi-square statistic is the difference in -2 log-likelihoods


between the final model and a reduced model. The reduced
model is formed by omitting an effect from the final model. The
null hypothesis is that all parameters of that effect are 0.
a. This reduced model is equivalent to the final model
because omitting the effect does not increase the degrees
of freedom.

64

Test de lhypothse linaire gnrale


Le modle
Test

e0 1x1 ...k x k
( x ) P ( Y 1 / X x )
1 e0 1x1 ...k x k
H0 : C(0, 1, k) = 0
H1 : C(0, 1, k) 0

Statistiques utilises
1.

= [-2L(H0)] - [-2L(H1)]

2.

Wald ( 0 , 1 ,..., k )C '

0

CVar ( M )C '

Proc GENMOD
1

Proc Logistic
Proc Genmod

65

Rgle de dcision
On rejette
H0 : C(0, 1, , k) = 0
au risque de se tromper si
2

1 rang ( L)
ou Wald

ou si
2

NS = Prob( rang(L) Wald ou )

66

La rgression logistique pas--pas


descendante
On part du modle complet.
A chaque tape, on enlve la variable ayant le
Wald le moins significatif (plus fort niveau de
signification) condition que son niveau de
signification soit suprieur 10 % .
67

La rgression logistique pas--pas ascendante


dans la Proc Logistic de SAS
A chaque tape on slectionne la variable X j
qui aura le niveau de signification du
2Score(Xj) le plus faible une fois introduite
dans le modle, condition que lapport de
Xj soit significatif.
Linfluence des variables hors-modle est
teste globalement laide de la statistique
2Score (Residual Chi-Square dans SAS), mais ce
test est peu puissant.

68

Test du Score pour la variable Xj


Modle

Prob( Y 1 / x )
Test H0 :

j = 0

vs

0 1x1 t x t jx j

0 1x1 t x t jx j

1 e

H 1 : j 0

Statistique
L
2Score

'
H 0

L
2

2

H 0

H0

suit une loi du khi-deux 1 degr de libert sous H 0.

L
est calcul sur le modle t+1 variables.

69

Test du Score pour les variables hors modle


Modle

e0 1x1t x t t 1x t 1...k x k
Prob( Y 1 / x )
1 e0 1x1t x t t 1x t 1...k x k
Test H0 : t+1 == k = 0 vs H1 : au moins un j 0
Statistique
'

L
L
L
2Score

'
k-t

H 0.
de libert
sous
degr
suit une loi du khi-deux
H0
H0

H 0

est calcul sur le modle k variables.


70

Rgression logistique multiple


(Donnes agrges)
Exemple : Job satisfaction (Models for discrete data, D.
Zelterman, Oxford Science Publication, 1999)
9949 employees in the craft job (travail manuel) within a
company
Response : Satisfied/Dissatisfied
Factors : Sex (1=F, 0=M)
Race (White=1, Nonwhite=0)
Age (<35, 35-44, >44)
Region (Northeast, Mid-Atlantic, Southern,
Midwest, Northwest, Southwest, Pacific)
Explain Job satisfaction with all the main effects and the
interactions.

71

Job satisfaction (Y/N) by sex (M/F), race, age, and region of residence
for employees of a large U.S. corporation
Under 35
M
F

White
35-44
M
F

Over 44
M
F

288
177

60
57

224
166

35
19

337
172

70
30

38
33

19
35

32
11

22
20

21
8

15
10

90
45

19
12

96
42

12
5

124
39

17
2

18
6

13
7

7
2

0
3

9
2

1
1

Southern
Y
N

226
128

88
57

189
117

44
34

156
73

70
25

45
31

47
35

18
3

13
7

11
2

9
2

Midwest
Y
N

285
179

110
93

225
141

53
24

324
140

60
47

40
25

66
56

19
11

25
19

22
2

11
12

Northwest
Y
N

270
180

176
151

215
108

80
40

269
136

110
40

36
20

25
16

9
7

11
5

16
3

4
5

Southwest
Y
N

252
126

97
61

162
72

47
27

199
93

62
24

69
27

45
36

14
7

8
4

14
5

2
0

Pacific
Y
N

119
58

62
33

66
20

20
10

67
21

25
10

45
16

22
15

15
10

10
8

8
6

6
2

Region
Northeast
Y
N
Mid-Atlantic
Y
N

Under 35
M
F

Nonwhite
35-44
M
F

Over 44
M
F

72

Utilisation de la Proc Logistic


data job;
input sat nsat race age sex region;
label
sat
=
'satisfied with job'
nsat
=
'dissatisfied'
race
=
'0=non-white, 1=white'
age
=
'3 age groups'
sex
=
'0=M, 1=F'
region =
'7 regions'
total =
'denominator';
total = sat+nsat;
propsat = sat/total;
cards;
288 177 1 0 0 0
90 45 1 0 0 1
226 128 1 0 0 2
.
.
.
2
0 0 2 1 5
6
2 0 2 1 6
;

73

Utilisation de la Proc Logistic


proc logistic data=job;
class race age sex region/param=effect;
model sat/total = race age sex region race*age
race*sex race*region age*sex
age*region sex*region
/selection = forward
hierarchy = none ;
run;

74

Rsultat de la Proc Logistic


(option Forward et hierarchy =none)

Type III Analysis of Effects


Effect
race
age
sex
region
race*sex
age*sex

DF

Wald
Chi-Square

Pr > ChiSq

1
2
1
6
1
2

0.1007
50.7100
14.0597
37.7010
7.5641
5.9577

0.7510
<.0001
0.0002
<.0001
0.0060
0.0509

75

Utilisation de la Proc Logistic avec l option Param=effect


Analysis of Maximum Likelihood Estimates
Parameter

DF

Estimate

Standard
Error

Intercept
race
age
age
sex
region
region
region
region
region
region
race*sex
age*sex
age*sex

1
1
1
1
1
1
1
1
1
1
1
1
1
1

0.6481
-0.0099
-0.1952
-0.0227
0.1230
-0.2192
0.2228
-0.0446
-0.1291
-0.0927
0.0704
0.0856
0.0768
-0.0342

0.0346
0.0312
0.0316
0.0375
0.0328
0.0469
0.0820
0.0527
0.0462
0.0472
0.0531
0.0311
0.0315
0.0375

0
0
1
0
0
1
2
3
4
5
0 0
0 0
1 0

Chi-Square

Pr > ChiSq

350.2297
0.1007
38.2459
0.3675
14.0597
21.8470
7.3832
0.7159
7.8133
3.8616
1.7565
7.5641
5.9428
0.8352

<.0001
0.7510
<.0001
0.5444
0.0002
<.0001
0.0066
0.3975
0.0052
0.0494
0.1851
0.0060
0.0148
0.3608

76

Calcul et test des derniers coefficients

proc logistic data=job;


class race age sex region/param=effect;
model sat/total = race age sex region
race*sex age*sex ;
contrast 'Age >44' age -1 -1/estimate = parm;
contrast 'Pacific' region -1 -1 -1 -1 -1 -1/
estimate=parm;
contrast 'Age>44,Homme' age*sex -1 -1/
estimate=parm;
run;

77

Rsultats

Contrast Rows Estimation and Testing Results


Contrast

Estimate

Age >44
Pacific
Age>44,Homme

0.2180
0.1924
-0.0425

Standard
Wald
Error Chi-Square
0.0375
0.0751
0.0375

0.1444
0.0453
-0.1159

Pr > ChiSq
<.0001
0.0104
0.2565

78

Utilisation de la Proc Logistic avec l option Param=effect


Logit(Prob(Satisfait))

Non Blanc .01


0.65
Blanc .01

35 .20
35 44 .02
44 .22

Homme
Femme

Northeast .22
Mid Atlantic .22
Southern .04
.12

Midwest .13

.12
Northwest .09

Southwest .07
.19
Pacific

35 .08
- .08
Non Blanc .09
- .09
35 44 .03 .03

Blanc .09
.09
44 .05 .05
Homme Femme

Homme Femme

79

Construction d un modle hirarchique


proc logistic data=job;
class race age sex region/param=effect;
model sat/total= sex region race(sex)
age(sex) /scale=none ;
contrast 'Pacific' region -1 -1 -1 -1 -1 -1
/estimate=parm;
contrast 'Age>44,Homme' age(sex) -1 -1 0 0
/estimate = parm;
contrast 'Age>44,Femme' age(sex) 0 0 -1 -1
/estimate=parm;
run;
80

Rsultats
Type III Analysis of Effects
Effect
sex
region
race(sex)
age(sex)

DF

Wald
Chi-Square

Pr > ChiSq

1
6
2
4

14.0597
37.7010
7.5710
55.4078

0.0002
<.0001
0.0227
<.0001

81

Rsultats
Analysis of Maximum Likelihood Estimates
Parameter

DF

Estimate

Standard
Error

Intercept
sex
region
region
region
region
region
region
race(sex)
race(sex)
age(sex)
age(sex)
age(sex)
age(sex)

1
1
1
1
1
1
1
1
1
1
1
1
1
1

0.6481
0.1230
-0.2192
0.2228
-0.0446
-0.1291
-0.0927
0.0704
0.0757
-0.0956
-0.1185
-0.0570
-0.2720
0.0115

0.0346
0.0328
0.0469
0.0820
0.0527
0.0462
0.0472
0.0531
0.0422
0.0459
0.0342
0.0370
0.0530
0.0652

0
0
1
2
3
4
5
0
0
0
1
0
1

0
1
0
0
1
1

Contrast
Pacific
Age>44,Homme
Age>44,Femme

Estimate
0.1924
0.1754
0.2605

Standard
Error
0.0751
0.0367
0.0654

Chi-Square

Pr > ChiSq

350.2297
14.0597
21.8470
7.3832
0.7159
7.8133
3.8616
1.7565
3.2230
4.3244
11.9881
2.3683
26.3735
0.0313

<.0001
0.0002
<.0001
0.0066
0.3975
0.0052
0.0494
0.1851
0.0726
0.0376
0.0005
0.1238
<.0001
0.8596

Wald
Chi-Square
6.5729
22.8477
15.8719

Pr > ChiSq
0.0104
<.0001
<.0001

82

Utilisation de la Proc Logistic avec l option Param=effect


Logit(Prob (Satisfait ))

0.65

Homme .12

Femme .12

Homme

Northeast .22

Mid Atlantic .22


Southern .04 ns

Midwest .13
Northwest .09

Southwest .07 ns
.19
Pacific

.08

- .08

.10
Femme .10
Non-blanc Blanc

Diffrence entre
races par sexe :
Race(Sexe)

ns

Homme .12 .06 .18


ns
Femme .27 .01 .26

35 3544 44

Diffrence entre
les ages par sexe :
Age(Sexe)

83

Analyse des rsidus


donnes agrges en s groupes
ni = effectif du groupe i, i = 1 s = 84
yi = nombre de succs observ dans le groupe i
i = probabilit de succs dans le groupe i
y i = n i i = nombre de succs attendu dans le groupe i
Rsidu de Pearson :

yi n i i
ri
n i i (1 i )

Rsidu dviance :

yi
ni yi
d i signe( yi y i ) 2 yi log( ) 2(ni yi ) log(
)
y i
ni y i

84

Analyse des rsidus et validation du modle


proc logistic data=job;
class race age sex region/param=effect;
model sat/total=race age sex region
race*sex age*sex / scale = none ;
output out = residu
predicted =predicted
reschi =reschi
resdev=resdev;
run;
Proc print data=residu;
var sat total propsat predicted reschi resdev;
run;
85

Analyse des rsidus : Rsultats


Obs

sat

total

propsat

predicted

reschi

resdev

288

465

0.61935

0.58848

1.35305

1.35864

90

135

0.66667

0.68991

-0.58388

-0.58005

226

354

0.63842

0.63003

0.32704

0.32756

285

464

0.61422

0.61011

0.18152

0.18164

270

450

0.60000

0.61875

-0.81897

-0.81651

252

378

0.66667

0.65641

0.41995

0.42097

119

177

0.67232

0.68338

-0.31638

-0.31541

60

117

0.51282

0.53231

-0.42246

-0.42216

19

31

0.61290

0.63909

-0.30364

-0.30214

86

Validation du modle
Le khi-deux de Pearson :

QP ri2
i 1

La dviance :

QL d i2

Si le modle tudi est exact QP et QL suivent


approximativement une loi du khi-deux
groupes - nb de paramtres du modle]
degrs de libert.

i 1

[nb de

87

Remarques
Les tests de validation sont valables sil y a
au moins 10 sujets par groupe.
La dviance QL est gale
[2 L(modle tudi)]-[-2L(modle satur)]

o le modle satur est un modle


reconstituant parfaitement les donnes.
88

Rsultats
Deviance and Pearson Goodness-of-Fit Statistics
Criterion

DF

Value

Value/DF

Pr > ChiSq

Deviance
Pearson

70
70

81.9676
79.0760

1.1710
1.1297

0.1552
0.2142

Number of events/trials observations: 84

89

Sur-dispersion
Khi-deux de Pearson QP et dviance QL sont trop forts si :
- Modle mal spcifi
- Outliers
Htrognit de chaque groupe
La variable de rponse Yi = Nb de succs sur le groupe i ne
suit plus une loi binomiale :
- E(Yi) = nii
-

V(Yi) = ni i (1 - i)

90

Calcul de
Dans la Proc LOGISTIC :
- Option SCALE = Pearson :
-

Option SCALE = Deviance :

Dans la Proc GENMOD :


- Option PSCALE ou DSCALE
- Scale =
(vrai galement dans Proc Logistic)

QP

ddl
QL

ddl

91

Solution LOGISTIC/GENMOD pour prendre en


compte la sur-dispersion
Utilisation de la rponse binomiale pour lestimation des
paramtres.
Pour les tests sur les coefficients :
-

Les statistiques de Wald et LRT sont divises par .


Les dviances sont divises par .
Dans GENMOD, utilisation de la statistique
F

Dev(Modle sous H 0 ) - Dev(Modle sous H1 )


(ddl H 0 ddl H1 )

Sil y a sur-dispersion (Dviance et Khi-deux de Pearson


significatifs) les rsultats non corrigs sont trop
significatifs.
92

B. La rgression logistique ordinale


Exemple : Qualit des vins de Bordeaux
Variables observes sur 34 annes (1924 - 1957)
TEMPERATURE
SOLEIL
CHALEUR
PLUIE

: Somme des tempratures


moyennes journalires
: Dure dinsolation
: Nombre de jours de grande chaleur
: Hauteur des pluies

QUALITE DU VIN :

Bon, Moyen, Mdiocre


93

Les donnes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Temprature
3064
3000
3155
3085
3245
3267
3080
2974
3038
3318
3317
3182
2998
3221
3019
3022
3094
3009
3227
3308
3212
3361
3061
3478
3126
3458
3252
3052
3270
3198
2904
3247
3083
3043

Soleil
1201
1053
1133
970
1258
1386
966
1189
1103
1310
1362
1171
1102
1424
1230
1285
1329
1210
1331
1366
1289
1444
1175
1317
1248
1508
1361
1186
1399
1259
1164
1277
1195
1208

Chaleur
10
11
19
4
36
35
13
12
14
29
25
28
9
21
16
9
11
15
21
24
17
25
12
42
11
43
26
14
24
20
6
19
5
14

Pluie
361
338
393
467
294
225
417
488
677
427
326
326
349
382
275
303
339
536
414
282
302
253
261
259
315
286
346
443
306
367
311
375
441
371

Qualit
2
3
2
3
1
1
3
3
3
2
1
3
3
1
2
2
2
3
2
1
2
1
2
1
2
1
2
3
1
1
3
1
3
3

94

Correlations
Temprature

corrlations

Soleil

Chaleur

Pluie

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

Temprature
1
.
34
.712**
.000
34
.865**
.000
34
-.410*
.016
34

Soleil
.712**
.000
34
1
.
34
.646**
.000
34
-.473**
.005
34

Chaleur
.865**
.000
34
.646**
.000
34
1
.
34
-.401*
.019
34

Pluie
-.410*
.016
34
-.473**
.005
34
-.401*
.019
34
1
.
34

**. Correlation is significant at the 0.01 level (2-tailed).


*. Correlation is significant at the 0.05 level (2-tailed).

Coefficientsa

VIF

Model
1

Temprature
Soleil
Chaleur
Pluie

Collinearity Statistics
Tolerance
VIF
.211
4.733
.451
2.216
.248
4.031
.760
1.316

a. Dependent Variable: Qualit

95

La rgression logistique ordinale


La variable Y prend 1,, m, m+1 valeurs ordonnes.

I. Le modle pentes gales


Dans la Proc
Logistic :

e i 1x1k x k
Prob( Y i / x )
i 1x1 k x k
1 e

pour i = 1, , m et avec 1 2 m
Dans SPSS :

e i 1x1 k x k
Prob(Y i / x )
1 e i 1x1 k x k

Les coefficients de rgression des xj de SPSS sont


loppos de ceux de SAS : j = - j.

96

Proprits du modle
Modle pentes gales (proportional odds ratio)
Prob(Y i/x) / Prob(Y i/x)
e i x
x ' e( x x ')
Prob(Y i/x' ) / Prob(Y i/x' ) e i

est indpendant de i.
Lorsque j > 0, la probabilit des petites valeurs de
Y augmente avec Xj.
97

Test du modle pentes gales dans SAS


Le modle gnral

e i 1ix1kix k
Prob( Y i / x )
i 1i x1 kix k
1

e
pour i = 1,,m
Test H0 :

11 = 12 = = 1m

21 = 22 = = 2m
k1 = k2 = = km

k(m-1) contraintes

98

Statistique utilise

L() Log-vraisemblance du modle gnral


H0 =

estimation de pour le modle


pentes gales
La statistique
L
2
Score

'

H 0

'
2

H 0

H 0

suit une loi du khi-deux k(m-1) degrs de libert


sous lhypothse H0.
99

Rgle de dcision
On rejette lhypothse H0 dun modle pentes
gales au risque de se tromper si

ou si

2
Score
12 k (m 1)

NS = Prob(
Conseil dAgresti :

2
2 m( k 1) Score
)

Test plutt utilis pour valider H0 que pour rejeter H0.

100

Rsultats SPSS
Test of Parallel Linesa
Model
Null Hypothesis
General

-2 Log
Likelihood
26.158
22.355

Chi-Square

df

3.803

Sig.
4

.433

The null hypothesis states that the location parameters (slope


coefficients) are the same across response categories.
a. Link function: Logit.

Model Fitting Information


Model
Intercept Only
Final

-2 Log
Likelihood
74.647
26.158

Chi-Square

df

48.489

Sig.
4

.000

Link function: Logit.

Pseudo R-Square
Cox and Snell
Nagelkerke
McFadden
Link function: Logit.

.760
.855
.650

101

Rsultats SPSS
Modle complet
Parameter Estimates

Threshold
Location

[QUALITE = 1]
[QUALITE = 2]
TEMPERAT
SOLEIL
CHALEUR
PLUIE

Estimate
-85.50748
-80.54960
-.02427
-.01379
.08876
.02589

Std. Error
34.92140
33.96555
.01277
.00850
.11929
.01235

Wald
5.99549
5.62405
3.61247
2.63346
.55364
4.39307

df

Sig.
.014
.018
.057
.105
.457
.036

1
1
1
1
1
1

Link function: Logit.

Modle sans Chaleur


Parameter Estimates

Threshold
Location

[QUALITE = 1]
[QUALITE = 2]
TEMPERAT
SOLEIL
PLUIE

Link function: Logit.

Estimate
-67.44675
-62.63810
-.01717
-.01499
.02224

Std. Error
22.89023
21.78872
.00759
.00832
.01046

Wald
8.68204
8.26445
5.11905
3.24843
4.52311

df
1
1
1
1
1

Sig.
.003
.004
.024
.071
.033

102

Prvision de la
qualit du vin
avec le 2e modle

Case Summariesa

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Qualit
Moyen
Mdiocre
Moyen
Mdiocre
Bon
Bon
Mdiocre
Mdiocre
Mdiocre
Moyen
Bon
Mdiocre
Mdiocre
Bon
Moyen
Moyen
Moyen
Mdiocre
Moyen
Bon
Moyen
Bon
Moyen
Bon
Moyen
Bon
Moyen
Mdiocre
Bon
Bon
Mdiocre
Bon
Mdiocre
Mdiocre
a. Limited to first 100 cases.

Estimated
Cell
Probability for
Response
Category: 1
.01
.00
.01
.00
.64
.99
.00
.00
.00
.42
.94
.08
.00
.67
.04
.05
.13
.00
.21
.97
.58
1.00
.04
1.00
.11
1.00
.75
.00
.95
.14
.00
.29
.00
.00

Estimated
Cell
Probability for
Response
Category: 2
.48
.05
.44
.00
.35
.01
.01
.01
.00
.57
.06
.83
.08
.33
.78
.81
.82
.01
.76
.03
.42
.00
.81
.00
.83
.00
.25
.09
.05
.81
.09
.69
.17
.36

Estimated
Cell
Probability for
Response
Category: 3
.51
.95
.56
1.00
.00
.00
.99
.99
1.00
.01
.00
.09
.92
.00
.18
.15
.05
.99
.03
.00
.01
.00
.15
.00
.06
.00
.00
.91
.00
.05
.90
.02
.83
.63

Predicted
Response
Category
Mdiocre
Mdiocre
Mdiocre
Mdiocre
Bon
Bon
Mdiocre
Mdiocre
Mdiocre
Moyen
Bon
Moyen
Mdiocre
Bon
Moyen
Moyen
Moyen
Mdiocre
Moyen
Bon
Bon
Bon
Moyen
Bon
Moyen
Bon
Bon
Mdiocre
Bon
Moyen
Mdiocre
Moyen
Mdiocre
Mdiocre

103

Qualit de la prvision
Qualit * Predicted Response Category Crosstabulation
Count

Qualit

Total

Bon
Moyen
Mdiocre

Predicted Response Category


Bon
Moyen
Mdiocre
9
2
2
7
2
1
11
11
10
13

Total
11
11
12
34

104

II. Le modle partiellement pentes gales


Les donnes de chaque observation sont rptes m fois.
La variable Type indique le numro de la rptition i.
La variable Rponse indique si [Y i] est vrai :
Anne
1926
1926
1927
1927
1928
1928

Qualit
2
2
3
3
1
1

Type
1
2
1
2
1
2

Rponse
0
1
0
0
1
1

Pour Type = 1 : Rponse = 1 Qualit = 1


Pour Type = 2 : Rponse = 1 Qualit 2

(Y=1) faux
(Y 2) vrai

105

Le modle complet
Pr ob(Rponse 1 / Type , x )
e 1T1 2T2 1T 4 P 5T1T 8T1P

1T1 2T2 1T 4 P 5T1T 8T1 P


1 e
- Pour Type = 1 : Rponse = 1 Qualit = 1
- Pour Type = 2 : Rponse = 1 Qualit 2
- Do : Prob(Rponse = 1/Type = 1, x) = Prob(Qualit = 1/x)
Prob(Rponse = 1/Type = 2, x) = Prob(Qualit 2/x)
- T1 , T2 = variables indicatrices de la variable Type
106

Le code SAS
Proc genmod data=bordeaux2 descending;
class type annee;
model reponse = type tempera soleil chaleur pluie
type*tempera type*soleil
type*chaleur type*pluie
/dist=bin link=logit type3 noint;
repeated subject=annee / type=unstr;
run;

107

Rsultats tape 1
The GENMOD Procedure
Criteria For Assessing Goodness Of Fit
Criterion

DF

Value

Value/DF

Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood

58
58
58
58

22.5317
22.5317
20.4541
20.4541
-11.2659

0.3885
0.3885
0.3527
0.3527

Algorithm converged.

108

Rsultats tape 1
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter

Standard
Estimate
Error

Intercept
type
type
tempera
soleil
chaleur
pluie
tempera*type
tempera*type
soleil*type
soleil*type
chaleur*type
chaleur*type
pluie*type
pluie*type

0.0000
-68.1364
-251.965
0.0948
0.0079
-0.8727
-0.1036
-0.0755
0.0000
0.0013
0.0000
0.8799
0.0000
0.0852
0.0000

1
2

1
2
1
2
1
2
1
2

95% Confidence
Limits

0.0000
0.0000
0.0000
29.7166 -126.380 -9.8929
82.1239 -412.925 -91.0055
0.0330
0.0300
0.1596
0.0107 -0.0130
0.0288
0.3574 -1.5732 -0.1722
0.0437 -0.1893 -0.0179
0.0358 -0.1458 -0.0053
0.0000
0.0000
0.0000
0.0144 -0.0270
0.0295
0.0000
0.0000
0.0000
0.3795
0.1360
1.6238
0.0000
0.0000
0.0000
0.0460 -0.0050
0.1753
0.0000
0.0000
0.0000

Z Pr > |Z|
.
-2.29
-3.07
2.87
0.74
-2.44
-2.37
-2.11
.
0.09
.
2.32
.
1.85
.

.
0.0219
0.0022
0.0041
0.4598
0.0146
0.0178
0.0351
.
0.9290
.
0.0204
.
0.0641
.

109

Rsultats
Score Statistics For Type 3 GEE Analysis
Source
type
tempera
soleil
chaleur
pluie
tempera*type
soleil*type
chaleur*type
pluie*type

DF

ChiSquare

Pr > ChiSq

2
1
0
2
2
2
2
2
2

7.08
4.94
.
0.00
0.02
0.04
0.27
0.00
0.00

0.0290
0.0263
.
0.9995
0.9881
0.9799
0.8734
0.9999
1.0000

110

Le modle partiellement pentes gales


On limine progressivement les interactions
non significatives.
On retrouve le modle pentes gales si
toutes les interactions sont limines.
Cette approche permet un test LRT de
comparaison entre le modle complet et le
modle pentes gales.
111

Rsultat des itrations


Modle pentes gales

Criteria For Assessing Goodness Of Fit


Criterion

DF

Value

Value/DF

Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood

62
62
62
62

26.2408
26.2408
26.5218
26.5218
-13.1204

0.4232
0.4232
0.4278
0.4278

Algorithm converged.

112

Rsultat des itrations


Modle pentes gales
Analysis Of Initial Parameter Estimates
Parameter

DF

Estimate

Standard
Error

Intercept
type
type
tempera
soleil
chaleur
pluie

0
1
1
1
1
1
1

0.0000
-86.4800
-81.5119
0.0245
0.0140
-0.0922
-0.0259

0.0000
35.0585
34.0447
0.0127
0.0085
0.1180
0.0123

1
2

Wald 95% Confidence


Limits
0.0000
-155.193
-148.238
-0.0004
-0.0026
-0.3235
-0.0500

0.0000
-17.7666
-14.7855
0.0495
0.0306
0.1391
-0.0019

ChiSquare

Pr > ChiSq

.
6.08
5.73
3.70
2.73
0.61
4.46

.
0.0136
0.0167
0.0543
0.0986
0.4348
0.0347

113

C. Rgression logistique multinomiale


La variable nominale Y prend r valeurs.
Modle : (La modalit r sert de rfrence.)
Prob(Y i / x )

e i i1x1 ik x k
r 1

1 e i i1x1 ik x k

, i 1,..., r 1

i 1

Prob(Y r / x )

1
r 1

1 e i i1x1 ik x k
i 1

114

Application aux vins de Bordeaux


Le code SAS
proc catmod data=bordeaux;
direct tempera soleil chaleur pluie;
response logit;
model qualite = tempera soleil chaleur pluie;
run;

115

Test de Wald sur linfluence dune variable X j


Le modle
ei 0 i1x1 ...ik x k
i ( x ) P ( Y i / X x )
, i 1,..., r 1
i 0 i1x1 ...ik x k
1 e

Test

H0 : 1j = = r-1,j = 0
H1 : au moins un ij 0

Statistique utilise

Wald ( 1 j ,..., r 1, j ) Var

1 j

1 j

r 1, j

r 1, j

116

Rgle de dcision
On rejette
H0 : 1j = = r-1,j = 0
au risque de se tromper si

1 r 1
Wald

ou si
2

NS = Prob( r 1 Wald )

117

Influence des p variables Xp+1,, Xk


Le modle
ei 0 i1x1 ...ik x k
i ( x ) P ( Y i / X x )
, i 1,..., r 1
i 0 i1x1 ...ik x k
1 e

Test

H0 : i,p+1 = = ik = 0, i = 1,, r-1


H1 : au moins un ij 0

Statistiques utilises
1. = [-2L(Modle simplifi)] - [-2L(Modle complet)]

2.

Wald (1,p1 ,..., r 1,k )

1,p 1

Var

r 1,k

1,p1

r 1, k

118

Rgle de dcision
On rejette
H0 : 1,p+1 = = r-1,k = 0
au risque de se tromper si

1 p ( r 1)
ou Wald

ou si
2
p(r 1) Wald ou )

NS = Prob(

119

Application aux vins de Bordeaux


Model Fitting Information
Model
Intercept Only
Final

-2 Log
Likelihood
74.647
22.227

Chi-Square

df

52.420

Sig.
8

.000

Pseudo R-Square
Cox and Snell
Nagelkerke
McFadden

.786
.884
.702

120

Application aux vins de Bordeaux


Likelihood Ratio Tests

Effect
Intercept
TEMPERAT
SOLEIL
CHALEUR
PLUIE

-2 Log
Likelihood of
Reduced
Model
34.575
29.546
22.870
25.894
31.242

Chi-Square
12.348
7.319
.642
3.667
9.015

df
2
2
2
2
2

Sig.
.002
.026
.725
.160
.011

The chi-square statistic is the difference in -2 log-likelihoods


between the final model and a reduced model. The reduced model is
formed by omitting an effect from the final model. The null hypothesis
is that all parameters of that effect are 0.

Les tests LRT sont plus justes que les tests de Wald :
meilleure approximation du niveau de signification.
121

Application aux vins de Bordeaux


Parameter Estimates

Qualit
Bon

Moyen

Intercept
TEMPERAT
SOLEIL
CHALEUR
PLUIE
Intercept
TEMPERAT
SOLEIL
CHALEUR
PLUIE

B
-313.557
.113
.015
-.874
-.122
-249.604
.095
.007
-.890
-.105

Std. Error
230.325
.096
.024
.934
.104
225.594
.095
.022
.923
.103

Wald
1.853
1.375
.370
.876
1.387
1.224
.999
.094
.930
1.040

df
1
1
1
1
1
1
1
1
1
1

Sig.
.173
.241
.543
.349
.239
.269
.318
.759
.335
.308

Exp(B)
1.120
1.015
.417
.885
1.099
1.007
.411
.901

122

Application aux vins de Bordeaux


Pseudo R-Square
Cox and Snell
Nagelkerke
McFadden

.782
.880
.694

Likelihood Ratio Tests

Effect
Intercept
TEMPERAT
CHALEUR
PLUIE

-2 Log
Likelihood of
Reduced
Model
42.197
43.392
30.419
41.634

Chi-Square
19.327
20.522
7.550
18.764

df
2
2
2
2

Sig.
.000
.000
.023
.000

The chi-square statistic is the difference in -2 log-likelihoods


between the final model and a reduced model. The reduced model is
formed by omitting an effect from the final model. The null hypothesis
is that all parameters of that effect are 0.

123

Application aux vins de Bordeaux


Parameter Estimates

Qualit
Bon

Moyen

Intercept
TEMPERAT
CHALEUR
PLUIE
Intercept
TEMPERAT
CHALEUR
PLUIE

B
-381.353
.145
-1.161
-.151
-308.897
.121
-1.145
-.133

Std. Error
190.219
.074
.738
.080
186.257
.072
.729
.078

Wald
4.019
3.893
2.478
3.577
2.750
2.785
2.471
2.906

df
1
1
1
1
1
1
1
1

Sig.
.045
.048
.115
.059
.097
.095
.116
.088

Exp(B)
1.156
.313
.860
1.129
.318
.875

124

Case Summariesa

Prvision de la
qualit du vin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Qualit
Moyen
Mdiocre
Moyen
Mdiocre
Bon
Bon
Mdiocre
Mdiocre
Mdiocre
Moyen
Bon
Mdiocre
Mdiocre
Bon
Moyen
Moyen
Moyen
Mdiocre
Moyen
Bon
Moyen
Bon
Moyen
Bon
Moyen
Bon
Moyen
Mdiocre
Bon
Bon
Mdiocre
Bon
Mdiocre
Mdiocre
a. Limited to first 100 cases.

Estimated
Cell
Probability for
Response
Category: 1
.01
.00
.01
.00
.73
.94
.00
.00
.00
.63
.92
.20
.00
.30
.02
.02
.05
.00
.21
.95
.60
.99
.08
1.00
.14
1.00
.62
.00
.84
.25
.00
.49
.00
.00

Estimated
Cell
Probability for
Response
Category: 2
.88
.03
.19
.07
.26
.06
.00
.00
.00
.34
.08
.50
.04
.69
.77
.98
.95
.00
.72
.05
.40
.01
.92
.00
.86
.00
.38
.00
.16
.75
.00
.51
.38
.00

Estimated
Cell
Probability for
Response
Category: 3
.10
.97
.79
.93
.01
.00
1.00
1.00
1.00
.03
.00
.30
.96
.00
.21
.00
.00
1.00
.08
.00
.00
.00
.00
.00
.00
.00
.00
1.00
.00
.00
1.00
.00
.62
1.00

Predicted
Response
Category
Moyen
Mdiocre
Mdiocre
Mdiocre
Bon
Bon
Mdiocre
Mdiocre
Mdiocre
Bon
Bon
Moyen
Mdiocre
Moyen
Moyen
Moyen
Moyen
Mdiocre
Moyen
Bon
Bon
Bon
Moyen
Bon
Moyen
Bon
Bon
Mdiocre
Bon
Moyen
Mdiocre
Moyen
Mdiocre
Mdiocre

125

Application aux vins de Bordeaux


Classification
Predicted
Observed
Bon
Moyen
Mdiocre
Overall Percentage

Bon
8
3
0
32.4%

Moyen
3
7
1
32.4%

Mdiocre
0
1
11
35.3%

Percent
Correct
72.7%
63.6%
91.7%
76.5%

126

Exemple Alligators (Agresti)


Table 1: Primary Food Choice of Alligators, by Lake, Gender, and Size

Lake
Hancock

Gender
Male
Female

Oklawaha

Male
Female

Trafford

Male
Female

George

Male
Female

Size
2.3
>2.3
2.3
>2.3
2.3
>2.3
2.3
>2.3
2.3
>2.3
2.3
>2.3
2.3
>2.3
2.3
>2.3

Fish
7
4
16
3
2
13
3
0
3
8
2
0
13
9
3
8

Primary Food Choice


Invertebrate
Reptile
Bird
1
0
0
0
0
1
3
2
2
0
1
2
2
0
0
7
6
0
9
1
0
1
0
1
7
1
0
6
6
3
4
1
1
1
0
0
10
0
2
0
0
1
9
1
0
1
0
0

Other
5
2
3
3
1
0
2
0
1
5
4
0
2
2
1
1

127

Exemple Alligators
The sample consisted of 219 alligators captured in four Florida lakes,
during September 1985.
The response variable is the primary food type, in volume, found in
an alligators stomach. This variable had five categories: Fish,
Invertebrate, Reptile, Bird, Other.
The invertebrates found in the stomachs were primarily apple snails,
aquatic insects, and crayfish.
The reptiles were primarily turtles (though one stomach contained
tags of 23 baby alligators that had been released in the lake during the
previous year!).
The Other category consisted of amphibian, mammal, plant material,
stones or other debris, or no food of dominant type.

128

Exemple Alligators
Likelihood Ratio Tests

Effect
Intercept
LAKE
GENDER
SIZE

-2 Log
Likelihood of
Reduced
Model
146.644a
196.962
148.859
164.244

Chi-Square
.000
50.318
2.215
17.600

df

Sig.
0
12
4
4

.
.000
.696
.001

The chi-square statistic is the difference in -2 log-likelihoods


between the final model and a reduced model. The reduced model
is formed by omitting an effect from the final model. The null
hypothesis is that all parameters of that effect are 0.
a. This reduced model is equivalent to the final model
because omitting the effect does not increase the degrees
of freedom.

129

Exemple Alligators
Likelihood Ratio Tests

Effect
Intercept
LAKE
SIZE

-2 Log
Likelihood of
Reduced
Model
95.028a
144.161
116.115

Chi-Square
.000
49.133
21.087

df

Sig.
0
12
4

.
.000
.000

The chi-square statistic is the difference in -2 log-likelihoods


between the final model and a reduced model. The reduced
model is formed by omitting an effect from the final model. The
null hypothesis is that all parameters of that effect are 0.
a. This reduced model is equivalent to the final model
because omitting the effect does not increase the degrees
of freedom.

130

Modle estim
CHOICE
B

Parameter Estimates

Intercept
[LAKE=G
[LAKE=H
[LAKE=O
[LAKE=T
[SIZE=<=2.3
[SIZE=>2.3
Intercept
[LAKE=G
[LAKE=H
[LAKE=O
[LAKE=T
[SIZE=<=2.3
[SIZE=>2.3
Intercept
[LAKE=G
[LAKE=H
[LAKE=O
[LAKE=T
[SIZE=<=2.3
[SIZE=>2.3
Intercept
[LAKE=G
[LAKE=H
[LAKE=O
[LAKE=T
[SIZE=<=2.3
[SIZE=>2.3

]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]

B
Std. Error
-.626
.642
1.847
1.317
1.300
.993
-1.265
1.233
0a
.
-.279
.806
0a
.
.379
.479
2.935
1.116
1.692
.780
.476
.634
a
0
.
.351
.580
0a
.
-.048
.505
1.813
1.127
-1.088
.908
.292
.641
a
0
.
1.809
.603
0a
.
-.009
.522
1.419
1.189
1.002
.830
-1.034
.840
0a
.
.683
.651
a
0
.

Wald
.952
1.967
1.712
1.052
.
.120
.
.626
6.913
4.703
.564
.
.367
.
.009
2.590
1.434
.207
.
9.008
.
.000
1.424
1.459
1.515
.
1.099
.

a. This parameter is set to zero because it is redundant.

df
1
1
1
1
0
1
0
1
1
1
1
0
1
0
1
1
1
1
0
1
0
1
1
1
1
0
1
0

Sig.
.329
.161
.191
.305
.
.729
.
.429
.009
.030
.452
.
.545
.
.925
.108
.231
.649
.
.003
.
.987
.233
.227
.218
.
.295
.

131

Prvision
Case Summaries

LAKE
1
2
3
4
5
6
7
8

H
H
O
O
T
T
G
G

SIZE
<=2.3
>2.3
<=2.3
>2.3
<=2.3
>2.3
<=2.3
>2.3

Estimated
Cell
Probability for
Response
Category: B
.07
.14
.01
.03
.04
.11
.03
.08

Estimated
Cell
Probability for
Response
Category: F
.54
.57
.26
.46
.18
.30
.45
.66

Estimated
Cell
Probability for
Response
Category: I
.09
.02
.60
.25
.52
.19
.41
.14

Estimated
Cell
Probability for
Response
Category: O
.25
.19
.05
.07
.17
.20
.09
.10

Estimated
Cell
Probability for
Response
Category: R
.05
.07
.08
.19
.09
.20
.01
.02

H = Hancock, O = Oklawaha, T = Trafford, G = George

132

Exemple Alligators (2)


SEX
________
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M

LENGTH
________
1.30
1.32
1.32
1.40
1.42
1.42
1.47
1.47
1.50
1.52
1.63
1.65
1.65
1.65
1.65
1.68
1.70
1.73
1.78
1.78
1.80
1.85
1.93
1.93
1.98

CHOICE
________
I
F
F
F
I
F
I
F
I
I
I
O
O
I
F
F
I
O
F
O
F
F
I
F
I

SEX
________
M
M
M
M
M
M
M
M
M
M
M
M
M
M
F
F
F
F
F
F
F
F

LENGTH
________
2.03
2.03
2.31
2.36
2.46
3.25
3.28
3.33
3.56
3.58
3.66
3.68
3.71
3.89
1.24
1.30
1.45
1.45
1.55
1.60
1.60
1.65

CHOICE
________
F
F
F
F
F
O
O
F
F
F
F
O
F
F
I
I
I
O
I
I
I
F

SEX
________
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F

LENGTH
________

CHOICE
________

1.78
1.78
1.80
1.88
2.16
2.26
2.31
2.36
2.39
2.41
2.44
2.56
2.67
2.72
2.79
2.84

133

I
O
I
I
F
F
F
F
F
F
F
O
F
I
F
F

Exemple Alligators (2)


The CATMOD Procedure
Maximum likelihood computations converged.
Maximum Likelihood Analysis of Variance
Source
DF
Chi-Square
Pr > ChiSq

Intercept
2
9.84
0.0073
sex
2
2.71
0.2574
length
2
10.28
0.0059
length*sex
2
2.57
0.2767
Likelihood Ratio

94

77.64

0.8890

134

Exemple Alligators (2)


Likelihood Ratio Tests

Effect
Intercept
LENGTH
SEX

-2 Log
Likelihood of
Reduced
Model
92.270a
110.319
95.732

Chi-Square
.000
18.049
3.461

df

Sig.
0
2
2

.
.000
.177

The chi-square statistic is the difference in -2 log-likelihoods


between the final model and a reduced model. The reduced model
is formed by omitting an effect from the final model. The null
hypothesis is that all parameters of that effect are 0.
a. This reduced model is equivalent to the final model
because omitting the effect does not increase the degrees
of freedom.

135

Exemple Alligators (2)


Likelihood Ratio Tests

Effect
Intercept
LENGTH

-2 Log
Likelihood of
Reduced
Model
104.563
106.681

Chi-Square
14.247
16.365

df
2
2

Sig.
.001
.000

The chi-square statistic is the difference in -2 log-likelihoods


between the final model and a reduced model. The reduced model
is formed by omitting an effect from the final model. The null
hypothesis is that all parameters of that effect are 0.

Parameter Estimates

CHOICE
F
I

Intercept
LENGTH
Intercept
LENGTH

B
.998
.085
5.181
-2.388

Std. Error
1.176
.489
1.746
.921

Wald
.721
.030
8.807
6.718

df
1
1
1
1

Sig.
.396
.862
.003
.010

136

Exemple Alligators (2)


e.998.085Length
Prob(F)
.998.085Length
5.181 2.388Length
1 e
e
Prob(I)

5.181 2.388Length

1 e.998.085Length e5.181 2.388Length

1
Prob(0)
.998 .085Length
5.181 2.388Length
1 e
e
137

Exemple Alligators (2)


.8

.6

.4

Probabilit

.2
Prob(O)
Prob(I)
0.0
1.0

Prob(F)
1.5

Longueur

2.0

2.5

3.0

3.5

4.0

138