Vous êtes sur la page 1sur 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Correspondence Analysis
Julie Josse, Franois Husson, Sbastien L
Applied Mathematics Department, Agrocampus Ouest

useR-2008 Dortmund, August 11th 2008

1 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

History

Theoretical principles: Fisher (1940) Correspondence Analysis has been actively developed in 1965

... in Rennes!
JP. Benzcri: mathematician and linguist PhD thesis of his student B. Escoer : Correspondence

Analysis
The beginning of the "French school"

2 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

CA in the R packages

anacor (de Leeuw and mair) ca (Nenadic and Greenacre) ade4 (Chessel) vegan (Dixon) homals (de Leeuw) FactoMineR (Husson et al.)

3 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Data, examples

Two categorical variables contingency table. Symmetric

role of the rows and the columns


Examples: examples where a 2 test can be applied text-mining: number of times the word i is in the text j solutions (acid, bitter, etc.) - answers (acid, bitter, etc.): number of persons who answer j for the stimulus i perfumes - descriptors: number of times the descriptor j is used to describe the perfume i

4 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Notations

Figure: Data table in CA.


5 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Notations

Figure: Row prole and column prole.


6 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Aim

Rows typology Columns typology Relationship between these two typologies

Study the relationship (the correspondence) between the two variables, the gap to independence Visualize the association between levels

7 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Example
12 perfumes described by 39 words:

Angel Aromatics Elixir Chanel 5 Cinma Coco Mademoiselle ......

floral fruity strong soft light 2 11 18 3 1 2 3 29 2 0 5 0 19 3 1 14 14 3 12 9 10 10 6 10 7 . . . . .

... ... ... ... ... ...

8 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Intensity of the relationship


Chi-square:

2 =
ij

(nij ni . n.j /n)2 , ni . n.j /n (fij fi . f.j )2 , fi . f.j

= n = n .
ij 2

2 is the intensity of the relationship


Chi-square test (Pearson):
2 2 obs (I 1)(J 1)

2 = 615.8, (p-value = 1.7e-56) Highly signicant obs


9 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Nature of the relationship


Contribution to the chi-square:
2 xij =

(nij ni . n.j /n)2 ni . n.j /n

Contribution of each cell, contribution of each row, contribution of each column?


Residuals (positive or negative association):

xij =

(nij ni . n.j /n) ni . n.j /n

CA: visualize the residuals matrix X (the gap to independence) As usual, the association structure of X is revealed using the SVD
10 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Total inertia
Total inertia = 2 = trace(XX ) = 2 = 2 = n (fij fi . f.j )2 fi . f.j
fij f.j

2 n

ij

2 =
ij

fi . fi .
fij fi .

f.j

Similarly: 2 2 = = n f.j f.j


2

fi .
ij

11 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Total inertia explained via the prole


For the row prole:
Weight for the columns

Associated weight of the rows

fi .
i

fij = f.j fi .
fij fi .

2 =
ij

f.j f.j

fi .

Total inertia = weighted sum of squared distances of the rows prole to the average prole the weight of the row prole is its mass fi . and the squared distance is an Euclidean distance where each squared dierence is divided by the corresponding average value f.j
12 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

2 -distance
the row i is a point in RJ (with the weight fi . )

d 2 (i , l ) =
j

1 f.j

flj fij fi . fl .

the column j is a point in RI (with the weight f.j )

d (j , h) =
i

1 fi .

fij fih f.j f.h

These 2 -distances enjoy good properties (distributional equivalence principle)


13 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

CA: dierent formulations

In factorial analysis such as PCA, we are looking for dimension

which represent in the better way the variability between individuals (i.e the distance to the barycenter), in CA we are looking for dimensions which better represent the gap to independence.

Classical presentation of CA: two weighted PCA on "row

prole" and on "column prol"


Canonical Analysis on the two indicator matrices

14 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Link between the two representations: transition formulae

1 Fs (i ) = s

fij Gs (j ) fi .

Row i is at the barycenter of the weighted columns (with a scale 1/ s )

1 Gs (k ) = s

fij Fs (i ) f.j

Column k is at the barycenter of the weighted rows (with a scale 1/ s )


15 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Graphical representation
CA factor map

vanilla 1.0 sugary Lolita Lempika


q

Angel Dim 2 (21.12%) 0.5


q

Cinma q L_instant
q

agressive acid wooded spicy strong Aromatics ShalimarElixir


q q

fruity soft q J_adore light q Coco Mademoiselle Pure Poison q J_adore_et q discrete q fresh floral Pleasures
q

0.0

old

Chanel 5 0.5
q

soap

1.0

0.5

0.0

0.5 Dim 1 (60.46%)

1.0

1.5

16 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Graphical representation
CA factor map

1.0

Lolita Lempika
q

Angel 0.5
q

Dim 2 (21.12%)

Cinma qL_instant
q

0.0

J_adore
q

Aromatics Elixir Shalimar q


q

Coco Mademoiselle Pure Poison q J_adore_et q


q

Chanel 5 0.5 Pleasures


q q

0.5

0.0

0.5 Dim 1 (60.46%)

1.0

1.5

17 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Graphical representation
CA factor map
1.5

vanilla 1.0 candy sugary Lolita Lempika


q

Dim 2 (21.12%)

0.5

hot

Angel q oriental

Cinma young L_instant q q fruity acid spicy soft lemon vegetable qwoman J_adore light wooded q Coco Mademoiselle Pure Poison forest J_adore_et q discreet q q fresh floral rose nature shampoo Pleasures shower.gel q soap

heavy intense peppery agressive heady drugs eau.de.cologne strong Aromatics ShalimarElixir male q q alcohol powerful old toilets Chanel 5 q amber

0.5

0.0

musky

1.0

0.5

0.0

0.5

1.0

1.5

2.0

Dim 1 (60.46%)

18 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Graphical representation in CA: remarks

The barycenter represents the independence The distance between levels of a same variable can be

interpreted
Representation provided are pseudo-barycentric (dilatation):

transition formulae
It is not possible to interpret the distance between levels of the

two variables but ...


... it is at a weighted barycenter of all the levels

19 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Helps to interpret
Supplementary informations can be added (zero weight)! Percentage of variance for each axis: information brought by

the dimension eigenvalues. associations.

s , s s

but, it is interesting to have a look at the

s always smaller than 1; the value 1 is obtained for exclusive

library(FactoMineR) don=diag(5) a=CA(don) a$eig don=matrix(1,5,5) a=CA(don) a$eig


20 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Helps to interpret

The maximum number of axes is min(I , J ) 1

Quality of the representation: cos2

Contribution:

inertia of a point total inertia

s (i ) = fi . F . Be careful, extreme points are not s those which contribute the most to the dimension

21 / 22

Introduction

Intensity and Nature

Inertia decomposition

Graphical representation

Helps to interpret

Practice

library(FactoMineR) perfume = read.table("perfume.txt",header=T,sep="\t",row.names=1) res.ca = CA(perfume,col.sup=16:39) plot(res.ca,invisible="row") plot(res.ca,invisible=c("col","col.sup")) res.ca$eig barplot(res.ca$eig[,1],main="Eigenvalues",names.arg=1:nrow(res.ca$eig)) res.ca$row$coord res.ca$row$cos2 res.ca$row$contrib res.ca$col$coord res.ca$col$cos2 res.ca$col$contrib

22 / 22

Vous aimerez peut-être aussi