Vous êtes sur la page 1sur 16

Multivariate Data Analysis Using SPSS Lesson 2

LESSON 2:
DISCRIMINANT ANALYSIS

27
Multivariate Data Analysis Using SPSS Lesson 2

MULTIPLE DISCRIMINANT ANALYSIS (MDA)

In multiple linear regression, the objective is to model one quantitative variable (called the
dependent variable) as a linear combination of others variables (called the independent
variables). The purpose of discriminant analysis is to obtain a model to predict a single
qualitative variable from one or more independent variable(s). In most cases the dependent
variable consists of two groups or classifications, like, high versus normal blood pressure,
loan defaulting versus non defaulting, use versus non use of internet banking etc. The choice
between three candidates, A, B or C in an election is an example where the dependent
variable consists of more than two groups.

Discriminant analysis derives an equation as linear combination of the independent variables


that will discriminate best between the groups in the dependent variable. This linear
combination is known as the discriminant function. The weights assigned to each
independent variable are corrected for the interrelationships among all the variables. The
weights are referred to as discriminant coefficients.

The model
X1

X2
Independent Variables Y Dependent Variable
(Quantitative) : (Qualitative)

Xp

The discriminant equation:

F = 0 + 1X1 + 2X2 + + pXp +

where, F is a latent variable formed by the linear combination of the dependent variable, X1,
X2 , Xp are the p independent variables, is the error term and 0, 1 , 2 ,, p are the
discriminant coefficients.

The objective discriminant analysis is to test if the classifications of groups in a variable Y


depends on at least one of the Xis.
In terms of hypothesis, it can be written as:

H0: Y does not depend on any of the Xis.


Ha: Y depends on at least one of the Xis.
OR simply, H0: i = 0, for i=1, 2,, p versus Ha: i 0 for at least one i.

28
Multivariate Data Analysis Using SPSS Lesson 2

Assumptions

The variables X1, X2, , Xp are independent of each other.


Groups are mutually exclusive and the group sizes are not grossly different.
The number of independent variables is not more than two less than the sample size.
The variance-covariance structure of the independent variables are similar within each
group of the dependent variable.
Errors (residuals) are randomly distributed.
For purposes of significance testing, the independent variables follow a multivariate
normal distribution.

There are several purposes for MDA:

To investigate differences among groups.


To determine the most parsimonious way to distinguish among groups.
To discard variables which are little related to group distinction.
To classify cases into groups.
To test theory by observing whether cases are classified as predicted.

If you have collected a large number of independent variables and want to select a useful
subset for predicting the dependent variable, use Multiple Discrminant Analysis with
selection methods.

29
Multivariate Data Analysis Using SPSS Lesson 2

Key Concepts and Terms

Discriminant function

The number of functions computed is one less than the number of groups in the dependent
variable. That is, for two groups one function, for three groups - two functions, and so on.
When there are two functions, the first function maximizes the differences between the
groups in the dependent variable. The second function is orthogonal to the first (uncorrelated
with it) and maximizes the differences between the groups in the dependent variable,
controlling for the first function. Though mathematically different, each discriminant function
is a dimension which differentiates a case into groups in the dependent variable based on its
values on the independent variables. In discriminant analysis, the first function will be the
most powerful in differentiation the dimensions and the subsequent functions may or may not
represent additional significant differentiation.

Discriminant Coefficient

The discriminant function coefficients are partial coefficients that reflect the unique
contribution of each variable to the classification of the groups in the dependent variable. A
discrminant score that belongs to a latent variable can be obtained for each case by applying
the coefficients to the values in the respective independent variables. The standardized
discriminant coefficients, like beta weights in regression, are used to assess the relative
classifying importance of the independent variables. Structure coefficients are the
correlations between a given independent variable and the discriminant scores. The higher the
value, the higher if the association between the independent variable and the dicriminant
function. Looking at all the structure coefficients for a function allows the researcher to
assign a label to the dimension it measures.

Group centroid

Group centroids are the mean discriminant scores for each group in the dependent variable
for each of the discriminant functions. For two groups in the dependent variable there is a
single dicriminant function. The centroids are in a unidimensional space, one center for each
group. For three groups in the dependent variable there are two dicriminant functions.
Hence, the centroids are in a two dimensional space. By connecting the centroids a canonical
plot can be created depicting a discriminant function space.

Eigenvalue

Eigenvalue, also called the characteristic roots, is a ratio between the explained and
unexplained variation in a model. For a good model the eigen value must be more than one.
In discriminant analysis there is one eigenvalue for each discriminant function. The bigger
the eigenvalue, the stronger is the discriminating power of the function. In an analysis with
three groups, the ratio between two eigenvalues indicates the relative discriminating power of
the one discriminant function over the other. For example, if the ratio of two eigenvalues is
1.6, the first discriminant function accounts for 60% more of the between-group variance for
the three groups in the dependent variable compared to the second discriminant function.
Relative percentage of a discriminant function is the function's eigenvalue divided by the sum

30
Multivariate Data Analysis Using SPSS Lesson 2

of all eigenvalues of all discriminant functions in the model. It represents the percent of
discriminating power for the model associated with a given discriminant function. Usually,
the relative percentage of the first functions will be high. If the values for the subsequent
functions are small, then a single function is as good as two or more function in the
classification.

Canonical correlation

The canonical correlation is a measure of the association between the groups in the dependent
variable and the discriminant function. A high value implies a high level of association
between the two and vice-versa.

Wilks's lambda

In discriminant analysis, the Wilks Lamba is used to test the significance of the discriminant
functions. Mathematically, it is one minus the explained variation and the value ranges from
0 to 1. Unlike the F-statistics in linear regression, when the value lambda for a function is
small, the function is significant.

Classification matrix

The classification matrix is a simple cross tabulation of the observed and predicted
memberships. For a good prediction, the values in the diagonal must be high and the values
off the diagonal must be close to 0.

Box's M

Like in other multivariate data analysis, the Box's M tests the assumption of equality of
variance-covariance matrices in the groups. A big Box's M indicated by a small p-value
indicates violation of this assumption. However, when the sample size is big, Boxs M is
usually large. In such situations, the natural logarithm of the variance-covariance matrices for
the groups are compared.

Sample size

As a rule, the sample size of the smallest group should exceed the number of independent
variables. Though the general agreement is that there should be at least 5 cases for each
independent variable, it is best to model with at least 20 cases for each independent variable.

31
Multivariate Data Analysis Using SPSS Lesson 2

Example 1: Loan Defaulting

Variables: Age, Income (Annual) and Default (1: Yes, 2: No)


Model: F = 0 + 1(Age) + 2(Income) +
Customer Age Income ($,000) Default MAH_1
1 54 72 No 2.44
2 50 72 No 1.75
3 48 56 No 1.32
4 44 72 No 1.51
5 47 64 No 0.74
6 35 64 No 0.97
7 36 56 No 0.05
8 52 56 Yes 2.59
9 23 56 Yes 2.97
10 40 45 Yes 1.37
11 34 56 Yes 0.21
12 26 29 Yes 4.22
13 25 56 Yes 2.23
14 27 56 Yes 1.60
15 29 30 Yes 4.04

Outlier detection

For 2 degree of freedom and = 0.05, the threshold value = 5.991.


Largest Mahalanobis Distance, D2 =4.04 < 5.991 there are no major outliers.

Descriptive summary
Group Statisti cs

St d. Error
Def ault N Mean St d. Dev iation Mean
Age Yes 8 32.00 9.769 3.454
No 7 44.86 7.081 2.676
Income Yes 8 48.00 12.036 4.255
No 7 65.14 7.198 2.721

Independent Samples Test

Lev ene's Test f or


Equality of Variances t-t est f or Equalit y of Means
95% Conf idence
Interv al of the
Sig. Mean St d. Error Dif f erence
F Sig. t df (2-tailed) Dif f erence Dif f erence Lower Upper
Age Equal v ariances assumed .589 .456 -2.878 13 .013 -12.857 4.468 -22.510 -3.205
Equal v ariances not
-2.943 12.621 .012 -12.857 4.369 -22.326 -3.389
assumed
Income Equal v ariances assumed 2.914 .112 -3.281 13 .006 -17.143 5.225 -28.430 -5.855
Equal v ariances not
-3.394 11.626 .006 -17.143 5.051 -28.187 -6.099
assumed

Test of equality of means


The p-values for both Age and Income are less than 0.05. Thus, there is a significant
difference in Age and Income between the defaulters and non defaulters

32
Multivariate Data Analysis Using SPSS Lesson 2

Scatter diagram

Income versus Age

Set Markers by Default

Default
80
Yes
No

70

60
Income

50

40

30

20

20 30 40 50 60
Age

33
Multivariate Data Analysis Using SPSS Lesson 2

To Obtain Discriminant Analysis in SPSS

From the menus choose

Analyze
Classify
Discriminant...

Grouping Variable: Default

Independents: Age
Income

Define Range... Statistics...

Save... Classify...

34
Multivariate Data Analysis Using SPSS Lesson 2

Output

Descriptive statistics and test of means


Group Statisti cs Tests of Equality of Group Means

Valid N (listwise) Wilks'


Lambda F df 1 df 2 Sig.
Def ault Mean St d. Dev iation Unweighted Weighted
Age .611 8.281 1 13 .013
Yes Age 32.00 9.769 8 8.000
Income .547 10.766 1 13 .006
Income 48.00 12.036 8 8.000
No Age 44.86 7.081 7 7.000
Income 65.14 7.198 7 7.000
Total Age 38.00 10.644 15 15.000
Income 56.00 13.153 15 15.000

One the average, the age and income among the defaulters are lesser than the non-defaulters. The
p-values for tests of equality of means are both less than 0.05.
Thus, perhaps both age and income can be important discriminant of defaulting groups

Covari ance Matricesa The diagonal are variances and the off diagonals
Def ault Age Income are covariances.
Yes Age 95.429 21.714
Income 21.714 144.857
Determinant for defaulting:
No Age 50.143 25.524 |D| = (95.429*144.857) (21.714)2 = 13352.06
Income 25.524 51.810
Total Age 113.286 80.571
Income
Ln|D| = 9.499
80.571 173.000
a. The total cov ariance matrix has 14 degrees of f reedom.

Box's Test of Equality of Covariance Matrices


Log Determinants Test Results
Log Box's M 3.241
Def ault Rank Determinant F Approx. .899
Y es 2 9.499 df 1 3
No 2 7.574
df 2 121191.4
Pooled within-groups 2 8.860
Sig. .441
The ranks and natural logarithms of determinants
printed are t hose of the group cov ariance matrices. Tests null hy pothesis of equal population cov ariance matrices.

The p-value for Boxs M is more than 0.05. Thus, equality of variance-covariance matrix can be
assumed. The log determinant values are quite close to each other.

Summary of Canonical Discriminant Functions


Eigenvalues Wilks' Lambda

Canonical Wilks'
Function Eigenvalue % of Variance Cumulativ e % Correlation Test of Function(s) Lambda Chi-square df Sig.
1 .463 9.229 2 .010
1 1.158a 100.0 100.0 .733
a. First 1 canonical discriminant functions were used in the analysis.

There are two groups. Therefore number of function = 1.


The eigen value is 1.158 (>1). Canonical correlation, rc= 0.733(>0.35).
Wilks Lamda = 0.463, p-value = 0.010(<0.05). Thus, the Function 1 explains the variation well

35
Multivariate Data Analysis Using SPSS Lesson 2

The function
Canonical Discriminant Standardized Canoni cal Structure Matrix
Function Coeffi ci ents Discri minant Function
Coeffi cients Function
Function 1
1 Function
Income .846
Age .064 1
Age .554 Age .742
Income .069 Correlationbetween

Income .696
(Constant) -6.303
Unstandardized coef f icients Correlation between
Income, Age and F
F = -6.303 + 0.064(Age) + 0.069(Income)

Centroids
Functions at Group Centroids Classificati on Function Coefficients
Function Def ault
Def ault 1 Yes No
Yes -.937 Age .303 .432
No 1.071
Income .401 .540
Unstandardized canonical discriminant (Constant) -15.170 -27.960
f unct ions ev aluat ed at group means
Fisher's linear discriminant f unctions
Between -0.937 and 1.071, the mid point is 0.067

F
-0.937 0.067 1.071
Defaulters Non-Defaulters

Classification results
Classification Resultsb,c

Predicted Group
Membership
Def ault Y es No Total
Original Count Y es 7 1 8
No 1 6 7
% Y es 87.5 12.5 100.0
No 14.3 85.7 100.0
Cross-v alidateda Count Y es 7 1 8
No 1 6 7
% Y es 87.5 12.5 100.0
No 14.3 85.7 100.0
a. Cross v alidation is done only f or t hose cases in the analy sis. I n
cross v alidation, each case is classif ied by the f unctions deriv ed
f rom all cases ot her than that case.
b. 86.7% of original grouped cases correct ly classif ied.
c. 86.7% of cross-v alidated grouped cases correctly classif ied.

Classification:

Age = 30, Income = 40 F = -6.303 + 0.064(30) + 0.069(40) = -1.617 < 0.067 Yes
Age = 40, Income = 40 F = -6.303 + 0.064(40) + 0.069(40) = -0.975 < 0.067 Yes
Age = 30, Income = 60 F = -6.303 + 0.064(30) + 0.069(60) = -0.238 < 0.067 Yes
Age = 40, Income = 60 F = -6.303 + 0.064(40) + 0.069(60) = 0.404 > 0.067 No
Age = 50, Income = 60 F = -6.303 + 0.064(50) + 0.069(60) = 1.046 > 0.067 No

36
Multivariate Data Analysis Using SPSS Lesson 2

Example 2:

Data: Lesson 2 Blood Pressure categorized.sav


A researcher wanted to study the relationship of blood pressure (BP) status (Normal, High)
with four other variables: Age, Weight, Body Surface Area (BSA) and Pulse. He recruited
110 adults and recorded their BP, Age, Weight, BSA and Pulse values. He recoded the SBP
values into: Normal (<140) and High (> 140)

Data: Blood Pressure.sav

No. BP Age Weight BSA Pulse


1 High 40 74 2.43 66
2 Normal 27 75 2.51 65
3 Normal 36 68 2.25 78
4 Normal 41 71 2.36 71
: : : : : :
110 High 44 73 2.42 66

Objective: To identify the significant determinants of BP status among Age, Weight, BSA
and Pulse.

Model: F = 0 + 1 Age + 2 Weight + 3 BSA + 4 Pulse + 5 Stress +

Output

Descriptive statistics and test of means

The p-values for all, except Pulse, below 0.05.


Thus, Age, weight and BSA may be significant
determinants of BP status

Box's Test of Equality of Covariance Matrices

The p-value for Boxs M is > 0.05


Equality of variance-covariance matrix can be assumed.

37
Multivariate Data Analysis Using SPSS Lesson 2

Summary of Canonical Discriminant Functions

There are two groups. Therefore number of function = 1.


The eigen value is 1.021 (>1). Canonical correlation, rc= 0.711(>0.35).
Wilks Lamda = 0.495, p-value = <0.001. Thus, the Function 1 explains the variation well

The function

Correlation between the


determinants and F

The discriminant function: F= -22.368 + 0.195(Age)+0.352(Weight) 3.842(BSA) - 0.018(Pulse)

Centroids

Between -0.613 and 1.635, mid point is 0.511

F -0.613 0.511 1.635


Normal High

Classification:

Age = 60, Weight = 90, BSA=2.0, Pulse = 70

F = -22.368 + 0.195(60) + 0.352(90) 3.842(2) 0.018(70) = 12.03 > 0.511 High BP

38
Multivariate Data Analysis Using SPSS Lesson 2

Results from Stepwise Method Method Wilks Lambda

Summary of Canonical Discriminant Functions

There are two groups. Therefore number of function = 1.


The eigen value is 0.981 (<1). Canonical correlation, rc= 0.704(>0.35).
Wilks Lamda = 0.505, p-value = <0.001. Thus, the Function 1 explains the variation well.

The function

F = -23.894 + 0.197 (Age) + 0.226(Weight)

Centroids Classification results

Between -0.601 and 1.603, mid point is 0.501

F
-0.601 0.501 1.603
Normal High

Classification:

Age = 60, Weight = 90, BSA=2.0, Pulse = 70, Stress =80


F = -23.894 + 0.197 (60) + 0.226(90) = 8.29 > 0.501 High BP

39
Multivariate Data Analysis Using SPSS Lesson 2

Example 3: Three groups

Data: Lesson 2 Voters.sav

In this example, since there are three groups in the dependent variable candidate, there are
two functions
F1 = 01 + 11 Age + 21Education +
F2 = 02 + 12 Age + 22 Education +

Output

Voters who are younger and with lesser years of education seem to prefer candidate A.
Voters who are older favor either candidates B or C. Among them those with longer years of
education prefer candidate B.

The p-values for both Age and Education are less than 0.05. Perhaps both Age and Education
are significant predictors.

Though the Boxs M is significant, the log determinant values are quite close

40
Multivariate Data Analysis Using SPSS Lesson 2

The eigenvalue and canonical correlation values for the first function is much higher than the second
function. Looks like the first function is sufficient to differentiate the choice of candidates.

In the first row (1 through 2) the Wilks Lamda is significant, but not in the second row (2). This
means, over and above the first function, the second function does not contribute much.

The functions are: F1 = -8.401 + 0.092(Age) + 0.398(Education).


F2 = -0.643 - 0.059(Age) + 0.267(Education).
Based on the standardized coefficients, both are equally important.

The group centroids for the candidates are: A(-1.164, 0.018), B(0.639, 0.102) and C(0.428, -0.196)

In the diagram above, the range in the vertical axis is small. Hence, F2 does make much difference.
Only F1, the horizontal axis, is important for differentiation.

41
Multivariate Data Analysis Using SPSS Lesson 2

Based on the information from the table above, the classification is not that good.

Classification:
Age = 50, Education = 15:

F1 = -8.401 + 0.092(50) + 0.398(15) = 2.14

F2 = -0.643 - 0.059(50) + 0.267(15) = 0.43

For (2.14, 0.43), the nearest centroid is that of candidate B

Results from Stepwise Method

In this example, the stepwise method also gives the same results.

42