Vous êtes sur la page 1sur 16

Discriminant Analysis

Discriminant analysis is useful for building a predictive model of group membership based on observed characteristics of each case. The procedure generates a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases that have measurements for the predictor variables but have unknown group membership. Note: The grouping variable can have more than two values. The codes for the grouping variable must be integers, however, and you need to specify their minimum and maximum values. Cases with values outside of these bounds are excluded from the analysis. Example. On average, people in temperate zone countries consume more calories per day than people in the tropics, and a greater proportion of the people in the temperate zones are city dwellers. A researcher wants to combine this information into a function to determine how well an individual can discriminate between the two groups of countries. The researcher thinks that population size and economic information may also be important. Discriminant analysis allows you to estimate coefficients of the linear discriminant function, which looks like the right side of a multiple linear regression equation. That is, using coefficients a, b, c, and d, the function is:
D = a * climate + b * urban + c * population + d * gross domestic product per capita

If these variables are useful for discriminating between the two climate zones, the values of D will differ for the temperate and tropic countries. If you use a stepwise variable selection method, you may find that you do not need to include all four variables in the function. Statistics. For each variable: means, standard deviations, univariate ANOVA. For each analysis: Box's M, within-groups correlation matrix, within-groups covariance matrix, separate-groups covariance matrix, total covariance matrix. For each canonical discriminant function: eigenvalue, percentage of variance, canonical correlation, Wilks' lambda, chi-square. For each step: prior probabilities, Fisher's function coefficients, unstandardized function coefficients, Wilks' lambda for each canonical function. Data. The grouping variable must have a limited number of distinct categories, coded as integers. Independent variables that are nominal must be recoded to dummy or contrast variables. Assumptions. Cases should be independent. Predictor variables should have a multivariate normal distribution, and within-group variance-covariance matrices should be equal across groups. Group membership is assumed to be mutually exclusive (that is, no case belongs to more than one group) and collectively exhaustive (that is, all cases are members of a group). The procedure is most effective when group membership is a truly categorical variable; if group membership is based on values of a continuous variable (for example, high IQ versus low IQ),

consider using linear regression to take advantage of the richer information that is offered by the continuous variable itself.
Discriminant analysis is used to model the value of a dependent categorical variable based on its relationship to one or more predictors

The discriminant model has the following assumptions: The predictors are not highly correlated with each other. The mean and variance of a given predictor are not correlated. The correlation between two predictors is constant across groups. The values of each predictor have a normal distribution

Using Discriminant Analysis to Assess Credit Risk


If you are a loan officer at a bank, you want to be able to identify characteristics that are indicative of people who are likely to default on loans, and you want to use those characteristics to identify good and bad credit risks. Suppose information on 850 past and prospective customers is contained in bankloan.sav . See the topic Sample Files for more information. The first 700 cases are customers who were previously given loans. Use a random sample of these 700 customers to create a discriminant analysis model, setting the remaining customers aside to validate the analysis. Then use the model to classify the 150 prospective customers as good or bad credit risks.

Setting the random seed allows you to replicate the random selection of cases in this analysis. To set the random seed, from the menus choose: Transform > Random Number Generators... Select Set Starting Point. Select Fixed Value and type 9191972 as the value Click OK. To create the selection variable for validation, from the menus choose: Transform > Compute Variable... Type validate in the Target Variable text box. Type rv.bernoulli(0.7) in the Numeric Expression text box. This sets the values of validate to be randomly generated Bernoulli variates with probability parameter 0.7. You only intend to use validate with cases that could be used to create the model; that is, previous customers. However, there are 150 cases corresponding to potential customers in the data file. To perform the computation only for previous customers, click If. Select Include if case satisfies condition. Type MISSING(default) = 0 as the conditional expression. This ensures that validate is only computed for cases with non-missing values for default; that is, for customers who previously received loans. Click Continue. Click OK in the Compute Variable dialog box. Approximately 70 percent of the customers previously given loans will have a validate value of 1. These customers will be used to create the model. The remaining customers who were previously given loans will be used to validate the model results. These selections generate the following command syntax:
SET SEED 9191972. IF ($casenum < 701) validate = rv.bernoulli(.7) . EXECUTE .

To run the discriminant analysis, from the menus choose: Analyze > Classify > Discriminant... Select Previously defaulted as the grouping variable.

Select Years with current employer, Years at current address, Debt to income ratio (x100), and Credit card debt in thousands as the independent variables. Select validate as the selection variable. Select Previously defaulted and click Define Range Type 0 as the minimum. Type 1 as the maximum. Click Continue. Select validate and click Value in the Discriminant Analysis dialog box Type 1 as the value for selection variable. Click Continue. Click Statistics in the Discriminant Analysis dialog box. Select Means, Univariate ANOVAs, and Box's M in the Descriptives group. Select Fisher's and Unstandardized in the Function Coefficients group. Select Within-groups correlation in the Matrices group. Click Continue. Click Classify in the Discriminant Analysis dialog box. Select Summary table and Leave-one-out classification. Click Continue. Click Save in the Discriminant Analysis dialog box. Select Predicted group membership and Probabilities of group membership. Click Continue. Click OK in the Discriminant Analysis dialog box. These selections generate the following command syntax:
DISCRIMINANT /GROUPS=default(0 1) /VARIABLES=employ address debtinc creddebt /SELECT=validate(1) /ANALYSIS ALL /SAVE=CLASS PROBS /PRIORS EQUAL /STATISTICS=MEANS STDDEV UNIVF BOXM COEFF RAW FPAIR TABLE CROSSVALID CORR /CLASSIFY=NONMISSING POOLED .

The procedure builds a model for discriminating between the values 0 and 1 on the variable default using the variables employ, address, debtinc, and creddebt. The SELECT subcommand specifies that only cases with a value of 1 on the variable validate are used to build the model. The SAVE subcommand requests predicted group membership and predicted probabilities for each group to be saved to the active dataset. The STATISTICS subcommand requests Box's M test, classification function coefficients, unstandardized discriminant functions, pairwise F ratios, classification results and crossvalidated results, and a pooled withingroups correlation matrix in addition to the default output. All other options are set to their default values.

The classification functions are used to assign cases to groups.


There is a separate function for each group. For each case, a classification score is computed for each function. The discriminant model assigns the case to the group whose classification function obtained the highest score. The coefficients for Years with current employer and Years at current address are smaller for the Yes classification function, which means that customers who have lived at the same address and worked at the same company for many years are less likely to default. Similarly, customers with greater debt are more likely to default. For example, consider cases 701 and 703. Case 701 has had the same employer for 16 years, lived at her current address for 13 years, and has debt equal to 10.9% of her income, $540 of which is credit card debt. The discriminant model predicts that there is only about an 8% chance that she will default on the loan, so she is a good credit risk. Case 703 has had the same employer and lived at the same address for fewer years and has greater debts, so the model sees him as a poor credit risk.

The within-groups correlation matrix shows the correlations between the predictors
Checking Collinearity of Predictors
Pooled Within-Groups Matrices Years with current employer Years with current 1.000 employer Years at current Correlation address Debt to income .104 ratio (x100) .140 1.000 .508 .286 1.000 .140 .290 .286 .104 .508 Years at current address Debt to income ratio (x100) Credit card debt in thousands

Credit card debt in .508 thousands .290 .508 1.000

The within-groups correlation matrix shows the correlations between the predictors. The largest correlations occur between Credit card debt in thousands and the other variables, but it is difficult to tell if they are large enough to be a concern. Look for differences between the structure matrix and discriminant function coefficients to be sure.

Checking for Correlation of Group Means and Variances

The group statistics table reveals a potentially more serious problem. For all four predictors, larger group means are associated with larger group standard deviations

In particular, look at Debt to income ratio (x100) and Credit card debt in thousands, for which the means and standard deviations for the Yes group are considerably higher. In further analysis, you may want to consider using transformed values of these predictors

Checking Homogeneity of Covariance Matrices.


Box's M tests the assumption of equality of covariances across groups. Log determinants are a measure of the variability of the groups. Larger log determinants correspond to more variable groups. Large differences in log determinants indicate groups that have different covariance matrices. Since Box's M is significant, you should request separate matrices to see if it gives radically different classification results. See the section on specifying separate-groups covariance matrices for more information

Assessing the Contribution of Individual Predictors


There are several tables that assess the contribution of each variable to the model, including the tests of equality of group means, the discriminant function coefficients, and the structure matrix.

Tests of Equality of Group Means


The tests of equality of group means measure each independent variable's potential before the model is created. Each test displays the results of a one-way ANOVA for the independent variable using the grouping variable as the factor. If the significance value is greater than 0.10, the variable probably does not contribute to the model. According to the results in this table, every variable in your discriminant model is significant. Wilks' lambda is another measure of a variable's potential. Smaller values indicate the variable is better at discriminating between groups.

The table suggests that Debt to income ratio (x100) is best, followed by Years with current employer, Credit card debt in thousands, and Years at current address.

Standardized Canonical Discriminant Function Coefficients


The standardized coefficients allow you to compare variables measured on different scales. Coefficients with large absolute values correspond to variables with greater discriminating ability. This table downgrades the importance of Debt to income ratio (x100), but the order is otherwise the same.

Structure Matrix
The structure matrix shows the correlation of each predictor variable with the discriminant function. The ordering in the structure matrix is the same as that suggested by the tests of equality of group means and is different from that in the standardized coefficients table. This disagreement is likely due to the collinearity between Years with current employer and Credit card debt in thousands noted in the correlation matrix. Since the structure matrix is unaffected by collinearity, it's safe to say that this collinearity has inflated the importance of Years with current employer and Credit card debt in thousands in the standardized coefficients table. Thus, Debt to income ratio (x100) best discriminates between defaulters and nondefaulters.

Assessing Model Fit


In addition to measures for checking the contribution of individual predictors to your discriminant model, the Discriminant Analysis procedure provides the eigenvalues and Wilks' lambda tables for seeing how well the discriminant model as a whole fits the data The eigenvalues table provides information about the relative efficacy of each discriminant function. When there are two groups, the canonical correlation is the most useful measure in the table, and it is equivalent to Pearson's correlation between the discriminant scores and the groups. Wilks' lambda is a measure of how well each function separates cases into groups. It is equal to the proportion of the total variance in the discriminant scores not explained by differences among the groups. Smaller values of Wilks' lambda indicate greater discriminatory ability of the function. The associated chi-square statistic tests the hypothesis that the means of the functions listed are equal across groups. The small significance value indicates that the discriminant function does better than chance at separating the groups.

Model Validation
The classification table shows the practical results of using the discriminant model Of the cases used to create the model, 94 of the 124 people who previously defaulted are classified correctly. 281 of the 375 nondefaulters are classified correctly. Overall, 75.2% of the cases are classified correctly. Classifications based upon the cases used to create the model tend to be too "optimistic" in the sense that their classification rate is inflated The cross-validated section of the table attempts to correct this by classifying each case while leaving it out from the model calculations; however, this method is generally still more "optimistic" than subset validation.

Subset validation is obtained by classifying past customers who were not used to create the model. These results are shown in the Cases Not Selected section of the table 77.1 percent of these cases were correctly classified by the model. This suggests that, overall, your model is in fact correct about three out of four times The 150 ungrouped cases are the prospective customers, and the results here simply give a frequency table of the model-predicted groupings of these customers

Specifying Separate-Groups Covariance Matrices


Since Box's M is significant, it's worth running a second analysis to see whether using a separate-groups covariance matrix changes the classification To obtain a classification using a separate-groups covariance matrix, recall the Discriminant Analysis dialog box. Click Classify Select Separate-groups. Note that with separate groups, leave-one-out classification is not available. Click Continue. Click OK in the Discriminant Analysis dialog box.

Specifying Separate-Groups Covariance Matrices


The classification results have not changed much, so it's probably not worth using separate covariance matrices. Box's M can be overly sensitive to large data files, which is likely what happened here.

Adjusting Prior Probabilities


This table displays the prior probabilities for membership in groups. A prior probability is an estimate of the likelihood that a case belongs to a particular group when no other information about it is available. Unless you specified otherwise, it is assumed that a case is equally likely to be a defaulter or nondefaulter. Prior probabilities are used along with the data to determine the classification functions. Adjusting the prior probabilities according to the group sizes can improve the overall classification rate. To obtain a classification using non-uniform priors, recall the Discriminant Analysis dialog box. Click Classify Select Compute from group sizes. Select Within-groups. Click Continue. Click OK in the Discriminant Analysis dialog box.

The prior probabilities are now based on the sizes of the groups A priori, 75.2% of the cases are nondefaulters, so the classification functions will now be weighted more heavily in favor of classifying cases as nondefaulters. The overall classification rate is higher for these classifications than for the ones based on equal priors Unfortunately, this comes at the cost of misclassifying a greater percentage of defaulters If you need to be conservative in your lending, then your goal is to identify defaulters, and you'd be better off using equal priors. If you can be more aggressive in your lending, then you can afford to use unequal priors. Using Discriminant Analysis, you created a model that classifies customers as high or low credit risks. Box's M showed a possible problem with heterogeneity of the covariance matrices, although further investigation revealed this was probably an effect of the size of the data file. The use of unequal priors to take advantage of the fact that nondefaulters outnumber defaulters resulted in a higher overall classification rate but at the cost of missing defaulters.

Using Discriminant Analysis to Classify Telecommunications Customers


A telecommunications provider has segmented its customer base by service usage patterns, categorizing the customers into four groups. If demographic data can be used to predict group membership, you can customize offers for individual prospective customers. Click Reset to restore the default settings. If the variable list does not display variable labels in file order, right-click anywhere in the variable list and from the context menu choose Display Variable Labels and Sort by File Order. Select Customer category as the grouping variable. Select Age in Years through Number of people in household as independent variables. Select Use stepwise method. Select Customer category and click Define Range. Type 1 as the minimum. Type 4 as the maximum. Click Continue. Click Classify in the Discriminant Analysis dialog box. Select Summary table and Territorial map. Click Continue. Click OK in the Discriminant Analysis dialog box.

Stepwise Discriminant Analysis

When you have a lot of predictors, the stepwise method can be useful by automatically selecting the "best" variables to use in the model. The stepwise method starts with a model that doesn't include any of the predictors

At each step, the predictor with the largest F to Enter value that exceeds the entry criteria (by default, 3.84) is added to the model.

The variables left out of the analysis at the last step all have F to Enter values smaller than 3.84, so no more are added

This table displays statistics for the variables that are in the analysis at each step.

Tolerance is the proportion of a variable's variance not accounted for by other independent variables in the equation. A variable with very low tolerance contributes little information to a model and can cause computational problems. F to Remove values are useful for describing what happens if a variable is removed from the current model (given that the other variables remain). F to Remove for the entering variable is the same as F to Enter at the previous step (shown in the Variables Not in the Analysis table).

A Note of Caution Concerning Stepwise Methods


Stepwise methods are convenient, but have their limitations. Be aware that because stepwise methods select models based solely upon statistical merit, it may choose predictors that have no practical significance. If you have some experience with the data and have expectations about which predictors are important, you should use that knowledge and eschew stepwise methods. If, however, you have many predictors and no idea where to start, running a stepwise analysis and adjusting the selected model is better than no model at all

Checking Model Fit


Nearly all of the variance explained by the model is due to the first two discriminant functions.

Three functions are fit automatically, but due to its minuscule eigenvalue, you can fairly safely ignore the third. Wilks' lambda agrees that only the first two functions are useful. For each set of functions, this tests the hypothesis that the means of the functions listed are equal across groups. The test of function 3 has a significance value greater than 0.10, so this function contributes little to the model.

Structure Matrix
When there is more than one discriminant function, an asterisk(*) marks each variable's largest absolute correlation with one of the canonical functions. Within each function, these marked variables are then ordered by the size of the correlation

Level of education is most strongly correlated with the first function, and it is the only variable most strongly correlated with this function. Years with current employer, Age in years, Household income in thousands, Years at current address, Retired, and Gender are most strongly correlated with the second function, although Gender and Retired are more weakly correlated than the others. The other variables mark this function as a "stability" function. Number of people in household and Marital status are most strongly correlated with the third discriminant function, but this is a useless function, so these are nearly useless predictors

Territorial Map

The territorial map helps you to study the relationships between the groups and the discriminant functions. Combined with the structure matrix results, it gives a graphical interpretation of the relationship between predictors and groups

The first function, shown on the horizontal axis, separates group 4 (Total service customers) from the others. Since Level of education is strongly positively correlated with the first function, this suggests that your Total service customers are, in general, the most highly educated.

The second function separates groups 1 and 3 (Basic service and Plus service customers). Plus service customers tend to have been working longer and are older than Basic service customers.

E-service customers are not separated well from the others, although the map suggests that they tend to be well educated with a moderate amount of work experience.

In general, the closeness of the group centroids, marked with asterisks (*), to the territorial lines suggests that the separation between all groups is not very strong. Only the first two discriminant functions are plotted, but since the third function was found to be rather insignificant, the territorial map offers a comprehensive view of the discriminant model.

Classification Results
From Wilks' lambda, you know that your model is doing better than guessing, but you need to turn to the classification results to determine how much better

Given the observed data, the "null" model (that is, one without predictors) would classify all customers into the modal group, Plus service. Thus, the null model would be correct 281/1000 = 28.1% of the time Your model gets 11.4% more or 39.5% of the customers. In particular, your model excels at identifying Total service customers. However, it does an exceptionally poor job of classifying E-service customers. You may need to find another predictor in order to separate these customers. You have created a discriminant model that classifies customers into one of four predefined "service usage" groups, based on demographic information from each customer. Using the structure matrix and territorial map, you identified which variables are most useful for segmenting your customer base. Lastly, the classification results show that the model does poorly at classifying E-service customers. More research is required to determine another predictor variable that better classifies these customers, but depending on what you are looking to predict, the model may be perfectly adequate for your needs. For example, if you are not concerned with identifying E-service customers the model may be accurate enough for you. This may be the case where the E-service is a loss-leader which brings in little profit. If, for example, your highest return on investment comes from Plus service or Total service customers, the model may give you the information you need.

Vous aimerez peut-être aussi