Vous êtes sur la page 1sur 35

# The SPSS Sample Problem

To demonstrate multinomial logistic regression, we will work the sample problem for multinomial logistic regression in SPSS Regression Models 10.0, pages 65 - 82. The description of the problem found on page 66 states that the 1996 General Social Survey asked people who they voted for in 1992. Demographic variables from the GSS, such as sex, age, and education, can be used to identify the relationships between voter demographics and voter preference. The data for this problem is: voter.sav.

Slide 1

## Stage One: Define the Research Problem

In this stage, the following issues are addressed: Relationship to be analyzed Specifying the dependent and independent variables Method for including independent variables

Relationship to be analyzed
The goal of this analysis is to examine the relationship between presidential choice in 1992, sex, age, and education.

Slide 2

## Specifying the dependent and independent variables

The dependent variable is pres92 Vote for Clinton, Bush, Perot. It has three categories: 1 is a vote for Bush, 2 is a vote for Perot, and 3 is vote for Clinton. SPSS will solve the problem by contrasting votes for Bush to votes for Clinton, and votes for Perot to votes for Clinton. By default, SPSS uses the highest numbered choice as the reference category.

## The independent variables which we will use in this analysis are:

AGE 'Age of respondent EDUC Highest year of school completed DEGREE Respondents Highest Degree SEX Respondents Sex

## Method for including independent variables

The only method for including variables multinomial logistic regression in SPSS is direct entry of all variables.

Slide 3

## Stage 2: Develop the Analysis Plan: Sample Size Issues

In this stage, the following issues are addressed: Missing data analysis Minimum sample size requirement: 15-20 cases per independent variable

## Missing data analysis

Only 2 of the 1847 cases have any missing data. Since the number of cases with missing data is so small, it cannot produce a missing data process that is disruptive to the analysis. We will bypass any missing data analysis.

## Minimum sample size requirement: 15-20 cases per independent variable

The data set has 1845 cases and 4 independent variables for a ratio of 462 to 1, well in excess of the requirement that we have 15-20 cases per independent variable.

Slide 4

## Stage 2: Develop the Analysis Plan: Measurement Issues:

In this stage, the following issues are addressed: Incorporating nonmetric data with dummy variables Representing Curvilinear Effects with Polynomials Representing Interaction or Moderator Effects

## Incorporating Nonmetric Data with Dummy Variables

It is not necessary to create dummy variables for nonmetric data since SPSS will do this automatically when we specify that a variable is a factor in the model.

## Representing Curvilinear Effects with Polynomials

We do not have any evidence of curvilinear effects at this point in the analysis, though the SPSS text for this problem points out that there is a curvilinear relationship between education and voting preference, which led them to create the variable Degree Respondents Highest Degree. Democrats (i.e. Clinton voters) are favored by both those with little formal education and those who have advanced degrees.

## Representing Interaction or Moderator Effects

We do not have any evidence at this point in the analysis that we should add interaction or moderator variables. The SPSS procedure makes it very easy to add interaction terms.
Multinomial Logistic Regression Slide 5

## Stage 3: Evaluate Underlying Assumptions

In this stage, the following issues are addressed: Nonmetric dependent variable with two or more groups Metric or nonmetric independent variables

## Nonmetric dependent variable having two groups

The dependent variable pres92 Vote for Clinton, Bush, Perot has three categories.

## Metric or nonmetric independent variables

AGE and EDUC, as metric variables, will be entered as covariates in the model. SEX and DEGREE, as nonmetric variables, will be entered as factors.

## Multinomial Logistic Regression

Slide 6

Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Model Estimation
In this stage, the following issues are addressed: Compute logistic regression model

## Compute the logistic regression

The steps to obtain a logistic regression analysis are detailed on the following screens.

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

## Multinomial Logistic Regression

Slide 13

Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Assessing Model Fit
In this stage, the following issues are addressed: Significance test of the model log likelihood (Change in -2LL) Measures Analogous to R: Cox and Snell R and Nagelkerke R Classification matrices as a measure of model accuracy Check for Numerical Problems Presence of outliers

Slide 14

## Significance test of the model log likelihood

The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure like total sums of squares in regression. If our independent variables have a relationship to the dependent variable, we will improve our ability to predict the dependent variable accurately, and the log likelihood measure will decrease.

The initial log likelihood value (2718.636) is a measure of a model with no independent variables, i.e. only a constant or intercept. The final log likelihood value (2600.138) is the measure computed after all of the independent variables have been entered into the logistic regression. The difference between these two measures is the model chi-square value (118.497 = 2718.636 - 2600.138) that is tested for statistical significance. This test is analogous to the F-test for R or change in R value in multiple regression which tests whether or not the improvement in the model associated with the additional variables is statistically significant.
In this problem the model Chi-Square value of 118.497 has a significance &lt; 0.0001, so we conclude that there is a significant relationship between the dependent variable and the set of independent variables.

## Multinomial Logistic Regression

Slide 15

Measures Analogous to R
The next SPSS outputs indicate the strength of the relationship between the dependent variable and the independent variables, analogous to the R measures in multiple regression.

The Cox and Snell R measure operates like R, with higher values indicating greater model fit. However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerke proposed a modification that had the range from 0 to 1. We will rely upon Nagelkerke's measure as indicating the strength of the relationship. If we applied our interpretive criteria to the Nagelkerke R, we would characterize the relationship as weak.

Slide 16

## The Classification Matrix as a Measure of Model Accuracy

The classification matrix in logistic regression serves the same function as the classification matrix in Multinomial Logistic Regression, i.e. evaluating the accuracy of the model.

If the predicted and actual group memberships are the same, i.e. 1 and 1, 2 and 2, or 3 and 3, then the prediction is accurate for that case. If predicted group membership and actual group membership are different, the model "misses" for that case. The overall percentage of accurate predictions (49.9% in this case) is the measure of a model that I rely on most heavily for this analysis as well as for Multinomial Logistic Regression because it has a meaning that is readily communicated, i.e. the percentage of cases for which our model predicts accurately. To evaluate the accuracy of the model, we compute the proportional by chance accuracy rate and the maximum by chance accuracy rates, if appropriate. The proportional by chance accuracy rate is equal to 0.393 (0.358^2 + 0.150^2 + 0.492). A 25% increase over the proportional by chance accuracy rate would equal 0.491. Our model accuracy rate of 49.9% meets this criterion. Since one of our groups (voters for Clinton) contains 49.2% of the cases, we should also apply the maximum by chance criterion. A 25% increase over the largest groups would equal 0.614. Our model accuracy race of 49.9% fails to meet this criterion. The usefulness of the relationship among the demographic variables and voter preference is questionable.

Slide 17

## Check for Numerical Problems

There are several numerical problems that can occur in logistic regression that are not detected by SPSS or other statistical packages: multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and "complete separation" whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. All of these problems produce large standard errors (over 2) for the variables included in the analysis and very often produce very large B coefficients as well. If we encounter large standard errors for the predictor variables, we should examine frequency tables, oneway ANOVAs, and correlations for the variables involved to try to identify the source of the problem. None of the standard errors or B coefficients are excessively large, so there is no evidence of a numeric problem with this analysis.
Multinomial Logistic Regression Slide 18

Presence of outliers
Multinomial logistic regression does not provide any output for detecting outliers. However, if we are concerned with outliers, we can identify outliers on the combination of independent variables by computing Mahalanobis distance in the SPSS regression procedure.

Slide 19

## Stage 5: Interpret the Results

In this section, we address the following issues: Identifying the statistically significant predictor variables Direction of relationship and contribution to dependent variable

Slide 20

## Identifying the statistically significant predictor variables - 1

There are two outputs related to the statistical significance of individual predictor variables: the Likelihood Ratio Tests and Parameter Estimates. The Likelihood Ratio Tests indicate the contribution of the variable to the overall relationship between the dependent variable and the individual independent variables. The Parameter Estimates focus on the role of each independent variable in differentiating between the groups specified by the dependent variable. The likelihood ratio tests are a hypothesis test that the variable contributes to the reduction in error measured by the 2 log likelihood statistic. In this model, the variables age, degree, and sex are all significant contributors to explaining differences in voting preference.

Slide 21

## Identifying the statistically significant predictor variables - 2

The two equations in the table of Parameter Estimates are labeled by the group they contrast to the reference group. The first equation is labeled "1 Bush", and the second equation is labeled "2 Perot." The coefficients for each logistic regression equation are found in the column labeled B. The hypothesis that the coefficient is not zero, i.e. changes the odds of the dependent variable event, is tested with the Wald statistic, instead of the t-test as was done for the individual B coefficients in the multiple regression equation. The variables that have a statistically significant relationship to distinguishing voters for Bush from voters for Clinton in the first logistic regression equation were DEGREE=3 (bachelor's degree) and Sex=1 (Male). The variables that have a statistically significant relationship to distinguishing voters for Perot from voters for Clinton were AGE, Degree=2 (junior college degree), and SEX=1 (male).

Slide 22

## Direction of relationship and contribution to dependent variable - 1

Interpretation of the independent variables is aided by the "Exp (B)" column which contains the odds ratio for each independent variable. We can state the relationships as follows: Having a bachelor's degree rather than an advanced degree increased the likelihood that a voter would choose Bush over Clinton by about 50%. Being a male increased the likelihood that a voter would choose Bush over Clinton by approximately 50% (almost 60%).

Slide 23

## Direction of relationship and contribution to dependent variable

Interpretation of the independent variables is aided by the "Exp (B)" column which contains the odds ratio for each independent variable. We can state the relationships as follows: Increases in age made a voter about 3% less likely to choose Perot over Clinton.

Having a junior college degree made a person about 2.3 times more likely to choose Perot over Clinton.
Being a male doubled the likelihood that a voter would choose Perot over Clinton.

Slide 24

## Stage 6: Validate The Model

The SPSS multinomial logistic procedure does not include the ability to select a subset of cases based on the value of a variable, so we cannot use our usual strategy for conducting a validation analysis. We can, however, accomplish the same results with a step-by-step series of syntax commands, as will be shown on the following screens. We cannot run all of the syntax commands at one time because one of the steps requires us to manually type the coefficients from the SPSS output into the syntax file so that we can calculate predicted values for the logistic regression equations. In order to understand the steps that we will follow, we need to understand how we translate scores on the logistic regression equations into classification in a group. The multinomial logistic regression problem for three groups is solved by contrasting two of the groups with a reference group. In this problem, the reference group is Clinton voters. The classification score for the reference group is 0, just as the code for any reference group for dummy coded variables is 0. The first logistic regression equation is used to compute a logistic regression score that would test whether or not the subject is more likely a member of the group of Bush voters rather than a member of the group of Clinton voters. Similarly, the second logistic regression equation is used to test whether or not the subject is more likely to be a Perot voter than a Clinton voter.

Slide 25

## Stage 6: Validate The Model (continued)

The classification problem, thus, involves the comparison of three scores, one associated with each of the groups. The first score (which we will label g1) is associated with voting for Bush. The second score (which we will label g2) is associated with voting for Perot. The third score (which we will label g3) is associated with voting for Clinton. Calculating g1 and g2 require substituting the variables for each subject in the logistic regression equations. G3 is always 0. The scores g1, g2, and g3 are log estimates of the odds of belonging to each group. To convert the scores into a probability of group membership, we convert each score into its antilog equivalent and divide by the sum of the three antilog equivalents. To estimate group membership, we compare the three probabilities, and estimate that the subject is a member of the group associated with the highest probability.

Slide 26

## Computing the First Validation Analysis

The first step in our validation analysis is to create the split variable.
* Compute the split variable for the learning and validation samples. SET SEED 2000000. COMPUTE split = uniform(1) > 0.50 . EXECUTE .

## Multinomial Logistic Regression

Slide 27

Creating the Multinomial Logistic Regression for the First Half of the Data
Next, we run the multinomial logistic regression on the first half of the sample, where split = 0.
* Select the cases to include in the first validation analysis. USE ALL. COMPUTE filter_\$=(split=0). FILTER BY filter_\$. EXECUTE . * Run the multinomial logistic regression for these cases. NOMREG pres92 BY degree sex WITH age educ /CRITERIA = CIN(95) DELTA(0) MXITER(100) MXSTEP(5) LCONVERGE(0) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /MODEL /INTERCEPT = INCLUDE /PRINT = CLASSTABLE PARAMETER SUMMARY LRT .

Slide 28

## Entering the Logistic Regression Coefficients into SPSS

To compute the classification scores for the logistic regression equations, we need to enter the coefficients for each equation into SPSS. Next, we enter the B coefficients into SPSS using compute commands. For the first set of coefficients, we will use the letter A, followed by a number. For the second set of coefficients, we will use the letter B, followed by a number. The complete set of compute commands are below the graphic.

Slide 29

## Create the coefficients in SPSS

* Assign the compute A0 = compute A1 = compute A2 = compute A3 = compute A4 = compute A5 = compute A6 = compute A7 = compute B0 = compute B1 = compute B2 = compute B3 = compute B4 = compute B5 = compute B6 = compute B7 = execute. coefficients from the model just run to variables. 0.4371960543979. 0.000141395117344668. -0.0627104600309503. -0.498317435855329. 0.000960703262109129. 0.100394066914469. 0.0289917073909467. 0.395907588574936. -0.181048831733511. -0.0230592828938232. -0.0511669299018998. -0.548361281578711. 0.482047826644372. 0.532843066729832. 0.492518027246711. 0.6773170430501.

Slide 30

## Entering the Logistic Regression Equations into SPSS

Before we can enter the logistic regression equations, we need to explicitly create the dummy coded variables which the logistic regression equation created for the variables that we specified were factors.
* Create the dummy coded variables which SPSS created. * Use a logical assignment to code the variables as 0 or 1. compute degree0 = (degree = 0). compute degree1 = (degree = 1). compute degree2 = (degree = 2). compute degree3 = (degree = 3). compute degree4 = (degree = 4). compute sex1 = (sex = 1). execute.

The logistic regression equations can be entered as compute statements. We will also enter the zero value for the third group, g3.
compute g1 = A0 + A1 * AGE + A2 * EDUC + A3 * DEGREE0 + A4 * DEGREE1 + A5 * DEGREE2 + A6 * DEGREE3 + A7 * SEX1. compute g2 = B0 + B1 * AGE + B2 * EDUC + B3 * DEGREE0 + B4 * DEGREE1 + B5 * DEGREE2 + B6 * DEGREE3 + B7 * SEX1. compute g3 = 0. execute.

When these statements are run in SPSS, the scores for g1, g2, and g3 will be added to the dataset.

Slide 31

## Converting Classification Scores into Predicted Group Membership

We convert the three scores into odds ratios using the EXP function. When we divide each score by the sum of the three odds ratios, we end up with a probability of membership in each group.
* Compute the probabilities of membership compute p1 = exp(g1) / (exp(g1) + exp(g2) compute p2 = exp(g2) / (exp(g1) + exp(g2) compute p3 = exp(g3) / (exp(g1) + exp(g2) execute. in each group. + exp(g3)). + exp(g3)). + exp(g3)).

## The follow if statements compare probabilities to predict group membership.

* Translate if (p1 > p2 if (p2 > p1 if (p3 > p1 execute. the and and and probabilities into p1 > p3) predgrp = p2 > p3) predgrp = p3 > p2) predgrp = predicted group membership. 1. 2. 3.

When these statements are run in SPSS, the dataset will have both actual and predicted membership for the first validation sample.

Slide 32

## The Classification Table

To produce a classification table for the validation sample, we change the filter criteria to include cases where split = 1, and create a contingency table of predicted voting versus actual voting.
USE ALL. COMPUTE filter_\$=(split=1). FILTER BY filter_\$. EXECUTE. CROSSTABS /TABLES=pres92 BY predgrp /FORMAT= AVALUE TABLES /CELLS= COUNT TOTAL .

These command produce the following table. The classification accuracy rate is computed by adding the percents for the cells where predicted accuracy coincides with actual voting behavior: 6.3% + 42.2% = 48.5%.

## We enter this information in the validation table.

Multinomial Logistic Regression Slide 33

## Computing the Second Validation Analysis

The second validation analysis follows the same series of command, except that we build the model with the cases where split = 1 and validate the model on cases where split = 0. The results from my calculations have been entered into the validation table below.
Full Model Model Chi-Square Nagelkerke R2 Accuracy Rate for Learning Sample Accuracy Rate for Validation Sample Significant Coefficients (p < 0.05) Equation 1 DEGREE = 3 SEX = 1 Equation 2 AGE DEGREE = 2 SEX = 1 118.497, p < 0.0001 0.072 49.9% Split = 0 42.610, p < 0.0001 0.051 48.8% Split = 1 92.772, p < 0.0001 0.113 50.6%

48.5%

46.9%

Slide 34

## Generalizability of the Multinomial Logistic Regression Model

We can summarize the results of the validation analyses in the following table.
Full Model Model Chi-Square 118.497, p < 0.0001 Split = 0 42.610, p < 0.0001 Split = 1 92.772, p < 0.0001

Nagelkerke R2
Accuracy Rate for Learning Sample

0.072
49.9%

0.051
48.8%

0.113
50.6%

## Accuracy Rate for Validation Sample

Significant Coefficients (p < 0.05) Equation 1 DEGREE = 3 SEX = 1 Equation 2 AGE DEGREE = 2 SEX = 1

## 46.9% Equation 1 DEGREE = 3 SEX = 1 Equation 2 AGE DEGREE = 2 SEX = 1

From the validation table, we see that the original model is verified by the accuracy rates for the validation analyses. SEX and AGE would appear to be the more reliable predictors of voting behavior. However, the relationship is weak and falls short of the classification accuracy criteria for a useful model.

Slide 35