Académique Documents
Professionnel Documents
Culture Documents
To demonstrate multinomial logistic regression, we will work the sample problem for multinomial logistic regression in SPSS Regression Models 10.0, pages 65 - 82. The description of the problem found on page 66 states that the 1996 General Social Survey asked people who they voted for in 1992. Demographic variables from the GSS, such as sex, age, and education, can be used to identify the relationships between voter demographics and voter preference. The data for this problem is: voter.sav.
Slide 1
Relationship to be analyzed
The goal of this analysis is to examine the relationship between presidential choice in 1992, sex, age, and education.
Slide 2
Slide 3
Slide 4
Slide 6
Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Model Estimation
In this stage, the following issues are addressed: Compute logistic regression model
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Assessing Model Fit
In this stage, the following issues are addressed: Significance test of the model log likelihood (Change in -2LL) Measures Analogous to R: Cox and Snell R and Nagelkerke R Classification matrices as a measure of model accuracy Check for Numerical Problems Presence of outliers
Slide 14
The initial log likelihood value (2718.636) is a measure of a model with no independent variables, i.e. only a constant or intercept. The final log likelihood value (2600.138) is the measure computed after all of the independent variables have been entered into the logistic regression. The difference between these two measures is the model chi-square value (118.497 = 2718.636 - 2600.138) that is tested for statistical significance. This test is analogous to the F-test for R or change in R value in multiple regression which tests whether or not the improvement in the model associated with the additional variables is statistically significant.
In this problem the model Chi-Square value of 118.497 has a significance < 0.0001, so we conclude that there is a significant relationship between the dependent variable and the set of independent variables.
Slide 15
Measures Analogous to R
The next SPSS outputs indicate the strength of the relationship between the dependent variable and the independent variables, analogous to the R measures in multiple regression.
The Cox and Snell R measure operates like R, with higher values indicating greater model fit. However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerke proposed a modification that had the range from 0 to 1. We will rely upon Nagelkerke's measure as indicating the strength of the relationship. If we applied our interpretive criteria to the Nagelkerke R, we would characterize the relationship as weak.
Slide 16
If the predicted and actual group memberships are the same, i.e. 1 and 1, 2 and 2, or 3 and 3, then the prediction is accurate for that case. If predicted group membership and actual group membership are different, the model "misses" for that case. The overall percentage of accurate predictions (49.9% in this case) is the measure of a model that I rely on most heavily for this analysis as well as for Multinomial Logistic Regression because it has a meaning that is readily communicated, i.e. the percentage of cases for which our model predicts accurately. To evaluate the accuracy of the model, we compute the proportional by chance accuracy rate and the maximum by chance accuracy rates, if appropriate. The proportional by chance accuracy rate is equal to 0.393 (0.358^2 + 0.150^2 + 0.492). A 25% increase over the proportional by chance accuracy rate would equal 0.491. Our model accuracy rate of 49.9% meets this criterion. Since one of our groups (voters for Clinton) contains 49.2% of the cases, we should also apply the maximum by chance criterion. A 25% increase over the largest groups would equal 0.614. Our model accuracy race of 49.9% fails to meet this criterion. The usefulness of the relationship among the demographic variables and voter preference is questionable.
Slide 17
Presence of outliers
Multinomial logistic regression does not provide any output for detecting outliers. However, if we are concerned with outliers, we can identify outliers on the combination of independent variables by computing Mahalanobis distance in the SPSS regression procedure.
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Having a junior college degree made a person about 2.3 times more likely to choose Perot over Clinton.
Being a male doubled the likelihood that a voter would choose Perot over Clinton.
Slide 24
Slide 25
Slide 26
Slide 27
Creating the Multinomial Logistic Regression for the First Half of the Data
Next, we run the multinomial logistic regression on the first half of the sample, where split = 0.
* Select the cases to include in the first validation analysis. USE ALL. COMPUTE filter_$=(split=0). FILTER BY filter_$. EXECUTE . * Run the multinomial logistic regression for these cases. NOMREG pres92 BY degree sex WITH age educ /CRITERIA = CIN(95) DELTA(0) MXITER(100) MXSTEP(5) LCONVERGE(0) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /MODEL /INTERCEPT = INCLUDE /PRINT = CLASSTABLE PARAMETER SUMMARY LRT .
Slide 28
Slide 29
Slide 30
The logistic regression equations can be entered as compute statements. We will also enter the zero value for the third group, g3.
compute g1 = A0 + A1 * AGE + A2 * EDUC + A3 * DEGREE0 + A4 * DEGREE1 + A5 * DEGREE2 + A6 * DEGREE3 + A7 * SEX1. compute g2 = B0 + B1 * AGE + B2 * EDUC + B3 * DEGREE0 + B4 * DEGREE1 + B5 * DEGREE2 + B6 * DEGREE3 + B7 * SEX1. compute g3 = 0. execute.
When these statements are run in SPSS, the scores for g1, g2, and g3 will be added to the dataset.
Slide 31
When these statements are run in SPSS, the dataset will have both actual and predicted membership for the first validation sample.
Slide 32
These command produce the following table. The classification accuracy rate is computed by adding the percents for the cells where predicted accuracy coincides with actual voting behavior: 6.3% + 42.2% = 48.5%.
48.5%
46.9%
Slide 34
Nagelkerke R2
Accuracy Rate for Learning Sample
0.072
49.9%
0.051
48.8%
0.113
50.6%
From the validation table, we see that the original model is verified by the accuracy rates for the validation analyses. SEX and AGE would appear to be the more reliable predictors of voting behavior. However, the relationship is weak and falls short of the classification accuracy criteria for a useful model.
Slide 35