183 vues

Transféré par bartvansanten

- Logit-ologit
- NBA Shot Log Report
- Political Efficacy and Introductory Political Science Course: Findings from a Survey of a Large Public University
- Practica 6
- Markering_Research_and_Data_Processing_in_SPSS.pdf
- Safe Drinking water
- HTR MT16
- Statistical Credit Rating Method
- Logistic
- Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm - Henry De-Graft Acquah
- Journal of Social Science of Japan Genda Yuji
- Morris & Doak Chapter6
- Multinomial Logit Models
- Case3 LAB SmartPartyWare Complete
- Level and Determinants of Consumers Perception of Packed Milk in Pakistan
- weed2
- Logistic SPSS
- Geomath Lect12 Stats
- Jurnal Vancouver Eng
- Untitled

Vous êtes sur la page 1sur 35

To demonstrate multinomial logistic regression, we will work the sample problem for multinomial logistic regression in SPSS Regression Models 10.0, pages 65 - 82. The description of the problem found on page 66 states that the 1996 General Social Survey asked people who they voted for in 1992. Demographic variables from the GSS, such as sex, age, and education, can be used to identify the relationships between voter demographics and voter preference. The data for this problem is: voter.sav.

Slide 1

In this stage, the following issues are addressed: Relationship to be analyzed Specifying the dependent and independent variables Method for including independent variables

Relationship to be analyzed

The goal of this analysis is to examine the relationship between presidential choice in 1992, sex, age, and education.

Slide 2

The dependent variable is pres92 Vote for Clinton, Bush, Perot. It has three categories: 1 is a vote for Bush, 2 is a vote for Perot, and 3 is vote for Clinton. SPSS will solve the problem by contrasting votes for Bush to votes for Clinton, and votes for Perot to votes for Clinton. By default, SPSS uses the highest numbered choice as the reference category.

AGE 'Age of respondent EDUC Highest year of school completed DEGREE Respondents Highest Degree SEX Respondents Sex

The only method for including variables multinomial logistic regression in SPSS is direct entry of all variables.

Slide 3

In this stage, the following issues are addressed: Missing data analysis Minimum sample size requirement: 15-20 cases per independent variable

Only 2 of the 1847 cases have any missing data. Since the number of cases with missing data is so small, it cannot produce a missing data process that is disruptive to the analysis. We will bypass any missing data analysis.

The data set has 1845 cases and 4 independent variables for a ratio of 462 to 1, well in excess of the requirement that we have 15-20 cases per independent variable.

Slide 4

In this stage, the following issues are addressed: Incorporating nonmetric data with dummy variables Representing Curvilinear Effects with Polynomials Representing Interaction or Moderator Effects

It is not necessary to create dummy variables for nonmetric data since SPSS will do this automatically when we specify that a variable is a factor in the model.

We do not have any evidence of curvilinear effects at this point in the analysis, though the SPSS text for this problem points out that there is a curvilinear relationship between education and voting preference, which led them to create the variable Degree Respondents Highest Degree. Democrats (i.e. Clinton voters) are favored by both those with little formal education and those who have advanced degrees.

We do not have any evidence at this point in the analysis that we should add interaction or moderator variables. The SPSS procedure makes it very easy to add interaction terms.

Multinomial Logistic Regression Slide 5

In this stage, the following issues are addressed: Nonmetric dependent variable with two or more groups Metric or nonmetric independent variables

The dependent variable pres92 Vote for Clinton, Bush, Perot has three categories.

AGE and EDUC, as metric variables, will be entered as covariates in the model. SEX and DEGREE, as nonmetric variables, will be entered as factors.

Slide 6

Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Model Estimation

In this stage, the following issues are addressed: Compute logistic regression model

The steps to obtain a logistic regression analysis are detailed on the following screens.

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Assessing Model Fit

In this stage, the following issues are addressed: Significance test of the model log likelihood (Change in -2LL) Measures Analogous to R: Cox and Snell R and Nagelkerke R Classification matrices as a measure of model accuracy Check for Numerical Problems Presence of outliers

Slide 14

The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure like total sums of squares in regression. If our independent variables have a relationship to the dependent variable, we will improve our ability to predict the dependent variable accurately, and the log likelihood measure will decrease.

The initial log likelihood value (2718.636) is a measure of a model with no independent variables, i.e. only a constant or intercept. The final log likelihood value (2600.138) is the measure computed after all of the independent variables have been entered into the logistic regression. The difference between these two measures is the model chi-square value (118.497 = 2718.636 - 2600.138) that is tested for statistical significance. This test is analogous to the F-test for R or change in R value in multiple regression which tests whether or not the improvement in the model associated with the additional variables is statistically significant.

In this problem the model Chi-Square value of 118.497 has a significance < 0.0001, so we conclude that there is a significant relationship between the dependent variable and the set of independent variables.

Slide 15

Measures Analogous to R

The next SPSS outputs indicate the strength of the relationship between the dependent variable and the independent variables, analogous to the R measures in multiple regression.

The Cox and Snell R measure operates like R, with higher values indicating greater model fit. However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerke proposed a modification that had the range from 0 to 1. We will rely upon Nagelkerke's measure as indicating the strength of the relationship. If we applied our interpretive criteria to the Nagelkerke R, we would characterize the relationship as weak.

Slide 16

The classification matrix in logistic regression serves the same function as the classification matrix in Multinomial Logistic Regression, i.e. evaluating the accuracy of the model.

If the predicted and actual group memberships are the same, i.e. 1 and 1, 2 and 2, or 3 and 3, then the prediction is accurate for that case. If predicted group membership and actual group membership are different, the model "misses" for that case. The overall percentage of accurate predictions (49.9% in this case) is the measure of a model that I rely on most heavily for this analysis as well as for Multinomial Logistic Regression because it has a meaning that is readily communicated, i.e. the percentage of cases for which our model predicts accurately. To evaluate the accuracy of the model, we compute the proportional by chance accuracy rate and the maximum by chance accuracy rates, if appropriate. The proportional by chance accuracy rate is equal to 0.393 (0.358^2 + 0.150^2 + 0.492). A 25% increase over the proportional by chance accuracy rate would equal 0.491. Our model accuracy rate of 49.9% meets this criterion. Since one of our groups (voters for Clinton) contains 49.2% of the cases, we should also apply the maximum by chance criterion. A 25% increase over the largest groups would equal 0.614. Our model accuracy race of 49.9% fails to meet this criterion. The usefulness of the relationship among the demographic variables and voter preference is questionable.

Slide 17

There are several numerical problems that can occur in logistic regression that are not detected by SPSS or other statistical packages: multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and "complete separation" whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. All of these problems produce large standard errors (over 2) for the variables included in the analysis and very often produce very large B coefficients as well. If we encounter large standard errors for the predictor variables, we should examine frequency tables, oneway ANOVAs, and correlations for the variables involved to try to identify the source of the problem. None of the standard errors or B coefficients are excessively large, so there is no evidence of a numeric problem with this analysis.

Multinomial Logistic Regression Slide 18

Presence of outliers

Multinomial logistic regression does not provide any output for detecting outliers. However, if we are concerned with outliers, we can identify outliers on the combination of independent variables by computing Mahalanobis distance in the SPSS regression procedure.

Slide 19

In this section, we address the following issues: Identifying the statistically significant predictor variables Direction of relationship and contribution to dependent variable

Slide 20

There are two outputs related to the statistical significance of individual predictor variables: the Likelihood Ratio Tests and Parameter Estimates. The Likelihood Ratio Tests indicate the contribution of the variable to the overall relationship between the dependent variable and the individual independent variables. The Parameter Estimates focus on the role of each independent variable in differentiating between the groups specified by the dependent variable. The likelihood ratio tests are a hypothesis test that the variable contributes to the reduction in error measured by the 2 log likelihood statistic. In this model, the variables age, degree, and sex are all significant contributors to explaining differences in voting preference.

Slide 21

The two equations in the table of Parameter Estimates are labeled by the group they contrast to the reference group. The first equation is labeled "1 Bush", and the second equation is labeled "2 Perot." The coefficients for each logistic regression equation are found in the column labeled B. The hypothesis that the coefficient is not zero, i.e. changes the odds of the dependent variable event, is tested with the Wald statistic, instead of the t-test as was done for the individual B coefficients in the multiple regression equation. The variables that have a statistically significant relationship to distinguishing voters for Bush from voters for Clinton in the first logistic regression equation were DEGREE=3 (bachelor's degree) and Sex=1 (Male). The variables that have a statistically significant relationship to distinguishing voters for Perot from voters for Clinton were AGE, Degree=2 (junior college degree), and SEX=1 (male).

Slide 22

Interpretation of the independent variables is aided by the "Exp (B)" column which contains the odds ratio for each independent variable. We can state the relationships as follows: Having a bachelor's degree rather than an advanced degree increased the likelihood that a voter would choose Bush over Clinton by about 50%. Being a male increased the likelihood that a voter would choose Bush over Clinton by approximately 50% (almost 60%).

Slide 23

Interpretation of the independent variables is aided by the "Exp (B)" column which contains the odds ratio for each independent variable. We can state the relationships as follows: Increases in age made a voter about 3% less likely to choose Perot over Clinton.

Having a junior college degree made a person about 2.3 times more likely to choose Perot over Clinton.

Being a male doubled the likelihood that a voter would choose Perot over Clinton.

Slide 24

The SPSS multinomial logistic procedure does not include the ability to select a subset of cases based on the value of a variable, so we cannot use our usual strategy for conducting a validation analysis. We can, however, accomplish the same results with a step-by-step series of syntax commands, as will be shown on the following screens. We cannot run all of the syntax commands at one time because one of the steps requires us to manually type the coefficients from the SPSS output into the syntax file so that we can calculate predicted values for the logistic regression equations. In order to understand the steps that we will follow, we need to understand how we translate scores on the logistic regression equations into classification in a group. The multinomial logistic regression problem for three groups is solved by contrasting two of the groups with a reference group. In this problem, the reference group is Clinton voters. The classification score for the reference group is 0, just as the code for any reference group for dummy coded variables is 0. The first logistic regression equation is used to compute a logistic regression score that would test whether or not the subject is more likely a member of the group of Bush voters rather than a member of the group of Clinton voters. Similarly, the second logistic regression equation is used to test whether or not the subject is more likely to be a Perot voter than a Clinton voter.

Slide 25

The classification problem, thus, involves the comparison of three scores, one associated with each of the groups. The first score (which we will label g1) is associated with voting for Bush. The second score (which we will label g2) is associated with voting for Perot. The third score (which we will label g3) is associated with voting for Clinton. Calculating g1 and g2 require substituting the variables for each subject in the logistic regression equations. G3 is always 0. The scores g1, g2, and g3 are log estimates of the odds of belonging to each group. To convert the scores into a probability of group membership, we convert each score into its antilog equivalent and divide by the sum of the three antilog equivalents. To estimate group membership, we compare the three probabilities, and estimate that the subject is a member of the group associated with the highest probability.

Slide 26

The first step in our validation analysis is to create the split variable.

* Compute the split variable for the learning and validation samples. SET SEED 2000000. COMPUTE split = uniform(1) > 0.50 . EXECUTE .

Slide 27

Creating the Multinomial Logistic Regression for the First Half of the Data

Next, we run the multinomial logistic regression on the first half of the sample, where split = 0.

* Select the cases to include in the first validation analysis. USE ALL. COMPUTE filter_$=(split=0). FILTER BY filter_$. EXECUTE . * Run the multinomial logistic regression for these cases. NOMREG pres92 BY degree sex WITH age educ /CRITERIA = CIN(95) DELTA(0) MXITER(100) MXSTEP(5) LCONVERGE(0) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /MODEL /INTERCEPT = INCLUDE /PRINT = CLASSTABLE PARAMETER SUMMARY LRT .

Slide 28

To compute the classification scores for the logistic regression equations, we need to enter the coefficients for each equation into SPSS. Next, we enter the B coefficients into SPSS using compute commands. For the first set of coefficients, we will use the letter A, followed by a number. For the second set of coefficients, we will use the letter B, followed by a number. The complete set of compute commands are below the graphic.

Slide 29

* Assign the compute A0 = compute A1 = compute A2 = compute A3 = compute A4 = compute A5 = compute A6 = compute A7 = compute B0 = compute B1 = compute B2 = compute B3 = compute B4 = compute B5 = compute B6 = compute B7 = execute. coefficients from the model just run to variables. 0.4371960543979. 0.000141395117344668. -0.0627104600309503. -0.498317435855329. 0.000960703262109129. 0.100394066914469. 0.0289917073909467. 0.395907588574936. -0.181048831733511. -0.0230592828938232. -0.0511669299018998. -0.548361281578711. 0.482047826644372. 0.532843066729832. 0.492518027246711. 0.6773170430501.

Slide 30

Before we can enter the logistic regression equations, we need to explicitly create the dummy coded variables which the logistic regression equation created for the variables that we specified were factors.

* Create the dummy coded variables which SPSS created. * Use a logical assignment to code the variables as 0 or 1. compute degree0 = (degree = 0). compute degree1 = (degree = 1). compute degree2 = (degree = 2). compute degree3 = (degree = 3). compute degree4 = (degree = 4). compute sex1 = (sex = 1). execute.

The logistic regression equations can be entered as compute statements. We will also enter the zero value for the third group, g3.

compute g1 = A0 + A1 * AGE + A2 * EDUC + A3 * DEGREE0 + A4 * DEGREE1 + A5 * DEGREE2 + A6 * DEGREE3 + A7 * SEX1. compute g2 = B0 + B1 * AGE + B2 * EDUC + B3 * DEGREE0 + B4 * DEGREE1 + B5 * DEGREE2 + B6 * DEGREE3 + B7 * SEX1. compute g3 = 0. execute.

When these statements are run in SPSS, the scores for g1, g2, and g3 will be added to the dataset.

Slide 31

We convert the three scores into odds ratios using the EXP function. When we divide each score by the sum of the three odds ratios, we end up with a probability of membership in each group.

* Compute the probabilities of membership compute p1 = exp(g1) / (exp(g1) + exp(g2) compute p2 = exp(g2) / (exp(g1) + exp(g2) compute p3 = exp(g3) / (exp(g1) + exp(g2) execute. in each group. + exp(g3)). + exp(g3)). + exp(g3)).

* Translate if (p1 > p2 if (p2 > p1 if (p3 > p1 execute. the and and and probabilities into p1 > p3) predgrp = p2 > p3) predgrp = p3 > p2) predgrp = predicted group membership. 1. 2. 3.

When these statements are run in SPSS, the dataset will have both actual and predicted membership for the first validation sample.

Slide 32

To produce a classification table for the validation sample, we change the filter criteria to include cases where split = 1, and create a contingency table of predicted voting versus actual voting.

USE ALL. COMPUTE filter_$=(split=1). FILTER BY filter_$. EXECUTE. CROSSTABS /TABLES=pres92 BY predgrp /FORMAT= AVALUE TABLES /CELLS= COUNT TOTAL .

These command produce the following table. The classification accuracy rate is computed by adding the percents for the cells where predicted accuracy coincides with actual voting behavior: 6.3% + 42.2% = 48.5%.

Multinomial Logistic Regression Slide 33

The second validation analysis follows the same series of command, except that we build the model with the cases where split = 1 and validate the model on cases where split = 0. The results from my calculations have been entered into the validation table below.

Full Model Model Chi-Square Nagelkerke R2 Accuracy Rate for Learning Sample Accuracy Rate for Validation Sample Significant Coefficients (p < 0.05) Equation 1 DEGREE = 3 SEX = 1 Equation 2 AGE DEGREE = 2 SEX = 1 118.497, p < 0.0001 0.072 49.9% Split = 0 42.610, p < 0.0001 0.051 48.8% Split = 1 92.772, p < 0.0001 0.113 50.6%

48.5%

46.9%

Slide 34

We can summarize the results of the validation analyses in the following table.

Full Model Model Chi-Square 118.497, p < 0.0001 Split = 0 42.610, p < 0.0001 Split = 1 92.772, p < 0.0001

Nagelkerke R2

Accuracy Rate for Learning Sample

0.072

49.9%

0.051

48.8%

0.113

50.6%

Significant Coefficients (p < 0.05) Equation 1 DEGREE = 3 SEX = 1 Equation 2 AGE DEGREE = 2 SEX = 1

From the validation table, we see that the original model is verified by the accuracy rates for the validation analyses. SEX and AGE would appear to be the more reliable predictors of voting behavior. However, the relationship is weak and falls short of the classification accuracy criteria for a useful model.

Slide 35

- Logit-ologitTransféré parRoberto Cisneros Mendoza
- NBA Shot Log ReportTransféré parAlex Dai
- Political Efficacy and Introductory Political Science Course: Findings from a Survey of a Large Public UniversityTransféré parMiguel Centellas
- Practica 6Transféré parnitramnico
- Markering_Research_and_Data_Processing_in_SPSS.pdfTransféré parRsrk Kiran Kumar
- Safe Drinking waterTransféré parNthabiseng Mo Maluke
- HTR MT16Transféré parpresidentofasia
- Statistical Credit Rating MethodTransféré parShubham Bhatia
- LogisticTransféré parsumit
- Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm - Henry De-Graft AcquahTransféré parSantiago Fucci
- Journal of Social Science of Japan Genda YujiTransféré parabanbc
- Morris & Doak Chapter6Transféré parDgek London
- Multinomial Logit ModelsTransféré parRicardo Campos Espinoza
- Case3 LAB SmartPartyWare CompleteTransféré parincognito123123
- Level and Determinants of Consumers Perception of Packed Milk in PakistanTransféré parMariaMehmood
- weed2Transféré parAnonymous n1En5CJ8i
- Logistic SPSSTransféré parfranky
- Geomath Lect12 StatsTransféré paradf500
- Jurnal Vancouver EngTransféré parHabibi Rahman
- UntitledTransféré parapi-222199954
- THE EFFECT OF LEADERSHIP, COMPENSATION AND COMPETENCY ON EMPLOYEE PERFORMANCE OF BANDA ACEH PUBLIC HEALTH OFFICE.pdfTransféré parFahregaAlfradiandi
- Wright Thomas ATransféré parazn1nfern0
- annrheumd00479-0012Transféré parLarisa Cristina
- Robert G. Jahn et al- Count Population Profiles in Engineering Anomalies ExperimentsTransféré parDominos021
- maxdifftechTransféré parjrinconmora
- Ni Hms 350588Transféré parSan
- Resistance and Powering Prediction for Transom Stern Hull Forms During Early Stage Ship Design S C FungTransféré parGoutam Kumar Saha
- Modele_LRTransféré parclaudia_claudia1111
- Transportation Statistics: 2003 04Transféré parBTS
- Ch03_p34-75Transféré parsonalz

- Logistic Regression TutorialTransféré parDaphne
- fferTransféré parlhz
- Generalized structured additive regression based on Bayesian P-SplinesTransféré parcchien
- Maximum Entropy Markov Models - Conditional Random FieldsTransféré parPham Nguyen Tuan Anh
- Panov and Taleski - Final VersionTransféré parZhidas Daskalovski
- Determinants of Market outlet Choice for Major Vegetables Crop: Evidence from Smallholder Farmers’ of Ambo and Toke-Kutaye Districts, West Shewa, EthiopiaTransféré parPremier Publishers
- SVM Microarray Data AnalysisTransféré parselamitsp
- SentimentTransféré pardenizerman
- gologit2Transféré parJosé Eronides Junior
- Order Submission Under Asymmetric Information Menkhoff 2010 JBFTransféré parEleanor Rigby
- rologit.pdfTransféré parAnonymous rqqljZ5
- Philip_Hans_Franses,_Richard_Paap_Quantitative_models_in_marketing_research.pdfTransféré parUmberto Manzalini
- lclogitTransféré parENU
- Class Mobility in the Philippines - CHIBA UniversityTransféré parBert M Drona
- Twitter_PolitCommun.pdfTransféré parTsvetana Boncheva
- Assessment of Livelihood StrategiesTransféré parhainui
- Gllamm ManualTransféré pars0403547
- slides.pdfTransféré parRaj Jha
- Kok 2011Transféré parmeysamrahimi
- Alon the-evolution.pdfTransféré pareugenia1234
- mlogitTransféré parraydonal
- Biodiversity- Ljupco MelovskiTransféré pargsrivera1681
- Real Statistics Examples Part 2Transféré parheydar ruffa taufiq
- Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine LearningTransféré parTuytm2
- Interpretación Odds Ratio.pdfTransféré parjuchama4256
- NLOGIT ManualTransféré parAin Hafidita
- Insider_trading_and_share_repurchases_Do Insiders and Firms Trade in the Same DirectionTransféré parblacksmithMG
- Heat protection behaviour in the UK: results of an online survey after the 2013 heatwaveTransféré parSwarna Khare
- Marketing Engineering NotesTransféré parNeelesh Kamath
- Martiarena2011_Whats So Entreprenuerial About IntrapreneursTransféré parShare Wimby