Académique Documents
Professionnel Documents
Culture Documents
To perform Partial Correlation Analysis, we can use the employeedata.sav file. From the file we can estimate the correlation between current salary (salary) and beginning salary (salbegin), while choosing the control variable as months since hire (jobtime) and previous experience (prevexp). The order of the partial correlation coefficient is determined based on the number of control variables.
REGRESSION
Linear Regression estimates the coefficients of the linear equation. To analyze the relationship between variables we select the file employeedata.sav. We will analyze the relationship between the current salary as the dependent variable and education level, previous experience , beginning salary and months since hire as the independent variables.
LOGISTIC REGRESSION
To discuss the procedure for conducting Logistic Regression, we would use the sample file employeedata.sav. The analysis uses as a dependent the attitude variable minority, which is coded 0=no, 1=yes. The independent variables are years of education (educ), months of experience (prevexp), jobcategory (jobcat) ( 1=clerical, 2= custodial, 3= managerial), and gender (m,f)
LOGISTIC REGRESSION-1
Logistic regression is useful for situations in which you want to be able to predict the presence or absence of a characteristic or outcome based on values of a set of predictor variables. It is similar to a linear regression model but is suited to models where the dependent variable is dichotomous. Logistic regression coefficients can be used to estimate odds ratios for each of the independent variables in the model. Logistic regression is applicable to a broader range of research situations than discriminant analysis.
LOGISTIC REGRESSION-2
Statistics. For each analysis: total cases, selected cases, valid cases. For each categorical variable: parameter coding. For each step: variable(s) entered or removed, iteration history, -2 log-likelihood, goodness of fit, Hosmer-Lemeshow goodness-of-fit statistic, model chisquare, improvement chi-square, classification table, correlations between variables, observed groups and predicted probabilities chart, residual chi-square. For each variable in the equation: coefficient (B), standard error of B, Wald statistic, R, estimated odds ratio (exp(B)), confidence interval for exp(B), log-likelihood if term removed from model. For each variable not in the equation: score statistic, R. For each case: observed group, predicted probability, predicted group, residual, standardized residual.
LOGISTIC REGRESSION-3
Data. The dependent variable should be dichotomous. Independent variables can be interval level or categorical; if categorical, they should be dummy or indicator coded (there is an option in the procedure to recode categorical variables automatically). Assumptions. Logistic regression does not rely on distributional assumptions in the same sense that discriminant analysis does. However, your solution may be more stable if your predictors have a multivariate normal distribution. Additionally, as with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors. The procedure is most effective when group membership is a truly categorical variable; if group membership is based on values of a continuous variable (for example, "high IQ" versus "low IQ"), you should consider using linear regression to take advantage of the richer information offered by the continuous variable itself. Related procedures. Use the Scatterplot procedure to screen your data for multicollinearity. If assumptions of multivariate normality and equal variancecovariance matrices are met, you may be able to get a quicker solution using the Discriminant Analysis procedure. If all of your predictor variables are categorical, you can also use the Loglinear procedure. If your dependent variable is continuous, use the Linear Regression procedure. You can use the ROC Curve procedure to plot probabilities saved with the Logistic Regression procedure.
LOGISTIC REGRESSION-4
To Obtain a Logistic Regression Analysis
Select one dichotomous dependent variable. This variable may be numeric or short string. Select one or more covariates. To include interaction terms, select all of the variables involved in the interaction and then select >a*b>.
To enter variables in groups (blocks), select the covariates for a block, and click Next to specify a new block. Repeat until all blocks have been specified. Optionally, you can select cases for analysis. Click Select, choose a selection variable, and click Rule.
NONPARAMETRIC TESTS
A number of nonparametric tests are available, including: The Chi-Square Test The Binomial Test The Runs Test The One-Sample Kolmogorov-Smirnov Test Two-Independent-Samples Tests Tests for Several Independent Samples Two-Related-Samples Tests Tests for Several Related Samples
NONPARAMETRIC CHI-SQUARE
The Chi-Square Test procedure tabulates a variable into categories and tests the hypothesis that the observed frequencies do not differ from their expected values. Chi-Square Test allows you to: Include all categories of the test variable, or limit the test to a specific range. Use standard or customized expected values. Obtain descriptive statistics and/or quartiles on the test variable.
Testing Independence
A large hospital schedules discharge support staff assuming that patients leave the hospital at a fairly constant rate throughout the week. However, because of increasing complaints of staff shortages, the hospital administration wants to determine whether the number of discharges varies by the day of the week. This example uses the file dischargedata.sav . Use Chi-Square Test to test the assumption that patients leave the hospital at a constant rate.
CHI-SQUARE
Each case is a day of the week, and to perform the chi-square test, you must first weight the cases by frequency of patient discharge. To weight the cases, from the Data Editor menus choose: Data Weight Cases...
CHI-SQUARE
Select Weight cases by. Select Average Daily Discharges as the frequency variable. Click OK. The cases are now weighted by frequency of patient discharge.
CHI-SQUARE
To begin the analysis, from the menus choose: Analyze Nonparametric Tests Chi-Square...
CHI-SQUARE
Select Day of the Week as the test variable. Click OK.
Summary
Using the Chi-Square Test procedure, you found that the rate at which patients are discharged from the hospital is not constant over the course of an average week. This is largely due to greater numbers of discharges on Friday and vastly fewer numbers of discharges on Sunday. When you restricted the range of the test to weekdays, the discharge rates appeared to be more uniform. You may be able to correct staff shortages by adopting separate weekday and weekend staff schedules.
Related Procedures
The Chi-Square Test procedure is useful when you want to compare a single sample from a polychotomous variable to an expected set of values. The procedure tabulates this variable into a set of frequencies and tests this observed set against either a common expected value or a customized set of expected values. The entire range of the test variable is used by default; however, its range may be restricted to any set of contiguous values. Additionally, descriptive statistics and/or quartiles can be requested. If your variable has only two outcomes, you can alternatively use the Binomial Test procedure.
Contd..
For example, the percentage of churn for customers subscribing only to basic service was 31% Similarly, customers who prefer more high-end electronic services churned at a rate of about 27% within the last month. Similarly, customers who prefer more high-end electronic services churned at a rate of about 27% within the last month.
Contd.
There are about 280 customers who subscribe to a set of convenience services (three-way calling, call forwarding, call waiting, etc.). Of these, only 16% recently churned. Customers who take advantage of all of the services offered by the firm churned the most-37%, or 10% higher than the average of all customers within the last month.
Contd.
Of the 266 Basic service customers, 83 churned within the last month. The Observed Prop. column here shows that these 83 customers account for 31% of the total Basic service group in this sample The test proportion of 0.27 suggests that we should expect 0.27 * 266, or about 72 customers, to have churned.
Contd.
The asymptotic significance value is 0.07, which is above the conventional cutoff for statistical significance (0.05). By that standard, you cannot reject the null hypothesis that the churn rate for basic service customers is equal to the churn rate in the sample at large. The same cannot be said for Plus service customers, however. In this case, the proportion, 0.16, is significantly lower than the test proportion. Many fewer Plus service customers found another service provider last month.
Summary
At the other extreme, significantly more Total service customers were lost last month than the test proportion predicts. Using the Binomial Test procedure, you have determined that the rate of churn differs across customer types. Now that Total Service customers have been identified as high-risk, you can focus further efforts on finding out why these customers are dissatisfied.
Summary
Using a cut point to define the groups, you have found that a majority of the customers who churned within the last month fall below the median household income. Now that these customers have been identified as high-risk, you can focus further efforts on determining why these customers are dissatisfied.
Related Procedure
The Binomial Test procedure is useful when you want to compare a single sample from a dichotomous variable to an expected proportion. If the dichotomy does not exist in the data as a variable, one can be dynamically created based upon a cut point on a scale variable. If your variable has more than two outcomes, try the Chi-Square Test procedure. If you want to compare two dichotomous variables, try the McNemar test in the TwoRelated-Samples Tests procedure.
Summary
These test tables demonstrate that the results of the runs test may depend on your choice of cut point. In this example, the order of the ratings with respect to the median is not random. On the other hand, the same ratings do not show any order with respect to either modal value. Using the runs test, you discovered that whether the test finds that ratings are related to time spent browsing depends on the cut point. Given that the test variable is composed of integer ratings data from 0 to 20, it's probably safe to treat the site ratings as ordinal. Thus, the results of the two "mode" tests stand, and you can proceed as though the usability scores are independent of time spent browsing.
Related Procedure
You should use the Runs Test procedure when you want to test the hypothesis that the values of a variable are ordered randomly with respect to a cut point of your choosing. The default cut point is the median; however, you can also select the mean, the mode, or even a custom value. Optionally, you can request descriptive statistics and/or quartiles of the test variable. The Runs Test is often used as a precursor to running tests that compare the means of two or more groups, including: The Independent-Samples T Test procedure. The One-Way ANOVA procedure. The Two-Independent-Samples Tests procedure. The Tests for Several Independent Samples procedure.
SUMMARY
Using the One-Sample Kolmogorov-Smirnov Test procedure, you found that, overall, the number of automobile accidents per driver do not follow a Poisson distribution. However, once you split the file on gender, the distributions of accidents for males and females can individually be considered Poisson. These results demonstrate that the one-sample Kolmogorov-Smirnov test requires not only that you choose the appropriate distribution but the appropriate parameter (s) for it as well.
Related Procedures
You can use the one-sample Kolmogorov-Smirnov procedure to test the null hypothesis that a sample comes from a particular distribution. Four theoretical distribution functions are available--normal, uniform, Poisson, and exponential. Optionally, you can request descriptive statistics and/or quartiles of the test variable. If you want to compare the distributions of two variables, use the two-sample Kolmogorov-Smirnov test in the Two-Independent-Samples Tests procedure. If your variable is scale, more statistics and tests of normality are available through the Explore procedure.
Rank Table
Because the test variables are assumed to be ordinal, the Mann-Whitney and Wilcoxon tests are based on ranks of the original values and not on the values themselves. The rank table is divided into three panels, one panel for each test variable
Rank Table
The first test variable, Travel ADL, measures the ability to regularly get around the community. It ranges from 0 to 4, where 0 = Same as before illness and 4 = Bedridden. All 46 women in the control group and all 54 women in the treatment group provided valid data for this variable. First, each case is ranked without regard to group membership. Cases tied on a particular value receive the average rank for that value. After ranking the cases, the ranks are summed within groups.
Rank Table
Average ranks adjust for differences in the number of patients in both groups. If the groups are only randomly different, the average ranks should be about equal. For Travel ADL, the average ranks are over 9 points apart. The test variables Cooking ADL and Housekeeping ADL contain missing data. For these variables, the value 4 = Never did any; thus, these scales do not apply to all patients.
However, for those to whom they do apply, there are differences of about 12 to 13 points between the average ranks of the treatment and control groups.
Compare Distributions
A grain processor has two corn crops with aflatoxin levels below 20 parts per billion and are safe for human consumption. However, because aflatoxin varies widely across yields, he wants to compare the levels of aflatoxin in each of the yields. This example uses the file aflatoxin20.sav . Use the two-sample Kolmogorov-Smirnov test to determine whether the distribution of aflatoxin differs significantly between the two safe yields.
Related Procedures
You should use the Mann-Whitney test when you want to test for differences between two groups but you are testing an ordinal variable or you have a scale variable that in some other ways does not conform to the assumptions of the independent-samples t test. The Mann-Whitney and Wilcoxon tests assume that the variable you are testing is at least ordinal and that its distribution is similar in both groups. You can use the two-sample Kolmogorov-Smirnov test to validate the assumption of similar distributions. The two-sample Kolmogorov-Smirnov test tests the null hypothesis that two samples have the same distribution. It's a very flexible test because no specific shape is assumed for the underlying distribution. However, because the test makes no assumptions, it is sensitive to differences in both location and scale. You may want to center the test variable if you are not interested in location differences; additionally, you may want to standardize the test variable to remove both location and scale. You can alternatively use the Crosstabs procedure to test for differences between two or more groups of an ordinal or nominal variable. If your test variable passes the stricter conditions of the two-sample t test, you can use the Independent-Samples T Test procedure. If your grouping variable has more than two groups, try the Tests for Several Independent Samples Tests procedure. If the samples defined by your grouping variable are not independent, try the TwoRelated-Samples Tests procedure.