Académique Documents
Professionnel Documents
Culture Documents
Problem 1:
The purpose of this problem was to predict whether or not a student would drop out or not using
the other explanatory variables. DROPOUT was coded as a 1 for yes to the student dropping out
and a 0 to the student not dropping out. In addition, the variables that were used to predict the
dropout chance was the measure of the extent to which each child had behaviors associated with
ADD, labeled as ADDSC, if the student repeated a grade or not (REPEAT) and if the child was
reported to have more than usual social problems in the 9th grade (SOSPROB).
Response Profile
Ordered
Total
Value dropout Frequency
1 1
32
2 0
218
We can conclude that there was a huge difference between the number of children that not
dropped out and the children who dropped out. Ideally, we would want this number to be closer
together for more accurate predictions.
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
59.5975
<.0001
Score
76.7984
<.0001
Wald
45.9940
<.0001
Looking at the Testing Global Null Hypothesis table, we are able to conclude that one of the
variables is significant, with the p<.0001.
Analysis of Maximum Likelihood Estimates
Standard
Wald
Parameter DF Estimate
Error Chi-Square Pr > ChiSq
Intercept
-4.6295
1.0709
18.6883
<.0001
socprob
1.2934
0.6161
4.4079
0.0358
repeat
2.6590
0.4810
30.5555
<.0001
addsc
0.0288
0.0191
2.2867
0.1305
Using the Analysis of Maximum Likelihood Estimates table we were able to create our equation
for the log odds. We see that the variables SOCPROB and REPEAT are the significant predictors
of DROPOUT,and both were positive so both will increase the odds of dropout.
This leads us to the final logistical regression equation:
Y = -4.63 + 1.29(SOCPROB) + .028(REPEAT)
DF Pr > ChiSq
7.1383
0.5218
Using the Hosmer and Lemeshow we have a good goodness-of-fit result, since the p-value is
greater than .05 (chi-square = 7.14, df=8, p=.52).
Classification Table
Correct
Incorrect
Percentages
Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.000
32
218
12.8
100.0
0.0
87.2
0.100
24
186
32
84.0
75.0
85.3
57.1
4.1
0.200
22
194
24
10
86.4
68.8
89.0
52.2
4.9
0.300
22
198
20
10
88.0
68.8
90.8
47.6
4.8
0.400
18
203
15
14
88.4
56.3
93.1
45.5
6.5
0.500
10
210
22
88.0
31.3
96.3
44.4
9.5
0.600
214
26
88.0
18.8
98.2
40.0
10.8
0.700
216
26
88.8
18.8
99.1
25.0
10.7
0.800
217
30
87.6
6.3
99.5
33.3
12.1
0.900
218
32
87.2
0.0
100.0
12.8
1.000
218
32
87.2
0.0
100.0
12.8
The classification table allows us to determine the cutoffs in determining if the person is likely to
be a 1 or 0 in the DROPOUT prediction. After looking at the percent correct, the sensitivity and
the specificity, we are able to conclude that the .300 prob level is the best cut off. With a correct
percent of 88.0 and a sensitivity of 68.8 with a specificity of 90.8, we are able to see the most
accurate results using a .300.Looking at the .500 cut off prob value, we see that the percent
correct remains the same at 88.0 (the same as.300 prob level) however, we will encounter issues
with a cut off of .500. With a very low sensitivity, we are able to observe a false negative rate of
9.5, which is almost twice that off the .300 cut off. So, using a .500 cut off rate would not be
ideal in this case.
proc print;
ods graphics on;
ods rtf file='dropout.rtf';
proc logistic descending;
model dropout=socprob repeat addsc /risklimits
lackfit
ctable pprob=(0 To 1 by 0.1) ;
run;
ods graphics off;
ods rtf close;
run;
Response Profile
Ordered
Total
Value verdict Frequency
Problem 2:
The
130 Purpose of this problem was to predict a jurys verdict
1 1
in a
court case. Previous research was conducted to analyze
35
2 0
both
the physical attractiveness and social desirability of the
litigants. Looking at the defendant on several variables, it is the hope of this study to be able to
predict what the jury would verdict. VERDICT was coded as a 0 being not guilty and a 1 as
guilty, and was the predicted variable. The other explanatory variables used were if the jury
found the defendant attractive (ATTRACT), the gender (GENDER), the sociability
(SOCIABLE), the warmth of the defendant (WARMTH), the perceived kindness (KIND), the
sensitivity (SENSITIV) and the defendants intelligence (INTELLIG).
We can conclude that there was a major difference in the number of verdicts being much higher
as guilty than they were as not guilty. Ideally, we would want this number to be closer together
for more accurate predictions.
Likelihood Ratio
69.8280
<.0001
Score
54.1611
<.0001
Wald
32.5088
<.0001
Looking at the Testing Global Null Hypothesis table, we are able to conclude that one of the
variables are significant, with the p<.0001.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate
Standard
Wald
Error Chi-Square Pr > ChiSq
Intercept
9.1177
1.7915
25.9011
<.0001
attract
0.3228
0.5203
0.3848
0.5350
gender
-1.2556
0.5432
5.3432
0.0208
sociable
0.2536
0.2102
1.4555
0.2277
warmth
-0.1399
0.2069
0.4568
0.4991
intellig
-0.6276
0.2252
7.7685
0.0053
sensitive
-0.4458
0.2067
4.6523
0.0310
king
-0.2822
0.1530
3.3989
0.0652
We are able to create our equation for the log odds using the Analysis of Maximum Likelihood
Estimates table. We see that the gender, intelligence, and sensitivity of the defendant were
significant predictors of the verdicts outcome.
This leads us to a final logistical regression equation:
Y = 9.11 -1.26(gender) - .62(intellig) -.45(sensitive)
DF Pr > ChiSq
8
0.4966
We have a good goodness-of-fit result using the Hosmer and Lemeshow, since the p-value is
greater than .05 (chi-square = 7.38, df=8, p=.496).
Classification Table
Correct
Incorrect
Percentages
Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.000
130
35
78.8
100.0
0.0
21.2
0.100
129
35
78.2
99.2
0.0
21.3
100.0
0.200
125
31
78.2
96.2
11.4
19.9
55.6
0.300
125
10
25
81.8
96.2
28.6
16.7
33.3
0.400
122
15
20
83.0
93.8
42.9
14.1
34.8
0.500
119
20
15
11
84.2
91.5
57.1
11.2
35.5
0.600
117
22
13
13
84.2
90.0
62.9
10.0
37.1
0.700
113
27
17
84.8
86.9
77.1
6.6
38.6
0.800
101
30
29
79.4
77.7
85.7
4.7
49.2
0.900
83
32
47
69.7
63.8
91.4
3.5
59.5
1.000
35
130
21.2
0.0
100.0
78.8
The classification table allows us to determine the cutoffs in determining if the person is likely to
be a 1 or 0 in the VERDICT prediction. After looking at the percent correct, the sensitivity and
the specificity, we are able to conclude that the .500 prob level is the best cut off. With a correct
percent of 84.2 and a sensitivity of 91.5 with a specificity of 57.1, we are able to see the most
accurate results using a .500.
As the problems suggests the prediction of the verdicts at the .500 cutoff, and using the model
Y = 9.11 -1.26(gender) - .62(intellig) -.45(sensitive)
I have choosen a female, with a 4 intelligence and a 4 sensitivity. The result was 4.83 for
VERDICT.
proc print;
ods graphics on;
ods rtf file='verdict.rtf';
proc print;
run;
proc logistic descending;
model verdict= attract gender sociable warmth intellig sensitive
kind/risklimits
lackfit
ctable pprob=(0 To 1 by 0.1) ;
run;
ods graphics off;
ods rtf close;
run;
Problem 3:
The sample was collected from 343 person. The goal of the study was to determine the best prediction for
whether or not sexual harassment will be reported using the explanatory variables. REPORTED was
coded as a 1 being yes reported and a 0 being no reported harassment. The age of the victim was recorded
(AGE), the marital status of the victim with 1 being married and 2 being single (MARSTAT), the
Feminist idology score (FEM), and the Offensiveness of the harassment (OFFENSUV).
Response Profile
Ordered
Total
Value reported Frequency
1 1
169
2 0
174
After looking at the response profile for the variable REPORTED, with 1 being a yes reported incident of
sexual harassment and 0 being no reported harassment. We have a fairly even number of 1s and 0s,
which is ideal.
Likelihood Ratio
35.4422
<.0001
Score
33.5983
<.0001
Wald
30.3996
<.0001
Looking at the Testing Global Null Hypothesis table, we are able to conclude that one of the variables are
significant, with the p<.0001.
-1.7317
1.4298
1.4670
0.2258
age
-0.0137
0.0129
1.1264
0.2886
marstat
-0.0723
0.2339
0.0954
0.7574
fem
0.00699
0.0146
0.2275
0.6334
freq
-0.0464
0.1526
0.0925
0.7610
offensuv
0.4878
0.0949
26.4310
<.0001
using the Analysis of Maximum Likelihood Estimates table we are able to create our equation for the log
odds. We see that the variable OFFENSUV is the only significant predictor of RESPONSE.
This leads us to the final logistical regression equation:
Y = -1.73 + .48(OFFENSUV)
DF Pr > ChiSq
8.1131
0.4225
We have a good goodness-of-fit result using the Hosmer and Lemeshow, since the p-value is greater than .
05 (chi-square = 8.11, df=8, p=.42).
Classification Table
Correct
Incorrect
Percentages
Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.000
169
174
49.3 100.0
0.0
50.7
0.100
169
174
49.3 100.0
0.0
50.7
0.200
168
165
51.6
5.2
49.5
10.0
99.4
Classification Table
Correct
Incorrect
Percentages
Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.300
158
29
145
11
54.5
93.5
16.7
47.9
27.5
0.400
140
66
108
29
60.1
82.8
37.9
43.5
30.5
0.500
90
109
65
79
58.0
53.3
62.6
41.9
42.0
0.600
57
145
29
112
58.9
33.7
83.3
33.7
43.6
0.700
28
163
11
141
55.7
16.6
93.7
28.2
46.4
0.800
173
162
52.5
4.1
99.4
12.5
48.4
0.900
174
168
51.0
0.6 100.0
0.0
49.1
1.000
174
169
50.7
0.0 100.0
49.3
The classification table allows us to determine the cutoffs in determining if the person is likely to be a 1
or 0 in the RESPONSE prediction. After looking at the percent correct, the sensitivity and the specificity,
we are able to conclude that the .400 prob level is the best cut off. With a correct percent of 60.1 and a
sensitivity of 82.8 with a specificity of 37.9, we are able to see the most accurate results using a .400. In
the end, we are able to conclude from the logistic regression model that the best predictor of whether or
not a sexual harassment with be reported is the offensiveness of the harassment. To improve this study
would be to analyze the frequency of the harassment, or if it has occurred in the past. Research has shown
that many people who experience harassment in the workplace experience it frequently, and therefore are
more hesitate to report it as they see it as normal and not needing to be reported.
proc print;
ods graphics on;
ods rtf file='harrt.rtf';
proc print;
run;
proc logistic descending;
model reported=age marstat fem freq offensuv /risklimits
lackfit
ctable pprob=(0 To 1 by 0.1) ;
run;
ods graphics off;
ods rtf close;
run;
Problem 4:
The purpose of the problem is Using a multiple regression model to predict the state
expenditures using several explanatory variables. The government is looking to see
which variables would be best predict the per capita state and local expenditure.
The variables the government does to do this are the economic ability index, the
percentage of population living in standard metropolitan areas, the percent change
in population between 1950 and 1960, the percent of the population aged 5-19, the
percent of the population over 65 years of age, and if the state is a western state or
not.
First, looking at the correlation matrix, we are able to determine if the explanatory
variables are going to be good predictors of the per capita state and local
expenditures (EX) highlighted in yellow. In addition, the significant predictors in
bold.
Pearson Correlation Coefficients, N = 48
Prob > |r| under H0: Rho=0
eX
eX
ECAB
MET
GROW
YOUN
G
OLD
WEST
ECAB
MET
GROW
YOUN
G
OLD
WEST
1.0000 0.65586
0 <.0001
0.0452 0.40529
4 0.0043
0.7601
- 0.37349
0.1194 0.02340 0.0089
0 0.8746
0.4189
1.00000
0.4089 0.46007
3 0.0010
0.0039
- 0.03993
0.3075 0.04450 0.7876
4 0.7639
0.0335
1.0000 0.40402
0 0.0044
1.00000
- 0.08455
0.0505 0.41258 0.5677
1 0.0036
0.7332
1.0000
- 0.30487
0 0.46566 0.0351
0.0009
1.00000
0.04110
0.7815
1.00000
With decent correlations with EX and the other explanatory variables, we are able to
proceed to a multiple regression to predict EX.
Variabl
e
Parame
ter Standa
Estimat
rd
e
Error
Interce
pt
102.295
72
26.624
70
23932
14.76
ECAB
1.69618
0.2641
4
66853
41.24 <.00
01
WEST
40.4758
9
11.632
60
19628
12.11
Type II
F Pr >
SS Value
F
0.00
04
0.00
11
variables, we are left with ECAB and WEST as the significant predictors to the model. ECAB is
economic ability index, in which income, retail sales, and the value of output (manufactures,
mineral, and agricultural) per capita are equally weighted, and WEST is if the state is located in
the west or not. All of the other variables were removed during the elimination.
Final Multiple Regression model used to predict EX:
Y = 102.30 + 1.69(ECAB) + 40.47(WEST)
In addition, we observed an R-squared value of .55, showing that 55% of the variation of EX
predictions can be explained using this model.
The Root MSE is 1621.2, showing the amount by which our predicted EX could deviate from the
average about 95% of the time.
Overall, with a good R-square value and a model using few but significant predictors of EX, we
are able to conclude that the model is feasible. We can conclude that the variables ECAB and
WEST are efficient for the purposes of prediction and results in the most concise and efficient
model with all unnecessary variables removed.
ods graphics on;
ods rtf=coe.rtf;
proc corr nosimple;
var ex ecab met grow young old west;
run;
proc reg;
model ex = ecab met grow young old west / selection = backward;
run;
ods graphics off;
ods rtf close;
run;