Vous êtes sur la page 1sur 11

HW10

Problem 1:
The purpose of this problem was to predict whether or not a student would drop out or not using
the other explanatory variables. DROPOUT was coded as a 1 for yes to the student dropping out
and a 0 to the student not dropping out. In addition, the variables that were used to predict the
dropout chance was the measure of the extent to which each child had behaviors associated with
ADD, labeled as ADDSC, if the student repeated a grade or not (REPEAT) and if the child was
reported to have more than usual social problems in the 9th grade (SOSPROB).

Response Profile
Ordered
Total
Value dropout Frequency
1 1

32

2 0

218

We can conclude that there was a huge difference between the number of children that not
dropped out and the children who dropped out. Ideally, we would want this number to be closer
together for more accurate predictions.
Testing Global Null Hypothesis: BETA=0
Test

Chi-Square DF Pr > ChiSq

Likelihood Ratio

59.5975

<.0001

Score

76.7984

<.0001

Wald

45.9940

<.0001

Looking at the Testing Global Null Hypothesis table, we are able to conclude that one of the
variables is significant, with the p<.0001.
Analysis of Maximum Likelihood Estimates
Standard
Wald
Parameter DF Estimate
Error Chi-Square Pr > ChiSq
Intercept

-4.6295

1.0709

18.6883

<.0001

socprob

1.2934

0.6161

4.4079

0.0358

repeat

2.6590

0.4810

30.5555

<.0001

addsc

0.0288

0.0191

2.2867

0.1305

Using the Analysis of Maximum Likelihood Estimates table we were able to create our equation
for the log odds. We see that the variables SOCPROB and REPEAT are the significant predictors
of DROPOUT,and both were positive so both will increase the odds of dropout.
This leads us to the final logistical regression equation:
Y = -4.63 + 1.29(SOCPROB) + .028(REPEAT)

Hosmer and Lemeshow


Goodness-of-Fit Test
Chi-Square

DF Pr > ChiSq

7.1383

0.5218

Using the Hosmer and Lemeshow we have a good goodness-of-fit result, since the p-value is
greater than .05 (chi-square = 7.14, df=8, p=.52).
Classification Table
Correct

Incorrect

Percentages

Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.000

32

218

12.8

100.0

0.0

87.2

0.100

24

186

32

84.0

75.0

85.3

57.1

4.1

0.200

22

194

24

10

86.4

68.8

89.0

52.2

4.9

0.300

22

198

20

10

88.0

68.8

90.8

47.6

4.8

0.400

18

203

15

14

88.4

56.3

93.1

45.5

6.5

0.500

10

210

22

88.0

31.3

96.3

44.4

9.5

0.600

214

26

88.0

18.8

98.2

40.0

10.8

0.700

216

26

88.8

18.8

99.1

25.0

10.7

0.800

217

30

87.6

6.3

99.5

33.3

12.1

0.900

218

32

87.2

0.0

100.0

12.8

1.000

218

32

87.2

0.0

100.0

12.8

The classification table allows us to determine the cutoffs in determining if the person is likely to
be a 1 or 0 in the DROPOUT prediction. After looking at the percent correct, the sensitivity and
the specificity, we are able to conclude that the .300 prob level is the best cut off. With a correct
percent of 88.0 and a sensitivity of 68.8 with a specificity of 90.8, we are able to see the most
accurate results using a .300.Looking at the .500 cut off prob value, we see that the percent
correct remains the same at 88.0 (the same as.300 prob level) however, we will encounter issues
with a cut off of .500. With a very low sensitivity, we are able to observe a false negative rate of

9.5, which is almost twice that off the .300 cut off. So, using a .500 cut off rate would not be
ideal in this case.
proc print;
ods graphics on;
ods rtf file='dropout.rtf';
proc logistic descending;
model dropout=socprob repeat addsc /risklimits
lackfit
ctable pprob=(0 To 1 by 0.1) ;
run;
ods graphics off;
ods rtf close;
run;

Response Profile
Ordered
Total
Value verdict Frequency

Problem 2:

The
130 Purpose of this problem was to predict a jurys verdict
1 1
in a
court case. Previous research was conducted to analyze
35
2 0
both
the physical attractiveness and social desirability of the
litigants. Looking at the defendant on several variables, it is the hope of this study to be able to
predict what the jury would verdict. VERDICT was coded as a 0 being not guilty and a 1 as
guilty, and was the predicted variable. The other explanatory variables used were if the jury
found the defendant attractive (ATTRACT), the gender (GENDER), the sociability
(SOCIABLE), the warmth of the defendant (WARMTH), the perceived kindness (KIND), the
sensitivity (SENSITIV) and the defendants intelligence (INTELLIG).

We can conclude that there was a major difference in the number of verdicts being much higher
as guilty than they were as not guilty. Ideally, we would want this number to be closer together
for more accurate predictions.

Testing Global Null Hypothesis: BETA=0


Test

Chi-Square DF Pr > ChiSq

Likelihood Ratio

69.8280

<.0001

Score

54.1611

<.0001

Wald

32.5088

<.0001

Looking at the Testing Global Null Hypothesis table, we are able to conclude that one of the
variables are significant, with the p<.0001.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate

Standard
Wald
Error Chi-Square Pr > ChiSq

Intercept

9.1177

1.7915

25.9011

<.0001

attract

0.3228

0.5203

0.3848

0.5350

gender

-1.2556

0.5432

5.3432

0.0208

sociable

0.2536

0.2102

1.4555

0.2277

warmth

-0.1399

0.2069

0.4568

0.4991

intellig

-0.6276

0.2252

7.7685

0.0053

sensitive

-0.4458

0.2067

4.6523

0.0310

king

-0.2822

0.1530

3.3989

0.0652

We are able to create our equation for the log odds using the Analysis of Maximum Likelihood
Estimates table. We see that the gender, intelligence, and sensitivity of the defendant were
significant predictors of the verdicts outcome.
This leads us to a final logistical regression equation:
Y = 9.11 -1.26(gender) - .62(intellig) -.45(sensitive)

Hosmer and Lemeshow


Goodness-of-Fit Test
Chi-Square
7.3768

DF Pr > ChiSq
8

0.4966

We have a good goodness-of-fit result using the Hosmer and Lemeshow, since the p-value is
greater than .05 (chi-square = 7.38, df=8, p=.496).

Classification Table
Correct

Incorrect

Percentages

Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.000

130

35

78.8

100.0

0.0

21.2

0.100

129

35

78.2

99.2

0.0

21.3

100.0

0.200

125

31

78.2

96.2

11.4

19.9

55.6

0.300

125

10

25

81.8

96.2

28.6

16.7

33.3

0.400

122

15

20

83.0

93.8

42.9

14.1

34.8

0.500

119

20

15

11

84.2

91.5

57.1

11.2

35.5

0.600

117

22

13

13

84.2

90.0

62.9

10.0

37.1

0.700

113

27

17

84.8

86.9

77.1

6.6

38.6

0.800

101

30

29

79.4

77.7

85.7

4.7

49.2

0.900

83

32

47

69.7

63.8

91.4

3.5

59.5

1.000

35

130

21.2

0.0

100.0

78.8

The classification table allows us to determine the cutoffs in determining if the person is likely to
be a 1 or 0 in the VERDICT prediction. After looking at the percent correct, the sensitivity and
the specificity, we are able to conclude that the .500 prob level is the best cut off. With a correct
percent of 84.2 and a sensitivity of 91.5 with a specificity of 57.1, we are able to see the most
accurate results using a .500.
As the problems suggests the prediction of the verdicts at the .500 cutoff, and using the model
Y = 9.11 -1.26(gender) - .62(intellig) -.45(sensitive)
I have choosen a female, with a 4 intelligence and a 4 sensitivity. The result was 4.83 for
VERDICT.
proc print;
ods graphics on;
ods rtf file='verdict.rtf';
proc print;
run;
proc logistic descending;
model verdict= attract gender sociable warmth intellig sensitive
kind/risklimits
lackfit
ctable pprob=(0 To 1 by 0.1) ;
run;
ods graphics off;
ods rtf close;
run;

Problem 3:
The sample was collected from 343 person. The goal of the study was to determine the best prediction for
whether or not sexual harassment will be reported using the explanatory variables. REPORTED was
coded as a 1 being yes reported and a 0 being no reported harassment. The age of the victim was recorded
(AGE), the marital status of the victim with 1 being married and 2 being single (MARSTAT), the
Feminist idology score (FEM), and the Offensiveness of the harassment (OFFENSUV).

Response Profile
Ordered
Total
Value reported Frequency
1 1

169

2 0

174

After looking at the response profile for the variable REPORTED, with 1 being a yes reported incident of
sexual harassment and 0 being no reported harassment. We have a fairly even number of 1s and 0s,
which is ideal.

Testing Global Null Hypothesis: BETA=0


Test

Chi-Square DF Pr > ChiSq

Likelihood Ratio

35.4422

<.0001

Score

33.5983

<.0001

Wald

30.3996

<.0001

Looking at the Testing Global Null Hypothesis table, we are able to conclude that one of the variables are
significant, with the p<.0001.

Analysis of Maximum Likelihood Estimates


Standard
Wald
Parameter DF Estimate
Error Chi-Square Pr > ChiSq
Intercept

-1.7317

1.4298

1.4670

0.2258

age

-0.0137

0.0129

1.1264

0.2886

marstat

-0.0723

0.2339

0.0954

0.7574

fem

0.00699

0.0146

0.2275

0.6334

freq

-0.0464

0.1526

0.0925

0.7610

offensuv

0.4878

0.0949

26.4310

<.0001

using the Analysis of Maximum Likelihood Estimates table we are able to create our equation for the log
odds. We see that the variable OFFENSUV is the only significant predictor of RESPONSE.
This leads us to the final logistical regression equation:
Y = -1.73 + .48(OFFENSUV)

Hosmer and Lemeshow


Goodness-of-Fit Test
Chi-Square

DF Pr > ChiSq

8.1131

0.4225

We have a good goodness-of-fit result using the Hosmer and Lemeshow, since the p-value is greater than .
05 (chi-square = 8.11, df=8, p=.42).

Classification Table
Correct

Incorrect

Percentages

Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.000

169

174

49.3 100.0

0.0

50.7

0.100

169

174

49.3 100.0

0.0

50.7

0.200

168

165

51.6

5.2

49.5

10.0

99.4

Classification Table
Correct

Incorrect

Percentages

Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.300

158

29

145

11

54.5

93.5

16.7

47.9

27.5

0.400

140

66

108

29

60.1

82.8

37.9

43.5

30.5

0.500

90

109

65

79

58.0

53.3

62.6

41.9

42.0

0.600

57

145

29

112

58.9

33.7

83.3

33.7

43.6

0.700

28

163

11

141

55.7

16.6

93.7

28.2

46.4

0.800

173

162

52.5

4.1

99.4

12.5

48.4

0.900

174

168

51.0

0.6 100.0

0.0

49.1

1.000

174

169

50.7

0.0 100.0

49.3

The classification table allows us to determine the cutoffs in determining if the person is likely to be a 1
or 0 in the RESPONSE prediction. After looking at the percent correct, the sensitivity and the specificity,
we are able to conclude that the .400 prob level is the best cut off. With a correct percent of 60.1 and a
sensitivity of 82.8 with a specificity of 37.9, we are able to see the most accurate results using a .400. In
the end, we are able to conclude from the logistic regression model that the best predictor of whether or
not a sexual harassment with be reported is the offensiveness of the harassment. To improve this study
would be to analyze the frequency of the harassment, or if it has occurred in the past. Research has shown
that many people who experience harassment in the workplace experience it frequently, and therefore are
more hesitate to report it as they see it as normal and not needing to be reported.
proc print;
ods graphics on;
ods rtf file='harrt.rtf';
proc print;
run;
proc logistic descending;
model reported=age marstat fem freq offensuv /risklimits
lackfit
ctable pprob=(0 To 1 by 0.1) ;
run;
ods graphics off;
ods rtf close;
run;

Problem 4:
The purpose of the problem is Using a multiple regression model to predict the state
expenditures using several explanatory variables. The government is looking to see
which variables would be best predict the per capita state and local expenditure.
The variables the government does to do this are the economic ability index, the
percentage of population living in standard metropolitan areas, the percent change
in population between 1950 and 1960, the percent of the population aged 5-19, the
percent of the population over 65 years of age, and if the state is a western state or
not.
First, looking at the correlation matrix, we are able to determine if the explanatory
variables are going to be good predictors of the per capita state and local
expenditures (EX) highlighted in yellow. In addition, the significant predictors in
bold.
Pearson Correlation Coefficients, N = 48
Prob > |r| under H0: Rho=0
eX
eX

ECAB

MET

GROW

YOUN
G
OLD

WEST

ECAB

MET

GROW

YOUN
G

OLD

WEST

1.0000 0.65586
0 <.0001

0.0452 0.40529
4 0.0043
0.7601

- 0.37349
0.1194 0.02340 0.0089
0 0.8746
0.4189

1.00000

0.4089 0.46007
3 0.0010
0.0039

- 0.03993
0.3075 0.04450 0.7876
4 0.7639
0.0335

1.0000 0.40402
0 0.0044

0.2713 0.04105 0.33083


2 0.7817 0.0216
0.0621

1.00000

- 0.08455
0.0505 0.41258 0.5677
1 0.0036
0.7332
1.0000
- 0.30487
0 0.46566 0.0351
0.0009
1.00000

0.04110
0.7815
1.00000

With decent correlations with EX and the other explanatory variables, we are able to
proceed to a multiple regression to predict EX.

Variabl
e

Parame
ter Standa
Estimat
rd
e
Error

Interce
pt

102.295
72

26.624
70

23932

14.76

ECAB

1.69618

0.2641
4

66853

41.24 <.00
01

WEST

40.4758
9

11.632
60

19628

12.11

Type II
F Pr >
SS Value
F
0.00
04

0.00
11

After using the backwards elimination method to remove insignificant explanatory

variables, we are left with ECAB and WEST as the significant predictors to the model. ECAB is
economic ability index, in which income, retail sales, and the value of output (manufactures,
mineral, and agricultural) per capita are equally weighted, and WEST is if the state is located in
the west or not. All of the other variables were removed during the elimination.
Final Multiple Regression model used to predict EX:
Y = 102.30 + 1.69(ECAB) + 40.47(WEST)
In addition, we observed an R-squared value of .55, showing that 55% of the variation of EX
predictions can be explained using this model.
The Root MSE is 1621.2, showing the amount by which our predicted EX could deviate from the
average about 95% of the time.
Overall, with a good R-square value and a model using few but significant predictors of EX, we
are able to conclude that the model is feasible. We can conclude that the variables ECAB and
WEST are efficient for the purposes of prediction and results in the most concise and efficient
model with all unnecessary variables removed.
ods graphics on;
ods rtf=coe.rtf;
proc corr nosimple;
var ex ecab met grow young old west;
run;
proc reg;
model ex = ecab met grow young old west / selection = backward;
run;
ods graphics off;
ods rtf close;

run;

Vous aimerez peut-être aussi