Académique Documents
Professionnel Documents
Culture Documents
Assignment 1
AJAY KUMAR
(1411211)
ANU KANKANE
(1411214)
ANKIT GOSWAMI
(1411286)
MALA HARISH
(1411242)
Contents
DESCRIPTIVE STATISTICS.................................................................................................. 2
Proportion of Failure in Courses..........................................................................2
Probability of Dropout Sports Activity.............................................................3
Proportion of Dropout Average Difficulty Level.............................................3
Proportion of Dropout High School GPA..........................................................3
Proportion of Dropout Gender...........................................................................4
Proportion of Dropout Year of Program...........................................................4
BINARY LOGISTIC MODEL................................................................................................. 5
Bifurcation of Data Set............................................................................................. 5
Variables..................................................................................................................... 5
Definition of drop out............................................................................................ 6
Independent Variables:............................................................................................ 6
Interaction Variables............................................................................................. 6
New Variables created ........................................................................................ 6
Selecting final Independent Variables:..................................................................7
OUTPUT:...................................................................................................................... 7
The final output...................................................................................................... 7
Result of testing on Validation Set.....................................................................7
Observations:............................................................................................................. 8
Recommendations:................................................................................................... 8
MULTINOMIAL LOGISTIC MODEL....................................................................................... 9
Variables:.................................................................................................................... 9
Independent Variables:....................................................................................... 10
Selecting final Independent Variables:................................................................10
OUTPUT:.................................................................................................................... 10
The final Output-.................................................................................................. 10
Observations and Recommendations:.................................................................11
Exhibit 1:........................................................................................................................ 12
Exhibit 2:........................................................................................................................ 13
Exhibit 3:........................................................................................................................ 13
Exhibit 4:........................................................................................................................ 14
Exhibit 5:........................................................................................................................ 15
Exhibit 6:........................................................................................................................ 15
Exhibit 7:........................................................................................................................ 16
Exhibit 8:........................................................................................................................ 17
Exhibit 9: Excel Sheet and SPSS Output.........................................................................17
1 | Page
DESCRIPTIVE STATISTICS
Descriptive statistics are calculated to understand the relationship of dropout status with other
variables like courses opted, activity in sports, average difficulty level of courses, High School GPA,
gender and year of program.
The data set was divided into two parts- Training Data and Validation Data. The descriptive and
model building was done using the Training Dataset which had details of 100 students.
Proportion of Failure in Courses
0.50
0.33
0.16
0.03
From the above graph, it can be observed that most courses have difficulty levels, i.e. proportion of a
student failing, in the subject less than 0.16. However, courses C3 and C24 have higher proportion of
students who failed. The proportion of failure can be used as a proxy for the Difficulty Index of a
subject. (Exhibit 2)
Dropout codes 0, 1 and 2 used for the computation of the following descriptive statistics is defined as
below.
2 | Page
1
2
course
If the candidate dropped out despite passing all the courses he took
Active in Sports
28
20
10
58
Inactive in Sports
16
13
13
42
Grand Total
44
33
23
100
From the above table, a larger proportion (28/44) of students who did not drop out was inactive in
sports. Also, a higher proportion of students dropping out because of failure in more than 1 course
were observed to be inactive in sports.
Proportion of Dropout Average Difficulty Level
Y
0
1
2
The average difficulty level of the courses opted by a student and who has not dropped out from the
college is less than the average difficulty level of the courses opted by a student who has dropped
out from the college.
Proportion of Dropout High School GPA
Y
0
1
2
Maximum number of students with High School GPA > 3 fall in the category of not dropping out
from the college. And this number is least for the students who dropped out despite passing all the
subjects.
Proportion of Dropout Gender
Y
No of Male students
No of female
0
1
2
18
18
10
0
1
2
students
26
15
13
3 | Page
The number
of
female
students not dropping out from the college is higher than the number of male students not dropping
out from the college. They however have comparable numbers for dropping out from the college.
Proportion of Dropout Year of Program
Drop out
Year 1
Year 2
Year 3
The Independent Variable had the following observed classificationTotal Number of Students
Students Dropped
Students Graduated
4 | Page
100
56
44
Variables
Dependent Variable: Whether a particular student dropped out during the term or graduated
from Lovely
Business School is taken as the dependent variable.
Final Result
Y=
Dropped Out
1
Graduated/Did not Drop Out 0
Definition of drop out: A particular student, who has not taken any courses in two
consecutive terms, is termed as a drop out.
For instance, from the historical data provided, student 3544856 has not taken any subjects in the last
two terms. Hence he is considered as a drop out.
Independent Variables:
The provided data has the following continuous variables, and one binary variable (gender).
i)
HSGPA
HSPctil
ii)
e
iii)
HSSize
iv)
SAT
Apart from this, the historical data also has details on courses taken on per term basis.
5 | Page
To deal such situations we created 96 interaction variables between dummy and continuous
variables.
An exhaustive list of all the independent variables is given in Exhibit-1.
New Variables created
a) The first new variable that was created is Difficult Index of Subjects. The difficulty
index of each subject is calculated as the ratio of number of students who have taken the
subject and failed to pass to the total number of students who have taken the subject.
Hence, if number of students failing in a subject is higher, then the difficulty index of that
subject is higher.
DI x =
OUTPUT:
The final output
Log ( Y=1/Y=0) = 5.603 + 55.205*DI -.006*SATC5 -.101*HSPcTC19
The model gave an efficiency of 92% (Exhibit 4)
6 | Page
6
0
0
6
Observe
d
0
1
Observations:
The probability of dropping out is dependent on the following:
1. ADI: Average difficulty levels of all the subjects taken by a student. This seems to be a
logical conclusion also since a student might not have been able to perform better in an exam
because of difficult subjects and hence would have failed in that course. If such subjects are
more in number, the chances are high that the students performance will decrease leading to
failure in the examination. Not just failure, there is also a probability that the student is not
able to handle the pressure and hence drops out of the course.
2. If a student has taken course C19, his/her probability of dropping out decreases as compared
to the base case i.e. P(Y=0). And it is inversely proportional to GPA in Higher Secondary
School. This implies that if a person has taken course C19 and had a high Percentile in
School, his probability of dropping out decreases. One reason can be that the course C19 is an
extention of a high school course or has a similar course structure to a subject studied in high
school. With such similarity, the probability of performing well in the subject increases with a
students percentile in high school (i.e. HSPcT) and hence drop out probability decreases.
3. C5 is an easy subject as determined by the difficulty index. According to the model, the
probability of dropping out is inversely proportional to SAT Score. One reason behind this
result can be the same as in 2. C5 might be an aptitude based course, hence their probability
of performing well increases with their SAT score which in turn implies that their probability
of dropping out decreases.
4. The probability of dropping/ continuing is not dependent on the gender of the students.
5. Also, SAT score, participation in sports, do not affect the probability of dropping out.
Recommendations:
1. Since the probability of dropping out decreases when a student takes courses C19 and C5, the
college should promote these courses among the students to decrease the dropout rates. They
can be made compulsory courses for the students.
7 | Page
2. Since probability of dropouts increases with the average difficulty level of the subjects taken
by a student, the college should ensure that each student takes a balanced choice of subjects.
The average difficulty of subjects taken in a particular term should be such that his
probability of dropping out doesnt increase.
Y
If the candidate did not
0 drop out
If the candidate dropped
out, because he had
failed in more than 1
1 course
If the candidate dropped
out despite passing all
2 the courses he took
8 | Page
Number of candidates
0
44
1
33
2
23
Independent Variables:
Same as used in Binary Logistic Regression.
OUTPUT:
The final Output-
1+exp ( Z 1 )+ exp(z 2)
exp ( z 2 )
P ( Y =2 )=
1+exp ( Z 1 )+ exp(z 2)
exp ( z 1 )
P ( Y =1 )=
P (Y =0 ) =
9 | Page
1
1+exp ( z 1 )+ exp ( z 2 )
Observed
Predicted
0
0
6
1
0
2
0
1
6
0
2
0
0
0
( 1+exp ( Z 1 ) +exp ( z 2 ) )2
From the above marginal probability it can be observed that if i is positive, the probability of
dropping out increases.
1. From Z2, A higher SAT score implies higher probability of dropping out. One reason that can
be attributed to this behaviour is that Lovely Business School might not fall under the
ambitious list of colleges for the students. If he/she has a higher SAT score, the probability is
that he might get admission in another college.
2. Also, it can be seen that the probability of dropping out increases with increase in Average
Difficulty Index of the courses taken by the student. This seems to be a logical conclusion
since the student might not have been able to handle the pressure of difficult subjects leading
him to drop out or fail in the exam which is again increasing his probability of dropping out.
The strength of coefficient for ADI, is stronger for Y=1 than Y=2, pointing towards ADI has a
higher impact for students dropping out who have failed in more than one subject.
10 | P a g e
3. C5, C18, C15 are the courses with least Difficulty Index, hence it is not surprising that
students who takes any of these courses have lower probability of dropping out; and there is a
significant interaction between past performance and these courses.
4. Male have lesser probability than female in dropping out, since the coefficient of Gender is
negative in both case. This may point towards the nation having a cultural issue when it
comes to male and female education.
Exhibit 1:
Dropped/Result
Gender
1= dropped, 0=continued
1 = Male
Student ID
Course Year
Semester
Result
0 = Female
Identification Number
Year in which the course was taken
Semester within the year
PASS Student passed the course
Gender
HSGPA
HSPct
HSSize
SAT
Sports
0 = Female
GPA in Higher Secondary School
Percentile in Graduating Class in Higher Secondary
Number of students in HS graduating Class
Overall SAT Score
1 = Active in Sports
Ci
CPi
SAT*Ci
HSPcT*Ci
i=1 to 24
Difficulty index of the subject Ci ; 0<= CPi <= 1
Interaction Variable between SAT and Ci
Interaction Variable between HSPcT and Ci
11 | P a g e
HSGPA*Ci
Exhibit 2:
Total
Cases
Probability of
cases
failed
failing
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
C17
C18
C19
C20
C21
C22
C23
C24
107
7
2
74
44
71
80
1
106
66
70
107
78
176
80
97
24
73
100
10
93
96
103
3
0.08411215
0
0.5
0.027027027
0.022727273
0.098591549
0.0375
0
0.113207547
0.045454545
0.157142857
0.08411215
0.038461538
0.068181818
0.0625
0.12371134
0.041666667
0.04109589
0.16
0
0.032258065
0.0625
0.13592233
0.333333333
1
2
1
7
3
12
3
11
9
3
12
5
12
1
3
16
3
6
14
1
Exhibit 3:
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
12 | P a g e
SAT(0.558
ROC Area
HSPcT(0.54 HSGPA
)
0.506
0.443
0.474
0.245
0.094
0.209
0.322
0.508
0.508
0.228
(0.553)
0.511
0.443
0.474
0.248
0.094
0.228
0.349
0.508
0.504
0.245
0.496
0.443
0.474
0.247
0.091
0.218
0.337
0.508
0.487
0.239
Variable Selected
HSPcT*C1
SAT*C2
SAT*C3
HSPcT*C4
SAT*C5
HSPcT*C6
HSPcT*C7
SAT*C8
SAT*C9
HSPcT*C10
C11
C12
C13
C14
C15
C16
C17
C18
C19
C20
C21
C22
C23
C24
0.234
0.537
0.293
0.506
0.283
0.426
0.292
0.235
0.451
0.395
0.285
0.48
0.451
0.503
0.268
0.531
0.319
0.515
0.308
0.433
0.294
0.235
0.452
0.395
0.3
0.466
0.452
0.503
0.253
0.516
0.309
0.5
0.3
0.421
0.296
0.234
0.437
0.395
0.295
0.457
0.437
0.503
HSPcT*C11
SAT*C12
HSPcT*C13
HSPcT*C14
HSPcT*C15
HSPcT*C16
HSGPA*C17
SAT*C18
HSPcT*C19
SAT*C20
HSPcT*C21
SAT*C22
HSPcT*C23
SAT*C24
Exhibit 4:
Classification Tablea
Predicted
Dropped/Result
Percentage
Observed
Step 1 Dropped/Resul .0
1.0
t
.0
1.0
39
Correct
88.6
53
94.6
Overall Percentage
a. The cut value is .600
92.0
Exhibit 5:
Variables in the Equation
B
S.E.
Wald
df
Step 1
Sig.
Averagedifficul
ty
55.205
26.299
4.406
.036 92180000000
SATC5
-.006
.002
11.580
1
.001
HSPcTC19
-.101
.035
8.602
1
.003
Constant
5.603
3.790
2.185
1
.139
a. Variable(s) entered on step 1: Averagedifficulty, SATC5, HSPcTC19.
Exhibit 6:
Likelihood Ratio Tests
Model Fitting
Effect
13 | P a g e
Criteria
Exp(B)
94447689383
00.000
.994
.904
271.150
-2 Log
Likelihood of
Reduced
Intercept
SATC3
SATC5
SATC18
SATC22
HSPcTC1
HSPcTC11
HSPcTC14
HSPcTC15
Averagedifficul
Model
Chi-Square
58.827
21.880
54.696
17.749
94.745
57.797
76.023
39.076
a
54.765
17.818
71.977
35.029
71.316
34.369
60.626
23.679
51.011
14.064
ty
Gender
HSGPA
SAT
df
2
2
2
2
2
2
2
2
2
Sig.
.000
.000
.000
.000
.000
.000
.000
.000
.001
60.115
23.168
.000
51.829a
48.209
60.633a
14.882
11.262
23.685
2
2
2
.001
.004
.000
Exhibit 7:
Std.
Error
Wald
d
f
Sig.
Y=2
Intercept
1248.2
3
407.13
4
9.4
SATC3
-0.473
2.325
0.041
SATC5
-0.971
0.335
8.386
SATC18
-0.258
0.06
18.675
SATC22
-0.246
0.284
0.753
HSPcTC1
-5.545
3.354
2.733
14 | P a g e
0.00
2
0.83
9
0.00
4
0
0.38
6
0.09
8
HSPcTC11
5.594
0.904
38.286
HSPcTC14
18.644
4.656
16.032
HSPcTC15
-2.842
1.093
6.765
Averagedifficu
9294.0
2667.8
03
-
09
12.137
171.18
73.416
5.437
0.02
94.813
88.116
lty
Gender
0.00
9
2
HSGPA
890.01
1
SAT
2.616
0.01
69568.
06
Y=1
Intercept
1204.3
3
406.61
2
8.773
SATC3
-0.474
2.315
0.042
SATC5
-0.96
0.36
7.116
SATC18
-0.239
0.059
16.279
SATC22
-0.237
0.284
0.699
HSPcTC1
23.759
15.423
2.373
HSPcTC11
5.33
0.897
35.311
HSPcTC14
-10.587
15.011
0.497
HSPcTC15
-2.879
1.093
6.946
Averagedifficu
8975.3
2662.4
97
-
84
11.364
173.32
73.415
5.574
lty
Gender
15 | P a g e
0.00
3
0.83
8
0.00
8
0
0.40
3
0.12
3
0
0.48
1
0.00
8
0.00
1
0.01
8
HSGPA
SAT
889.38
94.822
87.975
9
2.592
Exhibit 8:
Classification
Predicted
Percent
Observed
0
1
2
Overall
Percentage
44
0
0
0
27
4
0
6
19
Correct
100.0%
81.8%
82.6%
44.0%
31.0%
25.0%
90.0%
Multinomial_Binary
_LR_Models.xlsx
16 | P a g e
Binary_LR_Model.sp
v
Multinomial_LRFina
l_Model.spv