Vous êtes sur la page 1sur 17

BUSINESS ANALYTICS AND INTELLIGENCE

Assignment 1

AJAY KUMAR
(1411211)
ANU KANKANE
(1411214)
ANKIT GOSWAMI
(1411286)
MALA HARISH
(1411242)

Contents
DESCRIPTIVE STATISTICS.................................................................................................. 2
Proportion of Failure in Courses..........................................................................2
Probability of Dropout Sports Activity.............................................................3
Proportion of Dropout Average Difficulty Level.............................................3
Proportion of Dropout High School GPA..........................................................3
Proportion of Dropout Gender...........................................................................4
Proportion of Dropout Year of Program...........................................................4
BINARY LOGISTIC MODEL................................................................................................. 5
Bifurcation of Data Set............................................................................................. 5
Variables..................................................................................................................... 5
Definition of drop out............................................................................................ 6
Independent Variables:............................................................................................ 6
Interaction Variables............................................................................................. 6
New Variables created ........................................................................................ 6
Selecting final Independent Variables:..................................................................7
OUTPUT:...................................................................................................................... 7
The final output...................................................................................................... 7
Result of testing on Validation Set.....................................................................7
Observations:............................................................................................................. 8
Recommendations:................................................................................................... 8
MULTINOMIAL LOGISTIC MODEL....................................................................................... 9
Variables:.................................................................................................................... 9
Independent Variables:....................................................................................... 10
Selecting final Independent Variables:................................................................10
OUTPUT:.................................................................................................................... 10
The final Output-.................................................................................................. 10
Observations and Recommendations:.................................................................11
Exhibit 1:........................................................................................................................ 12
Exhibit 2:........................................................................................................................ 13
Exhibit 3:........................................................................................................................ 13
Exhibit 4:........................................................................................................................ 14
Exhibit 5:........................................................................................................................ 15
Exhibit 6:........................................................................................................................ 15
Exhibit 7:........................................................................................................................ 16
Exhibit 8:........................................................................................................................ 17
Exhibit 9: Excel Sheet and SPSS Output.........................................................................17

1 | Page

DESCRIPTIVE STATISTICS
Descriptive statistics are calculated to understand the relationship of dropout status with other
variables like courses opted, activity in sports, average difficulty level of courses, High School GPA,
gender and year of program.
The data set was divided into two parts- Training Data and Validation Data. The descriptive and
model building was done using the Training Dataset which had details of 100 students.
Proportion of Failure in Courses

Proportion of Students failed


0.60
0.50
0.40
0.30
0.20
0.10
0.00

0.50
0.33
0.16
0.03

From the above graph, it can be observed that most courses have difficulty levels, i.e. proportion of a
student failing, in the subject less than 0.16. However, courses C3 and C24 have higher proportion of
students who failed. The proportion of failure can be used as a proxy for the Difficulty Index of a
subject. (Exhibit 2)
Dropout codes 0, 1 and 2 used for the computation of the following descriptive statistics is defined as
below.

Dropout Code (Y)

2 | Page

Dropout Code Description

If the candidate did not drop out


If the candidate dropped out, and he had failed in more than equal 1

1
2

course
If the candidate dropped out despite passing all the courses he took

Probability of Dropout Sports Activity


Dropout Criteria (Y)
0
1
2
Grand Total

Active in Sports
28
20
10
58

Inactive in Sports
16
13
13
42

Grand Total
44
33
23
100

From the above table, a larger proportion (28/44) of students who did not drop out was inactive in
sports. Also, a higher proportion of students dropping out because of failure in more than 1 course
were observed to be inactive in sports.
Proportion of Dropout Average Difficulty Level
Y
0
1
2

Average difficulty level of subjects taken


0.08028
0.09686
0.0914

The average difficulty level of the courses opted by a student and who has not dropped out from the
college is less than the average difficulty level of the courses opted by a student who has dropped
out from the college.
Proportion of Dropout High School GPA
Y
0
1
2

No of students with HSGPA>3


35
29
19

Maximum number of students with High School GPA > 3 fall in the category of not dropping out
from the college. And this number is least for the students who dropped out despite passing all the
subjects.
Proportion of Dropout Gender
Y

No of Male students

No of female

0
1
2

18
18
10

0
1
2

students
26
15
13

3 | Page

The number
of

female

students not dropping out from the college is higher than the number of male students not dropping
out from the college. They however have comparable numbers for dropping out from the college.
Proportion of Dropout Year of Program
Drop out
Year 1
Year 2
Year 3

No of students who have dropped out


29
43
8

The maximum number of drop outs happen in year 2.

BINARY LOGISTIC MODEL


Bifurcation of Data Set
The data was divided into two sets. Set 1, Training Data, consists of randomly selected 100
candidates and Set 2, Validation Data, consisted of 12 samples. Set 1 was used to estimate and build
the model, and Set 2 was used to validate the model built.

The Independent Variable had the following observed classificationTotal Number of Students
Students Dropped
Students Graduated
4 | Page

100
56
44

Variables
Dependent Variable: Whether a particular student dropped out during the term or graduated
from Lovely
Business School is taken as the dependent variable.
Final Result
Y=
Dropped Out
1
Graduated/Did not Drop Out 0

Definition of drop out: A particular student, who has not taken any courses in two
consecutive terms, is termed as a drop out.
For instance, from the historical data provided, student 3544856 has not taken any subjects in the last
two terms. Hence he is considered as a drop out.

Independent Variables:
The provided data has the following continuous variables, and one binary variable (gender).
i)

HSGPA
HSPctil

ii)
e
iii)
HSSize
iv)
SAT
Apart from this, the historical data also has details on courses taken on per term basis.

Interaction Variables - The probability of passing a difficult course should


empirically be dependent on the past academic performance of the candidate.
a) So a person with a higher HSGPA should have a higher probability of passing a
course, say C1, than a person who had a lower HSGPA. This leads to intercept
difference between when C1=0 and when C1=1
b) The logit, of a model with one continuous variable and one course, should empirically
be a function of the continuous variable leading us to make a guess of presence of
slope effect being present in the interaction terms.

5 | Page

To deal such situations we created 96 interaction variables between dummy and continuous
variables.
An exhaustive list of all the independent variables is given in Exhibit-1.
New Variables created
a) The first new variable that was created is Difficult Index of Subjects. The difficulty
index of each subject is calculated as the ratio of number of students who have taken the
subject and failed to pass to the total number of students who have taken the subject.
Hence, if number of students failing in a subject is higher, then the difficulty index of that
subject is higher.
DI x =

no of students who failed in subject x


total no of students who opted subject x
Detailed Difficulty Index of each subject given in Exhibit 2.
b) The second new variable that was created is rank. Instead of taking HSPct, and
HSSize as different variables. We explored the option to see, if
Rank=( 1HSPct )HSSize gave us better results. (It didnt, so we ended up
dropping Rank, and the 24 interaction variables between Rank and the Courses.)

Selecting final Independent Variables:


Step 1: For all the 72 interaction variables the ROC Area was founded. (Exhibit 3)
Step 2: Since all interaction variables involving courses Ci ( ij, and i takes the values from 1 to
24) are correlated we have taken the interaction variables having the highest ROC value.
Step 3: In case ROC values for three interaction variables involving Ci is the same, then the
preference of selection has been SAT>HSPcT>HSGPA since the ROC values individually for these
variables are in decreasing order.
Exhibit 3 gives the detailed list of ROC areas and the final variables selected.

OUTPUT:
The final output
Log ( Y=1/Y=0) = 5.603 + 55.205*DI -.006*SATC5 -.101*HSPcTC19
The model gave an efficiency of 92% (Exhibit 4)
6 | Page

Result of testing on Validation Set


Predicte
d
0

6
0

0
6

Observe
d

0
1

Observations:
The probability of dropping out is dependent on the following:
1. ADI: Average difficulty levels of all the subjects taken by a student. This seems to be a
logical conclusion also since a student might not have been able to perform better in an exam
because of difficult subjects and hence would have failed in that course. If such subjects are
more in number, the chances are high that the students performance will decrease leading to
failure in the examination. Not just failure, there is also a probability that the student is not
able to handle the pressure and hence drops out of the course.
2. If a student has taken course C19, his/her probability of dropping out decreases as compared
to the base case i.e. P(Y=0). And it is inversely proportional to GPA in Higher Secondary
School. This implies that if a person has taken course C19 and had a high Percentile in
School, his probability of dropping out decreases. One reason can be that the course C19 is an
extention of a high school course or has a similar course structure to a subject studied in high
school. With such similarity, the probability of performing well in the subject increases with a
students percentile in high school (i.e. HSPcT) and hence drop out probability decreases.
3. C5 is an easy subject as determined by the difficulty index. According to the model, the
probability of dropping out is inversely proportional to SAT Score. One reason behind this
result can be the same as in 2. C5 might be an aptitude based course, hence their probability
of performing well increases with their SAT score which in turn implies that their probability
of dropping out decreases.
4. The probability of dropping/ continuing is not dependent on the gender of the students.
5. Also, SAT score, participation in sports, do not affect the probability of dropping out.

Recommendations:
1. Since the probability of dropping out decreases when a student takes courses C19 and C5, the
college should promote these courses among the students to decrease the dropout rates. They
can be made compulsory courses for the students.

7 | Page

2. Since probability of dropouts increases with the average difficulty level of the subjects taken
by a student, the college should ensure that each student takes a balanced choice of subjects.
The average difficulty of subjects taken in a particular term should be such that his
probability of dropping out doesnt increase.

MULTINOMIAL LOGISTIC MODEL


Variables:
Dependent Variable: The dependent variable is coded either as 0,1 or 2.

Y
If the candidate did not
0 drop out
If the candidate dropped
out, because he had
failed in more than 1
1 course
If the candidate dropped
out despite passing all
2 the courses he took

8 | Page

The historical data had the following classificationY

Number of candidates
0
44
1
33
2
23

Independent Variables:
Same as used in Binary Logistic Regression.

Selecting final Independent Variables:


Step 1: For all the 72 interaction variables the ROC Area was founded. (Exhibit 3)
Step 2: Since all interaction variables involving courses Ci ( ij, and i takes the values from 1 to
24) are correlated we have taken the interaction variables having the highest ROC value.
Step 3: In case ROC values for three interaction variables involving Ci is the same, then the
preference of selection has been SAT>HSPcT>HSGPA since the ROC values individually for these
variables are in decreasing order.
Exhibit 3 gives the detailed list of ROC areas and the final variables selected.
Step 4: Variables which were not significant (at .1 were removed one by one, till all the variables in
Likelihood Ratio Test remained significant.

OUTPUT:
The final Output-

1+exp ( Z 1 )+ exp(z 2)

exp ( z 2 )
P ( Y =2 )=

1+exp ( Z 1 )+ exp(z 2)

exp ( z 1 )
P ( Y =1 )=

P (Y =0 ) =

9 | Page

1
1+exp ( z 1 )+ exp ( z 2 )

Z1 = -1248.233 -.971*SATC5 - .258*SATC18 - 5.545*HSPcTC1 + 5.594*HSPcTC11 +


18.644*HSPcTC14 -2.842*HSPcTC15 + 9294*ADI 171.182*Gender -890.011*HSGPA +
2.616*SAT
Z2 = -1204.327 -.960*SATC5 -.239*SATC18 + 5.330*HSPcTC11 -2.879*HSPcTC15 +
8975.397*ADI 173.323*Gender -889.389*HSGPA + 2.592*SAT
The achieved efficiency of the model is 90% (Exhibit 8)
Results of testing on validation set:

Observed

Predicted
0
0
6
1
0
2
0

1
6
0

2
0
0
0

Observations and Recommendations:


The marginal probability of Y, with respect to any variable xi and coefficient i is

iexp ( z 2 )( 1+exp ( Z 1 ) +exp ( z 2 ) ) ( iexp ( Z 1 )+ iexp ( z 2 ) )exp ( z 1 )

( 1+exp ( Z 1 ) +exp ( z 2 ) )2

From the above marginal probability it can be observed that if i is positive, the probability of
dropping out increases.
1. From Z2, A higher SAT score implies higher probability of dropping out. One reason that can
be attributed to this behaviour is that Lovely Business School might not fall under the
ambitious list of colleges for the students. If he/she has a higher SAT score, the probability is
that he might get admission in another college.
2. Also, it can be seen that the probability of dropping out increases with increase in Average
Difficulty Index of the courses taken by the student. This seems to be a logical conclusion
since the student might not have been able to handle the pressure of difficult subjects leading
him to drop out or fail in the exam which is again increasing his probability of dropping out.
The strength of coefficient for ADI, is stronger for Y=1 than Y=2, pointing towards ADI has a
higher impact for students dropping out who have failed in more than one subject.
10 | P a g e

3. C5, C18, C15 are the courses with least Difficulty Index, hence it is not surprising that
students who takes any of these courses have lower probability of dropping out; and there is a
significant interaction between past performance and these courses.
4. Male have lesser probability than female in dropping out, since the coefficient of Gender is
negative in both case. This may point towards the nation having a cultural issue when it
comes to male and female education.

Exhibit 1:
Dropped/Result
Gender

1= dropped, 0=continued
1 = Male

Student ID
Course Year
Semester
Result

0 = Female
Identification Number
Year in which the course was taken
Semester within the year
PASS Student passed the course

Gender

OTHER Failed and discontinued


1 = Male

HSGPA
HSPct
HSSize
SAT
Sports

0 = Female
GPA in Higher Secondary School
Percentile in Graduating Class in Higher Secondary
Number of students in HS graduating Class
Overall SAT Score
1 = Active in Sports

Ci

0 = Not a sports person


1=enrolled in course with course code Ci, 0=not enrolled in course Ci,

CPi
SAT*Ci
HSPcT*Ci

i=1 to 24
Difficulty index of the subject Ci ; 0<= CPi <= 1
Interaction Variable between SAT and Ci
Interaction Variable between HSPcT and Ci

11 | P a g e

HSGPA*Ci

Interaction Variable between HSGPA and Ci

Exhibit 2:
Total

Cases

Probability of

cases

failed

failing

C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
C17
C18
C19
C20
C21
C22
C23
C24

107
7
2
74
44
71
80
1
106
66
70
107
78
176
80
97
24
73
100
10
93
96
103
3

0.08411215
0
0.5
0.027027027
0.022727273
0.098591549
0.0375
0
0.113207547
0.045454545
0.157142857
0.08411215
0.038461538
0.068181818
0.0625
0.12371134
0.041666667
0.04109589
0.16
0
0.032258065
0.0625
0.13592233
0.333333333

1
2
1
7
3
12
3
11
9
3
12
5
12
1
3
16
3
6
14
1

Exhibit 3:

C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
12 | P a g e

SAT(0.558

ROC Area
HSPcT(0.54 HSGPA

)
0.506
0.443
0.474
0.245
0.094
0.209
0.322
0.508
0.508
0.228

(0.553)
0.511
0.443
0.474
0.248
0.094
0.228
0.349
0.508
0.504
0.245

0.496
0.443
0.474
0.247
0.091
0.218
0.337
0.508
0.487
0.239

Variable Selected
HSPcT*C1
SAT*C2
SAT*C3
HSPcT*C4
SAT*C5
HSPcT*C6
HSPcT*C7
SAT*C8
SAT*C9
HSPcT*C10

C11
C12
C13
C14
C15
C16
C17
C18
C19
C20
C21
C22
C23
C24

0.234
0.537
0.293
0.506
0.283
0.426
0.292
0.235
0.451
0.395
0.285
0.48
0.451
0.503

0.268
0.531
0.319
0.515
0.308
0.433
0.294
0.235
0.452
0.395
0.3
0.466
0.452
0.503

0.253
0.516
0.309
0.5
0.3
0.421
0.296
0.234
0.437
0.395
0.295
0.457
0.437
0.503

HSPcT*C11
SAT*C12
HSPcT*C13
HSPcT*C14
HSPcT*C15
HSPcT*C16
HSGPA*C17
SAT*C18
HSPcT*C19
SAT*C20
HSPcT*C21
SAT*C22
HSPcT*C23
SAT*C24

Exhibit 4:
Classification Tablea
Predicted
Dropped/Result
Percentage
Observed
Step 1 Dropped/Resul .0
1.0
t

.0

1.0
39

Correct
88.6

53

94.6

Overall Percentage
a. The cut value is .600

92.0

Exhibit 5:
Variables in the Equation
B
S.E.
Wald
df
Step 1

Sig.

Averagedifficul
ty

55.205

26.299

4.406

.036 92180000000

SATC5
-.006
.002
11.580
1
.001
HSPcTC19
-.101
.035
8.602
1
.003
Constant
5.603
3.790
2.185
1
.139
a. Variable(s) entered on step 1: Averagedifficulty, SATC5, HSPcTC19.

Exhibit 6:
Likelihood Ratio Tests
Model Fitting
Effect

13 | P a g e

Criteria

Likelihood Ratio Tests

Exp(B)
94447689383
00.000
.994
.904
271.150

-2 Log
Likelihood of
Reduced
Intercept
SATC3
SATC5
SATC18
SATC22
HSPcTC1
HSPcTC11
HSPcTC14
HSPcTC15
Averagedifficul

Model
Chi-Square
58.827
21.880
54.696
17.749
94.745
57.797
76.023
39.076
a
54.765
17.818
71.977
35.029
71.316
34.369
60.626
23.679
51.011
14.064

ty
Gender
HSGPA
SAT

df
2
2
2
2
2
2
2
2
2

Sig.
.000
.000
.000
.000
.000
.000
.000
.000
.001

60.115

23.168

.000

51.829a
48.209
60.633a

14.882
11.262
23.685

2
2
2

.001
.004
.000

Exhibit 7:
Std.

Error

Wald

d
f

Sig.

Y=2
Intercept

1248.2
3

407.13
4

9.4

SATC3

-0.473

2.325

0.041

SATC5

-0.971

0.335

8.386

SATC18

-0.258

0.06

18.675

SATC22

-0.246

0.284

0.753

HSPcTC1

-5.545

3.354

2.733

14 | P a g e

0.00
2
0.83
9
0.00
4
0
0.38
6
0.09
8

HSPcTC11

5.594

0.904

38.286

HSPcTC14

18.644

4.656

16.032

HSPcTC15

-2.842

1.093

6.765

Averagedifficu

9294.0

2667.8

03
-

09

12.137

171.18

73.416

5.437

0.02

94.813

88.116

lty
Gender

0.00
9

2
HSGPA

890.01
1

SAT

2.616

0.01

69568.
06

Y=1
Intercept

1204.3
3

406.61
2

8.773

SATC3

-0.474

2.315

0.042

SATC5

-0.96

0.36

7.116

SATC18

-0.239

0.059

16.279

SATC22

-0.237

0.284

0.699

HSPcTC1

23.759

15.423

2.373

HSPcTC11

5.33

0.897

35.311

HSPcTC14

-10.587

15.011

0.497

HSPcTC15

-2.879

1.093

6.946

Averagedifficu

8975.3

2662.4

97
-

84

11.364

173.32

73.415

5.574

lty
Gender

15 | P a g e

0.00
3
0.83
8
0.00
8
0
0.40
3
0.12
3
0
0.48
1
0.00
8
0.00
1
0.01
8

HSGPA
SAT

889.38

94.822

87.975

9
2.592

Exhibit 8:
Classification
Predicted
Percent
Observed
0
1
2
Overall
Percentage

44
0
0

0
27
4

0
6
19

Correct
100.0%
81.8%
82.6%

44.0%

31.0%

25.0%

90.0%

Exhibit 9: Excel Sheet and SPSS Output

Multinomial_Binary
_LR_Models.xlsx

16 | P a g e

Binary_LR_Model.sp
v

Multinomial_LRFina
l_Model.spv

Vous aimerez peut-être aussi