Vous êtes sur la page 1sur 4

ASSIGNMENT-2

LDA, LR, KNN

LDA,KNN and LR
STEP-1:

Collinearity Analysis:
First we performed collinearity analysis to find out, if any, high collinearity exist among dependent
variables. The cut off value for high collinearity is considered to be 0.7
Using the collinearity analysis we dropped following 6 variables from our dataset:
 PriceCH
 SalePriceCH
 PriceDiff
 DiscMM
 SalePriceMM
 PctDiscCH
Collinearity output:
Post variable reduction using collinearity method the data set has following remaining non-collinear
variables:
STEP-2:
To perform further analysis dataset was partitioned in training and validation sets in the proportion of
70:30 percentage, where 70%of original dataset was assigned as training data set and rest 30% as
validation data set.
LOGISTIC REGRESIION ANALYSIS:
The cutoff probability for prediction under this regression analysis was set to 0.5.
Training Confusion Matrix:
Post training and prediction on the logistic model using training dataset following confusion matrix was
obtained:

Accuracy: (425+307)/ (425+67+77+307) = 83.56%


Misclassification: (77+67)/( 425+67+77+307) = 16.43%
Validation Confusion Matrix:
The confusion matrix obtained from validation 30% data over logistic model is:

Accuracy: (225+127)/(30+46+225+127) = 82.24%


Misclassification: (30+46)/ (30+46+225+127) = 17.75%
Benchmark Accuracy:
The actual number of CH observations in original dataset: 653
The actual number of MM observations in original dataset: 417
If prediction was made exactly as the CH and MM observation in dataset, then:
Accuracy: 653/ (653+417) = 61.02% <- Can be considered as benchmark accuracy to evaluate the
performance of logistic model.
The accuracy of validation model is close to the training model and greater than benchmark accuracy.
Therefore the model can be considered to be a good fit and acceptable.
ROC Curve:
The ROC curve represents greater area under curve above the straight benchmark line. This shows that
TPR is significantly higher than FPR for the logistic model. Therefore the model Is good fit in the
prediction accuracy.
STEP-3:
LINEAR DISCRIMINANT MODEL:
Training Confusion Matrix:
The confusion matrix for training data under this model is:

Accuracy: (398+221)/(398+221+73+57) = 82.64%


Misclassification: (57+73)/ (398+221+73+57) = 17.35

Validation Confusion Matrix:

Accuracy: (220+130)/(220+130+43+35) = 81.75%


Misclassification: (35+43)/( 220+130+43+35) = 18.22%

Prediction Histogram:
The above histogram for CH and MM purchase groups shows significant overlap in the central area.
Therefore from above observation we can conclude that prediction accuracy for LDA model is lower
than Logistic model and also the significant overlap in the histogram among two categories of Purchase
for LDA prediction model shows that this model is not good fit in predictive power for the given dataset.
STEP-4:
KNN ANALYSIS:
Confusion Matrix:

Accuracy: (183+95)/(183+28+15+95) = 86.6%


Misclassification: (15+28)/( 183+28+15+95) =13.4%
The accuracy is good in this model but KNN doesn’t give any powerful insights of the data as it is given in
LDA and LR model. It is major generalized scenario of data classification whose accuracy depends on
value of K. Therefore this model cannot be considered as precise and powerful as Logistic Regression.

CONCLUSION: Among all the three models Logistic Regression model is most powerful and accurate and
best fit for the given type of dataset. It has good accuracy, lesser misclassification and ROC curve hihly
supports this model.

Vous aimerez peut-être aussi