Data Mining Lab 3 Model Answers

Data Mining and Knowledge Engineering (COMP723) – Laboratory 3 –Model Answers
Task 1 Establish baseline scores for classification accuracy, F-score and classification time (testing)
Record the classification accuracy, overall F-score and the classification time (i.e. Time taken to test
model on supplied test set)
Classification accuracy :
Correctly Classified Instances 470 89.8662 %
Overall F-score :
0.871
Classification time (i.e. Time taken to test model on supplied test set):
Time taken to test model on supplied test set: 0.05 seconds*
Task 2 Perform feature selection and investigate effects on accuracy and model building time
(a) Record the classification accuracy, overall F-score and the classification time values with the use
of the CorrelationAttributeEval filter.
Classification accuracy:
Overall F-score:
0.897
Classification time values:
(b) Record the classification accuracy, overall F-score and the classification time values with the use
of the CfsSubsetEval filter.
Overall F-score:
0.886
Classification time:
(c) Record the classification accuracy, overall F-score and the classification time values with the use
of the InfoGainAttributeEval filter.
Overall F-score:
0.898
Classification time :
Task 3 Use the RWeka interface to perform feature selection
Study the code snippet below and run it in R.
TraindataSecom<-read.arff("C:/Users/rpears/Desktop/Secom/SecomTrain.arff")
colnames(TraindataSecom)[colnames(TraindataSecom)=="591"] <- "class" # set the target attribute name to "class" in the
training file
TestdataSecom<-read.arff("C:/Users/rpears/Desktop/Secom/SecomTest.arff")
colnames(TestdataSecom)[colnames(TestdataSecom)=="591"] <- "class" # set the target attribute name to "class" in the testing
file
actual<-TestdataSecom[, 591] # get the class values from the test file
A<- InfoGainAttributeEval(class ~ . , data = TraindataSecom,na.action=NULL ) # rank features by their information gain score
ranked_list<- A[order(A)] # sorting in ascending order
A[order(-A)] # print the features with the highest information gain together with their corresponding gain values
D<-540 # now specify the number of features to be dropped;

s<- ranked_list[1:D]
cols.dont.want <- c(names(s)) # identify names of low ranked features
TraindataSecom1<- TraindataSecom[, !names(TraindataSecom) %in% cols.dont.want, drop = T] # now drop low ranked features
from the training dataset and retrun the top 50 highest ranked attributes.
classifier <- J48(class ~ ., data = TraindataSecom1 , na.action=NULL) # build the model on the reduced training dataset (the
version with 50 attributes)
TestdataSecom1<- TestdataSecom[, !names(TraindataSecom) %in% cols.dont.want, drop = T] # drop low ranked features drop
low ranked features from the test dataset
pred<-predict(classifier,TestdataSecom1, na.action=NULL,seed=1) # deploy the new version of the dataset on the test dataset to
make predictions
P11<-0
P12<-0
P21<-0
P22<-0
for ( K in seq(1,523))
{
if(actual[K]==-1){
if(pred[K]==-1){
P11<-P11+1
}
else
{
P12<-P12+1
}
}
else if (actual[K]==1){
if(pred[K]==1){
P22<-P22+1
}
else
{
P21<-P21+1
}
}
}
Prec_1<-(P11/(P11+P21))
Prec_2<-(P22/(P22+P12))
Recall_1<-(P11/(P11+P12))
Recall_2<-(P22/(P22+P21))
F_1<-(2*Prec_1*Recall_1)/(Prec_1+Recall_1)
F_2<-(2*Prec_2*Recall_2)/(Prec_2+Recall_2)
F_overall<-(F_1*462+F_2*61)/523
paste("This is the F overall score",F_overall)
t<-system.time(predict(classifier,TestdataSecom1, na.action=NULL,seed=1)) # return elapsed cpu time
paste("This is the total classification time", t[[1]])
Remember to alter the pathnames to point to your file locations.

Note that the number of attributes selected is 50 as 540 attributes are dropped.
(a) Run the above code paste the overall F score and classification time in your lab report
The overall F score:

0.903393659122237
Classification time:
0.0399999999999996*
(b) Now use a for loop and perform feature selection with K values in the range [10, 50] in intervals
of 5. You will need to lookup R help on using a for loop with increments. Paste your code in your
submission.
Code:
TraindataSecom<-read.arff("H:/lab class_COMP723/lab 4/2016 Lab 4/SecomTrain.arff")

colnames(TraindataSecom)[colnames(TraindataSecom)=="591"] <- "class" # set the target
attribute name to "class" in the training file
TestdataSecom<-read.arff("H:/lab class_COMP723/lab 4/2016 Lab 4/SecomTest.arff")
colnames(TestdataSecom)[colnames(TestdataSecom)=="591"] <- "class" # set the target
attribute name to "class" in the testing file
actual<-TestdataSecom[, 591] # get the class values from the test file
A<- InfoGainAttributeEval(class ~ . , data = TraindataSecom,na.action=NULL ) # rank features by
their information gain score
ranked_list<- A[order(A)] # sorting in ascending order
#A[order(-A)] # print the features with the highest information gain together with their
corresponding gain values
for(i in seq(10,50,5))
{
cat("Number of features selected:" ,i ,"\n")
D<-(590-i) # now specify the number of features to be dropped;
s<- ranked_list[1:D]
cols.dont.want <- c(names(s)) # identify names of low ranked features
TraindataSecom1<- TraindataSecom[, !names(TraindataSecom) %in% cols.dont.want,
drop = T] # now drop low ranked features from the training dataset
classifier <- J48(class ~ ., data = TraindataSecom1 , na.action=NULL) # build the model on

the reduced training dataset (the version with the top i attributes)
TestdataSecom1<- TestdataSecom[, !names(TraindataSecom) %in% cols.dont.want, drop
= T] # drop low ranked features drop low ranked features from the test dataset
pred<-predict(classifier,TestdataSecom1, na.action=NULL,seed=1) # deploy the new
version of the dataset on the test dataset to make predictions
P11<-0
P12<-0
P21<-0
P22<-0
for ( K in seq(1,523))
{
if(actual[K]==-1)
{
if(pred[K]==-1)
{
P11<-P11+1
}
else
{
P12<-P12+1
}
}
else if (actual[K]==1)
{
if(pred[K]==1)
{
P22<-P22+1
}
else
{
P21<-P21+1
}
}
}
Output:
Number of features selected: 10

This is the F overall score: 0.8908404
This is the total classification time: 0.01 *


This is the total classification time: 0.03*




0.7679383

0.8577615
* Classification time can vary

Data Mining Lab 3 Model Answers

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Mining Lab 3 Model Answers

Transféré par

Droits d'auteur :

Formats disponibles

Data Mining and Knowledge Engineering (COMP723) – Laboratory 3 –Model Answers

Task 3 Use the RWeka interface to perform feature selection

Study the code snippet below and run it in R.

D<-540 # now specify the number of features to be dropped;

Remember to alter the pathnames to point to your file locations.

The overall F score:

TraindataSecom<-read.arff("H:/lab class_COMP723/lab 4/2016 Lab 4/SecomTrain.arff")

classifier <- J48(class ~ ., data = TraindataSecom1 , na.action=NULL) # build the model on

Number of features selected: 10

Number of features selected: 15

Number of features selected: 20

Number of features selected: 25

Number of features selected: 30

Number of features selected: 35

Number of features selected: 40

Number of features selected: 50

* Classification time can vary

Vous aimerez peut-être aussi