Académique Documents
Professionnel Documents
Culture Documents
Attribute Description
id a unique identification number
age age of customer in years (numeric)
sex MALE / FEMALE
region inner_city/rural/suburban/town
income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
pep did the customer buy a PEP (Personal Equity Plan) after the last mailing
(YES/NO)
We will walk you through the whole process of classification this week starting from preprocessing
data, training the model, testing the model, analysing the model performance to running
prediction.
Preprocessing Data
Under the Preprocess tab, load bankdata.arff into your Weka Explorer. Before we run any
machine learning experiment, we must first understand our data and prepare data to be ready for
machine learning experiments. Here is a quick review of the items displayed on the Preprocess
tab.
Once the data is loaded, Weka recognizes attributes that are shown in the “Attribute” box.
• No.: A number that identifies the order of the attributes as they are in the data file (index
number)
• Selection tick boxes: Allow us to select the attributes for working relation
• Name: Name of an attribute as it was declared in the data file.
During the scan of the data, Weka computers some basic statistics on each attribute. The
following statistics are shown in “Selected attribute” box on the right panel of “Preprocess” tab.
Classifier Descriptions
NaiveBayes Can use kernel density estimators, which improve performance if the normality
assumption is grossly incorrect.
BayesNet Learns Bayesian nets under the assumptions nominal attributes (numeric ones
are prediscretized) and no missing values (any such values are replaced
globally). BayesNet algorithm works by first learning a graph structure of attribute
dependencies and then proceeds to calculate the probabilities for the attributes.
The number of possible parents possible in the graph can be changed under the
K2 parameter.
• useKernelEstimator - Use a kernel estimator for numeric attributes rather than a normal
distribution
• useSupervisedDiscretization -- Use supervised discretization to convert numeric attributes
to nominal ones
Test Model
Before we run the classification algorithm, we need to set test options. Set test options in the “Test
options” box. The test options that are available are described below:
a) Use training set: Evaluates a classifier on how well it predicts the class of the instances
it was trained on.
b) Supplied test set: Evaluates the classifier on how well it predicts the class of a set of
instances loaded from a file. Clicking on the “Set…” button brings up a dialog allowing you
to choose the file to test on.
c) Cross-validation: Evaluates the classifier by cross-validation, using the number of folds
entered in the “Folds” text field.
d) Percentage split: Evaluates the classifier on how well it predicts a certain percentage of
the data, which is held out for testing. The amount of data held out depends on the value
entered in the “%” field.
First, let us evaluate the classifier based on how well it predicts the 34% remaining test data after
using 66% of the data for training. Leave the default value 66% (means 66% used to training) in
the “Percentage split” text field.
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: bankdata_csv-weka.filters.unsupervised.attribute.Remove-R1
Instances: 600
Attributes: 11 “Run Information” gives us the following information:
age • The algorithm we used – NaiveBayes
sex • The relation name – bankdata with attribute 1
region removed
income • The number of instances in the relation – 600
married • The number of attributes in the relation – 11 and the
children list of the attributes: age, sex, region, income,
car married, children, car, save_act, current_act,
save_act mortgage, pep
current_act • The test mode we selected – split = 66%
mortgage
pep
Test mode: split 66.0% train, remainder test
Naive Bayes Classifier • The prior probability for each class value
• For each attribute either:
Class a) The parameters of the normal distribution
computed from the observed values
Attribute YES NO
(numeric attribute) of the attribute
(0.46) (0.54)
conditioned on the particular class value
===================================== b) The observed counts of each discrete
age value (nominal attribute) conditioned on the
mean 45.1277 40.0982 particular class value
std. dev. 14.3018 14.1018
weight sum 274 326
precision 1 1
sex
FEMALE 131.0 171.0
MALE 145.0 157.0 Evaluation is test split. “Summary” lists the
[total] 276.0 328.0 statistics summarizing how accurately the
… classifier was able to predict the true class of
the instances under the chosen test module.
Time taken to build model: 0 seconds
=== Evaluation on test split === The classifier correctly classifies 136 instances
Time taken to test model on test split: 0.01 seconds of 204 instances (accuracy = 67%). The
classifier incorrectly classifies 68 of 204
=== Summary === instances (error = 33%).
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.500 0.185 0.706 0.500 0.585 0.333 0.727 0.709 YES
0.815 0.500 0.647 0.815 0.721 0.333 0.727 0.752 NO
Weighted Avg. 0.667 0.352 0.675 0.667 0.657 0.333 0.727 0.731
a b <-- classified as
48 48 | a = YES
20 88 | b = NO
In this experiment, we also would like to see the predicted labels. Click the “More options” button.
In the “Classifier evaluation options” dialog box, click “Choose” next to “Output predictions” and
change “Null” to “PlainText”. Then. Click “OK”. Click “Start” to for learning to begin. Once the
experiment completes, we can see the predicted labels in the “Classifier output” box before the
“Summary” section.
=== Predictions on test set ===
In the “Classifier evaluation options” dialog box, click “Choose” next to “Output predictions” and
choose “CSV”.
We will have to specify the location to save the generated csv output file. Click on the CSV bar.
In the weka.gui.GenericObjectEditor dialog box, click on “outputFile” and select the file path and
indicate the name of the output file (e.g., output.csv). If we want the attributes to be included in
the predictions, specify the indices of the attributes to be included in the “attributes” field (e.g., 1-
10 means include all the attribute except the class label “pep”). Click “OK”.
Click “Start” button. Once the experiment completes, we can see the csv file in the specified folder.
@data
50,MALE,SUBURBAN,117546,NO,2,NO,YES,NO,NO,?
23,FEMALE,RURAL,11073,YES,3,NO,NO,YES,YES,?
27,MALE,INNER_CITY,9158.5,NO,1,YES,NO,NO,YES,?
22,FEMALE,INNER_CITY,7304.2,NO,0,YES,NO,YES,NO,?
67,MALE,RURAL,58092,NO,1,YES,YES,NO,NO,?
26,MALE,SUBURBAN,18500.6,YES,0,YES,YES,YES,NO,?
We now have “unseen data” with no known output for which we would like to make predictions.
a) On the “Classify” tab, select the “Supplied test set” option in the “Test options” box.
b) Click the “Set” button, click the “Open file” button on the “Test Instances” dialog box and
select the mock new dataset we just created with the name “bankdata_new.arff”. Click
“Close” on the window.
c) Click the “More options…” button to bring up options for evaluating the classifier.
d) Uncheck the information we are not interested in, specifically:
• “Output model”
• “Output per-class stats”
• “Output confusion matrix”
• “Store predictions for visualization”
e) For the “Output predictions” option click the “Choose” button and select “PlainText”.
f) Click the “OK” button to confirm the “Classifier evaluation options”.
g) Right click on the selected model in the “Results list” box.
h) Select “Re-evaluate model on current test set”.
The predictions for each test instance are then listed in the “Classifier Output” box.
Lab Exercise
Choose to work in a group of 2 people and prepare a short report to be submitted in elearn@usm.
Use bankdata for the lab exercise. Train the classifier using only bankdata_train.arff and test your
classifier on bankdata_test.arff.
Post your results and explanations directly on a discussion post (no need to attach a document).