Vous êtes sur la page 1sur 10

CDS503: Machine Learning

Lab 3: Bayes Classifiers


Loading Data
In this lab, we will use the bank data set. The bank data set contains attributes of customers and
the class label (pep) indicating whether a customer bought a PEP (Personal Equity Plan) after
the last mailing.

Attribute Description
id a unique identification number
age age of customer in years (numeric)
sex MALE / FEMALE
region inner_city/rural/suburban/town
income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
pep did the customer buy a PEP (Personal Equity Plan) after the last mailing
(YES/NO)

We will walk you through the whole process of classification this week starting from preprocessing
data, training the model, testing the model, analysing the model performance to running
prediction.
Preprocessing Data
Under the Preprocess tab, load bankdata.arff into your Weka Explorer. Before we run any
machine learning experiment, we must first understand our data and prepare data to be ready for
machine learning experiments. Here is a quick review of the items displayed on the Preprocess
tab.
Once the data is loaded, Weka recognizes attributes that are shown in the “Attribute” box.

• No.: A number that identifies the order of the attributes as they are in the data file (index
number)
• Selection tick boxes: Allow us to select the attributes for working relation
• Name: Name of an attribute as it was declared in the data file.
During the scan of the data, Weka computers some basic statistics on each attribute. The
following statistics are shown in “Selected attribute” box on the right panel of “Preprocess” tab.

• Name: Name of an attribute


• Type: Variable type (most commonly “Nominal” or “Numeric”)
• Missing: Number (percentage) of instances in the data for which this attribute is
unspecified

1 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018
• Distinct: Number of different values that the data contains for this attribute
• Unique: Number (percentage) of instances in the data having a value for this attribute that
no other instances have
If we select a numeric/continuous attribute (e.g., age), we will see the basic statistics on that
attribute: minimum, maximum, mean and standard deviation. If we select a nominal attribute (e.g.,
sex), we will see the frequency counts of each label.
Remove the id attribute.

Train Model: Naïve Bayes Learning


Basic Naïve Bayes makes two “naïve” assumptions over attributes:

• All attributes are a priori equally important


• All attributes are statistically independent (value of one attribute is not related to a value
of another attribute)
These assumptions mostly are not true, but in practice the algorithm gives good results. Weka
provides a few implementations of Bayesian classifiers

Classifier Descriptions
NaiveBayes Can use kernel density estimators, which improve performance if the normality
assumption is grossly incorrect.
BayesNet Learns Bayesian nets under the assumptions nominal attributes (numeric ones
are prediscretized) and no missing values (any such values are replaced
globally). BayesNet algorithm works by first learning a graph structure of attribute
dependencies and then proceeds to calculate the probabilities for the attributes.
The number of possible parents possible in the graph can be changed under the
K2 parameter.

2 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018
To perform training and testing, click on the “Classify” tab. In the “Classifier” box, click on the
“Choose” button select “NaiveBayes” classifier under the “bayes” folder. You will also find other
Bayesian classifiers under the “bayes” folder.

There are two parameters you can tweak in a NaïveBayes classifier

• useKernelEstimator - Use a kernel estimator for numeric attributes rather than a normal
distribution
• useSupervisedDiscretization -- Use supervised discretization to convert numeric attributes
to nominal ones
Test Model
Before we run the classification algorithm, we need to set test options. Set test options in the “Test
options” box. The test options that are available are described below:
a) Use training set: Evaluates a classifier on how well it predicts the class of the instances
it was trained on.
b) Supplied test set: Evaluates the classifier on how well it predicts the class of a set of
instances loaded from a file. Clicking on the “Set…” button brings up a dialog allowing you
to choose the file to test on.
c) Cross-validation: Evaluates the classifier by cross-validation, using the number of folds
entered in the “Folds” text field.
d) Percentage split: Evaluates the classifier on how well it predicts a certain percentage of
the data, which is held out for testing. The amount of data held out depends on the value
entered in the “%” field.
First, let us evaluate the classifier based on how well it predicts the 34% remaining test data after
using 66% of the data for training. Leave the default value 66% (means 66% used to training) in
the “Percentage split” text field.

3 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018
You can also specify what to be included in into the output. Click on the “More options…” button
and the “Classifier evaluation options” will pop out. Make sure that the following options are
checked:
a) Output model: The output is the classification model on the full training set, so that it can
be viewed, visualized, etc.
b) Output per-class stats: The precision/recall and true/false statistics for each class output.
c) Output confusion matrix: The confusion matrix of the classifier’s predictions is included
in the output.
d) Store predictions for visualization: The classifier’s predictions are remembered so that
they can be visualized.
e) Set “Random seed for Xval / % Split” to 1. This specifies the random seed used when
randomizing the data before it is divided up for the evaluation purposes.
The remaining options that you do not use in this activity but are available to you are:
f) Output entropy evaluation measures: Entropy evaluation measures are included in the
output.
g) Output predictions: Classifier’s predictions are remembered so they can be visualized.
Click “OK” button to close the “Classifier evaluation options” dialog box.

4 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018
Once the options have been specified, run the classification algorithm. Click the “Start” button to
start the learning process. You can stop the learning process at any time by clicking the “Stop”
button.
Let us break the “Classifier output” into multiple parts to be explained.
=== Run information ===

Scheme: weka.classifiers.bayes.NaiveBayes
Relation: bankdata_csv-weka.filters.unsupervised.attribute.Remove-R1
Instances: 600
Attributes: 11 “Run Information” gives us the following information:
age • The algorithm we used – NaiveBayes
sex • The relation name – bankdata with attribute 1
region removed
income • The number of instances in the relation – 600
married • The number of attributes in the relation – 11 and the
children list of the attributes: age, sex, region, income,
car married, children, car, save_act, current_act,
save_act mortgage, pep
current_act • The test mode we selected – split = 66%
mortgage
pep
Test mode: split 66.0% train, remainder test

5 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018
=== Classifier model (full training set) === NaiveBayes classifier outputs

Naive Bayes Classifier • The prior probability for each class value
• For each attribute either:
Class a) The parameters of the normal distribution
computed from the observed values
Attribute YES NO
(numeric attribute) of the attribute
(0.46) (0.54)
conditioned on the particular class value
===================================== b) The observed counts of each discrete
age value (nominal attribute) conditioned on the
mean 45.1277 40.0982 particular class value
std. dev. 14.3018 14.1018
weight sum 274 326
precision 1 1

sex
FEMALE 131.0 171.0
MALE 145.0 157.0 Evaluation is test split. “Summary” lists the
[total] 276.0 328.0 statistics summarizing how accurately the
… classifier was able to predict the true class of
the instances under the chosen test module.
Time taken to build model: 0 seconds
=== Evaluation on test split === The classifier correctly classifies 136 instances
Time taken to test model on test split: 0.01 seconds of 204 instances (accuracy = 67%). The
classifier incorrectly classifies 68 of 204
=== Summary === instances (error = 33%).

Correctly Classified Instances 136 66.6667 %


Incorrectly Classified Instances 68 33.3333 %
Kappa statistic 0.32
Mean absolute error 0.4165 “Detailed Accuracy By Class” demonstrates a
Root mean squared error 0.4599 more detailed per-class breakdown of the
Relative absolute error 83.7882 % classifier’s prediction accuracy.
Root relative squared error 92.0585 % From the confusion matrix, we can see that 48
Total Number of Instances 204 instances of a class “YES” has been assigned
to “NO”, and 20 instances of class “NO” are
=== Detailed Accuracy By Class === assigned to the class “YES”.

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.500 0.185 0.706 0.500 0.585 0.333 0.727 0.709 YES
0.815 0.500 0.647 0.815 0.721 0.333 0.727 0.752 NO
Weighted Avg. 0.667 0.352 0.675 0.667 0.657 0.333 0.727 0.731

=== Confusion Matrix ===

a b <-- classified as
48 48 | a = YES
20 88 | b = NO

6 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018
When we use “Split percentage” as our test option, Weka makes a random split of the data set. If
we want to evaluate different models on a fixed test set, we can select the option “Select test set”.
First, go back to the “Preprocess” tab. We should first download the bankdata_train.arff and
bankdata_test.arff from elearn@USM. The bankdata has been split into a training set containing
400 instances and a test set containing 200 instances. Load bankdata_train.arff into Weka
Explorer. Notice that attribute id has been removed from both the training set and test set.
Go to “Classify” tab. Select the NaiveBayes classifier. In the “Test options” box, select “Supplied
test set” option. Click the “Set” button. Click the “Open file…” button on the “Test Instances” dialog
box. Select bankdata_test.arff and click “Open”. Make sure the Class field is set to the class label,
which is “pep” in our case. After we are done selecting the test set, click the “Close” button.

In this experiment, we also would like to see the predicted labels. Click the “More options” button.
In the “Classifier evaluation options” dialog box, click “Choose” next to “Output predictions” and
change “Null” to “PlainText”. Then. Click “OK”. Click “Start” to for learning to begin. Once the
experiment completes, we can see the predicted labels in the “Classifier output” box before the
“Summary” section.
=== Predictions on test set ===

inst# actual predicted error prediction


1 2:YES 1:NO + 0.564
2 1:NO 1:NO 0.798
3 1:NO 2:YES + 0.631

7 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018
We can also save the predictions into a csv file.

In the “Classifier evaluation options” dialog box, click “Choose” next to “Output predictions” and
choose “CSV”.

We will have to specify the location to save the generated csv output file. Click on the CSV bar.
In the weka.gui.GenericObjectEditor dialog box, click on “outputFile” and select the file path and
indicate the name of the output file (e.g., output.csv). If we want the attributes to be included in
the predictions, specify the indices of the attributes to be included in the “attributes” field (e.g., 1-
10 means include all the attribute except the class label “pep”). Click “OK”.

Click “Start” button. Once the experiment completes, we can see the csv file in the specified folder.

8 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018
We can also save a classifier by right clicking the model in the “Result list” and save the model in
a desired file path.
Run Prediction
After training a model we are satisfied with, we can now make predictions on new data. You can
either create your own new data or download bankdata_new.arff from elearn@usm.
If you are creating your own data, open the bankdata_test.arff and modify the values of the
instances in the file (to avoid reformatting your new data file). For the sake of simplicity, include
only 5 instances in the file and remove the others. The class value (output variable) that we want
to predict is at the end of each line. Delete each of the 5 output class labels and replace them
with question mark symbols (?). Save the new data file as bankdata_new.arff.
@relation bankdata_new-weka.filters.unsupervised.attribute.Remove-R1

@attribute age numeric


@attribute sex {FEMALE,MALE}
@attribute region {INNER_CITY,TOWN,RURAL,SUBURBAN}
@attribute income numeric
@attribute married {NO,YES}
@attribute children numeric
@attribute car {NO,YES}
@attribute save_act {NO,YES}
@attribute current_act {NO,YES}

9 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018
@attribute mortgage {NO,YES}
@attribute pep {NO,YES}

@data
50,MALE,SUBURBAN,117546,NO,2,NO,YES,NO,NO,?
23,FEMALE,RURAL,11073,YES,3,NO,NO,YES,YES,?
27,MALE,INNER_CITY,9158.5,NO,1,YES,NO,NO,YES,?
22,FEMALE,INNER_CITY,7304.2,NO,0,YES,NO,YES,NO,?
67,MALE,RURAL,58092,NO,1,YES,YES,NO,NO,?
26,MALE,SUBURBAN,18500.6,YES,0,YES,YES,YES,NO,?

We now have “unseen data” with no known output for which we would like to make predictions.

a) On the “Classify” tab, select the “Supplied test set” option in the “Test options” box.
b) Click the “Set” button, click the “Open file” button on the “Test Instances” dialog box and
select the mock new dataset we just created with the name “bankdata_new.arff”. Click
“Close” on the window.
c) Click the “More options…” button to bring up options for evaluating the classifier.
d) Uncheck the information we are not interested in, specifically:
• “Output model”
• “Output per-class stats”
• “Output confusion matrix”
• “Store predictions for visualization”
e) For the “Output predictions” option click the “Choose” button and select “PlainText”.
f) Click the “OK” button to confirm the “Classifier evaluation options”.
g) Right click on the selected model in the “Results list” box.
h) Select “Re-evaluate model on current test set”.

The predictions for each test instance are then listed in the “Classifier Output” box.

Lab Exercise

Choose to work in a group of 2 people and prepare a short report to be submitted in elearn@usm.
Use bankdata for the lab exercise. Train the classifier using only bankdata_train.arff and test your
classifier on bankdata_test.arff.

1) Train a NaiveBayes classifier by setting the parameter useKernelEstimator to True. Train


another NaiveBayes classifier by setting the parameter useSupervisedDiscretization to
True. Compare the performance results against the NaïveBayes classifier using default
parameter values. Which parameter setup results in the classifier with the best
performance? Can you explain why?
2) Train a BayesNet classifier. Experiment with different parameter values until you find the
best performing classifier. Report the parameter values you have selected and analyse
the performance results. Compare the performance results of the BayesNet classifier to
the best performing NaiveBayes classifier in (1).

Post your results and explanations directly on a discussion post (no need to attach a document).

10 | CDS503: Lab 03 (JLSY)


Semester 2, 2017/2018

Vous aimerez peut-être aussi