Vous êtes sur la page 1sur 7

CS 440 / ECE 448: Introduction to AI

MP 2
Teaching Assistant: Jason Cho
Due Date: 11:59 pm October 13, 2014
Your answers must be concise and clear. Explain sufficiently that we can
easily determine what you understand. We will give more points for a brief
interesting discussion with no answer than for a misleading answer.

Dr. Chens Problem

Dr. Chen (Mingcheng) is a very renowned doctor who has a Ph.D. degree in
Cardiovascular Surgery (CS). He runs a very successful hospital and wishes
to allow more patients to use his hospital and researches ways to admit more
patients. This would allow him to generate more revenue, which in turn
would allow him to expand the size of the hospital. After careful research,
he finds out that the vast majority of the time is spent on deciding whether
to hospitalize the patients or not. Patients are hospitalized if they have any
sort of heart disease regardless of their severity level. His colleague, Sunny
tells him using decision trees may be helpful in solving the task at hand.
After weeks of careful research, Dr. Chen decides to use decision trees by
using WEKA. Your task, as Dr. Chens research assistant, is to help him
develop decision tree classifier.

1.1

Convincing the board members

Unfortunately, Dr. Chens board members are not quite convinced that decision trees will work. Your task, as an expert researcher, is to demonstrate
the board members that decision trees are, indeed, helpful in helping the
hospital grow1 . The board members will be convinced if you can run them
through the decision tree construction process. For the purpose of demonstration, you have randomly picked 10 patients. Their medical records can
be seen on Table 1.
In this problem, you will solve some relevant entropy problems.
1

The following YouTube video might be useful in demonstrating how difficult it is to


convince board members:http://www.youtube.com/watch?v=BKorP55Aqvg

Patient
1
2
3
4
5
6
7
8
9
10

Age
Young
Old
Young
Mid
Young
Ancient
Mid
Young
Old
Ancient

Blood Pressure
High
Low
Low
High
Low
High
Low
High
High
High

Weight
Light
Light
Light
Heavy
Heavy
Heavy
Mid
Heavy
Mid
Mid

Heart Disease?
False
False
False
False
False
True
True
True
True
True

Table 1: Patient Data


1. Calculate the entropy for Heart Disease for the full set of Patient Data.
2. Consider testing on each of the three attributes: Age, Blood Pressure,
Weight. For each attribute calculate the entropies of the resulting
subsets and the expected combined entropy after the split.
3. What are the maximum and minimum entropy for a feature that can
take 2k values?
4. What is the estimated information gain for each of the attributes?
Which should be the first Decision Tree test?

1.2

Entropy and Information Gain Calculation

After you show your calculations, the board members are somewhat convinced that decision trees may work. However, they say they now wish to
see it work on a bigger scale. To do this, they ask you to write a short
program that calculates entropy and information gain. The program should
be written in Java, and should be named EntropyCalc [netid].java, where
[netid] is your NetId. The program will take two arguments such as the
following:
j a v a E nt r op yC a lc [ n e t i d ] [ i n p u t f i l e name ] [ output f i l e name ]

[input file name] argument refers to an input file name. This is a csv file
where the last column is its label, and all the other columns are attributes.
[output file name] argument refers to an output file name, and it should
calculate entropies and information gains for each columns, one value per
each line. The last line should contain the entropy for the label. To
help you with the assignment, we have also released a file named entropycalc released.csv.
As an example, let us assume there are five columns in [input file name]
(in this case, entropycalc released.csv). [output file name] file should contain nine lines, where the first five lines have entropies for the column, and
2

the next four lines have information gains with regards to the fifth column.
We do not guarantee if the file used for grading will have the same number
of columns as those that we have released for this assignment.
The program should compile from the following command:
javac . java

This assignment should be easy enough that you would not need any external
libraries.

1.3

The Actual Task

The board members are sufficiently happy and let you proceed onto the
main task. In this problem set, you will learn how to use WEKA 2 to build
decision tree classifier.
In order to help Dr. Chen, we will be using a modified heart disease
dataset from UCI Machine Learning Repository3 . The dataset is in arff
file format. This is a native WEKA format that you will be using. The
attributes can either be numeric or nominal, and is given (sequentially) as
the following:
age (numeric), sex (nominal), cp (nominal), trestbps (numeric), chol (numeric), fbs (nominal), restecg (nominal), thalach (numeric), exang (nominal), oldpeak (numeric), slope (nominal), ca (nominal), thal (nominal)
The last column, num (nominal) is the true label we wish to predict. These
are given by < 500 if healthy, > 50 10 if not.
a. Your first classifier. It is now time to build your first classifier. We
will provide a relevant arff files for you to run experiments. There are
two different data sets and can be downloaded from the following links:
http://courses.engr.illinois.edu/cs440/hw/hw3/heart1.tar.gz
http://courses.engr.illinois.edu/cs440/hw/hw3/heart2.tar.gz
You will find two files for each tar.gz file. Each contains a training and
a testing dataset.
WEKAs GUI interface is shown in Figure 1.
Click on Explorer which should bring up the the screen in Figure 2.
Now load your training data by clicking on Open file.
To train your first classifier, click on classify tab. Click on the choose
button and choose J48. This sets your classifier to C4.5. Next, click on
the textfield next to choose button and it should bring up the interface
shown in Figure 3.
Set unpruned to True. This option disallows your classifier from
pruning your decision tree. Next, we will train the classifier. Select
2
3

http://www.cs.waikato.ac.nz/ml/weka/
http://archive.ics.uci.edu/ml/

Figure 1: WEKA GUI interface

Figure 2: WEKAs experimenter interface

Figure 3: WEKAs classifer interface

Use training set radio button, and then click on Start. This will
train your model. Next, right click on results list, as seen in Figure 4.
Click on Save model, and save your classifier model in a directory of
your preference. Congratulations! Now you have trained and saved
your first classifier. Notice the classification accuracy.
Now that we have our first classifier, we should test our model. Click
on Supplied test set and load test dataset, supplied to you in the
tar.gzs that we have provided. Next, right click on the results list
again, and click Re-evaluate model on current test set. Notice the
differences in accuracy.
Report 1) The decision tree that C4.5 generates for both heart-1 and
heart-2 data set 2) accuracy from the training data, and 3) accuracy
from the testing data. 4) Are there any significant performance differences on heart-1 and heart-2? If so, please describe the reasons for
performance differences. If not, please detail why you do not think
there are no performance differences. Finally, 5) Why is the accuracy
of the testing data lower than that from the training data?
b. The Big Data Analyst! Here, you will analyze the impact of various
attributes. Consider the training data set again. On heart-1.train.arff,
let us remove restecg attribute. Then test the trained classifier on

Figure 4: WEKAs results list

heart-1.test.arff after removing its restecg attribute. What is the impact of removing this attribute? Explain why removing this attribute
had such impact on the performance.
Next, lets load heart-2.train.arff, and remove cp attribute. Repeat
the above but with the cp attribute.
You are encouraged to consult the test data distribution as well as
train data distribution to answer this question for both of the data
set.
c. Decision Trees Gone Wild. Several parameters in WEKA can be
used to fine tune the behavior of J48. You will tune the number of
leaves a pruned decision trees can have for this part of the assignment.
On the GUI interface, set unpruned to be False. This allows the tree
to prune leaf nodes. Next, divide each of heart-1.test.arff and heart2.test.arff into two equal sized datasets.. First half of the testing data
is validation set and the second half is the test set. You will use
validation set to test out your parameters, and verify the performance
on the test set.
Report: 1) The performance results of test set and training set for each
of heart-1 and heart-2 dataset. 2) Performance of the decision tree (on
test and training set) as you increase the number of minimum leaves
per node (minNumObjs variable). 3) Comparison between unpruned
decision tree and pruned decision tree. Which one performs better?
d. DU Top Researcher4 ! All of the training process will not be meaningful if the algorithm did not perform very well on unseen data. After
you have found the best set of parameters, train your model on both
4

DU stands for Determined and Useful.

the training and testing data. Then, save your .model file. Please name
your model file as [netid].1.model for heart-1 dataset, [netid].2.model
for heart-2 dataset. As an example, if your NetID is asdf, then the
model file should be named as asdf.1.model and asdf.2.model.

1.4

What to submit

Please submit [netid].1.model, [netid].2.model and EntropyCalc [netid].java.


Please also submit the final report in PDF file. Assignments must be submitted via Compass 2g. You are strongly encouraged to edit your answers
in LaTeX. Do not submit handwritten solutions.

Vous aimerez peut-être aussi