Vous êtes sur la page 1sur 4

Assignment 1

Decision Trees
1.
Study the animals in the document (zoo.xls). Without using a data mining tool, draw
a decision tree of three to five levels deep that classifies animals into a mammal, bird,
reptile, fish, amphibian, insect or invertebrate.
2.
Read about the ARFF-format here. Construct the header for the animal file.
3.
Download datasets.zip and unzip it. Open zoo.arf by going to Weka and then
choosing the explorer.
4.
Find out in WEKA how many animals this dataset contains.
5.
Go to the classifier tab and select the decision tree classifier j48. Click on the line
behind the choose button. This shows you the parameters you can set and a button
called 'More'. Which algorithm is implemented by j48?
6.
Which percentage of instances is correctly classified by j48? Which families are
mistaken for each other?
7.
Again go to the parameter settings by clicking on the box after the 'Choose' button.
Now change binarySplit to true and build a new decision tree. What is the difference?
8.
Experiment with some of the other classifiers and until you get a better classification
performance. Write down the classifier and its performance

Assignment 2
Clustering
Goals
Gain familiarity with the data mining toolkit, Weka
Learn to apply clustering algorithms using Weka
Understand outputs produced by clustering tools in Weka
Procedure
Dataset
In this lab we will work with a dataset from HARTIGAN (file.06). The file has been translated
into ARFF, the default data file format in Weka. Download the dataset here. The dataset
gives nutrient levels of 27 kinds of food. The mounts of energy, protein, fat, calcium and iron
have been measured in a 3 ounce portion of the various foods.
Press the Preprocess tab. Now Press the Open button and load food.arf. A description of
each attribute can be seen by selecting the attribute from the list in the left hand side of the
screen. The description appears in the right hand side of the screen. Press the Edit button,
you can read and edit each instances.
More info on Explorer-Preprocessing is available in the Explorer User Guide.
Cluster Data
Several clustering algorithms are implemented in Weka. In this lab we experiment with an
implementation of K-means, SimpleKmeans, and an implementation of a density-based
method,MakeDensityBasedClusterer in Weka.
To cluster the data, click on the Cluster tab. Press the Choose button to select the
clustering algorithm. Click on the line that has appeared to the right of the Choose button to
edit the properties of the algorithm. You can find a detailed description of the algorithm by
pressing the More button. Set the desired properties and press OK. In the Cluster mode,
select "Use training set". Press the Ignore attributes button to specify which attributes
should be used in the clustering. Click Start.
Check the output on the right hand side of the screen. You can right click the result set in the
"Result list" panel and view the results of clustering in a separate window. The result window
shows the centroid of each cluster as well as statistics on the number and percentage of
instances assigned to different clusters. Another way of understanding the characteristics of
each cluster is through visualization. We can do this by right-clicking the result set on the
left "Result list" panel and selecting "Visualize cluster assignments". You also can click
the Save button in the visualization window and save the result as an arff file.
More info on Explorer-Clustering is available in the Explorer User Guide.
o Simple K means
Apply "Simple K Means" to your data. In Weka euclidian distance is implemented in Simple K
means. You can set the number of clusters and seed of a random algorithm for generating
initial cluster centers. Experiment with the algorithm as follows:
1. Choose a set of attributes for clustering and give a motivation. (Hint: always ignore
attribute "name". Why does the name attribute need to be ignored?)
2. Experiment with at least two different numbers of clusters, e.g. 2 and 5, but with the
same seed value 10.
3. Then try with a different seed value, i.e. different initial cluster centers. Compare the
results with the previous results. Explain what the seed value controls.
4. Do you think the clusters are "good" clusters? (Are all of its members "similar" to each
other? Are members from different clusters dissimilar?)
5. What does each cluster represent? Choose one of the results. Make up labels (words or
phrases in English) which characterize each cluster.
o Make Density Based Clusters
Now with Make Density Based Clusters, Simple K Means is turned into a denstiy-based
clusterer. You can set the minimum standard deviation for normal density calculation.
Experiment with the algorithm as the follows:

1. Use the Simple K Means clusterer which gave the result you have chosen in 5).
2. Experiment with at least two different standard deviations. Compare the results. (Hint:
Increasing the standard deviation to higher values will make the differences in different runs
more obvious and thus it will be easier to conclude what the parameter does)
Assignment 3

Association Rule Mining and Clustering


Goals
Cluster a given dataset and use association analysis to describe the clusters obtained.
Procedure
1. Load the dataset In this exercise, you will work with the Iris dataset, available from
http://staffwww.itn.liu.se/~aidvi/courses/06/dm/labs/iris.arff

2. Preprocessing
Since the association analysis in Weka (Apriori algorithm) cannot cope with continuous
attributes, we should discretize the iris dataset before starting the mining process. Weka
provides several filters to apply to the data. Discretize the attributes (except the class
attribute) with an unsupervised filter and use 3 bins. Verify that the data is actually
discretized by pressing the Edit button.
3. Association Rules Mining (ARM)
Mine association rules that predict the class attribute (so, these rules are also classification
rules). You can find a detailed description of the algorithm by pressing the More button.
Then, set the desired properties. Find out the meaning of the numbers after the LHS and
RHS of each rule.
4. Clustering
Use SimpleKmeans to cluster to the data within 3 clusters (since we know there are 3 types
of Iris flowers) and use seed value 10. In the Cluster mode, select Classes to clusters
evaluation to crosstabulate the clustering and class labeling. It is important that the
clustering ignores the class attribute. Find out the centroids for each cluster. Visualize the
cluster assignments.
5. Describing the Clusters through ARM
Use association rule mining to assist you in describing (in plain English) the clusters found in
the Iris dataset. To this end, a new attribute representing the cluster label assigned to each
record should be added to the data. Find then rules that are accurate and such that the
antecedent does not contain the class attribute and the consequent only contains the cluster
attribute, i.e. rules predicting the cluster. Find such rules for the 3 clusters. This should help

you to describe the instances grouped in each cluster. Finally, repeat the exercise with a
different combination of clustering algorithm, number of clusters and/or number of bins in
the discretization filter, in order to see whether you get better or worse results. Indicate the
SSE measure for each case.

Vous aimerez peut-être aussi