Académique Documents
Professionnel Documents
Culture Documents
1. Introduction
Weka is open source software under the GNU General Public License. System is
developed at the University of Waikato in New Zealand. “Weka” stands for the
Waikato Environment for Knowledge Analysis. The software is freely available at
http://www.cs.waikato.ac.nz/ml/weka. The system is written using object oriented
language Java. There are several different levels at which Weka can be used.
Weka provides implementations of state-of-the-art data mining and machine
learning algorithms. Weka contains modules for data preprocessing, classification,
clustering and association rule extraction.
• Explorer
– preprocessing, attribute selection, learning, visualiation
• Experimenter
– testing and evaluating machine learning algorithms
• Knowledge Flow
– visual design of KDD process
– Explorer
• Simple Command-line
– A simple interface for typing commands
Attribute Relation File Format (ARFF) is the default file type for data analysis in
weka but data can also be imported from various formats.
ARFF format of weather dataset from sample data in weka is presented here.
Attribute type is specified in the header tag. Nominal attribute have the distinct
values of attribute in curly brackets along with attribute name. Numeric attribute is
specified by the keyword real along with attribute name.
@relation weather
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
2. WEKA Explorer
Some attributes may not be required in the analysis, and then those attributes can
be removed from the dataset before analysis. For example, attribute instance
number of iris dataset is not required in analysis. This attribute can be removed by
selecting it in the Attributes check box, and clicking Remove (Fig. 3). Resulting
dataset then can be stored in arff file format.
In case some attributes needs to be removed before the data mining step, this can
be done using the Attribute filters in WEKA. In the "Filter" panel, click on the
"Choose" button. This will show a popup window with a list available filters. Scroll
down the list and select the "weka.filters.unsupervised.attribute.Remove" filter as
shown in Figure 4. Next, click on text box immediately to the right of the "Choose"
buttom. In the resulting dialog box enter the index of the attribute to be filtered out
(this can be a range or a list separated by commas). In this case, we enter 1 which
is the index of the "id" attribute (see the left panel). Make sure that the
"invertSelection" option is set to false (otherwise everything except attribute 1 will
be filtered) (Fig 5). Then click "OK".
4.2 Discretization
You can observe that WEKA has assigned its own labels to each of the value
ranges for the discretized attribute. For example, the lower range in the "age"
attribute is labeled "(-inf-34.333333]" (enclosed in single quotes and escape
characters), while the middle range is labeled "(34.333333-50.666667]", and so
on. These labels now also appear in the data records where the original age value
was in the corresponding range.
Fig. 6: Discretization Filter
The first two columns are the TP Rate (True Positive Rate) and the FP Rate
(False Positive Rate). For the first level where ‘play=yes’ TP Rate is the ratio of
play cases predicted correctly cases to the total of positive cases (eg: 8 out of
9 is predicted correctly =8/9=0.88).
The FP Rate is then the ratio no play cases incorrectly predicted as play yes
cases to the total of play no cases. 1 play no case was wrongly predicted as
play yes. So the FP Rate is 1/5=0.2
The next two columns are terms related to information retrieval theory. When
one is conducting a search for relevant documents, it is often not possible to
get to the relevant documents easily or directly. In many cases, a search will
yield lots results many of which will be irrelevant. Under these circumstances, it
is often impractical to get all results at once but only a portion of them at a time.
In such cases, the terms recall and precision are important to consider.
Recall is the ratio of relevant documents found in the search result to the total
of all relevant documents. Thus, higher recall values imply that relevant
documents are returned more quickly. A recall of 30% at 10% means that 30%
of the relevant documents were found with only 10% of the results examined.
Precision is the proportion of relevant documents in the results returned. Thus
a precision of 0.75 means that 75% of the returned documents were relevant.
In our example, such measures are not very applicable…the recall in this case
just corresponds to the TP Rate, as we are always looking at 100% of test
sample and precision is just the proportion of low and normal weight cases in
the test sample.
the F-measure is a way of combining recall and precision scores into a single
measure of performance. The formula for it is: 2*recall*precision / recall+
precision
Fig. 8: ID3 algorithm in weka
Confusion matrix specifies the classes of obtained results. For example, class a
has majority of objects (8 objects) from yes category, hence a is treated as class of
“yes” group. Similarly b has majority of objects (4) from no category, hence b is
treated as class of “no” group. Hence one object each from both the classes is
misclassified, which leads to misclassified instance as 2. User can see the plot of
tree too.
K-means is the most popularly used algorithm for clustering. User need to specify
the number of clusters (k) in advance. Algorithm randomly selects k objects as
cluster mean or center. It works towards optimizing square error criteria function,
defined as:
k x − 2 , where is the mean of cluster
∑ ∑ mi mi Ci .
i =1
x∈Ci
Main steps of k-means algorithm are: