Vous êtes sur la page 1sur 14

WEKA Software

(PDAM)
Submission2

WEKA
Classification & Clustering
Table of Contents
WEKA – Classification algorithms ................................................................................................................................ 2
Decision Tree ............................................................................................................................................................. 2
Random Forest........................................................................................................................................................... 2
K-NN – Nearest Neighbour (IBk).............................................................................................................................. 3
Analysis and Conclusion ............................................................................................................................................... 4
WEKA - Clustering algorithms...................................................................................................................................... 9
Bibliography................................................................................................................................................................. 12

PDAM [Submission 2] Page 1 of 13


Datamining with WEKA Software

WEKA – Classification algorithms

Decision Tree

The ultimate aim of the Decision Tree is to predict where a target variable would lie. It starts with a data
collection and makes the system learn about the various classes. Later this model can be used to predict where a new
variable would fit. Splitting the source data into subsets based on some rules does it. This process is called recursive
partitioning (IBM, 2012) as it is repeated for each subset of data. It is repeated until further partitioning does not add
any value to predictions. All the variables have the same value at a particular node. There are various algorithms
available for performing partitions like J48, CART and C4.5.
Decision Tree
Advantages Limitations
• Simple to understand and interpret • Trees can get very complex.
• Able to handle both numerical and categorical data. • Sometimes trees can become prohibitively large.
• Requires little data preparation • Information from a tree is biased towards attributes
• Possible to validate a model using statistical tests with more levels.
• Robust. Performs well even if its assumptions are
somewhat violated by the true model from which the
data was generated.
• Performs well with large data in a short time.

Decision tree can support classification and regression problems. Furthermore, decision tree is more recently
referred to as Classification and Regression Tree (CART). It works by creating a tree that evaluates an instance of data.
It is starting at the root of the tree and is moving town to the leaves (roots) until a prediction can be made. The process
of creating a decision tree works by selecting the best split point in order to make predictions and repeating the process
until the tree is a fixed depth. After the tree is constructed, it is pruned in order to improve the model’s ability to
generalize to new data.

Random Forest

Random forest (RF) is a notion of the general technique of random decision forests that is an ensemble learning
method for classification, regression and other tasks. This operates by constructing a multitude of decision trees at
training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of
the individual trees. In other words, builds multiple decision trees and merges them together to get a more accurate and
stable prediction. Random decision forests correct the decision trees’ habit of overfitting to their training set.
A RF consists of an arbitrary number of simple trees’ which are used to determine the final outcome. For
classification problems, the ensemble of simple trees votes for the most popular class. In the regression problem, their
responses are averaged to obtain an estimate of the dependent variable. RF method particularly is well-suited to small n
large p problems (Strobl, Malley, & Tutz, 2009).
Random Forest
Advantages Limitations
• Random forests are extremely flexible and have very • They are much harder and time-consuming to
high accuracy. construct than decision trees.
• Maintains accuracy even when a large proportion of • More computational resources.
the data is missing. • With a large collection of decision trees, it is hard to
• Automatic predictor selection from large number of have an intuitive grasp of the relationship existing in
candidates the input data.
• Resistance to over training • Poor performance on imbalanced data
• Ability to handle data without pre-processing

PDAM [Submission 2] Page 2 of 13


• Cluster identification can be used to generate tree-
based clusters through sample proximity
• Random forests also have less variance than a single
decision tree. It means that it works correctly for a
large range of data items than single decision trees.

K-NN – Nearest Neighbour (IBk)


The k-nearest neighbours algorithm supports both classification and regression and is a well-known centroid
based technique. For noisy dataset the accuracy may be improved if k is a bit larger (Abernethy, 2010) but increasing
excessive at a near value of the dataset will actually decrease the accuracy. It is a simple algorithm, but one that does
not assume very much about the problem other than that the distance between data instances is meaningful in making
predictions. As such, it often achieves very good performance. When making predictions on classification problems,
KNN will take the mode (most common class) of the k most similar instances in the training dataset and the size of the
neighbourhood is controlled by the k parameter.

It works by storing the entire training dataset and querying it to locate the k most similar training patterns when
making a prediction. As such, there is no model other than the raw training dataset and the only computation performed
is the querying of the training dataset when a prediction is requested (Hassanat, Abbadi, Altarawneh, & Alhasanat,
2014).

In WEKA software, by default, Euclidean distance is used to calculate the distance between instances, which is
good for numerical data with the same scale. However, Manhattan distance is good to use if the attributes differ in
measures or type, as it can be observed in Figure 1.

K-NN
Advantages Limitations
• Robust to noisy training data • Distance based learning is not clear which type of
• Effective if the training data is large distance to use and which attribute to use to produce
• K-NN algorithm is very simple to understand and the best results
equally easy to implement. • Computation cost is quite high because it needs to
• There are assumptions to be met to implement K-NN compute the distance of each query instance to all
• K-NN is a memory-based approach. The classifier training samples.
immediately adapts when collect new training data • K-NN needs to choose the optimal number of
• Can be used both for Classification and Regression neighbours to be considered while classifying the new
• Variety of distance criteria to be choose from data entry
(Euclidean, Hamming, Manhattan, Minkowski) • K-NN inherently has no capability of dealing with
missing value problem.

PDAM [Submission 2] Page 3 of 13


Analysis and Conclusion

For the experiment 40 datasets are used and two of them they are tailed resulting 39 outputs (Figure 1). Selected
datasets were further divided into 3 categories (type of data) as: Mix, Numeric and Nominal and all 3 algorithms were
applied as WEKA Experimenter. For K-NN algorithm two measurement distances were used, Euclidean and Manhattan.

Figure 1 - Dataset results

Note: For a larger output view please see Appendix A or attached spreadsheet file.

Figure 2 - Datasets by type


In Figure 3 there are different stages of outputs from WEKA. Decision tree and Random forest were run in the same
instance.

Figure 3 - WEKA Experiment environment results

PDAM [Submission 2] Page 4 of 13


For analysis purposes an initial focus is given to 3 datasets (one from each type), applying all 3 algorithms and analysing
the accuracy. The chosen datasets have relatively similar number of instances and attributes. The dataset used are:
▪ Vowel.arff - Mix (Instances: 990; Attributes: 14; Class:11)
▪ Vehicle.arf – Numeric (Instances: 846; Attributes 19; Class: 4)
▪ Solar flare1.arff – Nominal (Instances: 1066; Attributes: 13; Class: 6)
The classifiers are trained and tested and the obtained results are recorded. The analysis was evaluated with the
default configuration of parameters.

Comparison results of classifiers Comparison results of classifiers Comparison results of classifiers


Vowel (Mix) Vehicle (Numeric) Solar Flare (Nominal)
Accuracy (%) Accuracy (%) Accuracy (%)

Training Set

Training Set
Technique

Validation
Classifier

Technique

Technique
Validation

Validation
Training

Classifier

Classifier
Cross-

Cross-

Cross-
Set

J48 97.67 80.20 J48 96.92 72.28 J48 99.53 99.53


Random Forest 100 98.10 Random Forest 100 74.87 Random Forest 99.90 99.45
KNN 1 100 99.05 KNN 1 100 69.59 KNN 1 99.90 99.25
3 99.59 96.99 2 86.52 67.63 3 99.53 99.53
5 98.48 93.39 3 85.69 70.21 5 99.53 99.53

Table 1 - Comparison results of classifiers

In the table above we observe that for mix data types, KNN achieves more accuracy than J48 and RF being close (less
than 1% difference). RF perform better for numeric attributes and for nominal J48 and KNN have same value but RF
has a very close value (0.08% difference).

However, a second analysis was carried out on multiple datasets of the same type. For analysis all values were introduced
but separated on data types. The datasets analysis is represented as a boxplot by comparing distributions between
datasets. The median value dataset is marked with the X and the line represents the average value of the data block. The
analysis results revealed the following:

For mix dataset types (Figure 4), we can observe that


RF has better average performance compared with
the other algorithms. J48, KNN (E/M) have nearly
identical values. Furthermore, the mixed datatypes
have the lowest overall accuracy (84%) and are
below min-max accuracy value when compared with
numeric and nominal.

Figure 4 - Mix datatype average accuracy based on total number of


datasets

PDAM [Submission 2] Page 5 of 13


For numeric data type on average, again RF
performs better than other but KNN (E/M) are very
close as values. The numeric data types, have the
closest accuracy to 100% (94-95%). Therefore, we
can draw a conclusion that the numeric data type will
have the best performance.

Figure 5 - Numeric datatype average accuracy based on total number of


datasets

On nominal datatype all 3 algorithms are performing


very close to each other and the overall accuracy is
around (92%).

Figure 6 - Nominal data type average accuracy based on total number of


datasets

PDAM [Submission 2] Page 6 of 13


Another analysis carried out was to see if the number of instances has any impact over the accuracy. Analysing the
Figure 7 we can observe that a higher number of instances (more available data) will improve the accuracy for all
algorithms, regardless of the data type. However, the time required to compute the final result(s) will increase as well.

Figure 7 - Accuracy based on number of instances

PDAM [Submission 2] Page 7 of 13


The following analysis was carried out to see if the number of attributes for data sets have an impact on the overall
accuracy. Analysing Figure 8 we can notice that for mix datasets a larger number of attributes has a slightly positive
impact. For numeric and nominal types, the impact is negative.

Figure 8- Accuracy trending based on number of attributes

These experimental results conclude that different classifiers had different advantages and disadvantages: K-NN had
the best time, RF can produce the best accuracy and J48 had the best F-measure. Therefore, there is no perfect classifier
for a given problem as classifiers have unique characteristics among them. Furthermore, it should be noted that for better
detection accuracy, the optimization of machine learning classifiers and the use of hybrid classifiers is a crucial factor.
In some cases, the combination of two or more classifiers together can produce lower training / testing times and lower
false positive rates.

PDAM [Submission 2] Page 8 of 13


WEKA - Clustering algorithms

Cluster analysis is used to assign a set of objects to various groups such that they are more similar to other objects within
the group than those belonging to another. Cluster Analysis as a statistical data analytical tool forms the backbone of
various disciplines including marketing research, pattern finding, intelligence gathering and image analysis etc. Cluster
analysis comprises of a set of algorithms, all used in specific situations. These include Hierarchical Clustering, K-Means
clustering etc. One defining benefit of clustering over classification is that every attribute in the data set will be used to
analyse the data. A major disadvantage of using clustering is that the user is required to know ahead of time how many
groups he wants to create. For a user without any real knowledge of his data, this might be difficult. (Abernethy, 2010)
K Means Clustering
This method has the objective of classifying a set of n objects into k clusters, based on the closeness to the cluster
centres. The closeness to cluster centres is measured by the use of a standard distance algorithm. K-Means clustering is
computationally very fast, as compared to other types of clustering algorithms. The number of clusters is expected as an
input for the algorithm to work.
Agglomeration Method
Agglomeration Method or hierarchical clustering proceeds successively by either merging smaller clusters into larger
ones, or by splitting larger clusters. The result of the algorithm is a tree of clusters, called dendrogram, which shows
how the clusters are related. By cutting the dendrogram at a desired level, a clustering of the data items into disjoint
groups is obtained.
Performance Comparation
The main objective of this test is to make a comparative analysis of K-means and Hierarchical.
For this task 10 datasets were selected with class number varying from 2 to 6 and covering all 3 types of data. However,
the number of instances was kept below 5000 because the hierarchical technique requires a larger amount of computing
power and the running time will be extremely long.
Each data sets were recorded in a table
(Figure 9) including the data type,
instances, number of attributes and
numbers of classes.
K-means and Hierarchical Cluster
algorithms were applied for each data
set and the performance running time
(in seconds) was recorded.
Note: The performance running time it
may be affected by the computer physical
configuration.

Additionally, the incorrectly clustered


instances (as percentage) and the
allocation of clustered instances were
recorded for each dataset as well.

Figure 9 - Cluster analysis datasets

PDAM [Submission 2] Page 9 of 13


The first analysis conducted for clustering techniques was to see if there is any relation between number of clusters and
accuracy. We can observe in Figure 10 that a larger number of clusters will be prone to a higher number of errors, hence
less accuracy.

Figure 10 - Error analysis in relation with number of clusters

The second test conducted was a speed performance. Observing Figure 11 we can conclude that Hierarchical clustering
cannot handle big data well but K Means clustering can. This is because the time complexity of K Means is linear (e.g.
O(n)) while that of hierarchical clustering is quadratic (e.g. O(n2)).

Figure 11 - Speed performance

PDAM [Submission 2] Page 10 of 13


The third analysis was carried against data type. For this analysis the datasets were organised on data type and for each
algorithm the average was calculated (Figure 12).

Figure 12 - Datasets sorted on type and average error value

Analysing the resulted data (Figure 13) we can observe that both algorithms have the best performance for nominal
data types with very close values. For mix data types HC perform better than K-Means where for numeric values is
vice versa.

Figure 13 - Accuracy analysis

After analysing the results of testing the algorithms the following conclusions are drawn:
▪ The overall performance of K-Means algorithm is better than Hierarchical Clustering algorithm and is very
efficient in processing large datasets.
▪ All the algorithms have some ambiguity in some (noisy) data when clustered.
▪ Density based clustering algorithm is not suitable for data with high variance in density.
▪ K-Means algorithm produces quality clusters when using huge dataset.
▪ Hierarchical clustering algorithm requires more computing power and in every aspect is slower than K-Means

PDAM [Submission 2] Page 11 of 13


▪ Hierarchical clustering algorithm is more sensitive for noisy data.

Bibliography
Abernethy, M. (2010, May 12). Classification and clustering. Retrieved from IBM Developer:
https://developer.ibm.com/articles/os-weka2/

Abernethy, M. (2010, Jun 08). Data mining with WEKA: Nearest Neighbor and server-side library. Retrieved from
IBM: https://www.ibm.com/developerworks/library/os-weka3/index.html

Hassanat, A. B., Abbadi, M. A., Altarawneh, G. A., & Alhasanat, A. A. (2014). Solving the Problem of the K Parameter
in the KNN Classifier Using an Ensemble Learning Approach. International Journal of Computer Science and
Information Security, 12(8), 33-39.

IBM. (2012). Decision Tree Models. Retrieved from IBM Knowledge Center:
https://www.ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm.spss.modeler.help/nodes_tre
ebuilding.htm

Strobl, C., Malley, J., & Tutz, G. (2009, Dec). An introduction to recursive partitioning: Rationale, application, and
characteristics of classification and regression trees, bagging, and random forests. Psychological Methods,
14(4), 323-348. doi:10.1037/a0016973

PDAM [Submission 2] Page 12 of 13


PDAM [Submission 2] Page 13 of 13

Appendix A - Datasets

Vous aimerez peut-être aussi