Académique Documents
Professionnel Documents
Culture Documents
(PDAM)
Submission2
WEKA
Classification & Clustering
Table of Contents
WEKA – Classification algorithms ................................................................................................................................ 2
Decision Tree ............................................................................................................................................................. 2
Random Forest........................................................................................................................................................... 2
K-NN – Nearest Neighbour (IBk).............................................................................................................................. 3
Analysis and Conclusion ............................................................................................................................................... 4
WEKA - Clustering algorithms...................................................................................................................................... 9
Bibliography................................................................................................................................................................. 12
Decision Tree
The ultimate aim of the Decision Tree is to predict where a target variable would lie. It starts with a data
collection and makes the system learn about the various classes. Later this model can be used to predict where a new
variable would fit. Splitting the source data into subsets based on some rules does it. This process is called recursive
partitioning (IBM, 2012) as it is repeated for each subset of data. It is repeated until further partitioning does not add
any value to predictions. All the variables have the same value at a particular node. There are various algorithms
available for performing partitions like J48, CART and C4.5.
Decision Tree
Advantages Limitations
• Simple to understand and interpret • Trees can get very complex.
• Able to handle both numerical and categorical data. • Sometimes trees can become prohibitively large.
• Requires little data preparation • Information from a tree is biased towards attributes
• Possible to validate a model using statistical tests with more levels.
• Robust. Performs well even if its assumptions are
somewhat violated by the true model from which the
data was generated.
• Performs well with large data in a short time.
Decision tree can support classification and regression problems. Furthermore, decision tree is more recently
referred to as Classification and Regression Tree (CART). It works by creating a tree that evaluates an instance of data.
It is starting at the root of the tree and is moving town to the leaves (roots) until a prediction can be made. The process
of creating a decision tree works by selecting the best split point in order to make predictions and repeating the process
until the tree is a fixed depth. After the tree is constructed, it is pruned in order to improve the model’s ability to
generalize to new data.
Random Forest
Random forest (RF) is a notion of the general technique of random decision forests that is an ensemble learning
method for classification, regression and other tasks. This operates by constructing a multitude of decision trees at
training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of
the individual trees. In other words, builds multiple decision trees and merges them together to get a more accurate and
stable prediction. Random decision forests correct the decision trees’ habit of overfitting to their training set.
A RF consists of an arbitrary number of simple trees’ which are used to determine the final outcome. For
classification problems, the ensemble of simple trees votes for the most popular class. In the regression problem, their
responses are averaged to obtain an estimate of the dependent variable. RF method particularly is well-suited to small n
large p problems (Strobl, Malley, & Tutz, 2009).
Random Forest
Advantages Limitations
• Random forests are extremely flexible and have very • They are much harder and time-consuming to
high accuracy. construct than decision trees.
• Maintains accuracy even when a large proportion of • More computational resources.
the data is missing. • With a large collection of decision trees, it is hard to
• Automatic predictor selection from large number of have an intuitive grasp of the relationship existing in
candidates the input data.
• Resistance to over training • Poor performance on imbalanced data
• Ability to handle data without pre-processing
It works by storing the entire training dataset and querying it to locate the k most similar training patterns when
making a prediction. As such, there is no model other than the raw training dataset and the only computation performed
is the querying of the training dataset when a prediction is requested (Hassanat, Abbadi, Altarawneh, & Alhasanat,
2014).
In WEKA software, by default, Euclidean distance is used to calculate the distance between instances, which is
good for numerical data with the same scale. However, Manhattan distance is good to use if the attributes differ in
measures or type, as it can be observed in Figure 1.
K-NN
Advantages Limitations
• Robust to noisy training data • Distance based learning is not clear which type of
• Effective if the training data is large distance to use and which attribute to use to produce
• K-NN algorithm is very simple to understand and the best results
equally easy to implement. • Computation cost is quite high because it needs to
• There are assumptions to be met to implement K-NN compute the distance of each query instance to all
• K-NN is a memory-based approach. The classifier training samples.
immediately adapts when collect new training data • K-NN needs to choose the optimal number of
• Can be used both for Classification and Regression neighbours to be considered while classifying the new
• Variety of distance criteria to be choose from data entry
(Euclidean, Hamming, Manhattan, Minkowski) • K-NN inherently has no capability of dealing with
missing value problem.
For the experiment 40 datasets are used and two of them they are tailed resulting 39 outputs (Figure 1). Selected
datasets were further divided into 3 categories (type of data) as: Mix, Numeric and Nominal and all 3 algorithms were
applied as WEKA Experimenter. For K-NN algorithm two measurement distances were used, Euclidean and Manhattan.
Note: For a larger output view please see Appendix A or attached spreadsheet file.
Training Set
Training Set
Technique
Validation
Classifier
Technique
Technique
Validation
Validation
Training
Classifier
Classifier
Cross-
Cross-
Cross-
Set
In the table above we observe that for mix data types, KNN achieves more accuracy than J48 and RF being close (less
than 1% difference). RF perform better for numeric attributes and for nominal J48 and KNN have same value but RF
has a very close value (0.08% difference).
However, a second analysis was carried out on multiple datasets of the same type. For analysis all values were introduced
but separated on data types. The datasets analysis is represented as a boxplot by comparing distributions between
datasets. The median value dataset is marked with the X and the line represents the average value of the data block. The
analysis results revealed the following:
These experimental results conclude that different classifiers had different advantages and disadvantages: K-NN had
the best time, RF can produce the best accuracy and J48 had the best F-measure. Therefore, there is no perfect classifier
for a given problem as classifiers have unique characteristics among them. Furthermore, it should be noted that for better
detection accuracy, the optimization of machine learning classifiers and the use of hybrid classifiers is a crucial factor.
In some cases, the combination of two or more classifiers together can produce lower training / testing times and lower
false positive rates.
Cluster analysis is used to assign a set of objects to various groups such that they are more similar to other objects within
the group than those belonging to another. Cluster Analysis as a statistical data analytical tool forms the backbone of
various disciplines including marketing research, pattern finding, intelligence gathering and image analysis etc. Cluster
analysis comprises of a set of algorithms, all used in specific situations. These include Hierarchical Clustering, K-Means
clustering etc. One defining benefit of clustering over classification is that every attribute in the data set will be used to
analyse the data. A major disadvantage of using clustering is that the user is required to know ahead of time how many
groups he wants to create. For a user without any real knowledge of his data, this might be difficult. (Abernethy, 2010)
K Means Clustering
This method has the objective of classifying a set of n objects into k clusters, based on the closeness to the cluster
centres. The closeness to cluster centres is measured by the use of a standard distance algorithm. K-Means clustering is
computationally very fast, as compared to other types of clustering algorithms. The number of clusters is expected as an
input for the algorithm to work.
Agglomeration Method
Agglomeration Method or hierarchical clustering proceeds successively by either merging smaller clusters into larger
ones, or by splitting larger clusters. The result of the algorithm is a tree of clusters, called dendrogram, which shows
how the clusters are related. By cutting the dendrogram at a desired level, a clustering of the data items into disjoint
groups is obtained.
Performance Comparation
The main objective of this test is to make a comparative analysis of K-means and Hierarchical.
For this task 10 datasets were selected with class number varying from 2 to 6 and covering all 3 types of data. However,
the number of instances was kept below 5000 because the hierarchical technique requires a larger amount of computing
power and the running time will be extremely long.
Each data sets were recorded in a table
(Figure 9) including the data type,
instances, number of attributes and
numbers of classes.
K-means and Hierarchical Cluster
algorithms were applied for each data
set and the performance running time
(in seconds) was recorded.
Note: The performance running time it
may be affected by the computer physical
configuration.
The second test conducted was a speed performance. Observing Figure 11 we can conclude that Hierarchical clustering
cannot handle big data well but K Means clustering can. This is because the time complexity of K Means is linear (e.g.
O(n)) while that of hierarchical clustering is quadratic (e.g. O(n2)).
Analysing the resulted data (Figure 13) we can observe that both algorithms have the best performance for nominal
data types with very close values. For mix data types HC perform better than K-Means where for numeric values is
vice versa.
After analysing the results of testing the algorithms the following conclusions are drawn:
▪ The overall performance of K-Means algorithm is better than Hierarchical Clustering algorithm and is very
efficient in processing large datasets.
▪ All the algorithms have some ambiguity in some (noisy) data when clustered.
▪ Density based clustering algorithm is not suitable for data with high variance in density.
▪ K-Means algorithm produces quality clusters when using huge dataset.
▪ Hierarchical clustering algorithm requires more computing power and in every aspect is slower than K-Means
Bibliography
Abernethy, M. (2010, May 12). Classification and clustering. Retrieved from IBM Developer:
https://developer.ibm.com/articles/os-weka2/
Abernethy, M. (2010, Jun 08). Data mining with WEKA: Nearest Neighbor and server-side library. Retrieved from
IBM: https://www.ibm.com/developerworks/library/os-weka3/index.html
Hassanat, A. B., Abbadi, M. A., Altarawneh, G. A., & Alhasanat, A. A. (2014). Solving the Problem of the K Parameter
in the KNN Classifier Using an Ensemble Learning Approach. International Journal of Computer Science and
Information Security, 12(8), 33-39.
IBM. (2012). Decision Tree Models. Retrieved from IBM Knowledge Center:
https://www.ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm.spss.modeler.help/nodes_tre
ebuilding.htm
Strobl, C., Malley, J., & Tutz, G. (2009, Dec). An introduction to recursive partitioning: Rationale, application, and
characteristics of classification and regression trees, bagging, and random forests. Psychological Methods,
14(4), 323-348. doi:10.1037/a0016973
Appendix A - Datasets