Vous êtes sur la page 1sur 8

A Software Tool for Data Clustering Using Particle

Swarm Optimization

Kalyani Manda1, A. Sai Hanuman2, Suresh Chandra Satapathy3,


Vinaykumar Chaganti4, and A. Vinaya Babu5
1
Maharaj Vijayaram Gajapat Raj Engineering College, Vijayanagaram, India
2
GRIET, Hyderabad, India
3
Anil Neerukonda Institute of Technology and Science, Visakhapatnam, India
4
GITAM, Vishakapatnam, India
5
JNTU, Hyderabad, India

Abstract. Many universities all over the world have been offering courses on
swarm intelligence from 1990s. Particle Swarm Optimization is a swarm
intelligence technique. It is relatively young, with a pronounce need for a mature
teaching method. This paper presents an educational software tool in MATLAB to
aid the teaching of PSO fundamentals and its applications to data clustering. This
software offers the advantage of running the classical K-Means clustering
algorithm and also provides facility to simulate hybridization of K-Means with
PSO to explore better clustering performances. The graphical user interfaces are
user-friendly and offer good learning scope to aspiring learners of PSO.

Keywords: Particle swarm optimization, data clustering, learning tools.

1 Introduction
Computational techniques inspired by nature; such as artificial neural networks [1],
fuzzy systems [2], evolutionary computation [3] and swarm intelligence [4] etc have
found the interest of the scholarly. Particle Swarm Optimization is a unique approach
to swarm intelligence based on simplified simulations of animal social behaviors such
as fish schooling and bird flocking. It is first introduced by Kennedy and Eberhart as a
self-adaptive search optimization. Its applications are generally found in solving
complex engineering problems, mainly in non-linear function minimization, optimal
capacitor placement in distributed systems, shape optimization, dynamic systems and
game theory, constrained and unconstrained optimization, multi objective
optimization problems, control systems and others.
Off late, the interest and scope for research in PSO seems to be on a high. It is
therefore worthwhile to consider giving good quality learning to the beginners in the
field. Simulation is one among the better teaching methods for sure. Through this paper,
a software tutorial for PSO, developed to aid the teaching of PSO concepts and its
applications to data clustering, is introduced. The software offers facilities to simulate
classical K-means [6] clustering algorithm, PSO clustering, and hybridizations of
K-Means and PSO. The software provides a scope of experimentation by allowing the
learner to choose different tuning parameters for PSO along with suitable particle sizes

B.K. Panigrahi et al. (Eds.): SEMCCO 2010, LNCS 6466, pp. 278–285, 2010.
© Springer-Verlag Berlin Heidelberg 2010
A Software Tool for Data Clustering Using Particle Swarm Optimization 279

and iterations to obtain better clustering performances. The software is GUI based and
supported by various plots and graphs for better presentation of the derived results. This
work is done using MATLAB (Matrix LABoratory). MATLAB is a computational
environment for modeling, analyzing and simulating a wide variety of intelligent
systems. It also provides a very good access to the students by providing a numerous
design and analysis tools in Fuzzy Systems, Neural Networks and Optimization tool
boxes.
The remainder of this paper is organized as follows. In Section 2, the three
clustering algorithms; K-Means, PSO, and hybrid algorithms on three numerical
datasets – Iris, Wine, and Cancer (collected from UCI machine repository) are
discussed. In Section 3, the software for PSO based data clustering is presented by
taking a conventional K-Means clustering algorithm, PSO, and hybrid clustering
algorithms. In Section 4, comparative analysis of all the clustering algorithms with
experimental results is given based on their intra and inters cluster similarities and
quantization error. Section 5 concludes the paper.

2 Data Clustering
Data clustering is a process of grouping a set of data vectors into a number of clusters
or bins such that elements or data vectors within the same cluster are similar to one
another and are dissimilar to the elements in other clusters. Broadly, there are two
classes of clustering algorithms, supervised and unsupervised. With supervised
clustering, the learning algorithm has an external teacher that indicates the target class
to which the data vector should belong. For unsupervised clustering, a teacher does
not exist, and data vectors are grouped based on distance from one another.

2.1 K-Means Clustering


K-Means algorithm falls under partitional based clustering technique. It was
introduced by MacQueen [6]. K in K-Means signifies the number of clusters into
which data is to be partitioned. This algorithm aims at assigning each pattern of a
given dataset to the cluster having the nearest centroid. K-Means algorithm uses
similarity measure to determine the closeness of two patterns. Similarity can be
measured using Euclidean Distance or Manhattan Distance or Minkowski Distance. In
this paper, Euclidean Distance is considered as the similarity measure. For more on
K-Means clustering algorithm, refer to MacQueen [6].

2.2 PSO Clustering


The concept of Particle Swarm Optimization was discovered through simple social
model simulation. It is related to bird flocking, fish schooling, and swarming theory.
A “swarm” is an apparently disorganized population of moving particles that tend to
cluster together while each particle seems to be moving in a random direction.
In the context of PSO clustering, a single particle represents Nk cluster centroid
vectors. Each particle xi is constructed as follows:
xi = ( ai1 , ai 2 ........aij .....aiN k ) (1)
th th
Where aij= j cluster centroid vector of i particle in cluster Cij.
280 K. Manda et al.

The fitness of the particle is easily measured as the intracluster distance (the
distance among the vectors of a given cluster) which needs to be minimized. It is
given by

∑ [∑ ]
Nk

∀z p ∈Cij d (z p , a j )
j =1
(2)
Nk

Here z p denotes the p th data vector, cij is the i th particles j th cluster, a j denotes
Nd
centroid vector of cluster j, d ( z , a ) =
p j ∑ (z
k =1
pk − a jk ) 2 denoting the Euclidean

distance, and N k denotes number of cluster centroid vectors.


There are different versions of PSO models [5]. In the software we propose we
stuck to the basic PSO model called gbest model wherein every particle will interact
with every other particles to decide its optimum direction. This section now presents a
standard gbest PSO clustering algorithm.
Data vectors can be clustered using standard gbest PSO as follows:
i. Randomly select Nk cluster centroids to initialize each particle
ii. For I =1 to I max do
a) For each particle i do
b) For each data vector zp
i. calculate Euclidean distance d ( z p , a ij ) to all cluster
centroids C ij

ii. assign zp to the cluster C ij such that d ( z p , a ij ) =

min∀k =1.....Nk {d ( z p , aik )}


iii. calculate the fitness using equation (2)
c) Update the pbest and gbest positions
d) Update the cluster centroids using the below equations

velid ( I ) = w * velid ( I − 1) + c1 * rand () * ( pid − xid ( I − 1)) + c 2 * rand () * ( p gd − xid ( I − 1)) (3)

xid ( I ) = xid ( I − 1) + velid ( I ) (4)

Where I max is the maximum number of iterations.

2.3 Hybridized Clustering with K-Means and PSO


In the proposed software, we tried hybridization in two ways. The first one is
K-Means + PSO technique, where in the K-Means clustering algorithm is executed,
A Software Tool for Data Clustering Using Particle Swarm Optimization 281

the resulting centroids of which are used to seed the initial swarm, while the rest of
the swarm is initialized randomly. PSO algorithm is then executed (as in sec 2.2).
The second one is PSO + K-Means technique. In this, first PSO algorithm is executed
once, whose resulting gbest is used as one of the centroids for K-Means, while the rest
of the centroids are initialized randomly. K-Means algorithm is then executed.
Our software offers the facilities of exploring these possibilities with various
options of choosing parameters and number of iterations to investigate the ideas.

3 Software Tutorial for PSO Based Data Clustering


The objective of this software is to let the users learn how PSO can be applied in the
area of clustering. The idea is to involve the user for setting the parameters of PSO
clustering algorithm. For this application, three data sets have been taken namely Iris,
wine and breast cancer. As the data sets considered for this application are pre
classified, the number of clusters taken is same as that of their number of classes.

Table 1. Results of K-Means Clustering

Measures/datasets Iris Wine Cancer


Intra cluster distance 1.94212 293.617 671.53
Inter cluster distance 10.167 1474.22 1331.33
Quantization error 0.647374 97.0723 335.765
Time ( in sec) 24.3050 1.7562 6.75965

The results of clustering are shown in terms of intra class and inter class similarities
and also quantization error [Table 1]. A confusion matrix is also given where an
accuracy test can be made between the expected clusters and actual clusters. The time
taken by the algorithm to cluster the data is also given. The results of K-Means
clustering given in Table 1 are appended in the fig. 1 (as displayed by the software).

Fig. 1. Results of K-means clustering on three datasets

Fig. 2 displays the scope given for the user, to specify all the PSO parameters like
swarm size, inertia of weight, and acceleration coefficients. The results of clustering
are shown in the same way as in K-Means clustering [Table 2]. Sample results are
computed taking swarm size as 3, inertia weight as 0.72, and c1 and c2 both 1.
However, the user can play with this software giving any values to see how the PSO
clustering algorithm performs.
282 K. Manda et al.

Table 2. Results of gbest PSO Clustering

Swarm size = 3, Inertia of weight=0.72 , c1=1 and c2=1


Measures/datasets Iris Wine Cancer
Intra cluster distance 0.648096 145.849 222.833
Inter cluster distance 3.37355 749.14 432.382
Quantization error 0.216032 48.6163 111.416
Time in sec 19.9997 12.9741 76.1937

Fig. 2. Results of PSO based Clustering

On a similar note, the sample results and screen displays from the software for two
proposed hybridization algorithms are also presented below: K-Means+PSO [Table 3,
Fig. 3] and PSO+K-Means [Table 4, Fig. 4].

Table 3. Results of K-Means+PSO Clustering Algorithm

Swarm size = 3, Inertia of weight=0.72 , c1=1 and c2=1


Measures/datasets Iris Wine Cancer
Intra cluster distance 0.986767 148.68 334.202
Inter cluster distance 4.95916 811.311 640.836
Quantization error 0.328922 49.5601 167.101
Time in sec 12.6541 14.3183 43.847

Table 4. Results of PSO+K-Means Clustering Algorithm

Swarm size = 3, Inertia of weight=0.72 , c1=1 and c2=1


Measures/datasets Iris Wine Cancer
Intra cluster distance 0.621062 142.808 220.765
Inter cluster distance 5.08348 737.112 665.667
Quantization error 0.223687 47.9361 111.882
Time in sec 8.75372 10.8275 38.1585
A Software Tool for Data Clustering Using Particle Swarm Optimization 283

Fig. 3. Results of K-Means+PSO Fig. 4. Results of PSO+K-Means

4 Comparative Analysis with Experimental Results


This software tutorial gives a comparative study on all the four clustering algorithms, K-
Means, PSO, K-Means+PSO and PSO+K-Means. According to the experimental results
obtained, it is observed that the accuracy rate of PSO+K-Means is high. Table 5 shows
the results, and Fig. 5 appends the display from the software.

Table 5. Comparative results of four clustering algorithms

Results of K-Means Clustering

Measures/datasets Iris Wine Cancer


Intra cluster distance 1.94212 293.617 671.53
Inter cluster distance 10.167 1474.22 1331.33
Quantization error 0.647374 97.0723 335.765
Time in sec 24.3050 1.7562 6.75965
Results of gbest PSO clustering
Swarm size = 3, Inertia of weight=0.72 , c1=1 and c2=1
Measures/datasets Iris Wine Cancer
Intra cluster distance 0.648096 145.849 222.833
Inter cluster distance 3.37355 749.14 432.382
Quantization error 0.216032 48.6163 111.416
Time in sec 19.9997 12.9741 76.1937
Results of K-Means + PSO Clustering algorithm
Swarm size = 3, Inertia of weight=0.72 , c1=1 and c2=1
284 K. Manda et al.

Table 5. (continued)

Measures/datasets Iris Wine Cancer


Intra cluster distance 0.986767 148.68 334.202
Inter cluster distance 4.95916 811.311 640.836
Quantization error 0.328922 49.5601 167.101
Time in sec 12.6541 14.3183 43.847
Results of PSO + K-Means Clustering algorithm
Swarm size = 3, Inertia of weight=0.72 , c1=1 and c2=1
Measures/datasets Iris Wine Cancer
Intra cluster distance 0.621062 142.808 220.765
Inter cluster distance 5.08348 737.112 665.667
Quantization error 0.223687 47.9361 111.882
Time in sec 8.75372 10.8275 38.1585

Fig. 5. Fitness Curves

Fig. 5 shows the intra and inter cluster distances, quantization error and the time as
marked with blue, red, green, and black colors respectively, for all four algorithms.

5 Conclusion and Future Scope


The PSO is a stochastic algorithm based on sociometry of bird flocking behaviors.
Each particle in PSO interacts with each other in finding the optimal destinations
using its own cognitive decision component and social decision component. The
simple mathematical equations to update the particles next velocity and position have
made this algorithm very popular among researchers in various fields. This paper
presented a learning software tool for using PSO for a very specific application in
data mining called data clustering. Through the software presented in this paper
A Software Tool for Data Clustering Using Particle Swarm Optimization 285

learners can have first hand information about the PSO basics and also can proceed in
investigating fundamentals in clustering algorithms. The entire software has been
developed using MATLAB. The GUI generated using MATLAB are very convenient
for users to use and experiment. Also, users have been provided various options to
choose suitable parameters and check the effectiveness of those in clustering results.
The fitness graph generated while comparing all four clustering algorithms discussed
earlier can provide a better insight about the performances. The confusion matrices
generated are the indications of the accuracies of the algorithm on investigated
dataset. Authors note it here that no such comprehensive tools have been developed
so far to explore PSO based clustering using MATLAB. It is envisioned that the
presented software will offer a good learning environment to students keeping interest
in this filed.
As further scope, the other PSO models are to be included with facilities to include
more parameter setting environment. The variants of PSO also can be explored for the
purpose and a complete package can be developed for clustering applications.

References
1. Bishop, X.M.: Neural networks for pattern recognition. Oxford University Press, Oxford
(1995)
2. Yurkiovich, S., Passino, K.M.: A laboratory course on fuzzy control. IEEE Trans.
Educ. 42(1), 15–21 (1999)
3. Coelho, L.S., Coelho, A.A.R.: Computational intelligence in process control: fuzzy,
evolutionary, neural, and hybrid approaches. Int. J. Knowl-Based Intell. Eng. Sys. 2(2), 80–
94 (1998)
4. Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm intelligence: from natural to artificial
systems. Oxford University Press, Oxford (1999)
5. Kennedy, J.F., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of the IEEE
International conference on neural networks, Perth, Australia, vol. 4, pp. 1942–1948 (1995)
6. MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate
Observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and
Probability, vol. 1, pp. 281–297. University of California Press (1967)