Vous êtes sur la page 1sur 9

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No.

A survey: Performance improving of K-mean by Genetic Algorithm


Amit Dubey1, Prof. Anurag Jain2 and Dr. A.K. Sachan3 Computer Science department Radharaman Institute of science & technology, Bhopal 2 Computer Science department Radharaman Institute of science & technology, Bhopal 3 Computer Science department Radharaman Institute of science & technology, Bhopal amit_23dubey@yahoo.co.in Anurag.akjain@gmail.com sachanak_12@yahoo.com
1

Abstract
This paper presents a new initialization technique for k mean clustering. centroid selection performed by Genetic algorithm in the K mean algorithm. These centroids act as starting points for k-means. This paper is a survey of Improved K mean using Genetic algorithm. To measure the cluster compactness a within cluster scatter criteria has been used. Keywords: K-Means, GAIK, genetic algorithm, IGA-FKKM, Entropy Weighting

I.

Introduction

Clustering is the process of grouping data into groups having similar properties. It is widely used in many areas, including data mining, statistics, biology, and machine learning. A cluster has objects with high similarity, but is dissimilar to the objects in other clusters [1]. These similarities are assessed based on the attribute value.

1.2 Types of clustering


1. Partition based: - The partitioning method initially creates partitions. Then an iterative relocation technique is used to improve partitioning and moves objects from one group to another. 2. Hierarchical: - A hierarchical method creates a hierarchical decomposition of the given set of data objects. 3. Density based: The density based approach is to continue growing the given cluster as long as the density i.e. number of objects or data points in the neighborhood exceeds some threshold. 4. Grid based: - Grid based methods quantize the object space into a finite number of cells that forms a grid structure.

5. Model based clustering: - The model based clustering hypothesizes a model for each of the clusters and
finds the best fitted data according to the given model.

K-means algorithm which is a partition based clustering, and it is one of the most popular methods used in data clustering due to its good computational performance [2]. However, it is well known that its result depends on the initialization process, which is generally done by random selection. To improve the performance a new initialization technique has been proposed. Different runs of K-means on the same input data may produce different results. Genetic Algorithms are based on the ideas of natural evolution. In general, GA start with an initial population, and then a new population is created based on the fitness value of chromosomes. Fitness is the measure for how good is

21

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

the population. Typically a distance measure is the most common [3]. Then a process called crossover is done over the new population by swapping the substrings from selected chromosomes in order to produce new chromosomes. After that mutation process is applied to produce randomization. This process continues until a termination condition is achieved.

II.

K-Means For Clustering

K-Means is one of the most common algorithms used for clustering. The algorithm classifies pixels to a predefined number of clusters (assume k clusters). The idea is to choose random cluster centers called centroids, one for each cluster. These centroids are preferred to be as far as possible from each other. Initial points affect the clustering process and results. After that, each pixel will be taken into consideration to calculate similarity with all cluster centers through a distance measure, and it will be assigned to the most similar cluster, the nearest cluster center. When this assignment process is over, a new centroid is calculated for each cluster using the pixels in it. For each cluster, the mean value will be calculated for the coordinates of all the points in that cluster and set as the coordinates of the new center. Once we have these k new centroids or center points, the assignment process must start over. This process is repeated until there is no change in centroids. Finally, this algorithm aims at minimizing an objective function, which is in this case is a squared error function as given by eq. 1 [4].

(1)

In this formula K is the number of clusters, x represents a data point, Ck represents cluster k, mk represents the mean of the cluster k, and A is the total number of attributes for a data point. The K-means algorithm is expressed as follows [4]. Step 1: Choose random k points and set as cluster centers. Step 2: Assign each object to the closest centroids cluster. Step 3: When all objects have been assigned, recalculate the positions of the centroids. Step 4: Go back to Steps 2 unless the centroids are not changing. One drawback of K-means is that it is sensitive to the initially selected points, and so it does not always produce the same output. To avoid this problem, the algorithm may run many times before taking an average values for all runs, or at least take the median value.

III.

Initializing K-Means With GA

In literature it has been found that Genetic algorithm is used to initialize K-means and known as GA initialized Kmeans (GAIK). The purpose of GA is to optimize the performance of K-means. It has been also noticed that the performance of K-means depends upon the initial centroid selection. GA provides the initial cluster centroids, which

22

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

act as starting point for Kmeans. For using GAs into clustering, an initial population of random clusters is generated. At each generation, each individual is evaluated and recombined with others on the basis of its fitness. New individuals are created using crossover and mutation.

A. Chromosome representation
The first step of GA is representation (or encoding) of chromosomes. The encoding may be done in binary, integer or real numbers. Different research uses different encoding schemes. Fig 1 shows cluster centers as chromosomes.

Figure: 1 Encoding in genetic algorithms [5] B. Fitness evaluation


A fitness function is needed to evaluate the fitness of chromosomes. The fitness function should return some real value. Eq. 2 is used for fitness evaluation.

(2)

Cluster Ck, which makes it similar to the k-means algorithm [6].

C. Selection
Selecting chromosomes for production of new generation is called Selection. Selection is done on the basis of the fitness value. The best fitted chromosomes are selected for crossover. There are verities of selection procedures like uniform selection, roulette wheel selection, tournament etc.

D. Crossover
The purpose of crossover is to create two new individuals chromosomes from two existing chromosomes selected from current population. Typical crossover is one point crossover, two point crossover, cycle crossover and uniform crossover. Fig. 2 shows the generation of new chromosomes through crossover process.

23

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

Figure 2: Generation of new individuals through crossover E. Mutation


Mutation is done in order to produce randomization. Also it extends the search space. It is done with a pre defined rate called mutation probability.

Figure: 3 mutation process

For mutation a particular bit is changed randomly with the mutation probability. Fig. 3 shows two new child produced by mutation. The muted bit is highlighted. This process is repeated till the end of specified number of generations. Finally a chromosome is obtained having cluster centroids, which becomes the starting point for Kmeans. The stopping criteria tell when to stop the algorithm. When the condition is met the algorithm is terminated.

IV.

K Means Clustering & Genetic Algorithm Related Work

4.1 Gene Expression Analysis Using Clustering [7]


Data Mining has become an important topic in effective analysis of gene expression data due to its wide application in the biomedical industry. In this paper, k-means clustering algorithm has been extensively studied for gene expression analysis. Since Author purpose is to demonstrate the effectiveness of the k-means algorithm for a wide variety of data sets, Two pattern recognition data and thirteen microarray data sets with both overlapping and nonoverlapping class boundaries were taken for studies, where the number of features/genes ranges from 4 to 7129 and number of sample ranges from 32 to 683. The number of clusters ranges from two to eleven. For pattern recognition, they use IRIS and WBCD data and for microarray data they use serum data, yeast data, leukemia data, breast data, Lymphoma data, lung cancer, and St. Jude leukemia data. To identify common subtypes in independent disease data,

24

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

four different types of breast data and four Diffused Large B-cell Lymphoma (DLBCL) data were used. Clustering error rate (or, clustering accuracy) is used as evaluation metrics to measure the performance of k-means algorithm. Clustering is an efficient way of analyzing information from microarray data and K-means is a basic method for it. K-means can be very easily applied to Microarray data. Depending on the nature and complexity of the data performance of K-means varies. They achieve maximum accuracy for IRIS data where as lowest for DLBCL D. Kmeans has some serious drawbacks. Many papers have presented in past to improve K-Means.

4.2 HIG Algorithm with Variable Length Chromosome for Image Clustering
Clustering is a process of putting similar data into groups. Author [8] presents data clustering using improved genetic algorithm (IGA) in which an efficient method of crossover and mutation are implemented. Further it is hybridized with the popular Nelder-Mead Simplex search and K-means to exploit the potentiality of both in the hybridized algorithm. The performance of hybrid approach is evaluated with few data clustering problems. Further a Variable Length IGA is proposed which optimally finds the clusters of benchmark image datasets and the performance is compared with K-means and GCUK[9].The results revealed are very encouraging with IGA and its hybridization with other algorithms. Authors have explored the capability of an improved GA based clustering on some well known data sets. Although K-means clustering is a very well established approach, however it has some demerits of initialization and falling in local minima. GA being a randomized based approach has the capability to alleviate the problems faced by K-means. In this paper an improved version of GA was discussed and implemented for data clustering. In this improved version of GA (IGA) a new approach of crossover and offspring formation adopted. When applied to data clustering problem IGA performs better compared to K means in all data set under study in this paper. However, to further improvise the performance of IGA on data clustering the K-means was hybridized resulting in KMIGA and boost the KM-IGA further more it has been hybridized with Nelder-Mead resulting in KM-NM-IGA. In hybrid algorithm (KM-NM-IGA) the outcome of K means becomes one of the chromosomes in the initial population of NM-IGA. The results reveal that hybrid algorithm gives better results compared K-means, IGA and Nelder-Mead. Since the clustering results achieved by the IGA are satisfactory we have applied the IGA to the Image clustering problem by proposing a new variable length IGA (VLIGA) for automatic evolution of clusters. Experiments were carried out with three standard natural grey scale images to evaluate the performance of the proposed VLIGA. It was evident from the results that VLIGA algorithm was effective compared to the GCUK and traditional K-means algorithm. Further enhancements will include the study of higher dimensional data sets and large data set for clustering. Also the datasets with mixed data can be studied. It is also planned to study the appropriateness of hybrid algorithm (K-NM-IGA) for image clustering and extend the same to color images.

4.3 Fuzzy Kernel K-Means Clustering Method Based on


A fuzzy kernel k-means clustering method based on immune genetic algorithm (IGA-FKKM [10]) is proposed in this paper to overcome the dependence on the shape of the sample space and local optimization of fuzzy k-means

25

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

algorithm. Mapping samples from low-dimension space into high-dimension feature space with Mercer kernel, the method thus eliminates the influence of the shape of sample space on clustering accuracy. Meanwhile, the probability of gaining the global optimal value is also increased by using the immune genetic algorithm. Compared with the fuzzy k-means clustering method (FKM) and the fuzzy k-means clustering method based on genetic algorithm (GA-FKM), IGA-FKKM is validated by experimental results to achieve higher classification accuracy. Authors propose a Fuzzy Kernel K-Means clustering method based on Immune Genetic Algorithm (IGA-FKKM). Dependence of fuzzy K-Means clustering on distribution of sample is eliminated with the introduction of kernel function. Immune genetic algorithm is also used to suppress fluctuation occurred at later evolvement and avoid local optimum. Compared with FKM and GA-FKM, the experimental results show that IGA-FKKM obtains the global optimum, and has higher cluster accuracy. Further study will focus on dealing with the sensibility of clustering algorithm to initial value.

4.4 Web usage Data Clustering from K-means with Genetic Algorithm
Web usage mining involves application of data mining techniques to discover usage patterns from the web data. Clustering is one of the important functions in web usage mining. Recent attempts have adapted the K-means clustering algorithm as well as genetic algorithms based on rough sets to find interval sets of clusters. And an important point is, so far, the researchers havent contributed to improve the cluster quality once it is clustered. In this paper, author [11] has proposed a new framework to improve the web sessions cluster quality from k-means clustering using Genetic Algorithm (GA). Initially a modified k means algorithm is used to cluster the user sessions. The refined initial starting condition allows the iterative algorithm to converge to a better local minimum. And in the second step, GA based refinement algorithm to improve the cluster quality. The proposed algorithm is tested with web access logs collected from the Internet Traffic Archive (ITA) and shows that refined initial starting points and post processing refinement of clusters indeed lead to improved solutions. Web usage mining applies data mining techniques to discover usage patterns from the Web data, in order to understand and better serve the needs of Webbased applications. A new framework for web usage data clustering for users sessions as two step process, In the first step, the initial cluster centers are selected based on statistical mode based calculation to allow the iterative algorithm to converge to a better local minimum. And in the second step, we have proposed a novel method to improve to cluster quality using Genetic Algorithm (GA) based refinement algorithm. The proposed algorithm is tested with the access logs received from Internet Traffic Archive and shows that refined initial starting points and post processing refinement of clusters indeed lead to improved solutions. The method is scalable and can be coupled with a scalable clustering algorithm to address the large-scale clustering problems in web data mining.

4.5 IHKMCA for High dimensional Dataset & Its Performance Analysis.
In practical life we can see the rapid growth in the various data objects around us, which there by demands the increase of features and attributes of the data set. This phenomenon, in turn leads to the increase of dimensions of the various data sets. When increase of dimension occurred, the ultimate problem referred to as the the curse of dimensionality comes in to picture. For this reason, in order to mine a high dimensional data set an improved and an

26

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

efficient dimension reduction technique is very crucial and apparently can be considered as the need of the hour. Numerous methods have been proposed and many experimental analyses have been done to find out an efficient reduction technique so as to reduce the dimension of a high dimensional data set without affecting the original datas. In this paper author [12] proposed the use of Canonical Variety analysis, which serves the purpose of reducing the dimensions of a high dimensional dataset in a more efficient and effective manner. Then to the reduced low dimensional data set, a clustering technique is applied using a modified k-means clustering. In our paper for the purpose of initializing the initial centroids of the Improved Hybridized K Means clustering algorithm (IHKMCA) we make use of genetic algorithm, so as to get a more accurate result. The results thus found from the proposed work have better accuracy, more efficient and less time complexity as compared to other approaches. They proposed an Improved Hybridized K-Means Clustering Algorithm (IHKMCA) using CVA as a dimension reduction technique and initialized the centroid of Improved Hybridized K-means algorithm (IHKMCA) by using genetic algorithm. They find out that IHKMCA has a better performance compare to the earlier Hybridized K-Means Clustering algorithm using PCA. Now the problem of determining the number of clusters before hand is still worth working for and also certain area of improvement for dimension reduction and outlier elimination is still to be explored and unveiled.

4.6 K-means clustering and Genetic Algorithm for Nonlinear Optimization [13]
To reduce the computational amount and improve estimation accuracy for nonlinear optimizations, a new algorithm, K-means clustering with Chaos Genetic Algorithm is proposed, in which initial population are generated by chaos mapping and refined by competition. This approach is in addition to the evolution of genetic algorithm, the K-means Clustering algorithm is applied to achieve faster convergence and lead to a quick evolution of the population. The main purpose of the paper is to demonstrate how the genetic algorithm optimizer can be improved by incorporating a hybridization strategy. Experimental studies revealed that the hybrid KCGA approach can produce much more accurate estimates of the true optimum points than the other two optimization procedures, the chaos genetic algorithm and genetic algorithm. Further, the proposed hybrid KCGA approach exhibits superior convergence characteristics when compared to other algorithms in this paper separately. On the whole, the new approach is demonstrated to be extremely effective and efficient at locating optimal solutions and verified by an empirical example from construction. This study has proposed a procedure which joins K-means and chaos attributes based on genetic algorithm. The proposed procedure is not only to enhance the diversity of GA for more accuracy but also to extract clustering rules for achieving a potential trend of evolution. Additionally, it can effectively improve some drawbacks of traditional GA, such as long running time and getting trapped in local optima. Furthermore, this proposed procedure can really contribute to construction management in real world.

4.7 Initializing K-Means using Genetic Algorithms


In this paper, Author [14] proposed two algorithms to solve the initialization problem in K mean, Genetic Algorithm Initializes (GAIK) and K mean Initializes Genetic Algorithm (KIGA). To show the effectiveness and efficiency of our algorithms, a comparative study was done among GAIK, KIGA, Genetic-based Clustering Algorithm (GCA),

27

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

and FCM. His experimental evaluation scheme was used to provide a common base of performance assessment and comparison with other methods. From the experiments on the eight data sets, he find that pre-initialized algorithms work well and yield meaningful and useful results in terms of finding good clustering configurations which contain interdependence information within clusters and discriminative information for clustering. In addition, it is more meaningful in selecting, from each cluster, significant centers; with high multiple inter dependence with other points within each cluster. Finally, when comparing the experimental results of K Means, GKA, GAIK and KIGA we find that KIGA is better than the others. As shown by the results on all datasets KIGA is ready to achieve high clustering accuracy if compared to other algorithms.

4.8 New Model for Improving K-Means Clustering Based On Genetic Algorithm [15]
Data clustering into appropriate classes and categories is one of the important topics in pattern recognition. It is very good and very efficient that the number of data which misclassified is minimized or in other words data that classified in each class has been possible as much possible similarity together. In this article at the first, a fundamental method of data clustering which named K-Means Clustering was expressed and then with genetic algorithm, His proposal model that named GA-Clustering for improving K-Means method has been introduced. Finally, they said model was examined on some of the well-known data set. Results show that the method clusters data better than traditional K-Means Clustering algorithm significantly. In this paper K-Means algorithm that is one of the popular clustering techniques has been surveyed and tried to apply one of the optimization method named genetic algorithm improve in unsupervised classification procedure. Genetic algorithms are population based methods that use from operators for processing of population chromosomes. In this research, they defined a representation of chromosome string and combine K-Means and GA together. Observing simulations in different runnings show that K-Means clustering based on Genetic algorithm improved clustering measurement better and more efficient rather than pure K-Means considerably.

5. Conclusion and Future Research


K-Means is considered one of the major algorithms widely used in clustering. However, it still has some problems, and one of them is in its initialization step where it is normally done randomly. Another problem for KM is that it converges to local minima. Genetic algorithms are one of the evolutionary algorithms inspired from nature and utilized in the field of clustering. The initialization step is very important for any clustering algorithm. The survey show that the partition based random initialization method performs well and yields more compact clusters as compared to the normal random selection. The Entropy Weighting Genetic k-Means Algorithms benefit from both genetic algorithm (GA) and k-means, since GA searches the space more thoroughly than k-means; the genetic kmeans algorithm will not be trapped in a local optimum. Here author used weight entropy to minimize the within cluster dispersion and maximize the negative weight entropy in the clustering process this way he got more dimensions to make a contribution to identification of each cluster. The problem of identifying cluster by few sparse dimensions can be avoided.

28

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

References
[1] [2] [3] J. Han, and M. Kamber, Data Mining: Concepts and techniques, Morgan Kaufmann Publishers, 2001. J. Pena, J. Lozano, and P. Larranaga, An empirical comparison of four initialization methods for the k-means algorithm, Pattern Recognition Letters, vol. 20(10), pp. 1027-1032, 1999. Kailash Chander, Dr. Dinesh Kumar, Vijay Kumar, Enhancing Cluster Compactness using Genetic Algorithm Initialized K-means, International Journal of Software Engineering Research & Practices Vol.1, Issue 1, Jan, 2011. A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3), pp. 264-323, 1999. Qin Ding and Jim Gasvoda,A genetic algorithm for clustering on image data. International Journal of Computational Intelligence, vol.1, 2005. M. Painho and F. Bao, Using genetic algorithms in clustering problems. Proceedings of GeoComputation Conference, 2000. Kumar Dhiraj and Santanu Kumar Rath, Gene Expression Analysis Using Clustering , International Journal of Computer and Electrical Engineering, Vol. 1, No. 2, June 2009. Venkatesh Katari, *Suresh Chandra Satapathy , Hybridized Improved Genetic Algorithm with Variable Length Chromosome for Image Clustering, IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.11, November 2007. Bandyopadhyay, S., and Maulik, U., Genetic clustering for automatic evolution of clusters and application to image classification, IEEE pattern recognition, Vol.35, p.1197-1208, 2002. Chengjie Gu1, Shunyi Zhang, Kai Liu, He Huang., Fuzzy Kernel K-Means Clustering Method Based on Immune Genetic Algorithm, journal of computation systems, 2011. N. Sujatha, K. Iyakutty, Refinement of Web usage Data Clustering from K-means with Genetic Algorithm, European Journal of Scientific Research ISSN 1450-216X Vol.42 No.3, 2010. H.S Behera, Rosly Boy Lingdoh And Diptendra Kodamasingh, An Improved Hybridized Kmeans Clustering Algorithm (Ihkmca) For Highdimensional Dataset & Its Performance Analysis , International Journal on Computer Science and Engineering, Vol. 3 No. 3 Mar 2011. Cheng Min-Yuan and Huang Kuo-Yu, K-means clustering and Chaos Genetic Algorithm for Nonlinear Optimization, 26th International Symposium on Automation and Robotics in Construction, 2009. Bashar Al-Shboul, and Sung-Hyon Myaeng, Initializing K-Means using Genetic Algorithms, World Academy of Science, Engineering and Technology 54 2009. Rouhollah Maghsoudi1,*, Arash Ghorbannia Delavar2, Somayye Hoseyny3, Rahmatollah, Representing the New Model for Improving K-Means Clustering Algorithm based on Genetic Algorithm , The Journal of Mathematics and Computer Science Vol .2 No.2, 2011.

[4] [5] [6] [7] [8]

[9] [10] [11] [12]

[13] [14] [15]

29

Vous aimerez peut-être aussi