Vous êtes sur la page 1sur 7

K-Means Clustering

The Algorithm
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function

, where is a chosen distance measure between a data point and the cluster

centre , is an indicator of the distance of the n data points from their respective cluster centres. The algorithm is composed of the following steps: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centres. The k-means algorithm can be run multiple times to reduce this effect. K-means is a simple algorithm that has been adapted to many problem domains. As we are going to see, it is a good candidate for extension to work with fuzzy feature vectors.



Suppose that we have n sample feature vectors x1, x2, ..., xn all from the same class, and we know that they fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster i. If the clusters are well separated, we can use a minimum-distance classifier to

separate them. That is, we can say that x is in cluster i if || x - mi || is the minimum of all the k distances. This suggests the following procedure for finding the k means:

Make initial guesses for the means m1, m2, ..., mk Until there are no changes in any mean o Use the estimated means to classify the samples into clusters o For i from 1 to k Replace mi with the mean of all of the samples for cluster i o end_for end_until

Here is an example showing how the means m1 and m2 move into the centers of two clusters.

Remarks This is a simple version of the k-means procedure. It can be viewed as a greedy algorithm for partitioning the n samples into k clusters so as to minimize the sum of the squared distances to the cluster centers. It does have some weaknesses:

The way to initialize the means was not specified. One popular way to start is to randomly choose k of the samples. The results produced depend on the initial values for the means, and it frequently happens that suboptimal partitions are found. The standard solution is to try a number of different starting points. It can happen that the set of samples closest to mi is empty, so that mi cannot be updated. This is an annoyance that must be handled in an implementation, but that we shall ignore. The results depend on the metric used to measure || x - mi ||. A popular solution is to normalize each variable by its standard deviation, though this is not always desirable. The results depend on the value of k.

This last problem is particularly troublesome, since we often have no way of knowing how many clusters exist. In the example shown above, the same algorithm applied to the same data produces the following 3-means clustering. Is it better or worse than the 2-means clustering?

Unfortunately there is no general theoretical solution to find the optimal number of clusters for any given data set. A simple approach is to compare the results of multiple runs with different k classes and choose the best one according to a given criterion (for instance the Schwarz Criterion - see Moore's slides), but we need to be careful because increasing k results in smaller error function values by definition, but also an increasing risk of overfitting.

K-Means Clustering Algorithm

Definition: This nonheirarchial method initially takes the number of components of the population equal to the final required number of clusters. In this step itself the final required number of clusters is chosen such that the points are mutually farthest apart. Next, it examines each component in the population and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters.

Example: Let us apply the k-Means clustering algorithm to the same example as in the previous page and obtain four clusters : Food item # Protein content, P Fat content, F Food item #1 1.1 60 Food item #2 8.2 20 Food item #3 4.2 35 Food item #4 1.5 21 Food item #5 7.6 15 Food item #6 2.0 55 Food item #7 3.9 39 Let us plot these points so that we can have better understanding of the problem. Also, we can select the three points which are farthest apart.

We see from the graph that the distance between the points 1 and 2, 1 and 3, 1 and 4, 1 and 5, 2 and 3, 2 and 4, 3 and 4 is maximum. Thus, the four clusters chosen are : Cluster number C1 C2 C3 C4 Protein content, P 1.1 8.2 4.2 1.5 Fat content, F 60 20 35 21

Also, we observe that point 1 is close to point 6. So, both can be taken as one cluster. The resulting cluster is called C16 cluster. The value of P for C16 centroid is (1.1 + 2.0)/2 = 1.55 and F for C16 centroid is (60 + 55)/2 = 57.50. Upon closer observation, the point 2 can be merged with the C5 cluster. The resulting cluster is called C25 cluster. The values of P for C25 centroid is (8.2 + 7.6)/2 = 7.9 and F for C25 centroid is (20 + 15)/2 = 17.50 The point 3 is close to point 7. They can be merged into C37 cluster. The values of P for C37 centroid is (4.2 + 3.9)/2 = 4.05 and F for C37 centroid is (35 + 39)/2 = 37. The point 4 is not close to any point. So, it is assigned to cluster number 4 i.e., C4 with the value of P for C4 centroid as 1.5 and F for C4 centroid is 21. Finally, four clusters with three centroids have been obtained. Cluster number Protein content, P Fat content, F C16 1.55 57.50 C25 7.9 17.5 C37 4.05 37 C4 1.5 21

In the above example it was quite easy to estimate the distance between the points. In cases in which it is more difficult to estimate the distance, one has to use euclidean metric to measure the distance between two points to assign a point to a cluster.

Understanding Dilation and Erosion

Morphology is a broad set of image processing operations that process images based on shapes. Morphological operations apply a structuring element to an input image, creating an output image of the same size. In a morphological operation, the value of each pixel in the output image is based on a comparison of the corresponding pixel in the input image with its neighbors. By choosing the size and shape of the neighborhood, you can construct a morphological operation that is sensitive to specific shapes in the input image. The most basic morphological operations are dilation and erosion. Dilation adds pixels to the boundaries of objects in an image, while erosion removes pixels on object boundaries. The number of pixels added or removed from the objects in an image depends on the size and shape of the structuring element used to process the image. In the morphological dilation and erosion operations, the state of any given pixel in the output image is determined by applying a rule to the corresponding pixel and its neighbors in the input image. The rule used to process the pixels defines the operation as a dilation or an erosion. This table lists the rules for both dilation and erosion. Rules for Dilation and Erosion Operation Rule Dilation The value of the output pixel is the maximum value of all the pixels in the input pixel's neighborhood. In a binary image, if any of the pixels is set to the value 1, the output pixel is set to 1. The value of the output pixel is the minimum value of all the pixels in the input pixel's neighborhood. In a binary image, if any of the pixels is set to 0, the output pixel is set to 0.


The following figure illustrates the dilation of a binary image. Note how the structuring element defines the neighborhood of the pixel of interest, which is circled. (See Understanding Structuring Elements for more information.) The dilation function applies the appropriate rule to the pixels in the neighborhood and assigns a value to the corresponding pixel in the output image. In the figure, the morphological dilation function sets the value of the output pixel to 1 because one of the elements in the neighborhood defined by the structuring element is on.

Morphological Dilation of a Binary Image

The following figure illustrates this processing for a grayscale image. The figure shows the processing of a particular pixel in the input image. Note how the function applies the rule to the input pixel's neighborhood and uses the highest value of all the pixels in the neighborhood as the value of the corresponding pixel in the output image. Morphological Dilation of a Grayscale Image

Processing Pixels at Image Borders (Padding Behavior) Morphological functions position the origin of the structuring element, its center element, over the pixel of interest in the input image. For pixels at the edge of an image, parts of the neighborhood defined by the structuring element can extend past the border of the image. To process border pixels, the morphological functions assign a value to these undefined pixels, as if the functions had padded the image with additional rows and columns. The value of these padding pixels varies for dilation and erosion operations. The following table describes the padding rules for dilation and erosion for both binary and grayscale images. Rules for Padding Images Operation Rule Dilation Pixels beyond the image border are assigned the minimum value afforded by the data type. For binary images, these pixels are assumed to be set to 0. For grayscale images, the minimum value for uint8 images is 0. Erosion Pixels beyond the image border are assigned the maximum value afforded by the data type. For binary images, these pixels are assumed to be set to 1. For grayscale images, the maximum value for uint8 images is 255. Note By using the minimum value for dilation operations and the maximum value for

erosion operations, the toolbox avoids border effects, where regions near the borders of the output image do not appear to be homogeneous with the rest of the image. For example, if erosion padded with a minimum value, eroding an image would result in a black border around the edge of the output image. Back to Top

Understanding Structuring Elements

An essential part of the dilation and erosion operations is the structuring element used to probe the input image. A structuring element is a matrix consisting of only 0's and 1's that can have any arbitrary shape and size. The pixels with values of 1 define the neighborhood. Two-dimensional, or flat, structuring elements are typically much smaller than the image being processed. The center pixel of the structuring element, called the origin, identifies the pixel of interest -- the pixel being processed. The pixels in the structuring element containing 1's define the neighborhood of the structuring element. These pixels are also considered in dilation or erosion processing. Three-dimensional, or nonflat, structuring elements use 0's and 1's to define the extent of the structuring element in the x- and y-planes and add height values to define the third dimension.