Vous êtes sur la page 1sur 5

Hyperspherical Possibilistic Fuzzy c-Means for HighDimensional Data Clustering

Yang Yan, Lihui Chen School of Electric and Electronic Engineering Nanyang Technological University Singapore, 639798 yany0014@ntu.edu.sg, elhchen@ntu.edu.sg Abstract A possibilistic fuzzy c-means (PFCM)[1] has been proposed for clustering unlabeled data. It is a hybridization of possibilistic c-means (PCM) and fuzzy cmeans (FCM), therefore it has been shown that PFCM is able to solve the noise sensitivity issue in FCM, and at the same time it helps to avoid coincident clusters problem in PCM with some numerical examples in low-dimensional data sets. In this paper, we conduct further evaluation of PFCM for high-dimensional data and proposed a revised version of PFCM called Hyperspherical PFCM (HPFCM). Modifications have been made in the original PFCM objective function, so that cosine similarity measure could be incorporated in the approach. We apply both the original and revised approaches on six large benchmark data sets, and compare their performance with some of the traditional and recent clustering algorithms for automatic document categorization. Our analytical as well as experimental study show HPFCM is promising for handling complex high dimensional data sets and achieves more stable performance. On the other hand, the remaining problem of PFCM approach is also discussed. Keywordscosine similairity, possibilisitic fuzzy clustering, I. high dimensional,

those based on fuzzy c-means (FCM)[7] such as [8], are often used for categorization-applications that require a realistic overlapping clusters representation. Meanwhile, FCM-based algorithms are known to be vulnerable to outliers.[1] To address this problem, possibilistic clustering, pioneered by the possibilistic c-means (PCM) algorithm[9], was developed and has been shown to be more robust to outliers as compared to FCM. However, the robustness of PCM comes with the cost of stability[10]. Early PCM-based algorithms are very sensitive to initializations, and sometimes lead to coincident clusters[1, 9]. It has been reported that apply PCM on high dimensional data sets always suffer from coincident problem. It terminates with a smaller number of distinct clusters than the actually number of the clusters. The larger the actually cluster number, the higher possibility the coincident problem will appear for high dimensional data. In 2005, a new model called possibilistic fuzzy c-means [1]that hybridized FCM and PCM has been proposed. It takes the benefits of avoiding the short come of both models. Some numeric examples on low dimensional (2-D) data sets have been used to show and confirm the advantages of PFCM. In this paper, we will evaluate the effectiveness of PFCM on high dimensional data sets and discuss the trade off if any. The subsequent sections of this paper are organized as follows: In section II, some considerations regarding the selection of a distance function for document collections are presented. A modified possibilistic fuzzy c-means clustering algorithm that replaces the squared Euclidean norm by a dissimilarity function common to IR systems is derived. In section III, the experimental work is described and the results are presented and analyzed. Finally, we conclude our work in section IV. II. THE PROPOSED HPFCM

INTRODUCTION

In the current information age, the number of on-line highdimensional data repositories, such as collections of web pages, is growing faster and faster, which has induced the need for a sophisticated yet practical categorization technique to autoorganization. To meet this demand, researchers have made many significant efforts and progress, particularly in the area of data clustering. The use of clustering is supported by the cluster hypothesis[2] which assumes that documents relevant to a given query tend to be more similar in content to each other than to the irrelevant documents and hence are likely to be clustered together. Some recent clustering techniques for web documents includes clustering ensemble, clustering based on matrix factorization concept[3], information theory co clustering[4], relational data clustering[5] and semi-supervised clustering[6]. These works do have merits on various aspects, like co-clustering is generally effective to handle high dimensional data, fuzzy clustering algorithms, particularly

Since the similarity measure plays a critical role in the clustering, we consider replacing the Euclidean distance in original PFCM framework by a more suitable one for high dimensional data. In this section, we first discuss the selection of the distance function, and then introduce the proposed HPFCM for clustering high dimensional document sets. A. Selection of Distance Function The choice of a suitable distance function to be used in clustering algorithms may affect the clustering results.[11] In

978-1-4244-4657-5/09/$25.00 2009 IEEE

ICICS 2009

Vector Space model, documents are represented as term vectors and those vectors tend to be high-dimensional and very sparse. The Euclidean distance, which is applied in [1], and also commonly used in the algorithms based on FCM, is not the most suitable metric for measuring the proximity between text documents. The problem with this norm is that the nonoccurrence of the same terms in both documents is handled in the similar way as the co-occurrence of terms.[8] The cosine measure denoted here as S , is simply the inner product of M-dimensional vectors ( x and x ) after normalization to unit length .The higher the cosine value the higher the similarity between these two vectors. [8]

tik =

1 b 2 1 + Dik i
1/( n 1)

1
M b 1 + 1 xkj vij j =1 i 1/( n 1)

(5)

equation (5) is a weighting factor from membership values to the typicality values. It is a group of constants rather than a single value. Each cluster will be probably assigned a distinct i value. It is obtained by a terminal membership matrix after a FCM initialization. Authors in [1] also suggested choosing the { i } by computing:
i =

i in

S ( x , x ) = x , x = X j X j
j =1

(1)

u
k =1 N k =1

m ik

2 Dik m ik

u
k =1

m ik

M 1 xkj vij j =1

(6)

u
k =1

m ik

When we explain the similarity measure by computing Euclidean distance, shorter Euclidean distance is equivalent to higher similarity. Therefore, to keep consistent to the objective function of each algorithm we have discussed, a simple transformation to equation (1) can be performed to obtain the dissimilarity measure in equation (2).
D ( xk , vi ) = 1 S ( xk , vi ) = 1 xkj vij
j =1 M

(2)

B. Hypersperical Possibilistic Fuzzy c-Means Algorithm Now we apply the dissimilarity function (2) to replace the Euclidean distance for clustering normalized document vectors in the PFCM framework since it is widely used and considered as a better measure for text documents in vector space model and IR. The modified algorithm has been called as Hyperspherical Possibilisitic Fuzzy c-Means (HPFCM), as both data vectors and cluster centers lie in a M-dimensional hyper-sphere of unit radius, The modified objective function is similar to the original one[3], the difference being the replacement of the squared norm by the function given in (2):
m n J m,n (U , T , V ; X ) = ( auik + btik ) xk vi + i (1 tik ) 2 k =1 i =1 N C i =1 k =1 m n = ( auik + btik )Dik + i (1 tik ) k =1 i =1 i =1 k =1 C N n N C C N n

Where is a user-defined constant which scales the magnitude of for different data sets in order to achieve better clustering results. A new update expression for cluster centers vi needs to be derived. First, according to equation (1.2), the constraint forces the cluster centers to be normalized to unit length as well as the document vector. To minimizing the objective function given in [1] with respect to vi ( uik fixed / tik fixed /both fixed), we use the method of the Lagrange multipliers. A new part which represents the influence of the constraint on vi will add on. The Lagrange function of HPFCM is defined as:
L vi , i = J m , n U ,T , vi ; X + i S vi , vi 1

( )

M C M N N n m n 2 = auik + btik 1 xkj vij + i 1 tik + i vij 1 k =1 j =1 i =1 k =1 j =1

Where (3)

is the Lagrange multiplier.

Taking the derivative of the Lagrangian function respect to vi ,


L ( vi , i ) =0 vi
N k =1

N C M N C n m n = ( auik + btik ) 1 xkj vij + i (1 tik ) k =1 i =1 j =1 i =1 k =1

Where C denotes the number of clusters, and N denotes the number of documents in the data set. a and b define the weight of fuzzy membership and typicality value, respectively. m and n are the fuzzifiers, controlling the degree of the fuzziness for the fuzzy memberships and typicality values The constrains regarding the membership value and typicality values are the same as those in the original FCM and PCM, therefore the update expressions for both u and t are also similar to the original one. Namely,
C D uik = ik j =1 D jk
2/( m 1)

m n ( auik + btik ) xk + 2i vi = 0 vi =

1 2i

( au
k =1

m ik

n + btik ) xk

Finally, replacing 1 by the term of 2i equation for v


N m ik n ik

uik and tik , we get update


1/2

M C 1 xkj vij j =1 = M r =1 1 xkj vrj j =1

1/( m 1)

(4)

2 M N m n (7) vi = ( au + bt )xk ( auik + btik )xkj k =1 j =1 k =1 With cosine similarity measure, HPFCM is theatrically outperformed of the original PFCM. In next section, we conduct a few experiments to verify our assumption.

Table 1: Datasets Descriptions


Dataset No. of docs. 500 2891 No. of Words 3377 2176 Clusters (No. of docs.)

Binary Classic3 Multi5 Multi10 Reuters3 Yahoo_K1

500 500 1076 2340

2889 2015 2837 3640

Politics (250), Middle-east (250) Medical (1033), Aerospace (1398), Inform. Retrieval (1460) Computer Graphics (100), Motorcycle (100), Baseball (100), Space (100), Middleeast (100) Atheism(50), Hardware(50), Forsale(50), Rec.autos(50), Hockey(50), Srypt(50), Electronics(50), Medical(50), Space(50), Politics(50) Trade (361), Crude (408), Money-fx (307) Health (494), Entertainment (1389), Sports (141), Politics (114), Technology (60), Business (142) according to the number of the ground-truth categories of documents given in data sets, as indicated in Table 2. The convergence indicator parameter was set to 10E-5 throughout the experiments. The simulation runs iteratively until a local minimum of the objective function is found or the maximum number of iterations is reached. Due to the size and complexity of high dimensional data sets, choosing a suitable set of parameters is always a tedious process, especially for (H)PFCM. Efforts have been made on looking for the right combination of parameters for each data set. The parameters used in simulation are list below: For HFCM: We set m=1.02 for all data sets except for Multi10, m=1.05. For PFCM and HPFCM: We set a =1 for all datasets, others are shown in Table 2. For ITCC: the number of word cluster was set to 200. Table 2: The Parameters for PFCM and HPFCM Datasets Binary Classic3 Multi5 Multi10 Reuters3 Yahoo_K1 m 1.02 1.02 1.02 1.02 1.02 PFCM n 5 1 5 1 4 1 2 1 5 1 b 1 1 1 1 1 m 1.02 1.05 1.02 1.05 1.02 1.05 HPFCM n 5 1 5 1 2 10 5 1 2 1 3 1 b 1 1 1 2 1 1

III.

EXPERIMENTAL STUDY

A. Description of the Data Sets To demonstrate the usefulness of (H)PFCM in categorizing real-world data, we run simulations on six benchmark document data sets. The details for the datasets are shown in Table 1. By observing the categories of datasets, we can see variations of the complexity natures of datasets due to, among other things, more overlapping categories and more unbalanced size categories. Through these variations, we can observe how the algorithms perform in different data sets conditions. B. Document Rrepresentation The documents were pre-processed using the Matrix Creation (MC) tool[12]. Words occurring in less 0.5% and more than 99.5% of the number of documents were removed for all datasets. Meanwhile, each document was automatically indexed for keyword frequency extraction. Stemming was performed and stop words were discarded. The document vectors were then organized as rows of a (MxN) matrix, where M is the total number of indexing terms and N is the collection size of the dataset. C. Performance Evaluation The validity of fuzzy clustering algorithm is generally evaluated using internal performance measures, i.e. measures that are algorithm dependent and do not contain any external or objective knowledge about the actual structure of the data. In this paper, we use three traditional but popular measures that are typicality use in IR system. They are precision, recall and purity.[13] D. Parameter Setting In the experiment, we compared HPFCM with original PFCM and four other algorithms on the high dimensional datasets. The main goal is to observe the improvement made by cosine similarity measure on PFCM. All six algorithms were randomly initialized. Since different initializations may result in different outcomes, for each data set, every algorithm was run in 30 trials, and the average performance is reported. The parameter max was set to 200, limiting the max number of iterations to 200. The number of clusters C was set

E. Results and Discussions Table 3 shows the performance results among HPFCM, PFCM and HFCM. In addition to the average precision, recall, and purity over 30 trials, information about the stand deviation is also provided, as shown in the bracket. Table 4 shows the performance of other three algorithms: S-Kmeans[14], NMF[3] and ITCC[4], which are the popular clustering approaches for high dimensional document categorization in literature. Several observations can be made based on Table 3. Firstly, HPFCM generally shows improvement or comparable results when comparing with PFCM. Refer to Table 2, the suitable parameter set for HPFCM may not be the same as for PFCM, e.g.Multi5. In addition, when the number of categories is

Table 3: The Results of HPFCM, HFCM, PFCM Datasets Binary Classic3 Multi5 Multi10 Reuters3 Yahoo_K1 Precision 76.3(1.4) 98.7(0) 99(0) 83.4(7.0) 67.5(6.8) 78.4(1.2) HPFCM Recall 68.8(0.8) 98.8(0) 99(0) 79.6(5.3) 65(6.4) 58.6(2.4) Purity 68.8(0.8) 99(0) 99(0) 79.6(5.3) 68.8(3.2) 83.3(0.9) Precision 70.6(5) 99(0) 86.4(12) 72.6(8.7) 64.6(9.5) 53.4(7) HFCM Recall 68.4(7.5) 98.8(0) 85.7(11.6) 70.5(11.6) 62.4(9.4) 39.2(3.6) Purity 68.4(7.5) 98.9(0) 85.7(11.6) 70.5(11.6) 64.2(10.2) 80.2(3.2) Precision 75.2(2.0) 94(0.6) 91.4(3.6) 65(6.4) 77.6(1.3) PFCM Recall 67.8(1.4) 94.6(0.4) 87.2(1.3) 64.5(5.3) 50.2(6.1) Purity 67.8(1.4) 94.2(0.4) 87.2(1.3) 68.2(4.9) 81.6(1)

Table 4The Results of S-Kmeans, NMF, ITCC Datasets Binary Classic3 Multi5 Multi10 Reuters3 Yahoo_K1 S-Kmeans Recall 69.0 97.9 73.9 32.8 59.2 47.4 NMF Recall 66 97.9 79.3 66 44 67.1 ITCC Recall 63 98.1 79.5 55.2 52.7 32.5

Precision 72.50 98.1 75.4 32.6 63.0 79.8

Purity 69. 97.8 73.9 32.9 61.7 83.0

Precision 76.6 98.1 87.7 73.1 58.6 80.5

Purity 66 97.8 79.3 66 44.8 81.9

Precision 68.4 98.1 83.2 56.3 58.2 58.2

Purity 63 98.1 79.5 55.2 56.4 75.6

increased, PFCM may lead to coincident problem while with a suitable set of parameters, HPFCM can avoid it. This also shows the improvement made by .a more suitable similarity measure. For the rest three datasets, both algorithms show comparable results. For some data sets, such as Binary, the changes made either on parameter set or in the distance function take little effect on the clustering results. Secondly, HPFCM outperforms HFCM in all six data sets. Especially for Multi5 and Multi10, HPFCM gives significant improvement in precision and recall. In a data set that contains highly unbalanced size categories such as Yahoo_K1, all algorithms do not perform as well as on the other datasets, but HPFCM is still much better than others. Thirdly, HPFCM can achieve more stable performance when compared with others. The chance of suffering local minimum problem is reduced. In Table 4, the performances on these six datasets by applying S-Kmeans, NMF and ITCC are reported. We conclude that HPFCM achieves the comparable results on Yahoo_K1, but obtains better accuracy than these three algorithms on the other five data sets. Now we further discuss about the existing problems and limitations of the proposed method. 1) Impact of the Parameters As we already mentioned above, (H)PFCM has much more complicated parameter selection process than HFCM. It may be difficult to find out the best set of parameters and achieve satisfied results for some data sets. Through plenty of experiments, we summarize a few guidelines on how to find the applicable region of all five parameters, in other word, helps on improving the effectiveness of the algorithm. In this section, we discuss the meaning of each parameter and

influences to the performance, membership and typicality values caused by each of them. a) a and b As we mentioned in section II, the constants a and b define the relative weight of fuzzy membership and typicality values in the objective function. The constant b has a direct influence on the typicality values. If the data have outliers then, we shall see that when more importance is given to T by increasing b improves the prototypes.[3] If the absolute value of b is too small, then PFCM tend to back to FCM model and reduce the robustness to outliers, therefore, lower down the accuracy of clustering. Due to equation (5), tik depends much on the b/ ratio, not only on which is shown in PCM. However, similarly with PCM, when b/ decreases,
1/( n 1)

M b decreases, tik will arise. When we 1 xkj vij j =1 i assign a larger value to a, it is almost equivalent to assign a smaller value to b while remain a unchanged, so, typicality also increases. When a and b increase or decrease together but keep the same ratio, the change of typicality value is adjusted by .In addition, a relative larger b value causes coincident problem since the algorithm will tend to like PCM model. Same thing happens when a has a very small absolute value although the ratio between a and b is valid. b) m and n The valid value of m is limited between 1~1.2, for certain dataset it must be less than 1.1 (e.g. Classic3), otherwise membership values are all approach to 1/C. n is better to be a

large value rather than a small one to avoid coincident problem If m is in the right set, the increase on n does not affect the results much, until n is drastically large, then the effect of typicality is rapidly reduced and we shall get the results similar to the FCM. Notices that when n is small, the typicality values tends to crisp. Inspection of T reveals that for the closest point to the cluster center in each cluster the typicality is almost 1.0, while the typicality values of all other points in the same cluster are very small. 2) Other Limitations In [1], the examples show that the difference of typicality value between outliers and normal data point is obvious. However, in our experiments for high dimensional data sets, typicality value from any documents k assign to any cluster i are nearly the same in every single testing, it does not help a lot on classifying the documents, the partition mainly depends on the membership values. The possible reason is the weight (occurrence) in each dimension (word) of documents may be very small after normalizing each vector to unit length in high dimensional dataset, so the 1 M x v part in equation(5)

[2] [3]

[4]

[5]

[6]

[7] [8]

j =1

kj ij

cannot dominantly influence the typicality value. However, it does not mean the value of typicality is useless. We actually obtain better membership distribution than HFCM due to the influence of this typicality value and its coefficient b. A suitable combination for a and b supposes to let the clustering process neither bias to FCM nor to PCM. Another cost of HPFCM is the processing time required. The run time of HPFCM is much longer than HFCM, since HFCM is used as the initialization process in HPFCM. IV. CONCLUSION AND FUTURE WORK We have presented a modified possibilisitic fuzzy c-means algorithm, by replacing Euclidean distance measure to cosine similarity measure for high dimensional data. We have demonstrated, through experiments, the effectiveness of HPFCM for clustering large high dimensional data sets. We have also shown, analytically and empirically, that HPFCM is more robust to noise than HFCM although we cannot distinguish the noise directly from the terminal typicality values. However, the trade off of PFCM application is the run time cost, especially for large size datasets. Furthermore, it is none trivial to search for the suitable set of parameters. We believe (H)PFCM framework has the potential to tackle real-world categorization problems when some of the existing limitation could be overcome. The potential future directions include some other forms of incorporated distance-based objective function feasible for high dimensional data, or some systematic way to the selection of parameters. Reference [1] N. R. Pal, K. Pal, J. M. Keller, and J. C. Bezdek, "A possibilistic fuzzy c-means clustering algorithm,"

[9] [10]

[11] [12]

[13] [14]

IEEE Transactions on Fuzzy Systems, vol. 13, pp. 517-530, 2005. C. J. v. Rijsbergen, "Information Retrieval, 2nd Edition.," 1979. W. Xu, X. Liu, and Y. Gong, "Document Clustering Based On Non-negative Matrix Factorization," SIGIR Forum (ACM Special Interest Group on Information Retrieval), pp. 267-273, 2003. I. S. Dhillon, S. Mallela, and D. S. Modha, "Information-theoretic co-clustering," Proc. 9th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 03), pp. 89-98, 2003. R. J. Hathaway, J. W. Davenport, and J. C. Bezdek, "Relational duals of the c-means clustering algorithms," Pattern Recognition, vol. 22, pp. 205212, 1989. H. Frigui and C. Hwang, "Semi-supervised Clustering and Aggregation of Relational Data," in Proceedings - IEEE Symposium on Computers and Communications, 2008, pp. 590-595. J. Bezdek, "Pattern Recognition with Fuzzy Objective Function Algorithms," Plenum Press New York, 1981. M. E. S. Mendes and L. Sacks, "Evaluating fuzzy clustering for relevance-based information access," in IEEE International Conference on Fuzzy Systems, 2003, pp. 648-653. R. Krishnapuram and J. M. Keller, "Possibilistic approach to clustering," IEEE Transactions on Fuzzy Systems, vol. 1, pp. 98-110, 1993. M. Barni, V. Cappellini, and A. Mecocci, "Comments on "a possibilistic approach to clustering"," IEEE Transactions on Fuzzy Systems, vol. 4, pp. 393-396, 1996. F. Klawonn and A. Keller, "Fuzzy clustering based on modified distance measures," Advances in Intelligent Data Analysis, pp. 291-301, 1999. W. C. Tjhi and L. Chen, "A heuristic-based fuzzy coclustering algorithm for categorization of highdimensional data," Fuzzy Sets and Systems, vol. 159, pp. 371-389, 2008. R. Baeza-Yates and B. Ribeiro-Neto, "Modem information retrieval," ACM Press/Addison-Wesley, 1999. I. S. Dhillon and D. S. Modha, "Concept decompositions for large sparse text data using clustering," Machine Learning, vol. 42, pp. 143-175, 2001.

Vous aimerez peut-être aussi