Vous êtes sur la page 1sur 36

Algorithms for Clustering High Dimensional and Distributed Data

Tao Li Shenghuo Zhu Mitsunori Ogihara Computer Science Department University of Rochester Rochester, NY 14627-0226, USA taoli,zsh,ogihara @cs.rochester.edu

Abstract

Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. The clustering problem has been widely studied in machine learning, databases, and statistics. This paper studies the problem of clustering high dimensional data. The paper proposes an algorithm called the CoFD algorithm, which is a non-distance based clustering algorithm for high dimensional spaces. Based on the Maximum Likelihood Principle, CoFD attempts to optimize its parameter settings to maximize the likelihood between data points and the model generated by the parameters. The distributed versions of the problem, called the D-CoFD algorithms, are also proposed. Experimental results on both synthetic and real data sets show the efciency and effectiveness of CoFD and D-CoFD algorithms. Keywords: CoFD, clustering, high dimensional, maximum likelihood, distributed.

The contact author is Tao Li. His contacting information: Email: taoli@cs.rochester.edu, telephone: +1-585-275-8479, fax: +1-585-273-4556.

Introduction

The problem of clustering data arises in many disciplines and has a wide range of applications. Intuitively, the clustering problem can be described as follows: Let be a set of data points in a multi-dimensional space. Find a partition of into classes such that the points within each class are similar to each other. To measure similarity between points a distance function is used. A wide variety of functions has been used to dene the distance function. The clustering problem has been studied extensively in machine learning [17, 67], databases [32, 74], and statistics [13] from various perspectives and with various approaches and focuses. Most clustering algorithms do not work efciently in high dimensional spaces due to the curse of dimensionality. It has been shown that in a high dimensional space, the distance between every pair of points is almost the same for a wide variety of data distributions and distance functions [14]. Many feature selection techniques have been applied to reduce the dimensionality of the space. However, as demonstrated in [1], the correlations among the dimensions are often specic to data locality; in other words, some data points are correlated with a given set of features and others are correlated with respect to different features. As pointed out in [40], all methods that overcome the dimensionality problems use a metric for measuring neighborhoods, which is often implicit and/or adaptive. In this paper, we present a non-distance-based algorithm for clustering in high dimensional spaces called CoFD1 , The main idea of CoFD is as follows: Sup with feature set needs to be clustered into classes, pose that a data set

with the possibility of recognizing some data points to be outliers. The clustering of the data then is represented by two functions, the data map    

  and the feature map     

  , where 

 correspond to the clusters and  corresponds to the set of outliers. Accuracy of such representation is measured using the log likelihood. Then, by the Maximum Likelihood Principle, the best clustering will be the representation that maximizes the likelihood. In CoFD, several approximation methods are used to  and  iteratively. The CoFD algorithm can also be easily adapted optimize to estimate the number of classes when the value is not given as a part of the input. An added bonus of CoFD is that it produces interpretable descriptions of the resulting classes since it produces an explicit feature map. Furthermore, since the data and feature maps provide natural summary and representative information of the datasets and cluster structures, CoFD can be naturally extended to cluster distributed datasets. We present its distributed version, D-CoFD. The rest of the paper is organized as follows: Section 2 presents the CoFD algorithm. Section 3
1

CoFD stands for Co-learning between Feature maps and Data maps.

describes the D-CoFD algorithms. Section 4 shows our experimental results on both synthetic and real data sets. Section 5 surveys the related work. Finally, our conclusions are presented in Section 6.

The CoFD Algorithm

This section describes CoFD and the core idea behind it. We rst present the CoFD algorithm for binary data sets. Then we will show how to extend it to continuous or non-binary categorical data sets in Section 2.6.

2.1 The Model of CoFD


Suppose we wish to divide into classes with the possibility of declaring some  , data as outliers. Such clustering can be represented by a pair of functions,     

 is the feature map and     

  is where  the data map.  we wish to be able to evaluate how good the Given a representation  representation is. To accomplish this we use the concept of positive features. Intuitively, a positive feature is one that best describes the class it is associated with. Suppose that we are dealing with a data set of animals in a zoo, where the vast majority of the animals is the monkey and and the vast majority of the animals is the four-legged animal. Then, given an unidentied animal having four legs in the zoo, it is quite natural for one to guess that the animal is a monkey because the conditional probability of that event is high. Therefore, we regard the feature having four legs as a positive (characteristic) feature of the class. In most practical cases, characteristic features of a class do not overlap with those of another class. Even if some overlaps exist, we can add the combinations of those features into the feature space. Let be the total number of data points and let be the number of features. can be represented as a  -  Since we are assuming that the features are binary,   matrix, which we call the data-feature matrix. Let and We  . We. also say that the th feature is active in the th point if and only if . say that the th data point possesses the th feature if and only if The key idea behind the CoFD algorithm is the use of the Maximum Likelihood Principle, which states that the best model is the one that has the highest likelihood of generating the observed data. We apply this principle by regarding the data  as the model. feature matrix as the observed data and the representation  Let the data map and the feature map  be given. Consider the  matrix,,   , dened as follows: for each ,  , and each ,


   







the entry of the matrix is if  and otherwise.   This is the model represented by  and . interpreted as the consistence of and     if   . For all , , ,   . Note that           , and  , we consider  , , the probability of the th feature being active in the th data point in the real data conditioned upon the th feature being active in the th data in the model  represented by and  . We assume that this conditional probability is dependent denote the probability of an entry being only on the values of and . Let denote observed as while the entry in the model being equal to . Also, let  the proportion of such that and . Then the likelihood  of the model can be expressed as follows:





  
  


      

optimizing one of and  by xing the other. First, we try to optimize  by xing  . The problem of optimizing  over all data-feature pairs can be approximately decomposed into subproblems of optimizing each  over all data-feature pairs   . If of feature , i.e., minimizing the conditional entropy  and  are given, the entropies can be directly estimated. Therefore,  is the smallest. is assigned to the class in which the conditional entropy of  xing  is the dual. Hence, we only need to minimize Optimizing  for  while . A straightforward approximation method for min  . This  each
 imizing to  is to assign    approximation method is also applicable to minimization of  
 where we assign to  . A direct improvement of the approximation is possible by using the idea of entropy-constrained vector quan  to
 tizer [21] and assigning  and   and 
 and , respectively, where  are the number of features and the number of points in class .

 
    
         
            "! 
  $#&% ('    (1)       )* + )(2) .  / ,
 0  
    , i.e., alternatively We apply the hill-climbing method to maximize

  

 

 1 ' 1  4 1 4  1 ) * 0+ ) ,6587    7 ' 


 91  1 ' )* 0+ )-,65:7
  7 5 ?  
0 + >
  ) * ) ; , 5 7  = 7 <   : 5   )* + )-,65 7
  7@<
>  
ACBEDGFIHKJLDNMOMPJQDSRUT9MWVXJUY ACBEDGFIHKJLDNMZ[JLDNJ\VXJ]Y


21  3  1 ' 1

2.2 Algorithm Description


There are two auxiliary procedures in the algorithm: estimates the feature map from the data map; estimates the data map from the feature map. Chi-square tests are used for deciding if a feature is an 4

outlier or not. The main procedure, , attempts to nd a best class by an iterative process similar to the EM algorithm. Here the algorithm iteratively estimates the data and feature maps based on the estimations made in the previous round, until no more changes occur in the feature map. The pseudo-code of the algorithm is shown in Figure 1. It can be observed from the pseudo-code description that the time complexities of both and are . The number of iterations in is not related to or .

O;Z

ACBEDF HKJLDMOMPJLDSRUT MWVXJ]Y O=Z

ACBEDGFIH JQDNMZ[JLDEJCVXJ]Y

  

2.3 An Example
To illustrate how CoFD works, an example is given in this section. Suppose we have a data set as presented in Table 1. Initially, the data points and are chosen as seed points, where the data point is in class and the data point is in class  
, i.e., , . returns that the features , and are positive in class , that the features and are positive in class ,   , and that features and  are outliers. In other words,    . Then assigns the     , and    data points , and  to class and the data points  , and  to class . then asserts that the features , and are positive in class , the features , and are positive in class , and the feature  is an outlier. In the next iteration, the result does not change. At this point the algorithm stops. The  resulting clusters are: the class for the data points , , and  and the class for the data points  , , and  .

    A]BEDGFIH JLDMOMPJLDGR]T MWVXJ]Y

ACBEDGFIHKJLDNMOMPJQDSRUT9MWVXJUY

      ACBEDF HKJLDMZ0JQDEJ\VXJ]Y

2.4 Rening Methods


Clustering results are sensitive to initial seed points. Randomly chosen seed points may make the search trapped in local minima. We use a rening procedure, whose idea is to use conditional entropy to measure the similarity between a pair of clustering results. CoFD attempts to nd a best clustering result having the smallest average conditional entropy against all others. Clustering a large data set may be time consuming. To speed up the algorithm, we focus on reducing the number of iterations. A small set of data points, for  of the entire data set, may be selected as the bootstrap data set. First, example, the clustering algorithm is executed on the bootstrap data set. Then, the clustering algorithm is run on the entire data set using the data map obtained from clustering on the bootstrap data set (instead of using randomly generated seed points). CoFD can be easily adapted to estimate the number of classes instead of using as an input parameter. We observed that the best number of classes results the 5

smallest average conditional entropy between clustering results obtained from different random seed point sets. Based on the observation, the algorithm of guessing the number of clusters is described in Figure 3.

2.5 Using Graph Partitioning to Initialize Feature Map


In this section, we describe the method of using graph partitioning for initializing the feature map. For the binary datasets, we can nd the frequent associations between features and build the association graph of the features. Then we use the graph partitioning algorithm to partition the features into several parts, i.e., to induce the initial feature map. 2.5.1 Feature Association

The problem of nding all frequent associations among attributes in categorical (basket) databases [3], called association mining, is one of the most fundamental and most popular problems in data mining. Formally, association mining is the problem of, given a database each of whose datasets is a subset of a universe  , (the so-called minimum  (the set of all items), and a threshold value , such that the proportion of the set support), enumerating all nonempty of transactions in containing each of the elements of is at least , that is,
 . Such a set is called a frequent itemset.  (There are variations in which the task is to enumerate all frequent itemsets having certain properties, e.g. [10, 22, 29].) There are many practical algorithms for this problem in various settings, e.g. [4, 5, 20, 34, 38, 43, 73] (see a survey by Hipp et al. [41]) and, using the current commodity machines, these practical algorithms can be made to run reasonably quickly.

7 7 7 7

2.5.2

Graph Partitioning

After nding all the frequent associations 2 , we build the association graph. In the association graph, the nodes represent the features and the edge weights are the support of the associations. Then we could use the graph partitioning to get an initial feature map. Graph partitioning is to partition the nodes of a graph into several parts, such that the summation of the weights of edges connecting nodes in different nodes is minimized. So the partitioning of the association graph would divide the features into several subsets so that the connections between the subsets are minimized. In our experiments, We apply the Metis algorithms [50] to get an initial feature map.
2

In our experiment, we use the Apriori algorithm with the minimum support of

  .

The use of graph partitioning to initialize feature map enables our algorithm explicitly consider the correlations between the features. In general, the feature space is much smaller than the data space. Hence the graph partitioning on feature space is very efcient.

2.6 Extending to Non-binary Data Sets


In order to handle non-binary data sets, we rst translate raw attribute values of data points into binary feature spaces. The translation scheme in [68] can be used to discretize categorical and continuous attributes. However, this type of translation is vulnerable to outliers that may drastically skew the range. In our algorithm, we use two other discretization methods. The rst is combining Equal Frequency instances, the method Intervals method with the idea of CMAC [6]. Given  divides each dimension into bins with each bin containing adjacent values3 . In other words, with the new method, each dimension is divided into several overlapped segments and the size of overlap is determined by . An attribute is then translated into a binary sequence having bit-length equal to the number of the overlapped segments, where each bit represents whether the attribute belongs to the corresponding segment. We can also use Gaussian mixture models to t each attribute of the original data sets, since most of them are generated from Gaussian mixture models. The number of Gaussian distributions can be obtained by maximizing the Bayesian information criterion of the mixture model. Then the value is translated into feature values in the binary feature space. The th feature value  is if the probability that the value of the data point is generated from the th Gaussian distribution is the largest.

5 <

Distributed Clustering

In this section, we rst briey review the motivation for distributed clustering and then give the D-CoFD algorithms for distributed clustering.

3.1 Motivation for Distributed Clustering


Over the years, data set sizes have grown rapidly with the advances in technology, the ever-increasing computing power and computer storage capacity, the permeation of Internet into daily life and the increasingly automated business, manufacturing and scientic processes. Moreover, many of these data sets are, in nature,
The parameter can be a constant for all bins or different constants for different bins depending . on the distribution density of the dimension. In our experiment, we set
3

 

geographically distributed across multiple sites. For example, the huge number of sales records of hundreds of chain stores are stored at different locations. To cluster such large and distributed data sets, it is important to investigate efcient distributed algorithms to reduce the communication overhead, central storage requirements, and computation times. With the high scalability of the distributed systems and the easy partition and distribution of a centralized dataset, distribute clustering algorithms can also bring the resources of multiple machines to bear on a given problem as the data size scale-up.

3.2 D-CoFD Algorithms


In a distributed environment, data sites may be homogeneous, i.e., different sites containing data for exactly the same set of features, or heterogeneous, i.e., different sites storing data for different set of features, possibly with some common features among sites. As we already mentioned, the data and feature maps imply natural summary information of the dataset and the cluster structures. Based on the observation, we will present extensions of our CoFD algorithm to those for clustering homogeneous and heterogeneous datasets. 3.2.1 Clustering Homogeneous Datasets

We rst present an extension of CoFD algorithm, D-CoFD-Hom, to clustering homogeneous datasets. In the paper, we assume that the data sets have similar distributions and hence the number of classes of each site is usually the same.
Algorithm

begin 1. Each site sends a sample of its data to the central cite; 2. The central site decides the parameters for data transformation, e.g., number of mixtures for each dimension, et al., and broadcasts to each site;  3. Each site preprocesses its data, performs CoFD, and obtains its data map and its corresponding feature map  ; 4. Each site sends its feature map  to the central cite; 5. The central site decides the global feature map  and broadcasts it to each site;  from the global feature map  . 6. Each site estimates its data map end

Z O=Z \ H

The task of estimating the global feature map at the central site can be described as follows: rst we need to identify the correspondence between each pair of sites. If we view each class as a collection of features 4 , we need to nd a map4

The feature indices are the same among all sites.

ping between the classes of two sites to maximize the mutual information between the two sets of features. This can be easily reduced to the maximum bipartite graph matching problem, which is known to be NP-hard. There are several approximation algorithms, e.g., [42], for this problem. In our experiments, however, the number of classes is relatively small, thus, we used brute force search to nd the mapping between the classes on two sites. Once the best correspondence has been found for each pair of sites, the global feature map can be established using , i.e., by assigning each feature to a class.

A]BEDGFIH JLDMOMPJLDGR]T MWVXJ]Y

3.2.2

Clustering Heterogeneous Datasets

Here we present another extension of CoFD algorithm, D-CoFD-Het, for clustering heterogeneous datasets. We assume that each site contains relational data and the row indices are used as the key linking the rows from different sites. The duality of the feature map and data map in our CoFD algorithm makes it natural to extend it to an algorithm for clustering heterogeneous datasets.
Algorithm

begin  1. Each site preprocesses its data, performs CoFD and obtains its data map and its corresponding feature map  ;  to the central cite; 2. Each site sends  3. The central site decides the global data map and broadcasts it to each site. The similar approach as estimating the global feature map is used;  4. Each site estimates its feature map  from the global data map . end

Z O=Z M?D

Experimental Results

4.1 Experiments with CoFD algorithm


There are many ways to measure how accurately CoFD performs. One is the con of a confusion matrix is the fusion matrix which is described in [1]. Entry number of data points assigned to output class and generated from input class . The input map is a map of the data points to the input classes. So, the in formation of the input map can be measured by the entropy . The goal of clustering is to nd an output map that recovers the information. Thus, the con is interpreted as the information of the input map given the dition entropy output map , i.e., the portion of information which is not recovered by the clustering algorithm. Therefore, the recovering rate of a clustering algorithm, dened as

'

'

sure for clustering. The purity [75] that measures the extend to which each cluster contained data points from primarily one class is also a good metric for cluster qual as a weighted individual ity. The purity of a clustering solution is obtained
  sum of cluster purities and is given by   

 % '

'  ' 5 , can also be used as a performance mea

and is the total number of points 6 . In general, the larger the values of purity, the  better the clustering solution is. For a perfect clustering solution, the purity is .  All the experiments are performed on a Sun     machine with  MB memory, running on Sun OS 5.7. 4.1.1 A Binary Synthetic Data Set

      where is a particular cluster of size , is the number of documents of the -th input class that were assigned to the
-th cluster, is the number of clusters


4.1.2

to evaluate our algorithm. First we generate a binary synthetic data set   binary features. Each  clusters.  Each point has  points are from  cluster has   positive features,   negative   features.  The positive features of a point have a probability of  to be and to be ; the negative features of     a point have a probability of  to be and to be ; the rest features have a     probability of to be , and to be . shows the confusion matrix of this experiment. In this experiment,  2  ,  ,   ,   ,  , and the  Table       rest are zeros. So , i.e., the recovering rate of  the algorithm is , i.e., all data points are correctly recovered.

   
>  '   

A Continuous Synthetic Data Set

In this experiment we attempt to cluster a continuous data set. We use the method       data points in . has described in [1] to generate a data set  a -dimensional space, with . All input classes are generated in some  -dimensional subspace. Five percent of the data points is chosen to be outliers, which are distributed uniformly at random throughout the entire space. Using the second translation method described in Section 2.6, we map all the data point into  a binary space with  features.     data points are randomly chosen as the bootstrap data set. By Then, running CoFD algorithm on the bootstrap data set, we obtain clustering of the bootstrap data set. Using the bootstrap data set as the seed points, we run the

5 6'

!"#%$& is mutual information between  and $ . !)(  & is also called the individual cluster purity.

10

algorithm on the entire data set. Table 3 shows the confusion matrix of this exper  of the data points are recovered. The conditional entropy iment. About    . The recovering rate is   the input entropy is is  while     . We make a rough comparison with the result  thus reported in [1]. From the confusion matrix reported in the paper, their recovering   rate is calculated as , which seems to indicate that our algorithm is better than theirs in terms of recovering rate.

' % ' '  

'

4.1.3

Zoo Database

We also evaluate the performance of the CoFD algorithm on the zoo database avail able at the UC Irvine Machine Learning Repository. The database contains   animals, each of which has boolean attributes and categorical attribute 7 . We translate each boolean attribute into two features, whether the feature is active and whether the feature is inactive. We translate the numeric attribute, legs, into six  features, which correspond to , ,  , ,  , and  legs, respectively. Table 4 shows   the confusion matrix of this experiment. The conditional entropy is    while the input entropy is   . The recovering rate of this algorithm is     . In the confusion matrix, we found that the clusters  with a large number of animals are likely to be correctly clustered. 8 CoFD comes with an important by-product that the resulting classes can be easily described in terms of features, since the algorithm produces an explicit feature map. For example, the positive features of class are no eggs, no backbone, venomous, and eight legs; the positive features of class are feather, air borne, and two legs. domestic and catsize; the positive feature of class is no legs; the positive feature of class  are aquatic, no breathes, ns and ve legs; the positive features of class  are six legs and no tail; the positive feature of class is four legs. Hence, class can be described as animals having feather and two legs, and being airborne, which are the representative features of the birds. Class  can be described as animals being aquatic, having no breathes, having ns and ve legs9 .

% ' ' 

' 

'

The original data set has data points but one animal, frog, appears twice. So we eliminated one of them. We also eliminated two attributes, animal name and type. 8 For example, cluster no. is mapped into cluster ; cluster no. is mapped into cluster ; et al. 9 Animals dolphin and porpoise are in class , but were clustered into class , because their attributes aquatic and ns make them more like animals in class than their attribute milk does for class .
7

11

4.2 Scalability Results


In this subsection we present the computational scalability results for our algorithm. The results are averaged over ve runs on each case to eliminate the randomness effect. We report the scalability results in terms of the number of points and the dimensionality of the space.   features (dimensions). They Number of points: The data sets we test have all contain clusters and each cluster has an average  features. Figure 4 and Figure 5 show the scalability result of the algorithm in terms of the number of data
points. The value of the coordinate in Figure 4 is the running time in seconds.    on Linux    . If it were imWe implement the algorithm using octave plement in , the running time could be reduced considerably. Figure 4 contains two curves: one is the scalability result with bootstrap and one without. It can be seen from Figure 4 that with bootstrap, our algorithm scales sub-linearly with the number of points and without bootstrap, it scales linearly with the number of
points. The value of the coordinate in Figure 5 is the ratio of the running time to the number of points. The linear and sub-linear scalability properties can also be observed in Figure 5.      data points. Dimensionality of the space: The data sets we test have clusters and the average features (dimensions) for each cluster They all contain is about of the total number of features (dimensions). Figure 6 and Figure 7 show the scalability result of the algorithm in terms of dimensionality of the space.
The value of the coordinate in Figure 6 is the running time in seconds. It can be seen from Figure 6 that our algorithm scales linearly with the dimensionality of the
space. The value of the coordinate in Figure 7 is the ratio of the running time to the number of features.

4.3 Document Clustering Using CoFD


In this section, we apply our CoFD algorithm to cluster documents and compare its performance with other standard clustering algorithms. Document clustering has been used as a means for improving document retrieval performances. In our experiments, documents are represented using binary vector-space model where each document is a binary vector in the term space and each element of the vector indicates the presence of the corresponding term. 4.3.1 Document Datasets

Several datasets are used in our experiments:

12

URCS: URCS dataset is the collection of technical reports published in the year 2002 by the computer science department at university of Rochester 10 . There are 3 main research area at URCS: AI11 , System and Theory while AI has been loosely divided into two sub-areas: NLP 12 and Robotics/Vision. Hence the dataset which a contain 476 abstracts that are divided into 4 different research areas. WebKB: The WebKB dataset contains webpages gathered from university computer science departments. There are about 8300 documents and they are divided into 7 categories: student, faculty, staff, course, project, department and other. The raw text is about 27MB. Among these 7 categories, student, faculty, course and project are four most populous entity-representing categories. The associated subset is typically called WebKB4. In this paper, we did experiments on both 7-category and 4-category datasets. Reuters: The Reuters-21578 Text Categorization Test collection contains documents collected from the Reuters newswire in 1987. It is a standard text categorization benchmark and contains 135 categories. In our experiments, we use a subsets of the data collection which include the 10 most frequent categories among the 135 topics and we call it Reuters-top 10. K-dataset: The K-dataset was from WebACE project [35] and it was used in [15] for document clustering. The K-dataset contains 2340 documents consisting news articles from Reuters new service via the Web in October 1997. These documents are divided into 20 classes. The datasets and their characteristics are summarized in Table 5. To preprocess the datasets, we remove the stop words use a standard stop list and perform stemming using a porter stemmer, all HTML tags are skipped and all header elds except subject and organization of the posted article are ignored. In all our experiments, we rst select the top 1000 words by mutual information with class labels. The feature selection is done with the rainbow package [56]. 4.3.2 Document Clustering Comparisons

Document clustering methods can be mainly categorized into two types: partitioning and hierarchical clustering [69]. Partitioning methods decompose a collection of documents into a given number disjoint clusters which are optimal in terms of some predened criteria functions. For example, the traditional -means method tries to minimize the sum-of-squared-errors criterion function. The criteria functions of adaptive -means approaches usually take more factors into consideration and hence more complicated. There are many different Hierarchical clustering
The dataset can be downloaded from http://www.cs.rochester.edu/u/taoli/data. Articial Intelligence. 12 Natural Language Processing.
11 10

13

algorithms with different policies on combining clusters such as single-linkage, complete-linkage and UPGA method [45]. Single-linkage and complete-linkage use the maximum and the minimum distance between the two clusters, respectively, while UPGMA - Unweighted Pair-Groups Method Average uses the distance of the cluster centers to dene the similarity of two clusters for aggregating. In our experiments, we compare the performance of CoFD on the datasets with -means, Partitioning methods with various adaptive criteria functions and hierarchical clustering algorithms using single-linkage, complete-linkage and UPGA aggregating measures. For partitioning and hierarchical algorithms, we use the CLUTO clustering package described in [75, 76]. The comparisons are shown in Table 6. Each entry is the purity of the corresponding column algorithm on the row dataset. P1, P2, P3 and P4 are the partitioning algorithm with different adaptive criteria functions as shown in Table 7. Slink, Clink and UPMGA columns are different hierarchical clustering algorithms using single-linkage, complete-linkage and UPGA aggregating policies. In our experiments, we use the cosine function measure of the two document vectors as their similarity. From Table 6, we observe that CoFD achieve the best performance on URCS and Reuters-top 10 datasets. On all other datasets, the results obtained by CoFD are really close to the best results. On WebKB4, -means has the best performance of     and CoFD gives the result of    . P2 achieves the best result of   on    on K-dataset. The results WebKB and P1 gives the highest performance of    and    respectively. on WebKB and K-dataset obtained by CoFD are Figure 8 shows the graphical comparison. The comparison shows that, although there is no single winner on all the datasets, CoFD is a viable and competitive algorithm in document clustering domain. To get a better understanding of the CoFD, Table 8 gives the confusion matrices built from the clustering results on URCS dataset. The columns of the confusion matrix are NLP, Robotics/Vision, Systems and Theory respectively. The result shows Systems and Theory are much different from each other, and different from NLP and Robotics/Vision; NLP and Robotics/Vision are similar to each other; AI is more similar to SYSTEMS than ROBOTICS is.

4.4 Experiments Results of D-CoFD algorithms


This subsection presents the experiment results with D-CoFD algorithms for distributed clustering.

14

4.4.1

An Articial Experiment

We generate the test data set using the algorithm in [59]. has in a  -dimensional space with , including noise dimensions. The rst translation method described in Section 2.6 is used for preprocessing. We rst ap ply the CoFD algorithm on the dataset, for which the recovering rate was , i.e., all data points are correctly recovered. We then horizontally partition the datasets into     points each and applied the D-CoFD-Hom algorithm. ve subsets containing An important step of the D-CoFD-Hom algorithm is to identify the correspondence between the classes of a set and those of the other for each pair of sites. We use a brute force approach to nd the mapping that maximizes the mutual information and  from site and site between them. For example, given the feature map  respectively, we want to establish the correspondence between classes on site and those on . We rst construct the confusion matrix of two feature maps where is the number of features that are positive in both class at site the entry and class at site . According to the confusion matrix, we can easily derive the correspondence. Table 9 gives a detailed example of a confusion matrix and its consequent correspondence. Overall, all the points are also correctly clustered and  thus the recovering rate was . In another experiment, we vertically partition the datasets into three subsets      points and   features each. As in the homogeneous case, to containing identify the correspondence between sets of classes, we use brute force search to nd the mapping that maximizes the mutual information between them. For  and  of site and site example, in our experiment, given the data map , we want to establish the correspondence of site and . We rst construct of the confusion the confusion matrix of two feature maps where the entry matrix is the number of points that are clustered into both class at site and class at site . We build the confusion matrix of the data maps similar to that of the feature maps in the homogeneous case. Then, according to the confusion matrix of the data maps, we can easily derive the map. Table 10 presents the confusion matrix and its correspondence. The outcome of the D-CoFD-Het algorithm is shown in   . The above experiments Table 11 and the recovering rate in this case was  demonstrate that the results of distributed clustering are very close to that of the centralized clustering.

 

4.4.2

Dermatology Database

We also evaluate our algorithms on the dermatology database from the UC Irvine Machine Learning Repository. The database is used for the differential diagnosis of erythemato-squamous diseases. These diseases all share the clinical features

15

of erythema and scaling, with very little differences. The dataset contains    data points over   attributes and was previously used for classication. In our experiment, we use it to demonstrate our distributed clustering algorithms. The confusion matrix of centralized clustering, i.e., CoFD, on this dataset is presented  in Table 12 and its recovering rate is   . We then horizontally partition the  dataset into  subsets, each containing instances. The confusion matrix by D-CoFD-Hom is shown in Table 13.

Related Work

Self-Organizing Map (SOM) [53] is widely used for clustering, and is particularly well suited to the task of identifying a small number of prominent classes in a multidimensional data set. SOM uses incremental approach where points/patterns are processed one-by-one. However, the centroids with huge dimensionality are hard to interpret in SOM. Other Traditional clustering techniques can be classied into partitional, hierarchical, density-based and grid-based. Partitional clustering attempts to directly decompose the data set into disjoint classes such that the data points in a class are nearer to one another than the data points in other classes. Hierarchical clustering proceeds successively by building a tree of clusters. Densitybased clustering is to group the neighboring points of a data set into classes based on density conditions. Grid-based clustering quantizes the object space into a nite number of cells that form a grid-structure and then performs clustering on the grid structure. Most of them use distance functions as objective criteria and are not effective in high dimensional spaces. Next we review some recent clustering algorithms which have been proposed for high dimensional spaces or without distance functions and are largely related to our work. CLIQUE [2] is an automatic subspace clustering algorithm for high dimensional spaces. It uses equal-size cells and cell density to nd dense regions in each subspace in a high dimensional space. Cell size and the density threshold need to be provided by the user as inputs. CLIQUE does not produce disjoint classes and the highest dimensionality of subspace classes reported is about ten. Our algorithm produces disjoint classes and does not require the additional parameters such as the cell size and density. Also it can nd classes with higher dimensionality. Aggarwal et al. [1] introduce projected clustering and presents algorithms for discovering interesting patterns in subspaces of high dimensional spaces. The core idea is a generalization of feature selection which allows the selection of different sets of dimensions for different subsets of the data sets. However, the algorithms are based on the Euclidean or Manhattan distance and their feature selection method is a variant of singular value decomposition. Also the algorithms assume that the

16

number of projected dimensions are given beforehand. Our algorithm does not need the distance measures and the number of dimensions for each class. Also our algorithm does not require all projected classes to have the same number of dimensions. Ramkumar and Swami [64] propose a method for clustering without distance functions. The method is based on the principles of co-occurrence maximization and label minimization which are normally implemented using associations and classication techniques. The method does not consider correlations of dimensions with respect to data locality. The idea of co-clustering of data points and attributes dates back to [7, 39, 62]. Co-clustering is a simultaneous clustering of both points and their attributes by utilizing the canonical duality contained in the point-by-attribute data representation. Govaert [31] researches simultaneous block clustering of the rows and columns of contingency table. Dhillon [24] presents a co-clustering algorithm using bipartite graph between documents and words. Our algorithms, however, use the association graph of the features to initialize the feature map and then apply an EM-type algorithm. Han et al. [36] propose the clustering algorithm based on association rule hypergraphs but they dont consider the co-learning problem of data map and feature map. A detailed survey on hypergraph based clustering can be found in [37]. Cheng et al. [19] propose an entropy-based subspace clustering for mining numerical data. The criterion of the entropy-based method is that low entropy subspace corresponds to a skewed distribution of unit densities. Decision tree based clustering method is proposed in [55]. There are also some recent work on clustering categoric datasets, such as ROCK [33] and CACTUS [30]. Our algorithm can be applied to both categorical and numerical data. There are also a lot of probabilistic clustering approaches where data is considered to be a sample independently drawn from a mixture model of several probability distributions [58]. The area around the mean of each distribution constitutes a natural cluster. Generally the clustering task is then reduced to nding the parameters such as mean, variance, etc. of the distributions to maximizing the probability that the data is drawn from the mixture model (also known as maximizing the likelihood). Expectation-Maximization (EM) method [23] is then used to nd mixture model parameters that maximize the likelihood. Moore [60] suggests acceleration of EM method based on a special data index, KD-tree, where data is divided at each node into two descendants by splitting the widest attribute at the center of its range. Algorithm Autoclass [18] utilizes a mixture model and covers a wide range of distributions including Bernoulli, Poisson, Gaussian and log-normal distributions. The algorithm MCLUST [28] uses Gaussian models with ellipsoids of different volumes, shapes and orientations. smyth [67] uses probabilistic clustering for mutli-variate and sequential data and Cadez et al.[16] use mixture model for cus17

tomer proling based on transactional information. Barash and Friedman [9] use context specic Bayesian clustering method to help in understanding the connections between transcription factors and functional classes of genes. The contentspecic approach has the ability to deal with many attributes that are irrelevant to the clustering and are suitable for the underlying biological problem. Kearns et al. [51] give an information-theoretic analysis of hard assignments (used by means) and soft assignments (used by EM algorithm) for clustering and proposed a posterior partition algorithm which is close to the soft assignments of EM for clustering. The relationship between maximum likelihood and clustering is also discussed in [47]. Minimum encoding length approaches, including Minimum Message Length (MML) [70] and Minimum Description Length (MDL) [65] can also be used for clustering. Both approaches perform induction by seeking a theory that enables the most compact encoding of both the theory and available data and admit to probabilistic interpretations. Given prior probabilities for both theories and data, minimization of the MML encoding closely approximates maximization of posterior probability while an MDL code length denes an upper bound on unconditional likelihood [66, 71]. Minimum encoding is an information theoretic criterion for parameter estimation and model selection. Using minimum encoding for clustering, we are looking for the cluster structure which minimum the size of encoding. The algorithm SNOB [72] uses a mixture model in conjunction with Minimum Message Length (MML) principle. An empirical comparison of criteria prominent is given in [63] and it also concludes that Minimum Message Length appeared to be the best criterion for selecting the number of components for mixtures of Gaussian distributions. Fasulo [26] gives a detailed survey on clustering approaches based on mixture models [8], dynamical systems [52] and clique graphs [11]. Similarity based on shared features has also been analyzed in cognitive science such as the Family resemblances study by Rosch and Mervis. Some methods have been proposed for classication using maximum entropy (or maximum likelihood)[12, 61]. The classication problem is a supervised learning problem, for which the data points are already labeled with known classes. However, the clustering problem is an unsupervised learning problem as no labeled data is available. Recently there has been lots of interest in distributed clustering. Kargupta et al. [49] present collective principal component analysis for clustering heterogeneous datasets. Johnson and Kargupta [46] study hierarchical clustering from distributed and heterogeneous data. Both of the above works tend to subject to vertical partitioning. In our D-CoFD algorithms, we envisage data that is horizontally partitioned as well as vertically partitioned. Lazarevic et al. [54] propose a distributed clustering algorithm for learning regression modules from spatial datasets. Hulth 18

and Grenholm [44] propose a distributed clustering algorithm which is intended to be incremental and to work in a real-time situation. McClean et al. [57] present algorithms to cluster databases that hold aggregate count data on a set of attributes that have been classied according to heterogeneous classication schemes. Forman and Zhang [27] present distributed K-means, K-Harmonic-Means and EM algorithms for homogeneous (horizontally partitioned) datasets. Parallel clustering algorithms are also discussed, for example in [25]. More work on distributed clustering can be found in [48].

Conclusions

In this paper, we rst propose a novel clustering algorithm CoFD which does not require the distance function for high dimensional spaces. We come up with a new perspective of viewing the clustering problem by interpreting it as the dual problem of optimizing the feature map and data map. As a by-product, CoFD also produces the feature map which can provide interpretable descriptions of the resulting classes. CoFD also provides a method to select the number of classes based on the conditional entropy. We introduce the recovering rate, an accuracy measure of clustering, to measure the performance of clustering algorithms and to compare clustering results of different algorithms. We extend CoFD and propose D-CoFD algorithms for cluster distributed datasets. Extensive experiments have been conducted to demonstrate the efciency and effectiveness of the algorithms. Documents are usually represented as high-dimensional vectors in term space. We evaluate CoFD for clustering on a variety of document collections and compare its performance with other widely used document clustering algorithms. Although there is no single winner on all the datasets, CoFD outperforms the rest on two datasets and its performances are close to the best on the other three datasets. In summary, CoFD is a viable and competitive clustering algorithm.

Acknowledgments
The project is supported in part by NIH Grants 5-P41-RR09283, RO1-AG18231, and P30-AG18254 and by NSF Grants EIA-0080124, EIA-0205061, NSF CCR9701911, and DUE-9980943.

19

References
[1] Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. Fast algorithms for projected clustering. In ACM SIGMOD Conference, pages 6172, 1999. [2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering for high dimensional data for data mining applications. In SIGMOD-98, 1998. [3] R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in massive databases. In Proceedings of the ACM-SIGMOD 1993 International Conference on Management of Data, pages 207216, 1993. [4] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining., pages 307328. AAAI/MIT Press, 1996. [5] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of 20th Conference on Very Large Databases, pages 487499, 1994. [6] J. S. Albus. A new approach to manipulator control: The cerebellar model articlatioon controller (CMAC). Trans. of the ASME, J. Dynamic Systems, Meaasurement, and Control, 97(3):220227, sep 1975. [7] M. R. Anderberg. Cluster Analysis for Applications. Academic Press Inc., 1973. [8] J. Baneld and A. Raftery. Model-based gaussian and non-gaussian clustering. Biometrics, 49:803821, 1993. [9] Yoseph Barash and Nir Friedman. Context-specic bayesian clustering for gene expression data. In RECOMB, pages 1221, 2001. [10] R. J. Bayardo Jr. and R. Agrawal. Mining the most interesting rules. In Proceedings of 5th International Conference on Knowledge Discovery and Data Mining, pages 145154, New York, NY, 1999. ACM Press. [11] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. J. of Comp. Biology, 6(3/4):281297, 1999. [12] Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Computational Lingistics, 1996. 20

[13] M. Berger and I. Rigoutsos. An algorithm for poinr clustering and grid generation. IEEE Trans. on Systems, Man and Cybernetics, 21(5):12781286, 1991. [14] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? In ICDT Conference, 1999. [15] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. Document categorization and query generation on the world wide web using webace. AI Review, 13(5). [16] I. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transactional data with applications to proling, visualization, and prediction. In KDD2001, pages 3746, San Francisco, CA, 2001. [17] P. Cheeseman, J. Kelly, and M. Self. AutoClass: A bayesian classication system. In ICML88, 1988. [18] P. Cheeseman and J. Stutz. Bayesian classication (autoclass): Theory and results. In Advances in Know/edge Discovery and Data Mining. AAAI Press/MIT Press, 1996. [19] C-H Cheng, A. W-C Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In KDD-99, 1999. [20] D. Cheung, J. Han, V. T. Ng, and A. W. Fu an Y. Fu. Fast distributed algorithm for mining association rules. In International Conference on Parallel and Distributed Information Systems, pages 3142, 1996. [21] P. A. Chou, T. Lookabaugh, and R. M. Gray. Entropy-constrained vector quantization. IEEE Trans., ASSP-37(1):31, 1989. [22] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. wullman, and C. Yang. Finding interesting associations without support pruning. IEEE Trasactions on Knowledge and Data Engineering, 13(1):6478, 2001. [23] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data. Journal of the Royal Statistical Society, (39):138, 1977. [24] I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. Technical Report 2001-05, UT Austin CS Dept, 2001.

21

[25] Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Lecture Notes in Articial Intelligence, pages 245260, 2000. [26] D. Fasulo. An analysis of recent work on clustering algorithms. Technical Report 01-03-02, U. of Washington, Dept. of Comp. Sci. & Eng., 1999. [27] G. Forman and B. Zhang. Distributed data clusteringcan be efcient and exact. SIGKDD Explorations, 2(2):3438, 2001. [28] C. Fraley and A. Raftery. Mclust: Software for model-based cluster and discriminant analysis. Technical Report 342, Dept. Statistics, Univ. of Washington, 1999. [29] A. Fu, R. Kwong, and J. Tang. Mining most interesting itemsets. In Proceedings of 12th International Symposium on Methodologies for Intelligent Systems, pages 5967. Springer-Verlag Lecture Notes in Computer Science 1932, 2000. [30] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. CACTUS clustering categorical data using summaries. In Knowledge Discovery and Data Mining, pages 7383, 1999. [31] G. Govaert. Simultaneous clustering of rows and columns. Control and Cybernetics, (24):437458, 1985. [32] S. Guha, R. Rastogi, and K. Shim. CURE: An efcient clustering algorithm for large database. In Proceedings of the 1998 ACM SIGMOD Conference, 1998. [33] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345366, 2000. [34] E. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD Conference, pages 277288. ACM Press, 1997. [35] Eui-Hong Han, Daniel Boley, Maria Gini, Robert Gross, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. WebACE: A web agent for document categorization and exploration. In Katia P. Sycara and Michael Wooldridge, editors, Proceedings of the 2nd International Conference on Autonomous Agents (Agents98), pages 408415, New York, 913, 1998. ACM Press. 22

[36] Eui-Hong Han, George Karypis, Vipin Kumar, and Bamshad Mobasher. Clustering based on association rule hypergraphs. In Research Issues on Data Mining and Knowledge Discovery, pages 0, 1997. [37] Eui-Hong Han, George Karypis, Vipin Kumar, and Bamshad Mobasher. Hypergraph based clustering in high-dimensional data sets: A summary of results. Data Engineering Bulletin, 21(1):1522, 1998. [38] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD Conference, pages 112. ACM Press, 2000. [39] J. Hartigan. Clustering Algorithms. Wiley, 1975. [40] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,Inference, and Prediction. Springer, 2001. [41] J. Hipp, U. G untzer, and G. Nakhaeizadeh. Algorithms for association rule mininga general survey and comparison. SIGKDD Explorations, 2(1):58 63, 2000. [42] J. Hopcroft and R. Karp. An algorithm for maximum matchings in bipartite graphs. SIAM Journal of Computing, 2(4):225231, 1973. [43] M. Houtsma and A. Swami. Set-oriented mining of association rules. Research Report RJ 9567, IBM Almaden Research Center, San Jose, CA, October 1993. [44] N. Hulth and P. Grenholm. A distributed clustering algorithm. Technical Report 74, Lund University Cognitive Science, 1998. [45] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. [46] Erik L. Johnson and Hillol Kargupta. Collective, hierarchical clustering from distributed, heterogeneous data. In M. Zaki and C. Ho, editors, Large-scale Parallel Systems, pages 221244. Springer, 1999. [47] M. I. Jordan. Graphical Models: Foundations of Neural Computation. MIT Press, 2001. [48] Hillol Kargupta and Philip Chan, editors. Advances in Distributed and Parallel Data Mining. AAAI Press, 2000.

23

[49] Hillol Kargupta, Weiyun Huang, Krishnamoorthy Sivakumar, and Erik Johnson. Distributed clustering using collective principal component analysis. In ACM SIGKDD-2000 Workshop on Distributed and Parallel Knowledge Discovery, 2000. [50] George Karypis and Vipin Kumar. Multilevel algorithms for multi-constraint graph partitioning. Technical Report 98-019, 1998. [51] M. Kearns, Y. Mansour, and A. Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In Learning in Graphical Models. Kluwer AP, 1998. [52] Jon M. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. pages 599608, 1997. [53] Teuvo Kohonen. The Self-Organizing Map. In New Concepts in Computer Science, 1990. [54] A. Lazarevic, D. Pokrajac, and Z. Obradovic. Distributed clustering and local regression for knowledge discovery in multiple spatial databases. In 8th European Symposium On Articial Neural Networks,ESANN2000, 2000. [55] Bing Liu, Yiyuan Xia, and Philip S. Yu. Clustering through decision tree construction. In SIGMOD-00, 2000. [56] Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classication and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996. [57] S. McClean, B. Scotney, K. Greer, and R. Pairceir. Conceptual clustering of heterogeneous distributed databases. In Workshop on Ubiquitous Data Mining, PAKDD01, 2001. [58] G. McLachlan and K. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, NY, 1988. [59] G. Milligan. An algorithm for creating articial test clusters. Psychometrika, 50(1):123127, 1985. [60] A. Moore. Very fast em-based mixture model clustering using multiresolution kd-trees. In Proceedings of Neural Information Processing Systems Conference, 1998.

24

[61] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classication. In In IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 6167, 1999. [62] S. Nishisato. Analysis of Categorical Data: Dual Scaling and its Applications. University of Toronto Press, Toronto, 1980. [63] J. J. Oliver, R. A. Baxter, and C. S. Wallace. Unsupervised Learning using MML. In Machine Learning: Proceedings of the Thirteenth International Conference (ICML 96), pages 364372. Morgan Kaufmann Publishers, 1996. [64] G.D. Ramkumar and A. Swami. Clustering data without distance function. IEEE Data(base) Engineering Bulletin, 21:914, 1998. [65] J. Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11(2):416431, 1983. [66] Jorma Rissanen. Stochastic complexity. Journal of the Royal Statistical Society B, 49(3), 1987. [67] P. Smyth. Probabilistic model-based clustering of multivariate and sequential data. In Proceedings of Articial Intelligence and Statistics, pages 299304, San Mateo CA, 1999. Morgan Kaufman. [68] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In SIGMOD-96, 1996. [69] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In In KDD Workshop on Text Mining, 2000. [70] C.S. Wallace and D.M. Boulton. An information measure for classication. Computer Journal, 11(2):185194, 1968. [71] C.S. Wallace and P.R. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society (Series B), 49(3), 1987. [72] S. Wallace and D.L. Dowe. Intrinsic classication by mml the snob program. In Proceedings of the 7 Australian Joint Conference on Articial Intelligence, pages 3744. World Scientic, 1994. [73] M. Zaki, S. Parthasarathy, M. Ogihara, and W. LI. Parallel algorithms for discovery of association rules. Data Mining and Knowledge Discovery, 1:343 373, 1998.

25

[74] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efcient data clustering method for very large databases. In ACM SIGMOD Conference, 1996. [75] Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report 0140, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. [76] Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. Technical Report 02-022, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2002.

26

List of Tables
1 2 3 4 5 6 7 8 9 10 11 12 13 A data set example. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . Confusion matrix for .. . . . . . . . . . . . . . . . . . . . . . Confusion matrix for Confusion matrix of the zoo data. . . . . . . . . . . . . . . . . . . Document DataSets Descriptions. . . . . . . . . . . . . . . . . . Document Clustering Comparison Table . . . . . . . . . . . . . . Criteria Functions for Partitioning Algorithms . . . . . . . . . . . Confusion matrix of technical reports by CoFD. . . . . . . . . . . in homogeneous setting . . . . . . . . . . . . Experiments of in heterogeneous setting . . . . . . . . . . . . Experiments of Confusion Matrix on Site A. . . . . . . . . . . . . . . . . . . . . Confusion matrix of Centralized Result for Dermatology Database. Confusion matrix of Distributed Result for Dermatology Database. 28 28 28 28 29 29 29 30 30 30 31 31 31

27

feature data point

a 1 1 1 0 0 0

b 1 1 0 1 0 0

c 0 1 1 0 0 0

d 0 1 0 0 1 1

e 1 0 0 1 1 0

f 0 0 0 1 1 1

g 0 1 0 0 1 0

1 2 3 4 5 6

Table 1: A data set example.


Input Output

A 0 0 0 86 0

B 0 0 0 0 75

C 0 0 75 0 0

D 83 0 0 0 0

E 0 81 0 0 0

1 2 3 4 5

Table 2: Confusion matrix for


Input Output

A 1 0 0 10 20520 16

B 0 15496 0 0 0 17

C 1 0 0 17425 0 15

D 17310 2 0 0 0 10

E 0 22 24004 5 0 48

O. 12 139 1 43 6 4897

1 2 3 4 5 Outliers

Table 3: Confusion matrix for


Input Output

.
6 0 0 0 0 0 8 0

1 0 0 39 0 2 0 0

2 0 20 0 0 0 0 0

3 0 0 0 2 1 0 2

4 0 0 0 0 13 0 0

5 0 0 0 0 0 0 3

7 1 0 0 0 4 5 0

A B C D E F G

Table 4: Confusion matrix of the zoo data. 28

Datasets URCS WebKB4 WebKB Reuters-top 10 K-dataset

# documents 476 4199 8,280 2,900 2,340

# class 4 4 7 10 20

Table 5: Document DataSets Descriptions.

Datasets URCS WebKB4 WebKB Reuters-top 10 K-dataset

CoFD

-means

P1

P2

P3

P4

Slink

Clink

UPMGA

0.889 0.638 0.501 0.734 0.754

0.782 0.647 0.489 0.717 0.644

0.868 0.639 0.488 0.681 0.762

0.836 0.644 0.505 0.684 0.748

0.872 0.646 0.503 0.724 0.716

0.878 0.640 0.493 0.691 0.750

0.380 0.392 NA 0.393 0.220

0.422 0.446 NA 0.531 0.514

0.597 0.395 NA 0.498 0.551

Table 6: Document Clustering Comparison Table. Each entry is the purity of the corresponding column algorithm on the row dataset.

Algorithms P1 P2 P3

Criteria Function  Maximize

P4

  
 
?  S  Minimize
?  S   
 ?  ?        
 
   Maximize       
       
 ?  ? !"  

    Maximize  #    
     

 .

Table 7: Criteria Functions for Different Partitioning Algorithms. is the dataset


is the   and s are the clusters of size . is the number of clusters.
distance/similarity between  and .

29

Input Output

1 68 8 0 25

2 0 1 0 70

3 8 4 160 6

4 0 120 0 6

A B C D

Table 8: Confusion matrix of technical reports by CoFD.

Site B Site A

Outlier 7 1 1 0 1 0

1 0 0 0 29 0 0

2 0 0 0 0 0 28 3 1 4 5

3 0 25 0 0 0 0 5 2

4 0 2 27 0 1 0

5 0 0 0 0 28 0

Outlier 1 2 3 4 5

Site A Site B

1 3

2 4

Table 9: Confusion matrix and correspondence of feature maps of neous setting.

in homoge-

Site B Site A

1 1989 1 2 0 0 Site A Site B

2 0 2 0 1998 0 1 1 2 5

3 0 0 0 3 1998 3 4 4 2

4 3 0 1958 14 28 5 3

5 6 1997 1 0 0

1 2 3 4 5

Table 10: Confusion matrix and correspondence of data maps of neous setting.

in heteroge-

30

Input Output

1 0 1999 1 0 0

2 0 0 0 2000 0

3 1 0 1958 13 28

4 0 0 0 2 1998

5 1997 1 2 0 0

1 2 3 4 5

Table 11: Confusion Matrix on Site A.

Input Output

1 0 112 0 0 0 0

2 5 13 0 14 7 22

3 0 0 72 0 0 0

4 1 2 1 15 4 26

5 0 4 0 5 43 0

6 20 0 0 0 0 0

A B C D E F

Table 12: Confusion matrix of Centralized Result for Dermatology Database.

Input Output

1 3 99 1 5 2 2

2 0 6 0 4 16 35

3 9 0 63 0 0 0

4 0 1 1 2 3 42

5 41 0 0 8 3 0

6 0 0 0 20 0 0

A B C D E F

Table 13: Confusion matrix of Distributed Result for Dermatology Database.

31

List of Figures
1 2 3 4 5 6 7 8 Description of CoFD algorithm. . . . . . . . . . . . . . . . . Clustering and rening algorithm. . . . . . . . . . . . . . . . Algorithm of guessing the number of clusters. . . . . . . . . . Scalability with the number of points:running time. . . . . . . Scalability with the number of points:running time/#features. . Scalability with the number of features:running time. . . . . . Scalability with the number of features:running time/#features. Purity comparisons on various document datasets. . . . . . . . . . . . . . . . . . . . . . . . 33 34 34 34 35 35 36 36

32

A]BEDGFIH JLDMOMPJLDGR]T MWVXJ]Y (data points: , data map:  begin


 do >) for in .    + ) ,  
  then if 
.  > )  

0 +   ) * ) ,  ; 
Procedure

else

endif end return  end Procedure  If no outliers, begin  for in if

outlier;

ACBEDGFI H  JQDN M Z[  JL  Y (data points: DE  JC VXJ]
. Otherwise,

endif end  return ; end Algorithm (data points: , # of classes: ) begin  be the set of randomly chosen distinct data points from let  ; assign each of them to one class,say the map be    assign ( , ) to  ; repeat  assign  to  ;  assign ( , ) to ;  assign ( , ) to  ;    is zeros; until conditional entropy  return ; end

G   + )-,  /do    


 
 / S )     then     
 )* 0+ )-, 
; 
else   outlier;
O=Z

, feature map:  ) .

ACBEDGFIH JQDNMOMPJLDGR]T MWVXJ]Y

A]BEDGFIH L J DMZ[JLDEJ\VXJUY A]BEDGFIH L J DMOMPJLDGR]T MWVXJ]Y '

Figure 1: Description of CoFD algorithm.

33

Algorithm (data points: , the number of classes: begin ( , ) and do times of for  ; assign the results to  conditional compute the average  ; entropies for  return ; end

M LM

   '  )* +

O;Z

Figure 2: Clustering and rening algorithm.



(data points: Algorithm )  is the estimated maximum number of clusters;  begin for in do do times of ( , ) and for  assign the results to conditional compute the average for entropies end ;  return end

[R-M?BEB

RQBEDNMT

   '  + ) * + 

 

Figure 3: Algorithm of guessing the number of clusters.


1000 running time in seconds w/o bootstrap w/ bootstrap 100

10

0.1 100

1000 #points

10000

100000

Figure 4: Scalability with the number of points:running time.

34

0.01 running time/#points w/o bootstrap w/ bootstrap

0.001

0.0001 100

1000

10000 #points

100000

Figure 5: Scalability with the number of points:running time/#features.

14 running time in seconds 12 10 8 6 4 2 20 200 500 #features 1000

Figure 6: Scalability with the number of features:running time.

35

0.14 running time/#features 0.12 0.1 0.08 0.06 0.04 0.02 0 20 200 500 #features 1000

Figure 7: Scalability with the number of features:running time/#features.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 URCS WebKB4 WebKB Reuterstop10 K-dataset COFD K-Means P1 P2 P3 P4 Slink Clink UPMGA

Figure 8: Purity comparisons on various document datasets.

36

Vous aimerez peut-être aussi