Clustering categorical data using a mutual information based algorithm

Available online at www.sciencedirect.
com
Information Fusion 9 (2008) 223–233

www.elsevier.com/locate/inffus
k-ANMI: A mutual information based clustering algorithm

for categorical data
Zengyou He *, Xiaofei Xu, Shengchun Deng
Department of Computer Science and Engineering, Harbin Institute of Technology, 92 West Dazhi Street, P.O. Box 315, 150001, PR China
Received 25 November 2005; received in revised form 25 May 2006; accepted 31 May 2006
Available online 10 July 2006
Abstract
Clustering categorical data is an integral part of data mining and has attracted much attention recently. In this paper, we present k-
ANMI, a new efficient algorithm for clustering categorical data. The k-ANMI algorithm works in a way that is similar to the popular k-
means algorithm, and the goodness of clustering in each step is evaluated using a mutual information based criterion (namely, average
normalized mutual information – ANMI) borrowed from cluster ensemble. This algorithm is easy to implement, requiring multiple hash
tables as the only major data structure. Experimental results on real datasets show that k-ANMI algorithm is competitive with those
state-of-the-art categorical data clustering algorithms with respect to clustering accuracy.
2006 Elsevier B.V. All rights reserved.
Keywords: Clustering; Categorical data; Mutual information; Cluster ensemble; Data mining
1. Introduction The k-means algorithm is one of the most popular clus-

tering techniques to find the structure in unlabeled data
Clustering is an important data mining technique that sets, which is based on a simple iterative scheme for finding
groups together similar data objects. Recently, much atten- a locally minimal solution. The k-means algorithm is well
tion has been put on clustering categorical data [1–21, known for its efficiency in clustering large data sets. This
26–31], where data objects are made up of non-numerical paper proposes k-ANMI algorithm, a new k-means like
attributes. Fast and accurate clustering of categorical data clustering algorithm for categorical data that directly opti-
has many potential applications in customer relationship mizes the mutual information sharing based object
management, e-commerce intelligence, etc. function.
In [21], the categorical data clustering problem is defined The k-ANMI algorithm takes the number of desired
as an optimization problem using a mutual information clusters (supposed to be k) as input and iteratively changes
sharing based object function (namely, average normalized the class label of each data object to improve the value of
mutual information –ANMI) from the viewpoint of cluster object function. That is, the current label of each object
ensemble [22–24]. However, those algorithms in [21] have is changed to each of the other k 1 possible labels and
been developed from intuitive heuristics rather than from the ANMI objective is re-evaluated. If the ANMI
the vantage point of a direct optimization, which can not increases, the object’s label is changed to the best new value
guarantee to find a reasonable solution. and the algorithm proceeds to the next object. When all
objects have been checked for possible improvements,
one cycle test is completed. If at least one object’s label
was changed in a cycle, we initiate a new cycle. The algo-
*
Corresponding author. Tel.: +86 451 86414906x8001. rithm terminates when a full cycle does not change any
E-mail address: zengyouhe@yahoo.com (Z. He). labels, thereby indicating that a local optimum is reached.
1566-2535/$ - see front matter 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.inffus.2006.05.006
224 Z. He et al. / Information Fusion 9 (2008) 223–233
The basic idea of k-ANMI is very simple, which has also is presented in [10]. ROCK algorithm [11] is an adaptation
been exploited in the literature of cluster ensemble [22] and of agglomerative hierarchical clustering algorithm based a
implemented in a straightforward manner. However, such novel ‘‘link’’ based distance measure. Squeezer is a thresh-
straightforward implementation can be extremely slow old based one-pass categorical data clustering algorithm
due to its exponential time complexity. As reported in [16], which is also suitable for clustering categorical data
[22], the average running time of such straightforward streams.
implementation is about one hour per run on a 1 GHz Based on the notion of large item, an allocation and
PC, even on a smaller dataset with 400 objects and eight refinement strategy based algorithm is presented to cluster-
attributes when k = 10. ing transactions [12]. Following the large item method in
Hence, it is a great research challenge to implement the k- [12], a new measurement, called the small-large ratio is pro-
ANMI algorithm in an efficient way such that it is scalable to posed and utilized to perform the clustering [13]. In [14],
large datasets. The goal of this paper is to present a simple the authors consider the item taxonomy in performing clus-
and efficient implementation of k-ANMI algorithm. To that ter analysis. Xu and Sung [15] propose an algorithm based
end, we employ multiple hash tables to improve its efficiency. on ‘‘caucus’’, which are fine-partitioned demographic
More precisely, in a general clustering problem with r attri- groups that is based the purchase features of customers.
butes and k clusters, we need only (k + 1)r hash tables as COOLCAT [17] is an entropy-based algorithm for cate-
our major data structure. Through the use of these hash gorical clustering. CLOPE [18] is an iterative clustering
tables, ANMI value could be derived without accessing the algorithm that is based on the concept of height-to-width
original dataset. Thus, the computation is very efficient. ratio of cluster histogram. Based on the notion of general-
We also provide the analysis on time complexity and ized conditional entropy, a genetic algorithm is utilized for
space complexity of k-ANMI algorithm. The analysis discovering the categorical clusters in [20]. LIMBO intro-
shows that the computational complexity k-ANMI algo- duced in [27] is a scalable hierarchical categorical clustering
rithm is linear to both the number of objects and the num- algorithm that builds on the Information Bottleneck
ber of attributes, thus it is capable of handing large framework. Li et al. [28] shows that the entropy-based cri-
categorical datasets. Experimental results on both real terion in categorical data clustering can be derived in the
and synthetic datasets provide further evidence on the formal framework of probabilistic clustering models and
superiority of k-ANMI algorithm. develops an iterative Monte-Carlo procedure to search
The remainder of this paper is organized as follows. Sec- for the partitions minimizing the criterion.
tion 2 presents a critical review on related work. Section 3 He et al. [21] formally define the categorical clustering
introduces basic concepts and formulates the problem. In problem as an optimization problem from the viewpoint
Section 4, we present the k-ANMI algorithm and provide of cluster ensemble, and apply cluster ensemble approach
complexity analysis. Experimental studies and conclusions for clustering categorical data. Simultaneously, Gionis
are given in Sections 5 and 6, respectively. et al. [30] use disagreement measure based cluster ensemble
method to solve the problem of categorical clustering.
2. Related work Chen and Chuang [26] develop CORE algorithm by
employing the concept of correlated-force ensemble. He
Many algorithms have been proposed in recent years for et al. [29] propose TCSOM algorithm for clustering binary
clustering categorical data [1–21,26–31]. In [1], an associa- data by extending traditional self-organizing map (SOM).
tion rule based clustering method is proposed for clustering Chang and Ding [31] present a method for visualizing clus-
customer transactions in a market database. Algorithms tered categorical data such that users’ subjective factors
for clustering categorical database based on non-linear can be reflected by adjusting clustering parameters, and
dynamical systems are studied in [2,3]. therefore to increase the clustering result’s reliability.
The k-modes algorithm [4,5] extends the k-means para- Categorical data clustering could be considered as a spe-
digm to categorical domain by using a simple matching dis- cial case of symbolic data clustering [32–36], in which cat-
similarity measure for categorical objects, modes instead of egorical values are taken by the attributes of symbolic
means for clusters, and a frequency-based method to update objects. However, most of the techniques used in the liter-
modes. Based on k-modes algorithm, Ref. [6] proposes an ature in clustering symbolic data are based on the hierar-
adapted mixture model for categorical data, which gives a chical methodology, which are not efficient in clustering
probabilistic interpretation of the criterion optimized by large data sets.
the k-modes algorithm. A fuzzy k-modes algorithm is pre-
sented in [7] and the tabu search technique is applied in [8] 3. Introductory concepts and problem formulation
to improve fuzzy k-modes algorithm. An iterative initial-
points refinement algorithm for k-modes clustering is pre- 3.1. Notations
sented in [9]. The work in [19] can be considered as an exten-
sion of k-modes algorithm to transactional data. Let A1, . . . , Ar be a set of categorical attributes with
Based on a novel formalization of a cluster for categor- domains D1, . . . , Dr respectively. Let the dataset D =
ical data, a fast summarization based algorithm, CACTUS, {X1, X2, . . . , Xn} be a set of objects described by r categori-
Z. He et al. / Information Fusion 9 (2008) 223–233 225
cal attributes A1, . . . , Ar. The value set Vi of Ai is a set of partition of the original dataset, aiming for consolidation
values of Ai that are present in D. For each v 2 Vi, the fre- of results from a portfolio of individual clustering results.
quency f(v), denoted as fv, is number of objects O 2 D with Clustering aims at discovering groups and identifying
O. Ai = v. Suppose the number of distinct attribute values interesting patterns in a dataset. We call a particular clus-
of Ai is pi, we define the histogram of Ai as the set of pairs: tering algorithm with a specific view of the data a clusterer.
hi ¼ fðv1 ; f1 Þ; ðv2 ; f2 Þ; . . . ; ðvpi ; fpi Þg and the size of hi is pi. Each clusterer outputs a clustering or labeling, comprising
The histograms of dataset D is defined as the group labels for some or all objects.
H = {h1, h2, . . . , hr}. Given a dataset D = {X1, X2, . . . , Xn}, a partitioning of
Let X, Y be two categorical objects described by r these n objects into k clusters can be represented as a set
categorical attributes. The dissimilarity measure or of k sets of objects {Cl|l = 1, . . . , k} or as a label vector
distance between X and Y can be defined by the total k 2 Nn. A clusterer U is a function that delivers a label vec-
mismatches of the corresponding attribute values of the tor given a set of objects. Fig. 1 (adapted from [24]) shows
two objects. The smaller the number of mismatches is, the basic setup of the cluster ensemble: A set of r labelings
the more similar the two objects. Formally, k(1,2,. . .,r) is combined into a single labeling k (the consensus
labeling) using a consensus function C.
X
r
d 1 ðX ; Y Þ ¼ dðxj ; y j Þ ð1Þ As shown in [21], categorical data clustering problem
j¼1 can be considered as the cluster ensemble problem. That
is, for a categorical dataset, if we consider attribute values
where as cluster labels, each attribute with its attribute values pro-
( vide a ‘‘best clustering’’ on the dataset without considering
0 ðxj ¼ y j Þ other attributes. Hence, categorical data clustering prob-
dðxj ; y j Þ ¼ ð2Þ
1 ðxj 6¼ y j Þ lem can be considered as cluster ensemble problem, in
which the attribute values of each attribute define the out-
Given dataset D = {X1, X2, . . . , Xn} and an object Y, the puts of different clustering algorithms.
dissimilarity measure between D and Y can be defined by More precisely, for a given dataset D = {X1, X2, . . . , Xn}
the average of the sum of distances between Xi and Y. with r categorical attributes A1, . . . , Ar, let Vi be the set of
Pn attribute values of Ai that are present in D. According to
j¼1 d 1 ðX j ; Y Þ
d 2 ðD; Y Þ ¼ ð3Þ the CE framework described in Fig. 1, if we define each clu-
n sterer U(i) as a function that transforms values in Vi into
If we take the histograms H = {h1, h2, . . . , hr} as the com- distinct natural numbers, we can get the optimal partition-
pact representation of dataset D, formula (3) can be re- ing k(i) determined by each attribute Ai as
defined as (4). k(i) = {U(i)(Xj . Ai)jXj . Ai 2 Vi, Xj 2 D}. Then, we can com-
Pr bine the set of r labelings k(1,2,. . .,r) into a single labeling k
j¼1 /ðhj ; y j Þ using a consensus function C to get the solution of the ori-
d 3 ðH ; Y Þ ¼ ð4Þ
n ginal categorical data clustering problem.
where
pj
3.3. Object function
X
/ðhj ; y j Þ ¼ fl dðvl ; y j Þ ð5Þ
l¼1
Continue Section 3.2, intuitively, a good combined clus-
tering should share as much information as possible with
From the viewpoint of implementation efficiency, for- the given r labelings. Strehl and Ghosh [22–24] use mutual
mula (4) can be presented in form of formula (6).
Pr
j¼1 wðhj ; y j Þ
d 4 ðH ; Y Þ ¼ ð6Þ
n
where
pj
λ
X
wðhj ; y j Þ ¼ n fl ð1 dðvl ; y j ÞÞ ð7Þ
λ λ
l¼1
If histogram is realized in form of hash table, formula

(6) can be computed more efficiently since only the frequen-
cies of matched attribute value pairs are required.
3.2. A unified view in the cluster ensemble framework λ

Cluster ensemble (CE) is the method to combine several Fig. 1. The cluster ensemble. A consensus function C combines clusterings
runs of different clustering algorithms to get a common k(q) from a variety of sources.
information in information theory to measure the shared 4. The k-ANMI algorithm

information, which can be directly applied in this literature.
More precisely, as shown in [23,24], given r groupings 4.1. Overview
with the qth grouping k(q) having k(q) clusters, a consensus
function C is defined as a function Nn·r ! Nn mapping a The k-ANMI algorithm takes the number of desired
set of clusterings to an integrated clustering: clusters (supposed to be k) as input and iteratively changes
the class label of each data object to improve the value of
C : fkðqÞ jq 2 f1; 2; . . . ; rgg ! k ð8Þ object function. That is, for each object, the current label
The set of groupings is denoted as K = {k jq 2 (q) is changed to each of the other k 1 possible labels and
{1, 2, . . . , r}}. The optimal combined clustering should the ANMI objective is re-evaluated. If the ANMI
share the most information with the original clusterings. increases, the object’s label is changed to the best new value
In information theory, mutual information is a symmetric and the algorithm proceeds to the next object. When all
measure to quantify the statistical information shared objects have been checked for possible improvements, a
between two distributions. Let A and B be the random sweep is completed. If at least one label was changed in a
variables described by the cluster labeling k(a) and k(b), with sweep, we initiate a new sweep. The algorithm terminates
k(a) and k(b) groups respectively. Let I(A, B) denote the when a full sweep does not change any labels, thereby indi-
mutual information between A and B, and H(A) denote cating that a local optimum is reached.
the entropy of A. As Strehl has shown in [24], IðA; BÞ 6
H ðAÞþH ðBÞ
holds. Hence, the [0, 1]-normalized mutual infor- 4.2. Data structure
2
mation (NMI) [24] used is
Taking the dataset D = {X1, X2, . . . , Xn} with r categori-
2IðA; BÞ cal attributes A1, . . . , Ar and the number of desired clusters
NMIðA; BÞ ¼ ð9Þ
H ðAÞ þ H ðBÞ k as input, we need (k + 1)r hash tables as our basic data
structure. Actually, each hash table is the materialization
Obviously, NMI(A, A) = 1. Eq. (9) has to be estimated of a histogram. The concept and structure of histogram
by the sampled quantities provided by the clusterings have been discussed in Section 3.1. In the remaining parts
[24]. As shown in [24], if we let n(h) be the number of objects of the paper, we will use histogram and hash table
in cluster Ch according to k(a), and let ng be the number of interchangeably.
objects in cluster Cg according to k(b). Let ngðhÞ be the num- As discussed in Section 3.2, each attribute Ai determines
ber of objects in cluster Ch according to k(a) as well as in an optimal partitioning k(i) without considering other attri-
cluster Cg according to k(b). The [0, 1]-normalized mutual butes. Storing k(i) in its original format is costly with
information criteria /(NMI) is computed as follows [23,24]: respect to both space and computation. Therefore, we only
!
2Xk ðaÞ X
k ðbÞ
ngðhÞ n keep the histogram of Ai on D, denoted as AHi, as the com-
ðNMIÞ ðaÞ ðbÞ ðhÞ
/ ðk ; k Þ ¼ n logkðaÞ kðbÞ ðhÞ ð10Þ pact representation of k(i). Since we have r attributes, r his-
n h¼1 g¼1 g n ng
tograms are constructed.
Therefore, the average normalized mutual information Suppose that the partition of these n objects into k clus-
(ANMI) between a set of r labelings, K, and a labeling ~k ters is represented as a set of k sets of objects
is defined as follows [24]: {Cl|l = 1, . . . , k} or as a label vector ~k 2 N n . For each Cl,
we construct a histogram for each attribute separately.
1X r
We denote the histogram of Ai on Cl as CAHl,i. Hence,
/ðANMIÞ ðK; ~
kÞ ¼ /ðNMIÞ ð~
k; kðqÞ Þ ð11Þ
r q¼1 we need r histograms for each Cl and rk histograms for ~ k.
Overall, we need (k + 1)r histograms totally.
According to [23,24], the optimal combined clustering
k(k-opt) should be defined as the one that has the maximal Example 1. For example, Table 1 shows a categorical table
average mutual information with all individual partitioning with 10 objects, each object is described by two categorical
k(q) given that the number of consensus clusters desired is k. attributes. If only ‘‘Attribute 1’’ is considered, we can get
Hence, k(k-opt) is defined as the optimal partitioning {(1, 2, 5, 7, 10), (3, 4, 6, 8, 9)} with
Xr two clusters. Similarly, ‘‘Attribute 2’’ gives an optimal
kðk-optÞ ¼ arg max /ðNMIÞ ð~
k; kðqÞ Þ ð12Þ partitioning {(1, 4, 9), (2, 3, 10), (5, 6, 7, 8)} with three
~
k q¼1 clusters.
where ~k goes through all possible k-partitions. Suppose that we are testing a candidate partition
Taking ANMI as the object function in our k-ANMI ~k ¼ fð1; 2; 3; 4; 5Þ; ð6; 7; 8; 9; 10Þg when k = 2. In this case,
algorithm, we have to compute the value of /(NMI). More we need (2 + 1) · 2 = 6 histograms, which are described
precisely, we should be able to efficiently get the accurate in a vivid form in Fig. 2. Among these six histograms in
value of each parameter in Eq. (10). In the next section, Fig. 2 AH1, and AH2 are histograms of ‘‘Attribute 1’’ and
we will describe our data structure and k-ANMI algorithm ‘‘Attribute 2’’ with respect to the original dataset. The
in detail. histograms of ‘‘Attribute 1’’ and ‘‘Attribute 2’’ with respect
Table 1 the histogram of Ab on D. Note that the value of

Sample categorical data set k(b) can be directly derived from the corresponding
Object number Attribute 1 Attribute 2 histogram.
1 M A (4) The value of n(h) is equal to the sum of frequencies of
2 M B attribute values in any histogram CAHh,i, where
3 F B 1 6 i 6 r.
4 F A
5 M C
(5) Suppose the cluster Cg in k(b) is determined by attri-
6 F C bute value v, then ng is equal to the frequency value
7 M C of v in histogram AHb.
8 F C (6) As in (5), suppose the cluster Cg in k(b) is determined
9 F A by attribute value v, then ngðhÞ is equal to the frequency
10 M B
value of v in histogram CAHh,b if v has an entry in
CAHh,b. Otherwise, nðhÞg ¼ 0.
AH 1 AH 2 CA H 1, 1 CA H 1, 2 CA H 2, 1 CA H 2, 2 From (1)–(6), we know that the ANMI value can

be computed through the use of histograms without access-
ing the original dataset. Thus, the computation is very
(M ,5 ) (A ,3 ) (M ,3 ) (A ,2 ) (M ,2) (A ,1 ) effective.
Example 2. Continuing Example 1, suppose that we are
(F, 5) (B ,3 ) (F, 2) (B , 2) (F, 3) (B ,1 ) trying to compute /(NMI)(k(a), k(b)), where kðaÞ ¼ ~ k ¼ fð1; 2;
3; 4; 5Þ; ð6; 7; 8; 9; 10Þg and k(b) = {(1, 2, 5, 7, 10), (3, 4, 6, 8,
(C ,4 ) (C ,1 ) (C ,3 )
9)}. In this case, we have n = 10, k(a) = k = 2, k(b) = 2.
Fig. 2. The six histograms in Example 1. Furthermore, suppose that Ch = (1, 2, 3, 4, 5) and Cg = (1, 2,
5, 7, 10). We have n(h) = 3 + 2 = 5 (according to CAH1,1) or
n(h) = 2 + 2 + 1 = 5 (according to CAH1,2), ng = 5 (the
to cluster Cl in ~
k are CAHl,1 and CAHl,2, where l = 1 and 2. frequency value of ‘‘M’’ in histogram AH1,) and ng ¼ 3
ðhÞ
It is not difficult to get the frequency values in each histo- (the frequency value of ‘‘M’’ in histogram CAH1,1).
gram by counting corresponding attribute values.
4.4. The algorithm
4.3. Computation of ANMI
Fig. 3 shows the k-ANMI algorithm. The collection of
In this section, we show how to use those histograms
objects is stored in a file on the disk and we read each
introduced in Section 4.2 to compute the ANMI value.
object t in sequence.
To compute the ANMI between a set of r labelings, K,
In the initialization phase of the k-ANMI algorithm, we
and a labeling k, ~ we only need to compute /ðNMIÞ ð~k; kðiÞ Þ
(i)
firstly select the first k objects from the dataset to construct
for each k 2 K. Therefore, we will focus on the computa- initial histograms for each cluster. Each consequent object
tion of /ðNMIÞ ð~ k; kðiÞ Þ. To be consistent with the description is put into the closest cluster according to formula (6). The
in Section 3.3, we use /(NMI)(k(a), k(b)) instead of /ðNMIÞ cluster label of each object is stored. At the same time, the
ð~
k; kðiÞ Þ for illustration by setting kðaÞ ¼ ~ k and k(b) = k(i). histogram of partition derived from each attribute is also
P ðaÞ P ðbÞ
Recalling that / ðk ; k Þ ¼ 2n kh¼1 kg¼1 ngðhÞ
ðNMIÞ ðaÞ ðbÞ
constructed and updated.
ðhÞ
ng n In iteration phase, we read each object t (in the same order
logkðaÞ kðbÞ nðhÞ ng
, where n(h) is the number of objects in clus-
as in initialization phase), move t to an existing cluster (pos-
ter Ch according to k(a), ng is the number of objects in clus- sibly stay where it is) to maximize ANMI. After each move,
ter Cg according to k(b), ngðhÞ is the number of objects in the cluster identifier is updated. If no object is moved in one
cluster Ch according to k(a) as well as in cluster Cg accord- pass of all objects, iteration phase terminates; otherwise, a
ing to k(b), k(a) is the number of clusters in k(a) and k(b) is the new pass begins. Essentially, at each step we locally optimize
number of clusters in k(b). the criterion ANMI. The key step is to find the destination
To compute the value of /(NMI)(k(a), k(b)), we must know cluster for moving an object according to the value of
six values, which are n, k(a), k(b), n(h), ng and ngðhÞ . ANMI. How to efficiently compute the ANMI value using
histograms has been discussed in Section 4.3.
(1) The value of n is the number of objects in a given
dataset, which is a constant value in a clustering 4.5. Time and space complexities
problem.
(2) Since kðaÞ ¼ ~
k, we have k(a) = k. 4.5.1. Worst-case analysis
(b)
(3) Since k is a partition derived from attribute Ab, The time and space complexities of the k-ANMI
hence k(b) is equal to the size of AHb, where AHb is algorithm depend on the size of dataset (n), the number
Algorithm k-ANMI
Input: D // the categorical database
k // the number of desired clusters
Output: clusterings of D
/* Phase 1-Initialization */
01 Begin
02 foreach object t in D
03 counter++
04 update histograms for each attribute
05 if counter<=k then
06 put t into cluster Ci where i = counter
07 else
08 put t into cluster Ci to which t has the smallest distance
09 write <t, i>
/* Phase 2-Iteration */
10 Repeat
11 not_moved =true
12 while not end of the database do
13 read next object < t, Ci >
14 moving t to an existing cluster Cj to maximize ANMI
15 if Ci != Cj then
16 write <t, j>
17 not_moved =false
18 Until not_moved
19 End
Fig. 3. The k-ANMI algorithm.
of attributes (r), the number of the histograms, the size of The above analysis shows that the time complexity of k-
every histogram, the number of clusters (k) and the itera- ANMI is linear to the size of dataset, the number of attri-
tion times (I). butes and the iteration times, which makes this algorithm
To simplify the analysis, we will assume that every attri- deserve good scalability.
bute has the same number of distinct attributes values, p.
Then, in the worst case, in the initialization phase, the time 4.6. Enhancement for real applications
complexity is O(nkrp). In the iteration phase, the computa-
tion of ANMI requires O(rp2k) and hence this phase has The data sets in real-life applications are usually com-
time complexity O(Ink2rp2). Totally, the algorithm has time plex. They have not only categorical data but also numeric
complexity O(Ink2rp2) in worst case. data. Sometimes, they are incomplete. The proposed
The algorithm only needs to store (k + 1)r histograms method could not be applied if no pre-processing tech-
and the original dataset in main memory, so the space com- niques have been applied to such data before performing
plexity of this algorithm is O(rkp + nr). clustering process. There are many pre-processing tech-
niques and tools available in the literature. In this section,
4.5.2. Practical analysis we discuss the techniques for handling data with these
As pointed out in [10], categorical attributes usually characteristics in k-ANMI.
have small domains. An important implication of the com-
pactness of categorical domains is that the parameter p 4.6.1. Handling numeric data
could be expected to be very small. Furthermore, the use To process numeric data, we apply the widely used bin-
of hashing technique in histograms also reduces the impact ning technique [37] to discretize numeric data, or use some
of p. So, in practice, the time complexity of k-ANMI can be well-designed numeric clustering algorithms to transform
expected to be O(Ink2rp). numeric data into categorical class labels.
4.6.2. Handling missing values cluster i. Consequently, the clustering error is defined as
Pk
To handle incomplete data, we provide two choices. ai
Firstly, missing values in an incomplete object will not be 1 ni¼1 .
considered when updating histograms. And secondly, miss- The intuition behind clustering error defined above is
ing values are treated as special categorical attribute values. that clusterings with ‘‘pure’’ clusters, i.e., clusters in which
In our current implementation, we use the second. all objects have the same class label, are preferable. That is,
if a partition has clustering error equal to 0 it means that it
5. Experimental results contains only pure clusters. These kinds of clusters are also
interesting from a practical perspective. Hence, we can con-
A performance study has been conducted to evaluate clude that smaller clustering error means better clustering
our method. In this section, we describe those experiments results in real world applications.
and the results. We ran our algorithm on real-life datasets It should be noted that some well-known clustering
obtained from the UCI Machine Learning Repository [25] validity indices are available in the literature and recently
to test its clustering performance against other algorithms. extended to symbolic data [34]. Hence, we also evaluated
Furthermore, one larger synthetic dataset is used to dem- our results using these clustering validity indices in our
onstrate the scalability of our algorithm. experiments. As expected, it was observed that these valid-
ity indices coincide with above defined clustering error. In
5.1. Real-life datasets and evaluation method other words, the results of performance comparisons using
clustering error and other validity indices are very similar.
We experimented with three real-life datasets: the Con- Therefore, those experimental results with other validity
gressional Votes dataset, the Mushroom dataset and the indices are omitted in this paper.
Wisconsin Breast Cancer dataset, which were obtained
from the UCI Repository [25]. Now we will give a brief 5.2. Experiment design
introduction about these datasets.
We studied the clusterings found by five algorithms, our
• Congressional Votes data: It is the United States Con- k-ANMI algorithm, Squeezer algorithm introduced in [16],
gressional Voting Records in 1984. Each record repre- GAClust algorithm proposed in [20], standard k-modes
sents one Congressman’s votes on 16 issues. All algorithm [4,5] and ccdByEnsemble algorithm in [21].
attributes are Boolean with Yes (denoted as y) and No Until now, there is no well-recognized standard method-
(denoted as n) values. A classification label of Republi- ology for categorical data clustering experiments. How-
can or Democrat is provided with each record. The ever, we observed that most clustering algorithms require
dataset contains 435 records with 168 Republicans and the number of clusters as an input parameter, so in our
267 Democrats. experiments, we cluster each dataset into different number
• Mushroom data: It has 22 attributes and 8124 records. of clusters, varying from 2 to 9. For each fixed number of
Each record represents physical characteristics of a sin- clusters, the clustering errors of different algorithms were
gle mushroom. A classification label of poisonous or compared.
edible is provided with each record. The numbers of edi- In all the experiments, except for the number of clusters,
ble and poisonous mushrooms in the dataset are 4208 all the parameters required by the ccdByEnsemble algo-
and 3916, respectively. rithm are set to be default as in [21]. The Squeezer algo-
• Wisconsin Breast Cancer data1: It has 699 instances with rithm requires a similarity threshold as input parameter,
nine attributes. Each record is labeled as benign (458% so we set this parameter to a proper value to get the desired
or 65.5%) or malignant (241% or 34.5%). In this paper, number of clusters. In GAClust algorithm, we set the pop-
all attributes are considered to be categorical. ulation size to be 50, and set other parameters to be their
default values.2 In k-modes clustering algorithm, initial k
Validating clustering results is a non-trivial task. In the modes are constructed by selecting the first k objects from
presence of true labels, as in the case of the data sets we the data set.
used, the clustering accuracy for measuring the clustering Moreover, since the clustering results of k-ANMI algo-
results was computed as follows. Given the final number rithm, ccdByEnsemble algorithm, k-modes algorithm and
Pk Squeezer algorithm are fixed for a particular dataset when
ai
of clusters, k, clustering accuracy was defined as i¼1
n
, the parameters are fixed, only one run is used in these algo-
where n is the number of objects in the dataset and ai is rithms. The GAClust algorithm is a genetic algorithm, so
the number of objects with the class label that dominates its outputs will differ in different runs. However, we
observed in the experiments that the clustering error is very
1
We use a dataset that is slightly different from its original format in
2
UCI Machine Learning Repository, which has 683 instances with 444 The source codes for GAClust are public available at: http://
benign records and 239 malignant records. It is public available at: http:// www.cs.umb.edu/~dana/GAClust/index.html. The readers may refer to
research.cmis.csiro.au/rohanb/outliers/breast-cancer/brcancerall.dat. this site for details about other parameters.
0.45
0.4
0.35
Clustering Error
0.3
Squeezer GAClust ccdByEnsemble k-modes k-ANMI
0.25
0.2
0.15
0.1
0.05
0
2 3 4 5 6 7 8 9
Number of clusters
Fig. 4. Clustering error vs. different number of clusters (votes dataset).
stable, so the clustering error of this algorithm is reported 5.4. Clustering results on mushroom data
with its first run. In summary, we use one run to get the
clustering errors of all algorithms. The experimental results on mushroom dataset are
described in Fig. 5 and the relative performance is summa-
5.3. Clustering results on congressional voting (votes) data rized in Table 3. As shown in Fig. 5 and Table 3, our algo-
rithm beats all the other algorithms in average clustering
Fig. 4 shows the results of different clustering algorithms error. Furthermore, although the k-ANMI algorithm did
on votes dataset. From Fig. 4, we can summarize the rela- not always perform best on this dataset, it performed best
tive performance of these algorithms as Table 2. In Table 2, in five cases and never performed worst. That is, k-ANMI
the numbers in column labeled by k (k = 1, 2, 3, 4 or 5) are algorithm performed best in majority of the cases.
the times that an algorithm has rank k among these algo- Moreover, the results of k-ANMI algorithm are signifi-
rithms. For instance, in the eight experiments, k-ANMI cantly better than that of ccdByEnsemble algorithm in
algorithm performed third best in two cases, that is, it is most cases. It demonstrates that the direct optimization
ranked 3 for two times. strategy utilized in k-ANMI is more desirable than the
Compared to other algorithms, k-ANMI algorithm per- intuitive heuristics in ccdByEnsemble algorithm.
formed best in most cases and never performed the worst.
And the average clustering error of k-ANMI algorithm is 5.5. Clustering results on Wisconsin Breast Cancer
significantly smaller than that of other algorithms. Thus, (cancer) data
the clustering performance of k-ANMI on votes dataset is
superior to all other four algorithms. The experimental results on cancer dataset are described
in Fig. 6 and the summarization on relative performance of
Table 3
Table 2 Relative performance of different clustering algorithms (mushroom
Relative performance of different clustering algorithms (votes dataset) dataset)
Ranking 1 2 3 4 5 Average clustering error Ranking 1 2 3 4 5 Average clustering error
Squeezer 0 0 2 1 5 0.163 Squeezer 1 1 4 0 2 0.206
GAClust 1 0 2 2 3 0.136 GAClust 0 1 1 2 4 0.393
ccdByEnsemble 1 3 0 4 0 0.115 ccdByEnsemble 2 1 0 3 2 0.315
k-modes 2 4 1 1 0 0.097 k-modes 0 5 0 3 0 0.206
k-ANMI 5 1 2 0 0 0.092 k-ANMI 5 1 2 0 0 0.165
0.6
0.5
Clustering Error
0.4
0.3
0.2
0.1
0
2 3 4 5 6 7 8 9
Number of clusters
Fig. 5. Clustering error vs. different number of clusters (mushroom dataset).

0.25
0.2
Clustering Error
0.15
0.1
0.05
0
2 3 4 5 6 7 8 9
Number of clusters
Fig. 6. Clustering error vs. different number of clusters (cancer dataset).
Table 4
Run time in seconds

200
Relative performance of different clustering algorithms (cancer dataset)
150
Ranking 1 2 3 4 5 Average clustering error
100
Squeezer 0 0 3 3 2 0.091
50
GAClust 0 0 1 2 5 0.117
ccdByEnsemble 1 4 1 2 0 0.071 0
1 2 3 4 5 6 7 8 9 10
k-modes 1 3 2 2 0 0.070
k-ANMI 6 1 1 0 0 0.039 Number of Objects in 10,000
Fig. 7. Scalability of k-ANMI to the number of objects when mining two

five algorithms is given in Table 4. From Fig. 6 and Table clusters from DS1 dataset.
4, some important observations are summarized as follows:
(1) Our algorithm beats all the other algorithms with

Run time in seconds
2500
respect to average clustering error. 2000
(2) The k-ANMI algorithm almost performed the best in 1500
all cases (except for the cases when the number of 1000
clusters is 4 and 5); furthermore, in almost every case, 500
k-ANMI algorithm achieves better output than that 0
2 3 4 5 6 7 8 9 10 11
of ccdByEnsemble algorithm, which verify the effec- Number of Clusters
tiveness of the direct optimization strategy in k-
Fig. 8. Scalability of k-ANMI to the number of clusters when clustering
ANMI. In particular, when the number of clusters
100,000 objects of DS1 dataset.
is set to 2 (the true number of clusters for cancer data-
set), our k-ANMI algorithm is able to get clustering
output whose clustering error is significantly smaller conducted on a Pentium4-2.4G machine with 512 M of
than that of other algorithms. RAM and running Windows 2000. Fig. 7 shows the results
of using k-ANMI to find two clusters with different number
of objects. Fig. 8 shows the results of using k-ANMI to find
5.6. Scalable tests different number of clusters on DS1 dataset.
One important observation from these figures was that
The purpose of this experiment was to test the scalability the run time of k-ANMI algorithm tends to increase line-
of k-ANMI algorithm when handling very large datasets. arly as the number of objects is increased, which is highly
A synthesized categorical dataset created with the software desired in real data mining applications.
developed by Cristofor [20] is used. The data size (i.e., Furthermore, although the run time of k-ANMI algo-
number of objects), the number of attributes and the num- rithm does not increase linearly as the number of clusters
ber of classes are the major parameters in the synthesized is increased, it at least achieves good scalability at an
categorical data generation, which were set to be 100,000, acceptable level.
10 and 10 separately. Moreover, we set the random gener-
ator seed to 5. We will refer to this synthesized dataset with 6. Conclusions
name of DS1.
We tested two types of scalability of the k-ANMI algo- Entropy-based criterion for the heterogeneity of clusters
rithm on large dataset. The first one is the scalability has been used for a long time, and most recently it is
against the number of objects for a given number of clus- applied to categorical data clustering extensively
ters and the second is the scalability against the number [17,20,21,27,28]. At the meantime, the k-means-type algo-
of clusters for a given number of objects. Our k-ANMI rithm is well known for its efficiency in clustering large data
algorithm was implemented in Java. All experiments were sets. Hence, developing effective and efficient k-means like
algorithms with entropy-based criterion as objective func- [3] Y. Zhang, A.W. Fu, C.H. Cai, P.A. Heng, Clustering categorical
tion for clustering categorical data is much desired in prac- data, in: Proc. of ICDE’00, 2000, pp. 305–305.
[4] Z. Huang, A fast clustering algorithm to cluster very large categorical
tice. However, such kinds of algorithms are still not data sets in data mining, in: Proc. of 1997 SIGMOD Workshop on
available to date. To fulfill this void, this paper proposes Research Issues on Data Mining and Knowledge Discovery, 1997, pp.
a new k-means like clustering algorithm called k-ANMI 1–8.
for categorical data, which tries to directly optimize a [5] Z. Huang, Extensions to the k-means algorithm for clustering large
mutual information sharing based object function. The data sets with categorical values, Data Mining and Knowledge
Discovery 2 (3) (1998) 283–304.
superiority of our algorithm is verified by the experiments. [6] F. Jollois, M. Nadif, Clustering large categorical data, in: Proc. of
As we have argued in [21], categorical data clustering PAKDD’02, 2002, pp. 257–263.
and cluster ensemble are two equivalent problems in [7] Z. Huang, M.K. Ng, A fuzzy k-modes algorithm for clustering
essence and algorithms developed in both domains can be categorical data, IEEE Transactions on Fuzzy Systems 7 (4) (1999)
used interchangeably. Thus, it is reasonable to employ k- 446–452.
[8] M.K. Ng, J.C. Wong, Clustering categorical data sets using tabu
ANMI as an effective algorithm for solving the problem search techniques, Pattern Recognition 35 (12) (2002) 2783–2790.
of cluster ensemble in practice. Further studying the effec- [9] Y. Sun, Q. Zhu, Z. Chen, An iterative initial-points refinement
tiveness of k-ANMI algorithm in cluster ensemble applica- algorithm for categorical data clustering, Pattern Recognition Letters
tions would be a promising future research direction. 23 (7) (2002) 875–884.
In light of the fact a large number of clustering algo- [10] V. Ganti, J. Gehrke, R. Ramakrishnan, CACTUS-clustering cate-
gorical data using summaries, in: Proc. of KDD’99, 1999, pp. 73–83.
rithms exist, the proposed k-ANMI algorithm offers some [11] S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm
special advantages since it is rooted from cluster ensemble. for categorical attributes, Information Systems 25 (5) (2000) 345–366.
Firstly, as we have just noted, the k-ANMI algorithm is [12] K. Wang, C. Xu, B. Liu, Clustering transactions using large items, in:
suitable for both categorical data clustering and cluster Proc. of CIKM’99, 1999, pp. 483–490.
ensemble. Secondly, it could be easily deployed in cluster- [13] C.H. Yun, K.T. Chuang, M.S. Chen, An efficient clustering algorithm
for market basket data based on small large ratios, in: Proc. of
ing distributed categorical data. Finally, it is flexible in COMPSAC’01, 2001, pp. 505–510.
handling heterogeneous data that contains a mix of cate- [14] C.H. Yun, K.T. Chuang, M.S. Chen, Using category based adherence
gorical and numerical attributes. to cluster market-basket data, in: Proc. of ICDM’02, 2002, pp. 546–
Besides proposing the k-ANMI algorithm, one more 553.
important but implicit contribution of this paper is to pro- [15] J. Xu, S.Y. Sung, Caucus-based transaction clustering, in: Proc of
DASFAA’03, 2003, pp. 81–88.
vide a general framework for implementing k-means like [16] Z. He, X. Xu, S. Deng, Squeezer: an efficient algorithm for clustering
algorithms using entropy-based criterion as objective func- categorical data, Journal of Computer Science & Technology 17 (5)
tion in the context of categorical data clustering and cluster (2002) 611–624.
ensemble. More precisely, in a general clustering problem [17] D. Barbara, Y. Li, J. Couto, COOLCAT: an entropy-based algorithm
with r attributes and k clusters, we can use only (k + 1)r for categorical clustering, in: Proc. of CIKM’02, 2002, pp. 582–589.
[18] Y. Yang, S. Guan, J. You, CLOPE: a fast and effective clustering
hash tables (histograms) as our basic data structure. With algorithm for transactional data, in: Proc. of KDD’02, 2002, pp. 682–
the help of the histogram-based data structure, we are able 687.
to develop other kinds of k-means-types algorithms using [19] F. Giannotti, G. Gozzi, G. Manco, Clustering transactional data, in:
various entropy-based criterions. Hence, we conclude that Proc. of PKDD’02, 2002, pp. 175–187.
our idea would provide general and standard guidelines [20] D. Cristofor, D. Simovici, Finding median partitions using informa-
tion-theoretical-based genetic algorithms, Journal of Universal Com-
for future research on this topic. puter Science 8 (2) (2002) 153–172.
[21] Z. He, X. Xu, S. Deng, A cluster ensemble method for clustering
Acknowledgements categorical data, Information Fusion 6 (2) (2005) 143–151.
[22] A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse
The comments and suggestions from the anonymous framework for combining multiple partitions, Journal of Machine
Learning Research 3 (2002) 583–617.
reviewers greatly improve the paper. This work was sup- [23] A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse
ported by the High Technology Research and Develop- framework for combining partitionings, in: Proc. of the 8th National
ment Program of China (No. 2004AA413010, No. Conference on Artificial Intelligence and 4th Conference on Innova-
2004AA413030), the National Nature Science Foundation tive Applications of Artificial Intelligence, 2002, pp. 93–99.
of China (No. 40301038) and the IBM SUR Research [24] A. Strehl, Relationship-based clustering and cluster ensembles for
high-dimensional data mining, Ph.D. thesis, The University of Texas
Fund. at Austin, May 2002.
[25] C. J. Merz, P. Merphy, UCI Repository of Machine Learning
References Databases, 1996. Available from: <http://www.ics.uci.edu/~mlearn/
MLRRepository.html>.
[1] E.H. Han, G. Karypis, V. Kumar, B. Mobasher, Clustering based on [26] M. Chen, K. Chuang, Clustering categorical data using the corre-
association rule hypergraphs, in: Proc. of 1997 SIGMOD Workshop lated-force ensemble, in: Proc. of SDM’04, 2004.
on Research Issues on Data Mining and Knowledge Discovery, 1997, [27] P. Andritsos, P. Tsaparas, R.J. Miller, K.C. Sevcik, LIMBO: scalable
pp. 9–13. clustering of categorical data, in: Proc. of EDBT’04, 2004, pp. 123–
[2] D. Gibson, J. Kleiberg, P. Raghavan, Clustering categorical data: an 146.
approach based on dynamic systems, in: Proc. of VLDB’98, 1998, pp. [28] T. Li, S. Ma, M. Ogihara, Entropy-based criterion in categorical
311–323. clustering, in: Proc. of ICML’04, 2004.
[29] Z. He, X. Xu, S. Deng, TCSOM: clustering transactions using self- [34] K. Mali, S. Mitra, Clustering and its validation in a symbolic
organizing map, Neural Processing Letters 22 (3) (2005) 249–262. framework, Pattern Recognition Letters 24 (14) (2003) 2367–2376.
[30] A. Gionis, H. Mannila, P. Tsaparas, Clustering aggregation, in: Proc. [35] D.S. Guru, B.B. Kiranagi, Multivalued type dissimilarity measure
of ICDE’05, 2005, pp. 341–352. and concept of mutual dissimilarity value for clustering symbolic
[31] C. Chang, Z. Ding, Categorical data visualization and clustering patterns, Pattern Recognition 38 (1) (2005) 151–156.
using subjective factors, Data & Knowledge Engineering 53 (3) (2005) [36] D.S. Guru, B.B. Kiranagi, P. Nagabhushan, Multivalued type
243–263. proximity measure and concept of mutual similarity value useful for
[32] K.C. Gowda, E. Diday, Symbolic clustering using a new dissimilarity clustering symbolic patterns, Pattern Recognition Letters 25 (10)
measure, Pattern Recognition 24 (6) (1991) 567–578. (2004) 1203–1213.
[33] K.C. Gowda, T.V. Ravi, Divisive clustering of symbolic objects using [37] H. Liu, F. Hussain, C.L. Tan, M. Dash, Discretization: an
the concepts of both similarity and dissimilarity, Pattern Recognition enabling technique, Data Mining and Knowledge Discovery 6
28 (8) (1995) 1277–1282. (2002) 393–423.

Clustering categorical data using a mutual information based algorithm

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Clustering categorical data using a mutual information based algorithm

Transféré par

Droits d'auteur :

Formats disponibles

Available online at www.sciencedirect.

Information Fusion 9 (2008) 223–233

k-ANMI: A mutual information based clustering algorithm

1. Introduction The k-means algorithm is one of the most popular clus-

If histogram is realized in form of hash table, formula

3.2. A uniﬁed view in the cluster ensemble framework λ

information in information theory to measure the shared 4. The k-ANMI algorithm

Table 1 the histogram of Ab on D. Note that the value of

AH 1 AH 2 CA H 1, 1 CA H 1, 2 CA H 2, 1 CA H 2, 2 From (1)–(6), we know that the ANMI value can

Fig. 3. The k-ANMI algorithm.

Fig. 4. Clustering error vs. diﬀerent number of clusters (votes dataset).

Fig. 5. Clustering error vs. diﬀerent number of clusters (mushroom dataset).

Fig. 6. Clustering error vs. diﬀerent number of clusters (cancer dataset).

Run time in seconds

Fig. 7. Scalability of k-ANMI to the number of objects when mining two

(1) Our algorithm beats all the other algorithms with

Vous aimerez peut-être aussi