Académique Documents
Professionnel Documents
Culture Documents
Letters
ELSEVIER Pattern Recognition Lcttcrs 1~ ( It~951 1147 1157
Abstract
Clustering techniques are important for knowledge acquisition. Traditionally, numerical clustering methods have been
viewed in opposition to conceptual clustering methods developed in Artificial Intelligence. Numerical techniques emphasize
the determination of homogeneous clusters but provide low-level descriptions of clusters. A conceptual approach is more
concerned with high-level, i.e., more understandable descriptions of classes. In this paper, we propose a hybrid numeric-
symbolic method that integrates an extended version of the K-means algorithm for cluster determination and a complemen-
tary conceptual characterization algorithm for cluster description.
come these limitations. Cluster/2 (Michalski. 1984) 2. The K-means algorithm extended to mixed
and Cluster/S (Stepp and Michalski, 1986) used an data
extension of a predicate calculate formalism to repre-
sent data and knowledge. They determine concepts Clustering algorithms include two steps (Stepp,
that are described using logical predicates. Concep- 1987; Fisher and Langley, 1985): the first one is the
tual clustering systems work well on applications aggregation step when a partition of the observations
relating to complex data and knowledge. But they is calculated and the second one is the characteriza-
are not well-suited to numerical noisy data without tion of the classes resulting from that partition.
background knowledge. Another approach proposed Let ( X 1, ~ ) , . . . , ( X p , ~ ) be p numerical at-
has been to extend the similarity or distance used by tributes. Xj is the name of the jth attribute whose
numerical methods to cluster objects. In (Kodratoff domain is the set of reals ~. Each observation o ~ O
and Tecuci, 1988: Gowda and Diday, 1991; Esposito is a p-tuple o = (X 1, a~) . . . . . (Xp, a p ) t h a t is repre-
et al,, 1992) conceptual distances that take into ac- sented by the vector:
count structure and background knowledge are de-
fined on the objects. In the first step the conceptual
distances between the observations are computed and A distance d on numerical data must be chosen
after the result is analyzed using classical numerical on the observations space. Let o = ( a j ) l < j < p and
hierarchical methods. The weakness of this approach ~; = ( b , ) ~ , ~ p be two given observations. If the
lies in the difficulty to come back to the initial usual Euclidean distance is chosen, we have
knowledge to interpret the resulting classes.
In this paper, we are interested in extending the d (o, o')= E (aj-b;)
well known K-means clustering method (Jain and
Dubes, 1988) to deal with mixed data (data charac- [f the attributes are not homogeneous, the above
terised with numerical and symbolic features) and to distance is not suitable. The normalized distance is
give conceptual interpretation of the resulting clus- preferred because its value does not depend on the
ters. The K-means method is very popular because unit used to measure the attributes and the amplitude
of its ability to cluster huge amounts of numerical of each one. Let o-: denote the standard deviation of
and noisy data quickly and efficiently. It remains a the jth attribute. In the observation space the normal-
basic framework for developing numerical (Ralam- ized distance is written:
bondrainy, 1987; Venkateswarlu and Raju, 1992) or 2 v
conceptual clustering systems (Hanson, 1990) be- o')= E (aj-b/) /o-;.
l~<)<~p
cause of the various possibilities of distance and
prototype choice. We show that this kind of method An attribute X is said to be discrete or symbolic
discovers statistical concepts that are not displayed if its range of possible values (called modalities)
by the usual numerical interpretation tables provided D = {d~ . . . . . d m} is finite. To be processed, symbolic
by clustering programs. We propose a distance to attributes must be coded numerically. The usual way
deal with mixed data and a characterization algo- to do that is to associate to each modality a binary
rithm to derive abstract descriptions of the resulting attribute. For example, the attribute (set, {male,
clusters. female}) will be replaced by the following binaries
The paper is organized as follows. Section 2 attributes (sex male, {1, 0}) and (sex female,
describes the extended K-means algorithm proposed {1, 0}). When all the attributes are symbolic, an
to deal with mixed data. Section 3 describes the observation o is represented by a binary vector
formalisms used to represent data and knowledge. o = (a 1. . . . . aj . . . . . aq) where aj ~ {1, 0} and q is
Section 4 is concerned with the conceptual cluster the total number of modalities.
characterization process. Section 5 compares our al- The usual Euclidean distance can be used on
gorithm with a decision tree method on one applica- symbolic data but it gives equal importance to each
tion. modalitv. The chi-square distance used in correspon-
tt. Ra/amhondratny / Pattern Reco.¢mmm l_etters lh f 1995) 1147-1157 1149
dence analysis (Greenacre. 1984) takes into account algorithm is a classical optimization method that
the weight of each modality in the computing of the seeks a partition that minimizes the intraclass inertia:
distance of two observations o = (a~)~., ; ~ q and o'
= (b~) 1~ j.< q. It is written: w: E E{p,d~(o,, grlIOi~Cr}
l~-~k
The lattice of the objects of examples will be defined as the most typical
The object set ~ has a lattice structure: L = L~ observations of the cluster to be described. More
X ... XLp=(g2=E1x . - . x E p , ~<, v , A, *. precisely, an observation is an example of a given
0) as the product of the lattices L i. The partial order cluster if its distance to the prototype is less than the
on L is the product of the orders defined in E: average distance of the cluster observations from the
and we have for every ~ o = ( % ) 1 ~ , ~ 1 , and o = prototype of this cluster. The observations of all the
( O j ) 1 ~< j~< p in .Q: other clusters will be the counterexamples.
(s.=)l$:e
Counter cxamplc> 'ix.5
Examples :.[ a.b.c.d !
DISC=,
[terat~tm ]
N E'~V=~~:,l.~,, * ~:I)R()P=/~.
~ ~ ~ .
NE\V='ItZ,k
~ ,'~1. [)IS('= [c.t'} ~
hcratlol:2 N E \ ~ , : I I .DROP:
it V
>-/E\~, ={ I } [)15,(= : c.f}
S:i c,i ,
L
['ILL' "
NEW = the set of examples Fig. 2 illustrates the algorithm on a very simple
DISC = { } example.
1. Compute predicates that subsume at least two Let S = {A~ . . . . . Am} be the resulting set of pred-
examples: icates that satisfy the two conditions (C1) and (C2).
The predicate R s = A l V • • • V A , , is an approxima-
NEW={A =gvg' ~. <~,, ~ : N E W a n d g*g'}
tion of the characteristic function of the cluster C.
2. Remove all elements from NEW that are not This predicate may be made of a great number of
oz-discriminating: terms and may include predicates A j, A k such as
A j <. A z:. It can be simplified according to the follow-
DROP = { A I A~ ~ NEW' and lng,
Table 1
Party Mean o"
Communist party: European electron (('m8'41 7.5 3.5
Communist party: Regional election i ('m921 7.8 3.7
Socialist Party: European election (So891 24.0 3.5
Socialist Parlv: Regional election (Soq2) 21.6 6.2
Environment party: European clecmm ( EnS9 ) 11.6 2.1
Environment party: Regional election ( En'~2 ) 13.7 3.8
Conservative parties UDF + RPR: European electron {('~,~91 37.5 5.4
Conservative parties UDF -- RPR: Regional election 1('s921 38.6 7.0
Ultraright National Front Part',: European election I Ntb, g l 1(t.8 3.8
Ultraright National Front Part', : Regional election ! Nt'~2 i 12.9 5.0
1154 H. Ralamhondrainy / Pattern Recognition Letters 16 (1995) 1147-1157
Table .~"
D~partement CmSO ('m92 So89 So92 En89 En92 Cs89 Cs92 Nf89 Nf92
01 Ain 4.93 4.78 22.8 21.35 11.54 15.2 40.9 43.3 12.5 15.2
02 Aisne 9.83 9.83 24.4 20.4 11.2 15.3 33.3 32.3 11.1 12.5
03 Allier 16.9 20.8 22.1 16.3 8.8 11.2 37.1 37.0 7.86 9.37
1 +l -1 0 +l
Cm89 Cm89
Cm92 Cm92
So89 So89
So92 So92
Ev89 Ev89
Ev92 Ev92
Cs89 Cs89
Cs92 Cs92
Nf89 Nf89
N~92 Nl'92
Cluster I Cluster 2
-I 0 +1 -1 0 +1
Cm89
Cm92
Cm89
Cm92 :1:
So89 So89
So92 So92
Ev89 Ev89
Ev92 Ev92
Cs89 Cs89
Cs92 Cs92
NI89
N~92
N f89
Nf92 :I
Cluster 3 Cluster 4
Fig. 3. Contributions of attributes to clusters determination.
H. Ralambondrainy/ Pattern Recognmon Letters 10 (1995) 1147-1157 1155
define the degree of participation of attributes to The n u m b e r of rules that satisfies the conditions
cluster determination by the coefficient: (C1) and (C2) is 115.
The n u m b e r of rules after the simplification step
c o e f ( j , r ) = ( rn~ - m s ) / <
is 3.
where mj is the mean, % the standard deviation of From Eq. (3) in Section 3, we define:
the j t h attribute and m~ is the m e a n in the class ('~.
Cons89 ÷ ) = ( 3 8 . 6 ~< Cs89 ~< 4 9 ) ,
The attribute j is typical for the cluster C,, if
Icoeffj, r)] is greater than 1, i.e., the attribute mean Cons92 ÷ ) = ( 4 4 . 2 ~< Cs92 ~< 5 1 . 7 ) ,
in this class differs significantly from the attribute
Cons + ) = ( C o n s 8 9 + A Cons92 + ),
m e a n in the population. Fig. 3 displays coef(j, r ) for
each cluster r and each attribute j. C o m 8 9 - ) = (2.3 ~ C m 8 9 ~< 8 . 8 ) ,
W e can give the following interpretation of the Com92 - ) = (2.9 ~< C m 9 2 ~< 8 . 8 ) ,
clusters: Cluster 1 (21%) regroups leftist d6parte-
ments that vote heavily for the Socialist party. Clus- Com- ) = ( C o m 8 9 - A C o m 9 2 - ),
ter 2 (21%) can be labelled as " C o n s e r v a t i v e " . Nati89 ) = (4.9 ~< Nf89 ~< 10.7),
D6partements of this cluster support the center-right
Nati92 - ) = (5 ~< Nf92 ~< 12.9),
party U D F and the neo-Gaullist party RPR. The
scores of parties in the biggest Cluster 3 (38%) are Nati - ) = (Nati89 - A Nati92 - ),
not significantly different from the average except
Soci89 = ) = ( 2 0 ~< So89 ~< 2 6 . 8 ) ,
for the E n v i r o n m e n t party. The last Cluster 4 (20%}
reflects " e x t r e m e " political opinion. Only the ultra- ( Soci92 = ) = ( 14.5 ~< So92 ~ 2 5 ) ,
right National Front and C o m m u n i s t parties have ( Soci = ) = ( Soci89 = A Soci92 = ),
high scores.
The G E N E R algorithm has been applied to tie- ( E n v i 8 9 = ) = (9.5 ~ E n 8 9 ~< 12.6),
scribe each cluster. Only the results related to Cluster ( E n v i 9 2 = ) = (8.3 ~< E n 9 2 ~< 15.3),
2 (20 examples) are given here:
( Envi = ) = ( E n v i 8 9 = A Envi92 = )
o~ = 2; /3 = 7; Examples = Cluster 2; Counterex-
amples = Cluster 1 U Cluster 3 W Cluster 4. where " ÷ " m e a n s higher than the average, " - "
The process has been stationary after nine itera- means lower than the average and " = " is for
tions. around the average.
CF02=44.2
number of objects
FiG 4. l'hc Ircc on dusiers.
1156 H Ralamhondrainv / Pattern Recognition Letter~ 16 (1995) 1147-1157
Fisher, D. and P. Langlc 3 (1~)~51. Approaches to conceptual scntation and learning in machines and humans. Ph.D. Thesis,
clustering. In: Proc. Ninth lnternat. Joint Con]'. Artificial Computer Science, University of Illinois, Urbana-Champaign,
Intelligence. Kauffmann. l ~ s Angeles, CA, 688 697. [k.
Friedmann, J.H. (19771. A recursive partitioning decision rule for Michalski, R. (1984). A theory and method of inductive learning.
non parametric classification, lEEK Trans. Comput. 26. 4114- In: R. Michalski, J. Carbonell and T. Mitchell, Eds., Machine
408. Learning. An Artificial Intelligence Approach. Springer, New
Gowda, K.C. and E. Diday (19911 Symbolic clustering using a York, 83 134.
new dissimilarity measure. Pattern Recognition 24. 567 57~,. Ouinlan, J.R. (1986). Induction of decision trees. Machine Learn-
Greenacre, M. (19841, Theory and Applications of Correspon- ing 1, 81-106.
dence Analysis. Academic Press, New, York, Ralambondrainy, H. (19871. A clustering method for nominal data
Hanson, S.J. and M. Bauer (19861. Machine learning, clustering and mixture of numerical and nominal data. Proc. First Conf.
and polymorphy. In: L.N. Kanal and J.F. Lammer, Eds.. Internat. Federation of Classification Societies, Aachen.
Uncertain O' in Artificial Intelligence. North-Holland, Amster- Stepp, R.E. and R.S. Michalski (1986). Conceptual clustering of
dam, 415 428. structured objects: a goal-oriented approach. Artificial Intelli-
Hanson, S.J. (1991)). Conceptual clustering and categorization: gence 28. 43 69.
bridging the gap between induction and causal models. In: 5" Stcpp, R.E. 11987). Concepts in conceptual clustering. Proc.
Kodratoff and R.S. Michalski, Eds.. Machine Learning: 4n Tenth Joint Conf. on Artificial Intelligence, Milan.
Artificial Intelligence Approach Morgan Kaufmann, [x)s Al- Smith, E.E. and D.L. Medin (19811. Categories and Concepts.
tos, CA, 3, 235-268. Harvard Univ. Press, Cambridge, MA.
Jain, A.K. and R.C. Dubes (19~;~l Algorithms fi)r Clustering,, Iakagi, T.. T. Yamaguchi and M. Sugeno (1991). Conceptual
Data. Prentice-Hall, Englewoods Cliffs. NJ. fuzzy sets. F'uz~, Engineering toward Human Friendly Sys-
KodratofL Y. and G. Tecuci ( 19881. Learning based on conceptual tems. lOS Press, IFES91, 261-272.
distance, lEEK Trans. Patterns Anal. Mach lntcll, 111. 897 Vcnkateswarlu, N.B. and P.S.V.S.K. Raju (1992). Fast Isodata
909. clustering algorithms. Pattern Recognition 25, 335-342.
Matheus, C.J. (19871. Conceptual purpose implications for repro-