Vous êtes sur la page 1sur 15

Expert Systems with Applications 39 (2012) 335–349

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Robust data clustering by learning multi-metric Lq-norm distances


Junying Zhang a,⇑, Liuqing Peng a, Xiaoxue Zhao b, Ercan E. Kuruoglu c
a
School of Computer Science and Technology, Xidian University, Xi’an 710071, China
b
Department of Physics, Fudan University, Shanghai 200433, China
c
Institute of Science and Technology of Information, ‘‘A. Faedo’’, CNR, Pisa 56124, Italy

a r t i c l e i n f o a b s t r a c t

Keywords: Unsupervised clustering for datasets with severe outliers inside is a difficult task. In this approach, we
Robust clustering propose a cluster-dependent multi-metric clustering approach which is robust to severe outliers. A data-
Multi-metric Lq-norm distance set is modeled as clusters each contaminated by noises of cluster-dependent unknown noise level in for-
Distance learning mulating outliers of the cluster. With such a model, a multi-metric Lp-norm transformation is proposed
Outlier detection
and learnt which maps each cluster to the most Gaussian distribution by minimizing some non-Gaussia-
nity measure. The approach is composed of two consecutive phases: multi-metric location estimation
(MMLE) and multi-metric iterative chi-square cutoff (ICSC). Algorithms for MMLE and ICSC are proposed.
It is proved that the MMLE algorithm searches for the solution of a multi-objective optimization problem
and in fact learns a cluster-dependent multi-metric Lq-norm distance and/or a cluster-dependent multi-
kernel defined in data space for each cluster. Experiments on heavy-tailed alpha-stable mixture datasets,
Gaussian mixture datasets with radial and diffuse outliers added respectively, and the real Wisconsin
breast cancer dataset and lung cancer dataset show that the proposed method is superior to many exis-
tent robust clustering and outlier detection methods in both clustering and outlier detection
performances.
Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction variants (Davé & Krishnapuram, 1997). Under the framework, most
robust clustering approaches generally fall into two directions: dis-
Unsupervised data clustering is a task of assigning points to tance measure definition and outlier identification. The former
clusters as well as simultaneously estimating cluster location and generally defines a fixed and global distance measure over data
shape. The performance of the standard multivariate analysis tech- space which provides robustness, and the latter generally clusters
niques relies on the estimation of both the location and shape of data after identification and removal of all potential outliers.
the cluster data distribution. However, it is difficult in robust sta- K-means clustering method partitions samples {xi : xi 2 Rd,
tistics in presence of outliers (Hardin & Rocke, 2004; Peña & Prieto, i = 1, 2, . . . , N} into K clusters (cj, j = 1, 2, . . . , K) using the classical
2001; Rocke & Woodruff, 1996): on one hand, only when outliers L2-norm distance:
are detected and eliminated correctly, can location and shape be  2
estimated with high quality; on the other hand, only when location Iðcj jxi Þ ¼ argmin xi  mj 2 : ð1Þ
16j6K
and shape are estimated with high quality can outliers be detected
correctly. Such a reciprocal relation between location/shape esti- It is fast but easily affected by even a single outlier leading to mis-
mation and outlier detection brings more difficulty for severe out- calculated locations and hence wrong clustering decisions, espe-
lier case since severe outliers will bias location and shape cially in presence of impulsive outliers (Hodge & Austin, 2004).
estimation more seriously and leads outlier detection less effective. For dealing with outliers, a successful extension is the use of Lp-
However, clustering methods need be robust against outliers/noise norm distance:
if they are to be useful in practice (Davé & Krishnapuram, 1997). In  p
addition, in some applications outliers can in fact bring significant
Iðcj jxi Þ ¼ argmin xi  mj p ð2Þ
16j6K
information.
Many popular unsupervised clustering techniques are proto- in Miyamoto and Agusta (1996) and Hathaway, Bezdek, and Hu (2000)
type-based, such as K-means, fuzzy c-means, PCM and their with p = 1 employed on all data. In Hathaway et al.’s work (2000), fuzzy
c-means using general Lp-norm distances was developed with discrete
and fixed p, p 2 (0, +1) over all clusters. It concluded that p = 1 exhibits
⇑ Corresponding author. Tel.: +86 13992815979; fax: +86 29 88203692. some robustness properties. It also summarized that choices of p
E-mail address: jyzhang@mail.xidian.edu.cn (J. Zhang). other than p = 1 or p = 2 lead to models more difficult to optimize.

0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.07.023
336 J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349

In the work of Chen and Wang (1999), robust clustering is imple- of points in input space, but it requires ‘‘side information’’ for clus-
mented by introducing density measurement which can reveal the tering (Xing, Ng, Jordan, & Russell, 2002; Bar-Hillel, Hertz, Shental,
degree of density around an input point as the weight of the distance & Weinshall, 2003). Other examples are the approaches on how to
measure with p = 2 for the point. In Hubert and his collaborators’ learn Mahalanobis distance for kNN classification from labeled
work (1997), and many other papers, Lp-norm measure was exam- examples (Weinberger, Blitzer, & Saul, 2005) and how to learn ker-
ined at the usual values, i.e. 1, 2, and +1. These three measures imply nel-based distance metric for microarray data classification (Xiong
an assumption of three distributions: Cauchy, normal and uniform, & Chen, 2006). A distance metric from a data set by knowledge
respectively. embedding is also proposed (Zhang, Zhang, & Zhang, 2004). Most
Another successful direction for robust clustering is based on of them are supervised learning and outliers are not paid great
detection and removal of potential outliers. Hardin and Rocke attention.
(2004) developed a distributional fit to Mahalanobis distances In this paper we propose a new prototype-based approach: ro-
which uses a robust shape and location estimate, namely the min- bust multi-metric clustering (RMMC) algorithm. It is composed of
imum covariance determinant (MCD) (Hardin & Rocke, 2004; robust multi-metric location estimation (MMLE) algorithm for
Rousseeuw & Van Zomeren, 1990) and also called minimum vol- location estimation and iterative chi-square cutoff (ICSC) algorithm
ume ellipsoid (MVE) by Davé and Krishnapuram (1997) with a for outlier detection/shape estimation. By assuming a cluster in
lot of studies followed (Fauconnier & Haesbroeck, 2009; Hubert, dataset be contaminated with noises caused by an unknown clus-
Rousseeuw, & Van Aelst, 2008; Peña & Prieto, 2001). The MCD is ter-dependent noise level, in MMLE, we propose a novel cluster-
computed from the claimed ‘‘closest’’ ‘‘half sample’’ (in which it dependent nonlinear map for mapping a cluster to a symmetric
is believed no outliers are included). Though it is ‘‘a polynomial distribution. Specifically, it learns a nonlinear map for each cluster,
time algorithm for fixed dimension of the data’’ and ‘‘NP-hard if which maps the cluster to a Gaussian distribution. Then under the
the dimension varies’’ (Bernholt & Fischer, 2004), based on the prototype-based framework, the mean of the mapped samples is
jointly estimated location and shape, chi-square distribution cutoff mapped back to data space as the update of the cluster location
and F distribution cutoff, referred to as Chisq and Adjusted F meth- in data space. It is rigorously proved that this is equivalent to learn-
od in Hardin and Rocke (2004), respectively, is used for outlier ing a cluster-dependent Lq-norm distance in data space where q is
detection. The latter considers that an F distribution fits the ex- learnt from each cluster. A non-Gaussianity measure of a random
treme points much more accurately across all sample sizes espe- vector is employed to direct the learning of the parameter q for
cially in small samples and hence is superior to the Chisq in each cluster. Based on the robustly estimated locations by MMLE,
robustness (Hardin & Rocke, 2004). In the work of Cuesta-Albertos, the proposed ICSC is employed for robust outlier detection which
Gordaliza, and Matr’an (1997), a robust clustering method aiming finally gives a Gaussian-mixture based description for the rest of
at robustifying K-means is proposed, with a user specified true per- data. Different from conventional robust approaches, which esti-
centage of outliers to be properly selected. In noise clustering pro- mate location and shape jointly with distance measure defined
posed by Rehm, Klawonn, and Kruse (2007), a single noise cluster globally and assumed fixed over data space, the proposed RMMC
which will hopefully contain all noise samples is introduced. Sam- approach estimates location and shape consecutively, with dis-
ples whose distances to all clusters exceed a certain threshold are tance measure defined being cluster-dependent and local to each
considered as noises. Kernel based clustering approach is origi- cluster, and learnt from the cluster. The distinct feature of the ap-
nated for clustering data which does not follow Gaussian distribu- proach is its strong robustness to severe outliers. This comes from
tions (Ben-Hur, Horn, Siegelmann, & Vapnik, 2001; Girolami, learnability of the shrinkage ability towards cluster location: the
2002), and recently employed successfully to outlier identification more severe the outlier situation of a cluster is, the map with
(Wang, 2009). How to determine suitable kernel function (includ- stronger shrinkage ability will be learnt.
ing its parameters) is generally by experience (Girolami, 2002; Our experiments, comparison with five typical robust clustering
Guo, Chen, & Tsai, 2009). algorithms (L1-norm K-means, Lq-norm K-means, FSMM, DWFCM
In respect to model-based clustering approaches, robust clus- and MCD) on impulse heavy-tailed alpha-stable mixture data, and
tering approach based on a finite student-t mixture model comparison with two typical outlier detection algorithms (Chisq
(SMM) and the model extended to fuzzy case are studied in the and Adjusted F, both are MCD based) on outlier contaminated
work of Archambeau and Verleysen (2007) and Chatzis and Gaussian mixture data, verify the distinct strong robustness char-
Varvarigou (2008), respectively. Deterministic annealing EM acteristic of the proposed MMLE and ICSC to even severe outliers.
(DAEM) algorithm for this mixture model is proposed which Finally the RMMC is successfully applied to two real Wisconsin
significantly improves upon the noise and initialization sensitivity breast cancer dataset and lung cancer dataset.
of traditional mixture decomposition algorithms (Guo et al., The rest of the paper is organized as follows. In Section 2, we
2009). In addition, recently a robust probabilistic PCA (PPCA) show our motivation for the RMMC. In the following two sections,
model by using the multivariate student-t distribution is pro- MMLE algorithm and ICSC algorithm are proposed respectively and
posed to consider the situation where there may be outliers with- their robustness is studied. Experimental results are given in Sec-
in a single-model (single-cluster) dataset (Chen, Martin, & tion 5. Discussions about the approach are in Section 6. Finally,
Montague, 2009). These approaches use computationally inten- the conclusions are in Section 7.
sive EM algorithm for model parameter estimation (Archambeau
& Verleysen, 2007; Chatzis & Varvarigou, 2008; Chen et al.,
2009). Recently emerged is a new direction of projected outlier 2. Motivation for multi-metric clustering
detection for the applications with the sparsity of high dimen-
sional data. The main focus is on how to search for the projected A dataset is generally composed of multiple clusters and con-
subspace (Ye, Li, & Orlowska, 2009). Robustness to severe noise taminated with noises. Additionally, the impulsiveness of outliers
situation remains to be studied. with respect to different clusters is generally unknown and might
To sum up, in the above approaches, distance/kernel is defined be different due to its underlying physical reality. Thus in this pa-
by experience and/or set globally fixed (single-metric) over all per, we model the data as clusters each contaminated with outliers
clusters. Some research works are related to distance metric learn- caused by a cluster-dependent unknown noise level.
ing. Mahalanobis distance (i.e., the related covariance matrix) is The above model is reasonable in many applications, such as
learnt given examples of similar (and if desired, dissimilar) pairs network intrusion detection, fraud detection and fault detection
J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349 337

in manufacturing, among other things. By considering cluster data space is symmetric for the cluster, e.g., Gaussian distributed. The
as usual events, and outliers as unusual events, more than one type schematic demonstration is shown in Fig. 1. At iteration n, samples
of unusual event may occur in a given data: the event types can be are assigned to clusters according to their locations. Then, samples
expected to differ markedly from one another (Zhang, Gatica- in the kth cluster, shown with circulars in the left panel of Fig. 1,
Perez, Bengio, & McCowan, 2005) and different type of unusual are mapped (shrinking towards the location of the cluster) to the
event might be differently impulsive from the usual event. For mapped space, shown with circulars in the right panel of Fig. 1,
example, as is known, cancer related gene expressions can be used by the shrinkage map wk(), where mapped samples are Gaussian
to detect cancer. It generally occur that for some type of cancer, a distributed. The mean lk of these mapped samples, denoted by
small fluctuation of gene expression levels indicates that the test plus in the right panel of the Fig. 1, is next inversely mapped, by
tissue is a cancer tissue, while for some other type of cancer, a large w1
k ðÞ, to data space to get the updated cluster location mk,n+1. This
one is requested to say that the test tissue is a cancer tissue. Such a process continues until convergence. Note that the shrinkage map
heterogeneity characteristic of the unusual events, together with wk() relies on cluster k since different cluster has different noise
their seldom occurrence and unexpectedness exacerbate the prob- level.
lem of unsupervised clustering if we still fix a single global model Mapping a cluster to Gaussian distribution will make location
as the conventional approaches did to capture all unusual events. estimation robust to outliers. Outliers, in principle, behave differ-
For clustering such a noisy dataset especially in presence of se- ently from the majority of samples. The number of outliers is very
vere outliers, location estimation is crucial and should be insensi- small compared with the total number of samples in the cluster.
tive to outliers and estimated with precision. Only so, can the Additionally, Gaussian distribution is a symmetric distribution rel-
outliers be detected and removed correctly, and hence the cluster- ative to the mean. The sparseness of outliers in each cluster and the
ing be performed correctly without the influence of outliers. The symmetry of Gaussian distribution of the mapped space jointly
characteristic of cluster-dependent noise level of the above model make the mean of the mapped samples a good location estimate
motivates the scheme of learning cluster-dependent metric for of the mapped cluster, robust to outliers. If so, the update of the
each cluster. With cluster number K known a prior as most conven- location in data space will be in precision and robust to outliers,
tional robust clustering approaches did, we propose a robust multi- and hence at the convergence, locations in data space will be dis-
metric clustering (RMMC) algorithm. It includes two consecutively covered in precision. Another reason we map a cluster to Gaussian
connected phases: distribution is that we can use some existent non-Gaussianity
measure (Hyvärinen & Oja, 2000) to help direct the search of the
(1) Estimation of location mk and corresponding metric param- map.
eter pk for each cluster ck, k = 1, 2, . . . , K, with the MMLE algo- It is known that the use of Lp-norm provides robustness for
rithm proposed in Section 3; dealing with outliers, and L1-norm is a particular case. Hence,
(2) Outlier detection and data clustering, with the ICSC algo- we adopt an Lp-norm based map for the shrinkage. For the data
rithm proposed in Section 4. x with location in the origin, we introduce a map from the data
space to the mapped space x ? z, such that the Lp-norm of sam-
Prototype-based approach is adopted in the first phase, i.e., after ples in the data space is equal to the L2-norm in the mapped
location initialization, repeat the following two phases iteratively space, i.e.,
till convergence: (1) assigning samples to clusters according to
their locations, and (2) updating locations according to the assign- kxkpp ¼ kzk22 : ð3Þ
ment. Unlike classical approach in which location update is in data
We then obtain the map function of
space, in this approach, we learn a multi-metric shrinking map for
attempting to avoid the influence of outliers for robust location z ¼ jxjp=2  sgnðxÞ; ð4Þ
estimation.
A nonlinear map is searched for mapping a cluster to Gaussian where p 2 [0, 2] for the shrinkage rather than expansion for the
distribution. Firstly the map should be a shrinking function which map. The corresponding inverse map function is then
can shrink the samples towards the location. Secondly, the shrink-
x ¼ jzj2=p  sgnðzÞ: ð5Þ
age for cluster with high noise level should be stronger than that
for cluster with low noise level. Hence, the map is a cluster-depen- p/2
In Eqs. (4) and (5), vectors jxj and sgn(x) are of the same size as
dent multi-metric transform. Lastly, we desire that the mapped that of x, and are defined, respectively as

Cluster ck in Data space Cluster ck in Mapped space


(clean cluster contaminated with noises) (Gaussian distributed)

Shrinking map ψ k ( )

mk ,n µk
m k , n+1 Inverse map ψ k- 1 ( )

Fig. 1. Schematic figure demonstrating problem definition and motivation of the proposed approach, where plus corresponds to cluster location, each circular corresponds to
a sample in the cluster, mk,n+1 is the location updated from the location mk,n for the kth cluster at iteration n, and lk is the mean of cluster samples in the mapped space.
338 J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349

D
jxjp=2 ¼ðjx1 jp=2 ; jx2 jp=2 ; . . . ; jxd jp=2 ÞT ð6Þ from learnable ability of shrinkage towards each cluster location:
the more severe the outlier situation of a cluster is, the map with
and a stronger shrinkage (smaller p) will be learnt.
D
sgnðxÞ ¼ðsgnðx1 Þ; sgnðx2 Þ; . . . ; sgnðxd ÞÞT ; ð7Þ 3. Robust location estimation: MMLE algorithm
where
 In this section, we propose the MMLE algorithm according to
D þ1; xi P 0; the noise level which is estimated with the least non-Gaussianity
sgnðxi Þ ¼ ð8Þ
1; otherwise criterion. At first, the algorithm is proposed in Section 3.1, fol-
lowed with the corresponding objective function and then non-
and dot product of vector b and c, i.e., d = b  c, is defined as a vector Gaussianity measure, studied in Sections 3.2 and 3.3, respectively.
whose ith element is the product of the ith element of b and the ith Initialization algorithm referred to as CDM initialization algo-
element of c. rithm which adapts to severe outliers is also proposed and given
Similarly, for the data x with location being m in data space and in Appendix A.
being the origin in mapped space, the corresponding map and the
inverse map function, respectively become 3.1. Robust location estimation: MMLE Algorithm
p=2
z ¼ wðxÞ ¼ jx  mj  sgnðx  mÞ;
ð9Þ The proposed MMLE algorithm is stated as follows:
x ¼ w1 ðzÞ ¼ m þ jzj2=p  sgnðzÞ:
MMLE algorithm for robust location estimation
The above map is a shrinking map which adapts to cluster-
dependent noise levels of the data model. Firstly, the map is a Input: dataset fxi : xi 2 Rd; i ¼ 1; 2; . . . ; Ng, and the total
shrinking map for p 2 [0, 2] with p = 2 corresponding to an identity number of clusters K;
map and the smaller p corresponding to stronger shrinkage ability. Output: {(mk, qk), k = 1, 2, . . . , K}, where mk is the estimated
Secondly, for a specific p, the farther away a sample is from cluster location and qk the estimated metric parameter of the kth
location m in data space, the stronger shrinkage the sample is cluster, respectively.
mapped towards cluster location. Both can be seen from Fig. 2 Step 1: (initialization) Initialize location of each cluster being
where p = 1.2, 1.4 and 2. Due to the characteristic of cluster-depen- mk,0 and metric parameter pk,0 = 2 for k = 1, 2, . . . , K;
dent noise levels, p is cluster dependent and learnt from the clus- set the iteration number n = 0;
ter, i.e., for a cluster with higher noise level, smaller p should be Step 2: (partitioning data into K clusters) Let
p
used/will be learnt. Only so, the mapped cluster can be Gaussian qk;n ¼ 1 þ k;n 2
; k ¼ 1; 2; . . . ; K. Assign sample xi,
distributed. i = 1, 2, . . . , N, to the kth cluster ck, i.e., xi 2 ck, accord-
Though the above map w() may not guarantee that the mapped ing to
space be an absolute Gaussian distribution due to its fixed form in  q
Eq. (9), the parameter p can be adjusted such that the mapped k ¼ argmin xi  mj;n qj;n : ð10Þ
j;n
j¼1;2;...;K
space cluster approaches Gaussian distribution.
The distinct feature of the above methodology is that it can Step 3: (learning cluster-dependent map) Search for pk,n+1 in
tackle severe impulsive outliers in the data, which will be seen from the range of [0, 2] for each cluster ck, k = 1, 2, . . . , K,
our experiments in Section 5.1, where three clusters are ranged such that the map
only from 20 to +30 while outliers are in the range from 600
to +600, with the outliers for each cluster ranged severely differ- zl ¼ jxl  mk;n jpk;nþ1 =2  sgnðxl  mk;n Þ for all xl
ently. Such strong robustness of the proposed approach comes 2 ck ; ð11Þ

makes the mapped samples {zl, zl 2 ck} follow Gaussian distri-


150 bution as much as possible.
Step 4: (computing location of the mapped cluster) The loca-
tion of mapped samples {zl, zl 2 ck} is
100
p=2 1 X
lk ¼ zl ð12Þ
p=1.4 Nk z 2c
l k
50
p=1.2 for k = 1, 2, . . . , K, where Nk is the number of samples in the kth
cluster ck.
ψ(x)

0
Step 5: (location update in data space) Update the location of
the cluster ck, for k = 1, 2, . . . , K, by
-50
mk;nþ1 ¼ mk;n þ jlk j2=pk;nþ1  sgnðlk Þ: ð13Þ
Step 6: Set n = n + 1 and repeat step 2 through step 5 until
-100
convergence. Then we have mk = mk,n+1 and qk = qk,n+1
for k = 1, 2, . . . , K.
-150
-300 -200 -100 0 100 200 300
x The algorithm (MMLE) is prototype-based. Initialization in step
1 is given in Appendix A. Step 2 is to partitioning data into K clus-
Fig. 2. Shrinkage map function for different p. A sample x which is 100 distant from
cluster location m will shrink with the amount specified by the arrow length in the
ters with different metric distance for each cluster. After the parti-
figure for p = 1.4. The cluster with more severe or impulsive outliers requires the tion in step 2, we map each cluster to a mapped space by learning a
map function with a stronger shrinkage ability gained by the smaller p. transformation map (step 3), compute the location in the mapped
J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349 339

space (step 4), and then map the location back to the data space as computationally available multivariate non-Gaussianity measure
the update of the location in the data space (step 5), with the mul- is required.
ti-metric pk searched such that the mapped space is the most A very important measure of non-Gaussianity is negentropy. It
Gaussian distributed for each cluster k = 1, 2, . . . , K. It will be seen is based on the information-theoretic quantity of (differential) en-
in Section 3.2 that theoretically an assumption is made on the tropy, defined as
dataset that in data space each cluster follows Multivariate Gener-
GðzÞ ¼ HðzGauss Þ  HðzÞ; ð16Þ
alized Gaussian Distribution (MGGD) with shape parameter of
qk = 1 + pk/2. The MMLE algorithm in fact searches for robust loca- where zGauss is a Gaussian random vector of the same covariance
tions mk’s and multi-metric parameter pk’s for k = 1, 2, . . . , K simul- matrix as z. The problem in using negentropy is that it is computa-
taneously for such dataset. tionally very expensive. Estimating negentropy using the definition
The MMLE degrades to the conventional K-means/L1-norm K- would require an estimate (possibly non-parametric) of the proba-
mean/Lp-norm K-means when a fixed single-metric pk = 2/pk = 1/ bility density function (Hyvärinen & Oja, 2000).
pk = p is set over all the clusters k = 1, 2, . . . , K rather than searched Estimating negentropy of a random variable z assuming to be
for by step 3. zero mean and unit variance (i.e., standardized) was given by
Hyvärinen and Oja (2000), as
3.2. The objective function
X
m
GðzÞ  ki ½EfHi ðzÞg  EfHi ðv Þg2 ; ð17Þ
In this section, we discuss the objective function referring to Lq-
i¼1
norm distance metric with different metric parameter q = qk for
cluster, and known a priori. We will show that minimizing it is where ki are some positive constants, and v is a Gaussian variable of
the convergent solution of the MMLE algorithm, and in Section 3.3, zero mean and unit variance (i.e., standardized), and the functions
we study the optimization of qk, k = 1, 2, . . . , K for their unknown Hi are some non-quadratic functions. Generally, m is set to 1, and
case. the choices of H which have proved very useful are:

Theorem 1. Let a dataset be {xi : xi 2 Rd, i = 1, 2, . . . , N} and K be the 1


H1 ðuÞ ¼ log cosh a1 u; and H2 ðuÞ ¼  expðu2 =2Þ; ð18Þ
total number of clusters. With the metric parameter qk known a priori a1
for k = 1, 2, . . . , K, the MMLE search for the optimal solution of the least where 1 6 a1 6 2 is some suitable constant. We use H2(u) in our
sum of Lq-norm distances between cluster samples to their locations experiment.
over all clusters, i.e., The above negentropy is useful for measuring non-Gaussianity
G(z) of a random variable z. However, in this approach, it is re-
XK
1 X quired to estimate the negentropy G(z) of a random vector
min J a ðmk ; k ¼ 1; 2; . . . ; KÞ ¼ kxi  mk kqqkk ; ð14Þ
mk N k x 2c
z = (z1, z2, . . . , zd)T in mapped space. Discouraged by the complexity
k¼1;2;...K k¼1 i k
of G(z), we try to roughly measure it with the negentropy G(zi) of zi,
where Nk is the total number of samples in ck. i = 1, 2, . . . , d. Note that
The proof of Theorem 1 is in Appendix B. It can also be seen that !
X
d Qd
qk = 1 + pk/2 from Appendix B. According to the criterion of mini- 1 i¼1 Cii
mizing Ja, itis evident
GðzÞ ¼ Iðzi ; z2 ; . . . ; zd Þ þ Gðzi Þ  log ; ð19Þ
q that partition strategy should be x 2 ck if i¼1
2 jCj
k ¼ arg min x  mj qj , and hence we have the partition process
j
j¼1;2;...;K
of step 2 in the MMLE algorithm. where I(zi, z2, . . . , zd) is the mutual information among z1, z2, . . . , zd,
Now we study the convexity of the objective function Ja. As is and C the covariance matrix of a mapped cluster. Since the map is
known, the mk related term of Ja, i.e., searched such that each cluster in the mapped space is most Gauss-
P
J k ðmk Þ ¼ N1 xi 2ck kxi  mk kqqkk , is a convex function of mk for ian distributed, just for simplicity, we assume zi, z2, . . . , zd are inde-
k
qk P 1; f(x) + f(y) is not strictly convex though both f(x) and f(y) pendent and have a diagonal covariance matrix. Hence the first
P
are convex. Hence, J a ¼ Kk¼1 J k ðmk Þ is not strictly convex, and there term and the last term in the right side of Eq. (19) vanish. Then,
might exist local minima for Ja. This means that the algorithm we can simply use summation of the negentropies over the ele-
might converge to local minimum. Attempting to avoid it, we pro- ments of the vector z as a rough estimation of the non-Gaussianity
pose a CDM initialization algorithm, given in Appendix A, for ro- measure of z, i.e.,
bust initialization of clustering algorithms including MMLE for
X
d
the data in presence of severe outliers. GðzÞ  Gðzi Þ: ð20Þ
The objective function in Eq. (14) is equivalent to the assump- i¼1
tion that each cluster with noise samples included follows a sym-
Hence, our criterion for searching for the optimal qk for k = 1, 2, . . . , K
metric Multivariate Generalized Gaussian Distribution (MGGD)
is
with the pdf
  X
d
f ðx=ck Þ / exp kx  mk kqqkk ; ð15Þ min Jb ðqk Þ ¼ Gðzi Þ;
qk 2½1;2
i¼1
where qk’s being known a prior and mk’s to be estimated, s:t: z ¼ ðz1 ; z2 ; . . . ; zd ÞT ¼ jx  mk jqk 1  sgnðx  mk Þ; 8x 2 c k ;
k = 1, 2, . . . , K.
ð21Þ
3.3. Non-Gaussianity measure of a mapped cluster which is the step 3 of the algorithm.
In summary of the study in Sections 3.2 and 3.3, the proposed
We have introduced the objective function Eq. (14) for known MMLE in fact minimizes Ja in Eq. (14) for mk, with each unknown
qk, k = 1, 2, . . . , K. However, noise level is unknown a priori and metric parameter qk in Ja searched for by minimizing Jb in Eq.
hence qk is to be estimated with some criterion. We have men- (21). Hence, the proposed MMLE solves a multi-objective optimiza-
tioned in motivation section that mapping a cluster to most Gauss- tion problem for estimation of both location mk and metric param-
ian distribution will bring robustness to outliers. Hence, a eter qk, k = 1, 2, . . . , K, simultaneously:
340 J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349

XK
1 X (continued)
min Ja ðmk ; qk ; k ¼ 1; 2; . . . ; KÞ ¼ kxi  mk kqqkk ;
mk
k¼1
N k x 2c ICSC algorithm for outlier detection and data clustering
k¼1;2;...K i k

X
d squared Mahalanobis distance between x and mk defined
min J b ðmk ; qk Þ ¼ Gðzi Þ; for k ¼ 1; 2; . . . ; K; by
qk 2½1;2
i¼1
 q
s:t: x 2 ck where k ¼ argmin x  mj qj ;
j¼1;2;...;K
j
MD2 ðx; mk Þ ¼ ðx  mk ÞT R1
k ðx  mk Þ: ð24Þ
T qk 1
z ¼ ðz1 ; z2 ; . . . ; zd Þ ¼ jx  mk j  sgnðx  mk Þ; 2
x is a potential outlier if MD ðx; mk Þ > v2d;e . The set of
8x 2 c k : outliers detected at this iteration is denoted as O;
ð22Þ Remove all the detected outliers from the cluster, i.e.,
ck = ck nO;
Four techniques are used in the MMLE for solving the above prob-
End
lem: (1) prototype-based iterative process, (2) multi-metric Lp-
Compute covariance matrix of the cluster, i.e.,
norm map and its inverse map, and (3) non-Gaussianity measure
for a random vector, as well as (4) a search process for optimal met-
8 1
P
ric qk, k = 1, 2, . . . , K. We will demonstrate in experiment section that
< mk ¼ Nk x2c x;
>
all these together make the estimated locations be very robust to k
P ð25Þ
outliers without prior knowledge on noise levels, even when noise > 1
: Rk ¼ N 1 ðx  mk Þðx  mk ÞT :
k
situation is serious. x2ck

End
4. Robust outlier detection: ICSC algorithm For a test sample x, x is assigned to cluster ck, i.e.,
x 2 ck, where
Chi-square cutoff method can be simply used but the perfor-
mance is influenced by impulsive outliers, if location is robustly
k ¼ argmin MD2 ðx; mj Þ; ð26Þ
estimated by the MMLE and fixed. This influence is serious espe- j¼1;2;...;K
cially for severe outlier situation, because estimating shape in
precision requires removing all potential outliers. To decrease and it is a potential outlier if MD2 ðx; mk Þ > v2d;e .
the influence, an apparently natural way is to detect and remove
outliers by iterative use of conventional chi-square cutoff meth-
od. Specifically, we compute covariance matrix of the cluster
with respect to the fixed location, remove samples which a large The ICSC algorithm is initialized with the location mk and the
Mahalanobis distance from the location as potential outliers, i.e., metric parameter qk of each cluster, k = 1, 2, . . . , K, which have been
remove samples which have Mahalanobis distance larger than determined by the MMLE algorithm. The covariance matrix (shape)
the quantile v2d;e of the chi-squared distribution, where v2d;e de- of the cluster is estimated after the data is clustered according to the
notes the e  100% quantile of the chi-squared distribution, with Lq-norm distance with q = qk for k = 1, 2, . . . , K. This is realized by
d the dimension of the data space. The above process proceeds removing potential outliers far away from the cluster iteratively.
until no outliers can be detected. This guarantees the estimation The shape of the cluster is computed relevant to the cluster’s loca-
of covariance matrix less and less influenced by outliers with re- tion by Eq. (23). By noticing that outliers of a cluster ck is generally
spect to iterations though the original outlier situation is very far away from the location, the determinant of the covariance matrix
severe. Rk,n in iteration n is larger than that of the clean cluster Rk, i.e.,
jRk,nj > jRkj. Hence, the Mahalanobis Distance (MD) between any
sample i in cluster ck and mk will be less than that computed accord-
ICSC algorithm for outlier detection and data clustering
ing to the clean cluster, i.e., MDn(x, mk) < MD(x, mk). This indicates
Parameter: chi-square cutoff threshold e; that with a specific e, some outliers might not be detected, and hence
Input: total number of clusters K; dataset {xi : xi 2 Rd, further iteration is required. This leads to
i = 1, 2, . . . , N}; location mk and the corresponding metric
jRk;nþ1 j < jRk;n j: ð27Þ
parameter qk for k = 1, 2, . . . , K;
Output: covariance matrix Rk for k = 1, 2, . . . , K, and detected The estimated covariance matrix approaches the clean one while
outliers. outliers are being detected and removed. At the convergence of
Assign the ith sample xi to cluster ck, i.e., xi 2 ck, where the algorithm, all the potential outliers are supposed to have been
 q
k ¼ argmin xi  mj  j , for i = 1, 2, . . . , N. Denote Nk the total
qj
removed, and data clustering is based on the locations and the
j¼1;2;...;K covariance matrices derived from the clean clusters. Hence, Eq.
number of samples in cluster ck. (25) is used, and data will be correctly clustered.
For each cluster ck, k = 1, 2, . . . , K, do
Repeat until no outlier can be detected
5. Experiments and results
Calculate the covariance matrix Rk of cluster ck relevant
to location mk by
Synthetic and real datasets were used to test the performance of
the proposed algorithm. At the beginning, robustness of the MMLE
1 X algorithm was tested on a very noise-impulsive heavy-tailed al-
Rk ¼ ðx  mk Þðx  mk ÞT : ð23Þ
Nk  1 x2c pha-stable mixture datasets of two dimensions, and compared
k
with the claimed L1-norm K-means and the single-metric location
Detect outliers by using chi-square cutoff method with estimation (SMLE) algorithm. In MMLE, qk is learnt from the clus-
parameter e, i.e., for each sample x 2 ck, compute the ter; in SMLE, qk are set the same and given fixed over all clusters.
J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349 341

Then, comparison experiment on the above dataset was performed stable clusters in 2-dimensional space. The pdf of the datasets is
with the proposed MMLE and three typical robust clustering algo- expressed as
rithms (FSMM, DWFCM and MCD), with the performance evalu-
pðx; yÞ ¼ 0:25S1 ð0:3; ½5; 5Þ þ 0:4S1:5 ð0:6; ½10; 10Þ
ated on data clustering results and maximal fuzzy membership
functions. In addition, outlier detection performance was evalu- þ 0:35S0:8 ð0:4; ½0; 14Þ: ð30Þ
ated and compared among the proposed RMMC, chi-square distri-
Except for location differences, the tail index and dispersion are also
bution cutoff (Chisq) method and F distribution cutoff (Adjusted F)
different and being a = 1, 1.5, 0.8 and c = 0.3, 0.6, 0.4, respectively.
method via the synthetic Gaussian mixture R-data and D-data of 4,
Shown in Fig. 3 is the scatter plot of the dataset, in original scale
7 and 10 dimensions in presence of radial and diffuse outliers
and in a very small region, respectively. Notice that the parameters
respectively. Finally, the RMMC was tested on the Wisconsin breast
a of three alpha-stable distributions in Eq. (30) are different and all
cancer dataset with a comparison to the outlier factor method
less than 2, respectively. This means that these three clusters are far
(Hawkins, He, Williams, & Baxter, 2002), and the lung cancer data-
from Gaussian distribution (which corresponds to a = 2), and their
set. All the algorithms were initialized with the proposed CDM
tails are very different and all very heavy compared with that of a
algorithm given in Appendix A in these experiments.
2 Termination
Gaussian one. This is the reason why the vertical scale and the hor-
condition of the algorithm is mj;n  mj;n1 2 =K 6 g with g set to
izontal scale in Fig. 3a (both from 600 to +600) is very large, while
0.001 in experiments.
the real data structure can be viewed in a comparatively smaller
In our experiment, we used the simplest way to search for the
scale in Fig. 3b (both from 20 to +30). This indicates that outlier sit-
optimal pk, k = 1, 2, . . . , K, since our emphasis is on understanding
uation is very severe in this type of datasets. Our task is to discover
the properties of the clustering and not on computational effi-
locations in this severe noise situation.
ciency. That is, we search for optimal q 2 [1, 2] with an increment
We performed clustering experiments with our proposed
of Dq (we set Dq = 0.1 in our experiments).
MMLE, claimed robust L1-norm K-means, and SMLE, where metric
parameter q’s are cluster-dependent and learnt from clusters, set
5.1. Location estimation on alpha-stable mixture data fixed over all clusters to 1, and from 1.125 to 2.000 with the inter-
val of 0.125, respectively. Shown in Table 1 is our experimental
The quality of estimated locations is measured as in Hathaway, result.
Bezdek, and Hu (2000) by the center deviation (CD) between the It is seen from Table 1 that our proposed MMLE is superior to
estimated locations mi and the true ones m0i , i.e., SMLE and L1-norm K-means in location estimation quality. Among
the three algorithms, L1-norm K-means is the worst, and both the
1X K  
mi  m0 2 : too large mean and standard deviation of the CDs indicate that it
CD ¼ i 2 ð28Þ
K i¼1 converges to outliers and the result is very instable. This means
that though it is claimed robust, it is still not robust enough to such
In fact, the experimental result of the Lq-norm Fuzzy c-means a severe outlier situation. SMLE also possibly converges to outliers
algorithm on a ‘‘two cluster data’’ with q = 0.5, 1, 2, 4, 4.3, 5 demon- for some q values, say, q = 1.625, 1.75, 1.875, 2. Though there are
strated in Fig. 5 of Hathaway et al. (2000) supports our conclusion situations that it converges to better results, e.g., q = 1.125, 1.250,
that q should be within [1, 2] and the outlying data becomes more 1.375, 1.5, the results are still worse than those of the MMLE (the
and more powerful drawn on the prototypes when q increases obtained metric parameter qk is 1.3, 1.6 and 1.35 for the three clus-
above two. ters, respectively). This result indicates that the proposed MMLE is
In this study, experiments on much more impulsive data were very robust to presence of outliers even though the noise situation
performed. As is well known, Generalized Gaussian distribution is very serious.
and alpha-stable distribution are classes of probability distribu- We did a lot of experiments on the datasets of different dimen-
tions that generalize Gaussian with much impulsive data; the sions, each generated from an alpha-stable mixture model with
intersection of the two classes is Gaussian distribution. The ‘‘heavy different number of mixed alpha-stable distributions of different
tail’’ behavior of the alpha-stable distribution causes the variance model parameters. From these experiments, we gained the same
of the distributions to be infinite, while Generalized Gaussian dis- conclusion that the MMLE is always superior to the other two algo-
tribution has finite variance. In order to test the MMLE on robust- rithms in its robustness to very severe outliers.
ness to severe outlier situation and adaptability to other data
model rather than the Generalized Gaussian model which is 5.2. Comparisons with typical robust clustering algorithms
proved theoretically optimal with the proposed algorithm (studied
in Section 3.2), we exemplify our experiments on the datasets fol- Comparisons were made between the proposed MMLE algo-
lowing symmetric alpha-stable distributions due to the limitation rithm and three typical claimed robust unsupervised clustering
of the space. algorithms. The three algorithms are fuzzy mixture of student’s-t
The probability density function of an alpha-stable distribution, distribution model (FSMM) proposed by Chatzis & Varvarigou
denoted as Sa(c, d), does not have a compact form, while the char- (2008), density-weighted fuzzy c-means (DWFCM) by Chen &
acteristic function has a compact form and can be written as Wang (1999), and the minimum covariance determinant (MCD)
    by Hardin & Rocke (2004). Robustness performances were com-
E expðiuT xÞ ¼ exp ca juja þ iuT d ; ð29Þ
pared via both clustering results and maximal fuzzy membership
functions. We demonstrate the result still on the data formulated
where a 2 (0, 2] is the tail index, c the dispersion, and d the d-
by (30). The parameter setting for FSMM is m = 1.4, and for
dimensional vector (location) of the distribution. As is known, the
DWFCM is / = 2 and h = 2.
distribution degrades to Gaussian distribution when a = 2, and to
Both FSMM and DWFCM are fuzzy-based, while MCD and
Cauchy distribution when a = 1, and it becomes more impulsive
for smaller a. The parameter a is responsible of specifying the MMLE are distance-based. For comparison, we map the distance
heaviness of the pdf tails: the smaller its value is, the heavier the tail measure to a pseudo fuzzy membership measure, i.e., we set the
is and hence the more severe the outlier situations. membership function of a data point x to cluster ck be a monoton-
1
We synthesized 100 datasets each is composed of N = 2000 ically decreasing sigmoidal function of uk ðxÞ ¼ 1þexpð0:05dðx;m ,
k ÞÞ
samples and is an alpha-stable mixture composed of three alpha- where d(x, mk) is the Mahalanobis distance for the MCD and the
342 J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349

600 30

400
20

200
10
0
0
-200

-10
-400

-600 -20
-600 -400 -200 0 200 400 600 -20 -10 0 10 20 30

(a) (b)
Fig. 3. The heavy-tailed synthetic dataset used in the experiment. (a) The whole region occupied by the dataset; (b) a selected region for illustrating the main part of the
dataset.

Table 1 at [10, 10], seen from Figs. 4c and 5c, respectively. This indicates
Comparison of MMLE, SMLE, and L1-norm K-means on location estimation perfor- that MCD is not robust in covariance matrix estimation. From this
mance for alpha-stable mixture datasets.
view point, the contour of the MMLE keeps symmetry and robust
Location Estimation Algorithm qk CD to outliers though serious outliers participate in the clustering
L1-norm K-means 1.000 5.60  109 ± 3.87  1010 process.
It can also be seen from Fig. 5 that the contour of the MMLE is
SMLE 1.125 0.023 ± 0.025
1.250 0.014 ± 0.009 more uniform over data space compared with those of other algo-
1.375 0.012 ± 0.008 rithms. For those algorithms, the contour becomes sparser in the
1.500 0.063 ± 0.060 space farther away from cluster locations, while for the MMLE it
1.625 1.08  108 ± 7.93  108
keeps its uniformness over a larger space. This indicates that the
1.750 1.25  108 ± 8.00  108
1.875 1.32  108 ± 7.95  108
MMLE has approximately same space resolution than the other
2.000 1.34  108 ± 7.99  108 algorithms especially in the space where outliers are located. This
MMLE 1.30, 1.60, 1.35 0.008 ± 0.005
might be the reason why the proposed MMLE is more robust to
outliers. Shown in Fig. 6 is the MFM function v(x) gained by the
four respective algorithms.
Lq-norm (q = qk) distance for the MMLE between x and mk. With
the membership functions in FSMM and DWFCM, and the pseudo 5.3. Outlier detection on R-data and D-data
ones for the MCD and MMLE, the maximal fuzzy membership
(MFM) of x is set be v(x) = maxkuk(x). In this subsection, we compare the proposed RMMC algorithm
Shown in each panel of Fig. 4 is the clustering result, and shown (i.e., firstly MMLE, and then ICSC) with the two typical robust out-
in each panel of Fig. 5 is the contour plot of MFM function, gained lier detection methods: chi-square distribution cutoff (Chisq) and F
by each algorithm respectively. Though the data is distributed in a distribution cutoff (Adjusted F) (Hardin & Rocke, 2004), both based
very large range in both horizontal and vertical axis (both from on minimum covariance determinant (MCD) and the latter claimed
600 to +600) due to the heavy tail property of the alpha-stable to be more robust than the former. All three approaches require
distribution of each cluster in the data, we demonstrate result in user cutoff parameter e, and in our experiment, e was set the same
the figures in only a very limited horizontal and vertical range and being 0.005.
for visualization on the concentration of the dataset. A technique similar to the one described in Hardin & Rocke
From Fig. 4, we can see that cluster centers gained by each algo- (2004) was adopted for generating two types of datasets: R-data,
rithm are all very close to the true cluster locations of the data. This in which radial outliers are added, and D-data, in which diffuse
indicates that the four algorithms are all robust to outliers in loca- outliers are added, to a Gaussian mixture dataset with two Gauss-
tion estimation. It is also seen from Fig. 4 that in the demonstrated ian component in d-dimensional space. As two clean clusters, half
range, DWFCM and MMLE are comparable in clustering perfor- of the samples came from a Gaussian distribution, and another half
mance and superior to FSMM and MCD. came from another Gaussian distribution. Both distributions have
Ideally, the contour of MFM should be symmetric to each clus- the same covariance matrix but separate means. The total number
ter location, since the data was simulated by symmetric alpha-sta- of samples generated in the clean clusters is set to be nd. Then out-
ble distributions for all clusters. By comparing the contours gained liers were added according to R-data and/or D-data to be gener-
from the four algorithms shown in Fig. 5, it is seen that both MMLE ated. For generating radial outliers in R-data, a Gaussian
and MCD keep the symmetry of the contours. On the contrary, for distribution data with the same center as but with a larger variance
DWFCM, the contour is biased, and for FSMM, it is even more seri- Rradial than that of any cluster was generated, and the samples
ously biased to be non-symmetric. This is because the serious out- exceeding the 99.9% containment ellipse of any clean cluster were
liers also participate in clustering process in the two algorithms. considered as radial outliers and were added to form the R-data.
On the other hand, for the MCD, the covariance matrix of some For generating diffuse outliers in D-data, also a Gaussian distribu-
cluster was under-estimated due to its ‘‘half sample’’ participation tion data but with mean of mdiffuse and covariance matrix of Rdiffuse
for estimation, the reason which leads some clusters, i.e., clusters was generated, where mdiffuse and Rdiffuse were the estimated center
centered at [0, 14] and at [5, 5], surrounded by the cluster centered and covariance matrix of the entire clean clusters, and samples
J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349 343

Fig. 4. Clustering result on the alpha-stable mixture data by the FSMM, DWFCM, MCD and MMLE algorithm, respectively.

exceeding the 99.9% containment ellipse of the entire clean clus- F, in that type II error reduced greatly by at least 2 to 10.5 times,
ters were considered as diffuse outliers and were added to form with the type I errors being kept comparable.
the D-data. In both the R-data and D-data, the total number of out- Now we compare the performance of the RMMC on R-data and
liers was controlled to be 5% of the total number of samples in the D-data. It is seen from Table 3 that the performance on the R-data
dataset. The parameters for generating synthetic datasets are is better than that on the D-data in that type I errors are compara-
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
shown in Table 2, with D ¼ v2d;0:999 =d (see Hardin & Rocke, ble, while type II errors for R-data are all zeros. The reason appears
to be the match between the R-data and the model in this study
2004), where v2d;0:999 is the 99.9% quantile of the chi-square distri-
that each outlier is the sample of some cluster due to the noise le-
bution, with d the dimension of the data space. vel of that cluster. Though the performance of the RMMC on D-data
Outlier detection performance is measured by type I and type is inferior to that on R-data, it is still superior to that of the Ad-
II error, respectively. The type I error measures the percentage of justed F greatly.
clean cluster points identified as outliers over clean cluster
points, and the type II error measures the percentage of true
outlying points identified as non-outlying over true outlying 5.4. Experiment on Wisconsin breast cancer dataset and lung cancer
points. Our results are shown in Table 3. The entry in the table dataset
is the average type I and type II error over 100 independently
generated R-data and D-data datasets of the dimensions of For evaluation of performance of the approach on real datasets,
d = 4, 7, 10, respectively. we select two real datasets with ground truth clusters known, i.e.,
By comparison of the performance among different outlier the Wisconsin breast cancer dataset and the lung cancer dataset.
detection approaches, it is seen that the Chisq is inferior to the The former dataset was analyzed in Mangasarian & Wolberg
other two approaches greatly in that it leads to too large type I (1990) & Hawkins et al. (2002), and the latter one was analyzed
errors for all the datasets though the type II errors are all small in Gordon et al. (2002).
or even zeros. As to the Adjusted F and the RMMC for R-data, it The Wisconsin breast cancer dataset contains nine attributes.
can be seen that the RMMC is superior to the Adjusted F in that Originally 699 records constitute the dataset, with 458 or 65.5% la-
the type I errors are comparable (more precisely, a little bit lar- beled as benign and 241 or 34.5% labeled as malignant. For compar-
ger) while the type II errors are all zeros. This is not surprising ison, we did the same as that in Hawkins et al. (2002), i.e., both the
since the shape parameter (covariance matrix) is estimated using incomplete records which have missing values and some of the
half samples in Adjusted F, which might make the shape param- malignant records were removed to form a very unbalanced distri-
eter under-estimated. Note that both the RMMC and the Adjusted bution between benign records and malignant records. After that,
F appear to be not affected significantly by dimensions of the the dataset had 39 (8%) malignant records and 444 (92%) benign re-
datasets on their type I errors. Furthermore, the comparison for cords over all the treated data. Three methods were used for compar-
D-data indicates that the RMMC is very superior to the Adjusted ison, the proposed RMMC, Adjusted F and outlier factor method
344 J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349

FSMM DWFCM
30 30

25 25

20 20

15 15

10 10

5 5

0 0

-5 -5

-10 -10
-15 -10 -5 0 5 10 15 20 25 -15 -10 -5 0 5 10 15 20 25

MCD MMLE
30 30

25 25

20 20

15 15

10 10

5 5

0 0

-5 -5

-10 -10
-15 -10 -5 0 5 10 15 20 25 -15 -10 -5 0 5 10 15 20 25

Fig. 5. Contour plot of the maximal fuzzy membership function v(x) gained by the FSMM, DWFCM, MCD and MMLE algorithm respectively.

Fig. 6. The MFM function v(x) gained by the FSMM, DWFCM, MCD and MMLE algorithm, respectively.
J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349 345

Table 2
Parameters for generating R-data and D-data datasets.

Type of dataset Center m Covariance matrix R Cluster size n4/n7/n10


Clean clusters Cluster 1 [0, 0, . . . , 0] I 300/500/500
Cluster 2 [2D, 0, . . . , 0] I 300/500/500
Radial outliers Outlier subset 1 [0, 0, . . . , 0] 5I 15/25/25
Outlier subset 2 [2D, 0, . . . , 0] 5I 15/25/25
Diffuse outliers Outlier set mdiffuse Rdiffuse 30/50/50

Table 3
89.7% of all malignant records, were identified. Shown in Fig. 7 is
Outlier detection performance comparison for the three outlier detection methods.
the Receiver Operation Characteristic (ROC) curves obtained by
d Type I error (%) Type II error (%) the proposed RMMC and by the outlier factor method. The former
R-data D-data R-data D-data is with respect to chi-square cutoff threshold e and the latter is with
RMMC 4 0.44 0.49 0.00 0.67 respect to the top ranking threshold (Hawkins et al., 2002). It is seen
7 0.52 0.46 0.00 0.60 that the area under the ROC curve obtained by the RMMC is larger
10 0.56 0.44 0.00 2.74 than that obtained by the outlier factor method. This application
Adjusted F 4 0.38 0.44 0.47 7.03 indicates that the proposed RMMC approach is superior in identify-
7 0.47 0.43 0.06 4.52 ing more malignant records in fewer records.
10 0.52 0.46 0.00 5.82 In lung cancer dataset, there are totally 181 records inside the
Chisq 4 14.36 14.80 0.00 0.23 dataset, with 31 normal records and 150 disease records. The dataset
7 11.8 11.53 0.00 0.18 contains 12,533 genes/attributes. Three influential genes were found
10 11.45 11.25 0.00 0.26
by applying the gene selection method proposed by Guyon, Weston,
Barnhill, & Vapnik (2002). The Affymetrix Probe Set IDs for those fea-
proposed in Hawkins et al. (2002) which is based on replicator neu- ture genes are 33328_at, 39409_at and 31622_f_at, respectively. After
ral networks (RNNs). Both the RMMC and Adjusted F used the same the RMMC method was applied, we obtained one cluster and its cor-
chi-square cutoff threshold e, i.e., e = 0.0005. The results are shown responding outlier set. Specifically, 149 records constituted the clus-
in Table 4. By using the RMMC, 46 records were identified as malig- ter, inside which 146 are true disease records. The phenomenon of
nant outliers, in which 37 are in fact the true malignant records, disease records constituting the main cluster is due to the dominant
which comprise 95% of all the malignant records. On the contrary, size of the disease records. Also 33 records were identified as outliers,
Adjusted F always failed to detect malignant outliers though the in which 29 are in fact the true normal records. The type I error and
method was performed a lot of times. The reason is that the covari- type II error are 2.67% and 6.45%, respectively.
ance matrix estimated based on the ‘‘half sample’’ is singular each
time of the run of the Adjusted F, which could not make the process 6. Discussions
proceed. Though Hardin & Rocke (2004) suggest to add more sam-
ples to the half sample for avoiding singularity of the covariance ma- The proposed approach has many distinct characteristics over
trix, some work indicates that samples should be carefully chosen so conventional prototype-based approaches in tackling severe out-
that the covariance is nonsingular (Davé & Krishnapuram, 1997) and lier situations:
the condition number of the estimated covariance matrix is gener-
ally low which makes the algorithm sensitive to the added samples (1) More flexible data model is introduced. Unlike the conven-
and hence not reliable for outlier detection. Trying to make Adjusted tional approaches assuming that outliers belong to a single
F work for comparison, we performed principal component analysis noise cluster (Rehm et al., 2007) and/or outliers are the
(PCA) with 90% energy reserved before we run the Adjusted F. How-
ever, the situation was exactly the same that the covariance matrix
ROC curves on Wisconsin breast cancer data
was always singular in each run of the Adjusted F. Hence we reduced 1
the percentage of the reserved energy to be 70%, and the Adjusted F
really worked. However, the dimension was reduced to only one 0.9
while that of the original data space is nine. Note that outliers will 0.8
affect the principal directions greatly especially in serious outlier
case, and the useful information of the true clusters will be lost if 0.7
Outlier factor Method
True positive ratio

the dimensionality is reduced too much. Both indicate that some real
0.6 MMLE
world applications really meet problem of singularity in using Ad-
justed F. Due to the inability of comparing our proposed RMMC to 0.5
the Adjusted F, we turned to compare it with outlier factor method
proposed by Hawkins et al. (2002), where records are ranked accord- 0.4
ing to their outlier factors. In our experiment on this dataset, within 0.3
the top 48 ranked records, only 35 malignant records, comprising
0.2
Table 4
Outlier detection performance comparison among RMMC, Adjusted F and outlier 0.1
factor method for Wisconsin breast cancer dataset.
0
Method Malignant Type I error Type II error 0 0.2 0.4 0.6 0.8 1

RMMC 37/46 2.03% 5.13% False positive ratio


Adjusted F – – –
Outlier factor method 35/48 2.93% 10.26% Fig. 7. ROC curves obtained by the proposed RMMC and the outlier factor method
proposed by Hawkins et al. (2002) for Wisconsin breast cancer dataset.
346 J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349

samples far from all the physical clusters, it is assumed that any in which q = 2.) in which distance is defined between any pair
outlier belongs to one of clusters due to noise contamination of samples in the data space. At first, the distance being
and the noise level is cluster dependent by its related underly- kx  mk kqqkk for cluster ck indicates that it is defined being clus-
ing physical reality. This assumption is reasonable and more ter dependent rather than cluster independent in data space;
flexible in that a sample might be an outlier neither far from secondly, it is location based, i.e., we can only tell the distance
all the clean clusters but far from some of the clean clusters, of a sample relevant to a specific cluster (or equivalently its
nor it can be viewed as a sample in a ‘‘noise cluster’’. Though location) and the distance between two arbitrary samples in
it is proved theoretically that the model is MGGD theoretically, the data space is not defined. On the other hand, the defined
our experimental results indicate that the approach is also distance measure is also a radial function which is similar to
effective to other data models (e.g., alpha-stable mixture data- the conventional distance measure. The advantage of intro-
sets, R-data and D-data models) and many application data. ducing cluster-dependent distance measure is the allowance
(2) Location and shape are consecutively estimated (rather than of noise levels in different clusters being different, which
jointly estimated), both robust to outliers. Robust location esti- matches the data model introduced in the approach.
mation is fundamental and crucial to robust shape estima- (5) Distance measures are learnt from the dataset. After partition-
tion: only when location is estimated robust to outliers, can ing samples to clusters, the distance measure of each cluster
samples which are viewed far from the location be correctly is learnt by mapping each cluster to most Gaussian distribu-
detected as outliers and removed, and hence shape be esti- tion with the criterion of the least non-Gaussianity for the
mated from the clean data with all the potential outliers cluster-dependent metric parameter qk. Hence, the approach
excluded, leading the shape estimation robust to outliers. This provides a technique for learning a distance measure for a
is on the contrary to the conventional approaches (Hardin & cluster, rather than setting it fixed over all the clusters (the
Rocke, 2004; Rousseeuw & Van Zomeren, 1990) which esti- data space) as what are being done in nearly all the related
mate location and shape jointly. They use samples surely approaches. In fact, the multi-metric Lq-norm distance and
coming from the clean cluster, i.e., ‘‘half sample’’, selected the Mahalanobis distance are learnt in the location estima-
from a cluster with the criterion of minimizing the determi- tion and shape estimation process, respectively, which lead
nant of the sample covariance matrix, to jointly estimate both the approach to robustness to severe outliers.
location and shape. Though it really leads to some robustness, (6) The covariance determinant is minimized. In this approach,
the fact that the half samples do not actually form the clean shape estimation is based on robust location estimation
cluster might bias the estimations of both location and shape. which is fulfilled by the proposed MMLE. The proposed ICSC
In addition, samples should be carefully chosen so that the detects and removes potential outliers and it is seen from Eq.
covariance is nonsingular (Davé & Krishnapuram, 1997). (27) that the covariance determinant decreases as iteration
(3) Mapping a cluster to most Gaussian distribution brings location proceeds, until when all the potential outliers are supposed
estimation robust to severe outliers. Gaussian distribution is a to be detected and removed (controlled by the user param-
symmetric distribution and the location can be simply com- eter e.) This indicates that at the convergence, the covariance
puted by the mean of the samples. If a cluster is mapped to determinant becomes the minimum and estimated by the
most Gaussian distribution with some map function, such cluster with all potential outliers removed, i.e., by a clean
computed location in mapped space is insensitive to the cluster. This scheme is potentially different from the MCD
inclusion of some samples being outliers in data space, even related approaches (Davé & Krishnapuram, 1997; Faucon-
though the outlier situation is severe. This is due to the most nier & Haesbroeck, 2009; Hardin & Rocke, 2004; Hubert
Gaussianity of the mapped space and a very small portion of et al., 2008; Peña & Prieto, 2001; Rousseeuw & Van Zomeren,
outliers taken up in the cluster samples. Hence the introduc- 1990) where the MCD is computed from the claimed ‘‘clos-
tion of a mapping function which maps a cluster to most est’’ ‘‘half sample’’ of a cluster, which is the only portion of
Gaussian is potentially helpful to robustness of the approach the cluster samples, rather than all the samples in the clean
to severe outlier situation. cluster, possibility of biasing the estimates of both location
In this approach, mapping a cluster to most Gaussian distri- and shape.
bution is a local shrinking. It is potentially different from the (7) The approach is easy to extend to robust clustering by learning
local shrinking based CLUES approach proposed by Wang, multi-metric hyper-kernels. Unlike conventional kernel-based
Qiu, & Zamar (2007). CLUES is for clustering data without approaches where kernel is defined globally over data space
prior knowledge on the total number of clusters K. At first, and learnt from overall dataset, e.g., learning polynomial
the shrinkage in CLUES is in data space, while the one in this kernel of order r or Gaussian kernel of variance r2, or a con-
approach is from data space to mapped space with a cluster- vex combinations of such kernels (Kim, Magnani, & Boyd,
dependent mapping, which is equivalent to the shrinking in 2006), the approach also motivates cluster dependent,
data space with different shrinking intensity for different locally defined, multi-metric and more general kernel which
cluster; secondly, in CLUES, samples will converge to a few is learnt from each cluster to tackle the datasets in presence
so called focal points which are indeed the locations, while of severe outliers. Let z = w(x) be an explicit map from data
the mapped cluster will converge to most Gaussian distribu- space to a mapped space and /(z) be an implicit map from
tion in this approach; in CLUES, k-nearest neighbor is used to the mapped space to some feature space with the corre-
direct the direction and magnitude of the movement in the sponding Mercer kernel being j(zi, zj) = /T(zi)/(zj). Thus
shrinking, while the least non-Gaussianity measure is used transformation from data space to the feature space is an
to direct them. Due to the working principle, it is estimated implicit map being /(w(x)). The corresponding Mercer ker-
that CLUES might not be robust to severe outliers. nel, referred to as hyper-kernel, is
(4) Distance measure is locally defined and cluster dependent. This
approach in fact defines location based radial distance mea- e ðxi ; xj Þ ¼ /T ðwðxi ÞÞ/ðwðxj ÞÞ;
K
sure in data space with related metric parameter qk depen- ð31Þ
¼ jðwðxi Þ; wðxj ÞÞ:
dent to cluster ck. It is conceptually different from the
conventional Lq-norm distance measure with q fixed and It is a hyper-polynomial kernel when j(zi, zj) is specified to be a con-
same over the overall data space (K-means is a specific case ventional polynomial one, and a hyper-Gaussian kernel when
J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349 347

j(zi, zj) is specified to be a conventional Gaussian one. Specifically, determine the total number of clusters K especially in the severe
when we set w(x) be the map in Eq. (9), the hyper-polynomial ker- outlier case and how to extend the approach to projective cluster-
nel of order r is ing for very high dimensional data in presence of severe outliers

are also challenging, e.g., for gene expression data (Bandyopadhyay
e ðxi ; xj Þ ¼ 1 þ wT ðxi Þwðxj Þ r ;
K
( )r & Santra, 2008; Kashef & Kamel, 2008; Zhao, Chan, Cheng, & Yan,
X d
m
2009).
¼ 1þ jxik xjk j sgnðxik xjk Þ ; ð32Þ Another promising direction is extending the approach to ker-
k¼1
n h m  ior nel-based approach or fuzzy-based approach. In fact, the proposed

¼ 1 þ trace xi xTj  sgn xi xTj ; approach can be viewed as a kernel-based approach with a cluster-
dependent multi-kernel being a multi-hyper-polynomial kernel of
and the hyper-Gaussian kernel with variance of r2 is order r = 1 and the kernel parameter m = p/2 learnt from the clus-
  ter. From this view point, other kernels (e.g., the more flexible hy-
e ðxi ; xj Þ ¼ expðwðxi Þ  wðxj Þ2 =r2 Þ;
K 2
ð33Þ per-kernel) with kernel parameter learnt from data might be
 2
¼ expðjxi jm  sgnðxi Þ  jxj jm  sgnðxj Þ2 =r2 Þ; helpful for clustering datasets with arbitrary cluster shape rather
than Gaussian shape and in presence of severe outliers.
both not implemented by but more flexible than the conventional Distance/kernel is one of the most fundamental concepts in
kernels (an additional kernel parameter m is introduced). The ap- pattern analysis domain. Unlike conventional globally defined sin-
proach in fact is a multi-kernel approach with each kernel being a gle-metric distance/single-kernel approaches in pattern analysis,
hyper-polynomial kernel with r = 1 and m (0 6 m(=p/2) 6 1) learnt cluster-dependent multi-metric distance/multi-kernel is intro-
from each cluster while clustering is being performed. We believe duced and learnt from the cluster. This might motivate cluster-/
that such an idea of learning cluster-/class-/pattern-dependent lo- class-/pattern-dependent multi-metric locally defined distance/
cally defined multi-hyper-kernel from the cluster/class/pattern kernel with some criterion set adapt to applications, and influ-
can also be applicable to clustering, classification and pattern ence/improve pattern analysis performance from the view point
analysis. of feature selection, feature extraction, clustering, classification,
prediction, pattern discovery and so forth.
7. Conclusions
Acknowledgements
A robust multi-metric clustering (RMMC) approach is proposed
for datasets with severe outliers. It is composed of two consecutive
This work was supported by the National Natural Science Foun-
approaches, multi-metric location estimation (MMLE) and multi-
dation (Grant No. 61070137) and the National Key Natural Science
metric iterative chi-square cutoff (ICSC). It is verified that the
Foundation of China (Grant No. 60933009). It was also supported
MMLE solves a multi-objective optimization problem for the Mul-
by the Chinese-Italian bilateral project ‘‘Statistical learning tech-
tivariate Generalized Gaussian Mixture as the data model in pres-
niques for cancer diagnosis’’ funded by Chinese Ministry of Science
ence of outliers with centers (mk) and metric parameter (qk)
and Technology and Italian Ministry of Foreign Affairs.
unknown and to be discovered. The ICSC solves shape problem
(i.e., covariance matrix Rk Þ for Gaussian mixture model with
known and fixed location mk, for k = 1, 2, . . . , K. The total number Appendix A. CDM algorithm
of clusters K is assumed known a priori and each cluster under
reality is supposed Gaussian distributed as most conventional clus- Initialization algorithm, referred to as CDM (Clustering, deleting
tering algorithms assume (Rehm et al., 2007). It reduces to conven- and merging) algorithm, is proposed in this study. It is suitable for
tional K-means when the cluster-dependent multi-metric learnt initializing most prototype-based clustering algorithms for data-
from data is substituted by cluster-independent setting of a sets with even severe outlier situation. The process includes clus-
single-metric q = 2 over all clusters. tering, deleting and merging. Let the total number of clusters be
The most significant feature of the approach is its strong robust- K. Firstly, data is partitioned into much larger number of clusters
ness to severe outliers. Experiments on impulsive-noisy heavy- than K by the simple conventional K-means, i.e., set K0  K. As a re-
tailed alpha-stable mixture data and comparison with four typical sult, outliers in general correspond to some small sized clusters.
claimed robust clustering algorithms (L1-norm K-means, FSMM, Hence, secondly we delete clusters of small sizes, i.e., the ones
DWFCM and MCD) on clustering performance and maximal fuzzy whose sizes are less than a size threshold d (in our experiments,
membership functions, and experiments of the approach on syn- d is taken to be 5%.) Then the clusters which are close in Euclidean
thetic R-data and D-data and comparison with two typical outlier distance in the data space are merged for formulating a new clus-
detection algorithms (Chisq and Adjusted F, both are MCD based), ter. The mean of samples lying in that cluster is set to be the initial-
verify the strong robustness of the approach. Experiments on real ization of the location. Such a strategy is characterized to be
datasets, i.e., Wisconsin breast cancer dataset and the lung cancer insensitive to outliers since clusters which include outliers are gen-
dataset, also indicate that the proposed algorithm is superior to the erally deleted.
RNN based outlier factor method proposed by Hawkins et al. For merging those close clusters to formulate a new cluster, and
(2002). achieving altogether K clusters, the partitioning process proposed
Further approach can be improving the search process for the by Wang et al. (2007) & Ankerst et al. (1999) is adopted. Specifi-
multi-metric pk, k = 1, 2, . . . , K. Any other search method such as cally, after clustering and deleting, the rest locations are sequenced
the simple method for k (k-nearest neighbor) used in Wang et al. according to the closeness of these locations, and then the se-
(2007), Monte Carlo and/or simulated annealing, might be more quence is partitioned into K subsequences according to the K max-
computationally efficient. In addition, the proposed ICSC might imum center distances for the sequence (see Wang et al., 2007;
be improved by considering the distribution property on extreme Ankerst, Breunig, Kriegel, & Sander, 1999). Finally, the clusters cor-
points especially in small samples, as what is done with F distribu- responding to the locations of each subsequence are merged to for-
tion cutoff approach proposed by Hardin & Rocke (2004). Further- mulate a new cluster, and the mean of the samples in this new
more, it is possible that the MMLE be extended to the multi-metric cluster is computed and employed as the initialization of location
weighted Lq-norm distance case with covariance included. How to of the cluster.
348 J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349

Appendix B. Proof of Theorem 1 J a ( m kj )


P P
J a ðmk Þ ¼ Kk¼1 N1 xi 2ck kxi  mk kqqkk is a continuous function of
k
mk and derivable everywhere for qk – 1 but not derivable every-
where for qk = 1. Hence, the solution of Eq. (14) satisfies
@J a m kj
¼ 0; for k ¼ 1; 2; . . . ; K ðB1-aÞ
@mk
mkj* − Δe1 m *
kj m +Δ e2
*
kj
@J a
for qk – 1, where @mk
is the derivative of Ja with respect to mk. Or,
Fig. B. Subdifferential of the function Ja(mkj) with respect to at the converged
@J a
a 6 6 b; for k ¼ 1; 2; . . . ; K ðB1-bÞ mkj ¼ m kj .
@mk
@J a
for qk = 1, where @m is the subdifferential of Ja with respect to mk
k Combining the above situations of qk – 1 and qk = 1, we con-
and a and b are non-negative d-dimensional vectors.
clude that the algorithm searches for the optimal solution of Eq.
By noticing that jaj = asgn(a) for any scalar a, and thus
@ q q1 @ q (14) for mk’s when qk’s are known a priori for k = 1, 2, . . . , K.
@A
kAkq ¼ qjAj  sgnðAÞ for q – 1 and @A kAkq ¼ sgnðAÞ 2 ½I; þI
for q = 1 for any vector A, where I is a same-sized vector with all
elements being ones, mk satisfying References
@J a 1 X
¼ qk jxi  mk jqk 1  sgnðxi  mk Þ Ankerst, M., Breunig, M. M., Kriegel, H. P. & Sander, J. (1999). OPTICS: Ordering
@mk Nk x 2c points to identify the clustering structure. In Proceedings of the ACM SIGMOD.
i k
Archambeau, C., & Verleysen, M. (2007). Robust Bayesian clustering. Neural
¼ 0; for qk – 1; ðB2-aÞ Networks, 20(129–138), 129–138.
Bandyopadhyay, S., & Santra, S. (2008). A genetic approach for efficient outlier
or, detection in projected space. Pattern Recognition, 41(4), 1338–1349.
Bar-Hillel, A., Hertz, T., Shental, N. & Weinshall, D. (2003). Learning distance
@J a 1 X functions using equivalence relations. In Proceedings of the 20th international
¼ sgnðxi  mk Þ 2 ½a; þb; for qk ¼ 1 ðB2-bÞ conference on machine learning.
@mk Nk x 2c Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2001). Support vector
i k
clustering. Journal of Machine Learning Research, 2, 125–137.
is the optimal solution of Eq. (14), for k = 1, 2, . . . , K. Bernholt, T., & Fischer, P. (2004). The complexity of computing the MCD-estimator.
Theoretical Computer Science, 326(1–3), 383–398.
This is exactly what the MMLE algorithm pursues. To prove this,
Chatzis, S., & Varvarigou, T. (2008). Robust fuzzy clustering using mixtures of
we denote the converged solution by symbols with ⁄ to see if it is student’s-t distributions. Pattern Recognition Letter, 29, 1901–1905.
the solution satisfying Eq. (B2). In fact, at the convergence of the Chen, J. L. & Wang, J. H. (1999). A new robust clustering algorithm-density-weighted
fuzzy c-means. In IEEE international conference on systems, man, and cybernetics.
algorithm, we have mk;nþ1 ¼ mk;n ¼ m k . Hence from Eq. (13), we
Chen, T., Martin, E., & Montague, G. (2009). Robust probabilistic PCA with missing
have data and contribution analysis for outlier detection. Computational Statistics and
Data Analysis, 53(10), 3706–3716.
l k ¼ 0: ðB3Þ Cuesta-Albertos, J. A., Gordaliza, A., & Matr’an, C. (1997). Trimmed k-means: An
attempt to robustify quantizers. Annals of Statistics, 25(2), 553–576.
By substituting Eqs. (11) and (12) into Eq. (B3), we get Davé, R. N., & Krishnapuram, R. (1997). Robust clustering models: A unified view.
X IEEE Transactions on Fuzzy Systems, 5(2), 270–293.
jxl  m k jpk =2  sgnðxl  m k Þ ¼ 0: ðB4Þ Fauconnier, C., & Haesbroeck, G. (2009). Outliers detection with the minimum
xl 2ck covariance determinant estimator in practice. Statistical Methodology, 6(4),
363–379.
This is just Eq. (B2-a) if we set Girolami, M. (2002). Mercer kernel-based clustering in feature space. IEEE
Transactions on Neural Networks, 13(3), 780–784.
pk Gordon, G. J., Jensen, R. V., Hsiao, L. L., Gullans, S. R., Blumenstock, J. E., Ramaswamy,
qk ¼ 1 þ for k ¼ 1; 2; . . . ; K ðB5Þ S., et al. (2002). Translation of microarray data into clinically relevant cancer
2 diagnostic tests using gene expression ratios in lung cancer and mesothelioma.
Cancer Research, 62, 4963–4967.
for qk – 1.
Guo, S. M., Chen, L. C., & Tsai, J. S. H. (2009). A boundary method for outlier detection
For qk = 1, we have pk = 0 by Eq. (B5) and hence the converged based on support vector domain description. Pattern Recognition, 42(1), 77–83.
P
m k satisfying Eq. (B4) leads to
xi 2ck sgnðxij  mkj Þ ¼ 0 for
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer
classification using support vector machines. Machine Learning, 46, 389–422.
j = 1, 2, . . . , d, where xij and mkj are the jth element of xj and m k , Hardin, J., & Rocke, D. M. (2004). Outlier detection in the multiple cluster setting
respectively. Without loss of generality, this indicates that the con- using the minimum covariance determinant estimator. Computational Statistics
verged m kj is the one with the number of samples on its left being and Data Analysis, 44(4), 625–638.
Hathaway, R. J., Bezdek, J. C., & Hu, Y. (2000). Generalized fuzzy c-means clustering
the same as that on its right for any j = 1, 2, . . . , d. If we shift mkj strategies using Lp norm distances. IEEE Transactions on Fuzzy Systems, 8(5),
around m kj but not to its nearest neighborhood samples, the value 576–582.
P Hawkins, S., He, H., Williams, G., & Baxter, R. (2002). Outlier detection using
of xi 2ck sgnðxij  mkj Þ will be kept 0. However, by shifting mkj to the replicator neural networks. Lecture Notes in Computer Science, 2454, 170–180.
nearest neighborhood sample on its left, say, mkj ¼ m kj  De1 , the Hodge, V., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial
P Intelligence Review, 22(2), 85–126.
left side of Eq. (B4) becomes xi 2ck sgnðxkj  m kj þ De1 Þ ¼ þ2; simi- Hubert, L., Arabie, P., Meulman, J. (1997). Hierarchical clustering and the
larly, by shifting the mkj to the nearest neighborhood sample on its construction of (optimal) ultrametrics using Lp-norms. In LI- Statistical
Procedures and Related Topics, IMS Lecture Notes – Monograph Series (Vol. 31,
right, say, mkj ¼ m kj þ De2 , the left side of Eq. (B4) becomes
P pp. 457–472).

xi 2ck sgnðxkj  mkj  De2 Þ ¼ 2. Therefore, we have the subdiffer- Hubert, M., Rousseeuw, P. J., & Van Aelst, S. (2008). High-breakdown robust
P multivariate methods. Statistical Science, 23(1), 92–119.
@Ja
ential @m ¼  N1 xi 2ck sgnðxij  m kj Þ in the left side of Eq. (B2-
kj k
Hyvärinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and
m
k applications. Neural Networks, 13, 411–430.
b) being with the curve shown in Fig. B and in the range of Kashef, R. & Kamel, M. S. (2008). Towards better outliers detection for gene

1 @Ja expression datasets. In International conference on biocomputation,
N
½2; 2, and hence the subdifferential @m is in the range of bioinformatics, and biomedical technologies.
k k mk
Kim, S. J., Magnani, A. & Boyd, S. (2006). Optimal kernel selection in kernel fisher
[a, b] where a and b are nonnegative vectors. Therefore the con- discriminant analysis. In Proceedings of the 23rd international conference on
verged m k for qk = 1 satisfying Eq. (B2-b). machine learning.
J. Zhang et al. / Expert Systems with Applications 39 (2012) 335–349 349

Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear Weinberger, K. Q., Blitzer, J. & Saul, L. K. (2005). Distance metric learning for large
programming. SIAM News, 23(5), 1–18. margin nearest neighbor classification. In NIPS2005.
Miyamoto, S. & Agusta, Y. (1996). Efficient algorithms for Lp fuzzy c-means and Xing, E. P., Ng, A. Y., Jordan, M. I. & Russell, S. (2002). Distance metric learning with
their termination properties. In The 5th conference of the international federation application to clustering with side-information. In NIPS2002.
of classification society. Xiong, H. L., & Chen, X. W. (2006). Kernel-based distance metric learning for
Peña, D., & Prieto, F. J. (2001). Multivariate outlier detection and robust covariance microarray data classification. BMC Bioinformatics, 7(299), 1–11.
matrix estimation. Technometrics, 43(3), 286–310. Ye, M., Li, X., & Orlowska, M. E. (2009). Projected outlier detection in high-
Rehm, F., Klawonn, F., & Kruse, R. (2007). A novel approach to noise clustering for dimensional mixed-attributes data set. Expert Systems with Applications, 36(3),
outlier detection. Soft Computing, 11(5), 489–494. 7104–7113.
Rocke, D. M., & Woodruff, D. L. (1996). Identification of outliers in multivariate data. Zhang, D., Gatica-Perez, D., Bengio, S. & McCowan, L. (2005). 2005 semi-supervised
Journal of the American Statistical Association, 91(435), 1047–1061. adapted HMMs for unusual event detection. In IEEE Computer Society Conference
Rousseeuw, P. J., & Van Zomeren, B. C. (1990). Unmasking multivariate outliers and on Computer Vision and Pattern Recognition.
leverage points. Journal of the American Statistical Association, 85(411), 633–651. Zhang, Y. G., Zhang, C. S., & Zhang, D. (2004). Distance metric learning by knowledge
Wang, C. H. (2009). Outlier identification and market segmentation using kernel- embedding. Pattern Recognition, 37, 161–163.
based clustering techniques. Expert Systems with Applications, 36(2), 3744–3750. Zhao, H., Chan, K. L., Cheng, L. M., & Yan, H. (2009). A probabilistic relaxation
Wang, X., Qiu, W., & Zamar, R. H. (2007). An iterative non-parametric clustering labeling framework for reducing the noise effect in geometric biclustering of
algorithm based on local shrinking. Computational Statistics and Data Analysis, gene expression data. Pattern Recognition, 42(11), 2578–2588.
52, 286–298.

Vous aimerez peut-être aussi