Temporal Data Clustering Via Weighted Clustering Ensemble With Different Representations

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1
Temporal data clustering via weighted

clustering ensemble with different
representations
Yun Yang and Ke Chen, Senior Member, IEEE
Abstract—Temporal data clustering provides underpinning techniques for discovering the intrinsic structure and condensing
information over temporal data. In this paper, we present a temporal data clustering framework via a weighted clustering
ensemble of multiple partitions produced by initial clustering analysis on different temporal data representations. In our
approach, we propose a weighted consensus function guided by clustering validation criteria to reconcile initial partitions to
candidate consensus partitions from different perspectives and then introduce an agreement function to further reconcile those
candidate consensus partitions to a final partition. As a result, the proposed weighted clustering ensemble algorithm provides an
effective enabling technique for the joint use of different representations, which cuts the information loss in a single
representation and exploits various information sources underlying temporal data. In addition, our approach tends to capture the
intrinsic structure of a data set, e.g., the number of clusters. Our approach has been evaluated with benchmark time series,
motion trajectory and time-series data stream clustering tasks. Simulation results demonstrate that our approach yields favorite
results for a variety of temporal data clustering tasks. As our weighted cluster ensemble algorithm can combine any input
partitions to generate a clustering ensemble, we also investigate its limitation by formal analysis and empirical studies.
Index Terms— Temporal data clustering, clustering ensemble, different representations, weighted consensus function, model
selection
——————————  ——————————
1 INTRODUCTION
T emporal data are ubiquitous in the real world and

there are many application areas ranging from mul‐
timedia information processing to temporal data min‐
is an extremely difficult unsupervised learning task. It is
inherently an ill‐posed problem and its solution often
violates some common assumptions [1]. In particular,
ing. Unlike static data, there is a high amount of depen‐ recent empirical studies [2] reveal that temporal data clus‐
dency among temporal data and the proper treatment of tering poses a real challenge in temporal data mining due
data dependency or correlation becomes critical in tem‐ to the high dimensionality and complex temporal correla‐
poral data processing. tion. In the context of the data dependency treatment, we
Temporal clustering analysis provides an effective way classify existing temporal data clustering algorithms as
to discover the intrinsic structure and condense informa‐ three categories: temporal‐proximity based, model based
tion over temporal data by exploring dynamic regularities and representation‐based clustering algorithms.
underlying temporal data in an unsupervised learning Temporal‐proximity based [2]‐[4] and model based
way. Its ultimate objective is to partition an unlabeled clustering algorithms [5]‐[7] directly work on temporal
temporal data set into clusters so that sequences grouped data. Therefore, temporal correlation is dealt with directly
in the same cluster are coherent. In general, there are two during clustering analysis by means of temporal similari‐
core problems in clustering analysis; i.e., model selection ty measures [2]‐[4], e.g., dynamic time warping, or dy‐
and grouping. The former seeks a solution that uncovers namic models [5]‐[7], e.g., hidden Markov model. In con‐
the number of intrinsic clusters underlying a temporal trast, a representation‐based algorithm converts temporal
data set, while the latter demands a proper grouping rule data clustering into static data clustering via a parsimo‐
that groups coherent sequences together to form a cluster nious representation that tends to capture the data de‐
matching an underlying distribution. Clustering analysis pendency.
Based on a temporal data representation of fixed yet
———————————————— lower dimensionality, any existing clustering algorithm is
 Yun Yang is with School of Computer Science, The University of Man- applicable to temporal data clustering, which is efficient
chester, Kilburn Building, Oxford Road, Manchester M13 9PL, U.K. E-
mail: yun.yang@postgrad.manchester.ac.uk.
in computation. Various temporal data representations
 Ke Chen is with School of Computer Science, The University of Manches- have been proposed [8]‐[15] from different perspectives.
ter, Kilburn Building, Oxford Road, Manchester M13 9PL, U.K. E-mail: To our knowledge, there is no universal representation
chen@cs.manchester.ac.uk. that perfectly characterizes all kinds of temporal data; one
single representation tends to encode only those features
Manuscript received (insert date of submission if desired). Please note that all
acknowledgments should be placed at the end of the paper, before the bibliography.
xxxx-xxxx/0x/$xx.00 © 200x IEEE
2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
well presented in its own representation space and inevit‐
ably incurs useful information loss. Furthermore, it is dif‐
ficult to select a representation to present a given tempor‐
al data set properly without prior knowledge and a care‐
ful analysis. These problems often hinder a representa‐
tion‐based approach from achieving the satisfactory per‐
formance.
As an emerging area in machine learning, clustering
ensemble algorithms have been recently studied from
different perspective, e.g., clustering ensembles with
graph partitioning [16],[18], evidence aggregation
[17],[19]‐[21] and optimization via semi‐definite pro‐
gramming [22]. The basic idea behind clustering ensem‐
ble is combining multiple partitions on the same data set
to produce a consensus partition expected to be superior
to that of given input partitions. Although there are few
studies in theoretical justification on the clustering en‐
semble methodology, growing empirical evidences sup‐
Fig. 1. Distributions of the time series dataset in various PCA repre-
port such an idea and indicate the clustering ensemble is sentation manifolds formed by the first two principal components of
capable of detecting novel cluster structures [16]‐[22]. In their representations. (a) PLS. (b) PDWT. (c) PCF. (d) DFT .
addition, a formal analysis on clustering ensemble reveals and our model, and Sect. 3 presents our weighted cluster‐
that under certain conditions, a proper consensus solution ing ensemble algorithm along with algorithm analysis.
uncovers the intrinsic structure underlying a given data Sect. 4 reports simulation results on a variety of temporal
set [23]. Thus, clustering ensemble provides a generic data clustering tasks. Sect. 5 discusses issues relevant to
enabling technique to use different representations jointly our approach, and the last section draws conclusions.
for temporal data clustering.
Motivated by recent clustering ensemble studies
[16]‐[23] and our success in the use of different represen‐
2 TEMPORAL DATA CLUSTERING WITH DIFFERENT
tations to deal with difficult pattern classification tasks REPRESENTATIONS
[24‐28], we present an approach to temporal data cluster‐ In this section, we first describe our motivation to pro‐
ing with different representations to overcome the fun‐ pose our temporal data clustering model. Then, we
damental weakness of the representation‐based temporal present our temporal data clustering model working on
data clustering analysis. Our approach consists of initial different representations via clustering ensemble learning.
clustering analysis on different representations to pro‐
duce multiple partitions and clustering ensemble con‐ 2.1 Motivation
struction to produce a final partition by combining those It is known that different representations encode vari‐
partitions achieved in initial clustering analysis. While ous structural information facets of temporal data in their
initial clustering analysis can be done by any existing representation space. For illustration, we perform the
clustering algorithms, we propose a novel weighted clus‐ principal component analysis (PCA) on four typical re‐
tering ensemble algorithm of a two‐stage reconciliation presentations (see Sect. 4.1 for details) of a synthetic time
process. In our proposed algorithm, a weighting consen‐ series data set. The data set is produced by the stochastic
sus function reconciles input partitions to candidate con‐ function F(t )  A sin( 2t  B)   (t ), where A, B and 
sensus partitions according to various clustering valida‐ are free parameters and ε(t) is the added noise drawn
tion criteria. Then, an agreement function further recon‐ from the normal distribution N(0,1). The use of four dif‐
ciles those candidate consensus partitions to yield a final ferent parameter sets (A, B,  ) leads to time series of four
partition. classes and 100 time series in each class.
The contributions of this paper are summarized as fol‐ As shown in Fig. 1, four representations of time series
lows. First, we develop a practical temporal data cluster‐ present themselves with various distributions in their
ing model by different representations via clustering en‐ PCA representation subspaces. For instance, both of
semble learning to overcome the fundamental weakness classes marked with blue triangles and Magenta star are
in the representation‐based temporal data clustering easily separated from other two overlapped classes in Fig.
analysis. Next, we propose a novel weighted clustering 1(a) whilst so is the classes marked with green circles and
ensemble algorithm, which not only provides an enabling red dot in Fig. 1(b). Similarly, different yet useful struc‐
technique to support our model but also can be used to tural information can also be observed from plots in Fig.
combine any input partitions. Formal analysis has also 1(c) and 1(d). Intuitively, our observation suggests that a
been done. Finally, we demonstrate the effectiveness and single representation simply captures partial structural
the efficiency of our model for a variety of temporal data information and the joint use of different representations
clustering tasks as well as its easy‐to‐use nature as all in‐ is more likely to capture the intrinsic structure of a given
ternal parameters are fixed in our simulations. temporal data set. When a clustering algorithm is applied
In the rest of the paper, Sect. 2 describes the motivation to different representations, diverse partitions would be
YANG AND CHEN: TEMPORAL DATA CLUSTERING 3
generated. To exploit all information sources, we need to
reconcile diverse partitions to find out a consensus parti‐
tion superior to any input partitions.
From Fig. 1, we further observe that partitions yielded
by a clustering algorithm are unlikely to carry the equal
amount of useful information due to their distributions in
different representation spaces. However, most of existing
clustering ensemble methods treat all of partitions equally
during the reconciliation, which brings about an averag‐
ing effect. Our previous empirical studies [29] found that
such a treatment could have the following adverse effects.
As partitions to be combined are radically different, the
clustering ensemble methods often yield a worse final
partition. Moreover, the averaging effect is particularly
harmful in a majority voting mechanism especially as
many highly correlated partitions appear highly inconsis‐
tent with the intrinsic structure of a given data set. As a
result, we strongly believe that input partitions should be
treated differently so that their contributions would be
taken into account via a weighted consensus function.
Fig. 2. Results of clustering analysis and clustering ensembles. (a)
Without the ground truth, the contribution of a parti‐ The data set of ground truth. (b) The partition of maximum DVI. (c)
tion is actually unknown in general. Fortunately, existing The partition of maximum MHΓ (d) The partition of maximum NMI.
clustering validation criteria [30] measure the clustering (e) DVI WCE. (f) MHΓ WCE. (g) NMI WCE. (h) Multiple criteria
quality of a partition from different perspectives, e.g., the WCE. (i) The Cluster Ensemble [16].
validation of intra‐ and inter‐class variation of clusters. To
the partition in Fig. 2(d) fails to separate clusters 4 and 5
a great extent, we can employ clustering validation crite‐
but is still judged as the best partition in terms of the NMI
ria to estimate contributions of partitions in terms of clus‐
criterion that favors the most common structures detected
tering quality. However, a clustering validation criterion
in all partitions. By the use of a single criterion to estimate
often measures the clustering quality from a specific
the contribution of partitions in the weight clustering en‐
viewpoint only by simply highlighting a certain aspect.
semble (WCE), it inevitably leads to incorrect consensus
In order to estimate the contribution of a partition pre‐
partitions as illustrated in Fig. 2(e)‐2(g), respectively. As
cisely in terms of clustering quality, we need to use vari‐
three criteria reflect different yet complementary facets of
ous clustering validation criteria jointly. This idea is em‐
clustering quality, the joint use of them to estimate the
pirically justified by a simple example below. In the fol‐
contribution of partitions become a natural choice. As
lowing description, we omit technical details of weighted
illustrated in Fig. 2(h), the consensus partition yielded by
clustering ensembles, which will be presented in Sect. 3,
the multiple‐criteria based WCE is very close to the
for illustration only.
ground truth in Fig. 2(a). As a classic approach, Cluster
Fig. 2(a) shows a two‐dimensional synthetic data set
Ensemble [16] treats all partitions equally during reconcil‐
subject to a mixture of Gaussian distribution where there
ing input partitions. When applied to this data set, it
are five intrinsic clusters of heterogeneous structures,
yields a consensus partition shown in Fig. 2(i) that fails to
and the ground‐truth partition is given for evaluation and
detect the intrinsic structure underlying the data set.
marked by red (cluster 1), blue (cluster 2), black (cluster
In summary, the above intuitive demonstration strong‐
3), cyan (cluster 4) and green (cluster 5),. The visual in‐
ly suggests the joint use of different representations for
spection on the structure of the data set shown in Fig. 2(a)
temporal data clustering and the necessity of developing
suggests that clusters 1 and 2 are relatively separate while
a weighted clustering ensemble algorithm.
cluster 3 spreads widely, and clusters 4 and 5 of different
populations overlap each other. 2.2 Model Description
Applying the K‐mean algorithm on different initializa‐ Based on the motivation described in Sect 2.1, we pro‐
tion conditions, including the center of clusters and the posed a temporal data clustering model with a weighted
number of clusters, to the data set yields 20 partitions. clustering ensemble working on different representations.
Using different clustering validation criteria [30], we eva‐ As illustrated in Fig. 3, the model consists of three mod‐
luate the clustering quality of each single partition. Fig. ules; i.e., representation extraction, initial clustering analysis
2(b)‐(d) depict single partitions of maximum value in and weighted clustering ensemble.
terms of different criteria. The DVI criterion always favors Temporal data representations are generally classified
a partition of balanced structure. Although the partition into two categories: piecewise and global representations. A
in Fig. 2(b) meets this criterion well, it properly groups piecewise representation is generated by partitioning the
clusters 1‐3 only but fail to work on clusters 4 and 5. The temporal data into segments at critical points based on a
MHΓ criterion generally favors partition with bigger criterion and then each segment will be modeled into a
number of clusters. The partition in Fig. 2(c) meets this concise representation. All segment representations in
criterion but fails to group clusters 1‐3 properly. Similarly, order collectively form a piecewise representation, e.g.,
for a formal clustering ensemble analysis [23].
3.1 Weighted Consensus Function

The basic idea of our weighted consensus function is
the use of the pair‐wise similarity between objects in a
partition for evident accumulation, where a pair‐wise
similarity matrix is derived from weighted partitions and
weights are determined by measuring the clustering qual‐
ity with different clustering validation criteria. Then, a
dendrogram [3] is constructed based on all similarity ma‐
trices to generate candidate consensus partitions.
A. Partition Weighting Scheme

Assume that X  { x n } nN1 is a data set of N objects and
Fig. 3. Temporal data clustering with different representations. there are M partitions Ρ  { Pm } mM1 on X, where the cluster
number in M partitions could be different, obtained from
adaptive piecewise constant approximation [8] and curva‐
the initial clustering analysis. Our partition weighting
ture‐based PCA segments [9]. In contrast, a global repre‐
scheme assigns a weight w m to each Pm in terms of a clus‐
sentation is derived from modeling the temporal data via
tering validation criterion π, and weights for all partitions
a set of basis functions and therefore coefficients of basis
based on the criterion π collectively form a weight vec‐
functions constitute a holistic representation, e.g., poly‐
tor w  {wm }mM 1 for the partition collection P. In the par‐
nomial curve fitting [10],[11], discrete Fourier transforms
tition weighting scheme, we define a weight
[13] [14], and discrete wavelet transforms [12]. In general,  ( Pm )
w m  , (1)
temporal data representations used in this module should
 m  1  ( Pm )
M
be of the complementary nature and hence we recom‐

where w m  0 and m1 w m  1.  ( Pm ) is the clustering
M
mend the use of both piecewise and global temporal data
validity index value in term of the criterion π. Intuitively,
representations together. In the representation extraction
the weight of a partition would express its contribution to
module, different representations are extracted by trans‐
the combination in terms of its clustering quality meas‐
forming raw temporal data to feature vectors of fixed di‐
ured by the clustering validation criterion π.
mensionality for initial clustering analysis.
As mentioned previously, a clustering validation crite‐
In the initial clustering analysis module, a clustering
rion measures only an aspect of clustering quality. In or‐
algorithm is applied to different representations received
der to estimate the contribution of a partition, we would
from the representation extraction module. As a result, a
examine as many different aspects of clustering quality as
partition for a given data set is generated based on each
possible. After looking into all existing of clustering vali‐
representation. When a clustering algorithm of different
dation criteria, we select three criteria of complementary
parameters is used, e.g., K‐mean, more partitions based
nature for generating weights from different perspectives
on a representation would be produced by running the
as elucidated below; i.e., Modified Huber’s Γ index (MHΓ)
algorithm on various initialization conditions. Thus, the
[30], Dunn’s Validity Index (DVI) [30] and Normalized
clustering analysis on different representations leads to
Mutual Information (NMI) [16].
multiple partitions for a given data set. All partitions
The MHΓ index of a partition Pm [30] is defined by
achieved will be fed to the weighted clustering ensemble N 1 N
module for the reconciliation to a final partition. MHT ( Pm )  Ν(Ν  1 )   A ij Q ij , (2)
2 i 1 j  i 1
In the weighted clustering ensemble module, a
weighted consensus function works on three clustering where Aij is the proximity matrix of objects, and Q is an
validation criteria to estimate the contribution of each N×N cluster distance matrix derived from the partition Pm
partition received from the initial clustering analysis where each element Qij expresses the distance between
module. The consensus function with single‐criterion the centers of clusters to which xi and xj belong. Intuitively,
based weighting schemes yields three candidate consen‐ a high MHΓ value for a partition indicates that the parti‐
sus partitions, respectively, as presented in Sect. 3.1. Then, tion has a compact and well separated clustering struc‐
candidate consensus partitions are fed to the agreement ture. However, this criterion strongly favors a partition
function consisting of a pair‐wise majority voting me‐ containing more clusters; i.e., increasing the number of
chanism, which will be presented in Sect. 3.2, to form a clusters results in a higher index value.
final agreed partition where the number of clusters is au‐ The DVI of a partition Pm [30] is defined by
 m m 
tomatically determined. DVI ( P )  min  d (C i
,C j )  (3)
m m ,
i,j
 k max { diam ( C k )} 
 1 ,..., K m 
3 WEIGHTED CLUSTERING ENSEMBLE where C im , C mj and C km are clusters in Pm , d(C im , C mj ) is a
In this section, we first present the weight consensus dissimilarity metric between clusters C im and Cmj , and
function based on clustering validation criteria and then diam(C km ) is the diameter of cluster C km in Pm. Similar to
describe the agreement function. Finally, we analyze our the MHΓ index, the DVI also evaluates the clustering
algorithm under the “mean” partition assumption made quality in terms of compactness and separation proper‐
ties. But it is insensitive to the number of clusters in a par‐ partitions and a clustering validation criterion π. The

tition. Nevertheless, this index is less robust due to the weighted similarity matrix actually tends to accumulate
use of a single linkage distance and the diameter informa‐ evidence in terms of clustering quality and hence treats
tion of clusters, e.g., it is quite sensitive to noise or outlier all partitions differently. For robustness against, we do
for any cluster of a large diameter. not combine three weight similarity matrices directly but
The NMI [16] is proposed to measure the consistency use them to yield three candidate consensus partitions.
between two partitions; i.e., the amount of information Motivated by the idea in [19], we employ the dendro‐
(common structured objects) shared between two parti‐ gram‐based similarity partitioning algorithm (DSPA) de‐
tions. The NMI index for a partition Pm is determined by veloped in our previous work [29] to produce a candidate 
summation of the NMI between the partition Pm and each consensus partition from a weighted similarity matrix S .
of other partitions Po. The NMI index is defined by Our DSPA algorithm uses an average‐link hierarchical
NN mo clustering algorithm that converts the weighted similarity
i m1  j o1 N ijmo log( N a N b )
K K ij
NMI ( P , P )  i j
,
matrix into a dendrogram [3] where its horizontal axis
(4)
m o
Nm No indexes all the data in a given data set, while its vertical
i 1 N i log( Ni )   j o1 N j log( N )
Km m K o j
axis expresses the lifetime of all possible cluster forma‐

M
NMI ( Pm )   NMI ( Pm , Po ). (5)
tion. The lifetime of a cluster in the dendrogram is de‐
o 1 fined as an interval from the moment that the cluster is
Here Pm and Po are two partitions that divide a data set of created to the moment that it disappears by merging with
N objects into Km and Ko clusters, respectively. N ijmo is the other clusters. Here we emphasize that due to the use of a
number of shared objects between two different clusters weighted similarity matrix, the life time of clusters is
C im  Pm and C oj  Po , where there are N im and N oj ob- weighted by the clustering quality in terms of a specific
jects in C im and C oj . Intuitively, a high NMI value im‐ clustering validation criterion, and the dendrogram pro‐
plies a well accepted partition that is more likely to reflect duced in this way is quite different from that yielded by
the intrinsic structure of the given data set. This criterion the similarity matrix without being weighed [19].
biases towards the highly correlated partitions and favors As a consequence, the number of clusters in a candi‐
those clusters containing a similar number of objects. date consensus partition P  can be determined automati‐
Inserting (2)-(5) into (1) by substituting π for a specific cally by cutting the dendrogram derived from S to form
clustering validity index results in three weight vectors, clusters at the longest lifetime. With the DSPA algorithm,
w MHΓ , w DVI and w NMI , respectively. They will be used we achieve three candidate consensus partitions P  , π =
to weight the similarity matrix, respectively. {MHΓ, DVI, NMI}, in our algorithm.
B. Weighted Similarity Matrix 3.2 Agreement Function
For each partition Pm, a binary membership indicator In our algorithm, the weighted consensus function
matrix H m = {0 ,1} NKm is constructed where Km is the yields three candidate consensus partitions, respectively,
number of clusters in the partition Pm. In the matrix H m , a according to three different clustering validation criteria,
row corresponds to one datum and a column refers to a as described in Sect. 3.1. In general, these partitions are
binary encoding vector for one specific cluster in the par‐ not always consistent with each other (see Fig. 2 for ex‐
tition Pm. Entities of column with one indicate that the ample), and hence a further reconciliation is required for
corresponding objects are grouped into the same cluster, a final partition as the output of our clustering ensemble.
and zero otherwise. Now we use the matrix H m to derive In order to obtain a final partition, we develop an
an N×N binary similarity matrix Sm that encodes the pair‐ agreement function by means of the evident accumula‐
wise similarity between any two objects in a partition. For tion idea [19] again. A pair‐wise similarity S is con‐
each partition Pm, its similarity matrix Sm  {0 ,1} NN is structed with three candidate consensus partitions in the
constructed by same way as described in Sect. 3.1.B. That is, a binary
Sm  H m H mT . (6) membership indicator matrix H  is constructed from par‐
In (6), the element (Sm ) ij is equal to the inner product tition P  where π = {MHΓ, DVI, NMI}. Then, concatenat‐
between rows i and j of the matrix Hm. Therefore, objects i ing three H  matrices leads to an adjacency matrix con‐
and j are grouped into the same cluster if the ele‐ sisting of all the data in a given data set versus candidate
ment (Sm ) ij  1 , and in different clusters otherwise. consensus partitions, H  [ H MHΓ | H DVI | H NMI ] . Thus, the
Finally, a weighted similarity matrix S concerning all pair‐wise similarity matrix S is achieved by
1
the partitions in P is constructed by a linear combination S 
HH T . (8)
of their similarity matrix Sm with their weight w m as 3
M Finally, a dendrogram is derived from S and the final par‐
S   wmSm . (7) tition P is achieved with the DSPA algorithm [32].
m1
In our algorithm, three weighted similarity matric- 3.3 Algorithm Analysis

es, S MHT , S DVI and S NMI , are constructed, respectively.
Under the assumption that any partition of a given
C. Candidate Consensus Partition Generation data set is a noisy version of its ground-truth partition
 subject to the Normal distribution [23], the clustering en-
A weighted similarity matrix S is used to reflect the
semble problem can be viewed as finding a “mean” parti-
collective relationship among all data in terms of different
tion of input partitions in general. If we know the a single clustering validation criterion π, or w m produced
ground-truth partition, C, and all possible partitions, Pi , by the joint use of multiple criteria. By inserting the esti‐
of the given data set, the ground-truth partition would be mated “mean” in the above form and the intrinsic “mean”
the “mean” of all possible partitions [23]: into (12), the second term of (12) becomes
2
M M
C  arg min  Pr( Pi  C )d( Pi , P), (9)   m  ( m  w m )Sm . (13)
P i m1 m1
where Pr( Pi  C ) is the probability that C is randomly From (13), it is observed that the quantities  m  w m crit‐

distorted to be Pi and d(,) is a distance metric for any ically determine the performance of a weighted clustering
two partitions. Under the Normal distribution assump- ensemble.
tion, Pr( Pi  C ) is proportional to the similarity between For clustering analysis, it has been assumed that an un‐
Pi and C. derlying structure can be detected if it holds well‐defined
In a practical clustering ensemble problem, an initial cluster properties, e.g., compactness, separability and
clustering analysis process returns only a partition sub‐ cluster stability [3],[4],[30]. We expect that such properties
set, Ρ  { Pm } mM1 . From (9), finding the “mean”, P * , of M can be measured by clustering validation criteria so
partitions in P can be performed by minimizing the cost that w m is as close to  m as possible. In reality, the ground‐
function: truth partition is generally not available for a given data
M set. Without the knowledge on  m , it is impossible to
( P)    m d( Pm , P), (10) formally analyze how good clustering validation criteria
m1
are to estimate intrinsic weights. Instead, we undertake
where  m  Pr( Pm  C ) and m1  m  1. The optimal so‐
M
empirical studies to investigate its capacity and limitation
lution to minimizing (10) is the intrinsic “mean”, P * :
of our weighted clustering ensemble algorithm based on
P *  arg min m1  m d( Pm , P).
M
P
elaborately designed synthetic data sets of the ground‐
In this paper, we use the piece‐wise similarity matrix to truth information, which is presented in Appendix 1.
characterize a partition and hence define the distance as
2
d( Pm , P )  Sm  S where S is the similarity matrix of a 4. SIMULATION
consensus partition, P. Thus, (10) can be rewritten as
M
For evaluation, we apply our approach to a collection
  m Sm  S .
2
( P)  (11) of time series benchmarks for temporal data mining [32],
m1 the CAVIAR visual tracking database [33] and the PDMC
Let S * be the similarity matrix of the “mean”, P * . Finding time‐series data stream dataset [34]. We first present the
P * to minimize ( P) in (11) is analytically solvable [31]; temporal data representations used for time series
i.e., S *  m1  mSm . By connecting this optimal “mean”
M
benchmarks and the CAVIAR database in Sect. 4.1, and
to the cost function in (11), we have then report experimental setting and results for various
M
temporal data clustering tasks in Sect. 4.2‐4.4.
  m Sm  S
2
( P ) 
m1
M
4.1 Temporal Data Representations
2
   m (Sm  S * )  (S *  S)
T

For the temporal data expressed as {x(t )}t 1 with a
m1
length of T temporal data points, we use two piecewise
M M
* 2 2
   m (Sm  S )    m (S  S) . (12)
* representations, piecewise local statistics (PLS) and piecewise
m1 m1 discrete wavelet transform (PDWT), developed in our pre‐
Note that the fact m1  m Sm  S  0 is applied in the vious work [29] and two classical global representations,
M *
polynomial curve fitting (PCF) and discrete Fourier transforms
last step of (12) due to mM1  m  1 and S *  mM1  mSm . (DFT), together.
The actual cost of a consensus partition is now decom‐ A. PLS Representation
posed into two terms in (12). The first term corresponds A window of the fixed size is used to block time series
to the quality of input partitions, e.g., how close they are into a set of segments. For each segment, the 1st‐ and 2nd‐
to the ground truth partition, C, solely determined by an order statistics are used as features of this segment. For
initial clustering analysis regardless of clustering ensem‐ segment n, its local statistics,  n and  n , are estimated by
n|W| n|W|
ble. In other words, the first term is constant once the ini‐  n  1  x(t ),  n 
1
 [ x(t )   n ]2 ,
| W | t 1( n1)|W| | W | t 1( n1)|W|
tial clustering analysis returns a collection of partitions, P.
The second term is determined by the performance of a where |W| is the size of the window.
clustering ensemble algorithm that yields the consensus B. PDWT Representation
partition, P; i.e., how close the consensus partition is to Discrete wavelet transform (DTW) is applied to de‐
the weighted “mean” partition. Thus, (12) provides a ge‐ compose each segment via the successive use of low‐pass
neric measure to analyze a clustering ensemble algorithm. and high‐pass filtering at appropriate levels. At level j,
In our weighted clustering ensemble algorithm, a con‐ high‐pass filters, Hj , encode the detailed fine informa‐
sensus partition is an estimated “mean” characterized in a tion, while low‐pass filters, Lj , characterize coarse infor‐
generic form: S  m1 w mSm , where w m is w m yielded by
M
mation. For the nth segment, a multi‐scale analysis of J
levels leads to a local representation with all coefficients: TABLE 1
{ x ( t )} tn|W( n|1 )|W|  {{  LJ , {  Hj } Jj 1 }.
TIME SERIES BENCHMARK INFORMATION [32].
However, the dimension of this representation is the win‐
dow size, |W|. For dimensionality reduction, we apply
Sammon mapping technique [35] by mapping wavelet
coefficients in (17) non‐linearly onto a pre‐specified low
dimensional space to form our PDWT representation.
C. PCF Representation
In [9], time series is modeled by fitting it to a parametric
polynomial function
x(t )   R t R   R1t R1       1t   0 .
Here  r ( r  0 ,1,  , R) is the polynomial coefficient of the Rth
order. The fitting is carried out by minimizing a least‐
square error criterion. All R+1 coefficients obtained via
the optimization constitute a PCF representation, a loca‐
tion‐dependent global representation. bine input partitions yielded by K‐mean, HC and
DBSCAN with different representations during initial
D. DFT Representation clustering analysis.
Discrete Fourier transforms have been applied to derive a We adopt the following experimental settings for the
global representation of time series in frequency domain first two types of experiments. For clustering directly on
T temporal data, we allow the K‐HMM to use the correct
[10]. The DFT of time series {x ( t )}t 1 yields a set of Fouri‐
er coefficients: cluster number, K* and the results of K‐mean were
a d  1  x ( t ) exp(  j 2  dt ), d  0 ,1 ,   , T  1
achieved on the same condition. There is no parameter
T
T t 1 T setting in HC. For clustering on single representations, K*
Then, we retain only few top d (d<<T) coefficients for ro‐ is also used in K‐mean. Although DBSCAN does not need
bustness against noise, i.e., real and imaginary parts, cor‐ the prior knowledge on a given data set, two parameters
responding to low frequencies collectively form a Fourier in the algorithm need to be tuned for the good perfor‐
descriptor, a location‐independent global representation. mance [36]. We follow suggestions in literature [36] to
find the best parameters for each data set by an exhausted
4.2 Time Series Benchmarks search within a proper parameter range. Given the fact
Time series benchmarks of 16 synthetic or real‐world that K‐mean is sensitive to initial conditions even though
time series data sets [32] have been collected to evaluate K* is given, we run the algorithm 20 times on each data
time series classification and clustering algorithms in the set with different initial conditions in two aforementioned
context of temporal data mining. In this collection [32], experiments. As results achieved are used for comparison
the ground truth, i.e., the class label of time series in a to clustering ensemble, we report only the best result for
data set and the number of classes K*, is given and each fairness.
data set is further divided into the training and testing For clustering ensemble, we do not use the prior know‐
subsets for the evaluation of a classification algorithm. ledge, K*, in K‐mean to produce partitions on different
The information on all 16 data sets is tabulated in Table 1. representations to test its model selection capability. As a
In our simulations, we use all 16 whole data sets contain‐ result, we take the following procedure to produce a par‐
ing both training and test subsets to evaluate clustering tition with K‐mean. With a random number generator of
algorithms. uniform distribution, we draw a number within the range
In the first part of our simulations, we conduct three K *  2  K  K *  2 (K>0). Then, the chosen number is the
types of experiments. First, we employ classic temporal K used in K‐mean to produce a partition on an initial
data clustering algorithms directly working on time series condition. For each data set, we repeat the above proce‐
to achieve their performance on the benchmark collection dure to produce 10 partitions on a single representation
used as a baseline yielded by temporal‐proximity and so that there are totally 40 partitions for combination as
model based algorithm. We use the hierarchical clustering K‐mean is used for initial clustering analysis. As men‐
(HC) algorithm [3] and the K‐mean based hidden Markov tioned previously, there is no parameter setting in the HC
Model (K‐HMM) [5], while the performance of the K‐mean and therefore only one partition is yielded for a given
algorithm is provided by benchmark collectors. Next, our data set. Thus, we need to combine only four partitions
experiment examines the performance on four representa‐ returned by the HC on different representations for each
tions to see if the use of a single representation is enough data set. When DBSCAN is used in initial clustering anal‐
to achieve the good performance. For this purpose, we ysis on different representations, we combine four best
employ two well known algorithms; i.e., K‐mean, an es‐ partitions of each data set achieved from single‐
sential algorithm, and DBSCAN [36], a sophisticated den‐ representation experiments mentioned above.
sity‐based algorithm that can discover clusters of arbi‐ It is worth mentioning that for all representation‐based
trary shape and is good at model selection. Last, we apply clustering in the aforementioned experiments, we always
our weighted clustering ensemble (WCE) algorithm to com‐ fix those internal parameters on representations (see Sect.
TABLE 2
CLASSIFICATION ACCURACY (%) OF DIFFERENT CLUSTERING ALGORITHMS ON TIME SERIES BENCHMARKS [32].

4.1 for details) for all 16 data sets. Here, we emphasize parameter setup. It implies that model selection on the
that there is no parameter tuning in our simulations. single representation space is rather difficult given the
As the same as used by the benchmark collectors [32], fact that a sophisticated algorithm does not perform well
the classification accuracy [37] is used for performance due to information loss in the representation extraction.
evaluation. It is defined by From Table 2, it is evident that our proposed approach
K *
C i  Pmj achieves significantly better performance in terms of both
Sim ( C , Pm )  (  max { 2 }) / K * .
i 1
j  {1 ,..., K } C i  Pmj classification accuracy and model selection no matter

Here, C  {Ci ,....C K * } is a labeled data set that offers the which algorithm is used in initial clustering analysis. It is
ground truth and Pm  {Pm 1 ,..., PmK } is a partition pro‐ observed that our clustering ensemble combining input
duced by a clustering algorithm for the data set. For relia‐ partitions produced by a clustering algorithm on different
bility, we use two additional common evaluation criteria representations always outperforms the best partition
for further assessment but report those results in Appen‐ yielded by the same algorithm on single representations
dix 2 due to the limited space here. even though we compare the averaging result of our clus‐
Table 2 collectively lists all the results achieved in three tering ensemble to the best one on single representations
types of experiments. For clustering directly on temporal when K‐mean is used. In terms of model selection, our
data, it is observed from Table 2 that K‐HMM, a model‐ clustering ensemble fails to detect the right cluster num‐
based method, generally outperforms K‐mean and HC, bers for two and one out of 16 data sets only, respectively,
temporal‐proximity methods, as it has the best perfor‐ as K‐mean and HC are used for initiation clustering anal‐
mance on 10 out of 16 data sets. Given the fact that HC is ysis, whilst the DBSCAN clustering ensemble is success‐
capable for finding a cluster number in a given data set, ful for ten of 16 data sets. For comparison on all three
we also report the model selection performance of HC types of experiments, we present the best performance
with the notation that * is added behind the classification with the bold entries in Table 2. It is observed that our
accuracy if HC finds the correct cluster numbers. As a approach achieves the best classification accuracy for 15
result, HC manages to find the correct cluster number for out of 16 data sets given the fact that K‐mean, HC and
four data sets only, as shown in Table 2, which indicates DBSCAN based clustering ensembles achieve the best
the model selection challenge in clustering temporal data performance for six, four and five of out of 16 data sets,
of high dimensions. It is worth mentioning that the K‐ respectively, and K‐HMM working on raw time series
HMM takes a considerably longer time in comparison yields the best performance for the Adiac data set.
with other algorithms including clustering ensembles. In our approach, an alternative clustering ensemble al‐
With regard to clustering on single representations, it gorithm can be used to replace our WCE algorithm. For
is observed from Table 2 that there is no representation comparison, we employ three state‐of‐the‐art algorithms
that always leads to the best performance on all data sets developed from different perspectives, i.e., Cluster Ensem‐
no matter which algorithm, K‐mean or DBSCAN, is used ble (CE) [16], hybrid bipartite graph formulation (HBGF) al‐
and the winning performance is achieved across four re‐ gorithm [18] and semidefinite‐programming based clustering
presentations for different data sets. The results demon‐ ensemble (SDP‐CE) [22]. In our experiments, K‐mean is
strate the difficulty in choosing an effective representation used for initial clustering analysis. Given that fact that all
for a given temporal data set. From Table 2, we observe three algorithms were developed without addressing
that DBSCAN is generally better than K‐mean algorithm model selection, we use the correct cluster number of
on the appropriate representation space and correctly each data set, K*, in K‐mean. Thus, 10 partitions for a sin‐
detects the right cluster number for nine out of 16 data gle presentation are generated with different initial condi‐
sets totally with appropriate representations and the best tions and overall there are 40 partitions to be combined
for each data set. For our WCE, we use exactly the same
TABLE 3
CLASSIFICATION ACCURACY (%) OF CLUSTERING ENSEMBLES

Fig. 4. All motion trajectories in the CAVIA database.

procedure for K‐mean to produce partitions described
earlier where K is randomly chosen from TABLE 3
K *  2  K  K *  2 (K>0). For reliability, we conduct 20 PERFORMANCE OF DIFFERENT ENSEMBLE APPROACHES
trials and report the average and the standard deviation
of classification accuracy rates.
Table 3 shows the performance of four clustering en‐
semble algorithms where the notation is as same as used
in Table 2. It is observed from Table 3 that the SDP‐CE, the
HBGF and the CE algorithms win on five, two and one
data sets, respectively, while our WCE algorithm has the
best results for the remaining eight data sets. Moreover, a
closer observation indicates that our WCE algorithm also
achieves the second best results for four out of eight data
sets where other algorithms win. In terms of computa‐
tional complexity, the SDP‐CE suffers from the highest
computational burden, while our WCE has higher com‐
putational cost than the CE and the HBGF. Considering
its capability of model selection and a trade‐off between
performance and computational efficiency, we believe
that our WCE algorithm is especially suitable for tempor‐
al data clustering with different representations. Further
Fig. 5. The final partition on the CAVIAR database by our approach;
assessment with other common evaluation criteria is also plots in (a)-(p) correspond to 15 clusters of moving trajectories.
done, and the entirely consistent results reported in Ap‐
pendix 2 allow us to draw the same conclusion. projection, respectively. As a result, the representation of
In summary, our approach yields favorite results on a motion trajectory is simply a collective representation of
the benchmark time series collection in comparison to two time series corresponding to its x‐ and y‐projection.
classical temporal data clustering and the state‐of‐the‐art Motion trajectories tend to have the various lengths and
clustering ensemble algorithms. Hence, we conclude that therefore a normalization technique needs to be used to
our proposed approach provides a promising yet easy‐to‐ facilitate the representation extraction. Thus, the motion
use technique for temporal data mining tasks. trajectory is re‐sampled with a pre‐specified number of
sample points, i.e., the length is 1500 in our simulations,
4.3 Motion Trajectory by a polynomial interpolation algorithm. After re‐
In order to explore a potential application, we apply sampling, all trajectories are normalized to a Gaussian
our approach to the CAVIAR database for trajectory clus‐ distribution of zero mean and unit variance in x‐ and y‐
tering analysis. The CAVIA database [33] was originally directions. Then, four different representations described
designed for video content analysis where there are the in Sect. 4.1 are extracted from the normalized trajectories.
manually annotated video sequences of pedestrians, re‐ Given that there is no prior knowledge on the “right”
sulting in 222 motion trajectories as illustrated in Fig. 4. number of clusters for this database, we run the K‐mean
A spatio‐temporal motion trajectory is a 2D spatio‐ algorithm 20 times by randomly choosing a K value from
temporal data of the notation {( x(t ), y(t ))}Tt 1 , where an interval between five and 25 and initializing the center
( x(t ), y(t )) is the coordinates of an object tracked at frame of a cluster randomly to generate 20 partitions on each of
t, and therefore can also be treated as two separate time four representations. Totally, 80 partitions are fed to our
series { x(t )}Tt1 and { y(t )}Tt1 by considering its x‐ and y‐ WCE to yield a final partition, as shown in Fig. 5. Without
the ground truth, human visual inspection has to be ap‐
TABLE 4
PERFORMANCE ON THE CAVIAR CORRUPTED WITH NOISE.

Fig.6. Typical clustering results with a single representation only.
plied for evaluating the results, as suggested in [38]. By
the common human visual experience, behaviors of pede‐
strians across the shopping mall are roughly divided into
five categories: “move up”, “move down”, “stop”, “move
left”, and “move right” from the camera viewpoint [35].
Ideally trajectories of the similar behaviors are grouped
Fig. 7. Classification accuracy of our approach on the CAVIAR data-
together along a motion direction and then results of clus‐ base and its noisy version in simulated occlusion situations.
tering analysis are used to infer different activities at a
semantic level, e.g., “enter the store”, “exit from the
store”, “pass in front” and “stop to watch”. ful clusters as validated by human visual inspection.
As observed from Fig. 5, coherent motion trajectories To demonstrate the benefit from the joint use of differ‐
have been properly grouped together while dissimilar ent representations, Fig. 6 illustrates several meaningless
ones are distributed into different clusters. For example, clusters of trajectories, judged by visual inspection,
the trajectories corresponding to the activity of “stop to yielded by the same clustering ensemble but on single
watch” are accurately grouped in the cluster shown in representations, respectively. Fig. 6(a) and 6(b) show that
Fig. 5(e). Those trajectories corresponding to moving from using only the PCF representation some trajectories of
left‐to‐right and right‐to‐left are properly grouped into line structures along x‐ and y‐axis are improperly grouped
two separate clusters, as shown in Fig. 5(c) and 5(f). The together. For these trajectories “perpendicular” to each
trajectories corresponding to “move up” and “move other, their x and y components have only a considerable
down” are grouped into two clusters as shown in Fig. 5(j) coefficient value on their linear basis but a tiny coefficient
and 5(k) very well. Fig. 5(a), 5(d), 5(g), 5(h), 5(i), 5(n) and value on any higher order basis. Consequently, this leads
5(o) indicate that trajectories corresponding to most activ‐ to a short distance between such trajectories in the PCF
ities of “enter the store” and “exit from the store” are representation space, which is responsible for improper
properly grouped together via multiple clusters in light of grouping. Fig. 6(c) and 6(d) illustrate a limitation of the
various starting positions, locations, moving directions DFT representation; i.e., trajectories with the same orien‐
and so on. Finally, Fig. 5(l) and 5(m) illustrates two clus‐ tation but with different starting points are improperly
ters roughly corresponding to the activity “pass in front”. grouped together since the DFT representation is in the
The CAVIAR database was also used in [38] where the frequency domain and therefore independent of spatial
self‐organizing map with the single DFT representation locations. Although the PLS and PDWT representations
was used for clustering analysis. In their simulation, the highlights local features, global characteristics of trajecto‐
number of clusters was determined manually and all tra‐ ries could be neglected. The cluster based on the PLS re‐
jectories were simply grouped into nine clusters. Al‐ presentation shown in Fig. 6(e) improperly groups trajec‐
though most of clusters achieved in their simulation are tories belonging to two clusters in Fig. 5(k) and 5(n).
consistent with ours, their method failed to separate a Likewise, Fig. 6(f) shows an improper grouping based on
number of trajectories corresponding to different activi‐ the PDWT representation that merges three clusters in
ties by simply putting them together into a cluster called Fig. 5(a), 5(d) and 5(i). All above results suggest that the
“abnormal behaviors” instead. In contrast, ours properly joint use of different representations is capable of over‐
groups them into several clusters shown in Fig. 5(a), 5(b), coming limitations of individual representations.
5(e) and 5(i). If the clustering analysis results are em‐ For further evaluation, we conduct two additional si‐
ployed for modeling events or activities, their merged mulations. The first one intends to test the generalization
cluster inevitably fails to provide any useful information performance on noisy data produced by adding different
for a higher level analysis. Moreover, the cluster shown in amount of Gaussian noise, N (0,  ) , to the range of coor‐
Fig. 5(j), corresponding to “move up”, was missing in dinates of moving trajectories. The second one tends to
their partition without any explanation [38]. Here, we simulate a scenario that a moving object tracked is oc‐
emphasize that unlike their manual approach to model cluded by other objects or the background, which leads to
selection, our approach automatically yields 15 meaning‐ missing data in a trajectory. For the robustness, 50 inde‐
pendent trials have been done in our simulations.
Table 4 presents results of classifying noisy trajectories
with the final partition shown in Fig. 5 where a decision is TABLE 5
RESULTS OF THE ODAC ALGORITHM VS. OUR APPROACH.
made by finding a cluster whose center is closest to the
tested trajectory in terms of the Euclidean distance to see
if its clean version belongs to this cluster. Apparently, the
classification accuracy highly depends on the quality of
clustering analysis. It is evident from Table 4 that the per‐
formance is satisfactory in contrast to those of the cluster‐
ing ensemble on a single representation especially as a
substantial amount of noise is added, which again de‐
monstrates the synergy between different representations.
To simulate trajectories of missing data, we remove
five segments of trajectory of the identical length at ran‐
dom locations, and missing segment of various lengths
are used for testing. As a result, the task is to classify a
trajectory of missing data with its observed data only and
the same decision‐making rule mentioned above is used.
We conduct this simulation on all trajectories of missing
data and their noisy version by adding the Gaussian
noise, N(0, 0.1). Fig. 7 shows the performance evolution in
presence of missing data measured by a percentage of the
trajectory length. It is evident that our approach performs
well in simulated occlusion situations.
4.4 Time-Series Data Stream

Unlike temporal data collected prior to processing, a
data stream consisting of variables continuously comes
from a data flow of a given source, e.g., sensor networks
[34], at a high speed to generate examples over time. The
temporal data stream clustering is a task finding groups
of variables that behave similarly over time. The nature of
temporal data streams poses a new challenge for tradi‐
tional temporal clustering algorithms. Ideally, an algo‐
rithm for temporal data stream clustering needs to deal
with each example in constant time and memory [39].
Time‐series data stream clustering has been recently
studied, e.g., an Online Divisive‐Agglomerative Cluster‐
ing (ODAC) algorithm [39]. As a result, most of such al‐
gorithms developed to work on a stream fragment to ful‐
fill clustering in constant time and memory. In order to
exploit the potential yet hidden information and to dem‐
onstrate the capability of our approach in temporal data
stream clustering, we propose to use dynamic properties
of a temporal data stream along with itself together. It is
known that dynamic properties of time series, x(t), can be Fig. 8. Results of the batch hierarchical clustering algorithm vs. ours
well described by its derivatives of different orders, on two data stream collections. (a) userID=6. (b) userID=25.
x ( n) (t ), n  1,2 ,  . As a result, we would treat derivatives
of a stream fragment and itself as different representa‐ Following the exactly same experimental setting in [39],
tions and then use them for initial clustering analysis in we use their ODAC algorithm for initial clustering analy‐
our approach. Given the fact that the estimate of the nth sis on three representations, i.e., time series, x(t ) , and its
derivative requires only n+1 successive points, a slightly first‐ and second‐order derivatives, x (1) (t ) and x ( 2 ) (t ) . The
larger constant memory is used in our approach; i.e., for same criteria, MHΓ and DVI, used in [39] are adopted for
the nth‐order derivative of time series, the size of our performance evaluation. Doing so allows us to compare
memory is n points larger than the size of the memory our proposed approach with this state‐of‐the‐art tech‐
used by the ODAC algorithm [39]. nique straightforward and to demonstrate the perfor‐
In our simulations, we use the PDMC Dataset collected mance gain by our approach via the exploitation of hid‐
from streaming sensor data of approximately 10,000 den yet potential information.
hours of time‐series data streams containing several va‐ As a result, Table 5 lists simulation results of our pro‐
riables including userID, sessionID, sessionTime, two posed approach against those reported in [39] on two col‐
characteristics, annotation, gender, and nine sensors [34]. lections, userID=6 of 80,182 observations and userID=25 of
141,251 observations, for the task finding the right num‐
ber of clusters on eight sensors from 2 to 9. The best re‐ ing, we have addressed little on the representation‐related
sults on two collections were reported in [39] where their issues per se including the development of novel tempor‐
ODAC algorithm working on the time‐series data streams al data representations and the selection of representa‐
only found three sensor clusters for each user and their tions to establish a synergy to produce appropriate parti‐
clustering quality was evaluated by the MHΓ and the DVI. tions for clustering ensemble. We anticipate that our ap‐
From Table 5, it is observed that our approach achieves proach would be improved once those representation‐
much higher MHΓ and DVI values but finds considerably related problems are tackled effectively.
different cluster structures; i.e., five clusters found for The cost function derived in (12) suggests that the per‐
userID=6 and four clusters found for userID=25. formance of a clustering ensemble depends on both quali‐
Although our approach considerably outperforms the ty of input partitions and a clustering ensemble scheme.
ODAC algorithm in terms of clustering quality criteria, First, initial clustering analysis is a key factor responsible
we would not conclude that our approach is much better for performance. According to the first term of (12) in
than the ODAC algorithm given the fact that those criteria Sect. 3.3, the good performance demands the property
evaluate the clustering quality from a specific perspective that the variance of input partitions is small and the op‐
only and there is no ground truth available. We believe timal “mean” is close to the intrinsic “mean”, i.e., the
that a whole stream of all observations should contain the ground‐truth partition. Hence, appropriate clustering
precise information on its intrinsic structure. In order to algorithms need to be chosen to match the nature of a
verify our results, we apply a batch hierarchical clustering given problem to produce input partitions of such a
(BHC) algorithm [3] to two pooled whole streams. Fig. 8 property, apart from the use of different representations.
depicts results of the batch clustering algorithm in con‐ When domain knowledge is available, it can be integrated
trast to ours. It is evident that ours is completely consis‐ via appropriate clustering algorithms during initial clus‐
tent with that of the BHC on userID=6 and ours on use‐ tering analysis. Moreover, the structural information un‐
rID=25 are identical to that of the BHC except that ours derlying a given data set may be exploited, e.g., via mani‐
groups sensors 2 and 7 in a cluster but the BHC separates fold clustering [40], to produce input partitions reflecting
them and merges sensor 2 to a larger cluster. Comparing its intrinsic structure. As long as an initial clustering anal‐
with the structures uncovered by the ODAC [39], one can ysis returns input partitions encoding domain knowledge
clearly see that theirs are quite distinct from those yielded and characterizing the intrinsic structural information, the
by the BHC algorithm. Although our partition on use‐ “abstract” similarity (i.e., whether or not two entities are
rID=25 is not consistent with that of the BHC, the overall in the same cluster) used in our weighted clustering en‐
results on two streams are considerably better than those semble will inherit them during combination of input
of the ODAC [39]. partitions. In addition, the weighting scheme in our algo‐
In summary, the simulation described above demon‐ rithm also allows any other useful criteria and domain
strates how our approach is applied to an emerging ap‐ knowledge to be integrated. All discussed above pave a
plication field in temporal data clustering. By exploiting new way to improve our approach.
the additional information, our approach leads to a sub‐ As demonstrated, a clustering ensemble algorithm pro‐
stantial improvement. Using the clustering ensemble with vides an effective enabling technique to use different re‐
different representations, however, our approach has a presentations in a flexible yet effective way. Our previous
higher computational burden and requires a slightly larg‐ work [24‐28] shows that a single learning model working
er memory for initial clustering analysis. Nevertheless, it on a composite representation formed by lumping differ‐
is apparent from Fig. 3 that the initial clustering analysis ent representations together is often inferior to an ensem‐
on different representations is completely independent ble of multiple learning models on different representa‐
and so is the generation of candidate consensus partitions. tions for supervised and semi‐supervised learning. More‐
Thus, we firmly believe that the advanced computing over, our earlier empirical studies [29] and those not re‐
technology nowadays, e.g., parallel computation, can be ported here also confirm our previous finding for tem‐
adopted to overcome the weakness for a real application. poral data clustering. Therefore, our approach is more
effective and efficient than a single learning model on the
composite representation of a much higher dimension.
5. DISCUSSION
As a generic technique, our weighted clustering en‐
The use of different temporal data representations in semble algorithm is applicable to combination of any in‐
our approach plays an important role in cutting informa‐ put partitions in its own right regardless of temporal data
tion loss during representation extraction, a fundamental clustering. Therefore, we would link our algorithm to the
weakness of the representation‐based temporal data clus‐ most relevant work and highlight the essential difference
tering. Conceptually, temporal data representations ac‐ between them.
quired from different domains, e.g., temporal vs. frequen‐ The Cluster Ensemble algorithm [16] presents three
cy, and on different scales, e.g., local vs. global as well as heuristic consensus functions to combine multiple parti‐
fine vs. coarse, tend to be complementary. In our simula‐ tions. In their algorithm [16], three consensus functions
tions reported in this paper, we simply use four temporal are applied to produce three candidate consensus parti‐
data representations of a complementary nature to dem‐ tions, respectively, and then the NMI criterion is em‐
onstrate our idea in cutting information loss. Although ployed to find out a final partition by selecting the one of
our work is concerning the representation‐based cluster‐ the maximum NMI value from candidate consensus parti‐
tions. Although there is a two‐stage reconciliation process 6. CONCLUSION
in both their algorithm [16] and ours, the following cha‐
In this paper, we have presented a temporal data clus‐
racteristics distinguish ours from theirs. First, ours uses
tering approach via a weighted clustering ensemble on
only a uniform weighted consensus function that allows
different representations and further propose a useful
various clustering validation criteria for weight genera‐
measure to understand clustering ensemble algorithms
tion. Various clustering validation criteria are used to
based on a formal clustering ensemble analysis [23]. Si‐
produce multiple candidate consensus partitions (in this
mulations show that our approach yields favorite results
paper, we use only three criteria). Then, we use an agree‐
for a variety of temporal data clustering tasks in terms of
ment function to generate a final partition by combining
clustering quality and model selection. As a generic
all candidate consensus partitions other than selection.
framework, our weighted clustering ensemble approach
Our consensus and the agreement functions are devel‐
allows other validation criteria [30] to be incorporated
oped under the evidence accumulation framework [19].
directly to generate a new weighting scheme as long as
Unlike the original algorithm [19] where all input parti‐
they better reflect the intrinsic structure underlying a data
tions are treated equally, we use the evidence accumu‐
set. In addition, our approach does not suffer from a te‐
lated in a selective way. When (13) is applied to the origi‐
dious parameter tuning process and a high computational
nal algorithm [19] for analysis, it can be viewed as a spe‐
complexity. Thus, our approach provides a promising yet
cial case of our algorithm as w m  1 / M. Thus, its cost
easy‐to‐use technique for real world applications.
defined in (12) is simply a constant independent of com‐
bination. In other words, the algorithm in [19] does not
exploit the useful information on relationship between ACKNOWLEDGMENT
input partitions and works well only if all input partition Authors are grateful to Vikas Singh who provides their
has a similar distance to the ground‐truth partition in the SDP‐CE [22] Matlab code used in our simulations and
partition space. Thus, we believe that this analysis justi‐ anonymous reviewers for their comments that significant‐
fies the fundamental weakness of a clustering ensemble ly improve the presentation of this paper. The Matlab
algorithm treating all input partitions equally during code of the CE [16] and the HBGF [18] used in our simula‐
combination. tions was downloaded from authors’ website.
Alternative weighted clustering ensemble algorithms
[41‐43] have also been developed. In general, they can be REFERENCES
divided into two categories in terms of the weighting [1] J. Kleinberg, “An impossible theorem for clustering,” Proc. Ad‐
scheme: cluster vs. partition weighting. A cluster weight‐ vances in Neural Information Processing Systems, vol. 15, 2002.
ing scheme [41],[42] associates clusters in a partition with [2] E. Keogh and S. Kasetty, “On the need for time series data min‐
an weighting vector and embeds it in the subspace ing benchmarks: A survey and empirical study,” Knowledge and
spanned by an adaptive combination of feature dimen‐ Data Discovery, vol. 6, pp. 102‐111, 2002.
sions while a partition weighting scheme [43] assigns a [3] A. Jain, M. Murthy, and P. Flynn, “Data clustering: A review”,
weight vector to partitions to be combined. Our algorithm ACM Computing Surveys, vol. 31, pp. 264‐323, 1999.
clearly belongs to the latter category but adopts a differ‐ [4] R. Xu and D. Wunsch II, “Survey of clustering algorithms,”
ent principle from that used in [43] to generate weights. IEEE Trans. Neural Networks, vol. 16, pp. 645–678, 2005.
The algorithm in [43] comes up with an objective func‐ [5] P. Smyth, “Probabilistic model‐based clustering of multivariate
tion encoding the overall weighted distance between all and sequential data,” Proc. Int. Workshop on Artificial Intelligence
input partitions to be combined and the consensus parti‐ and Statistics, pp. 299‐304, 1999.
[6] K. Murphy, “Dynamic Bayesian networks: representation, infe‐
tion to be found. Thus, an optimization problem has to be
rence and learning,” PhD thesis, Dept. Comp. Sci., University of
solved to find the optimal consensus partition. According
California, Berkeley, U.S.A., 2002.
to our analysis in Sect 3.3, however, the optimal “mean”
[7] Y. Xiong and D. Yeung, “Mixtures of ARMA models for model‐
in terms of their objective function may be insistent with based time series clustering,” Proc. IEEE Int. Conf. Data Ming,
the optimal “mean” by minimizing the cost function de‐ pp. 717‐720, 2002
fined in (11) and here the quality of their consensus parti‐ [8] N. Dimitova and F. Golshani, “Motion recovery for video con‐
tion is not guaranteed. Although their algorithm is devel‐ tent classification,” ACM Trans. Information Systems, vol. 13, pp.
oped under the non‐negative matrix factorization frame‐ 408‐439, 1995.
work [43], the iterative procedure for the optimal solution [9] W. Chen and S. Chang, “Motion trajectory matching of video
incurs a high computational complexity of O(n3). In con‐ objects,” Proc. SPIE/IS&T Conf. Storage & Retrieval for Media Da‐
trast, our algorithm calculates weights directly with clus‐ tabase, 2000.
tering validation criteria, which allows for the use of mul‐ [10] C. Faloutsos, M. Ranganathan and Y. Manolopoulos, “Fast sub‐
tiple criteria to measure the contribution of partitions and sequence matching in time‐series databases,” Proc. ACM SIG‐
leads to a much faster computation. Note that the effi‐ MOD Conf., pp. 419‐429, 1994.
ciency issue is critical for some real applications, e.g., [11] E. Sahouria and A. Zakhor, “Motion indexing of video,” in Proc.
temporal data stream clustering. In our ongoing work, we IEEE Int. Conf. Image Processing, vol.2, pp. 526‐529, 1997.
[12] C. Cheong, W. Lee and N. Yahaya, “Wavelet‐based temporal
are developing a weighting scheme of the synergy be‐
clustering analysis on stock time series,” Proc. ICOQSIA, 2005.
tween cluster and partition weighting.
[13] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrota, “Locally

adaptive dimensionality reduction for indexing large scale time
series databases,” Proc. ACM SIGMOD Conf., pp. 151‐162, 2001.
[14] F. Bashir, “MotionSearch: object motion trajectory‐based video [34] PDMC: Physiological Data Modeling Contest Workshop, Inter‐
database system – Index, retrieval, classification and recogni‐ national Conference on Machine Learning 2004. [Online] Avail‐
tion,” Ph.D. thesis, Dept. Elect. Eng., Univ. of Illinois, Chicago, able: http://www.cs.utexas.edu/users/sherstov/pdmc/.
U.S.A., 2005. [35] J. Sammon Jr., ʺA nonlinear mapping for data structure analy‐
[15] E. Keogh and M. Pazzani, “A simple dimensionality reduction sisʺ, IEEE Trans on Computers, vol. C‐18, pp 401‐409, 1969.
technique for fast similarity search in large time series data‐ [36] M. Ester, H. Kriegel, J. Sander, and X. Xu, ʺA density‐based
bases,” Proc. Pacific‐ Asia Conf. Knowledge Discovery and Data algorithm for discovering clusters in large spatial databases
Mining, pp. 122‐133, 2001. with noiseʺ. Proc. Int. Conf. Knowledge Discovery and Data
[16] A. Strehl and J. Ghosh, “Cluster ensembles – A knowledge Mining, pp. 226‐231, 1996
reuse framework for combining multiple partitions,” J. Machine [37] M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani, “Mining
Learning Research, vol. 3, pp. 583–617, 2002. the stock market: Which measure is best?” Proc. Int. Conf. Know-
[17] S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus clus‐ ledge Discovery and Data Mining, pp. 487-496, 2000.
tering: A resampling‐based method for class discovery and vi‐ [38] A. Naftel and S. Khalid, “Classifying spatiotemporal object
sualization of gene expression microarray data,” Machine Learn‐ trajectories using unsupervised learning in the coefficient fea‐
ing, vol. 52, pp. 91‐118, 2003. ture space,” Multimedia Systems, vol. 12, pp. 227‐238, 2006.
[18] X. Fern and C. Brodley, “Solving cluster ensemble problem by [39] P. P. Rodrigues, J. Gama, J. P. Pedroso, ʺHierarchical Clustering
bipartite graph partitioning,” Proc. Int. Conf. Machine Learning, of Time‐Series Data Streams,ʺ IEEE Trans. Knowledge and Data
pp. 36‐43, 2004. Engineering, vol. 20, pp. 615‐627, 2008
[19] A. Fred and A. Jain, “Combining multiple clusterings using [40] R. Souvenir and R. Pless, “Manifold clustering,” Proc. IEEE Int.
evidence accumulation,” IEEE Trans. Pattern Analysis and Ma‐ Conf. Computer Vision, 648‐653, 2005.
chine Intelligence, vol. 27, pp. 835‐850, 2005. [41] M. Al‐Razgan and C. Domeniconi, “Weighted clustering en‐
[20] N. Ailon, M. Charikar, and A. Newman, “Aggregating inconsis‐ sembles,” Proc. SIAM Int. Conf. Data Mining, pp. 258‐269. 2006.
tent information ranking and clustering,” Proc. STOC’05, pp. [42] H. Kien, A. Hua and K. Vu, “Constrained locally weighted clus‐
684‐693, 2005. tering,” Proc. ACM VLDB, pp. 90‐101, 2008.
[21] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggrega‐ [43] T. Li and C. Ding, “Weighted consensus clustering,” Proc. SIAM
tion,” ACM Trans. Knowledge Discovery from Data, vol. 1, 2007. Int. Conf. Data Mining, pp. 798‐809. 2008.
[22] V. Singh, L. Mukerjee, J. Peng, and J. Xu, “Ensemble clustering
Yun Yang received the BSc degree from Lan-
using semidefinite programming,” In Advance in Neural Informa‐
caster University with the first class honour in
tion Processing Systems 20, pp. 1353‐1360, 2007. 2004, MSc degree from Bristol University in
[23] A. Topchy, M. Law, A. Jain, and A. Fred, “Analysis of consensus 2005, and MPhil degree from The University
partition in cluster ensemble,” Proc. IEEE Int. Conf. Data Mining, of Manchester in 2006. He is currently a PhD
pp. 225‐232, 2004. candidate at The University of Manchester.
[24] K. Chen, L. Wang and H. Chi, “Methods of combining multiple His research interest lies in pattern recognition
and machine learning.
classifiers with different feature sets and their applications to
text‐independent speaker identification,” Int. J. Pattern Recogni‐
tion and Artificial Intelligence, vol. 11, pp. 417–445, 1997.
[25] K. Chen, “A connectionist method for pattern classification on Ke Chen received the BSc, MSc and PhD
diverse feature sets,” Pattern Recognition Letters, vol. 19, pp. 545‐ degrees in 1984, 1987 and 1990, respectively,
all in computer science. He has been with The
558, 1998.
University of Manchester since 2003. He was
[26] K. Chen and H. Chi, “A method of combining multiple proba‐ with The University of Birmingham, Peking
bilistic classifiers through soft competition on different feature University, The Ohio State University, Kyushu
sets,” Neurocomputing, vol. 20, pp. 227–252, 1998. Institute of Technology, and Tsinghua Univer-
[27] K. Chen, “On the use of different speech representations for sity. He was a visiting professor at Microsoft
speaker modeling,” IEEE Trans. Systems, Man, and Cybernetics Research Asia in 2000 and Hong Kong Poly-
technic University in 2001. He has been on
(Part C), vol. 35, pp. 301‐314, 2005.
the editorial board of several academic journals including IEEE
[28] S. Wang and K. Chen, “Ensemble learning with active data Transactions on Neural Networks and serves as the category editor
selection for semi‐supervised pattern classification,” Proc. Int. J. of Machine Learning and Pattern Recognition in Scholarpedia. He
Conf. Neural Networks, 2007. was the program chair of the first International Conference on Natu-
[29] Y. Yang and K. Chen, “Combining competitive learning net‐ ral Computation and has been a member of the technical program
works on various representations for temporal data clustering,” committee of numerous international conferences including CogSci
and IJCNN. He chairs Intelligent Systems Applications Technical
In Trends in Neural Computation, Springer, pp. 315‐336, 2007.
Committee (ISATC) and University Curricula Subcommittee, IEEE
[30] M. Halkidi, Y. Batistakis, and M. Varzirgiannis, “On clustering Computational Intelligence Society. He also served as task force
validation techniques,” J. Intelligent Information Systems, vol. chairs and a member of NNTC, ETTC and DMTC in IEEE CIS. He is
17, pp. 107‐145, 2001. a recipient of several academic awards including the NSFC Distin-
[31] M. Cox, C. Eio, G. Mana, and F. Pennecchi, “The generalized guished Principal Young Investigator Award and JSPS Research
weight mean of correlated quantities,” Metrologia, vol. 43, pp. Award. He has published over 100 academic papers in refereed
journals and conferences. His current research interests include
268‐275, 2006. pattern recognition, machine learning, machine perception and com-
[32] E. Keogh, Temporal data mining benchmarks. [Online] putational cognitive systems.
http://www.cs.ucr.edu/~eamonn/time_series_data.
[33] CAVIAR: Context aware vision using image‐based active rec‐
ognition. School of Informatics, The University of Edinburgh.
[Online] http://homepages.inf.ed.ac.uk/rbf/CAVIAR.

Temporal Data Clustering Via Weighted Clustering Ensemble With Different Representations

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Temporal Data Clustering Via Weighted Clustering Ensemble With Different Representations

Transféré par

Droits d'auteur :

Formats disponibles

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

Temporal data clustering via weighted

T emporal data are ubiquitous in the real world and

3.1 Weighted Consensus Function

be of the complementary nature and hence we recom‐

ties. But it is insensitive to the number of clusters in a par‐ partitions and a clustering validation criterion π. The

axis expresses the lifetime of all possible cluster forma‐

In our algorithm, three weighted similarity matric- 3.3 Algorithm Analysis

where Pr( Pi  C ) is the probability that C is randomly From (13), it is observed that the quantities  m  w m crit‐

4.4 Time-Series Data Stream

Vous aimerez peut-être aussi