Académique Documents
Professionnel Documents
Culture Documents
5, MARCH1 2016
I NTRODUCTION
Dong Huang is with the School of Information Science and Technology, Sun Yat-sen University, Guangzhou, China, and with the College
of Mathematics and Informatics, South China Agricultural University,
Guangzhou, China.
E-mail:huangdonghere@gmail.com.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
2
2% 2%
2%
3%
3%
4%
6%
51%
9%
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
18%
Fig. 1. The distribution of the link weights for the similarity graph induced
by the CA matrix for the MNIST dataset
TABLE 1
The distribution of the links with different weights and the proportion of
them making correct decisions for the MNIST dataset
Weight
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
2%
2%
2%
3%
3%
4%
6%
9%
18%
51%
83%
73%
65%
58%
50%
43%
35%
27%
18%
9%
x6
0.7
0.7
x1
x6
x2
0.9
x1
x3
x4
0.7
x2
0.9
x1
x3
x5
x4
x2
0.9
x3
x5
x7
x8
(a)
(b)
(c)
Fig. 2. How are the relationships between x1 and x2 and between x2
and x3 in the context of (a) a graph with three nodes, (b) a graph with
six nodes, and (c) a graph
with eight nodes?
x
x
6
To deal with
the uncertain0.7links, we propose a k -nearest0.7
0.9
0.9
x1 elite
x1
neighbor-like
strategy
termed
x2 neighbor selection (ENS)
x2
x3
to identify
the uncertain
links andx3build a sparse graph
x4
x4
x5
x5
termed K -elite neighbor graph
(K -ENG) that preserves
x8
x7
only a small number of probably
reliable links. Two microclusters are elite neighbors if the link between them has
a large weight (see Definition 6). We argue that the use
of a small number of probably reliable links is capable of
x6
reflecting the overall
structure of the graph and may lead
to much 0.7
better and more robust clustering results than
0.9
x2 links without considering their reliability.
1
using all xgraph
x3
x4
According
to xour experimental
analysis, we find that only
5
preserving several
percent
or
even
less than one percent
x
x7
of the links
via8 the ENS strategy can result in significant
improvements to the final consensus results compared to
using all links in the original graph.
Having constructed the K -ENG graph, we proceed to
exploit the global structure information to refine local links
by means of random walks. The random walk technique has
proved to be a powerful tool for finding community structures (or cluster structures) in the field of community discovery [15], [16], [17], [18]. The random walks are performed
on the sparse graph K -ENG and each node in K -ENG is
treated as a start node for a random walker. We propose a
new transition probability matrix that simultaneously considers the link weights and the node sizes. By analyzing the
probability trajectories of the random walkers starting from
different initial nodes, we derive a novel and dense similarity measure termed probability trajectory based similarity
(PTS) from the sparse graph. The PTS explores the pair-wise
relationships by capturing the global structure information
in the sparse graph via random walk trajectories. Based on
PTS, two consensus functions are further proposed, namely,
probability trajectory accumulation (PTA) and probability
trajectory based graph partitioning (PTGP).
The overall process of our approach is illustrated in
Fig. 3. Given the ensemble of base clusterings, we first map
the data objects to a set of microclusters and compute the
microcluster based co-association (MCA) matrix. With each
microcluster treated as a node, we construct the microcluster similarity graph (MSG) according to the MCA matrix.
Then, the ENS strategy is performed on the MSG and the
sparse graph K -ENG is constructed by preserving a small
number of probably reliable links. The random walks are
conducted on the K -ENG graph and the PTS similarity is
obtained by comparing random walk trajectories. Having
computed the new similarity matrix, any pair-wise similarity based clustering methods can be used to achieve
the consensus clustering. Typically, we propose two novel
consensus functions, termed PTA and PTGP, respectively.
Note that PTA is based on agglomerative clustering, while
PTGP is based on graph partitioning. Figure 4 summarizes
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
Input: the ensemble of base clusterings
0.63
0.62
0.61
Microcluster representation
0.6
NMI
MCA matrix
MSG graph
0.59
PTA
PTGP
MCLA
GP-MGLA
EAC
WEAC
0.58
0.57
0.56
0.55
10
20
30
40
50
60
Ensemble size
2
Clustering using PTS
Agglomerative
clustering
Graph
partitioning
...
2)
3)
RELATED WORK
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
and obtained the consensus clustering by partitioning the
bipartite graph. Ren et al. [8] proposed to assign weights
to data objects with regard to how difficult it is to cluster
them and presented three graph partitioning algorithms
based on the weighted-object scheme, that is, weightedobject meta clustering (WOMC), weighted-object similarity
partition (WOSP) clustering, and weighted-object hybrid
bipartite (WOHB) graph partition clustering.
Despite the significant success, there are still two limitations to most of the existing ensemble clustering approaches.
First, the existing approaches mostly overlook the problem
of uncertain links which may mislead the consensus process.
Second, most of them lack the ability to incorporate global
structure information to refine local links accurately and
efficiently. Recently, some efforts have been made to address
these two limitations. To deal with the uncertain links, Yi et
al. [7] proposed an ensemble clustering approach based on
global thresholding and matrix completion. However, using
global thresholds may lead to isolated nodes, i.e., all of the
links connected to a node may be cut out, due to the lack of
local adaptivity. Moreover, in the approach of [7], a parameter C is used to scale the noise term in the objective function
(see [7] for more details) and plays a sensitive and crucial role for yielding a good consensus clustering. Without
knowing the ground-truth in advance, tuning the parameter
C for [7] is very difficult and computationally expensive. To
exploit the indirect relationships in the ensemble, Iam-On et
al. [4], [6] proposed to refine the CA matrix by considering
the shared neighbors between clusters. The approaches in
[4] and [6] utilize the common neighborhood information,
i.e., the 1-step indirect relationships, and have not gone
beyond the 1-step indirect relationships to explore the more
comprehensive structure information in the ensemble. To
utilize multi-step indirect structure information, in the work
of [23], Iam-on et al. proposed to refine pair-wise links by
the SimRank similarity (SRS), which, however, suffers from
its high computational complexity and is not feasible for
large datasets (see Fig. 11). Different from [7] and [23], in this
paper, we propose to tackle these two limitations in a unified
and efficient manner. We present the elite neighbor selection
strategy to identify the uncertain links by locally adaptive
thresholds and build a sparse graph with a small number
of probably reliable links, which has shown its advantage
compared to using the whole original graph without considering the reliability of the links (see Section 5). To explore
the global structure information of the graph, we exploit
the random walk process with a new transition probability
matrix. By analyzing the probability trajectories of random
walkers, a novel and dense similarity measure termed PTS
is derived from the sparse graph. Specifically, based on
the PTS, two ensemble clustering algorithms are further
proposed, termed PTA and PTGP, respectively.
P ROBLEM F ORMULATION
TABLE 2
Summary of notations
N
xi
X
nk
Cjk
k
M
Nc
Ci
C
N
yi
Y
n
i
bij
aij
A
bij
a
ij
G
V
L
w
ij
G
V
L
w
ij
pij
P
pT
ij
PT
pT
i:
P TiT
P T Sij
R(0)
Link set of G
Link set of G
ij }N
T }
T
The i-th row of P T , pT
i: = {pi1 , , piN
(0)
S (0)
R(t)
S (t)
N
V
The t-step similarity matrix for PTA, S (t) = {sij }|R(t) ||R(t) |
Microcluster-cluster bipartite graph (MCBG)
Number of nodes in G
Node set of G
Link set of G
w
ij
(t)
(t)
}
|R(t) |
(t)
x6
21
22
x7
21
32
21
32
C21
x5
C32
x5
x7
x6
x6
x
x8
4
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING,
VOL. 28, NO. 5,
MARCH 2016
x7
x8
TABLE 3
An example of the microcluster representation
Data objects
Cluster labels
1
Microcluster labels
C11
C12
y1
x2
C11
C11
C11
C21
C21
C21
C21
C12
C12
C22
C22
C22
C32
C32
y1
x4
x5
x6
x7
x8
x1
x2
x1
x4
x3
y1
y2
y3
y3
x6
x8
y4
x4
x3
C32
x5
x7
x6
x8
x4
x3
x7
x8
2
4
x5
x6
x7
x8
(a)
(b)
(c)
(d)
C12
x1
x1
Fig. 5. Illustration
of
the
example
in
Table
3.
(a)
The
base
clustering
1 .
x2
x2
2
(b) The base clustering
2 .x4(c) The intersection of 1 and 2 . (d) The
x4 C2
x3 representation
x3
microcluster
for the ensemble of 1 and 2 .
C32
x5
x5
y4
x2
x2
x5
x7
x1
x1
2
x4 C2
x3
C21
x6
C12
x2
x5
x1
x3
C11
x1
In this section, we describe the proposed ensemble clustering approach based on sparse graph representation and
probability trajectory analysis.
The overall process of the proposed approach is illustrated in Fig. 3. The microclusters are used as a compact
representation for the clustering ensemble. The microcluster based co-association (MCA) matrix for the ensemble
is computed and a microcluster similarity graph (MSG) is
constructed from the MCA matrix with the microclusters
treated as graph nodes. In order to deal with the uncertain
links, we propose a k -nearest-neighbor-like pruning strategy termed elite neighbor selection (ENS), which is able to
identify the uncertain links by locally adaptive thresholds.
A sparse graph termed K -elite neighbor graph (K -ENG)
is then constructed with only a small number of probably
reliable links. The ENS strategy is a crucial step in our
approach. We argue that using a small number of probably
reliable links may lead to significantly better consensus
results than using all graph links regardless of their reliability. The random walk process driven by a new transition
probability matrix is performed on the K -ENG to explore
the global structure information. From the sparse graph K ENG, a dense similarity measure termed PTS is derived by
exploiting the probability trajectories of the random walkers. Two novel consensus functions are further proposed,
termed PTA and PTGP, respectively. In the following, we
describe each step of our framework in detail.
4.1
Y = {y1 , , yN }.
(1)
Intuitively, the set of microclusters is produced by intersecting the M base clusterings. As it is defined, there are
no clues to distinguish the objects in the same microcluster
given the information of the ensemble . In our work, the
microclusters are utilized as the primitive objects. For any
microcluster yi Y and cluster Cjk k , it holds that
either every object in yi is in Cjk or no object in yi is in Cjk .
Because the microclusters are treated as primitive objects,
in the following, if all objects in yi are in Cjk , we write it
as yi Cjk rather than yi Cjk ; otherwise, we write it
as yi 6 Cjk rather than yi 6 Cjk . Let Clsk (yi ) denote the
cluster in k that contains the microcluster yi . If yi Cjk ,
then Clsk (yi ) = Cjk .
In Table 3 and Fig. 5, we show an example of a dataset
with eight objects to describe the construction of the microclusters. There are two base clusterings in the example,
namely, 1 and 2 , which consist of two and three clusters,
respectively. With regard to Definition 1, four microclusters
can be produced using 1 and 2 , which correspond to the
intersection of 1 and 2 (as illustrated in Fig. 5(c) and 5(d)).
Definition 2. The co-association (CA) matrix of the clustering
ensemble is defined as
A = {aij }N N ,
(2)
where
bij
(3)
M
and bij denotes how many times the objects xi and xj occur in
the same cluster among the M base clusterings.
aij =
A = {
aij }N N ,
where
a
ij =
bij
M
(4)
(5)
and bij denotes how many times the microclusters yi and yj occur
in the same cluster among the M base clusterings.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
Theorem 1. For all xi , xj X , yk , yl Y such that xi yk
and xj yl , it holds that aij = a
kl .
Proof. Given xi , xj X , yk , yl Y , we have aij = bij /M
and a
kl = bkl /M according to Definitions 2 and 3. To prove
aij = a
kl , we need to prove bij = bkl . Given xi yk and
xj yl , for any base clustering m , if yk and yl are in
the same cluster in m , then xi and xj are in the same cluster
in m . Thus we have bij bkl . If yk and yl are in different
clusters in m , then xi and xj are in different clusters. Thus
we have bij bkl . Because bij bkl and bij bkl , we have
bij = bkl , which leads to aij = a
kl .
Similar to the conventional CA matrix [1] (see Definition 2), the microcluster based co-association (MCA) matrix
is computed by considering how many times two microclusters occur in the same cluster in . Then the microcluster
similarity graph (MSG) is constructed based on the MCA
matrix (see Definition 4). By using the microclusters, the size
.
of the similarity graph is reduced from N to N
Definition 4. The microcluster similarity graph (MSG) is defined as
= (V , L),
G
(6)
(7)
G
(8)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
1
2
0.7
0.7
n
j w
ij
,
n
ik
k6=i k w
pij = P
(a)
(b)
Fig. 6. Illustration of a graph with three microclusters as nodes. (a) The
microcluster-microcluster links. (b) The (hidden) object-object links.
(10)
where n
j is the number of objects in yj .
The random walk process is driven by the transition
probability matrix P. Let P T = {pTij }N N be the T -step
transition probability matrix, i.e., probability distribution at
step T , where pTij is the probability that a random walker
starting from node yi arrives at node yj at step T . The
probability distribution at step 1 is obviously the transition
probability matrix, i.e., P 1 = P . The probability distribution
at step T is computed as P T = P P T 1 , for T 2.
Let pTi: = {pTi1 , , pTiN } denote the probability distribution of node yi at step T , which is the i-th row of P T
and represents the probability of going from node yi to each
node in the graph by the random walk process at step T . The
relationship between node yi and node yj can be studied by
comparing their probability distributions at a certain step
[16]. However, the probability distributions at different steps
reflect different scales of information for the graph structure.
Using a single step of probability distribution as the feature
of a node overlooks the properties of this node at different
scales. In order to take advantage of multi-scale information
in the graph, we propose to exploit the probability trajectory
for describing the random walk process starting from each
node, which considers the probability distributions from
step 1 to step T rather than a single step. The formal definition of the probability trajectory is given in Definition 9.
Definition 9. The probability trajectory of a random walker
-tuple:
starting from node yi with length T is defined as a T N
pTi:
(11)
(12)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
performed on the sparse graph K -ENG which preserves
only a small number of probably reliable links by the ENS
strategy. The probability trajectories on the K -ENG are
used as the feature vectors for the graph nodes, which
-tuple
incorporates multi-scale graph information into a T N
by the different steps of the random walks. Theoretically, it
is possible that the K -ENG may consist of more than one
connected component, in which case we can perform the
random walk on each connected component of the graph
separately and then map the random walk trajectories at
each component back to the whole graph to facilitate the
computation.
For clarity, the algorithm of computing PTS is given in
Algorithm 1.
Algorithm 1 (Computation of Probability Trajectory Based
Similarity)
Input: , k.
1: Obtain the set of microclusters Y from .
2: Compute the MCA matrix by Eq. (4).
3: Build the graph MSG using the MCA matrix.
4: Construct the sparse graph K -ENG by the ENS strategy.
5: Perform random walks on K -ENG with the transition probability matrix given in Eq. (10).
6: Compute the new similarity PTS by Eq. (12).
Output: Y , {P T Sij }N N .
4.4
Consensus Functions
(0)
P
1
If Method=AL,
(t)
(t) P T Skl ,
(t)
(t)
yk Ri ,yl Rj
|Ri ||Rj |
P
(t)
If Method=CL,
(t)
(t) P T Skl ,
sij =
yk Ri ,yl Rj
max
If Method=SL,
(t) P T Skl ,
(t)
yk Ri ,yl Rj
(14)
(t)
(t)
where |Ri | is the number of microclusters in Ri .
The region merging process is performed iteratively and
the number of regions decrements by one in each step.
1)-step, there will be one region
Obviously, after the (N
left, which contains the entire set of the microclusters. Then
we have a dendrogram, i.e., a hierarchical representation
of clusterings. Each level in the dendrogram represents a
clustering with a certain number of clusters (or regions).
The final clustering is obtained by specifying a level for the
dendrogram.
An advantage of agglomerative clustering is that it can
efficiently generate a hierarchy of clusterings where each
level represents a clustering with a certain number of clusters. However, the region merging process is inherently local
and greedy. A mistaken merging may lead to increasing errors in the following merging steps. The similarity measure
determines the region merging order and plays a crucial role
in agglomerative clustering. In our work, the PTS measure
is able to deal with the uncertain links and incorporate
the global structure information in the ensemble, which
is beneficial for improving the accuracy and robustness of
the agglomerative clustering. The experimental results also
show the advantage of the PTA method (base on PTS)
compared to other pair-wise similarity based methods [1],
[4], [7], [19], [23] (see Section 5.5 and Table 9).
For clarity, the PTA algorithm is summarized in Algorithm 2.
Algorithm 2 (Probability Trajectory Accumulation)
Input: , k.
1: Compute the microclusters and the PTS measure according
to Algorithm 1.
2: Initialize the set of regions R(0) .
3: Construct the dendrogram iteratively:
1
for t = 1, 2, , N
Merge the most similar two regions in R(t1) w.r.t. S (t1) .
Obtain the new set of regions R(t) .
Compute the new similarity matrix S (t) .
end for
4: Find the clustering with k clusters in the dendrogram.
5: Obtain the final clustering by mapping microclusters back
to objects.
Output: the consensus clustering .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
Microclusters
y1
y2
...
yi
...
Clusters
C1
...
...
Cj
(15)
G
(16)
S
=N
+ Nc
where V = Y C = {
v1 , . . . , vN } is the node set, N
is the number of nodes and L is the link set. The weight of the link
between nodes vi and vj is defined as
vi , vj ), if vi Y, vj C
Simmc (
w
ij =
(17)
or vi C, vj Y,
0,
otherwise.
By treating both clusters and microclusters as nodes, the
microcluster-cluster bipartite graph (MCBG) is constructed
according to Definition 12. With regard to the bipartite structure of the MCBG, the efficient graph partitioning method
Tcut [26] can be used to partition the graph into a certain
number of disjoint sets of nodes. The microcluster nodes
in the same segment are treated as a cluster and thus the
final consensus clustering can be obtained by mapping the
microclusters back to the data objects.
For clarity, we summarize the PTGP algorithm in Algorithm 3.
Algorithm 3 (Probability Trajectory Based Graph Partitioning)
Input: , k.
1: Compute the microclusters and the PTS measure according
to Algorithm 1.
2: Compute the microcluster-cluster similarity by Eq. (15).
3: Build the bipartite graph MCBG.
4: Partition the MCBG into k clusters using Tcut.
5: Obtain the final clustering by mapping microclusters back
to objects.
Output: the consensus clustering .
TABLE 4
Description of the benchmark datasets
E XPERIMENTS
5.1
Dataset
#Object
#Attribute
#Class
MF
IS
MNIST
ODR
LS
PD
USPS
FC
KDD99-10P
KDD99
2,000
2,310
5,000
5,620
6,435
10,992
11,000
11,340
49,402
494,020
649
19
784
64
36
16
256
54
41
41
10
7
10
10
6
10
10
7
23
23
Construction of Ensembles
Parameter Analysis
In this paper, we propose two ensemble clustering methods, termed PTA and PTGP, respectively. There are two
parameters, namely, K and T , in the proposed methods.
The parameter K specifies how many neighbors of a node
will be treated as elite neighbors and preserved. The smaller
the parameter K is, the sparser the K -ENG will be. The
parameter T is the length of the probability trajectories.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
10
TABLE 5
Average information of the MSG graph over 20 runs.
MF
Dataset
IS
MNIST
ODR
LS
PD
USPS
FC
KDD99-10P
KDD99
)
#Node (i.e.,N
242
297
1, 438
899
1, 064
1, 095
2, 975
1, 837
230
301
#Link
12, 978
19, 778
468, 453
153, 949
219, 753
208, 146
1, 825, 938
583, 369
19, 260
34, 509
TABLE 6
Average RatioPL over 20 runs with varying parameter K .
1
16
32
64
ALL
2.4%
2.0%
0.6%
1.0%
0.8%
0.9%
0.3%
0.5%
1.7%
1.6%
3.4%
3.2%
0.9%
1.6%
1.1%
1.2%
0.5%
0.7%
2.1%
1.9%
6.6%
5.9%
1.7%
3.0%
2.3%
2.4%
1.0%
1.5%
4.2%
3.6%
11.4%
10.5%
2.9%
4.9%
3.8%
4.0%
1.7%
2.4%
9.2%
7.2%
21.1%
19.4%
5.0%
8.5%
6.9%
7.1%
2.9%
4.4%
23.9%
14.3%
43.5%
37.9%
8.7%
15.0%
12.4%
13.5%
5.0%
8.0%
53.4%
34.0%
82.1%
70.1%
16.2%
31.0%
23.9%
28.1%
8.7%
15.2%
90.0%
68.9%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
RatioPL
MF
IS
MNIST
ODR
LS
PD
USPS
FC
KDD99-10P
KDD99
TABLE 7
Average performance (in terms of NMI) of PTA over 20 runs with varying parameters K and T .
K
0.585
0.574
0.556
0.770
0.599
0.714
0.560
0.235
0.565
0.538
0.597
0.612
0.575
0.792
0.595
0.715
0.565
0.237
0.599
0.560
0.630
0.632
0.585
0.819
0.615
0.757
0.585
0.247
0.624
0.613
0.627
0.615
0.584
0.820
0.618
0.765
0.590
0.254
0.659
0.682
MF
IS
MNIST
ODR
LS
PD
USPS
FC
KDD99-10P
KDD99
16
32
64
ALL
0.605
0.595
0.589
0.813
0.621
0.761
0.591
0.250
0.664
0.700
0.560
0.592
0.574
0.798
0.612
0.735
0.579
0.237
0.592
0.644
0.538
0.610
0.538
0.757
0.586
0.698
0.570
0.221
0.562
0.570
0.535
0.609
0.486
0.713
0.539
0.678
0.442
0.199
0.561
0.562
10
10
1
16
32
64
128
0.613
0.613
0.567
0.790
0.600
0.722
0.574
0.246
0.677
0.687
0.617
0.616
0.585
0.816
0.613
0.759
0.587
0.245
0.681
0.693
0.619
0.611
0.582
0.813
0.618
0.768
0.588
0.257
0.686
0.698
0.620
0.612
0.591
0.817
0.620
0.765
0.592
0.255
0.677
0.700
0.620
0.611
0.593
0.816
0.620
0.762
0.595
0.254
0.671
0.691
0.620
0.612
0.591
0.817
0.627
0.762
0.594
0.250
0.668
0.689
0.619
0.611
0.592
0.812
0.632
0.761
0.594
0.250
0.676
0.694
0.615
0.612
0.592
0.810
0.637
0.760
0.587
0.250
0.670
0.693
TABLE 8
Average performance (in terms of NMI) of PTGP over 20 runs with varying parameters K and T .
K
0.609
0.610
0.575
0.809
0.604
0.746
0.575
0.224
0.618
0.620
0.610
0.615
0.577
0.815
0.607
0.747
0.570
0.227
0.620
0.622
0.627
0.623
0.585
0.822
0.615
0.759
0.578
0.235
0.627
0.624
0.621
0.616
0.587
0.830
0.623
0.762
0.586
0.238
0.643
0.686
MF
IS
MNIST
ODR
LS
PD
USPS
FC
KDD99-10P
KDD99
16
32
64
ALL
0.596
0.589
0.587
0.821
0.623
0.756
0.582
0.237
0.646
0.701
0.536
0.582
0.576
0.799
0.621
0.727
0.572
0.234
0.591
0.653
0.499
0.603
0.533
0.745
0.586
0.696
0.556
0.218
0.524
0.571
0.495
0.594
0.471
0.679
0.505
0.649
0.404
0.195
0.521
0.567
10
We first test the influence of parameter K on the proportion of preserved links in the K -ENG. The average
information of the MSG over 20 runs on each dataset is
shown in Table 5. Because the microclusters are generated
by intersecting multiple base clusterings, the number of
microclusters is affected by the total number of objects as
well as the shapes of cluster boundaries. For the KDD9910P and KDD99 datasets, the number of microclusters is less
than 1% of the number of the original objects. For the other
datasets, the number of microclusters is averagely about
10% to 20% of the number of the original objects, i.e., using
microclusters as nodes reduces the graph size by about 80%
to 90%. The number of links in the MSG is also given in
Table 5. Note that a link between two nodes exists if and
only if the weight between them is non-zero, i.e., a link
with zero-weight does not count as a link here. By cutting
out the probably unreliable links in the MSG via the ENS
strategy, the sparse graph K -ENG is constructed. The ratio
10
1
16
32
64
128
0.619
0.615
0.581
0.812
0.615
0.749
0.566
0.229
0.673
0.709
0.617
0.618
0.583
0.817
0.616
0.751
0.570
0.231
0.675
0.713
0.619
0.614
0.585
0.824
0.616
0.759
0.576
0.232
0.665
0.720
0.616
0.611
0.586
0.824
0.615
0.765
0.580
0.236
0.659
0.722
0.610
0.613
0.588
0.824
0.626
0.766
0.580
0.237
0.655
0.721
0.607
0.611
0.589
0.821
0.629
0.761
0.588
0.237
0.656
0.698
0.599
0.608
0.589
0.818
0.631
0.760
0.584
0.236
0.664
0.696
0.597
0.616
0.587
0.814
0.631
0.758
0.574
0.234
0.662
0.704
RatioPL =
#Links in K -ENG
.
#Links in MSG
(18)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
11
1
PTA
PTGP
Base Clusterings
0.9
0.8
0.7
NMI
0.6
0.5
0.4
0.3
0.2
0.1
MF
IS
MNIST ODR
LS
PD
USPS
FC KDD9910P KDD99
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
12
TABLE 9
Average performances (in terms of NMI) over 100 runs by different ensemble clustering methods (The three highest scores in each column are
highlighted in bold, while the highest one indicated with brackets.)
Method
PTA-AL
PTA-CL
PTA-SL
PTGP
CSPA
HGPA
MCLA
SRW
GP-MGLA
EAC-AL
EAC-CL
EAC-SL
SRS-AL
SRS-CL
SRS-SL
WCT-AL
WCT-CL
WCT-SL
ECMC-AL
ECMC-CL
ECMC-SL
WEAC-AL
WEAC-CL
WEAC-SL
Method
PTA-AL
PTA-CL
PTA-SL
PTGP
CSPA
HGPA
MCLA
SRW
GP-MGLA
EAC-AL
EAC-CL
EAC-SL
SRS-AL
SRS-CL
SRS-SL
WCT-AL
WCT-CL
WCT-SL
ECMC-AL
ECMC-CL
ECMC-SL
WEAC-AL
WEAC-CL
WEAC-SL
MF
Best-k
0.631
0.629
0.612
0.626
0.597
0.485
0.617
0.468
0.618
0.603
0.572
0.531
0.613
0.587
0.484
0.605
0.579
0.596
0.588
0.591
0.416
0.606
0.581
0.602
IS
True-k
0.614
0.611
0.522
0.613
0.591
0.231
0.596
0.197
0.604
0.578
0.508
0.173
0.584
0.547
0.189
0.579
0.546
0.247
0.333
0.520
0.166
0.583
0.520
0.268
Best-k
0.623
0.620
0.616
0.625
0.605
0.500
0.623
0.509
0.613
0.612
0.622
0.542
0.614
0.625
0.579
0.617
0.634
0.599
0.593
0.560
0.560
0.610
0.618
0.605
True-k
0.732
0.733
0.445
0.738
0.661
0.193
0.676
0.109
0.731
0.699
0.382
0.013
N/A
N/A
N/A
0.695
0.621
0.015
N/A
N/A
N/A
0.716
0.349
0.029
USPS
Best-k
True-k
0.583
0.565
0.552
0.530
0.499
0.051
0.568
0.551
0.481
0.469
0.407
0.017
0.519
0.488
0.467
0.112
0.560
0.547
0.550
0.526
0.424
0.169
0.011
0.001
N/A
N/A
N/A
N/A
N/A
N/A
0.563
0.533
0.516
0.476
0.011
0.001
N/A
N/A
N/A
N/A
N/A
N/A
0.561
0.548
0.429
0.146
0.011
0.001
PD
Best-k
0.757
0.749
0.700
0.755
0.669
0.584
0.699
0.469
0.754
0.740
0.615
0.367
N/A
N/A
N/A
0.752
0.694
0.575
N/A
N/A
N/A
0.751
0.606
0.609
True-k
0.607
0.609
0.521
0.611
0.605
0.457
0.609
0.175
0.608
0.605
0.442
0.413
0.603
0.585
0.358
0.606
0.612
0.415
0.277
0.411
0.200
0.605
0.431
0.417
similarity based methods associated with CL on all benchmark datasets except IS. The PTA-SL method achieves far
better consensus results than the other pair-wise similarity
based methods associated with SL on all of the benchmark
datasets. With the ability of handling uncertain links and
incorporating global information to construct more accurate
local links, the proposed PTA and PTGP methods perform
significantly better than the baseline ensemble clustering
methods on the benchmark datasets.
5.6
MNIST
Best-k
True-k
0.585
0.578
0.582
0.577
0.523
0.103
0.581
0.576
0.493
0.486
0.423
0.120
0.536
0.518
0.393
0.126
0.569
0.557
0.570
0.555
0.460
0.203
0.021
0.002
0.575
0.557
0.554
0.534
0.017
0.002
0.585
0.562
0.561
0.529
0.028
0.002
0.554
0.132
0.446
0.276
0.041
0.015
0.577
0.569
0.463
0.194
0.039
0.002
ODR
Best-k
True-k
0.813
0.804
0.805
0.793
0.760
0.533
0.819
0.813
0.726
0.723
0.643
0.353
0.785
0.770
0.514
0.135
0.807
0.798
0.792
0.772
0.651
0.389
0.257
0.099
0.794
0.772
0.771
0.744
0.130
0.003
0.800
0.774
0.774
0.741
0.318
0.132
0.780
0.328
0.617
0.454
0.350
0.177
0.799
0.785
0.643
0.371
0.348
0.126
Best-k
0.626
0.595
0.552
0.627
0.511
0.406
0.550
0.434
0.615
0.596
0.459
0.079
0.603
0.536
0.065
0.617
0.540
0.213
0.584
0.402
0.071
0.608
0.456
0.228
FC
KDD99-10P
Best-k
True-k
0.644
0.525
0.659
0.509
0.636
0.545
0.647
0.527
N/A
N/A
0.311
0.155
0.622
0.305
N/A
N/A
0.631
0.503
0.629
0.510
0.601
0.504
0.525
0.218
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
0.629
0.499
0.610
0.502
0.562
0.275
KDD99
Best-k
True-k
0.654
0.510
0.683
0.489
0.635
0.545
0.664
0.535
N/A
N/A
N/A
N/A
0.621
0.044
N/A
N/A
0.623
0.462
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Best-k
0.243
0.230
0.247
0.239
0.213
0.167
0.229
0.198
0.227
0.221
0.207
0.032
N/A
N/A
N/A
0.229
0.217
0.033
N/A
N/A
N/A
0.222
0.208
0.067
True-k
0.232
0.214
0.011
0.220
0.199
0.103
0.204
0.047
0.189
0.194
0.073
0.002
N/A
N/A
N/A
0.199
0.189
0.001
N/A
N/A
N/A
0.194
0.065
0.002
PTGP
PTAAL
PTACL
PTASL
CSPA
HGPA
MCLA
SRW
GPMGLA
EACAL
EACCL
EACSL
SRSAL
SRSCL
SRSSL
WCTAL
WCTCL
WCTSL
ECMCAL
ECMCCL
ECMCSL
WEACAL
WEACCL
WEACSL
0
LS
True-k
0.622
0.584
0.114
0.625
0.475
0.324
0.518
0.124
0.607
0.569
0.282
0.002
0.583
0.453
0.001
0.603
0.459
0.002
0.018
0.204
0.001
0.596
0.235
0.002
19
18
9
3
0
0
1
0
4
0
0
0
0
1
0
3
2
0
0
0
0
1
0
0
5
10
15
The number of times being ranked in the top 3
20
Fig. 9. The number of times of each method to be ranked in the top 3 (in
terms of NMI) across the 20 columns in Table 9.
GP-MGLA
GP-MGLA
GP-MGLA
0.720.72 0.72MCLA
MCLAMCLA
MCLA
EACEAC EAC
GP-MGLA
GP-MGLA
GP-MGLA
0.540.54 0.540.54GP-MGLA
WCTWCTWCT
0.710.71 0.71EACEAC EACEAC
WEAC
WEAC
WEAC
SRSSRS SRSSRS
0.520.52
0.52
0.7 0.7 0.52
0.7WCTWCT WCTWCT
0 40 50
40 50 60
50 60 60
10 20
0 0 10
30
0 WEAC
10 20
20 30 40
30 40 50
40 50 60
50 60
WEAC
WEAC
WEAC
Ensemble
Ensemble
Ensemble
sizesizesize
PenPenPenPen
(f) PD
40 50
40
50 60
50
0.52
WCTWCTWCT WCT
60 0.75
60 0.750.75 0.75
WEAC
WEAC
WEACWEAC
Ensemble
Ensemble
size sizesize
Ensemble
0.5
0.74 0.740.74 0.74
60 0 0 10
10 30
20 40
30 50
40 60
50 60 0 60
0 100 20
10 20
20 30
30 40
40 50
50 60
0.71
30
40
50
60
40 50
50 60
(e) LS
kddall
kddall
0.7 0.7
0.750.75
0.750.54
0.75
0.54 0.54
Ensemble
Ensemble
Ensemble
sizesizesize
(g) USPS
Ensemble
Ensemble
size
Ensemble
sizesize size
Ensemble
60 0
40 50
40
50 60
50
PTGP
PTGP
PTA
NMI
0.5 0.5
PTA
MCLA
MCLA
EACEAC
GP-MGLA
GP-MGLA
GP-MGLA
GP-MGLA
WEAC
WEAC
EAC EACEAC EAC
0.18
0.18
0.5 0.5 0.18
WCTWCTWCT WCT
0.18
0
10
20
30
0
10
20
30
40 40 50 50 60 60
60 60
WEAC
WEACWEAC
WEAC
PTA PTA
PTGPPTGP
MCLA
MCLA
GP-MGLA
GP-MGLA
0.45 0.45
0.4 0.4
0 0 10 10 20 20 30
Ensemble
Ensemble
sizesize
Ensemble
Ensemble
Ensemble
size sizesize
Ensemble
Ensemble
size
Ensemble
sizesize size
Ensemble
0.55 0.55
0.22 0.220.22PTA
0.22
PTA
PTA
PTA
0.550.55
MCLA
MCLA
PTGPPTGP
PTGPPTGP
GP-MGLA
0.2 0.2 0.2GP-MGLA
0.2
MCLA MCLA
10
0 10
10 20
20 30
30 40
40 50
50 60
0 20
10 30
20 40
30 50
40 60
50 60
(h) FC
NMI
NMI
NMI
NMI NMI
NMI NMI
0.6 0.6
0.22 0.220.22
0.560.56 0.56 0.56
60 0
30 40
60
Ensemble
size size
Ensemble
0 100 20
10 20
20 30
30 40
40 50
50 60
0 10
10 30
20 40
30 50
40 60
50 60
Ensemble
Ensemble
size
Ensemble
sizesize size
(i) KDD99-10P
Ensemble
60
(j) KDD99
Fig. 10. The average performances over 20 runs by different approaches with varying ensemble sizes M .
5.7
Execution Time
14
14
12
12
10
10
PTA PTA
PTGPPTGP
CSPACSPA
HGPAHGPA
MCLAMCLA
SRW SRW
GP-MGLA
GP-MGLA
EAC EAC
SRS SRS
WCT WCT
ECMC
ECMC
WEAC
WEAC
50
40
40
30
30
20
20
10
10
16
50
16
and the baseline methods 20 times and report their average performances in Fig. 10. Here, all pair-wise similarity
based methods are associated with AL. The PTA and PTGP
methods produce consistently good results with different
ensemble sizes. As shown in Fig. 10, the PTA and PTGP
methods yield the best or nearly the best performance
with varying M for the benchmark datasets. Especially, for
the MF, IS, PD, FC, KDD99-10P, and KDD99 datasets, the
PTA and PTGP methods exhibit significant advantages in
the robustness to varying ensemble sizes over the baseline
methods. Further, we illustrate the average performances
of different approaches over nine datasets, KDD99 not
included, in Fig. 4, which is in fact the average of the
first nine sub-figures in Fig. 10, i.e., the sub-figures from
Fig. 10(a) to Fig. 10(i). Figure 4 provides an average view
to compare our methods and the baseline methods across
datasets, which demonstrates the advantage of our methods
in the robustness to various datasets and ensemble sizes. In
particular, the advantage of the proposed methods becomes
even greater when the ensemble size gets larger, e.g., when
the ensemble size goes beyond 20 (see Fig. 4).
0 2000200040004000600060008000800010000
10000
DataData
size size
PTA PTA
PTGPPTGP
MCLAMCLA
GP-MGLA
GP-MGLA
DataData
size size
10 510 5
(a)
(b)
Fig. 11. Execution time of different ensemble clustering approaches as
the data size varies (a) from 0 to 10,000 and (b) from 0 to 494,020.
PTA
PTGP
MCLA
GP-MG
EAC
WCT
WEAC
0.72
20
kdd10P
kdd10P
FCTtFCTt
FCTtFCTt
0.74
0.73
Ensemble size
(d) ODR
USPS
USPS
USPS
USPS
0.75
0.7
10
Ensemble
sizesize size
Ensemble
Ensemble
Ensemble
size
0.7 0.7
NMI
NMI
NMI
NMI
0.54
0.24 0.240.24
0.580.58 0.58 0.58
NMI NMI
NMI NMI
PTA
PTGP
MCLA
GP-MGLA
EAC
SRS
WCT
WEAC
MCLA
MCLA
MCLAMCLA
0 20
10 30
20 40
30 50
40 60
50 60
10
0 10
10 20
20 30
30 40
40 50
50 60
0.26 0.260.26
0.58
0.56
FCTtFCTt
FCTt
Ensemble
Ensemble
size
Ensemble
size sizesize
Ensemble
PTGPPTGP
PTGPPTGP
0.78 0.780.78 0.78
(c) MNIST
Ensemble
sizesize size
Ensemble
Ensemble
Ensemble
size
0.58 0.58
0.780.78
0.780.58
0.78
NMI
NMI
NMI NMI
NMI NMI
NMI
NMI
NMI
NMI
NMI NMI
NMI NMI
NMI
NMI NMI
LS LS LS LS
0.60.75
0.6 0.6
0.60.75
0.75
0.6
0.76
USPS
USPS
USPS
0.770.77 0.77
0.620.62 0.620.62
0.760.76 0.76
0.740.74
0.580.58
0.580.74
0.58
60 0
(b) IS
PenPenPen
NMI NMI
0.780.78 0.78
0.640.64 0.640.64
0.78
0.77
PTA PTAPTA
0.57
0.570.57 0.57
PTGPPTGP
0.78 0.780.78
PTGP
MCLAMCLA
MCLA
PTA PTA PTA
PTA
0.77
GP-MGLA
0.56 GP-MGLA
0.560.560.77
0.560.77
GP-MGLA
PTGP
PTGP
PTGPPTGP
EAC EACEAC
MCLA
MCLAMCLA
0.76 0.760.76MCLA
SRS SRSSRS
GP-MGLA
GP-MGLA
GP-MGLA
GP-MGLA
0.550.55 0.55 0.55
WCTWCTWCT
EAC
0.75 0.750.75EAC EAC EAC
WEAC
WEAC
WEAC
SRS SRS SRS SRS
0.54
0.540.54 0.54
0.74 0.740.74WCTWCTWCT WCT
0 WEAC
10 WEAC
20 30
0
10
20
60
0 WEAC
10 WEAC
20 30 40
30
Ensemble
sizesize size
Ensemble
Ensemble
Ensemble
size
(a) MF
0.790.79 0.79
Ensemble
Ensemble
sizesizesize
Ensemble
0 20
10 30
20 40
30 50
40 60
50 60
10
0 10
10 20
20 30
30 40
40 50
50 60
0.79
0.62
NMI
Ensemble
size sizesize
Ensemble
Ensemble
Ensemble
size
ze size
ble
10 30
20 40
30 50
40 60
50 60 60 0
20
10 20
20 30
30 40
40 50
50 60
13
LS
0.64
NMI
Ensemble
Ensemble
sizesizesize
Ensemble
NMI
0
10
0 10
NMI NMI
NMI
NMI
0.550.55 0.550.55
0.57 0.57
0.62
0.620.62
0.620.57
NMI NMI
0.6 0.61
0.6 0.61
0.6
0.610.61
OD OD OD OD
0.6 0.82
0.60.82
0.6 0.6
0.82
0.810.81
0.59
0.590.59
0.590.81
NMI
NMI NMI
NMI
0.610.61
0.62
0.620.62
0.620.61
ze size
ble
MNIST
MNIST
MNIST
MNIST
IS
NMI
IS
0.59 0.59 0.59
0.640.64 0.64 0.64
0.58 0.58 0.58
0.630.63 0.63 0.63
NMI NMI
0.64
NMI NMI
0.64
0.630.63 0.63
0.640.64 0.640.64
0.620.62 0.62
0.630.63 0.630.63
NMI
NMI NMI
eat
Ensemble size
C ONCLUSION
10
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 5, MARCH 2016
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers for their insightful comments and suggestions which
helped enhance this paper significantly. This project was
supported by NSFC (61173084 & 61502543), National Science & Technology Pillar Program (No. 2012BAK16B06),
the GuangZhou Program (No. 201508010032), and the PhD
Start-up Fund of Natural Science Foundation of Guangdong
Province, China (No. 2014A030310180).
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
14
Dong Huang received his B.S. degree in computer science in 2009 from South China University of Technology, China. He received his M.Sc.
degree in computer science in 2011 and his
Ph.D. degree in computer science in 2015, both
from Sun Yat-sen University, China. He joined
South China Agricultural University in 2015 as an
Assistant Professor with College of Mathematics
and Informatics. His research interests include
data mining and pattern recognition. He is a
member of the IEEE.