Académique Documents
Professionnel Documents
Culture Documents
Thomas Hacker
Computer and Information Technology,
Purdue University,
West Lafayette, Indiana
E-mail: tjhacker@purdue.edu
Chunming Rong
Department of Electrical and Computer Engineering,
University of Stavanger,
Norway
E-mail: chunming.rong@uis.no
Abstract: The tremendous growth in data volumes has created a need for new tools and
algorithms to quickly analyse large datasets. Cluster analysis techniques, such as K-Means can be
distributed across several machines. The accuracy of K-Means depends on the selection of seed
centroids during initialisation. K-Means++ improves on the K-Means seeder, but suffers from
problems when it is applied to large datasets. In this paper, we describe a new algorithm and
a MapReduce implementation we developed that addresses these problems. We compared
the performance with three existing algorithms and found that our algorithm improves cluster
analysis accuracy and decreases variance. Our results show that our new algorithm produced a
speedup of 76 9 times compared with the serial K-Means++ and is as fast as the streaming
K-Means. Our work provides a method to select a good initial seeding in less time, facilitating
fast accurate cluster analysis over large datasets.
Keywords: K-Means; K-Means++; streaming K-Means; SK-Means; MapReduce.
Reference to this paper should be made as follows: Esteves, R.M., Hacker, T. and Rong, C.
(2014) A new approach for accurate distributed cluster analysis for Big Data: competitive
K-Means, Int. J. Big Data Intelligence, Vol. 1, Nos. 1/2, pp.5064.
Biographical notes: Rui Mximo Esteves is a researcher at the University of Stavanger (UiS) in
Norway, where his works focuses on data-intensive (Big Data) machine learning, optimisation
and cloud computing. He was a Guest Editor for the special issue Cloud computing and
Big Data in the Journal of Internet Technology and Chair of Cloud Computing Contest at
International Conferences on Cloud Computing Technology and Science (CloudCom). He was a
Professor Assistant in Pattern Recognition and in Web Semantic Technologies at UiS. He
lectured in the University of Trs-os-Montes in Portugal in Forestry Statistics and in Forestry
Remote Detection. He worked for the National Institute of Statistics in Portugal He participated
in research projects related to optimisation of energy consumption, statistics and remote detection
applied to forestry.
Thomas Hacker is an Associate Professor of Computer and Information Technology at Purdue
University and Visiting Professor in the Department of Electrical Engineering and Computer
Science at the University of Stavanger in Norway. His research interests centre around high
performance computing and networking on the operating system and middleware layers.
Recently, his research has focused on cloud computing, cyber infrastructure, scientific
workflows, and data-oriented infrastructure. He is also a co-Leader for Information Technology
A new approach for accurate distributed cluster analysis for Big Data
51
for the Network for Earthquake Engineering Simulation (NEES), which brings together
researchers from 14 universities across the country to share innovations in earthquake research
and engineering. He received his BS in Physics and BS in Computer Science from Oakland
University in Rochester, Michigan, USA. He received his MS and PhD in Computer Science and
Engineering from the University of Michigan, Ann Arbor, Michigan.
Chunming Rong is the Head of the Center for IP-based Service Innovation (CIPSI) at the
University of Stavanger in Norway, where his work focuses on big data analytics, cloud
computing, security and privacy. He has been an IEEE senior member and is honoured as a
member of the Norwegian Academy of Technological Sciences since 2011. He is a Visiting
Chair Professor at Tsinghua University (20112014) and served also as an Adjunct Professor at
the University of Oslo (20052009). He is the co-founder and Chairman of the Cloud Computing
Association (CloudCom.org) and its associated IEEE conference and workshop series.
This paper is a revised and expanded version of a paper entitled Competitive K-Means, a
new accurate and distributed K-Means algorithm for large datasets presented at the 5th IEEE
Cloudcom Conference, Bristol, UK, 25 December 2013.
Introduction
a high-feature dimensionality
52
2.1 K-Means
K-Means is a partition type of cluster analysis algorithm that
tries to solve the following clustering problem. Given an
integer k, a distance measure dm and a set of n data points in
a d-dimensional space, the goal is to choose k centroids so
as to minimise a cost function usually defined as the total
distance between each point in the dataset and the closest
centroid to that point. The exact solution to this problem is
NP-hard and K-Means provides an approximate solution
that has O(nkd) running time (Ailon et al., 2009).
The K-Means algorithm is simple and straightforward:
first, it randomly selects k points from the whole dataset.
These points represent the initial centroids (or seeds). Each
remaining point in the dataset is assigned to a cluster whose
centroid is closest to that point. The coordinates of the
centroid are then recalculated. The new coordinates of a
specific centroid corresponds to the average of all points
assigned to the respective cluster. This process iterates until
a cost function converges to an optimum without a
guarantee that it is the global one. Therefore, the initial
selection of centroids during the initialisation process to
select the best possible set of centroids is essential
(Ostrovsky and Rabani, 2006). The accuracy of large
dataset cluster analyses using K-Means depends on the
accuracy of centroid initialisation methods that are adapted
to datasets distributed across several machines.
A new approach for accurate distributed cluster analysis for Big Data
for selection of the centres increases the robustness of their
approach to outliers. The authors tested their algorithm with
an artificial dataset and with an image dataset (described as
the well known baboon image in their paper). Both
datasets tested by the authors have eight dimensions at
maximum. Relying on the distribution of a single dimension
to determine the initial position of centroids assumes that
the dimension chosen is representative of the distribution of
the dataset. To be representative of the dataset the variance
of the selected dimension has to be significantly higher that
all the others. Suppose that we have several dimensions
with similar high variance leading to several possible
choices. If each choice corresponds to a different sorting of
the data, the choice of the dimension will affect the
selection of the initial centroids. Their approach also
assumes that the value of the maximum variance is observed
in only one dimension. These assumptions can be
reasonable in datasets with a limited number of dimensions.
However, in the Big Data scenario, we face the challenge of
high dimensionality. For example, considering the
use case of document clustering applied to Wikipedia as it
was presented in our previous work (Esteves et al., 2011).
The analysed Wikipedia dataset has 30 GB and
11,500 dimensions after pre-processing. A dataset with
30 GB is a modest use case of Big Data, however it is
sufficient to understand that in 11,500 dimensions chances
are that several dimensions have similar variance with
distinct sorting. Thus, the method presented by Al-Daoud is
not suitable for high dimensionality and consecutively for
Big Data.
Redmond and Heneghan (2007) propose a method for
initialisation of K-Means using kd trees. The kd-tree is a
binary tree in which every node is a k-dimensional point.
Redmond and Heneghan use the kd-tree as a top-down
hierarchical scheme for partitioning data. Every non-leaf
node represents a splitting of the data along the longest
dimension of the parent node. The value of the median
computed for the longest dimension of the parent node is
used as the criteria for splitting. After creating the kd-tree,
the Redmond and Heneghan method computes the density
and the mean value for each leaf-node. It also computes the
distances between the mean values of the leaf-nodes. The
methods combines the information obtained from the
computed densities and distances to select centres of
leaf-nodes that have high density and are far apart from
each other. The selected centres are the initial centroids for
K-Means. The authors tested their algorithm with artificial
and real world datasets that have less than 20 dimensions.
According to Redmond and Heneghan, kd-trees have poor
ability to scale to high dimensions. Therefore this method is
not appropriated for Big Data.
El Agha and Ashour (2012) presented an initialisation
method for K-Means. Taking as example a two-dimension
dataset, ElAgha initialisation first finds the boundaries of
data points, and then it divides the area covered by the
points into K rows and K columns forming a 2D grid.
ElAgha initialisation then uses the upper left corner of the
cells lied on the diagonal as a base points. Then the base
53
54
Rather than using streaming, RHIPE uses its own map and
reducer Java functions. Since it does not rely on
Hadoop streaming, RHIPE is the fastest of the three
approaches. Therefore we chose RHIPE to implement our
new CK-Means approach.
Algorithms
55
A new approach for accurate distributed cluster analysis for Big Data
section, we describe a MapReduce implementation of our
new CK-Means (Algorithms 4 to 6).
Algorithm 2
Streaming K-Means
1:
2:
3:
Serial K-Means++
4:
5:
Sw T1 T2 .. Tm
6:
7:
1:
2:
While || IC || <k do
3:
4:
5:
IC IC {dp}
6:
End while
7:
56
1:
2:
3:
4:
Si = f (clxi)
5:
6:
1:
2:
3:
1:
2:
3:
Skey = f (clxkey)
4:
5:
6:
Input: A HDFS path to the stored data points and the number
of clusters k
Output: X points grouped into k clusters and respective
centroids C
1:
2:
Experimental setup
A new approach for accurate distributed cluster analysis for Big Data
possible through the use of our new algorithm described in
Section III. We show this by evaluating the following five
hypotheses:
H1
H2
H3
H4
H5
57
4.3.1 Experiment A
The aim of Experiment A was to test Hypothesis 1. To
achieve this aim we ran serial K-Means++ on each partition
set element {x1, x2, ..., xm} of the dataset X to obtain IC1,
IC2, ..., ICm sets of k initial centroids. We performed cluster
analyses CLx1, CLx2, ..., CLxm, using K-Means on X, and
using IC1, IC2, ..., ICm as the initial centroids. We then
performed clx1, clx2, ..., clxm cluster analyses using K-Means
on each partition set element {x1, x2, ..., xm}, using the
respective IC1, IC2, ..., ICm as initial centroids. We measured
the correlation between the fitness of CLx and the fitness of
clx. We used the entire hypercube and electrical datasets.
Several days were required to run one serial K-Means++
over the entire KDD99 and the Google datasets; thus we
performed the experiment with 10% of the KDD99 and
Google datasets. All datasets were tested for number of
centroids, k = 50, 100, 500, and 1,000, and with a number of
competitors m = 6. We calculated each correlation
coefficient based on 100 repetitions of cluster analyses
58
4.3.2 Experiment B
SK-Means
Experimental results
5.1 Experiment A
Table 1 shows the correlation coefficients obtained from a
correlation analysis of the fitness of CLx and the fitness of
clx as defined in Section 3. We observe that the correlation
increases as the size of the dataset increases. This
observation means that an increase of the correlation is
associated with the rise in the number of points in the
datasets. The Google dataset has 13 M points while the
hypercube has only 10 K points. We observe that the
correlation is stronger for smaller values of k.
Correlation coefficients between f(CLx) and f(clx) for
the four datasets of size N for varying K
Table 1
k = 50
k = 100
k = 500
k = 1,000
10K
0.80
0.70
0.56
0.24
0.71
0.53
0.43
0.76
0.74
0.72
0.78
0.65
4.3.3 Experiment C
Hypercube
Electrical
2M
0.80
KDD99
8M
0.79
13M
0.92
0.86
4.3.4 Experiment D
The aim of Experiment D was to test H5. For this
experiment we performed cluster analysis using a
MapReduce implementation of our new CK-Means; the
KDD99 dataset was the baseline. For testing how the
algorithm scales with an increasing number of points, we
compared the execution times of cluster analysis of the
KDD99 dataset with the KDD99n2 dataset. To test how it
scales with an increasing number of dimensions, we
compared the execution times of cluster analysis of the
KDD99 dataset with the KDD99d2 dataset.
The KDD99d2 and KDD99n2 datasets occupy each
4 GB in disk space. We used 15 machines with six cores
with Hadoop, R, and RHIPE package. The HDFS block size
5.2
Experiment B
59
A new approach for accurate distributed cluster analysis for Big Data
The comparison is repeated for k = 50, 100, 500, 1,000.
Figures 2 to 4 show the same information as Figure 1 for the
datasets hypercube, KDD99, and Google, respectively. The
y-axis in Figures 1 to 5 represent the fitness function
f = WSSQ. A lower WSSQ value indicates a better selection
of initial centroids for cluster analysis.
We observe in Figure 1 that our CK-Means is more
accurate than SK-Means and K-Means for k = 50 and
k = 100. However, the relative accuracy of CK-Means in
comparison with SK-Means slightly deteriorates when
k = 500 and dramatically deteriorates when k = 1,000. This
decline is explained because each partition of X does not
Figure 1
WSSQ of K-Means; SK-Means and our new CK-Means with different values of k (see online version for colours)
2720.00
3200.00
2715.00
3190.00
2710.00
3180.00
2705.00
3170.00
2695.00
WSSQ
WSSQ
2700.00
2690.00
3160.00
3150.00
2685.00
3140.00
2680.00
3130.00
2675.00
2670.00
3120.00
K-means
SK-means
Min Value
K-means
Max Value
SK-means
Min Value
12500.00
5000.00
12400.00
4900.00
12300.00
4800.00
12200.00
CK-means
Max Value
WSSQ
WSSQ
5100.00
CK-means
4700.00
12100.00
4600.00
12000.00
4500.00
11900.00
4400.00
11800.00
K-means
SK-means
Min Value
Max Value
CK-means
K-means
SK-means
Min Value
Max Value
CK-means
60
WSSQ
7600.00
6000.00
7400.00
5800.00
7200.00
5600.00
7000.00
5400.00
6800.00
5200.00
WSSQ
Figure 2
6600.00
5000.00
6400.00
4800.00
6200.00
4600.00
6000.00
4400.00
K-means
SK-means
Min Value
2750.00
CK-means
K-means
Max Value
SK-means
Min Value
2050.00
2700.00
CK-means
Max Value
2000.00
2650.00
1950.00
2550.00
WSSQ
WSSQ
2600.00
2500.00
1900.00
1850.00
2450.00
1800.00
2400.00
2350.00
1750.00
K-means
SK-means
Min Value
CK-means
K-means
Max Value
SK-means
Min Value
CK-means
Max Value
5.3 Experiment C
Table 2 shows a comparison of running times between a
MapReduce distributed implementation and a nondistributed implementation of our MapReduce seeder
(Algorithm 5).
Table 2
k = 50
k = 100
k = 50
k = 100
Single
node
x = 828;
= 13
x = 1,877;
= 18
x = 885;
= 15
x = 1,638;
= 21
15
nodes
x = 82;
=8
x = 142;
= 12
x = 81;
=9
x = 142;
= 10
61
A new approach for accurate distributed cluster analysis for Big Data
benefits from MapReduce to reduce the execution time, thus
proving that Hypothesis 4 is true.
5.4 Experiment D
Table 3 shows how the execution time of our MapReduce
implementation of CK-Means MapReduce seeder scales
with the dataset growth and with the number of clusters.
The execution time is more sensitive to increases in both n
and k than in an increase in d.
Table 3
k = 100
k = 500
k = 1,000
KDD99
82
142
577
1,107
KDD99d2
112
172
713
1,394
KDD99n2
148
254
1,137
2,196
Figure 3
WSSQ of K-Means; SK-Means and our new CK-Means with different values of k (see online version for colours)
6000.00
3500.00
3000.00
5000.00
2500.00
4000.00
WSSQ
WSSQ
2000.00
3000.00
1500.00
2000.00
1000.00
1000.00
500.00
0.00
0.00
K-means
SK-means
Min Value
1000.00
CK-means
K-means
Max Value
SK-means
Min Value
700.00
900.00
CK-means
Max Value
600.00
800.00
500.00
700.00
400.00
500.00
WSSQ
WSSQ
600.00
400.00
300.00
300.00
200.00
200.00
100.00
100.00
0.00
0.00
K-means
SK-means
Min Value
Max Value
CK-means
K-means
SK-means
Min Value
Max Value
CK-means
Figure 4
WSSQ
4000.00
1600.00
3500.00
1400.00
3000.00
1200.00
2500.00
1000.00
2000.00
800.00
WSSQ
62
1500.00
600.00
1000.00
400.00
500.00
200.00
0.00
0.00
K-means
SK-means
Min Value
250.00
CK-means
K-means
SK-means
Min Value
Max Value
140.00
CK-means
Max Value
120.00
200.00
100.00
80.00
WSSQ
WSSQ
150.00
100.00
60.00
40.00
50.00
20.00
0.00
0.00
K-means
SK-means
Min Value
Max Value
CK-means
K-means
SK-means
Min Value
CK-means
Max Value
Scaling of MapReduce CK-Means MapReduce seeder with KDD99s variants for different values of k
A new approach for accurate distributed cluster analysis for Big Data
Conclusions
References
Ackermann, M., Lammersen, C., Mrtens, M., Raupach, C.,
Sohler, C. and Swierkot, K. (2010) StreamKM++:
A Clustering Algorithm for Data Streams, pp.17387,
ALENEX.
Ailon, N., Jaiswal, R. and Monteleoni, C. (2009) Streaming
K-Means approximation [online]
http://scholar.google.com.au/scholar.bib?q=info:eeMPmjm4T
NsJ:scholar.google.com/&output=citation&hl=en&as_sdt=20
00&ct=citation&cd=0.
Al-Daoud, M.B. (2007) A new algorithm for cluster
initialization, World Academy of Science, Engineering and
Technology, Vol. 1, No. 4, pp.568570.
Arthur, D. and Vassilvitskii, S. (2007) K-Means++:
the advantages of careful seeding, SODA 07 Proceedings of
the Eighteenth Annual ACM-SIAM Symposium on Discrete
Algorithms [online] http://ilpubs.stanford.edu:8090/778/.
Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and
Vassilvitskii, S. (2012) Scalable K-Means++, Proc. VLDB
Endow, Vol. 5, No. 7, pp.62233.
Crainic, T.G. and Toulouse, M. (2010) Handbook of
Metaheuristics, Vol. 146, pp.497541; International Series in
Operations Research & Management Science, Springer, USA
[online] http://dx.doi.org/10.1007/978-1-4419-1665-5_17.
Davidson, I, and Satyanarayana, A. (2003) Speeding up K-Means
clustering by bootstrap averaging, in IEEE Data Mining
Workshop on Clustering Large Datasets.
Dean, J. and Ghemawat, S. (2008) MapReduce: simplified data
processing on large clusters, Commun. ACM, Vol. 51, No. 1,
pp.107113, doi:10.1145/1327452.1327492.
63
64