Accurate Distributed Cluster Analysis For Big Data Competitive K-Means

50
Int. J. Big Data Intelligence, Vol. 1, Nos. 1/2, 2014
A new approach for accurate distributed cluster

analysis for Big Data: competitive K-Means
Rui Mximo Esteves*
Department of Electrical and Computer Engineering,
University of Stavanger,
Norway
E-mail: rui.esteves@uis.no
*Corresponding author
Thomas Hacker
Computer and Information Technology,
Purdue University,
West Lafayette, Indiana
E-mail: tjhacker@purdue.edu
Chunming Rong
Department of Electrical and Computer Engineering,
University of Stavanger,
Norway
E-mail: chunming.rong@uis.no
Abstract: The tremendous growth in data volumes has created a need for new tools and
algorithms to quickly analyse large datasets. Cluster analysis techniques, such as K-Means can be
distributed across several machines. The accuracy of K-Means depends on the selection of seed
centroids during initialisation. K-Means++ improves on the K-Means seeder, but suffers from
problems when it is applied to large datasets. In this paper, we describe a new algorithm and
a MapReduce implementation we developed that addresses these problems. We compared
the performance with three existing algorithms and found that our algorithm improves cluster
analysis accuracy and decreases variance. Our results show that our new algorithm produced a
speedup of 76 9 times compared with the serial K-Means++ and is as fast as the streaming
K-Means. Our work provides a method to select a good initial seeding in less time, facilitating
fast accurate cluster analysis over large datasets.
Keywords: K-Means; K-Means++; streaming K-Means; SK-Means; MapReduce.
Reference to this paper should be made as follows: Esteves, R.M., Hacker, T. and Rong, C.
(2014) A new approach for accurate distributed cluster analysis for Big Data: competitive
K-Means, Int. J. Big Data Intelligence, Vol. 1, Nos. 1/2, pp.5064.
Biographical notes: Rui Mximo Esteves is a researcher at the University of Stavanger (UiS) in
Norway, where his works focuses on data-intensive (Big Data) machine learning, optimisation
and cloud computing. He was a Guest Editor for the special issue Cloud computing and
Big Data in the Journal of Internet Technology and Chair of Cloud Computing Contest at
International Conferences on Cloud Computing Technology and Science (CloudCom). He was a
Professor Assistant in Pattern Recognition and in Web Semantic Technologies at UiS. He
lectured in the University of Trs-os-Montes in Portugal in Forestry Statistics and in Forestry
Remote Detection. He worked for the National Institute of Statistics in Portugal He participated
in research projects related to optimisation of energy consumption, statistics and remote detection
applied to forestry.
Thomas Hacker is an Associate Professor of Computer and Information Technology at Purdue
University and Visiting Professor in the Department of Electrical Engineering and Computer
Science at the University of Stavanger in Norway. His research interests centre around high
performance computing and networking on the operating system and middleware layers.
Recently, his research has focused on cloud computing, cyber infrastructure, scientific
workflows, and data-oriented infrastructure. He is also a co-Leader for Information Technology
Copyright 2014 Inderscience Enterprises Ltd.
A new approach for accurate distributed cluster analysis for Big Data
51
for the Network for Earthquake Engineering Simulation (NEES), which brings together
researchers from 14 universities across the country to share innovations in earthquake research
and engineering. He received his BS in Physics and BS in Computer Science from Oakland
University in Rochester, Michigan, USA. He received his MS and PhD in Computer Science and
Engineering from the University of Michigan, Ann Arbor, Michigan.
Chunming Rong is the Head of the Center for IP-based Service Innovation (CIPSI) at the
University of Stavanger in Norway, where his work focuses on big data analytics, cloud
computing, security and privacy. He has been an IEEE senior member and is honoured as a
member of the Norwegian Academy of Technological Sciences since 2011. He is a Visiting
Chair Professor at Tsinghua University (20112014) and served also as an Adjunct Professor at
the University of Oslo (20052009). He is the co-founder and Chairman of the Cloud Computing
Association (CloudCom.org) and its associated IEEE conference and workshop series.
This paper is a revised and expanded version of a paper entitled Competitive K-Means, a
new accurate and distributed K-Means algorithm for large datasets presented at the 5th IEEE
Cloudcom Conference, Bristol, UK, 25 December 2013.
Introduction
The volume of data generated daily is growing at an

exponential rate. This growth is due in part to the
proliferation of sensors and the increase in resolution of
those sensors. To distil meaningful information from this
growing mountain of data, there is a growing need for
advanced data analysis techniques, such as cluster analysis.
Clustering is a key factor in the Big Data problem. In a
Big Data context it is not feasible to label large collection
of objects. It is common to not have a prior knowledge of
the underlying data structure and the number of groups and
the group nature. Moreover, in the Big Data context the data
tend to change in time and therefore using clustering
methods can produce clusters that are dynamic. Clustering
provides efficient browsing, search, recommendation and
assist in the document classification, which are relevant
tasks necessary for Big Data. As a consequence, cluster
analysis faces new challenges of processing tremendously
large and complex datasets stored and analysed across many
computers. Since moving a large amount of data between
machines is more costly than moving the computation to the
data, a recent trend is to move algorithms, which typically
represent a few KB, to process chunks of the dataset
independently. The MapReduce approach is a seamless
solution to distributed computation that can be used to solve
this problem; however, it requires new algorithms that can
benefit from MapReduce technology.
K-Means (Zhao et al., 2009; Ekanayake et al., 2008), is
a cluster analysis algorithm that can be implemented using
an embarrassingly parallel approach for clustering large
datasets distributed across several machines.
A proper initialisation of K-Means is crucial to
obtaining a good final solution. There are no efficient
lightweight techniques for improving the choice of initial
centroids when the dataset has the following characteristics:
a large number of clusters
a high-feature dimensionality
a large number of data points
storage across several systems. K-Means++

(developed by Arthur and Vassilvitskii, 2007) employs
an improved seeding method to improve K-Means
quality by choosing initial cluster centroids largely
distant from each other.
However, we observed two major problems when we

applied K-Means++ seeding method to large datasets. First,
K-Means++ is a stochastic algorithm, which means that the
results produced by K-Means++ were considerably different
across several analysis runs using the same initial
conditions. We observed that the difference in the results
grows as the dataset contains more points and has higher
feature dimensionality. Still, we found that the quality of the
K-Means++ was much better than K-Means when using a
random selection of initial centroids. Second, K-Means++ is
an inherently serial algorithm that is time-consuming for
large datasets.
In related work, several authors proposed improvements
to the second problem without considering the first problem
(Esteves et al., 2012; Bahmani et al., 2012; Ailon et al.,
2009). Pavan et al. (2010) addresses the first problem, but at
the expense of worsening the second problem.
In this paper, an extended version of a paper we
presented at CloudCom 2013 (Esteves, 2013), we propose a
new parallel seeding algorithm named competitive K-Means
(CK-Means) that addresses both problems affecting serial
K-Means++. We also propose an efficient MapReduce
implementation of our new CK-Means that we found scales
well with large datasets.
52
R.M. Esteves et al.
Theory and background
2.1 K-Means
K-Means is a partition type of cluster analysis algorithm that
tries to solve the following clustering problem. Given an
integer k, a distance measure dm and a set of n data points in
a d-dimensional space, the goal is to choose k centroids so
as to minimise a cost function usually defined as the total
distance between each point in the dataset and the closest
centroid to that point. The exact solution to this problem is
NP-hard and K-Means provides an approximate solution
that has O(nkd) running time (Ailon et al., 2009).
The K-Means algorithm is simple and straightforward:
first, it randomly selects k points from the whole dataset.
These points represent the initial centroids (or seeds). Each
remaining point in the dataset is assigned to a cluster whose
centroid is closest to that point. The coordinates of the
centroid are then recalculated. The new coordinates of a
specific centroid corresponds to the average of all points
assigned to the respective cluster. This process iterates until
a cost function converges to an optimum without a
guarantee that it is the global one. Therefore, the initial
selection of centroids during the initialisation process to
select the best possible set of centroids is essential
(Ostrovsky and Rabani, 2006). The accuracy of large
dataset cluster analyses using K-Means depends on the
accuracy of centroid initialisation methods that are adapted
to datasets distributed across several machines.
2.2 The problem of selecting initial centroids for

large Big Data datasets
A considerable amount of prior work has been undertaken
in cluster analysis and in developing new approaches to
select an excellent set of initial centroids for cluster analysis
techniques such as K-Means. In the era of Big Data,
however, the techniques that have been developed for initial
centroid selection may not scale to the hundreds of
thousands to millions of data points present in large
datasets, or to the high degree of dimensionality in these
large datasets. The inherent characteristics of Big Data,
known as the 4-Vs, present considerable challenges to
existing methods. The volume of data is the first critical
characteristic. Big Data involves the management and
analysis of tremendous volumes of data reaching in the
petabytes. One example is from the Compact Muon
Solenoid project, which produces petabytes of data annual
from the Large Hadron Collider at CERN. Another example
of volume is the production of log records for large scale
supercomputer systems, which can be comprised of
thousands of individual computational nodes, each
producing thousands of log entries each day. Another
distinguishing characteristic is variety. The inherent
dimensional complexity of Big Data, along with the much
larger potential degrees of freedom, require faster and more
accurate techniques for cluster analysis and centroid
selection. Moreover, the growing pervasive availability of
multi-core processors and commodity-based cluster
computing systems coupled with high performance parallel

storage systems, such as Hadoop and Lustre, provide a
fertile new ground for the development of new parallel
algorithms and data analytic techniques take can best take
full advantage of these new architectures and platforms.
This is the motivation for our work described in this paper.
According to Meil and Heckerman (1998), there is no
formal delimitation between initial search and search. The
task of finding the initial position of the centroids means
finding the global optimum to a certain accuracy. This task
represents a search per se, and in some circumstances it can
be a more challenging one than that performed by the main
clustering algorithm. The concept of best generally
implies a tradeoff between accuracy and computation cost.
To Meil and Heckerman the best initialisation method is
therefore an ill-defined notion, which depends on the data,
the accuracy/cost tradeoff and the clustering algorithm.
Meil and Heckerman studied the effect of three different
initialisation methods on the expectation-maximisation
clustering algorithm: the Random approach; the
noisy-marginal method of Thiesson et al. (1997) and the
hierarchical agglomerative clustering. The last two
initialisation methods are data dependent while the first one
is data independent. On a synthetic dataset created by Meil
and Heckerman, the Random method performed worse than
did the data-dependent initialisation method, but this
difference was not seen for the real-world data.
Khan and Ahmad (2004) present their cluster centre
initialisation method (CCIA) for K-Means. Their algorithm
first computes K cluster centres for individual attributes.
After the first step, the method choses seeds by examining
each of the attributes individually to extract a list of K
possible seed locations. It might happen that K > K
(more clusters centres were computed than needed). To
reduce the number of centres to K, their method uses
density-based multi scale data condensation (DBMSDC).
DBMSDC computes the density of the data at a point, and
then sorts the points according to their density. The point in
the top of the sorted list is chosen and all the points within a
radius inversely proportional to the density of that point are
pruned. The DBMSDC algorithm moves to the next point,
which has not been pruned from the list and repeats. This is
repeated until the desired number of points K remains. Their
method assumes that each attribute is normally distributed.
It also assumes that individual attributes can provide some
hints to initial cluster centres. The authors test their method
in a scenario of a small dataset. They tested with a real
dataset of 87 samples and 6 dimensions. In the scenario of
Big Data, where we have dimensions in the number of
hundreds or thousands, the presented assumptions are
difficult to guarantee as a priori conditions.
Al-Daoud (2007) introduced a new algorithm for
initialisation. The idea of the algorithm is to find the
dimension with maximum variance, sorting it by the
dimension selected, dividing it into a set of groups of data
points then finding the median for each group, using the
corresponding data points (vectors) to initialise the
K-Means. The usage of the median instead of the average
for selection of the centres increases the robustness of their
approach to outliers. The authors tested their algorithm with
an artificial dataset and with an image dataset (described as
the well known baboon image in their paper). Both
datasets tested by the authors have eight dimensions at
maximum. Relying on the distribution of a single dimension
to determine the initial position of centroids assumes that
the dimension chosen is representative of the distribution of
the dataset. To be representative of the dataset the variance
of the selected dimension has to be significantly higher that
all the others. Suppose that we have several dimensions
with similar high variance leading to several possible
choices. If each choice corresponds to a different sorting of
the data, the choice of the dimension will affect the
selection of the initial centroids. Their approach also
assumes that the value of the maximum variance is observed
in only one dimension. These assumptions can be
reasonable in datasets with a limited number of dimensions.
However, in the Big Data scenario, we face the challenge of
high dimensionality. For example, considering the
use case of document clustering applied to Wikipedia as it
was presented in our previous work (Esteves et al., 2011).
The analysed Wikipedia dataset has 30 GB and
11,500 dimensions after pre-processing. A dataset with
30 GB is a modest use case of Big Data, however it is
sufficient to understand that in 11,500 dimensions chances
are that several dimensions have similar variance with
distinct sorting. Thus, the method presented by Al-Daoud is
not suitable for high dimensionality and consecutively for
Big Data.
Redmond and Heneghan (2007) propose a method for
initialisation of K-Means using kd trees. The kd-tree is a
binary tree in which every node is a k-dimensional point.
Redmond and Heneghan use the kd-tree as a top-down
hierarchical scheme for partitioning data. Every non-leaf
node represents a splitting of the data along the longest
dimension of the parent node. The value of the median
computed for the longest dimension of the parent node is
used as the criteria for splitting. After creating the kd-tree,
the Redmond and Heneghan method computes the density
and the mean value for each leaf-node. It also computes the
distances between the mean values of the leaf-nodes. The
methods combines the information obtained from the
computed densities and distances to select centres of
leaf-nodes that have high density and are far apart from
each other. The selected centres are the initial centroids for
K-Means. The authors tested their algorithm with artificial
and real world datasets that have less than 20 dimensions.
According to Redmond and Heneghan, kd-trees have poor
ability to scale to high dimensions. Therefore this method is
not appropriated for Big Data.
El Agha and Ashour (2012) presented an initialisation
method for K-Means. Taking as example a two-dimension
dataset, ElAgha initialisation first finds the boundaries of
data points, and then it divides the area covered by the
points into K rows and K columns forming a 2D grid.
ElAgha initialisation then uses the upper left corner of the
cells lied on the diagonal as a base points. Then the base
53
points are randomly biased to generate the actual initial

centroids for K-Means. ElAgha initialisation generates K
points using semi random technique. It makes the diagonal
of the data as a starting line and selects the points randomly
around it. El Agha and Ashour tested their algorithm with
three artificial datasets created by the authors and with the
well-known public available Iris dataset by Fisher (1936).
The three artificial datasets has just two dimensions, with no
outliers and well-defined clusters that are homogenously
distributed in the sample space. With just four dimensions
and four classes, the Iris dataset is also a simple dataset.
El Agha initialisation archived better results than Random
initialisation mainly with the artificial datasets. Our
interpretation is that by using a criterion of setting the base
points from a diagonal of the data, ElAgha method assumes
that the data obeys to a particularly geometric structure of
data. For example, ElAgha initialisation method does not
select initial centroids from the opposite corners of the
diagonal. In contrast, by using our new CK-Means, we pick
initial centroids based on a probability distribution
according to the distance of the centroids. Thus, we do not
favour or exclude any specific regions that are pre-mapped
in the clustering space, neither we assume any particularly
structure in the data. Therefore, if a dataset has clusters
situated in the opposite corners of the diagonal, ElAgha
does not pick any point, while our CK-Means picks points
in those regions with high probability as long as the
centroids are far away from each other. Making assumptions
about the structure of datasets in advance might not be a
major problem for clustering small and simple datasets. In a
Big Data scenario this is usually not the case, therefore our
new algorithm is more suitable for Big Data.
2.3 Related work in centroid initialisation methods

for K-Means
Several studies have investigated the parallelisation of
K-Means (Zhang et al., 2006; Gursoy, 2004; Stoffel and
Belkoniene, 1999; Zhao et al., 2009; Jin et al., 2006; Kumar
et al., 2011; Wasif and Narayanan, 2011; Ekanayake et al.,
2008) to improve performance, but neglected the problem of
the seeding initialisation needed to select good centroids for
cluster analysis. Niknam et al. (2011) presents a survey of
these methods. Evolutionary algorithms are suitable to be
parallelised (Knysh and Kureichik, 2010; Crainic and
Toulouse, 2010) but they require input parameters that may
not be easy to determine (Eiben et al., 2007). The
K-Means++ algorithm (Arthur and Vassilvitskii, 2007)
improves the initialisation of K-Means by selecting an
initial set of centroids that has a higher probability of being
closer to the optimum solution.
A few studies investigate the parallelisation of
K-Means++. Bahmani et al. (2012) presents a scalable
algorithm inspired by the original serial K-Means++. Their
main idea is that instead of sampling a single point
in each pass of the K-Means++ algorithm, O(k) points
can be sampled per iteration. The process repeats for
approximately O(log n) iterations. At the end of the
54
R.M. Esteves et al.
iterations, O(k log n) points are candidates to form a

solution. In the next step these candidates are clustered into
k clusters using serial K-Means++. The resulting k centroids
are then the seed centroids for K-Means analysis.
The scalable K-Means++ from Bahami has one major
downside, which is the requirement of an extra input
parameter called oversampling factor l. The value of l is not
intuitive, and its choice can dramatically change the quality
of the results. As the authors admit, the analysis of the
provable guarantees is highly non-trivial and requires new
insights in contrast to the analysis of K-Means++.
In previous work (Esteves et al. 2012), we described a
solution to parallelise the most intensive calculations of the
serial K-Means++. Our approach maintains the batch
structure of the original serial K-Means++; therefore the
provable guarantees of being an expected O(log (k))
approximation to the K-Means problem are maintained. Our
approach presented in Esteves et al. (2012) reduced the
execution time by half compared with the serial
K-Means++, with no need for additional parameters.
However, these results were obtained by using only a single
multicore machine. Since our solution presented in Esteves
et al. (2012) relies on the same batch principles as the serial
K-Means++, it is inefficient for datasets stored across
several machines.
Ailon et al. (2009) proposed an algorithm named
streaming K-Means (SK-Means) that is another scalable
approximation to K-Means++.
Ackermann et al. (2010) introduced a streaming
algorithm based on K-Means++ that shares similarities with
the one presented in Ailon et al. (2009). Both algorithms
presented in Ailon et al. (2009) and Ackermann et al. (2010)
follow a divide-and-conquer strategy, whereby the dataset is
partitioned into smaller subsets, and analyses are done in
parallel to each partition. Both algorithms address the
problem of speeding up the K-Means++ running time;
however they do not address the problem that K-Means++
is a stochastic algorithm that can produce considerably
different results from several runs using similar initial
conditions.
Pavan et al. (2010) proposed a deterministic algorithm
based on K-Means++ that addresses the stochastic problem.
To find each cluster centroid, Pavans method calculates a
distance matrix between every point in the dataset. Pavans
algorithm is serial and has O(n2kd) complexity, which is
worse than K-Means++. Thus the worst-case running time
is even larger than serial K-Means++.
Our new CK-Means reduces the running time compared
with serial K-Means++. Several runs of our new algorithm
using the same initial conditions produces results with less
variance and better quality than the serial K-Means++. Thus
our new algorithm improves the seeding compared with
K-Means++.
2.4 Using MapReduce/Hadoop for distributed

cluster analysis
Hadoop (White, 2010) provides a distributed file system and
a framework for the analysis and transformation of very
large datasets using the MapReduce (Dean and Ghemawat,
2008) programming model. The Hadoop distributed file
system (HDFS) (Shvachko et al., 2010) is designed to
reliably store very large datasets across several systems.
HDFS is suitable for storing very large files (up to TBs in
size) to be processed in a write-once, read-many-times
pattern. A typical MapReduce job involves reading a large
proportion, if not all, of the dataset; so the time required for
reading the whole dataset is more important than the latency
incurred in reading the first record (Shvachko et al., 2010).
Our new CK-Means approach is suitable for efficiently
performing cluster analyses on large datasets distributed
across several machines. The algorithm can be executed in
parallel by a cluster of computers. The heavy calculations
can be performed by each machine in a cluster on a chunk
of data independently of the remaining dataset. Thus our
new algorithm can be easily parallelised to use MapReduce
and Hadoop.
2.5 Hadoop and R for distributed cluster analysis

R (The Comprehensive R Archive Network, 2012), an
open-source statistics package, is by default a serial
program that uses only a single computational core. We can
use the doMC package (The Comprehensive R Archive
Network, 2012) to extend R to use multiple cores in a single
machine. There are three approaches available at the
moment (Holmes, 2012) to integrate R with a Hadoop
cluster:
a
R + streaming by using Hadoop streaming

functionality, a user launches a streaming job and
provides the map-side and reduce-side R scripts
RHadoop an R package integrated with the R

environment. RHadoop provides an R wrapper on top
of Hadoop and streaming
RHIPE similar to RHadoop that is integrated with the

R environment.
Rather than using streaming, RHIPE uses its own map and
reducer Java functions. Since it does not rely on
Hadoop streaming, RHIPE is the fastest of the three
approaches. Therefore we chose RHIPE to implement our
new CK-Means approach.
Algorithms
In this section, we describe the existing serial K-Means++

(Arthur and Vassilvitskii, 2007) (Algorithm 1) along with
the existing SK-Means (Ailon et al., 2009) (Algorithm 2),
which is a parallel algorithm that has been theoretically
proven to provide results similar to the serial K-Means++,
and our new CK-Means (Algorithm 3). At the end of this
55
section, we describe a MapReduce implementation of our
new CK-Means (Algorithms 4 to 6).
3.1 Serial K-Means++

The main idea in the K-Means++ algorithm shown in
Algorithm 1 is to choose the set of initial centroids IC for
K-Means one-by-one in a sequential manner, where the
current set of chosen centroids will stochastically bias the
selection of the next centroid (Arthur and Vassilvitskii,
2007).
Algorithm 1
Algorithm 2
Streaming K-Means
Input: A set of data points X; the number of clusters k and the

number of partitions, m.
A, K-Means++ modified to select 3log(k) points per
iteration.
A, K-Means++ selecting one point per iteration.
Output: X points grouped into k clusters and respective final
centroids FC
1:
Partition X into X1, X2, ..., Xm
2:
For each i {1, 2, , m} do
3:
Serial K-Means++
4:
Run A on Xi to get 3k log(k) centroids

Ti = {ti1, ti2, }
Denote the induced clusters of Xi as Si1 Si2
Input: A set of data points X and the number of desired

centroids k
5:
Sw T1 T2 .. Tm

centroids FC
6:
Run A on Sw to get k centroids C
7:
Run K-Means on X using the set of initial centroids FC
1:
IC a single data point uniformly sampled at random

from X
2:
While || IC || <k do
3:
For each data point dp X, compute D(dp, ic),

where D is the shortest distance from x to the
closest ic IC
4:
Sample dp X with probability

D 2 (dp, ic)
D 2 (dp, ic)
5:
IC IC {dp}
6:
End while
7:
K-Means on X using the set of initial centroids IC

Source: Arthur and Vassilvitskii (2007)
K-Means++ is a serial algorithm that repeats steps 2 to 5 k

times to select the set of initial centroids IC for the
K-Means. When the selection of IC centroids is complete,
K-Means++ proceeds with step 7 and performs cluster
analysis on X, using K-Means with IC as the set of initial
centroids. At the end of step 7, the K-Means group X points
into k clusters and calculates the final set of centroids FC,
which corresponds to the averages of points within each
cluster.
3.2 Streaming K-Means

SK-Means shown in Algorithm 2, divides the input into m
equal-sized groups in step 1. In step 2, the algorithm runs in
each group a variant of K-Means++ that selects 3 log(k)
points in each iteration a total of k times (traditional
K-Means++ selects only a single point). In step 4, the
algorithm weights the m sets of points. The process of
weighting the points is not clearly specified in Ailon et al.
(2009), and the authors provide no guidelines on how to
weight them. In step 6, it runs serial K-Means++ on the
weighted set of these points to reduce the number of
centroids to k (Ailon et al., 2009).
Source: Ailon et al. (2009)
3.3 Our new CK-Means

We observed that running K-Means++ over large datasets
produces varying results of the algorithms inherent
stochastic nature. We can decrease the variability of the
results if we run several instances of serial K-Means++ in
parallel and select for our cluster analysis the instance
that produces the most-accurate clustering results.
However, running serial K-Means++ over large datasets is
time-consuming. To solve this problem, our new approach
reduces the execution time of the serial K-Means++ by
performing several cluster analysis instances of K-Means++
over subsets of the dataset in parallel. The result of each
cluster analysis instance on a subset of the dataset is then
scored using a fitness measure. The K-Means++ instances
compete with each other, and the winner is the instance with
the best-fit cluster analysis. The set of initial centroids IC of
the winner K-Means++ is then used as the initial centroids
IC for K-Means cluster analysis over the entire dataset.
For an overview of our CK-Means, consider a dataset X
for which cluster analysis is needed. We can randomly
select x, a subset from X to be used as input for K-Means++
with the aim of selecting a set of initial centroids IC that
will be used to perform cluster analysis on the entire dataset
X and produce a set CLx as the result of such analysis. We
define f as a fitness measure used to score the results of a
cluster analysis. For this paper, we defined the fitness
measure f to be the within sum of squares (WSSQ) function
(Hartigan and Wong 1979). Any of a number of fitness
measures as Silhouette (Rousseeuw, 1987), intra-cluster
similarity technique and centroid similarity technique
(Steinbach et al., 2000) can be used. However, we found
that the WSSQ was adequate. For a given clustering
problem where we perform cluster analysis several times
using WSSQ as f, the Best-fit cluster analysis corresponds
to the one with lowest WSSQ.
From our experiments, we found that the expectation
value E[f(CLx)] is a good approximation to the expectation
56
R.M. Esteves et al.
value of E[f(CLX)], where CLX is the resulting output of

cluster analysis on the full dataset X. Consequently, we
inferred that the fitness function f(CLx) f(CLX), using
K-Means++ on the complete dataset X and using
K-Means++ on a randomly selected representative subset x.
Let clx be the cluster analysis of x with initial k
centroids selected by running K-Means++ over x.
We observed that when x is representative of X, there is
a strong correlation (~0.70.9) between f (CLx) and f (clx).
Thus we can use f (clx) to select the best-fit set of initial CI
centroids and CI as the initial centroids for K-Means to
perform cluster analysis over the full dataset X.
Our new strategy is shown in Algorithm 3. The choice
of parameter m is optimal if it equals the total number of
available computational cores and simultaneously meets the
restriction whereby each partition xi must have more than
the minimum sample size to be representative of X. The
minimum sample size is O(k), and it is usually determined
empirically by trial and error (Davidson and Satyanarayana,
2003).
Algorithm 3
Our new CK-Means
Input: A set of data points X shuffled into a random order; the

number of centroids k; the number of competitors m and a
fitness measure f
centroids CF
1:
Partition X into x1, x2, ..., xm
2:
For each i {1, 2, , m} do
3:
Run K-Means++ on xi to get k centroids ICi and

clusters clxi
4:
Si = f (clxi)
5:
C ICi, where i Best fit(Si)
6:
Run K-Means on X with C as initial centroids to obtain

cluster analysis output
3.4 CK-Means: a MapReduce implementation

Our new CK-Means can be easily implemented using
MapReduce and HDFS. We can benefit from the distributed
platform provided by MapReduce and HDFS, using
the approach described in Algorithm 4 to write X into
HDFS where each partition is assigned to a unique
MapReduce key.
Algorithm 4
Our approach to write the dataset into HDFS
Input: A set of data points X shuffled into a random order and

the number of partitions, m
1:
Partition X into x1, x2, ..., xm
2:
Assign to each xkey a unique key {1, 2, , m}
3:
Write the pairs (key, xkey) in the HDFS
Ideally, the input parameter m should match the maximum

number of mappers that the computer cluster can run
simultaneously. In a scenario where we have many
machines and a relatively small dataset, m should be
selected by considering that each partition has sufficient

data points for representing X. When RHIPE is used, the
maximum size of each partition cannot exceed 256 MB,
which is a limitation imposed by the Google Protocol
Buffers (Protocol Buffers Google Developers, 2012)
that is a serialisation protocol used by RHIPE. HDFS
automatically partitions and distributes X into the machines
in the cluster. Steps 3 to 5 of Algorithm 3 can be easily
parallelised using a map function. The MapReduce platform
launches several map tasks, each one processing a partition
xi in the machine where it is stored. At the end of the
computation, each map emits to a reducer a candidate
solution with a set of the initial centroids and a fitness score.
The reducer chooses the fittest set of centroids. We present
the details of the MapReduce function in Algorithm 5.
Algorithm 5
Our MapReduce seeder
Input: A path to the stored data points in HDFS; the number of

clusters k and a fitness measure f
Output: k centroids C
1:
Map every (key, xkey) pair
2:
Run K-Means++ on xkey to get k centroids ICkey and k

clusters clxkey
3:
Skey = f (clxkey)
4:
Emit to the reducer the (key, (Skey , ICkey )) pair where

key is a constant
5:
6:
Reduce {(S1, IC1), (S2, IC2), , (Sm, ICm)}

C = IC y , where min ( S y )
1 y i
Our new CK-Means uses a limited amount of bandwidth.

The input values to the reducer are a fixed set of m pairs
(S, IC) emitted by the mappers that are independent of the
total number of points existent in the dataset. Since S is a
single value and IC is a set of k centroids, and each centroid
has d values, a single map emits to the reducer (k d) + 1
values. This represents an advantage of our new algorithm,
since the network is a bottleneck-shared resource in large
MapReduce jobs (Zaharia et al., 2008). The second step of
Algorithm 6 calls a MapReduce implementation of
K-Means, as described in Zhao et al. (2009).
Algorithm 6
Our CK-Means MapReduce
Input: A HDFS path to the stored data points and the number
of clusters k
Output: X points grouped into k clusters and respective
centroids C
1:
Run MapReduce seeder on HDFS path to get k centroids C
2:
Run K-Means_MR on HDFS path with C as initial

centroids and get CLx
Experimental setup
In this section, we demonstrate the significant

improvements in processing time and cluster quality
possible through the use of our new algorithm described in
Section III. We show this by evaluating the following five
hypotheses:
H1
A correlation exists between f (CLx) and f (clx).
H2
Our new CK-Means improves the quality of cluster

analysis compared with SK-Means, which is a
parallel algorithm that has been theoretically proven
by Ailon et al. (2009) to provide similar results as
serial K-Means++.
H3
Our new CK-Means has a similar running time when

compared with SK-Means.
H4
Our new CK-Means benefits from the usage of

MapReduce to reduce the execute time needed for
cluster analysis.
H5
Our new CK-Means scales with the dataset size,

both with the total number of points and the
dimensionality of the dataset.
57
dimensions, but twice the number of points, (16 M).

The KDD99d2 has the same amount of points as the
KDD99, but twice the number of dimensions (74).
Google a dataset of failures collected from a Google

computer cluster (TraceVersion2 Google Cluster Data
Second Format of Cluster-Usage Traces Traces of
Google Workloads Google Project Hosting, 2012).
Using this dataset, we tested the performance of our
new algorithm in a specific clustering situation where
we have a great many points with few dimensions. The
dataset has 13 M points with two dimensions that
represent the time stamp when a node has a failure and
the location identifier of the node where the failure
occurs. We aim to reduce the spatial temporal
dependency present in this dataset by compressing the
data into cluster prototypes and using the prototypes as
data representatives to further the processing. The
problem is described by Hacker et al. (2009).
4.2 The equipment

4.1 The datasets
To test our hypotheses, we used 4 datasets with different
characteristics.
Hypercube this is a synthetic dataset that we

generated to test the behaviour of our new algorithm in
an especially difficult situation of multiple overlapping
clusters. We created the dataset by using the hypercube
function in the R package mlbench. We generated seven
hypercubes, each with ten dimensions. The lengths of
the hypercube sides were sampled using a Gaussian
distribution with a variance of 0.25 and an average of 1.
We added points around the vertices sampled, using a
Gaussian distribution with a variance of 0.25. The
dataset has 10 K points with ten dimensions.
Electrical this is a dataset publicly available at the
UCI machine learning repository (Frank and Asuncion,
2010) that consists of real measurements of electric
power consumption in one household with a
one-minute sampling rate over a period of almost four
years. The dataset has 2 M points with nine dimensions.
We chose it to test our algorithm in a relatively small
real-world clustering problem. Its public availability
makes it easy for the reader to replicate the tests and
benchmark with other algorithms.
KDD99 this dataset is publicly available at the UCI
machine learning repository (Frank and Asuncion,
2010) and has been used for clustering benchmarks. We
chose it to test our algorithm in a medium real-world
clustering problem with a medium number of features.
The dataset consists of data from a network-intrusion
detector with 8 M points in 37 dimensions. The
network was submitted with 24 distinct attacks, which
suggests the possibility of 24 existing clusters. We
created two variations of the KDD99 dataset
specifically to test H5. The KDD99n2 also has 37
To follow Hadoop paradigm we used a cluster based on

commodity hardware. Each machine has the following
configuration: 16 GB 1,333 MHZ DDR3 RAM;
1 TB hardrive, SATA 6 Gb/s, 64 MB Cache, 7,200 RPM;
1 CPU AMD Phenom II X6, 6 cores 3,300 MHz;
Onboard 1,000 Mbps LAN; Linux CentOS with
kernel
Version
2.6.32-358.2.1.el6.centos.plus.x86_64;
Hadoop Version 0.20.203.0; R version 2.15.2; Rhipe
version 0.69; doMC version 1.3.0.
4.3 The experiments

For all of the experiments, we normalised the data and used
Euclidean distance as distance measure and WSSQs as the
fitness measure f.
4.3.1 Experiment A
The aim of Experiment A was to test Hypothesis 1. To
achieve this aim we ran serial K-Means++ on each partition
set element {x1, x2, ..., xm} of the dataset X to obtain IC1,
IC2, ..., ICm sets of k initial centroids. We performed cluster
analyses CLx1, CLx2, ..., CLxm, using K-Means on X, and
using IC1, IC2, ..., ICm as the initial centroids. We then
performed clx1, clx2, ..., clxm cluster analyses using K-Means
on each partition set element {x1, x2, ..., xm}, using the
respective IC1, IC2, ..., ICm as initial centroids. We measured
the correlation between the fitness of CLx and the fitness of
clx. We used the entire hypercube and electrical datasets.
Several days were required to run one serial K-Means++
over the entire KDD99 and the Google datasets; thus we
performed the experiment with 10% of the KDD99 and
Google datasets. All datasets were tested for number of
centroids, k = 50, 100, 500, and 1,000, and with a number of
competitors m = 6. We calculated each correlation
coefficient based on 100 repetitions of cluster analyses
58
R.M. Esteves et al.
under the same circumstances. We used a single machine

with six cores running the R and doMC package.
4.3.2 Experiment B
was 32 MB. We tested the three datasets with k = 50, 100.

Since we have 15 machines with six cores each, we chose
m = 90. We repeated each test five times and measured
the execution times used by our MapReduce seeder
(Algorithm 5).
The aim of Experiment B was to test H2 and H3. We

performed cluster analysis using the following algorithms:
a
K-Means with random initial centroids
SK-Means
our new CK-Means.
We performed the experiment using a random sample of

10% of the KDD99 and Google datasets. The 10% sample
of the KDD99 and Google datasets was the same for all
tests. We used the entire hypercube and electrical datasets.
All datasets were tested for k = 50, 100, 500, 1,000. We
repeated each test 100 times and measured the WSSQ. To
measure the execution times, we repeated each experiment
five times. We used a single machine with six cores with R
and doMC package.
Experimental results
5.1 Experiment A
Table 1 shows the correlation coefficients obtained from a
correlation analysis of the fitness of CLx and the fitness of
clx as defined in Section 3. We observe that the correlation
increases as the size of the dataset increases. This
observation means that an increase of the correlation is
associated with the rise in the number of points in the
datasets. The Google dataset has 13 M points while the
hypercube has only 10 K points. We observe that the
correlation is stronger for smaller values of k.
Correlation coefficients between f(CLx) and f(clx) for
the four datasets of size N for varying K
Table 1
k = 50
k = 100
k = 500
k = 1,000
10K
0.80
0.70
0.56
0.24
0.71
0.53
0.43
0.76
0.74
0.72
0.78
0.65
4.3.3 Experiment C
Hypercube
The aim of Experiment C was to test H4. For this

experiment, we performed cluster analysis using a
MapReduce implementation of our new CK-Means. We
used 15 machines with six cores with Hadoop, R
and RHIPE package, and compare the results with one
machine with six cores with R and doMC package.
The HDFS block size was 32 MB. We did the experiment
with the entire KDD99 and Google datasets for k = 50, 100.
Because we have 15 machines with six cores each, we
chose m = 90. We repeated each test five times and
measured the execution times used by our MapReduce
seeder (Algorithm 5).
Electrical
2M
0.80
KDD99
8M
0.79
Google
13M
0.92
0.86
4.3.4 Experiment D
The aim of Experiment D was to test H5. For this
experiment we performed cluster analysis using a
MapReduce implementation of our new CK-Means; the
KDD99 dataset was the baseline. For testing how the
algorithm scales with an increasing number of points, we
compared the execution times of cluster analysis of the
KDD99 dataset with the KDD99n2 dataset. To test how it
scales with an increasing number of dimensions, we
compared the execution times of cluster analysis of the
KDD99 dataset with the KDD99d2 dataset.
The KDD99d2 and KDD99n2 datasets occupy each
4 GB in disk space. We used 15 machines with six cores
with Hadoop, R, and RHIPE package. The HDFS block size
The hypercube and electrical datasets have weak correlation

for k = 500, 1,000. The drop in the correlation is explained
by the diminishing of the representativeness of each
partition of X when k increases. As we increase the number
of clusters, the minimum number of points per partition
necessary to represent the dataset rises. Thus this problem is
more noticeable in the smallest dataset tested, which is the
hypercube. The solution for this problem is to reduce the
number of partitions m for k equal to 500 and 1,000 in both
hypercube and electrical datasets. However, we maintained
m for Experiment B to explore the effects of the lack of
representativeness in the quality of the cluster analysis.
The results show that a correlation exists between
f(CLx) and f(clx), which proves that our first Hypothesis is
true. However, the correlation decreases when each partition
of X lacks enough points to represent X.
5.2
Experiment B
Figure 1 compares the results of cluster analysis of the

hypercube dataset between the following methods:
a
the K-Means using random initial centroids
SK-Means, which is a parallel implementation with

equivalent results to the serial K-Means++
our new CK-Means.
59
The comparison is repeated for k = 50, 100, 500, 1,000.
Figures 2 to 4 show the same information as Figure 1 for the
datasets hypercube, KDD99, and Google, respectively. The
y-axis in Figures 1 to 5 represent the fitness function
f = WSSQ. A lower WSSQ value indicates a better selection
of initial centroids for cluster analysis.
We observe in Figure 1 that our CK-Means is more
accurate than SK-Means and K-Means for k = 50 and
k = 100. However, the relative accuracy of CK-Means in
comparison with SK-Means slightly deteriorates when
k = 500 and dramatically deteriorates when k = 1,000. This
decline is explained because each partition of X does not
Figure 1
have the minimum sample size. The increase in the number

of clusters requires an increase in the minimum sample size
that is necessary to represent more clusters. However, the
same undesired behaviour is expressed by SK-Means. The
accuracy of SK-Means when compared with the K-Means
drops dramatically in the same way as the CK-Means drop.
Figure 2 shows, in contrast to Figure 1, that the accuracy
of CK-Means compared with the SK-Means and with the
K-Means for the electrical dataset did not decrease for
higher values of k. Although the correlation coefficient is
above 0.5 for k equal to 1,000, CK-Means still produces
better results when compared with SK-Means and K-Means.
WSSQ of K-Means; SK-Means and our new CK-Means with different values of k (see online version for colours)
2720.00
WSSQ box plot k=50
3200.00
2715.00
WSSQ box plot k=100
3190.00
2710.00
3180.00
2705.00
3170.00
2695.00
WSSQ
WSSQ
2700.00
2690.00
3160.00
3150.00
2685.00
3140.00
2680.00
3130.00
2675.00
2670.00
3120.00
K-means
SK-means
Min Value
K-means
Max Value
SK-means
Min Value
WSSQ box plot k=500
12500.00
5000.00
12400.00
4900.00
12300.00
4800.00
12200.00
CK-means
Max Value
WSSQ box plot k=1000
WSSQ
WSSQ
5100.00
CK-means
4700.00
12100.00
4600.00
12000.00
4500.00
11900.00
4400.00
11800.00
K-means
SK-means
Min Value
Notes: Dataset: hypercube. Lower is better.
Max Value
CK-means
K-means
SK-means
Min Value
Max Value
CK-means
60
R.M. Esteves et al.

WSSQ
7600.00
WSSQ box plot k=50
6000.00
7400.00
5800.00
7200.00
5600.00
7000.00
5400.00
6800.00
5200.00
WSSQ
Figure 2
6600.00
5000.00
6400.00
4800.00
6200.00
4600.00
6000.00
4400.00
K-means
SK-means
Min Value
2750.00
WSSQ box plot k=100
CK-means
K-means
Max Value
SK-means
Min Value
WSSQ box plot k=500
2050.00
2700.00
CK-means
Max Value
2000.00
2650.00
1950.00
2550.00
WSSQ
WSSQ
2600.00
2500.00
1900.00
1850.00
2450.00
1800.00
2400.00
2350.00
1750.00
K-means
SK-means
Min Value
CK-means
K-means
Max Value
SK-means
Min Value
CK-means
Max Value
Notes: Dataset: electrical. Lower is better.

Figures 3 and 4 show for the KDD99 and Google datasets
that CK-Means is more accurate than SK-Means and
K-Means in a consistent pattern for the four different values
of k. A lower box means better cluster quality, and a thinner
box means less variance. Our CK-Means produced lower
and thinner boxes than SK-Means and K-Means for all
tested situations.
These results show that our new CK-Means improves
the quality compared with the SK-Means, thus proving our
second hypothesis. In terms of execution time, CK-Means
consumes (1 1)% times more than the SK-Means. When
using one machine with six cores, our new approach
reduces serial K-Means++ execution time by (86.96
0.65)%.
5.3 Experiment C
Table 2 shows a comparison of running times between a
MapReduce distributed implementation and a nondistributed implementation of our MapReduce seeder
(Algorithm 5).
Table 2
comparison of Ck-means MapReduce seeder

execution times (seconds) for 15-node MapReduce
vs. single-thread
KDD99
Google
k = 50
k = 100
k = 50
k = 100
Single
node
x = 828;
= 13
x = 1,877;
= 18
x = 885;
= 15
x = 1,638;
= 21
15
nodes
x = 82;
=8
x = 142;
= 12
x = 81;
=9
x = 142;
= 10
Using MapReduce to extend the computation to a cluster of

15 nodes, we can speed up CK-Means to 10 1 times
versus one node with six cores. Compared with a single
node, MapReduce reduces time consistently for different
values of k and for both datasets with different
characteristics. Using Hadoop, we also found that our new
approach produced a speedup of 76 9 times versus the
serial K-Means++ seeder proposed by Arthur and
Vassilvitskii. Our results show that our new CK-Means
61
benefits from MapReduce to reduce the execution time, thus
proving that Hypothesis 4 is true.
5.4 Experiment D
Table 3 shows how the execution time of our MapReduce
implementation of CK-Means MapReduce seeder scales
with the dataset growth and with the number of clusters.
The execution time is more sensitive to increases in both n
and k than in an increase in d.
Table 3
Comparison of CK-Means MapReduce seeder

execution times (seconds) with varying dataset size
k = 50
k = 100
k = 500
k = 1,000
KDD99
82
142
577
1,107
KDD99d2
112
172
713
1,394
KDD99n2
148
254
1,137
2,196
Figure 3
We observed that when the dataset is doubled in size by

increasing n, or by increasing d, the execution time of CKMeans grows less than twofold. This means that the CKMeans becomes proportionally faster as we increase the size
of the dataset. The explanation for this is that our
MapReduce implementation of the seeder launches a fixed
number of mappers that is dependent on m and on k and
independent from n and from d. Increasing n or d and
maintaining the value m and k produce the same number of
mappers, but assigns each map with more data to process.
Thus all three variants of KDD99 spend the same amount of
execution time for setting the mappers and for
communication between the mappers and the reducers along
the network. Figure 5 shows that increasing k has almost a
linear effect on the execution time for the several different
variants of the KDD99 dataset. Our new algorithm scales
with the dimension of the dataset and thus the Hypothesis 5
is true.
6000.00
WSSQ box plot k=50
3500.00
WSSQ box plot k=100
3000.00
5000.00
2500.00
4000.00
WSSQ
WSSQ
2000.00
3000.00
1500.00
2000.00
1000.00
1000.00
500.00
0.00
0.00
K-means
SK-means
Min Value
1000.00
CK-means
K-means
Max Value
SK-means
Min Value
WSSQ box plot k=500
700.00
900.00
CK-means
Max Value
600.00
800.00
500.00
700.00
400.00
500.00
WSSQ
WSSQ
600.00
400.00
300.00
300.00
200.00
200.00
100.00
100.00
0.00
0.00
K-means
SK-means
Min Value
Notes: Dataset: KKD99. Lower is better.
Max Value
CK-means
K-means
SK-means
Min Value
Max Value
CK-means
Figure 4
R.M. Esteves et al.

WSSQ
4000.00
WSSQ box plot k=50
1600.00
3500.00
1400.00
3000.00
1200.00
2500.00
1000.00
2000.00
800.00
WSSQ
62
1500.00
600.00
1000.00
400.00
500.00
200.00
0.00
0.00
K-means
SK-means
Min Value
250.00
WSSQ box plot k=100
CK-means
K-means
SK-means
Min Value
Max Value
WSSQ box plot k=500
140.00
CK-means
Max Value
120.00
200.00
100.00
80.00
WSSQ
WSSQ
150.00
100.00
60.00
40.00
50.00
20.00
0.00
0.00
K-means
SK-means
Min Value
Max Value
CK-means
K-means
SK-means
Min Value
CK-means
Max Value
Notes: Dataset: Google. Lower is better.

Figure 5
Scaling of MapReduce CK-Means MapReduce seeder with KDD99s variants for different values of k
Conclusions
K-Means is a fast clustering method. However, initial

seeding heavily influences the cluster quality. In this paper,
we presented a new strategy to parallelise K-Means++ that
improves the speed and accuracy of cluster analysis for
large datasets. Our new CK-Means is highly scalable and
benefits from the use of Hadoop and MapReduce. We
observed that a Hadoop cluster of 15 machines running our
algorithm produced speedup of 76 9 times compared with
serial K-Means++ with improved accuracy. We found that
our new CK-Means consistently improves cluster analysis
accuracy compared with SK-Means. We proved that the
third hypothesis is true, since our new algorithm is only 1%
slower when compared with the SK-Means. We found that
MapReduce largely decreased the running time by a speed
up of 10 1 times compared with a non-distributed
implementation. We found that our new algorithm scales
with the dimension of the dataset. The running time is more
sensitive to variations in the number of data points and of
clusters than to variations in the number of dimensions.
With our findings, we have addressed the problem of
finding a good initial seeding in less time. Thus performing
accurate cluster analysis over large datasets can now be
done by using our new CK-Means approach.
References
Ackermann, M., Lammersen, C., Mrtens, M., Raupach, C.,
Sohler, C. and Swierkot, K. (2010) StreamKM++:
A Clustering Algorithm for Data Streams, pp.17387,
ALENEX.
Ailon, N., Jaiswal, R. and Monteleoni, C. (2009) Streaming
K-Means approximation [online]
http://scholar.google.com.au/scholar.bib?q=info:eeMPmjm4T
NsJ:scholar.google.com/&output=citation&hl=en&as_sdt=20
00&ct=citation&cd=0.
Al-Daoud, M.B. (2007) A new algorithm for cluster
initialization, World Academy of Science, Engineering and
Technology, Vol. 1, No. 4, pp.568570.
Arthur, D. and Vassilvitskii, S. (2007) K-Means++:
the advantages of careful seeding, SODA 07 Proceedings of
the Eighteenth Annual ACM-SIAM Symposium on Discrete
Algorithms [online] http://ilpubs.stanford.edu:8090/778/.
Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and
Vassilvitskii, S. (2012) Scalable K-Means++, Proc. VLDB
Endow, Vol. 5, No. 7, pp.62233.
Crainic, T.G. and Toulouse, M. (2010) Handbook of
Metaheuristics, Vol. 146, pp.497541; International Series in
Operations Research & Management Science, Springer, USA
[online] http://dx.doi.org/10.1007/978-1-4419-1665-5_17.
Davidson, I, and Satyanarayana, A. (2003) Speeding up K-Means
clustering by bootstrap averaging, in IEEE Data Mining
Workshop on Clustering Large Datasets.
Dean, J. and Ghemawat, S. (2008) MapReduce: simplified data
processing on large clusters, Commun. ACM, Vol. 51, No. 1,
pp.107113, doi:10.1145/1327452.1327492.
63
Eiben, A., Michalewicz, Z., Schoenauer, M. and Smith, J. (2007)

Parameter setting in evolutionary algorithms, Studies in
Computational Intelligence, Vol. 54, pp.1946, Springer
Berlin/Heidelberg [online] http://dx.doi.org/10.1007/978-3540-69432-8_2.
Ekanayake, J., Pallickara, S. and Fox, G. (2008) MapReduce for
data intensive scientific analyses, in eScience 08, IEEE
Fourth International Conference on, pp.277284.
El Agha, M. and Ashour, W.M. (2012) Efficient and fast
initialization algorithm for K-Means clustering, International
Journal of Intelligent Systems and Applications (IJISA),
Vol. 4, No. 1, p.21.
Esteves, R.M. and Rong, C. (2011) Using mahout for clustering
Wikipedias latest articles: a comparison between
K-Means and fuzzy C-Means in the cloud, Cloud Computing
Technology and Science (CloudCom), 2011 IEEE Third
International Conference on, pp.565569, 29 November to
1 December, doi: 10.1109/CloudCom.2011.86.
Esteves, R.M., Hacker, T. and Rong, C. (2012) Cluster analysis
for the cloud: parallel competitive fitness and parallel
K-Means++ for large dataset analysis, in Proceedings of the
2012 4th IEEE International Conference on Cloud
Computing Technology and Science, IEEE Computer Society,
Taipei, Taiwan.
Frank, A. and Asuncion, A. (2010) UCI Machine Learning
Repository, University of California, Irvine, School of
Information and Computer Sciences [online]
http://archive.ics.uci.edu/ml.
Gursoy, A. (2004) Data decomposition for parallel K-Means
clustering, Lecture Notes in Computer Science, Vol. 3019,
pp.241248, Springer Berlin/Heidelberg [online]
http://dx.doi.org/10.1007/978-3-540-24669-5_31.
Hacker, T.J., Romero, F. and Carothers, C.D. (2009) An analysis
of clustered failures on large supercomputing systems,
J. Parallel Distrib. Comput., Vol. 69, No. 7, pp.652665.
Hartigan, J.A. and Wong, M.A. (1979) Algorithm AS 136:
a K-Means clustering algorithm, Journal of the Royal
Statistical Society, Series C (Applied Statistics), Vol. 28,
No. 1, pp.100108.
Holmes, A. (2012) Hadoop in Practice, Manning Publications Co.,
New York.
Jin, R., Goswami, A. and Agrawal, G. (2006) Fast and exact
out-of-core and distributed K-Means clustering, Knowledge
and Information Systems, Vol. 10, No. 1, pp.1740,
doi:10.1007/s10115-005-0210-0.
Khan, S.S. and Ahmad, A. (2004) Cluster center initialization
algorithm for K-Means clustering, Pattern Recognition
Letters, Vol. 25, No. 11, pp.2931302,
doi:10.1016/j.patrec.2004.04.007.
Knysh, D. and Kureichik, V. (2010) Parallel genetic algorithms:
a survey and problem state of the art, Journal of Computer
and Systems Sciences International, Vol. 49, No. 4,
pp.579589, doi:10.1134/S1064230710040088.
Kumar, J., Mills, R.T., Hoffman, F.M. and Hargrove, W.W. (2011)
Parallel K-Means clustering for quantitative ecoregion
delineation using large data sets, Procedia Computer
Science, Vol. 4, pp.16021611.
Meil, M. and Heckerman, D. (1998) An experimental
comparison of several clustering and initialization methods,
in Proceedings of the Fourteenth Conference on Uncertainty
in Artificial Intelligence, pp.386395, Morgan Kaufmann
Publishers Inc., Madison, Wisconsin.
64
R.M. Esteves et al.
Niknam, T., Fard, E.T., Pourjafarian, N. and Rousta, A. (2011)

An efficient hybrid algorithm based on modified imperialist
competitive algorithm and K-Means for data clustering,
Engineering Applications of Artificial Intelligence, Vol. 24,
No. 2, pp.306317, doi:10.1016/j.engappai.2010.10.001.
Ostrovsky, R. and Rabani, Y. (2006) The effectiveness of lloydtype methods for the k-means problem, in 47th IEEE
Symposium on the Foundations of Computer Science (FOCS),
pp.165176.
Pavan, KK., Rao, A.A., Rao, A.V.D. and Sridhar, G.R. (2010)
Single pass seed selection algorithm for K-Means, Journal
of Computer Science, Vol. 6, No. 1, pp.6066,
doi:10.3844/jcssp.2010.60.66.
Protocol Buffers Google Developers (2012) [online]
https://developers.google.com/protocol-buffers/
(accessed 16 November).
Redmond, S.J. and Heneghan, C. (2007) A method for initialising
the K-Means clustering algorithm using Kd-trees, Pattern
Recognition Letters, Vol. 28, No. 8, pp.965973,
doi:10.1016/j.patrec.2007.01.001.
Rousseeuw, P.J. (1987) Silhouettes: a graphical aid to the
interpretation and validation of cluster analysis, Journal of
Computational and Applied Mathematics, November,
Vol. 20, pp.5365, doi:10.1016/0377-0427(87)90125-7.
Shvachko, K., Kuang, H., Radia, S. and Chansler, R. (2010) The
Hadoop distributed file system, in Proceedings of the IEEE
26th Symposium on Mass Storage Systems and Technologies
(MSST), pp.110, IEEE Computer Society.
Steinbach, M., Karypis, G. and Kumar, V. (2000) A comparison
of document clustering techniques, in KDD Workshop on
Text Mining.
Stoffel, K. and Belkoniene, A. (1999) Parallel K/h-means

clustering for large data sets, in Euro-Par99 Parallel
Processing, , Lecture Notes in Computer Science,
Vol. 1685, pp.1451154, Springer, Berlin/Heidelberg [online]
http://dx.doi.org/10.1007/3-540-48311-X_205.
The Comprehensive R Archive Network (2012) [online]
http://cran.r-project.org/
(accessed May 10).
TraceVersion2 Google Cluster Data Second Format of ClusterUsage Traces Traces of Google Workloads Google
Project Hosting (2012) [online]
http://code.google.com/p/googleclusterdata/wiki/TraceVersio
n2 (accessed 20 November).
Wasif, M.K. and Narayanan, P.J. (2011) Scalable clustering using
multiple GPUs, in High Performance Computing (HiPC),
2011 18th International Conference on, pp.110,
doi:10.1109/HiPC.2011.6152713.
White, T. (2010) Hadoop: The Definitive Guide, 2nd ed., OReilly
Media/Yahoo Press.
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R. and Stoica, I.
(2008) Improving MapReduce performance in heterogeneous
environments, in Proceedings of the 8th USENIX Conference
on Operating Systems Design and Implementation, pp.2942,
USENIX Association, San Diego, California.
Zhang, Y., Xiong, Z., Mao, J. and Ou, L. (2006) The study of
parallel K-Means algorithm, in Intelligent Control and
Automation, WCICA, The Sixth World Congress on, Vol. 2,
pp.58685871.
Zhao, W., Ma, H. and He, Q. (2009) Parallel K-Means clustering
based on MapReduce, Lecture Notes in Computer Science,
Vol. 5931, pp.674679, Springer, Berlin/Heidelberg [online]
http://dx.doi.org/10.1007/978-3-642-10665-1_71.

Accurate Distributed Cluster Analysis For Big Data Competitive K-Means

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Accurate Distributed Cluster Analysis For Big Data Competitive K-Means

Transféré par

Droits d'auteur :

Formats disponibles

50

Int. J. Big Data Intelligence, Vol. 1, Nos. 1/2, 2014

A new approach for accurate distributed cluster

Copyright 2014 Inderscience Enterprises Ltd.

The volume of data generated daily is growing at an

a large number of clusters

a large number of data points

storage across several systems. K-Means++

However, we observed two major problems when we

R.M. Esteves et al.

Theory and background

2.2 The problem of selecting initial centroids for

computing systems coupled with high performance parallel

points are randomly biased to generate the actual initial

2.3 Related work in centroid initialisation methods

R.M. Esteves et al.

iterations, O(k log n) points are candidates to form a

2.4 Using MapReduce/Hadoop for distributed

2.5 Hadoop and R for distributed cluster analysis

R + streaming by using Hadoop streaming

RHadoop an R package integrated with the R

RHIPE similar to RHadoop that is integrated with the

In this section, we describe the existing serial K-Means++

3.1 Serial K-Means++

Input: A set of data points X; the number of clusters k and the

Partition X into X1, X2, ..., Xm

For each i {1, 2, , m} do

Run A on Xi to get 3k log(k) centroids

Input: A set of data points X and the number of desired

Output: X points grouped into k clusters and respective final

Run A on Sw to get k centroids C

Run K-Means on X using the set of initial centroids FC

IC a single data point uniformly sampled at random

For each data point dp X, compute D(dp, ic),

Sample dp X with probability

K-Means on X using the set of initial centroids IC

K-Means++ is a serial algorithm that repeats steps 2 to 5 k

3.2 Streaming K-Means

Source: Ailon et al. (2009)

3.3 Our new CK-Means

R.M. Esteves et al.

value of E[f(CLX)], where CLX is the resulting output of

Our new CK-Means

Input: A set of data points X shuffled into a random order; the

Partition X into x1, x2, ..., xm

For each i {1, 2, , m} do

Run K-Means++ on xi to get k centroids ICi and

C ICi, where i Best fit(Si)

Run K-Means on X with C as initial centroids to obtain

3.4 CK-Means: a MapReduce implementation

Our approach to write the dataset into HDFS

Input: A set of data points X shuffled into a random order and

Partition X into x1, x2, ..., xm

Assign to each xkey a unique key {1, 2, , m}

Write the pairs (key, xkey) in the HDFS

Ideally, the input parameter m should match the maximum

selected by considering that each partition has sufficient

Our MapReduce seeder

Input: A path to the stored data points in HDFS; the number of

Map every (key, xkey) pair

Run K-Means++ on xkey to get k centroids ICkey and k

Emit to the reducer the (key, (Skey , ICkey )) pair where

Reduce {(S1, IC1), (S2, IC2), , (Sm, ICm)}