Vous êtes sur la page 1sur 6

Implementation of K-means algorithm using different Map-Reduce

Paradigms
Prajakta Purohit Swathi Gurram
pppurohi@indiana.edu swgurram@indiana.edu

Introduction
We have attempted to make a performance
comparison of the K-means algorithm using
different map-reduce programming paradigms
namely Twister and Hadoop. K-means is a
pleasingly parallel algorithm that very easily fits
into the Iterative map- reduce model. Our
project will involve understanding the algorithm
and implementing in both Twister (iterative
map-reduce) and Hadoop (map reduce)
frameworks.

K-means algorithm
Clustering is a main task of explorative data
mining, and a common technique
for statistical data analysis used in many fields,
including machine learning, pattern
recognition, image analysis, information
retrieval, and bioinformatics. Cluster Analysis
involves assigning a set of objects into clusters
such that the objects in the same cluster are
more similar to each other than to those in
other clusters.
Working of K-means
The following figure shows how the algorithm
K-means clustering is a well-known clustering can be implemented in map- reduce style. In
algorithm aiming to cluster a set of data points that each map function gets a portion of the
to a predefined number of clusters. The simple data, and it needs to access this data split in
definition of K-means algorithm: K-means each iteration. These data items do not change
algorithm partitions n observations to k over the iterations, and it is loaded once for the
clusters such that each observation belongs to a entire set of iterations. The variable data is the
cluster with the nearest mean. The final result current cluster centers calculated during the
of this algorithm is obtained in a number of previous iteration and hence used as the input
iterations. The accuracy of the result depends value for the map function.
on the selection of the initial means of the
clusters. Initial means (seeds) are found using k- All the mappers get the same input at each
means++ algorithm. iteration and each mapper computes partial
cluster centers by going through the
corresponding input data set. A reducer
function computes the accumulates all partial
cluster centers and calculates the mean od each
cluster to produce the new cluster centers for Map Reduce Frameworks
the next step. Depending on the difference Twister is an iterative map-reduce framework. It
between the previous and current cluster is best suitable for algorithms like K-means. It
centers, a decision will be made on whether we has given us the best runtimes for K-means
need to execute another iteration of the implementation. It does not have its own file
algorithm. system like Hadoop but uses the local
filesystem. These are the salient features of
Twister:
Configurable long running (cacheable)
map/reduce tasks
Pub/sub messaging based
communication/data transfers
Efficient support for Iterative
MapReduce computation
Combine phase to collect all reduce
outputs
Data access via local disks

Hadoop is a software framework that supports


data-intensive distributed applications and
suitable for simple map-reduce applications. It
is however not suitable for iterative map-
reduce functions like K-means. Hadoop has
following features:
Uses Map- reduce programming model
it's own filesystem ( HDFS Hadoop
Distributed File System based on the
Google File System) which is specifically
tailored for dealing with large files
can intelligently manage the
distribution of processing and your files,
and breaking those files down into
more manageable chunks for
processing

Haloop is a modified version of the Hadoop


Validation
Map - Reduce framework. We came across this
We validated our results by calculating the sum prototype during our project. Major features of
of each point to the cluster center it belongs to. Haloop include:
Since our aim is to find out the best cluster provide caching options for loop-
centers for the given data, each time the initial invariant data access
centroids differ, we will get a new answer. The let users reuse major building blocks
best of these results, would be the one whose from Hadoop implementations
have similar intra-job fault-tolerance
sum of distances to each point belonging to its
mechanisms to Hadoop.
own cluster would be minimum.
HaLoop reduces query runtimes by 1.85
compared with Hadoop
Implementation becomes simple. The user program only can
We have implemented the K-means algorithm access Twister output compare to previoud
in both Twister and Hadoop. The basic idea of input and based on a threshold decide whether
the mapper and reducer remains same in both to run another iteration of the clustering
the frameworks. In Twister implementation algorithm.

Initial centroids from text file

Load initial centroids and pass


to twister framework as
parameter

Twister mapper:
1) loads its partition of data
2) Forms clusters corresponding to each centroid
3) sends this accumulated clusters to reducer

Twister Reducer:
1) Collects output from all mappers
2) Calculates a new centroid for each new cluster
3) Sends out put to driver

Driver: get centroids from Twister

Check difference between


previous and current centroids

If differenced is greater than


Threshold repeat the process
again
K-means Implementation in Twister 1
In Hadoop, there is no easy way to get the has to manually move this file from HDFS to
output from one map reduce function and local file system and read the results from the
pass it as an input to the next map reduce. local file, and pass it as input to the next
Output of a map- reduce function can only be iteration.
written to a file in the HDFS. The user program

Initial centroids from text file

Load initial centroids to a temp HDFS file and initiates Hadoop job

Hadoop mapper:
1) For each point, decide which centroid is closest
2) Send the result to reducer

Hadoop Reducer:
1) Collects output from all mappers
2) Calculates a new centroid for each new cluster
3) Sends out put to driver
4)

Driver: get centroids from Hadoop

Check difference between


previous and current centroids

If difference is greater than Threshold :


1) Copy reducer output HDFS to local file system
2) Delete previous initial centroids and load new centroids
to that file in local directory
3) Delete hdfs output directory
Repeat the process
Otherwise , get final results
K-means Implementation in Hadoop 1
Runtime Comparison
We have noticed a large difference in the time The graph below shows the comparison of
taken by the K-means implementation in execution times. The X-axis simply denotes the
Twister framework and Hadoop framework. different input sets of initial centroids. This
Twister is very fast and when compared to graph is plotted for a data set of 20000 points.
Hadoop. We took data sets ranging from 20,000 When we attempted to execute the K-means
points to 80,000 points. Our initial centroid sets Hadoop implementation on a larger data set,
were generated by a method and we used we often found that Hadoop has crashed while
these centroid sets as an input for 10 different trying to execute it. K-means Twister
executions. Twisters execution times ranged implementation gave very good results even
about 1 or 1.5 seconds. However, a single with 80000 points. The execution times were in
iteration of the K-means on Hadoop itself took the range if 1.5 -2 seconds.
about 40 seconds. Iterative execution of K-
means took more than 650 seconds.

Twister- Hadoop Comparison


1000

900

800
Execution Time in seconds -->

700

600

500

400

300

200

100

0
1 2 3 4 5 6 7 8 9 10
Hadoop 603 630 886 642 646 942 483 690 671 713
Twister 1.1542 1.1263 1.1264 1.1097 1.1137 1.1262 1.0926 1.1102 1.1034 1.1159
Centroid Sets-->
Timeline and Responsibilities

Conclusion
We have understood the difference between
iterative map reduce and simple map
reduce frameworks through our project. We are
able to conclude that for algorithms that are
iterative Twister is more suitable than Hadoop
is. This is very evident through the execution
times of K-means on Twister and Hadoop.

References
[1] http://salsahpc.indiana.edu
[2] http://www.iterativemapreduce.org/sa
mples.html
[3] http://hadoop.apache.org/
[4] http://en.wikipedia.org/wiki/Apache_H
adoop
[5] http://clue.cs.washington.edu/node/14
[6] http://code.google.com/p/haloop/
[7] http://www.cs.washington.edu/homes/
billhowe/pubs/HaLoop.pdf

Vous aimerez peut-être aussi