Académique Documents
Professionnel Documents
Culture Documents
1
Contents
1. Abstract
2. Keywords
3. Introduction
4. Clustering
5. Partitional Algorithms
6. K-medoid Algorithms
6.1 PAM
6.2 CLARA
6.3 CLARANS
7. Analysis
8. Conclusion
9. References
2
PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING
defined process, consisting of several distinct steps.
Data mining is the core step in the process which
1. ABSTRACT
results in the discovery of knowledge. Data mining
is a high-level application technique used to
In last few years there has been tremendous
present and analyze data for decision-makers.
research interest in devising efficient data mining
There is an enormous wealth of information
algorithms. Clustering is a very essential
embedded in huge databases belonging to
component of data mining techniques.
enterprises and this has spurred tremendous interest
Interestingly, the special nature of data mining
in areas of knowledge discovery and data mining.
makes the classical clustering algorithms
The fundamental goals of data mining are
unsuitable, these characteristics are usually very
prediction and description. Prediction makes use of
large datasets; the dataset need not be necessarily
existing variables in the database in order to predict
numeric and hence importance should be given to
unknown or future values of interest and
efficient input and output operations instead of
description focuses on finding patterns describing
algorithmic complexity. As a result in last few
the data and the subsequent presentation for user
years a number of clustering algorithms are
interpretation. There are several mining techniques
proposed for data mining. The present paper gives
for prediction and description. These are
a brief overview of partitional clustering
categorized as association, classification,
algorithms used in data mining. The first part of the
sequential patterns and clustering. The basic
paper discuses overview of clustering technique
premise of association is to find all associations
used in data mining. In the second part the paper
such that the presence of one set of items in a
discusses different partitional clustering algorithms
transaction implies other items. Classification
used in mining of data.
develops profiles different groups. Sequential
patterns identify sequential patterns subject to a
2. KEYWORDS: user-specified minimum constraint. Clustering
Knowledge discovery in segments a database into subsets or clusters.
database, Data mining, Clustering,
partitional algorithms, PAM, CLARA,
CLARANS.
4. Clustering
3
There are two main types of clustering techniques algorithm usually adopts iterative optimization
partitional clustering techniques and hierarchical paradigm. It starts with an initial partition and uses
clustering techniques. The partitional clustering an iterative control strategy. It tries swapping of
techniques construct a partition of the database into data points to see if such a swapping improves the
predefined number of clusters. The hierarchical quality of clustering. When no swapping yields
clustering techniques do a sequence of partitions improvements in clustering it finds a locally
in which each partition is nested into next partition optimal partition. This quality of clustering is very
in the sequence. sensitive to initially selected partition. There are
mainly two different categories of the partitioning
algorithms.
6.1 PAM
Datasets after clustering PAM uses a k-medoid method to identify the
clusters. PAM selects k objects arbitrarily from the
data as medoids. In each step, a swap between a
5. PARTITIONAL ALGORITHMS
selected object Oi and a non-selected object Oh is
made as long as such a swap would result in an
Partitional algorithms construct a partition of a
improvement of the quality of clustering .To
database of n objects into a set of k clusters. The
calculate the effect of such a swap between Oi and
construction involves determining the optimal
Oh a cost Cih is computed, which is related to the
partition with respect to an objective function.
quality of partitioning the non-selected objects to k
There are approximately kⁿ/k! ways of partitioning
clusters represented by the medoids. So, at this
a set of n data points into k subsets. An exhaustive
stage it is necessary first to understand the method
enumeration method can though find the global
of partitioning of the data objects when a set of k-
optimal partition but is practically infeasible when
medoids are given
n and k are very small. The partitional clustering
4
d(Oj,Oe) = d(Oj,Oi), and Min
Partitioning d(Oj,Oe)=d(Oj,Oj΄), j΄ ≠ h.Define a cost as
we then say Oj belongs to the cluster represented • A non-selected object joj Є Cj΄ = Oj Є Ch
by Oi, if d(Oi,Oj)=Minoe d(Oj,Oe), where the So, Min d(Jo,Au) = d(Jo,Jo΄), and
minimum is taken over all medoids Oe and Min d(Jo,Au) = d(Jo,Oh)Cjih = d(Oj,Oh) -
d(Oa,Oh) determines the distance or dissimilarity d(Oj,Oj΄)
between objects Oa and Oh. The dissimilarity
matrix is known prior to the commencement of Define the total cost of swapping Oi and Oh as Chi =
PAM. The quality of clustering is measured by the ∑jCjih if Cih is negative then the quality of
average dissimilarity between an object and the clustering is improved by making Oh as a medoid
medoid of the cluster to which the object belongs. in plase of Oi. The process is repeated until we
cannot find a negative Cih.
there will be three types of changes that can occur Do for all non-selected objects Oh.
5
restricting its search to smaller sample of the
It can be observed that the major computational database. Thus if the sample size is s ≤ N, it
efforts for PAM are to determine k medoids examines at most k(S-k) pairs at every iteration.
through an iterative optimization. CLARA through CLARANS does not restrict the search to any
follows the same principle, attempts to reduce the particular subset of objects. Neither does it search
computational effort by relying on sampling to the entire dataset. It randomly selects few pairs for
handle large datasets. Instead of finding swapping at the current state. CLARANS, like
representative objects for the entire dataset, PAM, start with a randomly selected set of k
CLARA draws sample of the dataset, applies PAM medoids. It checks at most the maxneighbour
on this sample and finds the medoids of the number of pairs for swapping and if a pair with
sample. If the sample were drawn in a sufficiently negative cost is found, it updates the medoids set
random way, the medoids of the sample would and continues. Otherwise, it records the current
approximate the medoids of the entire dataset. The selection of medoids as a local optimum and
steps of CLARA are summarized below: restarts with a new randomly selected medoid-set
to search for another local optimum. CLARANS
ALGORITHM stops after the “numlocal” number of local optimal
medoid-sets are determined and return the best
among these.
• Input: Database of D objects,
ALGORITHM
• Repeat until
1. Draw a sample S c D randomly • Input(D, k, maxneighbour and numlocal)
6
Increment j←j+1 9. REFERENCES:
End do
Compare the cost of clustering “with Vasudha Bhatnagar, On Mining Of Data, IETE
mincost” Journal of research, 2001
7. ANALYSIS
8. CONCLUSION