Datamining and Dataware Housing

A TECHNICAL PAPER ON
DATAMINING AND DATAWARE HOUSING WITH SPECIAL REFERENCE

TO
PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING
Gudlavalleru Engineering College

by
K.PRADEEP KUMAR I.RAHUL

III/IV B.TECH CSE
III/IV B.TECH CSE email:rahulintety@yahoo.co.in
email:kothapradeep550@gmail.com Phone:08674-247222
Phone:08674-240673
1
Contents
1. Abstract
2. Keywords
3. Introduction
4. Clustering
5. Partitional Algorithms
6. K-medoid Algorithms
6.1 PAM
6.2 CLARA
6.3 CLARANS
7. Analysis
8. Conclusion
9. References
2
PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING
defined process, consisting of several distinct steps.
Data mining is the core step in the process which
1. ABSTRACT
results in the discovery of knowledge. Data mining
is a high-level application technique used to
In last few years there has been tremendous
present and analyze data for decision-makers.
research interest in devising efficient data mining
There is an enormous wealth of information
algorithms. Clustering is a very essential
embedded in huge databases belonging to
component of data mining techniques.
enterprises and this has spurred tremendous interest
Interestingly, the special nature of data mining
in areas of knowledge discovery and data mining.
makes the classical clustering algorithms
The fundamental goals of data mining are
unsuitable, these characteristics are usually very
prediction and description. Prediction makes use of
large datasets; the dataset need not be necessarily
existing variables in the database in order to predict
numeric and hence importance should be given to
unknown or future values of interest and
efficient input and output operations instead of
description focuses on finding patterns describing
algorithmic complexity. As a result in last few
the data and the subsequent presentation for user
years a number of clustering algorithms are
interpretation. There are several mining techniques
proposed for data mining. The present paper gives
for prediction and description. These are
a brief overview of partitional clustering
categorized as association, classification,
algorithms used in data mining. The first part of the
sequential patterns and clustering. The basic
paper discuses overview of clustering technique
premise of association is to find all associations
used in data mining. In the second part the paper
such that the presence of one set of items in a
discusses different partitional clustering algorithms
transaction implies other items. Classification
used in mining of data.
develops profiles different groups. Sequential
patterns identify sequential patterns subject to a
2. KEYWORDS: user-specified minimum constraint. Clustering
Knowledge discovery in segments a database into subsets or clusters.
database, Data mining, Clustering,
partitional algorithms, PAM, CLARA,
CLARANS.
4. Clustering
3. INTRODUCTION: Clustering is a useful technique for discovery of

data distribution and patterns in the underlying
Data mining is the non-trivial process of data. The goal of clustering is to discover dense
identifying valid, novel, potentially useful, and and sparse regions in a data set. Data clustering has
ultimately understandable patterns of data. been studied in the statistics, machine learning, and
Knowledge discovery in database (KDD) is a well database communities with diverse emphases.
3
There are two main types of clustering techniques algorithm usually adopts iterative optimization
partitional clustering techniques and hierarchical paradigm. It starts with an initial partition and uses
clustering techniques. The partitional clustering an iterative control strategy. It tries swapping of
techniques construct a partition of the database into data points to see if such a swapping improves the
predefined number of clusters. The hierarchical quality of clustering. When no swapping yields
clustering techniques do a sequence of partitions improvements in clustering it finds a locally
in which each partition is nested into next partition optimal partition. This quality of clustering is very
in the sequence. sensitive to initially selected partition. There are
mainly two different categories of the partitioning
algorithms.
• k-means algorithm, where each cluster is

represented by the center of gravity of the
cluster.
• k-medoid algorithms where each cluster is

represented by one of the objects of the
Datasets before clustering
cluster located near the center.
Most of special clustering algorithms designed for

data mining are k-medoid algorithms. Different k-
medoid algorithms are PAM, CLARA,
CLARANS.
6. k-Medoid Algorithms
6.1 PAM
Datasets after clustering PAM uses a k-medoid method to identify the
clusters. PAM selects k objects arbitrarily from the
data as medoids. In each step, a swap between a
5. PARTITIONAL ALGORITHMS
selected object Oi and a non-selected object Oh is
made as long as such a swap would result in an
Partitional algorithms construct a partition of a
improvement of the quality of clustering .To
database of n objects into a set of k clusters. The
calculate the effect of such a swap between Oi and
construction involves determining the optimal
Oh a cost Cih is computed, which is related to the
partition with respect to an objective function.
quality of partitioning the non-selected objects to k
There are approximately kⁿ/k! ways of partitioning
clusters represented by the medoids. So, at this
a set of n data points into k subsets. An exhaustive
stage it is necessary first to understand the method
enumeration method can though find the global
of partitioning of the data objects when a set of k-
optimal partition but is practically infeasible when
medoids are given
n and k are very small. The partitional clustering
4
d(Oj,Oe) = d(Oj,Oi), and Min
Partitioning d(Oj,Oe)=d(Oj,Oj΄), j΄ ≠ h.Define a cost as
If Oj is a non-selected object and Oi is a medoid, Cjih =d(Oj,Oj΄) - d(Oj,Oi)
we then say Oj belongs to the cluster represented • A non-selected object joj Є Cj΄ = Oj Є Ch
by Oi, if d(Oi,Oj)=Minoe d(Oj,Oe), where the So, Min d(Jo,Au) = d(Jo,Jo΄), and
minimum is taken over all medoids Oe and Min d(Jo,Au) = d(Jo,Oh)Cjih = d(Oj,Oh) -
d(Oa,Oh) determines the distance or dissimilarity d(Oj,Oj΄)
between objects Oa and Oh. The dissimilarity
matrix is known prior to the commencement of Define the total cost of swapping Oi and Oh as Chi =
PAM. The quality of clustering is measured by the ∑jCjih if Cih is negative then the quality of
average dissimilarity between an object and the clustering is improved by making Oh as a medoid
medoid of the cluster to which the object belongs. in plase of Oi. The process is repeated until we
cannot find a negative Cih.
Iterative Selection of Medoids The algorithm can be stated as follows:
Let us assume that O1, O2, ….., Ok are k medoids ALGORITHM

selected at any stage. We denote C1, C2, … , Ck are
the respective clusters. From the foregoing • Input: Database of object D.
discussion, for a non-selected object Oj, j ≠ 1, 2 …
• Select arbitrarily k representative objects.
k if Oj Є Ch then Min(1<i<k) d(Oj,Oi) = d(Oj,Oh).
Mark these objects as “selected” and mark
Let us now analyze the effect of swapping Oi and
the remaining as “non-selected”.
Oh. In other words, let us of compare the quality
• Repeat until no more objects are to be
clustering, if we select k medoids as O1,O2, … ,Oi-
classified.
1,Oh, Oi+1,…,Ok, where Oh replaces Oi as one of the
medoids. Due to the change in the set of medoids,
• Do for all selected object Oi.
there will be three types of changes that can occur Do for all non-selected objects Oh.
in the actual clustering. Compute Cih

End do
• End do
• A non-selected object Oj, such that Oj,
such that Oj Є Ci before swapping and Oj • Select imin, hmin such that Cimin,hmin=Min i,h
Є Ch after swapping. This case arises Cih

when the following conditions hold: • If Cimin,hmin<0
Min d(Oj,Oe) = d(Oj,Oi), before swapping Then mark Oi as non-selected and Oh as
and Mine≠I d(Oj,Oe)=d(kOj,Oh) after selected.
swapping. Do repeat.
• Find cluster C1, C2, C3, … , Ch.

• A non-selected object Oj Є Ci and Oj Є
Cj΄ , j΄ ≠h. This case arises when Min
6.2 CLARA
5
restricting its search to smaller sample of the
It can be observed that the major computational database. Thus if the sample size is s ≤ N, it
efforts for PAM are to determine k medoids examines at most k(S-k) pairs at every iteration.
through an iterative optimization. CLARA through CLARANS does not restrict the search to any
follows the same principle, attempts to reduce the particular subset of objects. Neither does it search
computational effort by relying on sampling to the entire dataset. It randomly selects few pairs for
handle large datasets. Instead of finding swapping at the current state. CLARANS, like
representative objects for the entire dataset, PAM, start with a randomly selected set of k
CLARA draws sample of the dataset, applies PAM medoids. It checks at most the maxneighbour
on this sample and finds the medoids of the number of pairs for swapping and if a pair with
sample. If the sample were drawn in a sufficiently negative cost is found, it updates the medoids set
random way, the medoids of the sample would and continues. Otherwise, it records the current
approximate the medoids of the entire dataset. The selection of medoids as a local optimum and
steps of CLARA are summarized below: restarts with a new randomly selected medoid-set
to search for another local optimum. CLARANS
ALGORITHM stops after the “numlocal” number of local optimal
medoid-sets are determined and return the best
among these.
• Input: Database of D objects,
ALGORITHM
• Repeat until
1. Draw a sample S c D randomly • Input(D, k, maxneighbour and numlocal)
from D. • Select arbitrarily k representative objects.

2. Call PAM (s,k) to get k medoids. Mark these objects as “selected” and all
other objects as non-selected. Call it
3. Classify the entire data set D to
current.
C1, C2, Ck,
• Set e =1.
4. Calculate the quality of
clustering as the average • Do while (e ≤ numlocal)
dissimilarity. Set j=1.
• End. • Do while (j ≤ maxneighbour)
o Consider randomly a pair (i,h)
6.3 CLARANS such that Oi is a selected object
and Oh is a non-selected object.
CLARANS (Clustering Large Applications o Calculate the cost Cih.
based on the Randomized Search) is similar to
o If Cih is negative
PAM but it applies a randomized Iterative-
 “update current”
Optimization for determination of medoids. It is
easy to see that in PAM, at every iteration, we  Mark Oi non-
examine k(N-k) swapping to determine the pair selected ,Oh selected
corresponding to minimum cost . On the other and j=1
hand, CLARA tries to examine fewer elements by Else
6
Increment j←j+1 9. REFERENCES:
 End do
 Compare the cost of clustering “with Vasudha Bhatnagar, On Mining Of Data, IETE
mincost” Journal of research, 2001
 If current_cost < mincost

Data mining and warehousing by Dunham
o Mincost ←current_cost
o Best_node←current
IEEE Papers
• increment e←e+1
• End do www.datawarehouse.com
• Return “best node”. www.itpapers.com
7. ANALYSIS
PAM is very robust to the existence of outliers.

The clusters found by this method do not depend
on the order in which the objects are examined.
However, it cannot handle very large data. CLARA
samples the large data and applies PAM on this
sample. The result will be based on the sample.
CLARANS applies randomized Iterative-
Optimization for determining of medoids. This can
be applied to large datasets also. It is more efficient
than earlier medoid-based methods suffers from
two major drawbacks: it assumes that all objects fit
in main memory, and the result is very sensitive to
input order. In addition, it may not find a real local
minimum due to the trimming of searching which
is controlled by ‘maxneighbour’.
8. CONCLUSION
PAM algorithm is efficient and gives good results

when data is small. However it cannot be applied
to large datasets. CLARA efficiency is determined
by the sample of data taken at sampling phase.
CLARANS is efficient for large datasets. As
datasets from which required data is mined is large
CLARANS is used and is efficient partitional
algorithm compared to PAM and CLARA.

Datamining and Dataware Housing

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Datamining and Dataware Housing

Transféré par

Droits d'auteur :

Formats disponibles

A TECHNICAL PAPER ON

DATAMINING AND DATAWARE HOUSING WITH SPECIAL REFERENCE

Gudlavalleru Engineering College

K.PRADEEP KUMAR I.RAHUL

3. INTRODUCTION: Clustering is a useful technique for discovery of

• k-means algorithm, where each cluster is

• k-medoid algorithms where each cluster is

Most of special clustering algorithms designed for

If Oj is a non-selected object and Oi is a medoid, Cjih =d(Oj,Oj΄) - d(Oj,Oi)

Iterative Selection of Medoids The algorithm can be stated as follows:

Let us assume that O1, O2, ….., Ok are k medoids ALGORITHM

in the actual clustering. Compute Cih

Є Ch after swapping. This case arises Cih

• Find cluster C1, C2, C3, … , Ch.

from D. • Select arbitrarily k representative objects.

examine k(N-k) swapping to determine the pair selected ,Oh selected

corresponding to minimum cost . On the other and j=1

hand, CLARA tries to examine fewer elements by Else

 If current_cost < mincost

PAM is very robust to the existence of outliers.

PAM algorithm is efficient and gives good results

Vous aimerez peut-être aussi