Vous êtes sur la page 1sur 9

AN AMALGAM CLUSTERING ALGORITHM FOR DATA

MINING
Shobana.K[1] Sasikala.M[2]
Research Scholar, Assistant Professor,
Department of Computer Science, Department of Computer Science,
KG College of Arts and Science. KG College of Arts and Science.

ABSTRACT
Data clustering i s a data mining technique used to place data elements into related groups. A clustering a l g o r i t h m aims at
efficient clustering to determine the innate categorization in a set of data that is not labeled T h i s p r e s e n t a t i o n
e x p o u n d s a p r o p o s e d a m a l g a m clustering algorithm based on K-means algorithm and K-harmonic mean algorithm
(KHM). The scheduled algorithm is evaluated on five disparate datasets. The investigation is fixated on swift and
meticulous clustering. The achievement is correlated with the conventional K-means a l g o r i t h m a n d KHM algorithm.
The outcome t h u s obtained from the proposed a m a l g a m algorithm is exceptional compared to the traditional K-means
algorithm and KHM algorithm.

KEYWORDS
Data set; Clusters; Clustering Algorithm; K-means; K-harmonic Means; Amalgam clustering Algorithm;

1. INTRODUCTION
The field of information mining and knowledge Fuzzy C-Means (FCM), S p ec t r al
discovery is rising as a b r an d new, elementary C l u s t e r i n g ( S P C ) i n c l u d i n g ma n y o t h e r
analysis space with vital applications to science, t e c h n i q u e s t h a t ar e p l a nn e d b a s e d o n t h e
engineering, medicine, business, and education. b o t to m o f h i g h e r th a n me n t i o n e d
Data processing tries to formulate, analyze and me t h o d s . T h e K M a lg o r i t h m is a
implements e l e me n t a r y entry processes that s t a n d a rd p ar t i t io n r u l e . I t ’s a s s o c i a t e
helps the extraction of pregnant data and data d e g r e e u n v a r i e d a re p e t i t i v e hill
fr o m u n s t r u c t u r e d k n o wl e d g e . c l i mb i n g rule a n d t h e r e fo r e t h e
S i z e o f d a ta b a s e s i n s c i e n t i f i c a n d s o l u t i o n e ar n e d c o mp l e t e l y r e l i e s o n
b u s i n e s s a p p l i c a t io n s is b i g wh e r e v e r initial g ro u p i n g . The K - me a n s
t h e q u a n t i t y o f r e co r d s i n a n e x c e ed i n g a l g o r i t h m h a s b e e n wi t h s u c c e s s wh i c h
d a t a s e t wi l l v a r y f r o m s o me t h o u s a n d s i s a p p l i e d t o s e v er a l s e n s i b l e i s s u e s ,
t o t ho u s a n d s o f mi l l i o n s . T h i s ma y b e a n e v e r t h e l e s s i t ’s b e e n p ro v e d t h a t t h e
c a t e g o r y o f d a t a mi n i n g t a s k d u r i n g r u l e mi g h t fa i l t o c o n v e r g e t o a
wh i c h a l g o r i t h m s c a n b e u s e d to e v o l v e wo r l d wi d e least below s p e c i fi c
fa s c i n a t i n g k n o wl e d g e d i s t r ib u t i o n s conditions.
wi t h i n t h e u n d e r l yi n g d a t a ho u s e . T h e The issue with K-means algorithm is, its
fo r ma t i o n o f c l u s t er s i s p r ed i c t ed o n performance is bad when the data are
t h e p r i n c i p le o f i n c r ea s i n g s i mi l a r i t y complex in structure. Similarly KHM
b e t we e n p a t t er n s r el at e d t o d i s t i n c t algorithm is based on averages of the
c l u s t e r s . S i mi l a r i t y o r p r o x i mi t y i s distance from each data point to the center. It
t yp i c a l l y outlined as distance proves that KHM is not sensitive to
p e r fo r ma n c e o n p a ir s o f p a t te r n s . initialization of the centers. In specific
S e v e r a l s e l ec t i o n of a l g o r it h m s a r e cases, KHM to a great extent increases the
e me r g i n g . C o mp l e t e l y different quality of clustering results. So it suits
b e g i n n i n g p o i n t s a n d c r i t e r i a t yp i c a l l y large data sets. The existing algorithm
r e s u l t i n d i f f e r e n t ta x o n o mi e s of demands multiple iterations over the
a l g o r i t h ms . C u r r e n t l y n o r ma l l y u s e d datasets to achieve an assembly, where most
p a r ti t i o n strategies are K -Means of them are sensitive to source.
( K M ) , K - H ar mo n i c M e a n s ( K HM ) ,
2. REVIEW OF ALGORITHMS K-Harmonic Means Algorithm ha s many
advantages and hence it is used in proposed
The K-Means and K-Harmonic Means
algorithm.
clustering algorithms are described in brief to
1. begin
suggest a path that gives rise to the proposed
2. initialize N, K, C1, C2, . . . , CK;
a m a l g a m clustering algorithm. where N is size of data set,
K is number of clusters,
3. 2.1. K-MEANS ALGORITHM C1, C2, . . . , CK are cluster centers.
The K-means algorithm is a distance based 3. do assign the n data points to the closest Ci;
algorithm. It splits the data into a specific
number of clusters. The K-means algorithm recompute C1, C2, . . . , CK using distance function;
solely manip ulates with numeric values. The until no change in C1, C2, . . . , CK;
K-means algorithm is distance based a n d 4. return C1, C2, . . . , CK;
5. End
h e n c e depends on the distance i n o r d e r to
measure the analogy between the data points. Figure 2. K-Harmonic Mean Algorithm
Data points are empowered to the nearest cluster
according to the distance metric used. The K- 3. PROPOSED ALGORITHM
Mean algorithm can be given as follows.
Clustering can be possible only on akin data set.
Clustering analysis m a k e s i t v i t a l to compute
1. BEGIN the complementary on distance. Hence the
situation grows tedious when data is too large or
2. INITIALIZE N, K, C1, C2, . . . , CK; WHERE N IS data is arranged in a scattered manner. Thus it is
SIZE OF DATA SET, K IS NUMBER OF CLUSTERS, arduous to arrange them in proper groups. The
C1, C2, . . . , CK ARE CLUSTER CENTERS. d i l e m m a with mean based algorithm is that
the mean is highly depressed by intense values.
3. DO ASSIGN THE N DATA POINTS TO THE To defeat this complication a new algorithm is
CLOSEST CI; proposed. This algorithm implements two
approaches to find the mean in spite of one
RECOMPUTE C1, C2, . . . , CK USING SIMPLE method.
MEAN FUNCTION;
The underlying idea of the new hybrid clustering
UNTIL NO CHANGE IN C1, C2, . . . , CK; algorithm is established by applying two
techniques to trove the mean sequentially until
4. RETURN C1, C2, . . . , CK;
or unless the destination is reached. The accuracy
5. END of result is massive compared to K-Mean a n d
KHM algorithm. The core steps of the proposed
Figure 1. K-means Algorithm algorithm are as follows:
Primarily, choose Z elements from dataset DN as
2.2. K-HARMONIC MEANS ALGORITHM single element cluster. This step follows the same
The K-Harmonic Means Algorithm is a technique as that of k-mean for s e l e c t i n g
center-based, iterative algorithm. It clarifies the k initial points means choosing k random
the clusters defined by K centers. KHM points.
examines the average of the squared distance
from a data point to t h e center. The KHM The final outcome of the experiments illustrate
algorithm initializes with a set of starting that the proposed algorithm has its own
positions of the center. The distance is calculated features, specifically in the cluster forming. As
by the function .Now the new position of the the proposed algorithm applies two different
center is calculated. This process iterates until t e c h n i q u e s to identify the mean value of a
the performance value stays constant. I t i s cluster in a single dataset, the result obtained by
p r o v e d t h a t KHM is invariably insensitive to proposed algorithm is aided by the advantages
initialization. When a situation arises where the of both the strategies. For another facet it
initialization is worse, it can converge smoothly, also sorts out the issue of picking the initial
including the convergence rate and the clustering points as mentioned earlier. Thus the Harmonic
results. It very well suits large scale data sets. The mean assembles well, counting t h e v i t a l
a s p e c t s o f result and speed, though the
initialization is indigent. Hence the proposed t h e resultant clusters are more dense. Owing
algorithm overcomes the c o m p l i c a t i o n s to this fact, the mean is diminished to a great
t h a t occur in K-Mean and K-Harmonic extend and calculated efficiently. The algorithm
Mean algorithms. has multiple assets in the facet of computation
1. begin time, the number of iterations and the
2. initialize Dataset DN, Z, C1, C2, . . . , CZ, effectiveness , which is improved to a greater
PrevailingPath=1; extend. The experiments helps discover the fact
that the clustering results are much exceptional.
where D is dataset,
and N is the size of the current data set, PROBE 1:
In this probe the maximum item value of the
Z is number of clusters to be formed, C1, C2, . . . , dataset is 997 and minimum item value is 1.
CZ are cluster centers.
PrevailingPath is the total no. of scans over the
dataset.
3. do assign n data points to the nearest Ci;
If PrevailingPath%2==0
recompute C1, C2, . . . , CZ using Harmonic Mean
function;
else
recompute C1, C2, . . . , CZ using Arithmetic Mean
function;
increment PrevailingPath by one.
Until no change in C1, C2, . . . , CZ;
4. return C1, C2, . . . , CZ; Proposed Amalgam Algo rit h m
5. End
Figure 3: Proposed Amalgam Algorithm Figure 1. Product obtained from applying
Proposed algorithm on Dataset1
Probe 2:
4. PROBE AND ANALYSIS OF EVENT In this probe the maximum item value of the
Pervasive examinations are executed to dataset is 9977 and minimum item value is 6.
examine the proposed algorithm. Algorithms
are deduced and t h e datasets are c o n s e r v e d
in the memory rather than the database. Each
dataset encompasses 15000 records. JAVA is
incorporated. Since heap memory of JAVA is
meager, the experiments use only 10 percent of
authentic dataset, thus resulting in 1500 records
to perform analysis.
A comparison between the new hybrid
algorithm with the conventional K-means
algorithm and KHM algorithm is delineated.
F o r each o b s e r v a t i o n five clusters are
created and the veracity of clusters is examined
with clusters formed based on the classical K-
means algorithm and KHM algorithm.
Proposed Amalgam Algorithm
Figures 1,2,3,4, 5 and Table 1 parades and proves
that the contemporary amalgam algorithm Figure 1. Product obtained from applying
curtails the mean value of each cluster, Proposed algorithm on Dataset2
implicating that the elements of clusters are Probe 3:
more hardly restrained with each other. T h u s
In this probe the maximum item value of the Proposed Amalgam Algorithm
dataset is 199 and minimum item value is 2.

5.CONCLUSION
This paper submits a contemporary
amalgam clustering algorithm which is
established on K-means and KHM
algorithm. From the developments it is
attended that the proposed hybrid
algorithm is competent, decisive and
dynamic. Analysis are performed using
d i v e r g e n t datasets. The efficiency of
t h e p r o p o s e d algorithm does not r e l y
on the size, scale and values of the dataset.
The h y b r i d algorithm p r o v e s t o
p o s s e s s i m m e n s e assets with solid
Proposed Amalgam Algorithm outcomes and choosing initial points.

Figure 3. Product obtained from Future enhancements m a y include the


study of multi dimensional datasets and
applying Proposed algorithm on clustering wh i c h s e r v e s a ma s s i v e l y
Dataset 3. h u g e datasets for analysis. It is also
Probe 4: intended to use three mean techniques in
In this probe the maximum item value of the spite of two.
dataset and minimum item value.
6.REFERENCES
[1] M Dunham, “Data Mining: Introduction and
Advancements,” 2003.
[2] A.Jain, “ Algorithms for Clustering,”
[3] B Zhang, M C Hsu, Umeshwar Dayal, “K-
Harmonic Means-A Data Clustering Algorithm”
[4] S.Ghahraman; Z. Advances in Information
Processing Systems 14, Cambridge: MITPress,
2002, 849-856.

Proposed Amalgam Algorithm


Probe 5:
In this probe the maximum item value of the
dataset is 498 and minimum item value is 1.
1. INTRODUCTION

Vous aimerez peut-être aussi