Académique Documents
Professionnel Documents
Culture Documents
2, July 2015
10
I.
INTRODUCTION
Dimension = s = | vocabulary |
Each term, x, in a document or query, y, is given a realvalued weight, wxy. Queries and documents are represented as
s-dimensional vectors:
11
12
{cx }kx =1
n( x) = arg min c ,n e x c n ( x )
x =1
dy q
CosSim(d y , q ) =
=
dy q
i =1
n
x =1
Algorithm K Means
Input
: A set of data items
( wxy wxq )
wxy2
n
x =1
2
wxq
Number of clusters k
Output
random
2. Repeat until coverage
2(a). Assign data items following the nearest neighbor rule,
that is
n
n( x) = arg min c ,n e x c n ( x )
x =1
c (ys ) =
i.e.,
cosine( d, c ) = (d o c) / || d || || c || = (d o c) / || c ||
cosine( c1, c2) = (c1 o c2) / || c1 || || c2 ||
x:n ( x ) = y
ex
ny
2(c). t = t + 1
In the proposed K-Means Clustering algorithm, Cosine
Similarity normally find the distance from every document to
its most close center. Since the document is represented in
13
K Means = n =1 e e exs n
k
Q
0
0
1
1
1
0
0
1
Count
D1 D2
1
0
1
1
1
1
1
0
1
1
0
1
0
0
0
0
D3
0
0
0
0
0
0
1
1
df
1
2
2
1
2
1
1
1
idf
0.4771
0.1760
0.1760
0.4771
0.1760
0.4771
0.4771
0.4771
Weights Wx = sf * xdf
Q
D1
D2
0
0.4771
0
0
0.1760 0.1760
0.1760 0.1760 0.1760
0.4771 0.4771
0
0.1760 0.1760 0.1760
0
0
0.4771
0
0
0
0.4771
0
0
Length Normalization
Co sin eD2 =
Q.D2 x
| Q || D2 |
Co sin eD3 =
Q.D3 x
| Q || D3 |
= 0.6747
Query Qs Length
Q = Sqrt( (0.1760)2+ (0.1760)2 + (0.1760)2 + (0.4771)2 ) =
0.7191
Similarity Measures
Cosine Similarity is used to measure the cosine value of
the angle between two vectors.
Q * D1= (0.1760 * 0.1760) + (0.4771 * 0.4771) + (0.1760
* 0.1760) = 0.2895
Q * D2= (0.1760 * 0.1760) + (0.1760 * 0.1760)
Q * D3 = (0.4771 * 0.4771) = 0.2276
Co sin eD1 =
D3
0
0
0
0
0
0
0.4771
0.4771
Q.D1x
| Q || D1 |
= 0.0619
VIII.
OBSERVATIONS
[2]
[3]
[4]
[5]
[6]
[7]
BIOGRAPHY
Ms. R. Malathi Ravindran received her MCA
degree from Madurai Kamarajar University,
Madurai, India. She has completed her Master of
Philosophy in Periyar University, Salem.
Presently she is working as an Assistant Professor
of MCA in NGM College, Pollachi. She has 10
years of teaching experience. Her area of interest
includes data Mining. Now she is pursuing her
Ph.D in Computer Science in the Research Department of Computer
Science, NGM College, Pollachi under Bharathiar University,
Coimbatore, India.
Dr. Antony Selvadoss Thanamani is presently
working as Associate Professor and Head,
Research Department of Computer Science, NGM
College, Pollachi, Coimbatore, India. He has
published
more
than
100
papers
in
international/national journals and conferences.
He has authored many books on recent trends in
Information Technology. His areas of interest
include E-Learning, Knowledge Management, Data Mining,
Networking, Parallel and Distributed Computing. He has to his
credit 25 years of teaching and research experience. He is a senior
member of International Association of Computer Science and
Information Technology, Singapore and Active member of Computer
Science Society of India, Computer Science Teachers Association,
NewYork.
14