Vous êtes sur la page 1sur 6

Evaluation of Document Clustering based on Similarity Rough Set Model

(KEYWORDS: document clustering, rough sets, similarity relation, text mining, vector space model) Nguyen Chi Thanh, Yamada Koichi, Unehara Muneyuki
Nagaoka University of Technology, Nagaoka, JAPAN

1. INTRODUCTION Document clustering is an important data mining techniques to efficient explore useful information from textual data. It is very useful in organizing and searching large text collections. It is also useful in summarizing, navigating the results of search engine. Document clustering is a process of grouping similar documents in to clusters where the clusters are produced so that they would exhibit high intracluster similarity and low intercluster similarity. Generally speaking, text document clustering methods attempt to group documents into classes or clusters where each cluster represents some topic that is different from those topics represented by the other clusters. When documents are represented as term (word) vectors, the data mining clustering methods can be applied. We need a data representation model to deal with text document. Vector Space Model is a very widely used representation model for information retrieval. In the vector space model, documents are represented by vectors of weights, where each weight denotes the importance of a word (term) in a document. Based on vector space model, Tolerance Rough Set Model (TRSM), a document presentation model introduced by Ho et al. [3], has been proposed. TRSM has been successfully applied in information retrieval and document clustering. In this paper, we propose a new presentation model for documents. The new model uses similarity relation instead of tolerance relation. Therefore we call it as Similarity Rough Set Model (SRSM) for documents. We carry out the experiment on a document collection to evaluate the efficient of document clustering with TRSM and SRSM. The structure of the paper is as follows. In Section 2 we introduce TRSM. In Section 3 we discuss drawbacks of TRSM and present our proposed model for document clustering. Section 4 and Section 5 presents the design and results of our experiment on a document collection. Finally, Section 6 concludes with a summary and discussion about future research. 2. TOLERANCE ROUGH SET MODEL FOR

TRSM is a model extended from Pawlaks rough set model [6] by using tolerance relation instead of equivalence relation. Pawlaks model is called equivalence rough set model (ERSM) to distinguish from tolerance rough set model. The ERSM was used in information retrieval to improve document ranking as well as recall (the proportion of relevant documents that are retrieved). It offers a new way to calculate the semantic relationship of words based on an organization of the vocabulary into equivalence classes. However, when ERSM was used in information retrieval, the vocabulary classes have to be identified by subject experts. There is no way to automatically calculate equivalence classes of words. The use of TRSM allows us to automatically calculate tolerance classes of words using several methods. For example, if we use thesaurus, all the words having a similar meaning to a given word in the thesaurus would constitute a tolerance class. If we use the co-occurrence of terms, we could define the tolerance class by the co-occurrence of index terms. TRSM could be defined based on an extended rough set model introduced in [10], as follows. Let the pair apr = (U, R) be an approximation space, where U is the universe, and R U x U is a tolerance relation on U. r(x): U 2U is an uncertainty function which corresponds to the relation R understood as xRy y r(x). r(x) is a tolerance class of all objects that are considered to have similar information to x. The function r(x) satisfies conditions: Reflexive: x r(x) Symmetric: y r(x) x r(y) for any x, y U Given an arbitrary set X U, X can be characterized by a pair of lower and upper approximations as follows:

apr ( X ) = {x U | r ( x) X } ,
apr ( X ) = {x U | r ( x ) X }.
To define approximation space (U, R) in document clustering, we choose the universe U as the set T of all index terms U = T = {t1, t2, ..., tN} The binary relation R represented by an uncertainty function I : U 2U: I (ti) = {tj | fD(ti, tj) } {ti}, where fD(ti, tj) is the number of documents in D in which term ti and tj co-occur, D is the whole set of documents. Function I corresponds to relation R understood as xRy y I(x). As this definition, R is a tolerance relation because it

DOCUMENTS TRSM is a basis to model documents and terms for information retrieval, text mining, etc [3]. It was then applied to document clustering to improve the quality of similarity measure between documents [4].

satisfies the properties of reflexivity and symmetry. Function I determines I (ti) as a tolerance class of term ti. The lower and upper approximation of any subset X T can be defined as

co-occurrence is greater than a proportion of fD(ti). We use a generalized approximations based on similarity relation to propose a new representation model for document clustering. Roman Slowinski and Daniel Vanderpooten proposed new definitions of lower and upper approximations to generalize rough set theory to be used with similarity relation [9]. Let the pair apr = (U, R) be an approximation space, where U is the universe, and R U x U is a similarity relation on U. r(x): U 2U is an uncertainty function which corresponds to the relation R understood as yRx y r(x), which might represent that y is similar to x. r(x) is a similarity class of all objects that are considered to have similar information to x. The function r(x) satisfies reflexive property: x r(x), but is not necessary symmetric and transitive. Given an arbitrary set X U, X can be characterized by a pair of lower and upper approximations as follows:
apr ( X ) = x U | r 1 ( x) X ,
apr ( X ) =

apr ( X ) = {ti T | I (ti ) X } ,

apr ( X ) = {t i T | I (t i ) X } .
3. SIMILARITY ROUGH SET MODEL FOR

DOCUMENTS The co-occurrence of terms was chosen for document clustering based on TRSM because it gives a meaningful interpretation in the context of information retrieval about the dependency and the semantic relation of index terms; and it is relatively simple and computationally efficient. However, in large collections of documents, the occurrences may have big difference among terms. There are terms with high frequency and terms with low frequency. In TRSM, the tolerance class of terms depends on the constant threshold . If the frequency of a term is high then the number of documents in which it co-occurs with other terms will be also large, and as a result, there are many terms in its tolerance class. Similarly, there may be a few terms in the tolerance class of a word with low frequency. So, the size of tolerance class of a term is affected by its frequency in collection. However it is not the nature of the relationship between meanings of words. To resolve the problem of TRSM, we propose a new equation for function I using the relative value of word frequency instead of a constant threshold. The new function depends on a parameter (0 < < 1) as I(ti) = {tj | fD(ti, tj) .fD(ti)} {ti}, (1) where fD(ti, tj) is the number of documents in D in which term ti and tj co-occur, fD(ti) is the number documents in D in which term ti occurs. Due to the new equation, the symmetric property of word relation does not hold. For example, suppose = 0.2, fD(ti, tj) = 5, fD(ti) = 19 and fD(tj) = 42. In this case,

U r ( x) ,
x X

where r 1 ( x ) denotes the inverse relation of R, which is the class of referent objects to which x is similar:
r 1 ( x) = {y U | xRy} We propose a new model for document representation in document clustering similarity rough set model for documents. The new model is defined as follows. To define approximation space (U, R) in document clustering, we choose the universe U as the set T of all index terms U = T = {t1, t2, ..., tN} The binary relation R represented by an uncertainty function I : U 2U: I(ti) = {tj U | tjRti} In similarity rough set model, I is defined as equation (1). As this definition, R is a similarity relation because it only satisfies the properties of reflexivity.

.fD(ti) = 0.2 19 = 3.8 .fD(tj) = 0.2 42 = 8.4 Thus, we have fD(ti, tj) .fD(ti), so tj I(ti). Then, we have
This function corresponds to a similarity relation S T x T that tjSti tj I (ti) and I (ti) is the similarity class of index term ti. Similarity relation is a binary relation that requires only reflexive property, it is not imposed to be symmetric and transitive [9]. Like tolerance relation, rough set model can be extended using similarity relation. This function means that tj belongs to similarity class of ti if the

t j Rt i f D (t i , t j ) . f D (t i )
I(ti) consists of all terms which are similar to ti. The lower and upper approximation of any subset X T can be defined as
1 apr( X ) = t i T | I (t i ) X ,

fD(tj, ti) < .fD(tj) so ti

I(tj).

apr ( X ) =

U I (t ) ,
i ti X

where I 1 (t i ) denotes the inverse relation of R, which is the class of referent objects to which x is similar:

I (t i ) = {t j U | t i Rt j }.
1

s j = xi ,
i

c (jt +1) =

In this case, because we choose equation (1) to define I, I(ti) = {tj | fD(ti, tj) .fD(ti)} {ti}, so I 1 (t i ) is defined as 4.

sj , 1 j k sj

where mj(t+1) denotes the centroid or the mean of the document vectors in cluster Cj(t+1). If some stopping criterion is met, then set Cj* = Cj(t + 1) and set cj* = cj(t
+ 1)

I (t i ) = {t j | f D (t i , t j ) . f D (t j )}
1

for 1 j k, and exit. Otherwise,

increment t by 1, and go to step 2 above. In our implementation, the stopping criterion is when the centroids from the previous iteration are identical to those generated in the current iteration. In TRSM algorithm and SRSM algorithm, we used vectors of upper approximation of documents instead of document vectors. We use tfidf weighting scheme to calculate the weights of terms in document vectors. The term weighting method is extended to define weights of terms in the upper approximations of documents. It ensures that each term in the upper approximation of a document, but not belonging to that document, has a weight smaller to the weight of any term belonging to that document. The weight aij of term ti in document dj is then defined as follows.

4. DESIGN OF THE EXPERIMENT


A TRSM document clustering algorithm and a SRSM document clustering algorithm are implemented based on spherical k-means algorithm [2]. The objective of our experiment is to evaluate document clustering algorithms with TRSM and SRSM. The process of clustering is described in Figure 1. Each document is represented by the upper approximation of the given keywords in both cases of TRSM and SRSM. Tolerance classes and similarity classes are determined by co-occurrence of terms. The cosine similarity measure is used to calculate the similarity between two documents.

Document Collection

terms Tolerance classes generation Clusters documents

Document representation

Approximation generation

N if t i d j f ij log f D (t ) i N log aij = f D (t ) i min if t i d j t h d j a hj N 1 + log f D (t ) i


where f ij is the frequency of term i in document j, and f D (t ) i is the number of documents containing the term i. Then normalization is applied to the document vectors. The program is written in Java language, using Word Vector Tool library to create word vector representations of text documents in the vector space model. Word Vector Tool library is provided in YALE, a free open-source environment for KDD and machine learning [5]. According to a poll performed by the leading internet portal for knowledge discovery and data mining, KDnuggets, YALE (which is called RapidMiner now) is among the Top 3 Data Mining Tools in 2007 and the No. 1 Open-Source Solution. The Word Vector Tool provides operators for document pre-processing and representing operations such as stopword removal, word stemming, document vector generation, term weighting, etc. First, stopwords are removed from set of terms. The Word Vector Tool uses a stopword list consists of 452 non-content-bearing stopwords which should be removed from

LA

UA

Clustering algorithm

Figure 1. Document clustering process


The spherical k-means algorithm: 1. Start with an arbitrary partitioning of the document vectors, namely C
(0)

= {C1 , C2 , ...,

(0)

(0)

Ck(0)}

. Let c1 , c2 , ..., ck

(0)

(0)

(0)

denote the centroids associated with the given partitioning. Set the index of iteration t = 0. 2. For each document vector xi, 1 i N, find the centroid closest in cosine similarity to xi. Now, compute the new partitioning C(t + 1) induced by the old centroids c1(t), c2(t), ..., ck : Cj
(t+1) (t)

is the set of all document vectors that are closest to

the centroid cj(t). If a document vector is simultaneously closest to more than one centroid, then it is randomly assigned to one of the clusters. 3. Compute the new centroids:

documents. To remove prefixes and suffixes, we use the Porter algorithm which is implemented in Snowball, a framework for writing stemming algorithms created by Martin Porter [7]. The Porter algorithm is the most popular stemming algorithm because of its simplicity end elegance [1]. To evaluate document clustering results and to compare the results between TRSM method and SRSM method, we use three clustering quality measures: entropy, F measure and mutual information. There are different clustering quality measures, which may have different result in comparison between different clustering algorithms. But if one clustering algorithm performs better than another clustering algorithm on many of these measures, we can have some confident to say that it is a better algorithm in that data set.

documents, Table 1 shows the top ten most common words in the data set. The most frequently occurring word has rank 1, the second most frequent word has rank 2, and so on.

Table 1. Top 10 most common words in the data set.


Rank 1 2 3 4 4 4 7 8 Word increas number requir set design relat order characterist depend signal Frequency 150 148 146 144 144 144 143 138 133 133

5. EXPERIMENTAL RESULT
Our test collection is data set of abstracts of papers from several IEEE journals of several fields. We formed a collection of 1010 documents from IEEE Transactions on Knowledge and Data Engineering (378 abstracts), IEEE Transactions on Biomedical Engineering (311 abstracts) and IEEE Transactions on Nanotechnology (321 abstracts). So we have a data set with 3 categories. We call these categories of documents as KDE, BIO and NANO. After removing stopwords and stemming words, we eliminated high frequency words which appear in more than 15% of the documents (appear in more than 151 documents). This operation is not only to remove non-content-bearing high frequency words [8], but also to reduce the dimensionality of document vectors. After preprocessing (stemming, stop-words elimination, and high frequency word pruning), we have 5690 terms in document collection. Figure 2 shows the distribution of word frequency in the document collection.

9 9

After frequencies of words are obtained, tolerance classes and similarity classes are created based on the co-occurrence of terms in the collection. Table 2 shows the average and maximum size of similarity classes and tolerance classes of terms in document collection when using SRSM and TRSM algorithm.

Table 2. Size of tolerance and similarity class.


TRSM SRSM Avg 114.31 36.96 17.48 9.75 5.93 3.85 2.64 1.94

2 4 6 8 10 12 14

Max 1229 669 434 302 216 154 118 86

0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55

Max 314 306 238 180 180 180 180 113

Avg 74.47 61.34 48.76 37.81 37.32 35.95 35.90 21.84

Figure 2. Word frequency in the document collection


With 5690 words we created 1010 document vectors using tfidf weighting scheme, each document vector has 5690 dimensions. In terms of the words contained in these

Figure 3. Size of tolerance and similarity classes

egar eva ssal c ytiralimi S ezis m umixam ssal c ytiralimi S ezis egar eva ssal c e c nar eloT ezis m umixam ssal c e c nar eloT

8 7 6 5 4 3 2 1

0 00 2 N u 00 4 m b 00 6 e r 00 8 o f 00 01 w o r 00 21 d s 00 41

6 31 121 60 1 19 67 16 64 13 6 1

y c n e uq erF

00 03

16

r 00 02 d
0 0 0 1W 0

In Table 2, we can see that the gap between sizes of tolerance classes is very big. If the frequency of a term is high then the number of documents in which it co-occurs with other terms will be also large, and as a result, there are many terms in its tolerance class. Similarly, there may be a few terms in the tolerance class of a word with low frequency. So, the size of tolerance class of a term is affected by its frequency in collection. However it is not the nature of the relationship between meanings of words. This is the problem of TRSM that has been solved with SRSM. Another problem with TRSM is that the sizes of tolerance classes much depend on threshold value. As we can see in Figure 3, the graph of maximum value of tolerance class size is very sloping, it goes down fast when we increase the threshold

The following tables shows the clustering results with TRSM and = 18, which produces the best result with TRSM, and = 6, which produces the worse result with TRSM.

Table 5. Clustering result with TRSM and = 18.


KDE Cluster 1 Cluster 2 Cluster 3 0 BIO 22 11 NANO

312
0 9

371
7

278

Table 6. Clustering result with TRSM and = 6.


KDE Cluster 1 Cluster 2 Cluster 3 1 375 2 BIO 77 228 6 NANO 298 23 0

. This means the maximum sizes of tolerance classes change


much according to the different values of the threshold . On the other hand, the graph of maximum value and average value of similarity class size is quite steady. We used our clustering algorithm to form 3 clusters from test collection. Table 4 shows the evaluation of clustering results of TRSM clustering with different values of , and of SRSM clustering with different values of . The evaluation uses different quality measures: entropy, mutual information and F measure. When = 18, the entropy of clustering result is 0.202, the mutual information is 0.406 and the F measure is 0.951, the best result with TRSM algorithm. We get the best result with SRSM when = 0.35 and the entropy is 0.173, the mutual information is 0.419 and the F measure is 0.961. We can see that the result of SRSM is better than the result of TRSM in all three quality measures, entropy, mutual information and F measure. Furthermore, there are only three values of parameter

The following tables show the clustering results with SRSM and = 0.35, which produces the best result with SRSM, and

= 0.25, which produces the worst result with SRSM


Table 7. Clustering result with SRSM and = 0.35.
KDE Cluster 1 Cluster 2 Cluster 3 0 BIO 18 10 NANO

314
2 5

374
4

283

Table 8. Clustering result with SRSM and = 0.25.


KDE Cluster 1 Cluster 2 Cluster 3 0 BIO 20 10 NANO

306
9 6

369
9

281

where the results from SRSM with these values are not better
the best result of TRSM, and the difference is not much. We can see that the clusters can be identified with the three categories KDE, BIO and NANO, and the results from SRSM have more documents assigned to the correct categories.
F

Table 4. Evaluation of clustering results


TRSM SRSM F

Entropy
2 0.682 4 0.673 6 0.693 8 0.682 10 0.649 12 0.643 14 0.649 16 0.364 18 0.202 20 0.202

Mutual

information measure 0.188 0.192 0.183 0.188 0.203 0.206 0.203 0.333 0.406 0.406 0.696 0.704 0.701 0.682 0.699 0.707 0.707 0.895 0.951 0.951

Entropy
0.20 0.212 0.25 0.231 0.30 0.215 0.35 0.173 0.40 0.179 0.45 0.183 0.50 0.190 0.55 0.174 0.60 0.182 0.65 0.196

Mutual

information measure 0.402 0.393 0.400 0.419 0.417 0.415 0.412 0.419 0.416 0.408 0.951 0.946 0.949 0.961 0.959 0.960 0.958 0.958 0.956 0.952

6. CONCLUSION
In this paper, we have presented Similarity Rough Set Model, our proposed model for document clustering. SRSM extends the vector space model using a generalized definition of rough set model based on similarity relation [18]. We develop a program using TRSM and SRMS to cluster the documents. To evaluate the two models in document clustering, a document collection consisting of 1010 documents from three IEEE journals is formed. Our experiment results show that the quality of the clustering with SRSM is better than the one with

TRSM based on three evaluation measures: entropy, mutual information, and F measure. Furthermore, the clustering quality with SRSM is less affected by the value of parameter ( and ) than TRSM. There is a big difference between the best result of TRSM and the other results of TRSM, but in the case of SRSM, the difference between clustering results with different parameter is not very big. This is very important because we still do not know how to determine these parameters ( and ) to get the optimum result. Therefore, with SRSM we have more chance of getting near optimum result with an arbitrary value of parameter , while with TRSM we have to try many values of threshold to obtain a good result. And even if we can get the best result with TRSM, there is still a big possibility that it is not better than a result with SRSM an arbitrary value of parameter . In future work, we will continue to improve the efficiency and scalability of the algorithm. Another challenge is how to determine the parameter for specific application domains.

models, (1997) Rough Sets and Data Mining: Analysis for Imprecise Data, pp. 47-73.

REFERENCES
[1] Baeza-Yates, R., Ribeiro-Neto, B., (1999) Modern

information retrieval, Addison-Wesley, New York. [2] Dhillon, I.S., Fan, J., Guan, Y., Efficient clustering of very large document collections, (2001) Data Mining for Scientific and Engineering Applications. [3] Ho, T.B., Funakoshi, K., Information retrieval using rough sets, (1997) Journal of Japanese Society for Aritificial Intelligence, 13 (3), pp. 424-433. [4] Ho, T.B., Nguyen, N.B., Nonhierarchical document clustering based on a tolerance rough set model, (2002) International Journal of Intelligent Systems, 17 (2), pp. 199-212. [5] Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T., YALE: Rapid prototyping for complex data mining tasks, (2006) Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935-940. [6] Pawlak, Z., Rough sets, (1982) Int. J. of Information and Computer Sciences, 11 (5), pp. 341-356. [7] Porter, M., (2001) Snowball: A Language for Stemming Algorithms, http://snowball.tartarus.org/texts/introduction.html. [8] Salton, G., McGill, M. J., (1983) Introduction to modern information retrieval, MCGraw-Hill Book Company. [9] Slowinski, R., Vanderpooten, D., A generalized definition of rough approximations based on similarity, (2000) IEEE Transactions on Knowledge and Data Engineering, 12 (2), pp. 331-336. [10] Yao, Y.Y., Wong, S.K.M, Lin, T.Y., A review of rough set

Vous aimerez peut-être aussi