Académique Documents
Professionnel Documents
Culture Documents
(KEYWORDS: document clustering, rough sets, similarity relation, text mining, vector space model) Nguyen Chi Thanh, Yamada Koichi, Unehara Muneyuki
Nagaoka University of Technology, Nagaoka, JAPAN
1. INTRODUCTION Document clustering is an important data mining techniques to efficient explore useful information from textual data. It is very useful in organizing and searching large text collections. It is also useful in summarizing, navigating the results of search engine. Document clustering is a process of grouping similar documents in to clusters where the clusters are produced so that they would exhibit high intracluster similarity and low intercluster similarity. Generally speaking, text document clustering methods attempt to group documents into classes or clusters where each cluster represents some topic that is different from those topics represented by the other clusters. When documents are represented as term (word) vectors, the data mining clustering methods can be applied. We need a data representation model to deal with text document. Vector Space Model is a very widely used representation model for information retrieval. In the vector space model, documents are represented by vectors of weights, where each weight denotes the importance of a word (term) in a document. Based on vector space model, Tolerance Rough Set Model (TRSM), a document presentation model introduced by Ho et al. [3], has been proposed. TRSM has been successfully applied in information retrieval and document clustering. In this paper, we propose a new presentation model for documents. The new model uses similarity relation instead of tolerance relation. Therefore we call it as Similarity Rough Set Model (SRSM) for documents. We carry out the experiment on a document collection to evaluate the efficient of document clustering with TRSM and SRSM. The structure of the paper is as follows. In Section 2 we introduce TRSM. In Section 3 we discuss drawbacks of TRSM and present our proposed model for document clustering. Section 4 and Section 5 presents the design and results of our experiment on a document collection. Finally, Section 6 concludes with a summary and discussion about future research. 2. TOLERANCE ROUGH SET MODEL FOR
TRSM is a model extended from Pawlaks rough set model [6] by using tolerance relation instead of equivalence relation. Pawlaks model is called equivalence rough set model (ERSM) to distinguish from tolerance rough set model. The ERSM was used in information retrieval to improve document ranking as well as recall (the proportion of relevant documents that are retrieved). It offers a new way to calculate the semantic relationship of words based on an organization of the vocabulary into equivalence classes. However, when ERSM was used in information retrieval, the vocabulary classes have to be identified by subject experts. There is no way to automatically calculate equivalence classes of words. The use of TRSM allows us to automatically calculate tolerance classes of words using several methods. For example, if we use thesaurus, all the words having a similar meaning to a given word in the thesaurus would constitute a tolerance class. If we use the co-occurrence of terms, we could define the tolerance class by the co-occurrence of index terms. TRSM could be defined based on an extended rough set model introduced in [10], as follows. Let the pair apr = (U, R) be an approximation space, where U is the universe, and R U x U is a tolerance relation on U. r(x): U 2U is an uncertainty function which corresponds to the relation R understood as xRy y r(x). r(x) is a tolerance class of all objects that are considered to have similar information to x. The function r(x) satisfies conditions: Reflexive: x r(x) Symmetric: y r(x) x r(y) for any x, y U Given an arbitrary set X U, X can be characterized by a pair of lower and upper approximations as follows:
apr ( X ) = {x U | r ( x) X } ,
apr ( X ) = {x U | r ( x ) X }.
To define approximation space (U, R) in document clustering, we choose the universe U as the set T of all index terms U = T = {t1, t2, ..., tN} The binary relation R represented by an uncertainty function I : U 2U: I (ti) = {tj | fD(ti, tj) } {ti}, where fD(ti, tj) is the number of documents in D in which term ti and tj co-occur, D is the whole set of documents. Function I corresponds to relation R understood as xRy y I(x). As this definition, R is a tolerance relation because it
DOCUMENTS TRSM is a basis to model documents and terms for information retrieval, text mining, etc [3]. It was then applied to document clustering to improve the quality of similarity measure between documents [4].
satisfies the properties of reflexivity and symmetry. Function I determines I (ti) as a tolerance class of term ti. The lower and upper approximation of any subset X T can be defined as
co-occurrence is greater than a proportion of fD(ti). We use a generalized approximations based on similarity relation to propose a new representation model for document clustering. Roman Slowinski and Daniel Vanderpooten proposed new definitions of lower and upper approximations to generalize rough set theory to be used with similarity relation [9]. Let the pair apr = (U, R) be an approximation space, where U is the universe, and R U x U is a similarity relation on U. r(x): U 2U is an uncertainty function which corresponds to the relation R understood as yRx y r(x), which might represent that y is similar to x. r(x) is a similarity class of all objects that are considered to have similar information to x. The function r(x) satisfies reflexive property: x r(x), but is not necessary symmetric and transitive. Given an arbitrary set X U, X can be characterized by a pair of lower and upper approximations as follows:
apr ( X ) = x U | r 1 ( x) X ,
apr ( X ) =
apr ( X ) = {t i T | I (t i ) X } .
3. SIMILARITY ROUGH SET MODEL FOR
DOCUMENTS The co-occurrence of terms was chosen for document clustering based on TRSM because it gives a meaningful interpretation in the context of information retrieval about the dependency and the semantic relation of index terms; and it is relatively simple and computationally efficient. However, in large collections of documents, the occurrences may have big difference among terms. There are terms with high frequency and terms with low frequency. In TRSM, the tolerance class of terms depends on the constant threshold . If the frequency of a term is high then the number of documents in which it co-occurs with other terms will be also large, and as a result, there are many terms in its tolerance class. Similarly, there may be a few terms in the tolerance class of a word with low frequency. So, the size of tolerance class of a term is affected by its frequency in collection. However it is not the nature of the relationship between meanings of words. To resolve the problem of TRSM, we propose a new equation for function I using the relative value of word frequency instead of a constant threshold. The new function depends on a parameter (0 < < 1) as I(ti) = {tj | fD(ti, tj) .fD(ti)} {ti}, (1) where fD(ti, tj) is the number of documents in D in which term ti and tj co-occur, fD(ti) is the number documents in D in which term ti occurs. Due to the new equation, the symmetric property of word relation does not hold. For example, suppose = 0.2, fD(ti, tj) = 5, fD(ti) = 19 and fD(tj) = 42. In this case,
U r ( x) ,
x X
where r 1 ( x ) denotes the inverse relation of R, which is the class of referent objects to which x is similar:
r 1 ( x) = {y U | xRy} We propose a new model for document representation in document clustering similarity rough set model for documents. The new model is defined as follows. To define approximation space (U, R) in document clustering, we choose the universe U as the set T of all index terms U = T = {t1, t2, ..., tN} The binary relation R represented by an uncertainty function I : U 2U: I(ti) = {tj U | tjRti} In similarity rough set model, I is defined as equation (1). As this definition, R is a similarity relation because it only satisfies the properties of reflexivity.
.fD(ti) = 0.2 19 = 3.8 .fD(tj) = 0.2 42 = 8.4 Thus, we have fD(ti, tj) .fD(ti), so tj I(ti). Then, we have
This function corresponds to a similarity relation S T x T that tjSti tj I (ti) and I (ti) is the similarity class of index term ti. Similarity relation is a binary relation that requires only reflexive property, it is not imposed to be symmetric and transitive [9]. Like tolerance relation, rough set model can be extended using similarity relation. This function means that tj belongs to similarity class of ti if the
t j Rt i f D (t i , t j ) . f D (t i )
I(ti) consists of all terms which are similar to ti. The lower and upper approximation of any subset X T can be defined as
1 apr( X ) = t i T | I (t i ) X ,
I(tj).
apr ( X ) =
U I (t ) ,
i ti X
where I 1 (t i ) denotes the inverse relation of R, which is the class of referent objects to which x is similar:
I (t i ) = {t j U | t i Rt j }.
1
s j = xi ,
i
c (jt +1) =
In this case, because we choose equation (1) to define I, I(ti) = {tj | fD(ti, tj) .fD(ti)} {ti}, so I 1 (t i ) is defined as 4.
sj , 1 j k sj
where mj(t+1) denotes the centroid or the mean of the document vectors in cluster Cj(t+1). If some stopping criterion is met, then set Cj* = Cj(t + 1) and set cj* = cj(t
+ 1)
I (t i ) = {t j | f D (t i , t j ) . f D (t j )}
1
increment t by 1, and go to step 2 above. In our implementation, the stopping criterion is when the centroids from the previous iteration are identical to those generated in the current iteration. In TRSM algorithm and SRSM algorithm, we used vectors of upper approximation of documents instead of document vectors. We use tfidf weighting scheme to calculate the weights of terms in document vectors. The term weighting method is extended to define weights of terms in the upper approximations of documents. It ensures that each term in the upper approximation of a document, but not belonging to that document, has a weight smaller to the weight of any term belonging to that document. The weight aij of term ti in document dj is then defined as follows.
Document Collection
Document representation
Approximation generation
LA
UA
Clustering algorithm
= {C1 , C2 , ...,
(0)
(0)
Ck(0)}
. Let c1 , c2 , ..., ck
(0)
(0)
(0)
denote the centroids associated with the given partitioning. Set the index of iteration t = 0. 2. For each document vector xi, 1 i N, find the centroid closest in cosine similarity to xi. Now, compute the new partitioning C(t + 1) induced by the old centroids c1(t), c2(t), ..., ck : Cj
(t+1) (t)
the centroid cj(t). If a document vector is simultaneously closest to more than one centroid, then it is randomly assigned to one of the clusters. 3. Compute the new centroids:
documents. To remove prefixes and suffixes, we use the Porter algorithm which is implemented in Snowball, a framework for writing stemming algorithms created by Martin Porter [7]. The Porter algorithm is the most popular stemming algorithm because of its simplicity end elegance [1]. To evaluate document clustering results and to compare the results between TRSM method and SRSM method, we use three clustering quality measures: entropy, F measure and mutual information. There are different clustering quality measures, which may have different result in comparison between different clustering algorithms. But if one clustering algorithm performs better than another clustering algorithm on many of these measures, we can have some confident to say that it is a better algorithm in that data set.
documents, Table 1 shows the top ten most common words in the data set. The most frequently occurring word has rank 1, the second most frequent word has rank 2, and so on.
5. EXPERIMENTAL RESULT
Our test collection is data set of abstracts of papers from several IEEE journals of several fields. We formed a collection of 1010 documents from IEEE Transactions on Knowledge and Data Engineering (378 abstracts), IEEE Transactions on Biomedical Engineering (311 abstracts) and IEEE Transactions on Nanotechnology (321 abstracts). So we have a data set with 3 categories. We call these categories of documents as KDE, BIO and NANO. After removing stopwords and stemming words, we eliminated high frequency words which appear in more than 15% of the documents (appear in more than 151 documents). This operation is not only to remove non-content-bearing high frequency words [8], but also to reduce the dimensionality of document vectors. After preprocessing (stemming, stop-words elimination, and high frequency word pruning), we have 5690 terms in document collection. Figure 2 shows the distribution of word frequency in the document collection.
9 9
After frequencies of words are obtained, tolerance classes and similarity classes are created based on the co-occurrence of terms in the collection. Table 2 shows the average and maximum size of similarity classes and tolerance classes of terms in document collection when using SRSM and TRSM algorithm.
2 4 6 8 10 12 14
egar eva ssal c ytiralimi S ezis m umixam ssal c ytiralimi S ezis egar eva ssal c e c nar eloT ezis m umixam ssal c e c nar eloT
8 7 6 5 4 3 2 1
0 00 2 N u 00 4 m b 00 6 e r 00 8 o f 00 01 w o r 00 21 d s 00 41
6 31 121 60 1 19 67 16 64 13 6 1
y c n e uq erF
00 03
16
r 00 02 d
0 0 0 1W 0
In Table 2, we can see that the gap between sizes of tolerance classes is very big. If the frequency of a term is high then the number of documents in which it co-occurs with other terms will be also large, and as a result, there are many terms in its tolerance class. Similarly, there may be a few terms in the tolerance class of a word with low frequency. So, the size of tolerance class of a term is affected by its frequency in collection. However it is not the nature of the relationship between meanings of words. This is the problem of TRSM that has been solved with SRSM. Another problem with TRSM is that the sizes of tolerance classes much depend on threshold value. As we can see in Figure 3, the graph of maximum value of tolerance class size is very sloping, it goes down fast when we increase the threshold
The following tables shows the clustering results with TRSM and = 18, which produces the best result with TRSM, and = 6, which produces the worse result with TRSM.
312
0 9
371
7
278
The following tables show the clustering results with SRSM and = 0.35, which produces the best result with SRSM, and
314
2 5
374
4
283
306
9 6
369
9
281
where the results from SRSM with these values are not better
the best result of TRSM, and the difference is not much. We can see that the clusters can be identified with the three categories KDE, BIO and NANO, and the results from SRSM have more documents assigned to the correct categories.
F
Entropy
2 0.682 4 0.673 6 0.693 8 0.682 10 0.649 12 0.643 14 0.649 16 0.364 18 0.202 20 0.202
Mutual
information measure 0.188 0.192 0.183 0.188 0.203 0.206 0.203 0.333 0.406 0.406 0.696 0.704 0.701 0.682 0.699 0.707 0.707 0.895 0.951 0.951
Entropy
0.20 0.212 0.25 0.231 0.30 0.215 0.35 0.173 0.40 0.179 0.45 0.183 0.50 0.190 0.55 0.174 0.60 0.182 0.65 0.196
Mutual
information measure 0.402 0.393 0.400 0.419 0.417 0.415 0.412 0.419 0.416 0.408 0.951 0.946 0.949 0.961 0.959 0.960 0.958 0.958 0.956 0.952
6. CONCLUSION
In this paper, we have presented Similarity Rough Set Model, our proposed model for document clustering. SRSM extends the vector space model using a generalized definition of rough set model based on similarity relation [18]. We develop a program using TRSM and SRMS to cluster the documents. To evaluate the two models in document clustering, a document collection consisting of 1010 documents from three IEEE journals is formed. Our experiment results show that the quality of the clustering with SRSM is better than the one with
TRSM based on three evaluation measures: entropy, mutual information, and F measure. Furthermore, the clustering quality with SRSM is less affected by the value of parameter ( and ) than TRSM. There is a big difference between the best result of TRSM and the other results of TRSM, but in the case of SRSM, the difference between clustering results with different parameter is not very big. This is very important because we still do not know how to determine these parameters ( and ) to get the optimum result. Therefore, with SRSM we have more chance of getting near optimum result with an arbitrary value of parameter , while with TRSM we have to try many values of threshold to obtain a good result. And even if we can get the best result with TRSM, there is still a big possibility that it is not better than a result with SRSM an arbitrary value of parameter . In future work, we will continue to improve the efficiency and scalability of the algorithm. Another challenge is how to determine the parameter for specific application domains.
models, (1997) Rough Sets and Data Mining: Analysis for Imprecise Data, pp. 47-73.
REFERENCES
[1] Baeza-Yates, R., Ribeiro-Neto, B., (1999) Modern
information retrieval, Addison-Wesley, New York. [2] Dhillon, I.S., Fan, J., Guan, Y., Efficient clustering of very large document collections, (2001) Data Mining for Scientific and Engineering Applications. [3] Ho, T.B., Funakoshi, K., Information retrieval using rough sets, (1997) Journal of Japanese Society for Aritificial Intelligence, 13 (3), pp. 424-433. [4] Ho, T.B., Nguyen, N.B., Nonhierarchical document clustering based on a tolerance rough set model, (2002) International Journal of Intelligent Systems, 17 (2), pp. 199-212. [5] Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T., YALE: Rapid prototyping for complex data mining tasks, (2006) Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935-940. [6] Pawlak, Z., Rough sets, (1982) Int. J. of Information and Computer Sciences, 11 (5), pp. 341-356. [7] Porter, M., (2001) Snowball: A Language for Stemming Algorithms, http://snowball.tartarus.org/texts/introduction.html. [8] Salton, G., McGill, M. J., (1983) Introduction to modern information retrieval, MCGraw-Hill Book Company. [9] Slowinski, R., Vanderpooten, D., A generalized definition of rough approximations based on similarity, (2000) IEEE Transactions on Knowledge and Data Engineering, 12 (2), pp. 331-336. [10] Yao, Y.Y., Wong, S.K.M, Lin, T.Y., A review of rough set