Vous êtes sur la page 1sur 4

A Combined Feature Selection Method for Chinese Text Categorization

Xiang Zhang1,2 , Mingquan Zhou3 ,Guohua Geng1, Na Ye2


2

College of Information Science and Technology, Northwest University, Xian, China College of Information and Control Engineering, Xian University of Architecture and Technology, Xian, China 3 College of Information Science and Technology, Beijing Normal University, Beijing, China zhangxiang1001@126.com

AbstractFeature selection is an important application in the field of Chinese text categorization. However, the traditional Chinese feature selection methods are based on conditional independence assumption; therefore there are many redundancies in feature subsets. In this paper a combined feature selection method of Chinese text is proposed and this method is designed by the regularized mutual information (RMI) and Distribute Information among Classes (DI). It takes two steps to execute feature selection. In the first step, Distribute Information algorithm is used to remove features which are irrelevant of text category and redundant features are eliminated by regularized mutual information in the second step. The experimental results show that this combined feature selection method can improve the quality of classification. Keywords- feature selection; regularized mutual information; distribute information among class; feature redundancy; Chinese text categorization

performance for dealing with the Sogou collection than that of the other two method presented . The rest of this paper is organized as follows: In section 2, related works are summarized. In section 3, we will discuss relevant feature and measure of the feature redundancy. The new algorithm of feature selection is discussed in Secion4 .In section 5, we present experiment steps and results. Some conclusions and ideas for further research are described finally. II. RELATED WORKS

I.

INTRODUCTION

The task of text categorization is automatically placing predefined labels on previously unseen documents. Used in document indexing, e-mail filtering, web browsing, and personal information agents, text categorization is an active and important research area where machine learning and information retrieve intersect. The terms appearing in documents are treated as features. One major difficulty in text categorization is the large dimension of the feature space. Therefore, feature selection plays a very important role in text categorization. However, the traditional feature selection methods are base on conditional independence assumption. It only considers the relevant feature between term and the category, but neglects of the relevant feature between the terms. Because of this, there are many redundancies in feature subset[1]. In the paper a combined feature selection method of Chinese text is proposed. This method is used Distribute Information among Classes and regularized mutual information technology and takes two steps to execute feature selection. In the first step, Distribute Information among Classes algorithm is used to remove features which are irrelevant of text category and the second step is to eliminate redundant feature by regularized mutual information. We also present three method IG, MI, our method for dealing with the Sogou collection (http://www.sogou.com/labs/dl/c.html ),the experimental results show that this combined feature selection can get higher

Feature selection, an important step after preprocessing in text categorization, is complicated for the non-structured format of raw text and its semi-structured format in e-mails and web pages. No matter what methods are used to represent document, there are a plenty of high-dimensional features in feature dataset. Therefore before classifier induction, feature selection is often used to reduce the size of feature dataset from to ' << ; the set ' is called the reduced feature set. A variety of feature selection techniques have been tested for text 2 categorization, while Information Gain, , Document Frequency, Bi-Normal Separation and Odds Ratio were reported to be most effective[2]. The above feature selection techniques are then defined as follows: 1. 2. Document frequency: Features are selected by frequency in document, with a threshold. Information gain measurement: Given a set of categories
= {ci }im =1 ,the information gain of term x is given by:
IG ( x) = P(ci ) log P (ci ) + P( x) P (ci x ) +
i =1 i =1 m m

P( x ) P(ci x) log P(ci x)


i =1

(1)

3.

Mutual information measurement: Mutual information for term in class c is given by


MI ( x, c) = log P( x c) (2) P ( x) p (ci )

For a set of categories = {ci }im =1 .Mutual information of each term t is calculated by

978-1-4244-4994-1/09/$25.00 2009 IEEE

MI ( x, ) = log
i =1

P( x ci ) (3) P( x) P(ci )

4. (CHI):
2

P(tk , ci ) P(tk , ci ) P(tk , ci ) P(tk , ci ) P(tk ) P(tk ) P(ci ) P(ci )

MI ( X , Y ) is greater than 0. X and Y are mutual redundancy. When X and Y are independent of each other, MI ( X , Y ) is equal to 0. When X and Y are Complementary relevance, MI ( X , Y ) is less than 0.

(4)

5. Bi-Normal Separation (BNS): , (5) Where F is the cumulative probability function of the standard Normal distribution. 6. Odds Ratio:
P(tk ci ) P(tk ck )
(1 P(tk ci )) P (tk ck ) F 1 ( P (tk ci )) F 1 ( P(tk ci ))

Because the mutual information can be more inclined to choose more numerical feature, calculation should be carried out by using their normalized entropy. Therefore we choose RMI (Regularized Mutual Information) to measure the redundancy of features. The RMI of a variable X and Y is defined as:
RMI ( X , Y ) = 2 MI ( X ;Y ) H ( X ) + H (Y )

(10)

(6)

Actual feature selection is performed by selecting the top scoring feature, using either a predefined threshold on the feature score or a fixed percentage of all the features available. However, the above methods are based on conditional independence assumption. It only considers the relevant feature between term and the category, but neglects of the relevant feature between the terms. Because of this, there are many redundancies in feature subset. III. MEASURE OF FEATURE REDUNDANCY

RMI (X, Y) can be compensated demerit of mutual information and can limit value range in [0, 1].Numerical 1 indicates that thought one feature can completely predict another feature and two features are completely relevant. X and Y are a redundancy feature for each other. Numerical 0 indicates that X and Y are mutually independent. IV. DIRMI FEATURE SELECTION ALGORITHM

In this section we present a method to measure the relevant between terms [3, 4]. Entropy theory is a very important concept in the Communication and information theory. Many measures are based on the entropy; a measure is the uncertainty of a random variable. As far as dataset is concerned, entropy can be used as measurement of datasets degree of impurity or irregular. Irregular degree is dependent on the strong or weak relation between data elements in set.The entropy of a variable X is defined as
H ( X ) = p ( x) log 2
x X

Separability criterion of feature among classes and inside a class is widely used for feature select in pattern recognition, which actually is the average distance of all classes feature vectors. Under Euclidean distance, separability criterion of feature among classes and inside a class is defined as
J d ( x) = s b + s w (11)

Where sb = Pi (mi m) T (mi m) (12)


i 1

s w = Pi
j =1

1 n j (i ) (i ) ( x mi ) T ( xl m k ) (13) ni k =1 k

J d ( x) is the average distance feature vectors in all classes. mi denotes the average value of vectors in sample set i ; m

1 (7) p ( x)

and the entropy of X after observing values of another variable Y is defined as


H ( X | Y ) = P( y j ) P ( x i | y j ) log 2 ( P( xi | y j ) (8)
j i

denotes average value of vectors in all sample set. s b is the Distribution Information Among Classes, s w is the Distribution Inside a Class. Generally, we regard that J d ( x) is bigger, the separability is better. The features which have bigger Distribution Information among Classes (DI) should be selected. Form s b formulae, Distribution Information among Classes of text features is defined as
1 c 1 c (tfi (tk ) tfi (tk )) c 1 i =1 c i =1 c 1 tfi (tk ) c i =1
2

Where P( xi ) is the prior probability for all values of X , and P( xi | yi ) is the posterior probability of X given the values of Y . Mutual Information (MI) had been introduced in Information Theory in order to better describe of the relation between things. MI is a measurement of Statistical relevance between two random variables X and Y [5]. The MI of a variable X and Y is defined as:
MI ( X , Y ) = p ( x, y ) p ( x, y ) (9) x , y p( x ), p ( y )

sb =

(14)

MI ( X , Y ) is greater and the relevance between the two variables is more closely. When X and Y has a high relevance,

Where c is total number of classes, tfi (tk ) is word frequency of feature tk which occurs in class i. s b is bigger, the feature tk is more relevant to class i. we change the define of tfidf weight as

weight = tf idf sb (15)

The main idea of the DIRMI feature selection algorithm is divided into two steps: First step: weight of each feature in the dataset is calculated by formulae (15), and the irrelevance features whose weights are below the threshold are removed. Second step: RMI of each feature is calculated and the redundancy features are removed. DIRMI for feature selection is described as follows: Input: training set S = {{x1, y1},{x2 , y2 },...{xm , yn }} ,feature set F = {F1, F2 ,...Fk } , subset number k ' , relevant threshold: Output: feature subset F ' . Step 1: feature subset is initialized, F ' = . Step2: weights of F = {F1, F2 ,...Fk } are calculated by weight = tf idf sb . Get the weight set W [i]{i = 1,2,...k } and w[i] is sorted by decreasing order. Step3: k ' features whose weights are bigger than others are selected as members the Candidate feature subset. Subset defines as F ' = {F1' , F2' ,..., Fk' ' } , where k ' < k . //features which are irrelevant to class are removed and candidate subset is obtained. Step4: For i = 1 to k ' 1 Do For j = i + 1 to k ' Do Step 4.1: calculate RMI ( Fi' , F j' ) . Step 4.2if RMI ( Fi' , F j' ) > F ' = F ' {F j' } . // if the two features are redundancy with each other, the smaller weight feature is removed. Step5: output feature subset F ' . V. EXPERIMENTAL RESULTS

b- the number of documents incorrectly assigned to this category c- the number of documents incorrectly rejected from this category. The dataset of this study is from the Sogou Lab (http://www.sogou.com/labs/dl/c.html).We select 10000 news reports from Sogou collection as document Profiles, and there are ten categories including IT, economics, education, health, military, tour, auto, sport, culture, and job ad. Each category consists of 1000 documents, and we select 700 documents as training dataset to build classification model, and 300 documents as testing dataset. B. Experimental result and analyze There are two missions in the experiment. One mission is to compare the performance of our algorithm with traditional algorism such as IG and MI. Another mission is to compare the performance of classifier in different number of features. We use the ICTCLAS (http://www.ictclas.org/) to make Chinese words segmentation. Stop words in text are eliminated and synonymous words are merged. Words whose frequencies lower than 10 times are removed. Finally the document can be represented by VSM. In the experiment kNN is chosen to build the classifier and k = 45 .Text similarity computing is based on cosine angle function which is defined as follow:

sim(d i , d j ) =

M 2 2 ( Wik )( W jk ) K =1 K =1

k =1 M

Wik W jk
(19)

Three algorithms IG, MI and our method are used to select features in dataset and then the test set is classified. The results of experiments are shown in table I .table II and table III.
TABLE I. Algorithm IG MI DIRMI
300

COMPARISON OF PECISION Number of features


500 1000 1500 2000

0.756 0.543 0.832

0.761 0.549 0.838

0.775 0.553 0.846

0.782 0.559 0.854

0.789 0.564 0.862

A. Dataset and evaluation Effectiveness of feature selection is usually measured in terms of the classic IR notions of precision, recall and F1, while Recall, Precision and F1 are defined as follows:
a (16) a+c a (17) pecision = a+b recall precision 2 F1 = (18) recall + precision recall =

TABLE II. Algorithm IG MI DIRMI


300

COMPARISON OF RECALL Number of features


500 1000 1500 2000

0.703 0.515 0.826 TABLE III.

0.711 0.524 0.834

0.715 0.531 0.838

0.724 0.538 0.841

0.729 0.546 0.846

COMPARISON OF F1 Number of features


500 1000 1500 2000

Algorithm IG MI DIRMI

300

0.729 0.529 0.829

0.735 0.536 0.836

0.744 0.542 0.842

0.752 0.548 0.847

0.758 0.555 0.854

a- the number of documents correctly assigned to this category.

From the tables above, we can find that with same number of feature and the same classifier, result of DIRMI feature selection is better than that of IG and MI. The reason is that DIRMI algorithm can remove the redundant feature better than other two algorithms. With the same number of features, feature selected by DIRMI can contain more information of the text and those features can represent the text category very well. Other feature selections do not consider the relevant feature, so there are too many redundant features in the feature subset. When dimensions of features are lower than in 2000, result of MI algorithm are no more than 60%.because lowfrequency words are very easy selected in the MI algorithm. VI. CONCLUSION AND FUTURE WORKS

ACKNOWLEDGMENT The work is supported by the National Natural Science Foundation of China under Grant No.60573179. REFERENCES
[1] S. Cooper. Some inconsistencies and Misnomers in probabilistic information retrieval. In: Proceedings of the 14th ACM SIGIR International Conference on Research and Development in Information Retrieval . 1991. Sebastiani F. Machine learning in automated text categorization.ACM Computing Surveys,2002,34,(1):1-47 Yu L , Liu H. Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research ,2004 ,5 :1205 1224Kohavi R , John G. Wrappers for Feature Subset Selection. Artificial Intelligence , 1997 , 97 (1 - 2) :273324. John G H, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem. In: Proceedings of the 11th International Conference on Machine Learning. New Brunswick, NJ: Morgan Kaufmann, 1994 :121129 Yu L , Liu H. Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research, 2004, 5:1205 1224.

[2] [3]

Effect of text categorization is to great extent depending on the effect of feature selection. Traditional Chinese feature selection methods are based on conditional independence assumption; therefore there are many redundancies in feature subset. To solve the problem, a combined feature selection of Chinese text is proposed and this method is designed by the regularized mutual information (RMI) and Distribute Information technology. At last, the experimental results show that this combined feature selection can improve the quality of classification. Our future works are improving the efficiency of feature selection method and applying this approach to automatic classification on large-scale web pages.

[4]

[5]

Vous aimerez peut-être aussi