Académique Documents
Professionnel Documents
Culture Documents
College of Information Science and Technology, Northwest University, Xian, China College of Information and Control Engineering, Xian University of Architecture and Technology, Xian, China 3 College of Information Science and Technology, Beijing Normal University, Beijing, China zhangxiang1001@126.com
AbstractFeature selection is an important application in the field of Chinese text categorization. However, the traditional Chinese feature selection methods are based on conditional independence assumption; therefore there are many redundancies in feature subsets. In this paper a combined feature selection method of Chinese text is proposed and this method is designed by the regularized mutual information (RMI) and Distribute Information among Classes (DI). It takes two steps to execute feature selection. In the first step, Distribute Information algorithm is used to remove features which are irrelevant of text category and redundant features are eliminated by regularized mutual information in the second step. The experimental results show that this combined feature selection method can improve the quality of classification. Keywords- feature selection; regularized mutual information; distribute information among class; feature redundancy; Chinese text categorization
performance for dealing with the Sogou collection than that of the other two method presented . The rest of this paper is organized as follows: In section 2, related works are summarized. In section 3, we will discuss relevant feature and measure of the feature redundancy. The new algorithm of feature selection is discussed in Secion4 .In section 5, we present experiment steps and results. Some conclusions and ideas for further research are described finally. II. RELATED WORKS
I.
INTRODUCTION
The task of text categorization is automatically placing predefined labels on previously unseen documents. Used in document indexing, e-mail filtering, web browsing, and personal information agents, text categorization is an active and important research area where machine learning and information retrieve intersect. The terms appearing in documents are treated as features. One major difficulty in text categorization is the large dimension of the feature space. Therefore, feature selection plays a very important role in text categorization. However, the traditional feature selection methods are base on conditional independence assumption. It only considers the relevant feature between term and the category, but neglects of the relevant feature between the terms. Because of this, there are many redundancies in feature subset[1]. In the paper a combined feature selection method of Chinese text is proposed. This method is used Distribute Information among Classes and regularized mutual information technology and takes two steps to execute feature selection. In the first step, Distribute Information among Classes algorithm is used to remove features which are irrelevant of text category and the second step is to eliminate redundant feature by regularized mutual information. We also present three method IG, MI, our method for dealing with the Sogou collection (http://www.sogou.com/labs/dl/c.html ),the experimental results show that this combined feature selection can get higher
Feature selection, an important step after preprocessing in text categorization, is complicated for the non-structured format of raw text and its semi-structured format in e-mails and web pages. No matter what methods are used to represent document, there are a plenty of high-dimensional features in feature dataset. Therefore before classifier induction, feature selection is often used to reduce the size of feature dataset from to ' << ; the set ' is called the reduced feature set. A variety of feature selection techniques have been tested for text 2 categorization, while Information Gain, , Document Frequency, Bi-Normal Separation and Odds Ratio were reported to be most effective[2]. The above feature selection techniques are then defined as follows: 1. 2. Document frequency: Features are selected by frequency in document, with a threshold. Information gain measurement: Given a set of categories
= {ci }im =1 ,the information gain of term x is given by:
IG ( x) = P(ci ) log P (ci ) + P( x) P (ci x ) +
i =1 i =1 m m
(1)
3.
For a set of categories = {ci }im =1 .Mutual information of each term t is calculated by
MI ( x, ) = log
i =1
P( x ci ) (3) P( x) P(ci )
4. (CHI):
2
MI ( X , Y ) is greater than 0. X and Y are mutual redundancy. When X and Y are independent of each other, MI ( X , Y ) is equal to 0. When X and Y are Complementary relevance, MI ( X , Y ) is less than 0.
(4)
5. Bi-Normal Separation (BNS): , (5) Where F is the cumulative probability function of the standard Normal distribution. 6. Odds Ratio:
P(tk ci ) P(tk ck )
(1 P(tk ci )) P (tk ck ) F 1 ( P (tk ci )) F 1 ( P(tk ci ))
Because the mutual information can be more inclined to choose more numerical feature, calculation should be carried out by using their normalized entropy. Therefore we choose RMI (Regularized Mutual Information) to measure the redundancy of features. The RMI of a variable X and Y is defined as:
RMI ( X , Y ) = 2 MI ( X ;Y ) H ( X ) + H (Y )
(10)
(6)
Actual feature selection is performed by selecting the top scoring feature, using either a predefined threshold on the feature score or a fixed percentage of all the features available. However, the above methods are based on conditional independence assumption. It only considers the relevant feature between term and the category, but neglects of the relevant feature between the terms. Because of this, there are many redundancies in feature subset. III. MEASURE OF FEATURE REDUNDANCY
RMI (X, Y) can be compensated demerit of mutual information and can limit value range in [0, 1].Numerical 1 indicates that thought one feature can completely predict another feature and two features are completely relevant. X and Y are a redundancy feature for each other. Numerical 0 indicates that X and Y are mutually independent. IV. DIRMI FEATURE SELECTION ALGORITHM
In this section we present a method to measure the relevant between terms [3, 4]. Entropy theory is a very important concept in the Communication and information theory. Many measures are based on the entropy; a measure is the uncertainty of a random variable. As far as dataset is concerned, entropy can be used as measurement of datasets degree of impurity or irregular. Irregular degree is dependent on the strong or weak relation between data elements in set.The entropy of a variable X is defined as
H ( X ) = p ( x) log 2
x X
Separability criterion of feature among classes and inside a class is widely used for feature select in pattern recognition, which actually is the average distance of all classes feature vectors. Under Euclidean distance, separability criterion of feature among classes and inside a class is defined as
J d ( x) = s b + s w (11)
s w = Pi
j =1
1 n j (i ) (i ) ( x mi ) T ( xl m k ) (13) ni k =1 k
J d ( x) is the average distance feature vectors in all classes. mi denotes the average value of vectors in sample set i ; m
1 (7) p ( x)
denotes average value of vectors in all sample set. s b is the Distribution Information Among Classes, s w is the Distribution Inside a Class. Generally, we regard that J d ( x) is bigger, the separability is better. The features which have bigger Distribution Information among Classes (DI) should be selected. Form s b formulae, Distribution Information among Classes of text features is defined as
1 c 1 c (tfi (tk ) tfi (tk )) c 1 i =1 c i =1 c 1 tfi (tk ) c i =1
2
Where P( xi ) is the prior probability for all values of X , and P( xi | yi ) is the posterior probability of X given the values of Y . Mutual Information (MI) had been introduced in Information Theory in order to better describe of the relation between things. MI is a measurement of Statistical relevance between two random variables X and Y [5]. The MI of a variable X and Y is defined as:
MI ( X , Y ) = p ( x, y ) p ( x, y ) (9) x , y p( x ), p ( y )
sb =
(14)
MI ( X , Y ) is greater and the relevance between the two variables is more closely. When X and Y has a high relevance,
Where c is total number of classes, tfi (tk ) is word frequency of feature tk which occurs in class i. s b is bigger, the feature tk is more relevant to class i. we change the define of tfidf weight as
The main idea of the DIRMI feature selection algorithm is divided into two steps: First step: weight of each feature in the dataset is calculated by formulae (15), and the irrelevance features whose weights are below the threshold are removed. Second step: RMI of each feature is calculated and the redundancy features are removed. DIRMI for feature selection is described as follows: Input: training set S = {{x1, y1},{x2 , y2 },...{xm , yn }} ,feature set F = {F1, F2 ,...Fk } , subset number k ' , relevant threshold: Output: feature subset F ' . Step 1: feature subset is initialized, F ' = . Step2: weights of F = {F1, F2 ,...Fk } are calculated by weight = tf idf sb . Get the weight set W [i]{i = 1,2,...k } and w[i] is sorted by decreasing order. Step3: k ' features whose weights are bigger than others are selected as members the Candidate feature subset. Subset defines as F ' = {F1' , F2' ,..., Fk' ' } , where k ' < k . //features which are irrelevant to class are removed and candidate subset is obtained. Step4: For i = 1 to k ' 1 Do For j = i + 1 to k ' Do Step 4.1: calculate RMI ( Fi' , F j' ) . Step 4.2 if RMI ( Fi' , F j' ) > F ' = F ' {F j' } . // if the two features are redundancy with each other, the smaller weight feature is removed. Step5: output feature subset F ' . V. EXPERIMENTAL RESULTS
b- the number of documents incorrectly assigned to this category c- the number of documents incorrectly rejected from this category. The dataset of this study is from the Sogou Lab (http://www.sogou.com/labs/dl/c.html).We select 10000 news reports from Sogou collection as document Profiles, and there are ten categories including IT, economics, education, health, military, tour, auto, sport, culture, and job ad. Each category consists of 1000 documents, and we select 700 documents as training dataset to build classification model, and 300 documents as testing dataset. B. Experimental result and analyze There are two missions in the experiment. One mission is to compare the performance of our algorithm with traditional algorism such as IG and MI. Another mission is to compare the performance of classifier in different number of features. We use the ICTCLAS (http://www.ictclas.org/) to make Chinese words segmentation. Stop words in text are eliminated and synonymous words are merged. Words whose frequencies lower than 10 times are removed. Finally the document can be represented by VSM. In the experiment kNN is chosen to build the classifier and k = 45 .Text similarity computing is based on cosine angle function which is defined as follow:
sim(d i , d j ) =
M 2 2 ( Wik )( W jk ) K =1 K =1
k =1 M