Académique Documents
Professionnel Documents
Culture Documents
www.jcseuk.com
The Algorithm Research of Genetic Algorithm Combining with Text Feature Selection Method
Yuping Fanga, Ken Chenb, Chenhong Luoa,a*
a
Yunnan Normal University, College of Vocational and Technical Education, Yunnan Kunming 650032, China b Yunnan Normal University, Department of Computer Science, Yunnan Kunming 650032, China
Abstract
The traditional methods of text feature selection were analyzed and their respective advantages and disadvantages were compared in detail. Based on the characteristic of optimizing itself of genetic algorithm, an improved text feature selection scheme was proposed. Firstly, the common text feature selection method (DF, IG, MI, CHI) was used to select the text characteristic, then screening was processed by genetic algorithm, and finally feature items according with feature of text classification was selected. The experimental results show that the performance has been significantly improved.
Keywords: Feature Selection; Dimensionality Reduction; Evaluation Function; Genetic Algorithm
1. Introduction
How to effectively organize and manage information, and how to find the information that users need quickly, accurately and comprehensively are major challenges faced by those in the current information science and technology fields. Text classification refers to the process in which texts are divided into the relevant predefined categories according to their content under the given classification system. It is an important component of text mining [1] and plays a significant role in improving the speed and accuracy of text retrieval. Text classification includes three steps: the vector model of the text, text feature selection and classifier training [2]. In order to balance the two aspects of the mathematical time and classification accuracy, feature selection has to be done, striving to reach the goal of dimensionality reduction without damaging the classification performance.
September 2012
Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 00090013
DF ( F )
thenumber of documents in which feature t appears the total number of documents of training set
It is the simplest evaluation function; having a small amount of calculations is its greatest feature. Theoretical assumption of the DF evaluation function is that the characteristics which emerge a lower frequency contain less information, but this assumption is obviously incomplete. Therefore, generally DF is not directly used in practical application, but it is used as a standard evaluation function. 2.2 Information Gain (IG, Information Gain) IG is a feature selecting method which is widely used in a machine study field. From the information theory point of view, it divides the study sample space by individual feature selecting, then filters and selects the effective features according to how much information was gained. IG can be expressed as: (2) In the formula, indicates the probability which the text belongs to when feature t appears in the text; indicates the probability which the text belongs to IG when word t does not appear in the text; indicates the probability of category occurrence; indicates the probability which the word t appears in the entire text of the training set. 2.3 Mutual Information (MI, Mutual Information) MI is the concept of information theory used to measure the degree of interdependence between the two signals in a message. In the field of feature selection, the mutual information between the feature t and the category reflects the relevance between the characteristics and the category. Feature t which appears high probability in a category and appears low probability in the other categories will receive higher mutual information. MI can be expressed as; (3) Where each indicator is the same as that of above formula (2). 2.4 Statistic (CHI) (4) Where A is the frequency which both feature t and class A document appear together; B is the frequency that feature t appears but the class A document does not appear; C is the frequency that class A document appears but the feature t does not appear; D is the frequency which both the feature t and the class A document do not appear; N is the total number of text. Method believes that the non-independent relationship between the feature t and text category is similar to the distribution which contains one-dimensional freedom. It is based on the following assumptions: high
Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 00090013
frequency words which are in specified or in other types of text can help to determine whether the article belongs to the category . The basic idea of the four kinds of feature selecting methods is to calculate a statistical measure, and then set a threshold, T to every feature word, filter out those features that have a smaller measure than T and the rest that are accounted as effective characteristics. Table 1 shows the respective advantages and disadvantages of the four kinds of feature selecting methods.
Table 1. The comparisons of the advantages and disadvantages of feature selection methods Selection method The document frequency (DF) Information gain (IG) Mutual information (MI) statistic CHI Advantage Low computational complexity, capable of largescale classification tasks. Is the simplest method of feature selection. It is widely used in machine learning feature selection method. Considers the low-frequency words, brought some amount of information. Considering the words "negative correlation": good classification results. Disadvantage Does not meet the widely accepted theory of information retrieval: ignores the role of high-frequency words. If the word does not occur the effectiveness of information gain will be greatly reduced: the statistics of the method is expensive. Easily lead to over-learning: not considered a negative correlation: Ignores the characteristics of the dispersion and concentration of test indicators, resulting in the individual characteristics of over-fitting. Statistics expensive and ineffective for low-frequency words.
3. The combing algorithm method between the genetic algorithm method and the traditional text feature selecting method
Genetic algorithms copy the biological process of natural selection and evolution system; it is a random population-based overall optimization algorithm. The genetic algorithm encodes those problems (parameters) which need to be solved, generates the initial solution group in solution space, and is gradually evolved to the overall optimal solution by genetic variation. As a more mature method, genetic algorithm has been discussed in a lot of literature, for further details please refer to the literature [1, 2 and 3]. The experimental idea in this article is: the use of the traditional method of the text feature selection (DF, IG, MI, CHI) to select the text feature, and then using the genetic algorithm to filter them, and eventually selecting a feature item which suits the text classification. 3.1 The combining algorithm method between genetic algorithm and text feature selection method Input: the set of entries set after word dividing process. Output: The text feature set. Algorithm description: [T1]. Using the Chinese Academy of Sciences word dividing system to divide the text word, to obtain the entry set of T; [T2]. Using equation (1), (2), (3) and (4) respectively on T to calculate the traditional text feature selection method and the result is T1; [T3]. T1s entry as the coding of genetic algorithms: entry that appears is 1entry does not appear is 0, result the collection of {0, 1}; [T4]. Re-use the TF-IDF formula for weight calculation: (5) Where N is the number of all documents, ni is the number of documents which contains the term of ti, tfi indicates the frequency that entry T appears in document d. [T5]. Fitness function
Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 00090013
fit ( si ) log
i 1, j 1 n
(
i , j 1
ik
jk ) 1
(6)
2 ik
jk
As for the identification of text feature, three of the most common evaluation indicators were applied to the test, they are Precision (P), Recall Rate (R) and aggregative indicator value (F). Their definitions are as follows: (1) Feature Identification Precision, the ratio of that identified character strings are indeed the characteristic words
the number of feature words which are correctly identified 100% the total number of feature words which are determined by system
(2) Feature Identification Recall Rate, The ratio of being identified of text feature words:
Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 00090013
the number of feature words which are correctly identified 100% the total number of feature words in corpus
2 P R 100% PR
The system test which is achieved by using VC++ Language program, still used the above corpus. The results are shown below.
Table 3. The Comparison of Experiment Results Feature Selecting Algorithm Precision Recall Rate Aggregative Indicator F 73.48 85.67 80.74 84.49 80.25 77.35 89.25 83.81 86.00
Ducument Frequency DF Information Gain IG Mutal Information MI Statistic CHI Genetic Algorithm GA DF+GA IG+GA MI+GA CHI+GA
Through the experiments it was found that through using a combination method of both the genetic algorithms and feature selection, on the evaluation indicator, the filtering process has been significantly improved, compared with simply the use of either the feature selection method or using a genetic algorithm. But it costs a large amount of time; this is also a field which needs to be improved through future research. In the combination of the four selecting methods and genetic algorithms, the effect of IG + GA is the best; followed by CHI + GA; the effect of MI + GA is the poorest; the DF + GAs running speed is the fastest.
Acknowledgements This work is supported by Humanity and Science foundation of Ministry of Education under Grant No. 09YJC870001; the Natural Science Foundation of Yunnan Education Department of China under Grant No. 2011Y315.
References:
[1] Zhaoqi Bian, Xuegong Zhang. Pattern Recognition [M] Beijing: Tsinghua University Press, 2000. [2] Jianchao Xu, Ming Hu. Chinese Web text characteristics of acquisition and classification [J]. Computer Engineering, 2005,31 (8) :24-26. [3] Miettinen K,Neittaanmaki P,Makela M M.Evolutionary Algorithms in Engineering and Computer Science[M].New York:Wiley,1999. [4] Guoliang Chen, Xufa Wang, Zhenquan Zhuang etc. genetic algorithm and its application [M], Beijing, People's Posts and Telecommunications Press, 1996. [5] McCallum,Andrew Kachites.Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering.[DB/OL].http://www.cs.cmu.edu/~mccallum/bow. 996(2003-02-08).
Biography Yuping Fang, 1977, female, Han, from Dali Yunnan, Masters degree, lecturer. Main research direction: Natural Language Process, Text Excavation etc. Email: fangyuping728@sina.com