Vous êtes sur la page 1sur 5

Journal of Computational Science and Engineering

www.jcseuk.com

2050-2311/Copyright 2012 IE Enterprises ltd All right reserved

Jour. of Comp. Sci. and Eng.


Vol. 1, Num.1, 00090013, 2012

The Algorithm Research of Genetic Algorithm Combining with Text Feature Selection Method
Yuping Fanga, Ken Chenb, Chenhong Luoa,a*
a

Yunnan Normal University, College of Vocational and Technical Education, Yunnan Kunming 650032, China b Yunnan Normal University, Department of Computer Science, Yunnan Kunming 650032, China

Abstract

The traditional methods of text feature selection were analyzed and their respective advantages and disadvantages were compared in detail. Based on the characteristic of optimizing itself of genetic algorithm, an improved text feature selection scheme was proposed. Firstly, the common text feature selection method (DF, IG, MI, CHI) was used to select the text characteristic, then screening was processed by genetic algorithm, and finally feature items according with feature of text classification was selected. The experimental results show that the performance has been significantly improved.
Keywords: Feature Selection; Dimensionality Reduction; Evaluation Function; Genetic Algorithm

1. Introduction
How to effectively organize and manage information, and how to find the information that users need quickly, accurately and comprehensively are major challenges faced by those in the current information science and technology fields. Text classification refers to the process in which texts are divided into the relevant predefined categories according to their content under the given classification system. It is an important component of text mining [1] and plays a significant role in improving the speed and accuracy of text retrieval. Text classification includes three steps: the vector model of the text, text feature selection and classifier training [2]. In order to balance the two aspects of the mathematical time and classification accuracy, feature selection has to be done, striving to reach the goal of dimensionality reduction without damaging the classification performance.

* Corresponding author. Yuping Fang E-mail address: fangyuping728@sina.com

September 2012

Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 00090013

2. Text feature selection


Feature selection is a dimensionality reduction measure used in the original feature space, i.e. selecting some of the most effective features which contribute most to the text information from a group of candidate features to form an optimal feature subset. From the optimization point of view, the feature selection process is actually a process of feature optimal combination. The goal of text feature selection is to achieve the same or better classification results with few features; therefore, for the individuals who have the same express ability, the fewer number of features the better. Currently the statistical methods used for feature selection are; feature frequency and text frequency methods which are based on frequency and mutual information, information gain, expected cross entropy, statistics, correlation coefficients, right of text evidence etc which are based on information theory. The following is the introduction of these four methods that will be used: 2.1 The document frequency (DF, Document Frequency) Text frequency can be expressed as:

DF ( F )

thenumber of documents in which feature t appears the total number of documents of training set

It is the simplest evaluation function; having a small amount of calculations is its greatest feature. Theoretical assumption of the DF evaluation function is that the characteristics which emerge a lower frequency contain less information, but this assumption is obviously incomplete. Therefore, generally DF is not directly used in practical application, but it is used as a standard evaluation function. 2.2 Information Gain (IG, Information Gain) IG is a feature selecting method which is widely used in a machine study field. From the information theory point of view, it divides the study sample space by individual feature selecting, then filters and selects the effective features according to how much information was gained. IG can be expressed as: (2) In the formula, indicates the probability which the text belongs to when feature t appears in the text; indicates the probability which the text belongs to IG when word t does not appear in the text; indicates the probability of category occurrence; indicates the probability which the word t appears in the entire text of the training set. 2.3 Mutual Information (MI, Mutual Information) MI is the concept of information theory used to measure the degree of interdependence between the two signals in a message. In the field of feature selection, the mutual information between the feature t and the category reflects the relevance between the characteristics and the category. Feature t which appears high probability in a category and appears low probability in the other categories will receive higher mutual information. MI can be expressed as; (3) Where each indicator is the same as that of above formula (2). 2.4 Statistic (CHI) (4) Where A is the frequency which both feature t and class A document appear together; B is the frequency that feature t appears but the class A document does not appear; C is the frequency that class A document appears but the feature t does not appear; D is the frequency which both the feature t and the class A document do not appear; N is the total number of text. Method believes that the non-independent relationship between the feature t and text category is similar to the distribution which contains one-dimensional freedom. It is based on the following assumptions: high

Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 00090013

frequency words which are in specified or in other types of text can help to determine whether the article belongs to the category . The basic idea of the four kinds of feature selecting methods is to calculate a statistical measure, and then set a threshold, T to every feature word, filter out those features that have a smaller measure than T and the rest that are accounted as effective characteristics. Table 1 shows the respective advantages and disadvantages of the four kinds of feature selecting methods.
Table 1. The comparisons of the advantages and disadvantages of feature selection methods Selection method The document frequency (DF) Information gain (IG) Mutual information (MI) statistic CHI Advantage Low computational complexity, capable of largescale classification tasks. Is the simplest method of feature selection. It is widely used in machine learning feature selection method. Considers the low-frequency words, brought some amount of information. Considering the words "negative correlation": good classification results. Disadvantage Does not meet the widely accepted theory of information retrieval: ignores the role of high-frequency words. If the word does not occur the effectiveness of information gain will be greatly reduced: the statistics of the method is expensive. Easily lead to over-learning: not considered a negative correlation: Ignores the characteristics of the dispersion and concentration of test indicators, resulting in the individual characteristics of over-fitting. Statistics expensive and ineffective for low-frequency words.

3. The combing algorithm method between the genetic algorithm method and the traditional text feature selecting method
Genetic algorithms copy the biological process of natural selection and evolution system; it is a random population-based overall optimization algorithm. The genetic algorithm encodes those problems (parameters) which need to be solved, generates the initial solution group in solution space, and is gradually evolved to the overall optimal solution by genetic variation. As a more mature method, genetic algorithm has been discussed in a lot of literature, for further details please refer to the literature [1, 2 and 3]. The experimental idea in this article is: the use of the traditional method of the text feature selection (DF, IG, MI, CHI) to select the text feature, and then using the genetic algorithm to filter them, and eventually selecting a feature item which suits the text classification. 3.1 The combining algorithm method between genetic algorithm and text feature selection method Input: the set of entries set after word dividing process. Output: The text feature set. Algorithm description: [T1]. Using the Chinese Academy of Sciences word dividing system to divide the text word, to obtain the entry set of T; [T2]. Using equation (1), (2), (3) and (4) respectively on T to calculate the traditional text feature selection method and the result is T1; [T3]. T1s entry as the coding of genetic algorithms: entry that appears is 1entry does not appear is 0, result the collection of {0, 1}; [T4]. Re-use the TF-IDF formula for weight calculation: (5) Where N is the number of all documents, ni is the number of documents which contains the term of ti, tfi indicates the frequency that entry T appears in document d. [T5]. Fitness function

Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 00090013

fit ( si ) log

i 1, j 1 n

(
i , j 1

ik

jk ) 1
(6)

2 ik

jk

i k jk are elements of the vector Ti, Tj.


[T6]. In this paper, the roulette selection method is used, the basic idea is: the selected probability of each individual and its fitness is proportional to the direct ratio. (7) [T7]. Crossover operator This article adopted the method of Insert Crossoverthe specific algorithm is as follows: 1) Randomly selected father sample, to determine the insertion point and gene fragments; 2) Insert the gene fragment. 3) Delete duplicate genes. [T8]. Mutation operator In this algorithm, randomly select a chromosome, according to the entry weight, obtain a gene (i.e. Feature word) by the roulette method, delete this gene, and randomly select a gene which is not in a chromosome from the glossary, put it at that location, thus forming a new source of future generations.

4. Experiment results and Conclusion


The test used 8,200 articles which are divided into five categories: politics, military affairs, entertainment, education and livelihood. 3,500 articles of them were used as a training part and the rest as an examination part. Document Frequency DF, Information Gain IG, Mutual Information MI and Statistic were tested under the Rainbow system. The Genetic Algorithm was achieved by the VC ++ program, average accuracy was used as the evaluation indicator and it was also the evaluation algorithm of the Rainbow system. Its average accuracy was the correct rate of average values which were calculated under the certain training set circumstance, based on repeating experiments. The experimental results are shown in table 2.
Table 2. The result comparison of feature selecting method Feature Selecting Method Document Frequency DF Information Gain IG Mutal Information MI Statistic CHI Genetic Algorithm GA DF+GA IG+GA MI+GA CHI+GA Average Accuracy% 33.58 85.36 31.25 52.14 78.28 41.54 88.65 42.65 65.32

As for the identification of text feature, three of the most common evaluation indicators were applied to the test, they are Precision (P), Recall Rate (R) and aggregative indicator value (F). Their definitions are as follows: (1) Feature Identification Precision, the ratio of that identified character strings are indeed the characteristic words

the number of feature words which are correctly identified 100% the total number of feature words which are determined by system

(2) Feature Identification Recall Rate, The ratio of being identified of text feature words:

Yuping Fang et al / Journal of Computational Science and Engineering 1:1 (2012) 00090013

the number of feature words which are correctly identified 100% the total number of feature words in corpus
2 P R 100% PR

(3) aggregative indicator value F F

The system test which is achieved by using VC++ Language program, still used the above corpus. The results are shown below.
Table 3. The Comparison of Experiment Results Feature Selecting Algorithm Precision Recall Rate Aggregative Indicator F 73.48 85.67 80.74 84.49 80.25 77.35 89.25 83.81 86.00

Ducument Frequency DF Information Gain IG Mutal Information MI Statistic CHI Genetic Algorithm GA DF+GA IG+GA MI+GA CHI+GA

78.12 88.54 82.11 87.32 80.39 79.68 89.94 84.28 88.64

69.36 82.98 79.41 81.84 80.12 75.15 88.57 83.35 83.51

Through the experiments it was found that through using a combination method of both the genetic algorithms and feature selection, on the evaluation indicator, the filtering process has been significantly improved, compared with simply the use of either the feature selection method or using a genetic algorithm. But it costs a large amount of time; this is also a field which needs to be improved through future research. In the combination of the four selecting methods and genetic algorithms, the effect of IG + GA is the best; followed by CHI + GA; the effect of MI + GA is the poorest; the DF + GAs running speed is the fastest.

Acknowledgements This work is supported by Humanity and Science foundation of Ministry of Education under Grant No. 09YJC870001; the Natural Science Foundation of Yunnan Education Department of China under Grant No. 2011Y315.

References:
[1] Zhaoqi Bian, Xuegong Zhang. Pattern Recognition [M] Beijing: Tsinghua University Press, 2000. [2] Jianchao Xu, Ming Hu. Chinese Web text characteristics of acquisition and classification [J]. Computer Engineering, 2005,31 (8) :24-26. [3] Miettinen K,Neittaanmaki P,Makela M M.Evolutionary Algorithms in Engineering and Computer Science[M].New York:Wiley,1999. [4] Guoliang Chen, Xufa Wang, Zhenquan Zhuang etc. genetic algorithm and its application [M], Beijing, People's Posts and Telecommunications Press, 1996. [5] McCallum,Andrew Kachites.Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering.[DB/OL].http://www.cs.cmu.edu/~mccallum/bow. 996(2003-02-08).

Biography Yuping Fang, 1977, female, Han, from Dali Yunnan, Masters degree, lecturer. Main research direction: Natural Language Process, Text Excavation etc. Email: fangyuping728@sina.com

Vous aimerez peut-être aussi