Vous êtes sur la page 1sur 8

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No.

Integrated Genetic Algorithm in Data Mining a Review


Pradeep Singh Raghav1, Parmalik Kumar2, Nikhil Singh3 Computer Science Dept., Patel collage of Science & Technology, Bhopal 462023, India Pradeep.raghav11@gmail.com Parmalik83@gmail.com Nikhil30_singh@yahoo.co.in Abstract
Data Mining is a feature extraction method in Database area. In this paper, we present a survey of the application of integrated genetic algorithm in Data Mining, and bring forward the basis idea and key design question of the new Data mining algorithm based on Genetic Algorithm, such as, knowledge rule coding, fitness function definition and knowledge rule expression. Keywords: Genetic Algorithm, Data Mining, Knowledge Expression, Knowledge Rule

1. Introduction
Data Mining is the process to extract any unknown and useful information and knowledge from a number of incomplete, uncertain, blurred and random data. Similar process also includes the Knowledge Discovery, Data Analysis, Data Fusion and Decision-Making Support [1]. There are also the following types of know technology ledge as discovered by DM: generalized, Characteristic, Contrast, correlative, predicting and Deviation, while the classification, clustering, reduced dimension, pattern recognition, visualization, decision tree, genetic algorithm and uncertainty disposal are normally used as the tools and methods for the discovery[2]. 1.2 Data mining classes Data mining commonly involves four classes of tasks: [17] Association rule learning Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. Clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines. Regression Attempts to find a function which models the data with the least error.

2.Description of Genetic algorithm


2.1 Genetic algorithm The genetic algorithm (GA) is an algorithm simulating the process of biological evolution to complete the optimized search, it is essentially a random search method based on simulation of biological evolution process [6]. Its basic principle may be concluded to comprehend or transfer the target function on optimization to the fitness of any biological population in environment, to correspond the optimized mutations to any individual of biological population, and to analogize any algorithm on optimized solution with the evolution of the biological population [7]. There are three apparent features between the genetic algorithm and the traditional optimized algorithm, that is, high robust, whole search capacity and internal parallelism. GA is search algorithm based on the mechanics of natural selection and genetics and they combine survival of the fittest among string structures to form a search algorithm. GA has been demonstrated to be effective and robust in searching very large spaces in a wide range of applications. GA is particularly suitable for multi-parameter optimization problems with an objective function subject to numerous hard and soft constraints. The financial application of GAs is growing 79

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 with successful applications in trading system, stock selection, portfolio selection, bankruptcy prediction, credit evaluation and budget allocation. The main idea of GAs is to start with a population of solutions to a problem, and attempt to produce new generations of solutions which are better than the previous ones. GAs operates through a simple cycle consisting of the following four stages: initialization, selection, crossover, and mutation. Figure 1 shows the basic steps of genetic algorithms

Figure 1. Basic steps of genetic algorithms In the initialization stage, a population of genetic structures (called chromosomes) that are randomly distributed in the solution space is selected as the starting point of the search. These chromosomes can be encoded using a variety of schemes including binary strings, real numbers or rules. After the initialization stage, each chromosome is evaluated using a user-defined fitness function. The goal of the fitness function is too numerically encode the performance of the chromosome. For real-world applications of optimization methods such as GAs, the choice of the fitness function is the most critical step. The mating convention for reproduction is such that only the high scoring members will preserve and propagate their worthy characteristics from generations to generation and thereby help in continuing the search for an optimal solution. The chromosomes with high performance may be chosen for replication several times whereas poor-performing structures may not be chosen at all. Such a selective process causes the best-performing chromosomes in the population to occupy an increasingly larger proportion of the population over time. Crossover causes to form a new offspring between two randomly selected 'good parents'. Crossover operates by swapping corresponding segments of a string representation of the parents and extends the search for new solution in far-reaching direction. The crossover occurs only with some probability. There are many different types of crossover that can be performed: the onepoint, the two-point, and the uniform type. Mutation is a GA mechanism where we randomly choose a member of the population and change one randomly chosen bit in its bit string representation. Although the reproduction and crossover produce many new strings, they do not introduce any new information into the population at the bit level. If the mutant member is feasible, it replaces the member which was mutated in the population. The presence of mutation ensures that the probability of reaching any point in the search space is never zero. .

80

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 2.2 Genome coding An important process in GA is coding, that is the transfer from the parameter form on solution of the optimized problem to the expression on genome code string. Taking the knowledge class rule coding as an example, the genetic code string on the rule (individual) is realized with the binary code according to the feature of the knowledge class rule [8]. A rule is divided into two parts, rule characteristic (premise) and rule sort(conclusion), that is, IF PRICE(moderate, low) QUALITY(high, normal) SERVICE (good) THEN CLASS=acc, supposed that the type of the characteristic attribute is discrete(any attribute on numerical characteristic would be dispersed), if the discrete attribute has k possible results, then the k bits would be allocated in the binary string, each bit corresponds to the given value, 0 means no value in disjunction form, while 1 is reversed. Any individual sort is expressed as the continuous binary string on sort attribute for the purpose of simplifying algorithm. As known from the above rule, if the value field of the characteristic attribute PRICE is {high, moderate, low}, the value field of QUALITY is {high, normal, poor}, the value field of SERVICE is {good, bad}, the value field of the sort attribute is {uacc, acc, good, vgood} respectively corresponded by the codes of 00, 01, 10 and 11, then the rule may express the following genome(binary string)form as 0111101001, in which the rule corresponds to the genome type one by one. The binary string 001100111 also corresponds to the following rules as: IF PRICE (low) QUALITY (high) SERVICE (good) THEN CLASS=v good If a characteristic attribute code is full of 1, then the attribute would not affect any validity of rule with whatever value. For example, if the PRICE is coded as 111 and the rule on the code is 1111001010, then the signification of the rule is that if the commodity quality is good while the service is good, then the commodity is accepted taking no account of price. In contrary saying, where the rule of if the commodity service is poor, then the commodity is unaccepted, that is, IF SERVICE (bad) THEN CLASS=uacc, which corresponds to the binary code string 1111110100. 2.3 Definition on fitness function The good rule exists from any mining on classification rule within the knowledge database by GA, and acts as the father generation rule to reproduction, crossover and mutation until the optimized rule group is discovered. The good rule means a high matching between the rule and any instance within the test data set (also called as the record, including featured and classified attributes), while the fitness function shall reflect the matching degree of rule and data set. In the definition of the fitness function there are three important parameters to be considered in the rule, such as accuracy, utility and coverage, which will be noted as follows (assumed that U is a test data set, and e is an instance (element) within it). Accuracy: The accuracy of a rule ri is measured with the matching degree between instances within the test data set U and the rule, which is reflected by the following formula:

In which, u is a subset of the test data set U, in which each element(instance)matches with the rule of ri ri u to be evolved, while ri is the base of the subset of . is also a subset of U, in which only the characteristic part(premise) of each element(instance)matches with the rule of ri to be evolved, and u is the base of the ui r r . Apparently, the higher the rule accuracy is, the higher the rule trust is. If thei accuracy is 1, then subset of the rule within the test data set U is constantly true, where the condition of the rule is existence, the conclusion of the rule is also existence. Utility: Within the test data set U, an instance of e may be matched with several rules to be evolved, while each rule has the different utility to be measurable. If an instance e within U only matches with a rule to be evolved within the current population, then the utility of the rule is 1. If an instance e matches with m rules to be evolved within the current population, then each rule to be evolved has the utility of 1/m as indicated by the following formula:

Utility (ri ) =

(ri , e) eU ( ri , e)
81

The instance e successfully matches with the rule r

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 Other In which, U and e are defined as above, while r is a rule within the current population. Coverage: The coverage of the rule ri means the numbers of instance e within the test data set U matched with the premise part (characteristic part) of ri to be evolved as indicated by the following formula:

Coverage(ri ) = ui + ui r r
Apparently, the higher the rule coverage is, the better the rule generality is. As the values of the three parameters would affect the size of rule fitness, we are in wish of their higher values. When fixing the fitness function of the rule, the relationship between rules shall be analyzed first. There are three possible relationships between rules, such as contain, redundant and contradictory. Lets take a look: IF PRICE (normal, low) SERVICE (good) THEN CLASS=acc IF SERVICE (good) THEN CLASS=acc Contradictory: IF PRICE (high) QUALITY (poor) SERVICE (bad) THEN CLASS=acc IF PRICE (high) QUALITY (poor) SERVICE (bad) THEN CLASS=uacc Redundant: Very simpleness. That is two same rules. The relationship between rules shall be considered in design of fitness function of the rule, while contain, contradictory and redundant rules shall be deleted to ensure any father generation rule is selected as the good rule. Following is the algorithm to evaluate rule fitness: Step 1: Assumed that U is a test data set, P is a rule (genome) variety (N size); Step 2: Calculating the Accuracy (ri) and Utility (ri) for each rule ri (i=1, 2N) within the current population P; Step 3: Arranging rule ri in an descending order according to the product with Accuracy(ri) and Utility(ri), while the results of arrangement shall be written into the sort table; Step 4: The Coverage(rj) of the first rule in the sort table shall be calculated, while the fitness(rj) = Coverage(rj) * Accuracy(rj) * Utility(rj) (initial j=0) u u Step 5: The rule as covered by the rule of rj shall be deleted from U, i.e. U = U r j r j Step 6: If rj is the last rule in the sort table, exit once the calculation on fitness of all rules is completed, otherwise j=j+1, and go to Step 4. Contain:

3. Genetic Algorithms related works 3.1 GA for Data Mining in Web based EHS
In this paper we show how to apply genetic algorithms for data mining of student information obtained in a Webbased Educational Adaptive Hypermedia System. The objective is to obtain interesting association rules so that the teacher can improve the performance of the system. In order to check the proposed algorithm we have used a Web-based Course developed for use by medical students. First, we will describe the proposed methodology, later the specific characteristics of the course and we will explain the information obtained about the students. We will continue on with the implemented genetic algorithm and finally with the rules discovered and the conclusions. Having tested our genetic algorithms on the Web-based Hypermedia Course as described above, and shown that they can produce potentially useful results, we plan to apply them to this and other adaptive hypermedia courses to test how they can be used to improve the adaptive features. The preliminary results reported in this paper are promising and they show that our genetic algorithm is a good alternative for extracting a small set of comprehensible rules, which is important in the context of data mining. We are now developing 82

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 several courses with AHA system and we expect to obtain interesting rules to improve the course adaptation. We have chosen AHA system because it is a generic model of hypermedia adaptive system and it has a high degree of adaptation. We are also developing a more sophisticated evolutionary algorithm with genetic programming, to obtain more complex and interesting rules [11].

3.2 Data Mining in Banking Management


In this paper, we will introduce the concepts of data mining technology and customer relationship management to analyze the advantages and disadvantages of decision tree and neural network. With the decision tree and neural network fusion algorithm, we shall find its necessity in bank customers management system application in the banking sector development and will explain the detailed applications of fusion algorithm in customer relationship management in the banking system. Keyword: data mining; customer relationship management; Decision tree; neural network. In this paper, we have studied the decision tree algorithm and neural network and made some comparisons between these two kinds of algorithms. Base on the complementary of decision trees and neural networks, we proposed the idea that decision-making tree can be used to construct neural network input layer to create a better algorithm. The algorithms have some shortcomings such uneasy to determine the hidden layer neural networks which need to be improved in the study in the future [12].

3.3 Effective DM by IG Algorithm [13]


Dividing a data set into a training set and a test set is a fundamental component in the pre-processing phase of data mining (DM). Effectively, the choice of the training set is an important factor in deriving good classification rules. Traditional approach for association rules mining divides the dataset into training set and test set based on statistical methods. In this paper, we highlight the weaknesses of the existing approach and hence propose a new methodology that employs genetic algorithm (GA) in the process. In our approach, the original dataset is divided into sample and validation sets. Then, GA is used to find an appropriate split of the sample set into training and test sets. We demonstrate through experiments that using the obtained training set as the input to an association rules mining algorithm generates high accuracy classification rules. The rules are tested on the validation set for accuracy. The results are very satisfactory; they demonstrate the applicability and effectiveness of our approach. In this paper, we proposed a novel approach that employs GA for splitting a data set into training and test sets. We focused specifically on the appropriate split for deriving the best classification rule set from the output of an association rule-mining model. The data-splitting problem (for the ARM technique) presented in this paper identified the following tasks: the entire dataset should represent the population as closely as possible; different samples are simulated by using the traditional methods to split the dataset into a) sample set and b) validation set; and each such sample set should be divided into training and test sets efficiently so that a) the training set represents the true classification relationships in the sample dataset as much as possible without over fitting; and b) the test set detects overtraining of the trained classification model as much as possible. The advantages are: 1) he primary problem of simulating the sample and the validation sets of the real-world is addressed; and 2) one approach is identified to define the right data split for efficient ARM, namely, to distribute the classification relationships 83

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 between the splits evenly. Using experimental validations, it is verified in this paper that the performance of the CARM model using TRS derived from such a split performs with good accuracy on the validation set.

3.4 A data mining software tool integrating genetic fuzzy systems[14]


This work introduces the software tool KEEL to assess evolutionary algorithms for Data Mining problems including regression, classification, clustering, and pattern mining and so on. It includes a big collection of genetic fuzzy system algorithms based on different approaches: Pittsburgh, Michigan, IRL and GCCL. It allows us to perform a complete analysis of any genetic fuzzy system in comparison to existing ones, including a statistical test module for comparison. The use of KEEL is illustrated through the analysis of one case study. EAs for DM problems, paying special attention to the GFS algorithms integrated in the tool. It relieves researchers of much technical work and allows them to focus on the analysis of their new GFS algorithms in comparison with the existing ones. Moreover, the tool enables researchers with a basic knowledge of fuzzy logic and evolutionary computation to apply GFSs to their work. We have shown a case study to illustrate functionalities and the experiment set up processes in KEEL. In this case, the results have been contrasted through statistical analysis (pairwise comparisons Wilcoxons test), obtaining that SLAVE algorithm clearly outperforms the Chi-RW algorithm assuming a high level of significance p = 0.0196. The KEEL software tool is being continuously updated and improved. At the moment, we are developing a new set of GFSs and a test tool that will allow us to apply parametric and non-parametric tests on any set of data. We are also developing data visualization tools for the on-line and offline modules. We are also working on the development of a data set repository that includes the data set partitions and algorithm results on these data sets, the KEEL-dataset

3.5 Data Mining Using IG Algorithm [17]


Sequential pattern mining is the process of finding the relationships between occurrences of sequential events, to find if there exists any specific order of the occurrences. The extraction of sequential pattern is not polynomial in time of execution. The algorithms for performing sequential pattern mining can assure optimum solutions but they do not take into consideration the time taken to reach such solutions. In this paper we propose a new algorithm based on genetic concepts which gives, may be a non-optimal solution but in a reasonable time (polynomial) of execution. In this paper, we applied Genetic Algorithm to find frequent sequences in Telecommunication Database in order to help Telecommunication companies to know the country codes that have a relation between them. So, the telecommunication companies can estimate the countries that have a specific order of the occurrences and give a discount on the calls to these countries. SPT-GA algorithm utilizes the property of evolutionary algorithm that discovers best rules in a short time with meaningful results.

3.6 IGA with CRF for Prediction[15]


Question informers play an important role in enhancing question classification for factual question answering. Previous works have used conditional random fields to identify question informer spans. However, in CRF-based models, the selection of a feature subset is a key issue in improving the accuracy of question informer prediction. In this paper, we propose a hybrid approach that integrates Genetic Algorithms with CRF to optimize feature 84

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 subset selection in CRF based question informer prediction models. The experimental results show that the proposed hybrid GA_CRF model improves the accuracy of question informer prediction of traditional CRF models. We have proposed a hybrid approach that integrates Genetic Algorithm (GA) with Conditional Random Field (CRF) to optimize feature subset selection in a CRF based model for question informer prediction. The experimental results show that the proposed hybrid GA_CRF model of question informer prediction improves the accuracy of the traditional CRF model. By using GA to optimize the selection of the feature subset in CRFbased question informer prediction, we can improve the F-score from 88.9% to 93.87%, and reduce the number of features from 105 to 40.

3.7 GPA for Classification Task [16]


The focus of this paper is the application of the genetic programming framework in the problem of knowledge discovery in databases, more precisely in the task of classification. Genetic programming possesses certain advantages that make it suitable for application in data mining, such as robustness of algorithm or its convenient structure for rule generation to name a few. This study concentrates on one type of parallel genetic algorithms cellular (diffusion) model. Emphasis is placed on the improvement of efficiency and scalability of the data mining algorithm, which could be achieved by integrating the algorithm with databases and employing a cellular framework. Cellular model of genetic programming that exploits SQL queries is implemented and applied to the classification task. Achieved results are presented and compared with other machine learning algorithms. The possibility of applying Genetic Programming in Data Mining was surveyed, examining its integration with databases and possibilities of parallelization. Cellular model of GP that uses SQL queries was applied to the classification task, and the obtained results were compared with the other machine learning algorithms. Implemented cellular GP algorithm achieved promising results on the used datasets. GP was suitable and convenient model for encoding the classification rules, because of the tree representation. Cellular approach seems to be an interesting genetic parallel model with the certain advantages like: faster convergence and better results, simplified selection method with less calculation, prevention of loss of the good solutions8 and a structure convenient for parallel implementation. However, further investigation and testing should be performed, with larger datasets and employing the parallel SQL servers. Furthermore, comparison of cellular model with other parallel GAs on the similar problem could bring interesting results.

4. Conclusion
Genetic algorithms are original systems based on the supposed functioning of the Living. The method is very different from classical optimization algorithms. 1. 2. 3. 4. Use of the encoding of the parameters, not the parameters themselves. Work on a population of points, not a unique one. Use the only values of the function to optimize, not their derived function or other auxiliary knowledge. Use probabilistic transition function not determinist ones.

It's important to understand that the functioning of such an algorithm does not guarantee success. We are in a stochastic system and a genetic pool may be too far from the solution, or for example, a too fast convergence 85

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 may halt the process of evolution. These algorithms are nevertheless extremely efficient, and are used in fields as diverse as stock exchange, production scheduling or programming of assembly robots in the automotive industry.

References
[1] Tubao Ho, Trongdung Nguyen, Ducdung Nguyen, Saori Kawasaki, Visualization Support for User Centered Model Selection in Knowledge Discovery and Data Mining, International Journal of Artificial Intelligence Tools, 2001. Susan E. George, A Visualization and Design Tool (AVID) for Data Mining with the Self-Organizing Feature Map, International Journal of Artificial Intelligence Tools, 2000, 9(3). 369 375. Tzung-Pei Hong, Chan-Sheng Kuo, Sheng-Chai Chi, Trade-off Between Computation Time and Number of Rules for Fuzzy Mining from Quantitative Data, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2001,9(5). 587 - 604 Vladimir Estivill-Castro, Jianhua Yang, Clustering Web Visitors by Base, Robust and Convergent Algorithms, International Journal of Foundations of Computer Science, 2002, 13(4). 497 520. R. Flix, T. Ushio, Binary Encoding of Discernibility Patterns to Find Minimal Coverings, International Journal of Software Engineering and Knowledge Engineering, 2002. D.E.Goddberg, Genetic Algorithms in Search, Optimizition and Machine Learning, Addison-wesley Publishing Company, 1989. Ai LirongHe Huacan. Summarise on Genetic AlgorithmsJournal of Computer Application and Research1997. Xiao Yong, Chen Yiyun. Constructing Decision Trees by Using Genetic Algorithm, Journal of Computer Research and Development, 1998. Xiaomin Zhong, Eugene Santos, Directing Genetic Algorithms for Probabilistic Reasoning through Reinforcement Learning, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2000. Gary William Grewal, Thomas Charles Wilson, an Enhanced Genetic Algorithm for Solving the HighLevel Synthesis Problems of Scheduling, Allocation, and Binding, International Journal of Computational Intelligence and Applications, 2001. Cristbal Romero, Sebastin Ventura, Carlos de Castro, Wendy Hall, Using Genetic Algorithms for Data Mining in Web based Educational Hypermedia Systems2004. Zhao Li Ping, Shu Qi Liang, Data Mining Application in Banking-Customer Relationship Management, International Conference on Computer Application and System Modeling (ICCASM 2010). Janaki Gopalan, Erkan Korkmaz, Reda Alhaj, Effective Data Mining by Integrating Genetic Algorithm into the Data Preprocessing Phase, IEEE 2005. Jesus Alcala-Fdez, Salvador Garca, Francisco Jose Berlanga Alberto Fernandez, Luciano Sanchez, M.J. del Jesus and Francisco Herrera, KEEL: A data mining software tool integrating genetic fuzzy systems, 3rd International Workshop on Genetic and Evolving Fuzzy Systems Witten-Bommerholz, Germany, March 2008. Min-Yuh Day, Chun-Hung Lu, Chorng-Shyong Ong, Shih-Hung Wu, Integrating Genetic Algorithms with Conditional Random Fields to Enhance Question Informer Prediction, IEEE 2006. Alexandra Takac, Cellular Genetic Programming Algorithm Applied to Classification Task,2006 Mourad Ykhlef, Yousuf Aldukhayyil and Muath Alfawzan, Mining Sequential Patterns in Telecommunication Database Using Genetic Algorithm,2007.

[2] [3]

[4] [5] [6] [7] [8] [9]

[10]

[11] [12] [13] [14]

[15] [16] [17]

86

Vous aimerez peut-être aussi