Académique Documents
Professionnel Documents
Culture Documents
1. Introduction
Data Mining is the process to extract any unknown and useful information and knowledge from a number of incomplete, uncertain, blurred and random data. Similar process also includes the Knowledge Discovery, Data Analysis, Data Fusion and Decision-Making Support [1]. There are also the following types of know technology ledge as discovered by DM: generalized, Characteristic, Contrast, correlative, predicting and Deviation, while the classification, clustering, reduced dimension, pattern recognition, visualization, decision tree, genetic algorithm and uncertainty disposal are normally used as the tools and methods for the discovery[2]. 1.2 Data mining classes Data mining commonly involves four classes of tasks: [17] Association rule learning Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. Clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines. Regression Attempts to find a function which models the data with the least error.
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 with successful applications in trading system, stock selection, portfolio selection, bankruptcy prediction, credit evaluation and budget allocation. The main idea of GAs is to start with a population of solutions to a problem, and attempt to produce new generations of solutions which are better than the previous ones. GAs operates through a simple cycle consisting of the following four stages: initialization, selection, crossover, and mutation. Figure 1 shows the basic steps of genetic algorithms
Figure 1. Basic steps of genetic algorithms In the initialization stage, a population of genetic structures (called chromosomes) that are randomly distributed in the solution space is selected as the starting point of the search. These chromosomes can be encoded using a variety of schemes including binary strings, real numbers or rules. After the initialization stage, each chromosome is evaluated using a user-defined fitness function. The goal of the fitness function is too numerically encode the performance of the chromosome. For real-world applications of optimization methods such as GAs, the choice of the fitness function is the most critical step. The mating convention for reproduction is such that only the high scoring members will preserve and propagate their worthy characteristics from generations to generation and thereby help in continuing the search for an optimal solution. The chromosomes with high performance may be chosen for replication several times whereas poor-performing structures may not be chosen at all. Such a selective process causes the best-performing chromosomes in the population to occupy an increasingly larger proportion of the population over time. Crossover causes to form a new offspring between two randomly selected 'good parents'. Crossover operates by swapping corresponding segments of a string representation of the parents and extends the search for new solution in far-reaching direction. The crossover occurs only with some probability. There are many different types of crossover that can be performed: the onepoint, the two-point, and the uniform type. Mutation is a GA mechanism where we randomly choose a member of the population and change one randomly chosen bit in its bit string representation. Although the reproduction and crossover produce many new strings, they do not introduce any new information into the population at the bit level. If the mutant member is feasible, it replaces the member which was mutated in the population. The presence of mutation ensures that the probability of reaching any point in the search space is never zero. .
80
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 2.2 Genome coding An important process in GA is coding, that is the transfer from the parameter form on solution of the optimized problem to the expression on genome code string. Taking the knowledge class rule coding as an example, the genetic code string on the rule (individual) is realized with the binary code according to the feature of the knowledge class rule [8]. A rule is divided into two parts, rule characteristic (premise) and rule sort(conclusion), that is, IF PRICE(moderate, low) QUALITY(high, normal) SERVICE (good) THEN CLASS=acc, supposed that the type of the characteristic attribute is discrete(any attribute on numerical characteristic would be dispersed), if the discrete attribute has k possible results, then the k bits would be allocated in the binary string, each bit corresponds to the given value, 0 means no value in disjunction form, while 1 is reversed. Any individual sort is expressed as the continuous binary string on sort attribute for the purpose of simplifying algorithm. As known from the above rule, if the value field of the characteristic attribute PRICE is {high, moderate, low}, the value field of QUALITY is {high, normal, poor}, the value field of SERVICE is {good, bad}, the value field of the sort attribute is {uacc, acc, good, vgood} respectively corresponded by the codes of 00, 01, 10 and 11, then the rule may express the following genome(binary string)form as 0111101001, in which the rule corresponds to the genome type one by one. The binary string 001100111 also corresponds to the following rules as: IF PRICE (low) QUALITY (high) SERVICE (good) THEN CLASS=v good If a characteristic attribute code is full of 1, then the attribute would not affect any validity of rule with whatever value. For example, if the PRICE is coded as 111 and the rule on the code is 1111001010, then the signification of the rule is that if the commodity quality is good while the service is good, then the commodity is accepted taking no account of price. In contrary saying, where the rule of if the commodity service is poor, then the commodity is unaccepted, that is, IF SERVICE (bad) THEN CLASS=uacc, which corresponds to the binary code string 1111110100. 2.3 Definition on fitness function The good rule exists from any mining on classification rule within the knowledge database by GA, and acts as the father generation rule to reproduction, crossover and mutation until the optimized rule group is discovered. The good rule means a high matching between the rule and any instance within the test data set (also called as the record, including featured and classified attributes), while the fitness function shall reflect the matching degree of rule and data set. In the definition of the fitness function there are three important parameters to be considered in the rule, such as accuracy, utility and coverage, which will be noted as follows (assumed that U is a test data set, and e is an instance (element) within it). Accuracy: The accuracy of a rule ri is measured with the matching degree between instances within the test data set U and the rule, which is reflected by the following formula:
In which, u is a subset of the test data set U, in which each element(instance)matches with the rule of ri ri u to be evolved, while ri is the base of the subset of . is also a subset of U, in which only the characteristic part(premise) of each element(instance)matches with the rule of ri to be evolved, and u is the base of the ui r r . Apparently, the higher the rule accuracy is, the higher the rule trust is. If thei accuracy is 1, then subset of the rule within the test data set U is constantly true, where the condition of the rule is existence, the conclusion of the rule is also existence. Utility: Within the test data set U, an instance of e may be matched with several rules to be evolved, while each rule has the different utility to be measurable. If an instance e within U only matches with a rule to be evolved within the current population, then the utility of the rule is 1. If an instance e matches with m rules to be evolved within the current population, then each rule to be evolved has the utility of 1/m as indicated by the following formula:
Utility (ri ) =
(ri , e) eU ( ri , e)
81
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 Other In which, U and e are defined as above, while r is a rule within the current population. Coverage: The coverage of the rule ri means the numbers of instance e within the test data set U matched with the premise part (characteristic part) of ri to be evolved as indicated by the following formula:
Coverage(ri ) = ui + ui r r
Apparently, the higher the rule coverage is, the better the rule generality is. As the values of the three parameters would affect the size of rule fitness, we are in wish of their higher values. When fixing the fitness function of the rule, the relationship between rules shall be analyzed first. There are three possible relationships between rules, such as contain, redundant and contradictory. Lets take a look: IF PRICE (normal, low) SERVICE (good) THEN CLASS=acc IF SERVICE (good) THEN CLASS=acc Contradictory: IF PRICE (high) QUALITY (poor) SERVICE (bad) THEN CLASS=acc IF PRICE (high) QUALITY (poor) SERVICE (bad) THEN CLASS=uacc Redundant: Very simpleness. That is two same rules. The relationship between rules shall be considered in design of fitness function of the rule, while contain, contradictory and redundant rules shall be deleted to ensure any father generation rule is selected as the good rule. Following is the algorithm to evaluate rule fitness: Step 1: Assumed that U is a test data set, P is a rule (genome) variety (N size); Step 2: Calculating the Accuracy (ri) and Utility (ri) for each rule ri (i=1, 2N) within the current population P; Step 3: Arranging rule ri in an descending order according to the product with Accuracy(ri) and Utility(ri), while the results of arrangement shall be written into the sort table; Step 4: The Coverage(rj) of the first rule in the sort table shall be calculated, while the fitness(rj) = Coverage(rj) * Accuracy(rj) * Utility(rj) (initial j=0) u u Step 5: The rule as covered by the rule of rj shall be deleted from U, i.e. U = U r j r j Step 6: If rj is the last rule in the sort table, exit once the calculation on fitness of all rules is completed, otherwise j=j+1, and go to Step 4. Contain:
3. Genetic Algorithms related works 3.1 GA for Data Mining in Web based EHS
In this paper we show how to apply genetic algorithms for data mining of student information obtained in a Webbased Educational Adaptive Hypermedia System. The objective is to obtain interesting association rules so that the teacher can improve the performance of the system. In order to check the proposed algorithm we have used a Web-based Course developed for use by medical students. First, we will describe the proposed methodology, later the specific characteristics of the course and we will explain the information obtained about the students. We will continue on with the implemented genetic algorithm and finally with the rules discovered and the conclusions. Having tested our genetic algorithms on the Web-based Hypermedia Course as described above, and shown that they can produce potentially useful results, we plan to apply them to this and other adaptive hypermedia courses to test how they can be used to improve the adaptive features. The preliminary results reported in this paper are promising and they show that our genetic algorithm is a good alternative for extracting a small set of comprehensible rules, which is important in the context of data mining. We are now developing 82
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 several courses with AHA system and we expect to obtain interesting rules to improve the course adaptation. We have chosen AHA system because it is a generic model of hypermedia adaptive system and it has a high degree of adaptation. We are also developing a more sophisticated evolutionary algorithm with genetic programming, to obtain more complex and interesting rules [11].
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 between the splits evenly. Using experimental validations, it is verified in this paper that the performance of the CARM model using TRS derived from such a split performs with good accuracy on the validation set.
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 subset selection in CRF based question informer prediction models. The experimental results show that the proposed hybrid GA_CRF model improves the accuracy of question informer prediction of traditional CRF models. We have proposed a hybrid approach that integrates Genetic Algorithm (GA) with Conditional Random Field (CRF) to optimize feature subset selection in a CRF based model for question informer prediction. The experimental results show that the proposed hybrid GA_CRF model of question informer prediction improves the accuracy of the traditional CRF model. By using GA to optimize the selection of the feature subset in CRFbased question informer prediction, we can improve the F-score from 88.9% to 93.87%, and reduce the number of features from 105 to 40.
4. Conclusion
Genetic algorithms are original systems based on the supposed functioning of the Living. The method is very different from classical optimization algorithms. 1. 2. 3. 4. Use of the encoding of the parameters, not the parameters themselves. Work on a population of points, not a unique one. Use the only values of the function to optimize, not their derived function or other auxiliary knowledge. Use probabilistic transition function not determinist ones.
It's important to understand that the functioning of such an algorithm does not guarantee success. We are in a stochastic system and a genetic pool may be too far from the solution, or for example, a too fast convergence 85
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 may halt the process of evolution. These algorithms are nevertheless extremely efficient, and are used in fields as diverse as stock exchange, production scheduling or programming of assembly robots in the automotive industry.
References
[1] Tubao Ho, Trongdung Nguyen, Ducdung Nguyen, Saori Kawasaki, Visualization Support for User Centered Model Selection in Knowledge Discovery and Data Mining, International Journal of Artificial Intelligence Tools, 2001. Susan E. George, A Visualization and Design Tool (AVID) for Data Mining with the Self-Organizing Feature Map, International Journal of Artificial Intelligence Tools, 2000, 9(3). 369 375. Tzung-Pei Hong, Chan-Sheng Kuo, Sheng-Chai Chi, Trade-off Between Computation Time and Number of Rules for Fuzzy Mining from Quantitative Data, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2001,9(5). 587 - 604 Vladimir Estivill-Castro, Jianhua Yang, Clustering Web Visitors by Base, Robust and Convergent Algorithms, International Journal of Foundations of Computer Science, 2002, 13(4). 497 520. R. Flix, T. Ushio, Binary Encoding of Discernibility Patterns to Find Minimal Coverings, International Journal of Software Engineering and Knowledge Engineering, 2002. D.E.Goddberg, Genetic Algorithms in Search, Optimizition and Machine Learning, Addison-wesley Publishing Company, 1989. Ai LirongHe Huacan. Summarise on Genetic AlgorithmsJournal of Computer Application and Research1997. Xiao Yong, Chen Yiyun. Constructing Decision Trees by Using Genetic Algorithm, Journal of Computer Research and Development, 1998. Xiaomin Zhong, Eugene Santos, Directing Genetic Algorithms for Probabilistic Reasoning through Reinforcement Learning, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2000. Gary William Grewal, Thomas Charles Wilson, an Enhanced Genetic Algorithm for Solving the HighLevel Synthesis Problems of Scheduling, Allocation, and Binding, International Journal of Computational Intelligence and Applications, 2001. Cristbal Romero, Sebastin Ventura, Carlos de Castro, Wendy Hall, Using Genetic Algorithms for Data Mining in Web based Educational Hypermedia Systems2004. Zhao Li Ping, Shu Qi Liang, Data Mining Application in Banking-Customer Relationship Management, International Conference on Computer Application and System Modeling (ICCASM 2010). Janaki Gopalan, Erkan Korkmaz, Reda Alhaj, Effective Data Mining by Integrating Genetic Algorithm into the Data Preprocessing Phase, IEEE 2005. Jesus Alcala-Fdez, Salvador Garca, Francisco Jose Berlanga Alberto Fernandez, Luciano Sanchez, M.J. del Jesus and Francisco Herrera, KEEL: A data mining software tool integrating genetic fuzzy systems, 3rd International Workshop on Genetic and Evolving Fuzzy Systems Witten-Bommerholz, Germany, March 2008. Min-Yuh Day, Chun-Hung Lu, Chorng-Shyong Ong, Shih-Hung Wu, Integrating Genetic Algorithms with Conditional Random Fields to Enhance Question Informer Prediction, IEEE 2006. Alexandra Takac, Cellular Genetic Programming Algorithm Applied to Classification Task,2006 Mourad Ykhlef, Yousuf Aldukhayyil and Muath Alfawzan, Mining Sequential Patterns in Telecommunication Database Using Genetic Algorithm,2007.
[2] [3]
[10]
86