Vous êtes sur la page 1sur 4

Efficient Data mining using Distribution Based tree algorithm

Vijayalakshmi.R

Abstract Data mining techniques are applied in building


software fault prediction models for improving the software quality. Early identification of high-risk modules can assist in quality enhancement efforts to modules that are likely to have a high number of faults.The objective of this paper is to reduce the number of data sets for the proper and efficient usage of the processor memory.The paper discusses about the improvement of memory management for wireless environmental monitoring system using distribution based decision tree algorithm.For the analysis continuous distributions methods are considered.Comparison of gaussian distribution with Averaging are discussed in this paper.Therefore obtaining an efficient data mining scheme for wireless sensor networks using decision tree algorithms is obtained. Key Words Averaging, Data mining,Decision Tree, data sets,Gaussian Distribution.

Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome[2]. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships. II. DECISION TREES Decision trees are often used in classification and prediction[3]. It is simple yet a powerful way of knowledge representation. The models produced by decision trees are represented in the form of tree structure. A leaf node indicates the class of the examples. The instances are classified by sorting them down the tree from the root node to some leaf node. Decision trees are produced by algorithms that identify various ways of splitting a data set into branch-like segments[5]. These segments form an inverted decision tree that originates with a root node at the top of the tree. The object of analysis is reflected in this root node as a simple, one-dimensional display in the decision tree interface. The name of the field of data that is the object of analysis is usually displayed, along with the spread or distribution of the values that are contained in that field. A. FEATURES OF DECISION TREE Decision trees are white boxes = means they generate imple, understandable rules. You can look into the trees, clearly understand each an every split, see the impact of that split and even compare it to alternative splits. Decision trees are non-parametric = means no specific data distribution is necessary. Decision trees easily handle continuous and categorical variables. Decision trees handle missing values as easily as any normal value of the variable In decision trees elegant tweaking is possible. You can chose to set the dept of the trees, the minimum number of observations needed for

I. INTRODUCTION Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified[1]. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Data mining consists of five major elements: Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table. Different levels of analysis are available: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.

Vijayalakshmi.R,Department of Electrical Engineering,College of Engineering,Guindy,Anna University,Chennai,Mob:9715373764. (e-mail:ervijayalakshmi@ymail.com)

Efficient Data mining using Distribution Based tree algorithm a split, or for a leave, the number of leaves per split (in case of multilevel target variables). Decision trees is one of the best independent variable selection algorithms They are fast, and, unlike calculating simple correlations with the target variable, they also take into account the interactions between variables . population. The Gaussian distribution function is determined by the following formula:

III. ALGORITHMS A. AVERAGING A straightforward way to deal with the uncertain information is to replace each pdf with its expected value, thus,effectively converting the data tuples into point-valued tuples. This reduces the problem back to that for point-valued data, and hence, traditional decision tree algorithms such as ID3 and C4.5 [3] can be reused. We call this approach Averaging (AVG). We use an algorithm based on C4.5. Here is a brief description. AVG is a greedy algorithm that builds a tree top-down.When processing a node; we examine a set of tuples S. The algorithm starts with the root node and with S being the set of all training tuples. Algorithm Steps: Fig 1: Illustration of decision tree Weak learners are great when you want to use lots of them in ensembles, because ensembles, like bagging, boosting, random forests, tree nets become very powerful algorithms when the individual models are weak learners,. Decision trees identify subgroups. Each terminal or intermediate leave in a decision tree can be seen as a subgroup/segment of your population. Decision trees run fast even with lots of observations and variables. Decision trees can easily handle unbalanced datasets. If we have 0.1 % of positive targets and 99.9% of negative ones. B. GAUSSIAN DISTRIBUTION The Gaussian distribution (the "bell-shaped curve" which is symmetrical about the mean) is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions (see also Elementary Concepts). In general, the normal distribution provides a good model for a random variable, when: 1. 2. 3. There is a strong tendency for the variable to take a central value; Positive and negative deviations from this central value are equally likely; The frequency of deviations falls off rapidly as the deviations become larger. Step 1: Get input from the dataset which contains n number of datas. Step2: Split the dataset into groups according to our requirement. Step3: Calculate the average of each group by using the formula: Sum of numbers in group/number of total data in a group. Step4: Change the group lengths and go to step 2. Step5: Calculate the accuracy percentage. B. DISTRIBUTION BASED TREE ALGORITHM The key to building a good decision tree is a good choice of an attribute Ajn and a split point zn for each node n. After an attribute Ajn and a split point zn have been chosen for a node n, we have to split the set of tuples S into number of groups. The Gaussian distribution for which we use as the standard deviation. In both cases, the pdf is generated using s sample points in the interval. Using this method), we transform a data set with point values. Algorithm Steps: Step 1: Get input from the dataset which contains n number of datas. Step2: Split the dataset into groups according to our requirement. Step3: Calculate the guassian distribution of each group by using the formula: Sum of numbers in group/number of total data in a group. Step4: Change the group lengths and go to step 2. Step5: Calculate the accuracy percentage.

As an underlying mechanism that produces the gaussian distribution, we can think of an infinite number of independent random (binomial) events that bring about the values of a particular variable. For example, there are probably a nearly infinite number of factors that determine a person's height (thousands of genes, nutrition, diseases, etc.). Thus, height can be expected to be normally distributed in the

Efficient Data mining using Distribution Based tree algorithm

Fig 2: flow chart of the proposed system

IV. RESULTS AND DISCUSSION The simulation for the datamining is being done in NetBeans IDE and the dataset of temperature, pressure and humidity is being considered. The main coding is being developed in JAVA and the performance graph is obtained.

Fig 5: Main page The tree generated as the result of Gaussian distribution is given below:

Fig 6: Generation of tree using Gaussian distribution Fig 3: NETBEANS IDE The project is being created using JAVA swing, developed and compiled using the IDE. The results obtained are shown in figure given below: From this tree the highest value of the node is considered to make decisions. Thus the decision making algorithm is being successfully implemented. V. PERFORMANCE EVALUATION

Fig 4: login page The columns are designed for each type of calculation and the results are obtained separately therefore giving a clear view for analyzing and performance evaluation.

From this graph, it is clear that the Gaussian distribution based tree algorithm gives us more efficiency when compared to Averaging method. VI. CONCLUSION. The algorithms such as the distribution based tree algorithm and averaging method to build decision trees for reducing the amount of data in datasets have been

Efficient Data mining using Distribution Based tree algorithm implemented.It is found that the method in the paper has adapted when suitable PDFs are used with remarkably higher accuracies. Therefore that data be collected and stored with the PDF information intact.

ACKNOWLEDGMENT I would like to express my sincere appreciation and gratitude to my guide, Dr.P.Vanaja Ranjan, Professor, Department of Electrical and Electronics Engineering, Anna University for her guidance, constant encouragement and support. REFERENCES
[1] S. Tsang, B. Kao, K.Y. Yip,W.-S. Ho, and S.D. Lee, Decision Trees for Uncertain Data, IEEE TRANSACTIONS . Data Eng. (ICDE), pp. 441-444, JAN 2011. M. Umanol, H. Okamoto, I. Hatono, H. Tamura, F. Kawachi, S.Umedzu, and J. Kinoshita, Fuzzy Decision Trees by Fuzzy ID3 Algorithm and Its Application to Diagnosis Systems, Proc. IEEE Conf. Fuzzy Systems, IEEE World Congress Computational Intelligence, vol. 3, pp. 2113-2118, June 1994. J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Decision Trees for Business Intelligence and Data Mining: Using SAS Enterprise Miner. http://www.the-data-mine.com.

[2]

[3] [4]

[5]

Ms. Vijayalakshmi .R completed her Bachelor of Engineering in the field of Computer Science and Engineering in the Govt college of Engineering, Thirunelveli. She is pursuing her Master of Engineering in the field of Embedded System Technologies, College of Engineering, Guindy. Her field of interest includes Data structures, Data mining.

Vous aimerez peut-être aussi