Vous êtes sur la page 1sur 5

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)

Optimization of C4.5 Decision Tree Algorithm for Data Mining

Gaurav L. Agrawal1, Prof. Hitesh Gupta2
PG Student, Department of CSE, PCST, Bhopal, India
Head of Department CSE, PCST, Bhopal, India
Abstract-- Data mining is a new technology and has This attribute minimizes the information needed to
successfully applied on a lot of fields, the overall goal of the classify the sample.
data mining process is to extract information from a data set In this In the paper, we analyzed several decision tree
and transform it into an understandable structure for further classification algorithms currently in use, including the ID3
use. Data mining is mainly used for model classification and [4] and C4.5 [2] algorithm as well as some of the improved
prediction. .classification is a form of data analysis that
algorithms [3] [5] [6] thereafter them. When these
extracts models describing important data classes. C4.5 is one
of the most classic classification algorithms on data mining classification algorithms are used in the data processing, we
,but when it is used in mass calculations ,the efficiency is very can find that its efficiency is very low and it can cause
low .In this paper ,we propose C4.5 algorithm which is excessive consumption of memory. On this basis,
improved by the use of L’Hospital Rule ,which simplifies the combining with large quantity of data, we put forward the
calculation process and improves the efficiency of decision improvement of C4.5 algorithm efficiency, and uses
making algorithm . We aim to implement the algorithms in a L’Hospital rule to simplify the calculation process by using
space effective manner and response time for the application approximate method. This improved algorithm not only has
will be promoted as the performance measures. Our system no essential impact on the outcome of decision-making, but
aims to implement these algorithms and graphically compare
can greatly improve the efficiency and reduce the use of
the complexities and efficiencies of these algorithms.
memory. So it is more easily used to process large amount
Keywords- C4.5 algorithm ,Data mining, Decision tree , of data collection.
ID3 algorithum, L’ hospital rule. The rest of the paper is organized as follows.
A. Decision Tree induction:
Decision tree induction is the learning of decision trees
Decision trees are built of nodes, branches and leaves from class labeled training tuples. A decision tree is a
that indicate the variables, conditions, and outcomes, flowchart-like tree structure, where each internal node (non
respectively. The most predictive variable is placed at the leaf node) denotes a test on an attribute, each branch
top node of the tree. The operation of decision trees is represents an outcome of the test, and each leaf node (or
based on the C4.5 algorithms. The algorithms make the terminal node) holds a class label. The topmost node in a
clusters at the node gradually purer by progressively tree is the root node. Suppose we want to buy the computer
reducing disorder (impurity) in original data set. Disorder It represents the concept buys computer, that is, it predicts
and impurity can be measured by the well-established whether a customer likely to purchase a computer. Internal
measures of entropy and information gain. One of the most nodes are denoted by rectangles, and leaf nodes are denoted
significant advantages of decision trees is the fact that by ovals. Some decision tree algorithms produce only
knowledge can be extracted and represented in the form of binary trees (where each internal node branches to exactly
classification (if-then) rules. Each rule represents a unique two other nodes), whereas others can produce non binary
path from the root to each leaf. In operations research, trees.
specifically in decision analysis, a decision tree (or tree C4.5 adopt greedy (i.e., nonbacktracking) approach in
diagram) is a decision support tool. A decision tree is used which decision trees are constructed in a top- down
to identify the strategy most likely to reach a goal. Another recursive divide and conquer manner. Most algorithms for
use of trees is as a descriptive means for calculating decision tree induction also follow such top-down
conditional probabilities. approach, which starts with a training set of tuples and their
A decision tree is a flow-chart-like tree structure, where associated class labels. The training set is recursively
each branches represents an out-come of the test, and each partitioned into smaller subsets as the tree is being built.
leaf node represent classes. The attribute with highest
information gain is chosen as test attribute for current node.

International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)
II. METHODOLOGY Therefore it is needed to perform supervised data mining
on the target data set. This narrowed down the choice of
Steps of the System:
classifiers to only few, classifiers that can handle numeric
1. Selecting dataset as an input to the algorithm for data as well as give a classification (amongst a predefined
processing. set of classifications). Hence selecting C4.5 decision tree
2 .Selecting the classifiers learning became obvious. The attribute evaluation was
3. Calculate entropy, information gain, gain ratio of also performed in order to find out the gain ratio and
attributes. ranking of each attribute in the decision tree learning. In
4. Processing the given input dataset according to the case for some data set data mining could not produce any
defined algorithm of C4.5 data mining. suitable result then finding the correlation coefficient was
5. According to the defined algorithm of improved C4.5 resorted to investigate if relation between attributes.
data mining processing the given input dataset.
6. The data which should be inputted to the tree generation C. Entropy:
mechanism is given by the C4.5 and improved C4.5 It is minimum number of bits of information needed to
processors. Tree generator generates the tree for C4.5 and encode the classification of arbitrary members of S.
improved C4.5 decision tree algorithm. Lets attribute A have v distinct value a1,............., av.
Attribute A can be used to Partition S into v subsets, S1,
III. D ATA M INING AND KNOWLEDGE D ISCOVERY S2,........, Sv , where Sj contains those samples in S that have
value aj of A. If A were selected as the test attribute, then
A. Attribute Selection Measure: these subset would corresponds to the branches grown from
the node contains the set S. Let Sij be the number of class
The attribute selection measure provides a ranking for
Ci, in a subset by Sj , The entropy or expected information
each attribute describing the given training tuples. The
attribute having the best score for the measure is chosen as based on partitioning into subset by A, is given by equation
the splitting attribute for the given tuples. If the splitting E(A) = ∑jv=1 (S1j +S2j+ · · · + Smj / S )*I(Sij + · · · + Smj)
attribute is continuous-valued or if we are restricted to
binary trees then, respectively, either a split point or a The first term acts as the weight of the jth subset and is
splitting subset must also be determined as part of the the number of samples in the subset divided by the total
splitting criterion. number of sample in S. The smaller the entropy value, the
The tree node created for partition D is labeled with the greater purity of subset partitions as shown in
splitting criterion, branches are grown for each outcome of I(S1, S2, · · · , Sm) = −∑
the criterion, and the tuples are partitioned accordingly.
There are two most popular attribute selection measures— Where Pi is the probability that a sample in Sj belongs to
information gain, gain ratio [12]. class Ci.
Let S be set consisting of data sample. Suppose the class D. Information Gain:
label attribute has m Distinct values defining m distinct
class Ci (for i =1... m). Let Si be the number of Sample of S It is simply the expected reduction in entropy caused by
in class Ci. The expected information needed to classify a partitioning the examples according to the attribute .More
given sample is given by equation precisely the information gain, Gain(S, A) of an attribute
A, relative collection of examples S, is given by equation.
I (S1, S2, · · · , Sm) = − ∑
Gain (A) = I (S1,S2, · · ·, Sm) − E (A)
Where Pi is probability that an arbitrary sample belongs
In other words gain (A) is the expected reduction in
to classify Ci and estimated by Si/S. Note that a log entropy caused by knowing the Value of attribute A. The
function to base 2 is used since the information in encoded algorithm computes the information gain of each attribute.
in bit.
With highest information gain is chosen as the test attribute
B. Classifiers: for a given set.
In [17] order to mine the data, a well-known data mining E. Gain ratio:
tool WEKA was used. Since the data has numeric data
It [12] differs from information gain, which measures
type with only the classification as nominal leading to the the information with respect to classification that is
category of labeled data set. acquired based on the same partitioning.

International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)
The gain ratio is defined as 1.Handling both continuous and discrete attributes - In
order to handle continuous attributes, C4.5 creates a
Gain Ratio(A) = Gain(A)/Split Info(A)
threshold and then splits the list into those whose attribute
The attribute with the maximum gain ratio is selected as value is above the threshold and those that are less than or
the splitting attribute. Note , however, that as the split equal to it.
information approaches 0, the ratio becomes unstable. A
constraint is added to avoid this, whereby the information 2. Handling training data with missing attribute values -
gain of the test selected must be large-at least as great as C4.5 allows attribute values to be marked as ? for missing.
the average gain over all tests examined. Missing attribute values are simply not used in gain and
entropy calculations.
IV. C4.5 ALGORITHM 3. Handling attributes with differing costs.
C4.5 is an algorithm used to generate a decision tree
4. Pruning trees after creation - C4.5 goes back through the
developed by Ross qiunlan. Many scholars made kinds of
tree once it's been created and attempts to remove branches
improvements on the decision tree algorithm. But the
that do not help by replacing them with leaf nodes
problem is that these decision tree algorithms need multiple
scanning and sorting of data collection several times in the
construction process of the decision tree. The processing
speed reduced greatly in the case that the data set is so A. The improvement
large that can not fit in the memory .At present, the The C4.5 algorithm [8] [9] generates a decision tree
literature about the improvement on the efficiency of through learning from a training set, in which each example
decision tree classification algorithm For example, is structured in terms of attribute-value pair. The current
Wei Zhao, Jamming Su in the literature [7] proposed attribute node is one which has the maximum rate of
improvements to the ID3 algorithm, which is simplify the information gain which has been calculated, and the root
information gain in the use of Taylor's formula. But this node of the decision tree is obtained in this way. Having
improvement is more suitable for a small amount of data, studied carefully, we find that for each node in the selection
so it's not particularly effective in large data sets. of test attributes there are logarithmic calculations, and in
Due to dealing with large amount of datasets, a variety each time these calculations have been performed
of decision tree classification algorithm has been previously too. The efficiency of decision tree generation
considered. can be impacted when the dataset is large. We find that the
The advantages of C4.5 algorithm is significantly, so it all antilogarithm in logarithmic calculation is usually small
can be choose. But its efficiency must be improved to meet after studying the calculation process carefully, so the
the dramatic increase in the demand for large amount of process can be simplified by using L’Hospital Rule. As
data. follows:
A. Pseudo Code [16]: If f(x) and g(x) satisfy:
1. Check for base cases. (1) ) And are both zero or are
2. For each attribute a calculate: both ∞
i. Normalized information gain from splitting on attribute a.
3. Select the best a, attribute that has highest information (2) In the deleted neighbourhood of the point x0, both f'(x)
gain. and g'(x) exist and g'(x)! = 0;
4. Create a decision node that splits on best of a, as rot f ( x)
lim exits or is ∞
node. x  x0 g ( x)
5. Recurs on the sub lists obtained by splitting on best of a
and add those nodes as children node. f ( x) f ' ( x)
Then lim  lim '
B. Improvements from ID3 algorithm: x  x0 g ( x) x  x0 g ( x)

C4.5 made a number of improvements to ID3. Some of

these are follows:

International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)

ln(l  x) [ln(l  x)]' Go on the simplification we can get:

lim  lim 
x  x' p p n n S S S
log 2  log 2  { 1 [ 11 log 2 11 
1 N N N N N S1 S1

lim 1  x  lim
1 S12 S S S S S S
log 2 12 ]  2 [ 21 log 2 21  22 log 2 22 ]}
1 1 x S1 S1 N S2 S2 S2 S2
(x approaches to zero) S1 S S S
/{ log 2 1  2 log 2 2 }
viz.ln (l-x)=-x(x approaches to zero)
In the equation above, we can easily learn that each item
ln(1-x)≈=x(when x is quiet small)) in both numerator and denominator has logarithmic
calculation and N, Divide the numerator and denominator
Suppose c = 2, that is there are only two categories in the
by ㏒2e simultaneously, and multiplied by N
basic definition of C4.5 algorithm. Each candidate
simultaneously. We can get equation:
attribute’s information gain is calculated and the one has
the largest information gain is selected as the root. Suppose
Gain-Ratio(S, A) =
that in the sample set S the number of positive is p and the
p n S S
negative is n. p ln  n ln  {[ S11 ln 11  S12 ln 12 ]
E(S, A) = N N S1 S1
r pj  nj S21 S
 j 1 pn
I ( s1 j  s2 j ) [ S21 ln
 S22 ln 22 ]} /
S1 S
{S1 ln  S2 ln 2 }
So we can get the equation in which pj and nj are N N
respective the number of positive examples and the
negative examples in the sample set .so gain Ratio(A) can p n
be simplified as Because N  p  n   1 Then
gain( A) E ( s)  E ( S , A) p n n p
Gain  Ratio( A)   replaces and with 1  and 1  respectively
I ( A) I ( A) N N N N
S S then we can get equation:
I ( p, n)  { 1 I ( S11 , S12 )  2 I ( S21 , S22 )} GainRatio(S,A)=
= N N
n p S
I ( S1 , S2 ) { p ln(1  )  n ln(1  )  {[ S11 ln(1  12 ) 
N N S1
S11 S S
S1: the number of positive examples in A S12 ln(1  )]  [ S21 ln(1  22 )  S22 ln(1  21 )]}
S2: the number of negative examples in A S1 S2 S2
S11: the number of examples that A is positive and S2 S
/ S1 ln(1  )  S2 ln(1  1 )
attributes value is positive, N N
S12: the number of examples that A is positive and
Attributes value is negative, Because we know that ln(1-x)≈=x, so we get:
S21: the number of examples that A is negative and pn S *S S *S
attributes value is positive,  {[ 11 12 ]  [ 21 22 ]
N S1 S2
S22: the number of examples that A is negative attributes Gain-Ratio(S, A)=
value is negative and attributes value is negative.
S1 * S2

International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)
In the expression above, Gain-Ratio (A) only has With the improved algorithm, we can get faster and
addition, subtraction, multiplication and division but no more effective results without the change of the final
logarithmic calculation, so computing time is much shorter decision and the presented algorithm constructs the
than the original expression. What’s more, the decision tree more clear and understandable. Efficiency and
simplification can be extended for multi-class. classification is greatly improved.
B. Reasonable arguments for the improvement: REFERENCES
In the improvement of C4.5 above, there is no item [1 ] I. H. Witten, E. Frank, Data Mining Practical Machine Learning
increased or decreased only approximate calculation is used Tools and Techniques, China Machine Press, 2006.
when we calculate the information gain rate. And the [2 ] S. F. Chen, Z. Q. Chen, Artificial intelligence in knowledge
engineering [M]. Nanjing: Nanjing University Press, 1997.
antilogarithm in logarithmic calculation is a probability
[3 ] Z. Z. Shi, Senior Artificial Intelligence [M]. Beijing: Science
which is less than 1. In order to facilitate the improvement Press,1998.
of the calculation, there are only two categories in this [4 ] D. Jiang, Information Theory and Coding [M]: Science and
article and the probability is a little bigger than in multi- Technology of China University Press, 2001.
class. And the probability will become smaller when the [5 ] M. Zhu, Data Mining [M]. Hefei: China University of Science and
number of categories becomes larger; it is more helpful to Technology Press ,2002.67-72.
justify the rationality. Furthermore, there is also the [6 ] A. P. Engelbrecht., A new pruning heuristic based on variance
guarantee of L’Hospital Rule in the approximate analysis of sensitivity information[J]. IEEE Trans on Neural
calculation, so the improvement is reasonable. Networks, 2001, 12(6): 1386-1399.
[7 ] N. Kwad, C. H. Choi, Input feature selection for classification
C. Comparison of the complexity: problem [J],IEEE Trans on Neural Networks, 2002,13(1): 143- 159.
To calculate Gain – Ratio(S, A), the C4.5 algorithm’s [8 ] Quinlan JR.Induction of decision tree [J].Machine Learing.1986
complexity is mainly concentrated in E(S) and E(S, [9 ] Quinlan, J.R.C4.5: ProgramsforMachineLearning.SanMateo,
CA:Morgan Kaufmann1993
A).When we compute E(s), each probability value is
[10 ] UCIRepository of machine earning databases. University of
needed to calculated first and this need o (n) time. Then California, Department of Information and Computer Science, 1998.
each one is multiplied and accumulated which need http: //www.ics. uci. edu/~mlearn/MLRepository. Html
O(log2n) time. So the complexity is O(log2n).Again, in the [11 ] UCI Machine Learning Repository –
calculation of E(S,A),the complexity is O(n(log2n)2),so the http://mlearn.icsuci.edu/database
total complexity of Gain-Ration(S,A) is O(n(log2n)2). [12 ] Jaiwei Han and Micheline Kamber , Data Mining Concepts and
And the improved C4.5 algorithm only involves. Techniques.second Edition,Morgan Kaufmann Publishers.
Original data and only addition, subtract, multiply and [13 ] Chen Jin,Luo De –lin,mu Fen-xiang ,An Improved ID3 Decision tree
divide operation. So it only needs one scan to obtain the algorithm,Xiamen University,2009.
total value and then do some simple calculations, the total [14 ] Rong Cao,Lizhen Xu,Improved C4.5 Decision tree algorithm for the
analysis of sales.Southeast University Nanjing211189,china,2009.
complexity are O (n).
[15 ] Huang Ming,NiuWenying ,Liang Xu ,An improved decision tree
classification algorithm based on ID3 and the application in score
VI. CONCLUSION AND F UTURE W ORK analysis.Dalian jiao Tong University,2009.
In this Paper we study that C4.5 and improved c4.5 [16 ] Surbhi Hardikar, Ankur Shrivastava and Vijay Choudhary
Comparison betweenID3 and C4.5 in Contrast to IDS VSRD-
algorithm to improve the performance of existing algorithm IJCSIT, Vol. 2 (7), 2012.
in terms of time saving by the used of L’hospital rule and [17 ] Khalid Ibnal Asad, Tanvir Ahmed ,MD. Saiedur Rahman,Movie
increased the efficiency a lot .We can not only speed up Popularity Classification based on Inherent, MovieAttributes using
the growing of the decision tree , but also better C4.5,PART and Correlation Coefficientd.IEEE/OSA/IAPR
information of rules can be generated. In this paper International Conference on Informatics, Electronics &
algorithm we will verify different large datasets which are
publicly available on UCI machine learning repository.