Vous êtes sur la page 1sur 5

2017 11 th International Conference on Intelligent Systems and Control (ISCO)

An Approach for Classification using Simple CART


Algorithm in Weka
Dr.Neeraj Bhargava Sonia Dayma Abishek Kumar Pramod Singh
School of system sc. & School of system sc. & School of system sc. & School of system sc. &
engg. engg. engg. engg.
professor MDS MDS University MDS University
MDS UniversityAjmer, India Ajmer India Ajmer India
UniversityAjmer, India soniadayma786@gmai ap481998@gmail.com pramodrathore88@gm
profneerajbhargava@g l.com ail.com
mail.com

Abstract-This decision tree is normally applicable in data


mining in order to produce a framework that predicts the value III. MTEO DO LOGY
of object or its dependent variable, established on the various This section of Paper describes the Methodology part, in which
input or independent variable. CART algorithms are mainly the emphasis is on the various techniques that are used on the
used in Medical, Statistics etc. For heart disease patients it is dataset[5] [7] . These techniques are required for prediction of
complex for medical practitioners to predict the heart attack as it
heart disease in patients.
is a complex task that requires experience and knowledge. First
part of paper has introduced the CART algorithm with
associative applications and its classification method. Other part A. Heart Disease Data Set
has represented a real world dataset of male patient taken for The patient data set is compiled from data collected from
further analysis. medical practitioners. Only 7 attributes form the database are
considered for the predictions required for the heart disease.
Keywords- CART, DM, Algo, Predictor,Weka tool The following attributes with nominal values are considered.
Introduction (Heading 1)
CART algorithm or Classification and Regression Tree is apply TABLE l. DATASET OF HEAR DIASEASE
on WEKA and used in data mining to classify the variable of
dataset and tree is describe the "class" of variable which is used Age Min=20-40
Nominal Avg=41-60
in dataset. Regression tree is defined where the variable is Upper=61-80
uninterrupted and tree is won to make the decision. CART Chest Pain Asympt
algorithm is applied on the dataset of Heart Disease of Male atyp_angina
Nominal
persons. The task of data mining for prediction of the class non_anginal
based on selected attributes of dataset such as age, chest pain, typ_ angina
Rest b press Good=120/80 mm
rest blood pressure, blood sugar, rest electrograph, maximum
Numeric Average=129-139
heart rate, exercise angina and disease. It analyzed the Poor= 179/90
confusion and determined that what attribute is the best Blood Sugar Nominal F=False, T=True
predictor for the correct prediction for the diagnoses. This
algorithm may help to experts to predict possible heart attacks Rest Electro Normal,
Ieft_vent_hyper,
form the patient datasets[ 1]. Nominal
st_t_wave_abnorm
ality
I. EASE OF US E Max Heart Rate A<=100
Nominal B>100&<=150
Agriculture, Biomedical, Plant diseases, Pharmacology, C>150
Medical research, Financial analysis etc. CART algorithm is Exercise
Nominal Yes, no
used in above these fields and many other fields where dataset Angina
is available for make the decision[3]. Disease Positive
Nominal
negative

II. INTRODlCTlON
There are eight main attributes in dataset. Each attribute except
The paper focuses to predict possible number of heart attacks
disease are the main causes for the Heart Disease. Each cause is
patients from the dataset using data mining techniques and categorized into some predefined measures. These measures
determines which model gives the highest percentage of are categorized for the making result efficient [8][11].
correct predictions for the diagnoses.

978-1-5090-2717-0/171$31 .00 ©2017 IEEE 212


B. Visualization of Heart Patients example age attribute, there are Three major categories
Heart Data set is analyzed visually using different like avg(average) ,Upper, Min(Minimum) and their
attributes and shows the distribution of values. values like 149, 48, 12. These are the number of
instance reside in the categories.

C. Performance of the Classifier CART


It shows the experimental results to evaluate the performance
of CART on Heart patient dataset

TABLE II. PERFORMANCE OF CLASSIFIER CART

Classifier Evaluation Criteria Value


Timing to build model (in Sec) 0.08

Correctly Classified Instances 167


CART
Figure1. AGE Figure 2. CHEST]AIN
Incorrectly Classified Instances 42
Accuracy 64.1148%

IV. RESULT

A. Final Result
When algorithm is applied to dataset, then the result is
;; produced, i.e. shown in figure 3.1. It consist the information of
dataset analysis such as information about total instances,
classified and unclassified instances, classification accuracy
Figure3. Rest blood Pressure Figure 4. Blood_sugar measures, detailed accuracy measures and confusion matrix
etc.

=== Run information ===


Scheme:weka.classifiers.trees.simpleCart -S I -M 2.0 -N 5 -C
1.0
Relation : heart- disease- male
Instances: 209
Attributes: 8
age
chest~ain
rest_ bpress
F igure5 . Rest_electro blood_sugar
rest electro
max heart rate
exercice_angina
disease
Test mode: 1O-fold cross-validation
=== Classifier model (full training set) ===
CART Decision Tree
chest~ain=( atyp _ angina) l(non _anginal): negative(88.0/ 13.0)
chest~ain !=( atyp_ angina) l( non_anginal)

F igure7. Exercice_angina Figure 8. Disease


These visualizations in above all figure show the Figure9. Result
attribute name and their value according to dataset. The The above figure9 shows the relation information including
number of graphs are denoting by the number of name, instances and attributes. The 10 fold cross validation
categories of the attributes and the value varying by the means the dataset is split into 10 slices equally and then the
number of instances in the related category. For algorithm is applied on each slice separately.

213
This above figurellshows the detailed accuracy and confusion
I exercice_angina=(no) matrix produced on the basis of classes in dataset. The detailed
I I age=(Avg) I(Min) accuracy is measured by some major measures like TP rate
I I I max_hearUate=(A) I(B): negative(l7.0/l0.0) (True Positive), FP (False Positive) rate, Precision, Recall, F-
I I I max_hearUate!=(A)I(B) Measure etc. These measures are calculated through the
I I I I rest_bpress=(Poor): negative(3.0/ 1.0) confusion matrix in figure 12.
I I I I rest_bpress!=(Poor): positive(7.0/1.0)
I I age!=(Avg) I(Min): positive(7.0/2.0)
I exercice_ angina!=(no): positive(54.0/6.0) TP rate = TP/ (TP+FN) ex:- (70) / (70+22) = 0.762 =>
Number of Leaf Nodes: 6 Positive
Size of the Tree: II (97) / (20+97) = 0.829=>
Negative
Time taken to build model: 0.08 seconds FP rate = FP/ (FP+ TN) ex:- (20) / (20+97) = 0.171
=== Stratified cross-validation === =>Positive
=== Summary === (22) / (22+70) = 0.239
Correctly Classified Instances 167 79.9043 % =>Negative
Incorrectly Classified Instances 42 20.0957 %
Precision = TP/ (TP+FP) ex:- (70) / (70+22)=0.778 =>
Kappa statistic 0.5913
Mean absolute error 0.2779 Positive
Root mean squared error 0.3869 (22) / (22+97)=0.815 =>
Relative absolute error 56.3624 % Negative
Root relative squared error 77.9345 % Recall = TP / (TP+ FN) ex :- (70) / (70+22) = 0.762 =>
Total Number ofInstances 209 Positive
(97) / (20+97) = 0.829=>
=== Detailed Accuracy By Class === Negative
TP Rate FP Rate Precision Recall F-Measure ROC Area F-Measure = (2*recal1*precision)/(recall + precision)
Class
Ex :- (2* 0.762*0.778) / (0.762 + 0.778) = 0.769 => Positive
0.761 0.171 0.778 0.761 0.769 0.838 positive
0.829 0.239 0.815 0.829 0.822 0.838 negative (2* 0.829*0.815) / (0.829 + 0.815) = 0.822 =>
Weighted Avg. 0.799 0.209 0.799 0.799 0.799 Negative
0.838 *Note
=== Confusion Matrix === • TP=70 * Recall= TP rate FN=22 FP=20 TN=97
a b < -- classified as
70 22 I a =positive
20 97 I b =negative
Figurel2. Rest blood Pressure

B. Stratified Cross- Validation


Stratified Cross-Validation 3) is conslstmg of some
Figurel0. Final Results essential classification accuracy measures like kappa statistic,
This above figure 10 shows the decision tree build by the MAE (Mean Absolute Error), RMSE (Root Mean Squared
classifier, it includes the root nodes, child nodes and leaf nodes error) etc. These measures show the accuracy factors of
with their possible/predicted values. For example there are algorithm.
number of leaf nodes are 6 and the size of tree is 11.
::: Stratified cross-validation :::
=== Detailed Accuracy By Class === ::: SUIilIIlary :::
TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
0.761 0.171 0.778 0.761 0.769 Correctly Classified Instances 167 79.9043 i
0.838 positive
0.829 0.239 0.815 0.829 0.822 Incorrectly Classified Instances 42 20.0957 %
0.838 negative Kappa statistic 0.5913
Weighted avg.
0.799 0.209 0.799 0.799 0.799 0.838 Mean absolute error 0.2779
=== Confusion Matrix === Root mean squared error 0.3869
a b < -- classified as
70 22 I a =positive Relative absolute error 56 .3624 %
20 97 I b =negative Root relative squared error 77. 9345 i
Total Number of Instances 209
Figurel1. Final Results
Figure 13. Stratified Cross-Validation

214
In above figure13 there are some important measures that E. Rules Interpretation
measure the correctly and incorrectly classified instance in By applying the CART algorithms on the dataset, some rules
dataset called errors. Errors are calculated by the predefined
are generated that are helpful to predict the correct cause of
formulas which are discussed previously.
heart disease.
C. Detailed Accuracy by Class
The detailed accuracy measures like TP rate, FP rate, Precision, JRIP rules:
Recall, F-Measure and ROC area are estimated by class i.e. .._--_ ........... -
-----------

positive and negative which are heart disease dataset's.


(exercice _angina = yes) =) disease=posi ti ve (72 .0/12.0)
::: ~ftailfa A~~U[a~y ~y ~lagg ::: (chestyain = aSYlllpt) and (age = Upper) =) disease=positive (9.0/2.0)
(chestyain = typ_angina) =) disease=positive (6.0/2.0)
=) disease=negati ve (122.0/21. 0)
nRatf f~ Ratf ~[fdsi~n Rf~all f·~fagU[f R~( Am Oass
Number of Rules : 4
~,lo1 ~,l1l ~,m ~ , lo1 ~, lo~ ~, ~J~ ~~sitivf
Figurell6. Rules Interpretaion
~, ~~~ ~ ,m ~,m ~ , ~~~ ~, ~~~ ~, ~J~ nf~ativf
These rules are generated by JRIP rules and they are helpful to
ijfi~ntfa AV~I ~,l~~ ~,~~~ ~ , l~~ ~,l~~ ~, l~~ ~, ~J~ take fmal decision by the machine. These are not final or exact
result, these may vary on other machine or as dataset changes.
The decision tree generated by the classifier is depended on
Figurel4. Detailed Accuracy by class these rules.
• The Rule l: If (Exercice_ angina=yes) Then Disease=
The abovel4 figure shows the Detailed accuracy by class. Positive.
How all the major measures are calculated, are briefly
described previously. • Rule 2: If (Chest~ain=asympt)And (age=Upper) Then
Disease=Positive
D. Confusion Matrix
• Rule 3: If (Chest~ain=typ_angina) Then
The main activity of the Algorithm's result, is that Confusion Disease=Positive (in Some Cases)
Matrix. It commonly named contingency table. In our case we
have 2 classes Positive and Negative and therefore 2x2 • Rule 4: If (Chest~ain=typ_angina) Then
confusion matrix, the matrix could be arbitrarily large. The Disease=Negative (in Many Cases).
number of correctly classified instances is the sum of diagonals
F. Tree Visualization
in the matrix; all others are incorrectly classified.
In above section, all the rules are interprated. These rules are
visualized in the tree form. The tree visualization is the
=== Confusion Matrix === simplest method to understand the conditions and their results.

a b (-- classified as
....
70 22 a = positive
'

20 97 b = negative

Figurel5. Confusion Matrix


The abovel4 figure shows the Detailed accuracy by class.
How all the major measures are calculated, are briefly
described previously.

Figurel7. Tree Visualization

215
In figure17 the root node is 'exercice_angina' and it' s child [I] S. Aruna, Dr S.P. Rajagopalan and L. V. Nandakishore, "AnEmpirical
Comparison Of Supervised learning algorithms in Disease ion", VoU ,
node is 'chest~ain' . If any person has ' exercice_angina' August 2011.
value is 'yes' then it comes into the 'positive' category. [2] Leonard Gordon, Using Classification and Regression Trees (CART) in
(72.0112.0) means 72 instances followed this rule and 12 are SAS Enterprise Miner For Applications in Public Health.Paper 089-
not. And same situation with value 'no', and then check the 2013,2013
condition for 'chest~ain' attribute. This figure17 shows the [3] Vikas Chaurasia, Saurabh Pal, "Early Prediction of Heart DiseasesUsing
decision tree build by the classifier, it includes the root nodes, Data Mining Techniques", Vol. I ,2013.
child nodes and leaf nodes with their possible/predicted [4] Thenmozhi, P. Deepika, M.Meiyappasamy, "Different DataMining
Techniques Involved in Heart Disease Prediction: A Survey", Volume 3,
values. For example there are number of leaf nodes are 6 and ISSN No. 2277-8179, September 2014.
the size of tree is 11. In this algorithm it implemented the [5] Sushilkumar Kalmegh," Analysis of WEKA Data MiningAlgorithm
Simple CART algorithm practically. Analysis on the REPTree, Simple Cart and RandomTree for Classification of Indian
"Disease" attribute and generate a confusion matrix and News" , ,Vol. 2 Issue 2, February 2015.
measured all essential measures like TP rate, FP rate, [6] T.Miranda Lakshmi, A.Martin, R.Mumtaj Begum, Dr.V.Prasanna
Venkatesan, "An Analysis on Performance of Decision Tree Algorithms
Precision and Recall etc. In this way it can also analyze other
using Student's Qualitative Data", U.Modern Education and Computer
attributes, which may helpful to solve more complex situations Science, 2013,5,18-27.
or queries in the prediction of Heart attack diseases. [7] Jyoti Rohilla, Preeti Gulia, "Analysis of Data Mining Techniquesfor
Diagnosing Heart Disease", Volume 5, Issue 7,July2015.
CONCLUSION [8] Hlaudi Daniel Masethe , Mosima Anna Masethe, " Prediction oft-leart
The research undertook an experiment on application of mining Disease using Classification Algorithms", Vol II WCECS 2014, 22-
240ctober,2014,SanFrancisco,USA.
algorithm (Simple CART) in order to predict the heart attacks
and to compare the best available method of prediction. The [9] Nidhi BhatIa, Kiran Jyoti, " An Analysis of Heart DiseasePrediction
using Different Data Mining Techniques" International1ournal of
experiment can serve as an important tool for physicians to Engineering and Technology VoU issue 8 2012.
predict risky cases in the practice and advise accordingly. The [10] Chaitrali S. Dangare and Sulabha S. Apte, " Improved Study Oft-feart
model from the classification will be able to answer more Disease Prediction Using Data Mining Classification Techniques",
complex queries in the prediction of heart attack diseases. The International Journal Of Computer Applications, Vol. 47, No. 10, pp.
predictive accuracy determined by SIMPLE CART algorithm 0975-888,2012.
suggests that parameters used are reliable indicators to predict [II] Atul Kumar Pandey, Prabhat Pandey, K.L. Jaiswal and AshokKumar
the presence of heart diseases. Sen, " A Heart Disease Prediction Model using Decision Tree", IOSR
Journal of Computer Engineerin g, Vol. 12, Issue.6, (Jul. - Aug. 2013),
pp. 83 - 86.
REFERENCES [12] Tina R. Pati!, Mrs. S.S. Sherekar, " Performance Analysis of NaIve
Bayes and J48 Classification algorithm for Data Classification" ,
International Journal Of Computer Science and Applications, Vol. 6,
No.2, Apr2013.

216

Vous aimerez peut-être aussi