Camera Ready

An Approach for Classification using Simple CART
Algorithm in Weka
Dr.Neeraj Bhargava
School of system sc. &
engg.
professor
MDS UniversityAjmer,
India
profneerajbhargava@g
mail.com
Sonia Dayma
engg.
MDS UniversityAjmer,
India
soniadayma786@gmai
l.com
AbstractThis decision tree is normally applicable in data

mining in order to produce a framework that predicts the value
of object or its dependent variable, established on the various
input or independent variable. CART algorithms are mainly
used in Medical, Statistics etc. For heart disease patients it is
complex for medical practitioners to predict the heart attack as it
is a complex task that requires experience and knowledge. First
part of paper has introduced the CART algorithm with
associative applications and its classification method. Other part
has represented a real world dataset of male patient taken for
further analysis.
Keywords CART,
Introduction (Heading 1)
DM,
Algo,
Predictor,Weka
tool
CART algorithm or Classification and Regression Tree is apply

on WEKA and used in data mining to classify the variable of
dataset and tree is describe the class of variable which is used
in dataset. Regression tree is defined where the variable is
uninterrupted and tree is won to make the decision. CART
algorithm is applied on the dataset of Heart Disease of Male
persons. The task of data mining for prediction of the class
based on selected attributes of dataset such as age, chest pain,
rest blood pressure, blood sugar, rest electrograph, maximum
heart rate, exercise angina and disease. It analyzed the
confusion and determined that what attribute is the best
predictor for the correct prediction for the diagnoses. This
algorithm may help to experts to predict possible heart attacks
form the patient datasets[1].
I.
EASE OF USE
Agriculture, Biomedical, Plant diseases, Pharmacology,

Medical research, Financial analysis etc. CART algorithm is
used in above these fields and many other fields where dataset
is available for make the decision[3].
II.
INTRODICTION
The paper focuses to predict possible number of heart attacks

patients from the dataset using data mining techniques and
determines which model gives the highest percentage of
correct predictions for the diagnoses.
Abishek Kumar
engg.
MDS University Ajmer
India
ap481998@gmail.com
III.
Pramod Singh
engg.
MDS University Ajmer
India
pramodrathore88@gm
ail.com
MTEODOLOGY
This section of Paper describes the Methodology part, in which

the emphasis is on the various techniques that are used on the
dataset[5][7]. These techniques are required for prediction of
heart disease in patients.
A. Heart Disease Data Set
The patient data set is compiled from data collected from
medical practitioners. Only 7 attributes form the database are
considered for the predictions required for the heart disease.
The following attributes with nominal values are considered.
TABLE I.
Age
DATASET OF HEAR DIASEASE

Nominal
Chest Pain
Nominal
Rest b press
Blood Sugar
Numeric
Nominal
Rest Electro
Nominal
Max Heart Rate

Exercise
Angina
Disease
Nominal
Min=20-40
Avg=41-60
Upper=61-80
Asympt
atyp_angina
non_anginal
typ_angina
Good=120/80 mm
Average=129-139
Poor=179/90
F=False, T=True
Normal,
left_vent_hyper,
st_t_wave_abnorm
ality
A<=100
B>100&<=150
C>150
Nominal
Yes, no
Nominal
Positive
negative
There are eight main attributes in dataset. Each attribute except

disease are the main causes for the Heart Disease. Each cause is
categorized into some predefined measures. These measures
are categorized for the making result efficient [8][11].
B. Visualization of Heart Patients
Heart Data set is analyzed visually using different
attributes and shows the distribution of values.
C. Performance of the Classifier CART

It shows the experimental results to evaluate the performance
of CART on Heart patient dataset
TABLE II.
Classifier
CART
Figure1. AGE
Figure 2. CHEST_PAIN
PERFORMANCE OF CLASSIFIER CART

Evaluation Criteria
Timing to build model (in Sec)
0.08
Correctly Classified Instances
167
Incorrectly Classified Instances

Accuracy
42
64.1148%
IV.
Figure3. Rest blood Pressure
Figure5. Rest_electro
Figure7. Exercice_angina
Figure 4. Blood_sugar
Figure 6. max_heart_rate
Figure 8. Disease
These visualizations in above all figure show the

attribute name and their value according to dataset. The
number of graphs are denoting by the number of
categories of the attributes and the value varying by the
number of instances in the related category. For
example age attribute, there are Three major categories
like avg(average),Upper, Min(Minimum) and their
values like 149, 48, 12. These are the number of
instance reside in the categories.
Value
RESULT
A. Final Result
When algorithm is applied to dataset, then the result is
produced, i.e. shown in figure 3.1. It consist the information of
dataset analysis such as information about total instances,
classified and unclassified instances, classification accuracy
measures, detailed accuracy measures and confusion matrix
etc.
=== Run information ===
Scheme:weka.classifiers.trees.SimpleCart -S 1 -M 2.0 -N 5 -C
1.0
Relation: heart_disease_male
Instances: 209
Attributes: 8
| exercice_angina=(no)
age
| | age=(Avg)|(Min)
chest_pain
| | | max_heart_rate=(A)|(B): negative(17.0/10.0)
rest_bpress
| | | max_heart_rate!=(A)|(B)
blood_sugar
| | | | rest_bpress=(Poor): negative(3.0/1.0)
rest_electro
| | | | rest_bpress!=(Poor): positive(7.0/1.0)
max_heart_rate
| | age!=(Avg)|(Min): positive(7.0/2.0)
exercice_angina
| exercice_angina!=(no): positive(54.0/6.0)
disease
Number of Leaf Nodes: 6
Test mode:10-fold cross-validation
Size of the Tree: 11
=== Classifier model (full training set) ===
CART Decision Tree
Time taken to build model: 0.08 seconds
chest_pain=(atyp_angina)|(non_anginal): negative(88.0/13.0)
=== Stratified cross-validation ===
chest_pain!=(atyp_angina)|(non_anginal)
=== Summary ===
Correctly Classified Instances
167
79.9043 %
Incorrectly Classified Instances
42
20.0957 %
Kappa statistic
0.5913
Mean absolute error
0.2779
Figure9.
Root Result
mean squared error
0.3869
Relativefigure9
absoluteshows
error the relation
56.3624
%
The above
information
including
Root
relative
squared
error
77.9345
% cross validation
name, instances and attributes. The 10 fold
209 equally and then the
meansTotal
theNumber
datasetofisInstances
split into 10 slices
algorithm is applied on each slice separately.
=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
0.761 0.171 0.778 0.761 0.769 0.838 positive
0.829 0.239 0.815 0.829 0.822 0.838 negative
Weighted Avg. 0.799 0.209 0.799 0.799 0.799
0.838
=== Confusion Matrix ===
a
b < -- classified as
70
22 | a = positive
20
97 | b = negative
TP rate = TP/ (TP+FN)

Positive
ex :- (70) / (70+22) = 0.762 =>

(97) / (20+97) = 0.829=>
Negative
FP rate = FP/ (FP+TN)
=>Positive
ex :- (20) / (20+97) = 0.171
(22) / (22+70) = 0.239

=>Negative
Precision = TP/ (TP+FP) ex :- (70) / (70+22)=0.778 =>
Positive
(22) / (22+97)=0.815 =>
Negative
Recall = TP/ (TP+FN) ex :(70) / (70+22) = 0.762 =>
Positive
(97) / (20+97) = 0.829=>
Negative
F-Measure = (2*recall*precision)/(recall + precision)
Ex :- (2* 0.762*0.778) / (0.762 + 0.778) = 0.769 => Positive
(2* 0.829*0.815) / (0.829 + 0.815) = 0.822 =>
Negative
*Note
TP=70
TN=97
* Recall= TP rate
FN=22
FP=20
Figure12. Rest blood Pressure
Figure10. Final Results

This above figure10 shows the decision tree build by the
classifier, it includes the root nodes, child nodes and leaf nodes
with their possible/predicted values. For example there are
number of leaf nodes are 6 and the size of tree is 11.
B. Stratified Cross-Validation
Stratified Cross-Validation 3)
is consisting of some
essential classification accuracy measures like kappa statistic,
MAE (Mean Absolute Error), RMSE (Root Mean Squared
error) etc. These measures show the accuracy factors of
algorithm.
=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
0.761
0.171 0.778
0.761 0.769
0.838
positive
0.829
0.239 0.815
0.829 0.822
0.838
negative
Weighted avg.
0.799
0.209
0.799
0.799 0.799
0.838
=== Confusion Matrix ===
a
b < -- classified as
70
22 | a = positive
20
97 | b = negative
Figure11. Final Results

This above figure11shows the detailed accuracy and confusion
matrix produced on the basis of classes in dataset. The detailed
accuracy is measured by some major measures like TP rate
(True Positive), FP (False Positive) rate, Precision, Recall, FMeasure etc. These measures are calculated through the
confusion matrix in figure 12.
Figure13. Stratified Cross-Validation

In above figure13 there are some important measures that
measure the correctly and incorrectly classified instance in
dataset called errors. Errors are calculated by the predefined
formulas which are discussed previously.
C. Detailed Accuracy by Class

The detailed accuracy measures like TP rate, FP rate, Precision,
Recall, F-Measure and ROC area are estimated by class i.e.
positive and negative which are heart disease datasets.
Figure116. Rules Interpretaion
Figure14. Detailed Accuracy by class

The above14 figure shows the Detailed accuracy by class.
How all the major measures are calculated, are briefly
described previously.
D. Confusion Matrix
The main activity of the Algorithms result, is that Confusion
Matrix. It commonly named contingency table. In our case we
have 2 classes Positive and Negative and therefore 2x2
confusion matrix, the matrix could be arbitrarily large. The
number of correctly classified instances is the sum of diagonals
in the matrix; all others are incorrectly classified.
These rules are generated by JRIP rules and they are helpful to
take final decision by the machine. These are not final or exact
result, these may vary on other machine or as dataset changes.
The decision tree generated by the classifier is depended on
these rules.
The Rule 1: If (Exercice_angina=yes) Then Disease=
Positive.
Rule 2: If (Chest_pain=asympt)And (age=Upper) Then

Disease=Positive
Rule
3:
If
(Chest_pain=typ_angina)
Disease=Positive (in Some Cases)
Then
Rule
4:
If
(Chest_pain=typ_angina)
Disease=Negative (in Many Cases).
Then
F. Tree Visualization
In above section, all the rules are interprated. These rules are
visualized in the tree form. The tree visualization is the
simplest method to understand the conditions and their results.
Figure15. Confusion Matrix

The above14 figure shows the Detailed accuracy by class.
How all the major measures are calculated, are briefly
described previously.
E. Rules Interpretation
By applying the CART algorithms on the dataset, some rules
are generated that are helpful to predict the correct cause of
heart disease.
Figure17. Tree Visualization

In figure17 the root node is exercice_angina and its child
node is chest_pain. If any person has exercice_angina value
is yes then it comes into the positive category. (72.0/12.0)
means 72 instances followed this rule and 12 are not. And
same situation with value no, and then check the condition
for chest_pain attribute. This figure17 shows the decision
tree build by the classifier, it includes the root nodes, child

nodes and leaf nodes with their possible/predicted values. For
example there are number of leaf nodes are 6 and the size of
tree is 11. In this algorithm it implemented the Simple CART
algorithm practically. Analysis on the Disease attribute and
generate a confusion matrix and measured all essential
measures like TP rate, FP rate, Precision and Recall etc. In this
way it can also analyze other attributes, which may helpful to
solve more complex situations or queries in the prediction of
Heart attack diseases.
CONCLUSION
The research undertook an experiment on application of mining

algorithm (Simple CART) in order to predict the heart attacks
and to compare the best available method of prediction. The
experiment can serve as an important tool for physicians to
predict risky cases in the practice and advise accordingly. The
model from the classification will be able to answer more
complex queries in the prediction of heart attack diseases. The
predictive accuracy determined by SIMPLE CART algorithm
suggests that parameters used are reliable indicators to predict
the presence of heart diseases.
REFERENCES
[1]
S. Aruna, Dr S.P. Rajagopalan and L.V. Nandakishore, AnEmpirical

Comparison Of Supervised learning algorithms in Disease ion, Vol.1,
August 2011.
[2]
Leonard Gordon, Using Classification and Regression Trees (CART) in

SAS Enterprise Miner For Applications in Public Health.Paper 0892013,2013
[3] Vikas Chaurasia, Saurabh Pal, Early Prediction of Heart DiseasesUsing
Data Mining Techniques, Vol.1,2013.
[4] Thenmozhi, P. Deepika, M.Meiyappasamy, Different DataMining
Techniques Involved in Heart Disease Prediction: A Survey, Volume 3,
ISSN No. 2277-8179, September 2014.
[5] Sushilkumar Kalmegh, Analysis of WEKA Data MiningAlgorithm
REPTree, Simple Cart and RandomTree for Classification of Indian
News , , Vol. 2 Issue 2, February 2015.
[6] T.Miranda Lakshmi, A.Martin, R.Mumtaj Begum, Dr.V.Prasanna
Venkatesan, An Analysis on Performance of Decision Tree Algorithms
using Students Qualitative Data, I.J.Modern Education and Computer
Science, 2013,5,18-27.
[7] Jyoti Rohilla, Preeti Gulia, Analysis of Data Mining Techniquesfor
Diagnosing Heart Disease, Volume 5, Issue 7,July2015.
[8] Hlaudi Daniel Masethe , Mosima Anna Masethe, Prediction ofHeart
Disease using Classification Algorithms, Vol II WCECS 2014, 2224October,2014,SanFrancisco,USA.
[9] Nidhi Bhatla, Kiran Jyoti, An Analysis of Heart DiseasePrediction
using Different Data Mining Techniques InternationalJournal of
Engineering and Technology Vol.1 issue 8 2012.
[10] Chaitrali S. Dangare and Sulabha S. Apte, Improved Study OfHeart
Disease Prediction Using Data Mining Classification Techniques,
International Journal Of Computer Applications, Vol. 47, No. 10, pp.
0975-888,2012.
[11] Atul Kumar Pandey, Prabhat Pandey, K.L. Jaiswal and AshokKumar
Sen, A Heart Disease Prediction Model using Decision Tree, IOSR
Journal of Computer Engineerin g, Vol. 12, Issue.6, (Jul. Aug. 2013),
pp. 83 86.
[12] Tina R. Patil, Mrs. S.S. Sherekar, Performance Analysis of Nave
Bayes and J48 Classification algorithm for Data Classification ,
International Journal Of Computer Science and Applications, Vol. 6,
No.2, Apr 2013.

Camera Ready

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Camera Ready

Transféré par

Droits d'auteur :

Formats disponibles

An Approach for Classification using Simple CART

AbstractThis decision tree is normally applicable in data

CART algorithm or Classification and Regression Tree is apply

Agriculture, Biomedical, Plant diseases, Pharmacology,

The paper focuses to predict possible number of heart attacks

This section of Paper describes the Methodology part, in which

DATASET OF HEAR DIASEASE

Max Heart Rate

There are eight main attributes in dataset. Each attribute except

C. Performance of the Classifier CART

PERFORMANCE OF CLASSIFIER CART

Correctly Classified Instances

Incorrectly Classified Instances

Figure3. Rest blood Pressure

These visualizations in above all figure show the

algorithm is applied on each slice separately.

=== Detailed Accuracy By Class ===

TP rate = TP/ (TP+FN)

ex :- (70) / (70+22) = 0.762 =>

ex :- (20) / (20+97) = 0.171

(22) / (22+70) = 0.239

Figure12. Rest blood Pressure

Figure10. Final Results

=== Detailed Accuracy By Class ===

Figure11. Final Results

Figure13. Stratified Cross-Validation

C. Detailed Accuracy by Class

Figure116. Rules Interpretaion

Figure14. Detailed Accuracy by class

Rule 2: If (Chest_pain=asympt)And (age=Upper) Then

Figure15. Confusion Matrix

Figure17. Tree Visualization

tree build by the classifier, it includes the root nodes, child

The research undertook an experiment on application of mining

S. Aruna, Dr S.P. Rajagopalan and L.V. Nandakishore, AnEmpirical

Leonard Gordon, Using Classification and Regression Trees (CART) in

Vous aimerez peut-être aussi