Vous êtes sur la page 1sur 5

An Approach for Classification using Simple CART

Algorithm in Weka
Dr.Neeraj Bhargava
School of system sc. &
engg.
professor
MDS UniversityAjmer,
India
profneerajbhargava@g
mail.com

Sonia Dayma
School of system sc. &
engg.
MDS UniversityAjmer,
India
soniadayma786@gmai
l.com

AbstractThis decision tree is normally applicable in data


mining in order to produce a framework that predicts the value
of object or its dependent variable, established on the various
input or independent variable. CART algorithms are mainly
used in Medical, Statistics etc. For heart disease patients it is
complex for medical practitioners to predict the heart attack as it
is a complex task that requires experience and knowledge. First
part of paper has introduced the CART algorithm with
associative applications and its classification method. Other part
has represented a real world dataset of male patient taken for
further analysis.
Keywords CART,
Introduction (Heading 1)

DM,

Algo,

Predictor,Weka

tool

CART algorithm or Classification and Regression Tree is apply


on WEKA and used in data mining to classify the variable of
dataset and tree is describe the class of variable which is used
in dataset. Regression tree is defined where the variable is
uninterrupted and tree is won to make the decision. CART
algorithm is applied on the dataset of Heart Disease of Male
persons. The task of data mining for prediction of the class
based on selected attributes of dataset such as age, chest pain,
rest blood pressure, blood sugar, rest electrograph, maximum
heart rate, exercise angina and disease. It analyzed the
confusion and determined that what attribute is the best
predictor for the correct prediction for the diagnoses. This
algorithm may help to experts to predict possible heart attacks
form the patient datasets[1].
I.

EASE OF USE

Agriculture, Biomedical, Plant diseases, Pharmacology,


Medical research, Financial analysis etc. CART algorithm is
used in above these fields and many other fields where dataset
is available for make the decision[3].
II.

INTRODICTION

The paper focuses to predict possible number of heart attacks


patients from the dataset using data mining techniques and
determines which model gives the highest percentage of
correct predictions for the diagnoses.

Abishek Kumar
School of system sc. &
engg.
MDS University Ajmer
India
ap481998@gmail.com

III.

Pramod Singh
School of system sc. &
engg.
MDS University Ajmer
India
pramodrathore88@gm
ail.com

MTEODOLOGY

This section of Paper describes the Methodology part, in which


the emphasis is on the various techniques that are used on the
dataset[5][7]. These techniques are required for prediction of
heart disease in patients.
A. Heart Disease Data Set
The patient data set is compiled from data collected from
medical practitioners. Only 7 attributes form the database are
considered for the predictions required for the heart disease.
The following attributes with nominal values are considered.
TABLE I.

Age

DATASET OF HEAR DIASEASE


Nominal

Chest Pain
Nominal

Rest b press
Blood Sugar

Numeric
Nominal

Rest Electro
Nominal

Max Heart Rate


Exercise
Angina
Disease

Nominal

Min=20-40
Avg=41-60
Upper=61-80
Asympt
atyp_angina
non_anginal
typ_angina
Good=120/80 mm
Average=129-139
Poor=179/90
F=False, T=True
Normal,
left_vent_hyper,
st_t_wave_abnorm
ality
A<=100
B>100&<=150
C>150

Nominal

Yes, no

Nominal

Positive
negative

There are eight main attributes in dataset. Each attribute except


disease are the main causes for the Heart Disease. Each cause is
categorized into some predefined measures. These measures
are categorized for the making result efficient [8][11].
B. Visualization of Heart Patients
Heart Data set is analyzed visually using different
attributes and shows the distribution of values.

C. Performance of the Classifier CART


It shows the experimental results to evaluate the performance
of CART on Heart patient dataset
TABLE II.
Classifier

CART

Figure1. AGE

Figure 2. CHEST_PAIN

PERFORMANCE OF CLASSIFIER CART


Evaluation Criteria
Timing to build model (in Sec)

0.08

Correctly Classified Instances

167

Incorrectly Classified Instances


Accuracy

42
64.1148%

IV.

Figure3. Rest blood Pressure

Figure5. Rest_electro

Figure7. Exercice_angina

Figure 4. Blood_sugar

Figure 6. max_heart_rate

Figure 8. Disease

These visualizations in above all figure show the


attribute name and their value according to dataset. The
number of graphs are denoting by the number of
categories of the attributes and the value varying by the
number of instances in the related category. For
example age attribute, there are Three major categories
like avg(average),Upper, Min(Minimum) and their
values like 149, 48, 12. These are the number of
instance reside in the categories.

Value

RESULT

A. Final Result
When algorithm is applied to dataset, then the result is
produced, i.e. shown in figure 3.1. It consist the information of
dataset analysis such as information about total instances,
classified and unclassified instances, classification accuracy
measures, detailed accuracy measures and confusion matrix
etc.
=== Run information ===
Scheme:weka.classifiers.trees.SimpleCart -S 1 -M 2.0 -N 5 -C
1.0
Relation: heart_disease_male
Instances: 209
Attributes: 8
| exercice_angina=(no)
age
| | age=(Avg)|(Min)
chest_pain
| | | max_heart_rate=(A)|(B): negative(17.0/10.0)
rest_bpress
| | | max_heart_rate!=(A)|(B)
blood_sugar
| | | | rest_bpress=(Poor): negative(3.0/1.0)
rest_electro
| | | | rest_bpress!=(Poor): positive(7.0/1.0)
max_heart_rate
| | age!=(Avg)|(Min): positive(7.0/2.0)
exercice_angina
| exercice_angina!=(no): positive(54.0/6.0)
disease
Number of Leaf Nodes: 6
Test mode:10-fold cross-validation
Size of the Tree: 11
=== Classifier model (full training set) ===
CART Decision Tree
Time taken to build model: 0.08 seconds
chest_pain=(atyp_angina)|(non_anginal): negative(88.0/13.0)
=== Stratified cross-validation ===
chest_pain!=(atyp_angina)|(non_anginal)
=== Summary ===
Correctly Classified Instances
167
79.9043 %
Incorrectly Classified Instances
42
20.0957 %
Kappa statistic
0.5913
Mean absolute error
0.2779
Figure9.
Root Result
mean squared error
0.3869
Relativefigure9
absoluteshows
error the relation
56.3624
%
The above
information
including
Root
relative
squared
error
77.9345
% cross validation
name, instances and attributes. The 10 fold
209 equally and then the
meansTotal
theNumber
datasetofisInstances
split into 10 slices

algorithm is applied on each slice separately.

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
0.761 0.171 0.778 0.761 0.769 0.838 positive
0.829 0.239 0.815 0.829 0.822 0.838 negative
Weighted Avg. 0.799 0.209 0.799 0.799 0.799
0.838
=== Confusion Matrix ===
a
b < -- classified as
70
22 | a = positive
20
97 | b = negative

TP rate = TP/ (TP+FN)


Positive

ex :- (70) / (70+22) = 0.762 =>


(97) / (20+97) = 0.829=>

Negative
FP rate = FP/ (FP+TN)
=>Positive

ex :- (20) / (20+97) = 0.171

(22) / (22+70) = 0.239


=>Negative
Precision = TP/ (TP+FP) ex :- (70) / (70+22)=0.778 =>
Positive
(22) / (22+97)=0.815 =>
Negative
Recall = TP/ (TP+FN) ex :(70) / (70+22) = 0.762 =>
Positive
(97) / (20+97) = 0.829=>
Negative
F-Measure = (2*recall*precision)/(recall + precision)
Ex :- (2* 0.762*0.778) / (0.762 + 0.778) = 0.769 => Positive
(2* 0.829*0.815) / (0.829 + 0.815) = 0.822 =>
Negative
*Note

TP=70
TN=97

* Recall= TP rate

FN=22

FP=20

Figure12. Rest blood Pressure

Figure10. Final Results


This above figure10 shows the decision tree build by the
classifier, it includes the root nodes, child nodes and leaf nodes
with their possible/predicted values. For example there are
number of leaf nodes are 6 and the size of tree is 11.

B. Stratified Cross-Validation
Stratified Cross-Validation 3)
is consisting of some
essential classification accuracy measures like kappa statistic,
MAE (Mean Absolute Error), RMSE (Root Mean Squared
error) etc. These measures show the accuracy factors of
algorithm.

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
0.761
0.171 0.778
0.761 0.769
0.838
positive
0.829
0.239 0.815
0.829 0.822
0.838
negative
Weighted avg.
0.799
0.209
0.799
0.799 0.799
0.838
=== Confusion Matrix ===
a
b < -- classified as
70
22 | a = positive
20
97 | b = negative

Figure11. Final Results


This above figure11shows the detailed accuracy and confusion
matrix produced on the basis of classes in dataset. The detailed
accuracy is measured by some major measures like TP rate
(True Positive), FP (False Positive) rate, Precision, Recall, FMeasure etc. These measures are calculated through the
confusion matrix in figure 12.

Figure13. Stratified Cross-Validation


In above figure13 there are some important measures that
measure the correctly and incorrectly classified instance in
dataset called errors. Errors are calculated by the predefined
formulas which are discussed previously.

C. Detailed Accuracy by Class


The detailed accuracy measures like TP rate, FP rate, Precision,
Recall, F-Measure and ROC area are estimated by class i.e.
positive and negative which are heart disease datasets.

Figure116. Rules Interpretaion

Figure14. Detailed Accuracy by class


The above14 figure shows the Detailed accuracy by class.
How all the major measures are calculated, are briefly
described previously.
D. Confusion Matrix
The main activity of the Algorithms result, is that Confusion
Matrix. It commonly named contingency table. In our case we
have 2 classes Positive and Negative and therefore 2x2
confusion matrix, the matrix could be arbitrarily large. The
number of correctly classified instances is the sum of diagonals
in the matrix; all others are incorrectly classified.

These rules are generated by JRIP rules and they are helpful to
take final decision by the machine. These are not final or exact
result, these may vary on other machine or as dataset changes.
The decision tree generated by the classifier is depended on
these rules.
The Rule 1: If (Exercice_angina=yes) Then Disease=
Positive.

Rule 2: If (Chest_pain=asympt)And (age=Upper) Then


Disease=Positive

Rule
3:
If
(Chest_pain=typ_angina)
Disease=Positive (in Some Cases)

Then

Rule
4:
If
(Chest_pain=typ_angina)
Disease=Negative (in Many Cases).

Then

F. Tree Visualization
In above section, all the rules are interprated. These rules are
visualized in the tree form. The tree visualization is the
simplest method to understand the conditions and their results.

Figure15. Confusion Matrix


The above14 figure shows the Detailed accuracy by class.
How all the major measures are calculated, are briefly
described previously.
E. Rules Interpretation
By applying the CART algorithms on the dataset, some rules
are generated that are helpful to predict the correct cause of
heart disease.

Figure17. Tree Visualization


In figure17 the root node is exercice_angina and its child
node is chest_pain. If any person has exercice_angina value
is yes then it comes into the positive category. (72.0/12.0)
means 72 instances followed this rule and 12 are not. And
same situation with value no, and then check the condition
for chest_pain attribute. This figure17 shows the decision

tree build by the classifier, it includes the root nodes, child


nodes and leaf nodes with their possible/predicted values. For
example there are number of leaf nodes are 6 and the size of
tree is 11. In this algorithm it implemented the Simple CART
algorithm practically. Analysis on the Disease attribute and
generate a confusion matrix and measured all essential
measures like TP rate, FP rate, Precision and Recall etc. In this
way it can also analyze other attributes, which may helpful to
solve more complex situations or queries in the prediction of
Heart attack diseases.
CONCLUSION

The research undertook an experiment on application of mining


algorithm (Simple CART) in order to predict the heart attacks
and to compare the best available method of prediction. The
experiment can serve as an important tool for physicians to
predict risky cases in the practice and advise accordingly. The
model from the classification will be able to answer more
complex queries in the prediction of heart attack diseases. The
predictive accuracy determined by SIMPLE CART algorithm
suggests that parameters used are reliable indicators to predict
the presence of heart diseases.
REFERENCES
[1]

S. Aruna, Dr S.P. Rajagopalan and L.V. Nandakishore, AnEmpirical


Comparison Of Supervised learning algorithms in Disease ion, Vol.1,
August 2011.

[2]

Leonard Gordon, Using Classification and Regression Trees (CART) in


SAS Enterprise Miner For Applications in Public Health.Paper 0892013,2013
[3] Vikas Chaurasia, Saurabh Pal, Early Prediction of Heart DiseasesUsing
Data Mining Techniques, Vol.1,2013.
[4] Thenmozhi, P. Deepika, M.Meiyappasamy, Different DataMining
Techniques Involved in Heart Disease Prediction: A Survey, Volume 3,
ISSN No. 2277-8179, September 2014.
[5] Sushilkumar Kalmegh, Analysis of WEKA Data MiningAlgorithm
REPTree, Simple Cart and RandomTree for Classification of Indian
News , , Vol. 2 Issue 2, February 2015.
[6] T.Miranda Lakshmi, A.Martin, R.Mumtaj Begum, Dr.V.Prasanna
Venkatesan, An Analysis on Performance of Decision Tree Algorithms
using Students Qualitative Data, I.J.Modern Education and Computer
Science, 2013,5,18-27.
[7] Jyoti Rohilla, Preeti Gulia, Analysis of Data Mining Techniquesfor
Diagnosing Heart Disease, Volume 5, Issue 7,July2015.
[8] Hlaudi Daniel Masethe , Mosima Anna Masethe, Prediction ofHeart
Disease using Classification Algorithms, Vol II WCECS 2014, 2224October,2014,SanFrancisco,USA.
[9] Nidhi Bhatla, Kiran Jyoti, An Analysis of Heart DiseasePrediction
using Different Data Mining Techniques InternationalJournal of
Engineering and Technology Vol.1 issue 8 2012.
[10] Chaitrali S. Dangare and Sulabha S. Apte, Improved Study OfHeart
Disease Prediction Using Data Mining Classification Techniques,
International Journal Of Computer Applications, Vol. 47, No. 10, pp.
0975-888,2012.
[11] Atul Kumar Pandey, Prabhat Pandey, K.L. Jaiswal and AshokKumar
Sen, A Heart Disease Prediction Model using Decision Tree, IOSR
Journal of Computer Engineerin g, Vol. 12, Issue.6, (Jul. Aug. 2013),
pp. 83 86.
[12] Tina R. Patil, Mrs. S.S. Sherekar, Performance Analysis of Nave
Bayes and J48 Classification algorithm for Data Classification ,
International Journal Of Computer Science and Applications, Vol. 6,
No.2, Apr 2013.

Vous aimerez peut-être aussi