Vous êtes sur la page 1sur 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/259031415

Cardiac Data Mining (CDM); Organization and Predictive Analytics on


Biomedical (Cardiac) Data

Article  in  AIP Conference Proceedings · October 2013


DOI: 10.1063/1.4825018

CITATION READS
1 526

4 authors, including:

Bilal Malik Iqra Basharat


University of Kashmir Bahria University
9 PUBLICATIONS   16 CITATIONS    7 PUBLICATIONS   34 CITATIONS   

SEE PROFILE SEE PROFILE

Mamuna Fatima
National University of Sciences and Technology
4 PUBLICATIONS   11 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

flood prediction system View project

Non-invasive glucose measurement View project

All content following this page was uploaded by Iqra Basharat on 11 February 2015.

The user has requested enhancement of the downloaded file.


Cardiac data mining (CDM); organization and predictive analytics on biomedical
(cardiac) data
M. Musa Bilal, Masood Hussain, Iqra Basharat, and Mamuna Fatima

Citation: AIP Conference Proceedings 1559, 260 (2013); doi: 10.1063/1.4825018


View online: http://dx.doi.org/10.1063/1.4825018
View Table of Contents: http://scitation.aip.org/content/aip/proceeding/aipcp/1559?ver=pdfcov
Published by the AIP Publishing

Articles you may be interested in


Wavelet methods in data mining
AIP Conf. Proc. 1463, 103 (2012); 10.1063/1.4740042

Mining Connected Data


AIP Conf. Proc. 1126, 94 (2009); 10.1063/1.3149476

Application of Bayesian networks and data mining to biomedical problems


AIP Conf. Proc. 953, 132 (2007); 10.1063/1.2817336

CDM: Numerical predictions on small scales


AIP Conf. Proc. 586, 130 (2001); 10.1063/1.1419542

Pros and cons of data mining


Phys. Today

This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
Cardiac Data Mining (CDM);
Organization and Predictive Analytics on
Biomedical (Cardiac) Data
M.Musa Bilal, Masood Hussain, Iqra Basharat, Mamuna Fatima
Department of Computer Engineering, College of E&ME (NUST), Peshawar Road, Rawalpindi, Pakistan

Abstract – Data mining and data analytics has been of immense importance to many different
fields as we witness the evolution of data sciences over recent years. Biostatistics and Medical
Informatics has proved to be the foundation of many modern biological theories and analysis
techniques. These are the fields which applies data mining practices along with statistical models
to discover hidden trends from data that comprises of biological experiments or procedures on
different entities. The objective of this research study is to develop a system for the efficient
extraction, transformation and loading of such data from cardiologic procedure reports given by
Armed Forces Institute of Cardiology. It also aims to devise a model for the predictive analysis
and classification of this data to some important classes as required by cardiologists all around the
world. This includes predicting patient impressions and other important features.

Key words –Data mining, Bioinformatics, classification of cardiovascular patients, predictive


analysis of heart disease data

INTRODUCTION
Data mining is the computational process of discovering patterns and hidden
information from large data sets. Its main goal is to extract information from a data set
and transform it into an understandable structure for further use. In the field of Bio
informatics and Biostatistics, the researchers applies data mining practices and
statistical models and techniques to discover hidden trends from data that comprises of
biological experiments or procedures on different entities. Within the biological domain,
cardiology is a branch that catches the attention of many biologists and data scientists
around the world.
Although bioinformatics and biomedical sciences has developed a lot in recent years
but still Cardiovascular Disease (CVD) is the highest ratio to the contribution of deaths
in the world. In the last decade, quality of health care organizations has improved and
costs to healthcare have reduced. However, a database of patients’ records is not much
informative and supportive in tracking patients’ medical records and diseases [1]. To
cope with this issue authors of [1] presents REMIND framework. It is a probabilistic
framework for Reliable Extraction and Meaningful Inference from non-structured data
to create high quality structured medical data automatically.
The field of statistics is used very commonly nowadays in analyzing the medical
data. Different statistical systems, software are being used in health care research
organizations to analyze the he medical data of patients and deduce important
knowledge from that [2].

2013 International Symposium on Computational Models for Life Sciences


AIP Conf. Proc. 1559, 260-269 (2013); doi: 10.1063/1.4825018
© 2013 AIP Publishing LLC 978-0-7354-1187-6/$30.00

260
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
The aim of this research is to thoroughly study and analyze the past patient reports
from the nuclear medicine department of Armed Forces Institute of Cardiology (AFIC)
to frame out a system that can efficiently and accurately be used to store and analyze
data extracted from those reports. In this study, we have extracted, transformed and load
the data from past reports into our designed databases; and carry out analysis on that
data by building different exploratory, classification and predictive models. This
analysis involves classifying the patients into different categories depending upon the
impressions deduced from their past history and current scan findings.
Moreover, we have applied different clustering techniques on the data set to
categorize the patients into distinct clusters and to extract useful features. These features
can then be employed in making decisions over the data set in our analysis.
Classification models use these classifiers. The analysis also involves predicting the
values of different intermediate statistics involved in the procedures carried out.

LITERATURE REVIEW
There is a lot of work in literature that is carried out on analysis, classification and
prediction of bio medical data using data mining techniques but cardiology and cardiac
data analysis is a field that holds utmost importance and cannot be ignored.
Bio-medical data can be broken down into different categories; there has been work
done on the gene expressions and patterns of various diseases. This work includes
approaches developed to detect patterns in various images using segmentation, feature
extraction and neural networks [4] [6] [7]. Besides this Bio-medical data analysis has
also been done on data extracted from different medical devices like
electroencephalographic (EEG) and magneto encephalographic (MEG) recordings [5]. It
has also been done on structured and unstructured data from various data sources like
medical journals. Biomedical Text mining has been used extensively for disease gene
discovery [8]. Extensive work has been done on semantic bio-medical data mining using
the common practices of the CRISP – DM model. Many different semantic sub-group
discovery algorithms have been developed which aides any future work on gene pattern
recognition [3].
Most of the work in cardiology has been done on image segmentation, pattern
recognition, scan correlation and feature extraction. The images commonly used within
this analysis are Angiography scans, CT-Angiography Scans, Thallium Scans etc. Most
of this work deals with diagnosis of different cardiac problems. Some of these diagnoses
include identification of wall motion abnormalities within developed classification
images of heart [9] and diagnosis of Cardiovascular images using echocardiography
images for the automatic assessment of the left ventricular ejection fraction (EF) of a
patient [10].
Besides, some data mining techniques have also been applied on ECG generated data;
mostly this has been done for the efficient identification of any cardiac abnormality. This
is done by feature selection from different compressed ECG signals using Clustering
[11]. Besides this ECG Signals have also been applied to apply adaptive data analysis for
the identification of Cardiovascular Diseases [12] and for the early prediction of
Cardiovascular Diseases [13]. Some work has also been done on predicting
Cardiovascular Risk Factors using decision tree algorithm

261
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
Unfortunately much work hasn’t been done in the analysis of clinical cardiologic data.
Some research however has been done in the field of improving cardiac care using
clinical cardiac data. When dealt with clinical data, the major challenge is to structure all
the available information into a single, suitable format. Another challenge is to remove
any sort of data redundancies which may lead to an inaccurate or biased analysis. Once
the data is structured and any inaccuracies or redundancies are removed using
probabilistic models, different machine learning algorithms were applied to monitor the
accuracy for making inferences on future data [14].
There are different heart diseases systems developed that are helping a lot medical
practitioners to make intelligent guesses. An intelligent heart disease prediction system
was developed using three data mining techniques (Naïve Byes, Neural Networks and
Decision trees) [15]. Results of this study shows that Naïve Byes is more influential in
envisaging patients with heart disease whereas Decision Tree gives better results in
predicting patients with no heart disease.

METHODOLOGY
A well-known CRISP methodology has been used in this research. The CRISP-DM
reference model in figure 1 [18] presents the whole data mining project life cycle in six
different phases. Every stage takes certain inputs, performs some tasks and produce
outputs [16]

FIGURE 1. CRISP-DM Process Model [3]


The research study consists of four main phases, namely,
• the data collection and target database design,
• data analysis (data cleansing, pre-processing, analysis, model building and training)

262
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
• Application Design (Platform for efficient storage and automated analysis;
exploratory analysis and classification of intended data)
• Reporting the results of the automated analysis
All four phases of the project were completed in the planned time span. The resources
used within the four phases include the following languages and tools,
TABLE 1. Tools and Languages used in the research study
Tools and Languages Usage
IBM SPSS Statistics data handling, cleansing and pre-processing
WEKA data exploratory analysis and model building
R language for Statistical data exploratory analysis, model building and
Computing generating graphs
Microsoft SQL Server 2008 Data modeling and target database design
MATLAB 2010 Handling Data, applying logic, writing scripts
and generating distributions
Microsoft Visual Studio 2010; Developing Application, building real-time
C#.NET classification model and applying logic
SAP Crystal Reports 2010 Making Reports over the results

Data Collection and Understanding


The first stage of the CRISP methodology is data understanding. This step involved
the collection and understanding of our research domain, patient reports and procedural
test reports from Armed Forces Institute of Cardiology (AFIC), Pakistan. Knowledge
acquisition and understanding of medical data of patients’ and biological terms and
procedure was developed in close collaboration with senior cardiologist at AFIC.

Data Extraction, Transformation and Load


In this phase, the collected data was extracted from patients’ reports and database was
designed to keep the processed data. The collected data was in unstructured report form
that could not be used in machine learning so it has t be transformed into structured form.
For this purpose mapping tables were used in database. IBM SPSS Statistics Suite was
used for preprocessing and transformation of the different segments of the data suitable
for data mining at later stage. This preprocessing stage was carried out in a recursive
manner according to the need, aiding in the exploratory analysis stage and the
improvement of the classification/predictive model.
The preprocessing stage involved the following steps,
• Data Normalization;
• Visualizing the distribution trend in different important attributes and transforming
them to a normal distribution using techniques like natural log transformation, power
transformation, min-max transformation and boxcox transformation according to the
nature of data
• Data Standardization;
• Transforming the data into a fixed upper and lower bound range, normally between 0
and 1

263
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
• Data Binning;
• Transforming nominal data into categorical data or transforming wider categories into
a narrower range.
The tool used for data modeling and database design was Microsoft SQL Server. For
data preprocessing and normalization we used IBM SPSS Statistics and Matlab.

Exploratory Analyses
Correlation Analysis
Correlation Analysis was done and correlation chart was plotted on the various
attributes of the data set to analyze the correlation and dependencies between attributes.
The correlation carried out was ‘Pearson Correlation’ as it is suitable for both nominal
data types unlike its counterpart ‘Spearman Correlation’. Pearson’s correlation
coefficient treats data in a quantitative way and the values of Spearman’s rank correlation
coefficient treats the same data in a ‘qualitative’ way for real data sets. [17]

FIGURE 2. Correlation Chart on Training Data (Generated in R language)

Clustering Analysis
To extract relevant and dominant feature space, clustering technique was applied on
the data set. The attributes/features in the data set, which are more dominant in defining
the boundaries of the clusters will also, play a dominant role when being treated as
classifiers in the process of classification/prediction.
The method of clustering used here is k-means clustering. To decide the value of k in
k-means algorithm is a recurrent problem in clustering and is a distinct issue from the
process of actually solving the clustering problem. The optimal choice of k is often
ambiguous; increasing the value of k always reduces the error and increases the
computation speed. The most favorable method to find k adopts a strategy which
balances between maximum compression of the data using a single cluster, and

264
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
maximum accuracy by assigning each data point to its own cluster. So accordingly, the
ELBOW approach was used for one-way analysis of variance [19]. The mathematical
formula to compute the percentage of variance for each number of clusters used is given
below.

FIGURE 3. Percentage of Variance (ANOVA)


For ease of use we put reference value one on the x-axis as corresponding to 3
clusters. We choose that specific number of clusters where the inclination of the graph
shown in Figure 3 starts to decline which in the figure is 6 clusters. Results of clusters
can be visualized in figure 4.

FIGURE 4. Cluster Visualization with reference to Protocol applied i.e. Bruce, Adenosine or Thallium
Infusion

Feature Selection
Once the optimal number of clusters is decided, we carry out clustering by separating
data set into the specified number of distinct clusters.

We made a comparative study of K-means, K-mediods and Fuzzy K-means clustering


for this purpose. The attributes selected from this process of feature extraction are given
below,

265
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
FIGURE 5. Features extracted from Clustering

Pre-processing technique based upon binning and relative probabilities of


impressions
We know that the Resting_LVEF value is the highest information content carrying
attribute. The probabilities of the impressions {critical, risk, moderate, fair} within
different range of ‘resting_LVEF’ values remain almost the same. Steps that are used for
each technique are presented in the following figure 6.

FIGURE 6. Steps for techniques

This adds a new attribute to the original data set which acts as a clue to the model.

Classification
After this step classification is carried out using different classification algorithms. A
comparative use of the following algorithms was made in this study.

Radial Basis Function Network


Radial basis function network classification is known to work fine for multi variant
classification [20] especially if the data has multiple classes which show ambiguous

266
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
attribute trends for different classes. We have four classes of impressions in case of our
data with classifiers coming from different medical procedure results and patient history.
Many of these classifiers are totally uncorrelated.

Support Vector Machines using Polynomial Kernel


Support Vector Machines is a well-known classification technique giving better
accuracy results in predicting the unknown classes. Techniques for feature selection and
SVM parameters optimization are known to improve classification accuracy. The most
common methods in writings for the optimization of the classification performance of
SVMs are undertaking the misclassification consequence and kernel constraints,
frequently by fine-tuning them composed.
Polynomial, Linear and RBF kernels are considered the most common kernels in
literature work that acquired the best developments, where as there are other well-known
kernels but have attained poor results [21].

EXPERIMENTAL RESULTS
After going through all these steps, the results kept on increasing incrementally due to
different factors catered for in the used techniques.

FIGURE 7. Results generated in WEKA on 10-Fold Cross Validation

FIGURE 8. Confusion Matrix after using derived attributes

FIGURE 9. Detailed Classification Statistics on 10 Fold Cross Validation

267
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
Platform (Application Development)
After we are done with the complete analysis and model building, we come to the
stage of building the Platform, which is the complete package for the utility of the
intended user. This working unit aims to provide the cardiologist a complete facility to
carry out the following tasks,
1. Record and manage the incoming patient data (historical, procedural)
2. Import/Export patient data for analysis
3. Perform exploratory analysis on the imported data (generating graphs and
visualizations like histograms, box plots, scatter plots etc) for aides in future studies
and research
4. Perform classification and Predictive analysis on any incoming data using the built
model
5. Generating Reports from the analysis and classifications carried out in the former step.
Following are screen shots of the application.

CONCLUSION
The main goal of data mining is to extract the hidden information from unstructured
and seemingly unknown data. The healthcare organizations such as hospitals have huge
data that needs to be filtered out to be used in the development of better clinical
procedures. The objective of this study was to develop a system for the efficient
extraction, transformation and loading of such data from cardiac procedure reports given
by Armed Forces Institute of Cardiology. It also aims to devise a model for the predictive
analysis and classification of this data to some important classes.
We have analyzed the data and the built the classification model using Kernel Support
Vector Machines using the Polynomial kernel.
This model was recreated in the .Net Application using the open source ‘Libsvm’
implementation [12]. The exploratory analysis done during the analysis stage was carried
out in R language and WEKA and later during the development of the Exploratory
Analysis Engine in the .Net Application, the RDotNet library was used to establish the
communication between R packages and commands and .NET framework. After this
Reporting of the stored patient data and results was done using SAP Crystal Reports.
This Application can prove to be of great help to different cardiologic studies and
surveys; moreover it can be a step towards evolving better and more sophisticated
biological data analysis tools. There are off the shelf tools in the market for data analysis
over financial data, social sciences and images; biological data analysis or more
commonly termed as bio-statistics haven’t evolved too much yet.

FUTURE WORK
Four categories of patient impressions, which lie in the domain, can be further
expanded to many other features. These features together add up to form the particular
impression. After the particular impression category is right fully predicted, further
predictions within the category can be made employing different models.

268
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
ACKNOWLEDGEMENTS
We would like to acknowledge the Armed Forces of Institute of Cardiology, Pakistan,
for providing the medical datasets to carry out this research. We are thankful to Major
General Shahab Naqvi, Commandant of E&ME College, Colonel Mohsin Sheikh, Head
of Department (Nuclear Cardiology), AFIC and DrShoab Ahmed Khan of E&ME
College for their support and guidance throughout the course of this study.

REFERENCES
1. Rao, R. B., Krishnan, S., &Niculescu, R. S. (2006), Data mining for improved cardiac care. ACM
SIGKDD Explorations Newsletter, 8(1), 3-10.
2. Kajabadi, A., Saraee, M. H., &Asgari, S. (2009, October). Data mining cardiovascular risk
factors.In Application of Information and Communication Technologies, 2009.AICT 2009.
International Conference on (pp. 1-5). IEEE.
3. Nada Lavra_,(2012), Advances in Data Mining for Biomedical Research
4. M. K. Osman, F. Ahmad, Z. Saad, (2010), A Genetic Algorithm-Neural Network Approach for
Mycobacterium Tuberculosis Detection in Ziehl-Neelsen Stained Tissue Slide Images
5. Ricardo Vig_ario, JaakkoS arela, HarriValpola, ErkkiOja, Biomedical Data Analysis
6. Wamiq M. Ahmed, (2008) Knowledge representation and data mining for biological imaging, Purdue
University Cytometry Laboratories, Bindley Bioscience Center, 1203 W. State Street, West Lafayette,
IN 47907, USA.
7. Development of Multiscale Biological Image Data Analysis: Review of 2006 International Workshop
on Multiscale Biological Imaging, Data Mining and Informatics, Santa Barbara, USA (BII06)
8. BiomedicalText Mining for Disease Gene Discovery, Sarah ElShal, Jesse Davis, Yves Moreau,
EMBnet.Journal on 25/6/2013
9. J.J. Sychra, D.G. Pave1, E. Olea,(1988) , Classification Images Of Cardiac Wall Motion Abnormalities
10. R. Bharat Rao, Glenn Fung, BalajiKrishnapuram, (2010), Mining Medical Images
11. Fahim Sufi, Khalil.I., (2010),Diagnosis of Cardiovascular Abnormalities From Compressed ECG: A
Data Mining-Based Approach
12. Md. Rabiul Islam, Shamim Ahmad, Keikichi Hirose and Md. Khademul Islam Molla, (2010), Data
Adaptive Analysis of ECG Signals for Cardiovascular Disease Diagnosis
13. NurulHikmahKamaruddin, M.Murugappan, M o h a m m a d I q b a l O m a r , ( 2 0 1 2 ) , Early Prediction of
Cardiovascular Diseases U s i n g E C G s i g n a l : R e v i e w
14. R.BharatRao, Sriram Krishnan, Radu Stefan Niculescu, (2006), Data mining for improved cardiac care.
15. Palaniappan, S. &, Awang, R., “Intelligent heart disease predication system using data mining
technique”.IJCSNS International Journal of Computer Science and Network Security.Vol. 8, No. 8,
2008.
16. Rüdiger Wirth&JochenHipp, CRISP-DM: Towards a Standard Process Model for Data Mining.
17. Hauke, J., &Kossowski, T. (2011). Comparison of values of Pearson's and Spearman's correlation
coefficients on the same sets of data.
18. Ahmad Mirabadi, ShabnamSharifian, Application of association rules in Iranian Railways (RAI)
accident data analysis, Safety Science, Volume 48, Issue 10, December 2010, Pages 1427-1435, ISSN
0925-7535
19. PreetinderKaur, MadhuGoyal, and Jie Lu, (2013),Pricing Analysis in Online Auctions Using Clustering
and Regression Tree Approach
20. Albrecht S, Busch J, Kloppenburg M, Metze F, Tavan P., (2000), Generalized radial basis function
networks for classification and novelty detection: self-organization of optimal Bayesian decision.
21. Gaspar, P., Carbonell, J., & Oliveira, J. L. (2012).On the parameter optimization of Support Vector
Machines for binary classification. Journal of Integrative Bioinformatics, 9(3), 201.

269
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
111.68.101.200 On: Mon, 08 Dec 2014 07:26:58
View publication stats

Vous aimerez peut-être aussi