Vous êtes sur la page 1sur 7

IJRECS @ June July 2015, V-3, I-1

ISSN-2321-5485 (Online)

ISSN-2321-5784 (Print)

Data Mining and Knowledge Discovery for Big Data in Health


Informatics
1

A. Upender , B. Sunitha Devi


M.Tech Student, Dept of CSE, CMR Institute of Technology, Kandlakoya, Medchal, TS, India.
2
Associate Professor, Dept of CSE, CMR Institute of Technology, Kandlakoya, Medchal, TS, India.
1

ABSTRACT: Big Data is a new term used to identify the


datasets that due to their large size and complexity. Big
Data are now rapidly expanding in all science and
engineering domains, including physical, biological and
biomedical sciences. Big Data mining is the capability of
extracting useful information from these large datasets or
streams of data, that due to its volume, variability, and
velocity, it was not possible before to do it. The Big Data
challenge is becoming one of the most exciting
opportunities for the next years. This study paper includes
the information about what is big data, Data mining, Data
mining with big data, Challenging issues and its related
work.
Keywords:
Big Data, Data mining, challenging issues,
Datasets, Data Mining Algorithms.
1.

INTRODUCTION

Recent years have witnessed a dramatic increase in our


ability to collect data from various sensors, devices, in
different formats, from independent or connected
applications. This data flood has outpaced our capability to
process, analyze, store and understand these datasets.
Consider the Internet data. The web pages indexed by
Google were around one million in 1998, but quickly
reached 1 billion in 2000 and have already exceeded 1
trillion in 2008. This rapid expansion is accelerated by the
dramatic increase in acceptance of social networking
applications, such as Facebook, Twitter, Weibo, etc., that
allow users to create contents freely and amplify the
already huge Web volume. Furthermore, with mobile
phones becoming the sensory gateway to get realtime data
on people from different aspects, the vast amount of data
that mobile carrier can potentially process to improve our
daily life has significantly outpaced our past CDR (call
data record)-based processing for billing purposes only.
It can be foreseen that Internet of things (IoT) applications
will raise the scale of data to an unprecedented level.
People and devices (from home coffee machines to cars, to
buses, railway stations and airports) are all loosely
connected. Trillions of such connected components will

224322
432243

www.ijrecs.com

generate a huge data ocean, and valuable information must


be discovered from the data to help improve quality of life

IJRECS @ June July 2015, V-3, I-1


ISSN-2321-5485 (Online)
and make our world a better place. For example, after we
get up every morning, in order to optimize our commute
time to work and complete the optimization before we
arrive at office, the system needs to process information
from traffic, weather, construction, police activities to our
calendar schedules, and perform deep optimization under
the tight time constraints. In all these applications, we are
facing significant challenges in leveraging the vast amount
of data, including challenges in (1) system capabilities (2)
algorithmic design (3) business models.
As an example of the interest that Big Data is having in the
data mining community, the grand theme of this years
KDD conference was Mining the Big Data. Also there
was a specific workshop Big Mine [12] in that topic: 1st
International Workshop on Big Data, Streams and
Heterogeneous Source Mining: Algorithms, Systems,
Programming Models and Applications1. Both events
successfully brought together people from both academia
and industry to present their most recent work related to
these Big Data issues, and exchange ideas and thoughts.
These events are important in order to advance this Big

224422
442244

www.ijrecs.com

ISSN-2321-5784 (Print)
Data challenge, which is being considered as one of the
most exciting opportunities in the years to come.
Today is the era of Google. The thing which is unknown
for us, we Google it And in fractions of seconds we get the
number of links as a result. This would be the better
example for the processing of Big Data. This Big Data is
not any different thing than out regular term data. Just big
is a keyword used with the data to identify the collected
datasets due to their large size and complexity? We cannot
manage them with our current methodologies or data
mining software tools. Another example, the first strike of
Anna Hajare triggered number of tweets within 2 hours.
Among all these tweets, the special comments that
generated the most discussions actually revealed the public
interests. Such online discussions provide a new means to
sense the public interests and generate feedback in realtime, and are mostly appealing compared to generic media,
such as radio or TV broadcasting. This example
demonstrates the rise of Big Data applications. The data
collection has grown tremendously and is beyond the

TOOLS: OPEN SOURCE REVOLUTION


ability of commonly used software tools to capture,
manage, and process within a tolerable time.
CONTRIBUTED ARTICLES
We selected four contributions that together shows very
significant state-of-the-art research in Big Data Mining,
and that provides a broad overview of the field and its
forecast to the future. Other significant work in Big Data
Mining can be found in the main conferences as KDD,
ICDM, ECMLPKDD, or journals as Data Mining and
Knowledge Discovery or Machine Learning.
- Scaling Big Data Mining Infrastructure: The Twitter
Experience by Jimmy Lin and Dmitriy Ryaboy (Twitter,
Inc.). This paper presents insights about Big Data mining
infrastructures, and the experience of doing analytics at
Twitter. It shows that due to the current state of the data
mining tools, it is not straightforward to perform analytics.
Most of the time is consumed in preparatory work to the
application of data mining methods, and turning
preliminary models into robust solutions.
- Mining Heterogeneous Information Networks: A
Structural
Analysis
Approach
by Yizhou
Sun
(Northeastern University) and Jiawei Han (University of
Illinois at Urbana-Champaign). This paper shows that
mining heterogeneous information networks is a new and
promising research frontier in Big Data mining research. It
considers interconnected, multi-typed data, including the
typical relational database data, as heterogeneous
information
networks.
These
semi-structured
heterogeneous information network models leverage the
rich semantics of typed nodes and links in a network and
can uncover surprisingly
rich knowledge
from
interconnected data.
- Big Graph Mining: Algorithms and discoveries by U
Kang and Christos Faloutsos (Carnegie Mellon
Universitiy). This paper presents an overview of mining
big graphs, focusing in the use of the Pegasus tool,
showing some findings in the Web Graph and Twitter
social network. The paper gives inspirational future
research directions for big graph mining.
- Mining Large Streams of User Data for Personalized
Recommendations by Xavier Amatriain (Netflix). This
paper presents some lessons learned with the Netflix Prize,
and discusses the recommender and personalization
techniques used in Netflix. It discusses recent important
problems and future research directions. Section 4 contains
an interesting discussion about if we need more data or
better models to improve our learning methodology.

The Big Data phenomenon is intrinsically related to the


open source software revolution. Large companies as
Facebook, Yahoo!, Twitter, Linked In benefit and
contribute working on open source projects. Big Data
infrastructure deals with Hadoop, and other related
software as:
Apache Hadoop [3]: software for data-intensive
distributed applications, based in the Map Reduce
programming model and a distributed file system called
Hadoop Distributed File System (HDFS). Hadoop allows
writing applications that rapidly process large amounts of
data in parallel on large clusters of compute nodes. A Map
Reduce job divides the input dataset into independent
subsets that are processed by map tasks in parallel. This
step of mapping is then followed by a step of reducing
tasks. These reduce tasks use the output of the maps to
obtain the final result of the job.
Apache Hadoop related projects [10]: Apache Pig,
Apache Hive, Apache HBase, Apache ZooKeeper, Apache
Cassandra, Cascading, Scribe and many others.
Apache S4: platform for processing continuous data
streams. S4 is designed specifically for managing data
streams. S4 apps are designed combining streams and
processing elements in real time.
Storm [9]: software for streaming data-intensive
distributed applications, similar to S4, and developed by
Nathan Marz at Twitter.
In Big Data Mining, there are many open source initiatives.
The most popular are the following:
Apache Mahout [4]: Scalable machine learning and data
mining open source software based mainly in Hadoop. It
has implementations of a wide range of machine learning
and data mining algorithms: clustering, classification,
collaborative filtering and frequent pattern mining.
R [7]: open source programming language and software
environment designed for statistical computing and
visualization. R was designed by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand
beginning in 1993 and is used for statistical analysis of
very large data sets.
MOA [5]: Stream data mining open source software to
perform data mining in real time. It has implementations of
classification, regression; clustering and frequent item set
mining and frequent graph mining. It started as a project of

the Machine Learning group of University of Waikato,


New Zealand, famous for the WEKA software. The
streams framework [6, 8] provides an environment for
defining and running stream processes using simple XML
based definitions and is able to use MOA, Android and
Storm. SAMOA [1] is a new upcoming software project
for distributed stream mining that will combine S4 and
Storm with MOA.
Vowpal Wabbit: open source project started at Yahoo!
Research and continuing at Microsoft Research to design a
fast, scalable, useful learning algorithm. VW is able to
learn from terafeature datasets. It can exceed the
throughput of any single machine network interface when
doing linear learning, via parallel learning. More specific
to Big Graph mining we found the following open source
tools:

3.

KEY FEATURES OF BIG DATA

The features of Big Data are: It is huge in size.


The data keep on changing time time to time.
Its data sources are from different phases.

Pegasus: big graph mining system built on top of Map


Reduce. It allows finding patterns and anomalies in
massive real-world graphs. See the paper by U. Kang and
Christos Fallouts in this issue.

It is free from the influence, guidance, or control of


anyone.

Graph Lab: high-level graph parallel system built without


using Map Reduce. GraphLab computes over dependent
records which are stored as vertices in a large distributed
data-graph. Algorithms in GraphLab are expressed as
vertex-programs which are executed in parallel on eac

Its huge in nature because, there is the collection of data


from various sources together. If we consider the example
of Face book, lots of numbers of people are uploading their
data in various types such as text, images or videos.

2.

BIG DATA AND DATA MINING

The Big Data is nothing but a data, available at


heterogeneous, autonomous sources, in extreme large
amount, which get updated in fractions of seconds. For
example, the data stored at the server of Facebook, as most
of us, daily use the Facebook; we upload various types of
information, upload photos. All the data get stored at the
data warehouses at the server of Facebook. This data is
nothing but the big data, which is so called due to its
complexity. Also another example is storage of photos at
Flicker. These are the good real-time examples of the Big
Data. Another best example of Big data would be, the
readings taken from an electronic microscope of the
universe. Now the term Data Mining, Finding for the exact
useful information or knowledge from the collected data,
for future actions, is nothing but the data mining.
So, collectively, the term Big Data Mining is a close up
view, with lots of detail information of a Big Data with lots
of information. As shown in fig 1 below

It is too much complex in nature, thus hard to handle.

The people also keep their data changing continuously.


This tremendous and instantaneously, time to time
changing stock of the data is stored in a warehouse. This
large storage of data requires large area for actual
implementation. As the size is too large, no one is capable
to control it oneself. The Big Data needs to be controlled
by dividing it in groups.
Due to largeness in size, decentralized control and different
data sources with different types the Big Data becomes
much complex and harder to handle. We cannot manage
them with the local tools those we use for managing the
regular data in real time. For major Big Data-related
applications, such as Google, Flicker, Face book, a large
number of server farms are deployed all over the world to
ensure nonstop services and quick responses for local
markets.
4.

CHALLENGING ISSUES IN DATA MINING


WITH BIG DATA

There are three sectors at which the challenges for Big


Data arrive. These three sectors are: Mining platform.
Privacy

Design of mining algorithms


Basically, the Big Data is stored at different places and
also the data volumes may get increased as the data keeps
on increasing continuously. So, to collect all the data
stored at different places is that much expensive. Suppose,
if we use these typical data mining methods (those
methods which are used for mining the small scale data in
our personal computer systems) for mining of Big Data,
and then it would become an obstacle for it. Because the
typical methods are required data to be loaded in main
memory, though we have super large main memory
To maintain the privacy is one of the main aims of data
mining algorithms. Presently, to mine information from
Big data, parallel computing based algorithms such as Map
Reduce are used. In such algorithms, large data sets are
divided into number of subsets and then, mining
algorithms are applied to those subsets. Finally, summation
algorithms are applied to the results of mining algorithms,
to meet the goal of Big Data mining. In this whole
procedure, the privacy statements obviously break as we
divide the single Big Data into number of smaller datasets.

While designing such algorithms, we face various


challenges. As shown in the figure 2 above, there are blind
men observing the giant elephant. Everyone is trying to
predict their conclusion on what the thing is actually.
Somebody is saying that the thing is a hose; someone says
its a tree or pipe etc. Actually everyone is just observing
some part of that giant elephant and not the whole, so the
results of each blind persons prediction is something
different than actually what it is.
Similarly, when we divide the Big Data in to number of
subsets, and apply the mining algorithms on those subsets,

the results of those mining algorithms will not always point


us to the actual result as we want when we collect the
results together.
5.

RELATED WORK

On the level of mining platform sector, at present, parallel


programming models like Map Reduce are being used for
the purpose of analysis and mining of data. Map Reduce is
a batch-oriented parallel computing model. There is still a
certain gap in performance with relational databases.
Improving the performance of Map Reduce and enhancing
the real-time nature of large-scale data processing have
received a significant amount of attention, with Map
Reduce parallel programming being applied to many
machine learning and data mining algorithms. Data mining
algorithms usually need to scan through the training data
for obtaining the statistics to solve or optimize model.
For those people, who intend to hire a third party such as
auditors to process their data, it is very important to have
efficient and effective access to the data. In such cases, the
privacy restrictions of user may be faces like no local
copies or downloading allowed, etc. So there is privacypreserving public auditing mechanism proposed for large
scale data storage.[1] This public key-based mechanism is
used to enable third-party auditing, so users can safely
allow a third party to analyze their data without breaching
the security settings or compromising the data privacy.
In case of design of data mining algorithms, Knowledge
evolution is a common phenomenon in real world systems.
But as the problem statement differs, accordingly the
knowledge will differ. For example, when we go to the
doctor for the treatment, that doctors treatment program
continuously adjusts with the conditions of the patient.
Similarly the knowledge For this, Wu [2] [15] [14]
proposed and established the theory of local pattern
analysis, which has laid a foundation for global knowledge
discovery in multisource data mining. This theory provides
a solution not only for the problem of full search, but also
for finding global models that traditional mining methods
cannot find.
6.

FORECAST TO THE FUTURE

There are many future important challenges in Big Data


management and analytics that arise from the nature of
data: large, diverse, and evolving [16]. These are some of
the challenges that researchers and practitioners will have
to deal during the next years:

Analytics Architecture. It is not clear yet how an optimal


architecture of analytics systems should be to deal with
historic data and with real-time data at the same time. An
interesting proposal is the Lambda architecture of Nathan
Marz. The Lambda Architecture solves the problem of
computing arbitrary functions on arbitrary data in real-time
by decomposing the problem into three layers: the batch
layer, the serving layer, and the speed layer. It combines in
the same system Hadoop for the batch layer, and Storm for
the speed layer. The properties of the system are: robust
and fault tolerant, scalable, general, extensible, allows ad
hoc queries, minimal maintenance, and debug gable.
Statistical significance. It is important to achieve
significant statistical results, and not be fooled by
randomness. As Efron explains in his book about Large
Scale Inference [10], it is easy to go wrong with huge data
sets and thousands of questions to answer at once.
Distributed mining. Many data mining techniques are not
trivial to paralyze. To have distributed versions of some
methods, a lot of research is needed with practical and
theoretical analysis to provide new methods.
Time evolving data. Data may be evolving over time, so
it is important that the Big Data mining techniques should
be able to adapt and in some cases to detect change first.
For example, the data stream mining field has very
powerful techniques for this task [13].
Compression: Dealing with Big Data, the quantity of
space needed to store it is very relevant. There are two
main approaches: compression where we dont lose
anything or sampling where we choose what is the data
that is more representative. Using compression, we may
take more time and less space, so we can consider it as a
transformation from time to space.
Using sampling, we are losing information, but the gains in
space may be in orders of magnitude. For example
Feldman et al. [12] use corsets to reduce the complexity of
Big Data problems. Corsets are small sets that provably
approximate the original data for a given problem. Using
merge-reduce the small sets can then be used for solving
hard machine learning problems in parallel.
Visualization. A main task of Big Data analysis is how to
visualize the results. As the data is so big, it is very
difficult to find user-friendly visualizations. New
techniques and frameworks to tell and show stories will be
needed, as for example the photographs, info graphics and
essays in the beautiful bookThe Human Face of Big
Data.

Hidden Big Data. Large quantities of useful data are


getting lost since new data is largely untagged file based
and unstructured data. The 2012 IDC study on Big Data
[11] explains that in 2012, 23% (643 exabytes) of the
digital universe would be useful for Big Data if tagged and
analyzed. However, currently only 3% of the potentially
useful data is tagged, and even less is analyzed.
CONCLUSION
Big Data is going to continue growing during the next
years, and each data scientist will have to manage much
more amount of data every year. This data is going to be
more diverse, larger, and faster. We discussed some
insights about the topic, and what we consider are the main
concerns and the main challenges for the future. Big Data
is becoming the new Final Frontier for scientific data
research and for business applications. We are at the
beginning of a new era where Big Data mining will help us
to discover knowledge that no one has discovered before.
Everybody is warmly invited to participate in this intrepid
journey
REFERENCES
[1] C. Wang, S.S.M. Chow, Q. Wang, K. Ren, and W.
Lou, Privacy- Preserving Public Auditing for Secure
Cloud Storage IEEE Trans. Computers, vol. 62, no. 2, pp.
362-375, Feb. 2013.
[2] X. Wu and S. Zhang, Synthesizing High-Frequency
Rules from Different Data Sources, IEEE Trans.
Knowledge and Data Eng., vol. 15, no. 2, pp. 353-367,
Mar./Apr. 2003.
[3] X. Wu, C. Zhang, and S. Zhang, Database
Classification for Multi-Database Mining, Information
Systems, vol. 30, no. 1, pp. 71- 88, 2005.
[4] K. Su, H. Huang, X. Wu, and S. Zhang, A Logical
Framework for Identifying Quality Knowledge from
Different Data Sources, Decision Support Systems, vol.
42, no. 3, pp. 1673-1683, 2006.
[5] E.Y. Chang, H. Bai, and K. Zhu, Parallel Algorithms
for Mining Large-Scale Rich-Media Data, Proc. 17th
ACM Intl Conf. Multimedia, (MM 09,) pp. 917-918,
2009.
[6] D. Howe et al., Big Data: The Future of Biocuration,
Nature, vol. 455, pp. 47-50, Sept. 2008.

[7] A. Labrinidis and H. Jagadish, Challenges and


Opportunities with Big Data, Proc. VLDB Endowment,
vol. 5, no. 12, 2032-2033, 2012.
[8] Y. Lindell and B. Pinkas, Privacy Preserving Data
Mining, J. Cryptology, vol. 15, no. 3, pp. 177-206, 2002.
[9] Storm, http://storm-project.net.
[10] P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch, and
G. Lapis. IBM Understanding Big Data: Analytics for
Enterprise Class Hadoop and Streaming Data. McGrawHill Companies, Incorporated, 2011.
[11] J. Gantz and D. Reinsel. IDC: The Digital Universe in
2020: Big Data, Bigger Digital Shadows, and Biggest
Growth in the Far East. December 2012.
[12] D. Feldman, M. Schmidt, and C. Sohler. Turning big
data into tiny data: Constant-size coresets for k-means, pca
and projective clustering. In SODA, 2013.
[13] Apache Hadoop, http://hadoop.apache.or g.
[14] J. Gantz and D. Reinsel IDC: The Digital Universe in
2020: Big Data, Bigger Digital Shadows, and Biggest
Growth in the Far East. December 2012.
[15] Gartner, http://www.gartner.com/it-glossar y/bigdata.
[16] V. Gopalkrishnan, D. Steier, H. Lewis, and J. Guszcza
Big data, big business: bridging the gap. In Proceedings of
the 1st International Workshop on Big Data, Streams and
Heterogeneous Source Mining: Algorithms, Systems,
Programming Models and Applications, Big Mine 12,
pages 711, New York, NY, USA, 2012. ACM.

Vous aimerez peut-être aussi