Project Report PDF

Spring 2015
CMPE 226 Database Systems

Big Data in Health Care
Project Report Team 3
Anushree Sonni
Yeshwanth Ravindra
009400534
anushree.narasimhamurthysonni@sjsu.edu
009318400
yeshwanth.ravindrababu@sjsu.edu
Karthik Kolichelimi Venkatrao

009299693
Karthik.kolichelimivenkatrao@sjsu.edu
Department of Computer Engineering, San Jos State University
1 Washington Sq, San Jose, CA 95192

Abstract In a medical emergency, we hear people being
unsure about which medical center to choose, not knowing the
nearest center offering the required service, unaware about
which among the centers in their vicinity provide the best
diagnostic service at reasonable fee. All these questions and more
have been answered by our application which proves to be a
handy solution to people in need. In our project, we focus
majorly on how we analyze one big piece of medical data into
various possible statistics combining many different frameworks
and tools to achieve the required statistical results. This medical
big data which has a wealth of information concealed, when
harnessed appropriately can provide meaningful, valuable and
helpful statistics.
I. INTRODUCTION
In todays ever expanding world, every individual has to
have the knowledge about various sprouting medical
advancements and technologies and the offerings in the
medical centers around them. Despite the fact that a couple of
websites are available for referencing medical data, there is a
strong need for a dedicated application to mine the large data
sets of these sites. This could be achieved of course by
regularly keeping up the latest news but yet it is good to have
a tool which provides you with every possible analysis and
information about the medical centers, a button-click away.
The data provided here under the Data Input includes
hospital-specific charges for more than 3,000 U.S. hospitals
that receive Medicare Inpatient Prospective Payment System
(IPPS) payments for the top 100 most frequently billed
discharges, paid under Medicare based on a rate per discharge
using Medicare Severity Diagnosis Related Group (MSDRG).
The project focuses on various factors and mainly on three
different types of users, the Government, the Public and the
Hospital. The Government can see the statistics of the charges
and the tax reimbursement to be issued. The Public can
choose the DRG based on the cost, location and the service
provided. The Hospital administration can check the financial
statement of the DRGs to monitor their income and finances.

This also provides valuable information on its competitors.
The intent of the project is to analyze this unstructured
medical data which can be analyzed based on Providers
Name, Address, City, State and can be referenced using the
Zip code of a particular area.
This analysis will be helpful in finding out:
1. Average costs to be incurred per hospital within a particular
state or region so that patients can look for an affordable
hospital when needed.
2. Number of patients within a hospital that are being cured
for a similar disease.
3. It can also find out how widespread is a disease and in
which area or region.
The outcome of the analysis is represented visually via
user-friendly interactive Dashboard. It will also explain the
analysis via Drill-down and Drill-through.
Through this analysis a user can easily find out which is
the most affordable hospital for him categorized on the region
or state. It will be very simple for him to find out a particular
widespread disease he should be alert of in his area. And he
can check for the Popularity of a hospital based on the number
of discharges.
II. TOOLS/SOFTWARE USED IN THE PROJECT
For the successful implementation of the project, Tools and
Approaches are needed to do it.
Project Implementation phase involves:
1. Project Initialization:
This involves getting the raw file in the form CSV from the
Data.CMS.gov site. Then once the data is received cleaning
action is performed on the sheet.
Spring 2015
this data and query the data using a SQL-like language called
HiveQL. At the same time this language also allows
traditional map/reduce programmers to plug in their custom
mappers and reducers when it is inconvenient or inefficient to
express this logic in HiveQL.
1.1.4. Technologies Used:
Languages: R
R is a free software programming language. It is widely
used among statisticians and data miners.
R is an implementation of S programming language
combined with lexical scoping semantics. S was created by
John Chambers while at Bell Labs.
R is used to categorizing the DRG in three level HIGH,
MEDIUM, LOW based on the expense. Some aggregation
features of R like MIN, MAX, MEAN and summary are
exploited.
.
Figure 1: End to End Flow
1 .1 System Resources:
1.1.1 Minimum resources:
While there are no guarantees on the minimum resources
required by Hadoop daemons, the community attempts to not
increase requirements within a minor release.
We can use GNU/Linux, Microsoft Windows, Apple
MacOSX, Solaris etc. operating systems where Apache
Hadoop is known to work reasonably well.
1.1.2 Framework: Apache Hadoop
Apache Hadoop is an open source software framework and
distributed processing of Big data on clusters of commodity
hardware. Its Hadoop Distributed File System (HDFS) splits
files into large blocks (default 64MB or 128MB) and
distributes the blocks amongst the nodes in the cluster. For
processing the data, Hadoop Map/Reduce ships data to the
nodes that have the required data, and the nodes then process
the data in parallel. This approach leverages data locality. The
cleansed DGV file is transferred to Hadoop file system.
1.1.3 Database: HIVE
The Apache Hive data warehouse software facilitates
querying and managing large datasets residing in distributed
storage. Hive provides a mechanism to project structure onto
Figure 2: R function
Tableau is business intelligence software that allows

anyone to easily connect to data, then visualize and create
interactive, sharable dashboards. It's easy enough that any
Excel user can learn it, but powerful enough to satisfy even
the most complex analytical problems. Securely sharing your
findings with others only takes seconds.
The result is BI software that you can trust to actually
deliver answers to the people that need them.
Today's organizations need efficient, scalable and easily
deployable business intelligence tools in order to accomplish
their goals. All too often, on boarding a new BI tool is an
effort of weeks, months or even years - and maintenance is
punctuated by a never ending stream of user requests and
expensive consulting bills.
Tableau takes a different approach. Installing Tableau
Desktop takes minutes. Once it's installed, anyone can connect
to data with a click and create interactive, analytical
dashboards. Sharing dashboards is just as easy: simply publish
them to either Tableau Server (on-premise), or
Tableau (Tableau Server in the cloud). Even large enterprise
deployments can be achieved with ease using Tableau's Drive
methodology.
Spring 2015
The dashboards are published to tableau public server, so
that anyone whos in need of information can easily access it.
1.2 Hardware
1.2.1 Architecture:
Figure 4: HDFS Architecture
Figure 3: Hadoop Master Slave Architecture
In Hadoop every system is categorized into mainly main

system Slave node, Master node and Client machines.
Slave Node:
Every system in Hadoop contains consists of Task tracker
and Data node. The job of Task tracker is to process the small
pieces of task given to the node while the Data node manages
the data given to the node. As the requirement of the data
increases more system with this pattern will added thus
forming a cluster.
Server Node
The Server contains Job tracker and Name node. The task
of the job tracker is to accept the task from the client and then
break the task given into smaller pieces and assign the task to
Slave nodes. The Name node keeps track of where data in
located Data node. Whenever the Client wants to write or read
a file it talks to Data node to find out the location. Both the
components Job tracker and Name node is responsible to
detect if there is any failure in Task tracker and Data node
respectively.
The Task and Job tracker forms the Map reduce
component while the Name and Data node forms Hadoop
distributed file system (HDFS) component of the Hadoop.
Client Machine
The task of the Client machine is to describe how the data
has to be processed (Map reduce), load the data to HDFS and
then get the results from the job when its done.
HDFS is designed to work on commodity hardware such

as Personal Computers however Hadoop is mainly ran on
servers. HDFS works well with large data by offering fast
access to the applications data. Its also very reliable as it is
highly fault tolerant.
Only one Namenode is in an HDFS cluster. The
Namenode is the master server in the cluster and it is in charge
of the file system namespace and controls file accessibility by
the clients. Closing, opening or renaming a file is a duty
executed by the Namenode as well as assigning blocks to
Datanodes.
The number of Datanodes is defined by the number of
nodes in the cluster. Datanodes manage data of the node to
which they are attached to. User data is stored in files, as
assigned by HDFS, that are fragmented into one or more
blocks which are then stored into a Datanode. When a read or
write requests is received from a client the Datanode takes
charge of the operation. A Datanode can also create, delete,
and replicate blocks if instructed by the Namenode.
Namenode and Datanode is software designed to work on
commodity machines. HDFS is designed and written in Java
so any machine that can run Java can utilize Namenode and
Datanode software.
The file system is designed in a similar fashion to other
file systems in existence. The Namenode records any changes
to the file system and keeps the number of times a file has
been copied. HDFS maintains the number of times a file is
replicated according to a number specified by the application.
HDFS replicates files in order to maintain high fault
tolerance. The files are stored in a sequence of blocks. The
block size and number of times a file can be replicated is
stated by the application. Replication of blocks is handled by
the Namenode.
MapReduce is a software framework that assists in writing
of programs that handle large amounts of data across
thousands of nodes4. It is divided into two parts: map and
reduce. The Map part distributes the work that needs to be
processed into separate nodes. The reduce part takes the
output of the Map phase and produces and single output.
Spring 2015
Pairing of MapReduce with HDFS works well because HDFS
provides high bandwidth across a large cluster.
SQL Server Reporting Services is a server-based reporting
platform that provides comprehensive reporting functionality
for a variety of data sources. Reporting Services includes a
complete set of tools to create, manage, and deliver reports,
and APIs that enable developers to integrate or extend data
and report processing in custom applications. Reporting
Services tools work within the Microsoft Visual Studio
environment and are fully integrated with SQL Server tools
and components.
information required by that implementation. All this

information can be provided during the creation of table.
i)Partition - Each partition can have its own columns and
SerDe and storage information. This can be used in the future
to support schema evolution in a Hive
HIVE :
The main components of Hive are:
a) External Interfaces - Hive provides both user interfaces
like command line (CLI) and web UI, and application
programming interfaces (API) like JDBC and
ODBC.
b) The Hive Thrift Server exposes a very simple client API to
execute HiveQL statements. Thrift is a framework for crosslanguage services, where a server written in one language
(like Java) can also support clients in other languages. The
Thrift Hive clients generated in di_erent languages are used to
build common drivers like JDBC (java), ODBC (C++), and
scripting drivers written in php, perl, python etc.
c) The Metastore is the system catalog. All other components
of Hive interact with the metastore.
d) The Driver manages the life cycle of a HiveQL statement
during compilation, optimization and execution. On receiving
the HiveQL statement, from the thrift server or other
interfaces, it creates a session handle which is later used to
keep track of statistics like execution time, number of output
rows, etc.
e) The Compiler is invoked by the driver upon receiving
a HiveQL statement. The compiler translates this statement
into a plan which consists of a DAG of map reduce jobs.
f) The driver submits the individual map-reduce jobs from
the DAG to the Execution Engine in a topological order. Hive
currently uses Hadoop as its execution engine.
Figure 5: HIVE Architecture Diagram
1.3 System Configuration:

We need to configure the system in a proper manner to get
the most optimal performance using our installation. The
different system configurations is explained below.
1.3.1 Apache Hadoop
For the Apache Hadoop installation, we need a stable
Ubuntu Linux release (14.04.1 LTS which is the latest release
as on 4th November 2014). Hadoop tar file (Version 1.2.1) to
install Hadoop, latest Java version and SSH is to be installed.
1.3.2 Tableau
Microsoft Windows Server 2012, 2012 R2, 2008,
g) Database - is a namespace for tables. The database default
2008 R2, 2003 R2 sp2 or higher; Windows 8 or 7

32-bit or 64-bit versions of Windows
is used for tables with no user supplied database name.
Minimum of a Pentium 4 or AMD Opteron processor

32-bit color depth recommended
h) Table - Metadata for table contains list of columns and

their types, owner, storage and SerDe information. It can also
contain any user supplied key and value data; this facility can
be used to store table statistics in the future. Storage
information includes location of the table's data in the
underlying _le system, data formats and bucketing
information. SerDe metadata includes the implementation
class of serializer and deserializer methods and any supporting
2. Project Operation:
This is practical management of a project. Here, project
inputs are transformed into outputs to achieve immediate
objectives.
Spring 2015
Following Figure 8 explains when data set is being
inputted and analyzed, output is obtained, categorized and
refined on user demands, which could be related to a
particular region or state. User can further search for average
Medicare payments as well as total payments.
DRG: Code and description identifying the DRG. DRGs are a

classification system that groups similar clinical conditions
(diagnoses) and the procedures furnished by the hospital
during the stay.
Provider ID: Provider Identifier billing for inpatient hospital
services.
Provider Name: Name of the provider.
Provider Street Address: Street address in which the provider
is physically located.
Provider City: City in which the provider is physically
located.
Provider State: State in which the provider is physically
located.
Figure 6: Data Flow
Provider Zip Code: Zip code in which the provider is

physically located.
III. PROJECT DATA USED
The Data Input for our project is taken from
Data.CMS.gov
(https://data.cms.gov/Medicare/InpatientProspectivePayment-System-IPPS-Provider/97k6-zzx3) which
is unstructured data (medical records) and structured data
(location specific data). The data set is given by CMS and was
updated on 06/02/14. Original FY2011 data file has been
updated to include a new column, "Average Medicare
Payment." The data provided here include hospital-specific
charges for the more than 3,000 U.S. hospitals that receive
Medicare Inpatient Prospective Payment System (IPPS)
payments for the top 100 most frequently billed discharges,
paid under Medicare based on a rate per discharge using the
Medicare Severity Diagnosis Related Group (MS-DRG) for
Fiscal Year (FY) 2011. These DRGs represent more than 7
million discharges or 60 percent of total Medicare IPPS
discharges. Hospitals determine what they will charge for
items and services provided to patients and these charges are
the amount the hospital bills for an item or service. The Total
Payment amount includes the MS-DRG amount, bill total per
diem, beneficiary primary payer claim payment amount,
beneficiary Part A coinsurance amount, beneficiary deductible
amount, beneficiary blood deducible amount and DRG outlier
amount.
For these DRGs, average charges, average total payments,
and average Medicare payments are calculated at the
individual hospital level. Users will be able to make
comparisons between the amount charged by individual
hospitals within local markets, and nationwide, for services
that might be furnished in connection with a particular
inpatient stay.
The definitions for the terms being used in the data set are
given as follows:
Hospital Referral Region Description: HRR in which the

provider is physically located.
Total Discharges: The number of discharges billed by the
provider for inpatient hospital services.
Annual Covered Charges: The provider's average charge for
services covered by Medicare for all discharges in the DRG.
These will vary from hospital to hospital because of
differences in hospital charge structures.
Average Total Payments: The average of Medicare payments
to the provider for the DRG including the DRG amount,
teaching, disproportionate share, capital, and outlier payments
for all cases. Also included are co-payment and deductible
amounts that the patient is responsible for.
Average Medicare Payments: Money.
Data- Set Snapshots:
Spring 2015
Figure 7: Inpatient Prospective Payment Part 1
Figure 9:: Cleaned Data
IV. BIG DATA RELATED IMPLEMENTATION PROCESS

Hadoop Map/Reduce ships data to the nodes that have the
required data, and the nodes then process the data in parallel.
This approach leverages data locality.. The cleansed DGV file
is transferred to Hadoop file system.
Figure 8: Inpatient Prospective Payment - Part 2
The raw file had few hiccups likes comma, dollar sign in the
few columns and which had to be removed in order to transfer
the data in their intended columns in HIVE. Thus we had to
cleanse the data before to import the data to Hadoop file
fil
structure.
Figure 10:: DGV Data on Hadoop
The reduce part takes the output of the Map phase and
produces and single output. Pairing of MapReduce with
HDFS works well because HDFS provides high bandwidth
across a large cluster.
Spring 2015
Figure 13: Ebola Tweets on HDFS
Figure 11: Map Reduce on Hadoop
A Table is created on the hive with structure similar to the

columns given in the raw file. Later the data is loaded from
the Hadoop file structure to the table.
Figure 12: DGV Data on HIVE
We have used Flume to capture data log files regarding Ebola

on Twitter. Based on this data, we can do the analysis of
various factors like top 5 countries Tweeting about it, the
places where it has been affected we can analyse the sentiment
of the Tweeters if they are being positive, negative or neutral
about Ebola. So far the Tweets regarding Ebola has been
captured by Flume and transferred to Hadoop File System.
V. PROJECT INPUT AND OUTPUT

The traditional IT infrastructure was not able to satisfy the
people in this new era of Big Analytics. As the result, many
enterprises are turning to the open source projects like R
statistical programming language and Hadoop as a probable
better response to this unmet commercial requisite.
Hadoop is an Apache product that does a parallel
processing of data across a multiple systems using a
programmable model. It mainly consists of HDFS, HBASE
for storage and Map reduce for distributed computing. While
R is a free software environment for performing statistical
computing as well for visual representation of the data. There
are vast diverse fields where this language is being
implemented like classification, scoring, finding relationships,
characterization, ranking and clustering.
HDFS Overview:
Hadoop includes a fault tolerant storage system called the
Hadoop Distributed File System, or HDFS. Hadoop Cluster
interconnected on a network acts as scalable factor, with the
power to withstand failure without data loss. The HDFS is
means of storage in Hadoop. R object as well as other models
can be stored in HDFS and later retrieved using the
MapReduce job. The MapReduce job even inscribes the result
back to HDFS once they are done with execution. These
results are later inspected and analyze by means of R language
thus making this essentially significant functional unit in the
process.
In order to facilitate a friendly environment while working
with HDFS there are several layers that are on top of it, one of
that is HBASE. This essentially provides table structures
similar to databases. HBASE aids in opening up the Hadoop
framework to the R programmer.
Spring 2015
Figure 14: HBASE Overview
MapReduce Data Reduction: MapReduce framework is the

processing pillar in the Hadoop environment. The framework
has few specific procedure that could be applied to an
enormous data set, fragment the problem and data, and run it
in parallel. The outcome of these operation are yielded to
HDFS/HBASE and later this could be analyzed using an R
language.
The R code can be integrated with MapReduce jobs. This
type of the implementation elevates the kinds and size of
analytics that be applied to titanic datasets. In this type of
process the model is pushed to task nodes of the Hadoop
cluster, then executing MapReduce jobs loads the model into
R on the task node. Data here can be either aggregated or
processed row `by row as per the requisite and then later the
results are hoarded on HDFS.
Visual representation the datasets assists in understanding
the data. Thus a binning algorithm in R is executed as a
MapReduce job and the output this process can be used an
input R client to render the representation.
integrated in 5 dashboard. The key performance indicators

(KPIs) are used in the dashboard to indicate the performance
of the Medicare system. The strategy possibly will be polished
based on these indicators.
The key rudiments that are going to play a crucial in
designing the dashboard are:
Its artless (simple) and can communicates easily.
Least possible distractions, leading to less confusion.
Will be supporting the organization with handy and

as well as the meaning data.
The dashboard will be basically integrated to a simple
HTML page while adhering to the elements above and with
selected KPIs.
Integration features like drill down and drill through, so
the users can have detailed and very accurate information on
the DRG.
Steps followed while designing the dashboard:

1. Defining the KPIs to observer:
A lot of information is available on the InpatientProspective-Payment-System. Following KPIs will be used to
will displaying the data.
Averaged Covered Charges.
Average Total Payments.
Average Medicare Payments.
Total Discharges.
2. Visualizing the data:
After defining the KPIs, the next step is to represent KPIs
using the charts. Diverse charts like the pie chart, bar chart,
Maps, bubble charts and the tables are used.
Figure 15: MapReduce R
VI. PROJECT EXPERIMENTATION RESULTS GAINED

Data Representation:
The information is presented using simple dashboard yet very
effective, interactive and user friendly. The Business
dashboards are mainly presented in sundry forms and also in
various dimensions. In this project there are 15 reports
Figure 16: Summary across DGV MIN, MAX, MEAN
Spring 2015
Figure 19: Export Features
VII.
PROJECT RESULT ANALYSIS
The chart below represents the data visualization done
with respect to the state in the country. It is the categorization
of the DRG of USA. In the following example, we see the
DRG for Utah. We can see the pie chart for the DRG category
in that state and the bubble chart for the city per state.
Figure 17: Details of DGV selected DRILL THROUGH
Figure 18: Details of DGV selected DRILL DOWN (State City street
zipcode)
3. The auto update of the Dashboard:

In this section where database will come into play. So
that each time the dashboard is used it gives the effective up to
date informations. The frequency at which new data is
available and how important the information is are the
important key factor will be considered while designing this
section.
Figure 20: Categorize the DRG STATE wise (UTAH)
4. Export Features:
In order the preserve the information the export feature
will be integrated to the dashboard. The dashboard could be
transformed to excel, Image, text file (Data) or the PDF with
aid of export feature.
Figure 21: Details of DRG of city wise (Salt Lake City

City) DRILL THROUGH
Spring 2015
The above chart shows the drill-through
through of the DRG state
wise. We can see the bar graph plotted for the DRG vs
Average Total Payments.
The bar chart below shows the Average Total Payments vs
DRG. The red line in the middle indicates the target to be
achieved by the hospital
al to get a certain revenue for the
hospital. This helps the hospital administration to set targets
and see where they need to improve and where they are doing
well.
paid as the coverage charges. The user can make a decision

based on these inputs to choose the right hospital for the DRG.
Figure 24: Safe areas to live in
Figure 22: Target vs Actual across hospital
The below bar chart represents the Safe areas to live in based
on the Discharge Rates. City within a region vs Discharge
Rates. The user can see which hospital to go based on these
statistics to make an informed decision. The user can choose
the hospital based on the area based on the number of
discharges in the locality. When we click on a particular area,
the values also change dynamically giving the user the best
experience to make the right decision.
Figure 23: Safe areas to live in
The below bar chart shows the Total Covered Charges vs

Cities in the region. We can evaluate the insurance covered
with respect to the city in the region. This can help to find the
place close to the user where the hospital covers the charges
levied by the hospital for the respective DRG. The higher
values indicate
icate the higher amounts of the insurance amounts
FUTURE SCOPE
VIII.
Ebola, an infectious and generall
generally fatal disease has
infected more than 6,500 people and claimed more than 3,000
lives in the world so far, according to the latest numbers from
WHO. That puts the fatality rate around 47 percent. Emerging
technologies are becoming super important in the fight against
Ebola and strives to terminate the further spread of the
disease. The Big Data technology paves way for vast sheaths
of information to be combined and refined from a variety of
sources while eliminating extraneous information along the
way.
The future scope of our tool is to analyze the data of the
deadly Ebola disease which requires being able to gather
unstructured data as soon as it is generated, by any number of
organizations from across the globe. Using information
gathered from a wide range
ange of sources, such as social media,
apprise from hospitals and flight records and information,
authorities can develop novel insights into where and how to
respond. This helps not only in saving lives, it also can make
sure that resources are allotted onn priorities.
We have used Flume to capture data log files regarding
Ebola on Twitter. Based on this data, we can do the analysis
of various factors like top 5 countries Tweeting about it, the
places where it has been affected we can analyse the sentiment
of the Tweeters if they are being positive, negative or neutral
about Ebola. So far the Tweets regarding Ebola has been
captured by Flume and transferred to Hadoop File System.
Spring 2015
the disease prevalent areas were also found out which is a
good thing to alert the nearby people.
REFERENCES
[1] David, S, The Marriage of Hadoop and R: Revolution Analytics at
[2]
[3]
Figure 25: Ebola Tweets on HDFS
IX. PROJECT SUMMARY

Inpatient Prospective Payment System (IPPS) Provider
Summary was analyzed to identify various statistics based on
city, state and other criteria This information was presented on
the website created to enable users to easily access the crime
data in that area.
After studying the project requirements and the given data
set, all the column fields were defined and analyzed. The
given data input consists of structured and unstructured data.
The medical data is unstructured data and the hospital location
(street, state, zip code) is the structured data.
Since, we had both kinds of data to handle, we used
several concepts like Hadoop to store both structured and
unstructured data. As well, Hadoop enables distributed
parallel processing of huge amounts of data across
inexpensive, industry standard servers that both store and
process the data, and can scale without limits.
We have used the R Statistical Tool to analyze the data
since it is an effective and an open source project, so the
inspection of code was much easy. R is an interactive
language, and it promotes experimentation and exploration.
This project is the best platform to visualize the
implementation of R and its benefits. The minimum system
requirements were used to get the optimal performance.
The data representation was done using a simple yet very
effective, interactive and user friendly Dashboard using
Tableau. We chose our data representation tool to be Tableau
Dashboard because its helpful not only in summing the
details but it also looks into the desired key features needed
for the project. The reports are deployed on to the cloud so
any one in the need of the information can easily access the
reports.
In short, these helped us take smart and faster decisions
valuing the time constraint. After the successful
implementation of our project the user was able to easily
select a type of DRG he/she desires for his preferred state,
region, city, Zip Code and many more search features, which
made it very simple and cost effective for the users by looking
up for the expenses that will be incurred based on selection.
Moreover, the users are able to look up for the number of
patients that are being cured up in a hospital for similar
disease. So, they can easily check about how widespread the
disease is and how effective the treatment in a hospital is. And
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Hadoop World, http://www.r-bloggers.com/the-marriage-of-hadoopand-r-revolution-analytics-at-hadoop-world/, November 11, 2011.

Revolution Analytics Advanced Big Data Analytics with R and
Hadoop, 2011.
Inpatient Prospective Payment System (IPPS) Provider Summary for
the Top 100 Diagnosis-Related Groups (DRG),
http://www.revolutionanalytics.com/sites/default/files/r-and-hadoopbig-data-analytics.pdf, 2011.
Bart, Creating a Business Dashboard in R, http://www.rbloggers.com/creating-a-business-dashboard-in-r/, March 28, 2013.
Katie, The Importance of Dashboards,
http://www.thetingleyadvantage.com/2013/06/the-importance-ofdashboards.html, June 20, 2013.
Dhruba Borthakur, The Hadoop Distributed File System: Architecture
and Design, The Apache Software Foundation,
http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf, 2007.
Amirtha, T, Why the R Programming Language is Good for
Business, http://www.fastcolabs.com/3030063/why-the-rprogramming-language-is-good-for-business/, May 5, 2014.
Anderson, T, Implementing Process, UTS:Project Management,
http://www.projects.uts.edu.au/stepbystep/implementing.html, April 6,
2006.
Welcome to ApacheTM Hadoop, Apache Hadoop Software
Foundation, http://hadoop.apache.org/, April 10, 2014.
Hadoop-common, Apache Hadoop Common Github
page, https://github.com/apache/Hadoop-common.
Managing Hadoop Projects: What You Need to Know to Succeed,
TechTarget, http://searchcloudcomputing.techtarget.com/definition/Ma
pReduce, Feb 8, 2010.
Murthy, A, Apache Hadoop YARN - Background and an Overview,
http://hortonworks.com/blog/apache-Hadoop-yarn-background-andan-overview/, August 7th, 2012.
Loughran, S, PoweredBy,
https://wiki.apache.org/Hadoop/PoweredBy, Feb 16, 2014.
"Hadoop Tutorial 1 - what is Hadoop ?",
http://zerotoprotraining.com/index.php?mode=video&id=323
"Hadoop", http://www.edevzone.com/hdfs/.
"How to crunch your data stored in HDFS?",
http://blog.octo.com/en/how-to-crunch-your-data-stored-in-hdfs/.
"Get started on Hadoop", http://hortonworks.com/tutorials/.
"Hadoop Distributed File System (HDFS) Introduction",
http://hortonworks.com/Hadoop/hdfs/.
Adventures in Data,
http://bigdata.wordpress.com/2010/03/22/security-in-Hadoop-part-1/.
Understanding Hadoop Clusters and the Network,
http://bradhedlund.com/2011/09/10/understanding-hadoop-clustersand-the-network/.
[21] HDFS Architecture,

http://hadoop.apache.org/docs/r0.19.0/hdfs_design.html, April 21,

Project Report PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Project Report PDF

Transféré par

Droits d'auteur :

Formats disponibles

Spring 2015

CMPE 226 Database Systems

Karthik Kolichelimi Venkatrao

Department of Computer Engineering, San Jos State University

1 Washington Sq, San Jose, CA 95192

statement of the DRGs to monitor their income and finances.

Figure 1: End to End Flow

Tableau is business intelligence software that allows

Figure 4: HDFS Architecture

Figure 3: Hadoop Master Slave Architecture

In Hadoop every system is categorized into mainly main

HDFS is designed to work on commodity hardware such

information required by that implementation. All this

Figure 5: HIVE Architecture Diagram

1.3 System Configuration:

Microsoft Windows Server 2012, 2012 R2, 2008,

g) Database - is a namespace for tables. The database default

2008 R2, 2003 R2 sp2 or higher; Windows 8 or 7

is used for tables with no user supplied database name.

Minimum of a Pentium 4 or AMD Opteron processor

h) Table - Metadata for table contains list of columns and

DRG: Code and description identifying the DRG. DRGs are a

Figure 6: Data Flow

Provider Zip Code: Zip code in which the provider is

Hospital Referral Region Description: HRR in which the

Figure 7: Inpatient Prospective Payment Part 1

Figure 9:: Cleaned Data

IV. BIG DATA RELATED IMPLEMENTATION PROCESS

Figure 8: Inpatient Prospective Payment - Part 2

Figure 10:: DGV Data on Hadoop

Figure 13: Ebola Tweets on HDFS

Figure 11: Map Reduce on Hadoop

A Table is created on the hive with structure similar to the

Figure 12: DGV Data on HIVE

We have used Flume to capture data log files regarding Ebola

V. PROJECT INPUT AND OUTPUT

Figure 14: HBASE Overview

MapReduce Data Reduction: MapReduce framework is the

integrated in 5 dashboard. The key performance indicators

Its artless (simple) and can communicates easily.

Least possible distractions, leading to less confusion.

Will be supporting the organization with handy and

Steps followed while designing the dashboard:

Averaged Covered Charges.

Average Total Payments.

Average Medicare Payments.

Figure 15: MapReduce R

VI. PROJECT EXPERIMENTATION RESULTS GAINED

Figure 16: Summary across DGV MIN, MAX, MEAN

Figure 19: Export Features

3. The auto update of the Dashboard:

Figure 20: Categorize the DRG STATE wise (UTAH)

Figure 21: Details of DRG of city wise (Salt Lake City

paid as the coverage charges. The user can make a decision

Figure 24: Safe areas to live in

Figure 22: Target vs Actual across hospital

Figure 23: Safe areas to live in

The below bar chart shows the Total Covered Charges vs

Figure 25: Ebola Tweets on HDFS

IX. PROJECT SUMMARY

Hadoop World, http://www.r-bloggers.com/the-marriage-of-hadoopand-r-revolution-analytics-at-hadoop-world/, November 11, 2011.

[21] HDFS Architecture,