Académique Documents
Professionnel Documents
Culture Documents
Yeshwanth Ravindra
009400534
anushree.narasimhamurthysonni@sjsu.edu
009318400
yeshwanth.ravindrababu@sjsu.edu
I. INTRODUCTION
In todays ever expanding world, every individual has to
have the knowledge about various sprouting medical
advancements and technologies and the offerings in the
medical centers around them. Despite the fact that a couple of
websites are available for referencing medical data, there is a
strong need for a dedicated application to mine the large data
sets of these sites. This could be achieved of course by
regularly keeping up the latest news but yet it is good to have
a tool which provides you with every possible analysis and
information about the medical centers, a button-click away.
The data provided here under the Data Input includes
hospital-specific charges for more than 3,000 U.S. hospitals
that receive Medicare Inpatient Prospective Payment System
(IPPS) payments for the top 100 most frequently billed
discharges, paid under Medicare based on a rate per discharge
using Medicare Severity Diagnosis Related Group (MSDRG).
The project focuses on various factors and mainly on three
different types of users, the Government, the Public and the
Hospital. The Government can see the statistics of the charges
and the tax reimbursement to be issued. The Public can
choose the DRG based on the cost, location and the service
provided. The Hospital administration can check the financial
Spring 2015
this data and query the data using a SQL-like language called
HiveQL. At the same time this language also allows
traditional map/reduce programmers to plug in their custom
mappers and reducers when it is inconvenient or inefficient to
express this logic in HiveQL.
1.1.4. Technologies Used:
Languages: R
R is a free software programming language. It is widely
used among statisticians and data miners.
R is an implementation of S programming language
combined with lexical scoping semantics. S was created by
John Chambers while at Bell Labs.
R is used to categorizing the DRG in three level HIGH,
MEDIUM, LOW based on the expense. Some aggregation
features of R like MIN, MAX, MEAN and summary are
exploited.
.
1 .1 System Resources:
1.1.1 Minimum resources:
While there are no guarantees on the minimum resources
required by Hadoop daemons, the community attempts to not
increase requirements within a minor release.
We can use GNU/Linux, Microsoft Windows, Apple
MacOSX, Solaris etc. operating systems where Apache
Hadoop is known to work reasonably well.
1.1.2 Framework: Apache Hadoop
Apache Hadoop is an open source software framework and
distributed processing of Big data on clusters of commodity
hardware. Its Hadoop Distributed File System (HDFS) splits
files into large blocks (default 64MB or 128MB) and
distributes the blocks amongst the nodes in the cluster. For
processing the data, Hadoop Map/Reduce ships data to the
nodes that have the required data, and the nodes then process
the data in parallel. This approach leverages data locality. The
cleansed DGV file is transferred to Hadoop file system.
1.1.3 Database: HIVE
The Apache Hive data warehouse software facilitates
querying and managing large datasets residing in distributed
storage. Hive provides a mechanism to project structure onto
Figure 2: R function
Spring 2015
The dashboards are published to tableau public server, so
that anyone whos in need of information can easily access it.
1.2 Hardware
1.2.1 Architecture:
Spring 2015
Pairing of MapReduce with HDFS works well because HDFS
provides high bandwidth across a large cluster.
SQL Server Reporting Services is a server-based reporting
platform that provides comprehensive reporting functionality
for a variety of data sources. Reporting Services includes a
complete set of tools to create, manage, and deliver reports,
and APIs that enable developers to integrate or extend data
and report processing in custom applications. Reporting
Services tools work within the Microsoft Visual Studio
environment and are fully integrated with SQL Server tools
and components.
HIVE :
The main components of Hive are:
a) External Interfaces - Hive provides both user interfaces
like command line (CLI) and web UI, and application
programming interfaces (API) like JDBC and
ODBC.
b) The Hive Thrift Server exposes a very simple client API to
execute HiveQL statements. Thrift is a framework for crosslanguage services, where a server written in one language
(like Java) can also support clients in other languages. The
Thrift Hive clients generated in di_erent languages are used to
build common drivers like JDBC (java), ODBC (C++), and
scripting drivers written in php, perl, python etc.
c) The Metastore is the system catalog. All other components
of Hive interact with the metastore.
d) The Driver manages the life cycle of a HiveQL statement
during compilation, optimization and execution. On receiving
the HiveQL statement, from the thrift server or other
interfaces, it creates a session handle which is later used to
keep track of statistics like execution time, number of output
rows, etc.
e) The Compiler is invoked by the driver upon receiving
a HiveQL statement. The compiler translates this statement
into a plan which consists of a DAG of map reduce jobs.
f) The driver submits the individual map-reduce jobs from
the DAG to the Execution Engine in a topological order. Hive
currently uses Hadoop as its execution engine.
1.3.2 Tableau
2. Project Operation:
This is practical management of a project. Here, project
inputs are transformed into outputs to achieve immediate
objectives.
Spring 2015
Following Figure 8 explains when data set is being
inputted and analyzed, output is obtained, categorized and
refined on user demands, which could be related to a
particular region or state. User can further search for average
Medicare payments as well as total payments.
Spring 2015
The raw file had few hiccups likes comma, dollar sign in the
few columns and which had to be removed in order to transfer
the data in their intended columns in HIVE. Thus we had to
cleanse the data before to import the data to Hadoop file
fil
structure.
The reduce part takes the output of the Map phase and
produces and single output. Pairing of MapReduce with
HDFS works well because HDFS provides high bandwidth
across a large cluster.
Spring 2015
HDFS Overview:
Hadoop includes a fault tolerant storage system called the
Hadoop Distributed File System, or HDFS. Hadoop Cluster
interconnected on a network acts as scalable factor, with the
power to withstand failure without data loss. The HDFS is
means of storage in Hadoop. R object as well as other models
can be stored in HDFS and later retrieved using the
MapReduce job. The MapReduce job even inscribes the result
back to HDFS once they are done with execution. These
results are later inspected and analyze by means of R language
thus making this essentially significant functional unit in the
process.
In order to facilitate a friendly environment while working
with HDFS there are several layers that are on top of it, one of
that is HBASE. This essentially provides table structures
similar to databases. HBASE aids in opening up the Hadoop
framework to the R programmer.
Spring 2015
Total Discharges.
2. Visualizing the data:
After defining the KPIs, the next step is to represent KPIs
using the charts. Diverse charts like the pie chart, bar chart,
Maps, bubble charts and the tables are used.
Spring 2015
VII.
PROJECT RESULT ANALYSIS
The chart below represents the data visualization done
with respect to the state in the country. It is the categorization
of the DRG of USA. In the following example, we see the
DRG for Utah. We can see the pie chart for the DRG category
in that state and the bubble chart for the city per state.
Figure 17: Details of DGV selected DRILL THROUGH
Figure 18: Details of DGV selected DRILL DOWN (State City street
zipcode)
4. Export Features:
In order the preserve the information the export feature
will be integrated to the dashboard. The dashboard could be
transformed to excel, Image, text file (Data) or the PDF with
aid of export feature.
Spring 2015
The above chart shows the drill-through
through of the DRG state
wise. We can see the bar graph plotted for the DRG vs
Average Total Payments.
The bar chart below shows the Average Total Payments vs
DRG. The red line in the middle indicates the target to be
achieved by the hospital
al to get a certain revenue for the
hospital. This helps the hospital administration to set targets
and see where they need to improve and where they are doing
well.
The below bar chart represents the Safe areas to live in based
on the Discharge Rates. City within a region vs Discharge
Rates. The user can see which hospital to go based on these
statistics to make an informed decision. The user can choose
the hospital based on the area based on the number of
discharges in the locality. When we click on a particular area,
the values also change dynamically giving the user the best
experience to make the right decision.
FUTURE SCOPE
VIII.
Ebola, an infectious and generall
generally fatal disease has
infected more than 6,500 people and claimed more than 3,000
lives in the world so far, according to the latest numbers from
WHO. That puts the fatality rate around 47 percent. Emerging
technologies are becoming super important in the fight against
Ebola and strives to terminate the further spread of the
disease. The Big Data technology paves way for vast sheaths
of information to be combined and refined from a variety of
sources while eliminating extraneous information along the
way.
The future scope of our tool is to analyze the data of the
deadly Ebola disease which requires being able to gather
unstructured data as soon as it is generated, by any number of
organizations from across the globe. Using information
gathered from a wide range
ange of sources, such as social media,
apprise from hospitals and flight records and information,
authorities can develop novel insights into where and how to
respond. This helps not only in saving lives, it also can make
sure that resources are allotted onn priorities.
We have used Flume to capture data log files regarding
Ebola on Twitter. Based on this data, we can do the analysis
of various factors like top 5 countries Tweeting about it, the
places where it has been affected we can analyse the sentiment
of the Tweeters if they are being positive, negative or neutral
about Ebola. So far the Tweets regarding Ebola has been
captured by Flume and transferred to Hadoop File System.
Spring 2015
the disease prevalent areas were also found out which is a
good thing to alert the nearby people.
REFERENCES
[1] David, S, The Marriage of Hadoop and R: Revolution Analytics at
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]