Académique Documents
Professionnel Documents
Culture Documents
AbstractBusinesses have come to rely on their data warehouse as a key component in their Information Technology infrastructure. The costs of the architecture to support
these environments are significant. Therefore, choosing the
wrong architecture can be a very costly decision. However,
considerable confusion exists in relation to MapReduce and
Parallel Database Management Systems (DBMS). In the past,
MapReduce has been presented as a replacement for the
Parallel Database Management Systems, as an additional tool
that works alongside the Parallel DBMS, but also as an inferior
tool by others. This paper will consider the broader themes
of the paradigms rather than the specific implementations of
MapReduce and Parallel DBMS. It will discuss MapReduce
and Parallel Database Management Systems as competing
and complimentary paradigms. The aim of this paper is to
provide a high-level comparison between MapReduce and
Parallel DBMS, providing a selection of criteria which can
be used to choose between MapReduce and Parallel DBMS for
a particular enterprise application.
Keywords-MapReduce; Parallel Database Management Systems
I. I NTRODUCTION
In 2008, the worlds servers processed 9.57 zettabytes of
information. For every worker, there is approximately three
terabytes of information created every year [1]. These high
volumes of data has significant potential to improve understanding, leading to scientific breakthroughs and business
process improvements. Examples of these improvements
include personalized genome sequencing, extracting realtime trends from business analytics and social network based
recommendations. However, the amount of data collected is
now outpacing improvements in data storage technology [2].
The traditional data analysis approach is to load the data into
a database and the use of a query language to perform the
analysis. MapReduce has been presented as an alternative
method, with implementations in leading IT companies such
as Google and Facebook.
MapReduce is a method for processing large volumes of
distributed data, allowing the use of shared nothing clusters.
In distributed architectures, a shared nothing cluster is one
where the nodes of the cluster share neither a common disk
space, a common CPU or common memory. The program
flow of MapReduce can be subdivided into a number of
ISBN: 978-1-61208-246-2
64
ISBN: 978-1-61208-246-2
65
ISBN: 978-1-61208-246-2
66
TA-
This query will load the data from the Hadoop file,
called mydfsfile.txt, into the database table Tab1. The UDF
function is called udfLoadHadoop. This function will take
the Hadoop file as a parameter and integrate the data using
the Hadoop NameNode metadata. The NameNode identifies
which nodes in the Hadoop file system contain the required
data. The UDF performs calculations based on the file size
and the number of parallel nodes to determine which data
belongs to each node. It then requests that data using the
NameNode [13]
VI. R ESULTS
This section will provide a comparison of MapReduce and
parallel DBMS, across a broad range of criteria:
Data Volume:
Parallel DBMS - Has been used for data volumes
in the order of Petabytes;
MapReduce - Has been used for data volumes in
the order of Petabytes;
Cost:
Parallel DBMS - Enterprise level toolset. This is
an expensive investment;
MapReduce - This is open source based solution.
The investment is considered inexpensive;
Fault Tolerance:
Parallel DBMS - Transaction Level, cannot survive
node failure;
MapReduce - Fault tolerant, designed to survive
multiple node failures;
Users:
Parallel DBMS - Can be used by multiple user
types, from Business Users using reporting tools,
through SQL novices and expert users;
ISBN: 978-1-61208-246-2
Unstructured data: If the primary data source is unstructured then the cost of transforming it and loading
it into a parallel DBMS is prohibitive. Based on this,
MapReduce would be a good candidate;
Cost: If cost is the main driver for the organization,
then MapReduce is the better candidate. Parallel DBMS
systems are considered enterprise level tools, but this
comes at a high cost;
User skill level: If the organization has an available
pool of high skilled developers then MapReduce is a
good option. In addition, if the organization is one in
which control of the data is important then MapReduce
is also the better candidate.
Structured data: If the data is structured and will continue to be so for the foreseeable future, then parallel
DBMS would be a good fit;
Enterprise Level Support: If enterprise level support is
important to the organization, then the parallel DBMS
vendors would be the preferred option. Although there
are companies that offer this (for example Cloudera),
companies in the parallel DBMS have been providing
this support for a longer time;
67
R EFERENCES
[1] R. B. J.E. Short and C. Baru, How Much Information 2010:
Report on Enterprise Server Information, http://hmi.ucsd.
edu/pdf/HMI\_2010\_EnterpriseReport\_Jan\_2011.pdf, Jan.
2011, [Online; accessed 27-Nov.-2012].
[9] W. Fang, B. He, Q. Luo, and N. Govindaraju, Mars: Accelerating mapreduce with graphics processors, Parallel and
Distributed Systems, IEEE Transactions on, vol. 22, no. 4,
pp. 608 620, april 2011.
[10] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, Hive: a
warehousing solution over a map-reduce framework, Proc.
VLDB Endow., vol. 2, no. 2, pp. 16261629, Aug. 2009,
[Online; accessed 27-Nov.-2012].
[11] D. DeWitt and J. Gray, Parallel database systems: the future
of high performance database systems, Commun. ACM,
vol. 35, no. 6, pp. 8598, Jun. 1992, [Online; accessed 27Nov.-2012].
[12] M. OSullivan,
Teradata, http://www.teradata.com/
newsrelease.aspx?id=7243, April 2012, [Online; accessed
27-Nov.-2012].
[13] Y. Xu, P. Kostamaa, and L. Gao, Integrating hadoop and
parallel dbms, in Proceedings of the 2010 international
conference on Management of data, ser. SIGMOD 10. New
York, NY, USA: ACM, 2010, pp. 969974, [Online; accessed
27-Nov.-2012].
ISBN: 978-1-61208-246-2
68