5 Ijcse 00445

2014, IJCSE All Rights Reserved 35
International Journal of Computer Science International Journal of Computer Science International Journal of Computer Science International Journal of Computer Sciences ss s and Engineering and Engineering and Engineering and Engineering Open Access
Research Paper Volume-2, Issue-8 E-ISSN: 2347-2693
A Study Using PI on: Sorting Structured Big Data In Distributed
Environment Using Apache Hadoop MapReduce
R. Murugesh
1*
and I. Meenatchi
2

1
*Department of Computer science and engineering, Pondicherry engineering college, Pondicherry, India
2
Department of Computer science and engineering, Pondicherry University, Pondicherry, India
www.ijcaonline.org
Received: 14 August 2014 Revised: 19 August 2014 Accepted: 25 August 2014 Published: 31 August 2014
Abstract MapReduce is a programming model and an associated implementation for processing and generating large data
sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce
function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in
this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a
large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the
program's execution across a set of machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the
resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is
highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers
find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day. Due to the exponential growth of information technology,
tremendous amount of data is generated. It is important to sort meaningful information from the scattered big data. So Data
mining plays a vital role in many of these applications [1]. In order to speed up the mining process we go for parallel and
distributed processing. However, sorting distributedly is not an easy task since in distributed environment, irregular and
imbalanced computation loads may cause the overall performance to be greatly degraded. Load balance among processors is thus
very important to parallel and distributed mining.
In this work, a new sorting is proposed using Mapreduce technique over the hadoop framework for distributed processing. A
popular free implementation is Apache Hadoop. The Hadoop stack is a data processing platform. It combines elements of
databases, data integration tools and parallel coding environments into a new and interesting mix. One advantage Hadoop has
over data integration tools is that its accessible to a variety of programming languages, which means it can be used for any
arbitrary parallel coding, like complex analytics.
Keywords Big data, Structured Big data, Hadoop, HDFS MapReduce, PI.
I. INTRODUCTION
MapReduce is a technique for processing large data sets in a
distributed environment like hadoop cluster. A MapReduce
program comprises of a Map() procedure that performs
filtering and a Reduce() procedure that performs a
summarizing operation [2]. MapReduce libraries have been
written in many programming languages, with different levels
of optimization.
Datas to be sort are often very large GB, TB and PB. This
rises the need of fast retrieval of data in order.
Map step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes.
A worker node may do this again in turn, leading to a multi-
level tree structure. The worker node processes the smaller
problem, and passes the answer back to its master node.
Reduce step: The master node then collects the answers to
all the sub-problems and combines them in some way to
form the output the answer to the problem it was originally
trying to solve.
II. COMPUTATION ALGORITHM [3]
1. Prepare the Map() input the "MapReduce system"
designates Map processors, assigns the K1 input key
value each processor would work on, and provides that
processor with all the input data associated with that key
value.
2. Run the user-provided Map() code Map() is run exactly
once for each K1 key value, generating output organized
by key values K2.
3. Shuffle" the Map output to the Reduce processors the
MapReduce system designates Reduce processors,
assigns the K2 key value each processor would work on,
CORRESPONDING AUTHOR: R. MURUGESH
International Journal of Computer Sciences and Engineering Vol.-2(8), PP(35-38) August 2014, E-ISSN: 2347-2693
and provides that processor with all the Map-generated
data associated with that key value.
4. Run the user-provided Reduce() code Reduce() is run
exactly once for each K2 key value produced by the
Map step.
5. Produce the final output the MapReduce system
collects all the Reduce output, and sorts it by K2 to
produce the final outcome.
III. STRUCTURED BIG DATA
The term structured data refers to data that is identifiable
because it is organized in a structure. The most common
form of structured data or structured data records (SDR) is a
database where specific information is stored based on a
methodology of columns and rows.
Structured data is also searchable by data type within
content. Structured data is understood by computers and is
also efficiently organized for human readers.
There are some types of structural data analysis: Regression
analysis, Bayesian analysis, Cluster analysis, Combinatorial
data analysis, Geometric data analysis, Topological data
analysis.
IV. APACHE HADOOP AND TECHNICAL COMPONENTS [4]
Apache Hadoop is an open-source software
framework that supports data-intensive distributed
applications, licensed under the Apache v2 license. It
supports the running of applications on large clusters of
commodity hardware. Hadoop was derived from Google's
MapReduce and Google File System (GFS) papers.
A Hadoop stack is made up of a number of technical
components. They include:
Hadoop Distributed File System (HDFS): The default
storage layer in any given Hadoop cluster;
Name Node: The node in a Hadoop cluster that provides
the client information on where in the cluster particular
data is stored and if any nodes fail;
Secondary Node: A backup to the Name Node, it
periodically replicates and stores data from the Name
Node should it fail;
Job Tracker: The node in a Hadoop cluster that initiates
and coordinates MapReduce jobs, or the processing of
the data.
Slave Nodes: The grunts of any Hadoop cluster, slave
nodes store data and take direction to process it from the
Job Tracker.
V. HDFS [5]
Hadoop cluster comprises of a single master and multiple
slaves or worker nodes. The JobTracker is the service
within Hadoop that farms out MapReduce tasks to
specific nodes in the cluster, ideally the nodes that have
the data, or at least are in the same rack.

A TaskTracker is a node in the cluster that accepts tasks -
Map, Reduce and Shuffle operations - from a JobTracker.
The master node consists of a JobTracker, TaskTracker,
NameNode, and DataNode. A slave or worker node acts
as both a DataNode and TaskTracker.
In a larger cluster, the HDFS is managed through a
dedicated NameNode server to host the file system index,
and a secondary NameNode that can generate snapshots
of the name node's memory structures, thus preventing
file system corruption and reducing loss of data.was
designed for two affiliations.

VI. SORTING USING MAPREDUCE

VII. TEST CASES OF PI
The number is a mathematical constant, the ratio of
a circle's circumference to its diameter, approximately equal
to 3.14159.
It has been represented by the Greek letter "" since the mid-
18th century.
Being an irrational number, cannot be expressed exactly as
a common fraction, although fractions such as 22/7 and other
rational numbers are commonly used to approximate .
Consequently its decimal representation never ends and
never settles into a permanent repeating pattern. The digits
appear to be randomly distributed although, to date, no proof
of this has been discovered.
Default Actual value of Pi = 3.141592

Test
Cases

No of
Nodes

Sample
Per Nodes

Time in
Seconds

Result

Accura-cy

1

5

10

23.842

3.280000

104.40%

2

5

50

28.892

3.168000

100.84%

3

10

50

25.905

3.160000

100.58%

4

30

50

85.748

3.152000

100.33%

5

50

100

57.196

3.141600

100.00%

6

100

100

112.722

3.140800

99.97%

7

100

1000

160.731

3.141200

99.98%

8

100

500

118.291

3.142480

100.02%

VIII. SIMULATION RESULTS
IX.


X. CONCLUSION AND FUTURE WORK
In this work, we have explored the performance of
arbitrary value of PI using Hadoop Mapreduce by taking
some test cases. And collected structured Big data, Using this
our future work will focus on single dimensional sorting and
Multidimensional sorting using structured big data.
ACKNOWLEDGMENT
This research was done using resources provided by the
Apache hadoop Mapreduce foundation and Intel IT center
Planning guide Getting started with Big data
REFERENCES
[1] Mahesh Maurya, Sunita Mahajan, " Performance analysis of
mapreduce Programs on Hadoop cluster", World Congress on
Information and Communication Technologies pp. 506-510,
March, 2012.

[2] Jeffrey Dean and Sanjay Ghemawat mapreduce: Simplified
Data Processing On Large Clusters, Google, Inc., Usenix
Association OSDI 04: 6
th
Symposium on Operating Systems Design
and Implementation, 2009.
[3]Http://En.Wikipedia.Org/Wiki/Mapreduce
[4] Http://en.wikipedia.org/wiki/Apache_Hadoop
[5]Aditya B. Patel, Manashvi Birla, Ushma Nair, Addressing Big
Data Problem Using Hadoop and Map Reduce, 2012 NIRMA
University International Conference On Engineering, Nuicone-2012,
06-08december, 2012.

5 Ijcse 00445

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

5 Ijcse 00445

Transféré par

Droits d'auteur :

Formats disponibles

2014, IJCSE All Rights Reserved 35

Vous aimerez peut-être aussi