Académique Documents
Professionnel Documents
Culture Documents
titled
Oct, 2013-14
Certificate
This is to certify that the Seminar Report titled Hadoop course & in HDFS VII is submitted towards the partial fulfillment of requirement of seminar Semester, B.E.(Computer Technology), degree awarded by Rashtrasant Tukdoji Maharaj Nagpur University, Nagpur.
Seminar Guide
Mr. P.DHAVAN
Seminar Coordinator
Mrs.P.DESHKAR
ii
Mr.A.R.PATIL BHAGAT
Date:Oct/2013 Place:Nagpur
Abstract
Nowadays we encounter huge amounts of data be it from Facebook or Twitter.This huge amounts of data that is being generated everyday is known as Big Data.Due to the vastness of the data we have somehow have tofind a way to analyse it int order to make any sense out of it.This analysis can be done using Hadoop. Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. This report focuses on the understanding of Apache Hadoop and the HDFS. The Hadoop Distributed File System (HDFS),a paradigm of Hadoop is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size.
iii
Table of Contents
Title Page No.
1.0 Introduction.......................................................................... .........4 2.0 Background Knowledge...............................................................5 2.0.1 2.0.2 Use Big of Data Cluster to What How does value Hadoop with Big of HDFS & around for the parallel Big Big Data look Big some world..................................................5 Architecture know is processing..........5 3.0 Everything 3.0.1 3.0.2 3.0.3The 3.1 Apache about Data............................................9 Data?................................................................9 like?.............................................9 Data..........................................................10 implementations..............12
iv
Hadoop
File
Sytem of of Filesystem
(HDFS)...........................12 HDFS.......................................................12 HDFS................................................13 Namespace................................................14 3.1.1.4 3.1.2 3.1.2.1 3.1.2.2 4.0 Advantage Limitations........................................................19 4.0.1Advantages.................................................................... ......19 4.0.2 Limitations..........................................................................2 0 5.0 Applications.......................................................................... .......21 6.0 Future Scope................................................................................22 7.0 Conclusion............................................................................. .......23 References.................................................................................... .......24 Data Some Organization implementations The The and of Oracle Dell and replication............................15 Hadoop....................................15 implementation.......................................15 implementation............................................17
List of Figures
Figure Number 1 2 Figure Name Overview of Cluster Architecture Technicians working on a large Linux cluster at the Chemnitz University of Technology, Germany 3 4 5 6 7 8
Value of big Data-Wipro infographic HDFS Architecture
Page No 6 7
11
14 Feeding Hadoop Data to the 16 Oracle Database Oracle grid engine 16 Dell Hadoop design 17 Dell network implementation 18
vi
1.0 Introduction
Big Data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, processing options break down broadly into massively parallel processing architectures data warehouses or databases such as Apache Hadoop-based solutions. Hadoop provides a distributed file system(HDFS) and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm.An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. So Hadoop definately plays an important role in managing & making sense out of the Big Data.
vii
Examples of Big Data in private sector include generation of 2.5 petabytes of data in an hour by 1 million customer transactions by Wallmart [3]. Thus Big Data is everywhere and the age of Big Data is upon us. So successfully exploiting the value in big data requires experimentation and exploration.
viii
To effectively harness the power of Big Data, we need an architecture that is distributed and will support parallel processing. It should cater to three needs namely 1. Volume It should be able to handle the extensive volume of Big Data. 2. Speed It should be able to process and analyze data as fast as possible. 3. Cost All this should be with minimum cost. A computer cluster (or Cluster Architecture) consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system. The components of a cluster are usually connected to each other through fast local area networks, each node (computer used as a server) running its own instance of an operating system. Each node consists of its own cores, memory and disks. Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low cost microprocessors, high speed networks, and software for high performance distributed computing. The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit.
ix
Clusters are usually deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.[4]
Fig 2: Technicians working on a large Linux cluster at the Chemnitz University of Technology, Germany
The advantages of cluster architecture are Modular and Scalable - easier to expand the system without bringing down the application that runs on top of the cluster.
Data Locality where data can be processed by the cores collocated in same node or Rack minimizing any transfer over network.
Parallelization - higher degree of parallelism via the simultaneous execution of separate portions of a program on different processors.
Less cost Built on the principle of commodity hardware which is to have more low-performance, lowcost hardware working in parallel (scalar computing) than to have less high-performance, high-cost hardware.
However managing a Cluster has a few overheads which include Complexity - Cost of administering a cluster of N machines significantly increases complexity of using the cluster. More Storage - As data is replicated to protect from failure Cluster architecture requires more storage capacity. Data Distribution and Task Scheduling When a large multi-user cluster needs to access very large amounts of data, task scheduling and Data Distribution becomes a challenge. However, given that in a complex application environment the performance of each job depends on the characteristics of the underlying cluster, mapping tasks onto CPU cores and GPU devices provides significant challenges.[5] Careful Management and Need of massive parallel processing Design - Automatic parallelization of programs continues to remain a technical challenge. The development and debugging of parallel programs on a
xi
xii
web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing? To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They're a helpful lens through which to view and understand the nature of the data and the software platforms available to exploit them. Most probably you will contend with each of the Vs to one degree or another. 1. Volume: Many factors contribute to the increase in data volume transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, etc. This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it. 2. Velocity: The importance of data's velocity the increasing rate at which data flows into an organization has followed a similar pattern to that of volume. Problems previously restricted to segments of industry are now presenting themselves in a much broader setting. Specialized companies such as financial traders have long turned systems that cope with fast moving data to their advantage. Why is that so? The Internet and mobile era means that the way we deliver and consumer products and services is increasingly instrumented, generating a data flow back to the provider.Those who
xiii
are able to quickly utilize that information, by recommending additional purchases, for instance, gain competitive advantage. It's not just the velocity of the incoming data that's the issue: it's possible to stream fast-moving data into bulk storage for later batch processing, for example. The importance lies in the speed of the feedback loop, taking data from input through to decision.[6]
3. Variety: Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesn't fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application. Even on the web, where computer-to-computer communication ought to bring some guarantees, the reality of data is messy. A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application.
xiv
Until now, there was no practical way to harvest this opportunity. Today managing Big Data can be done effectively and with an ease with emergence of state of art solutions like Apache Hadoop.
xv
xvi
3.1.1.1 Features of HDFS [10] Highly fault-tolerant: Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file systems data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Suitable for applications with large data sets: Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It provides high aggregate data bandwidth and scale to thousands of nodes in a single cluster. It should support tens of millions of files in a single instance. Streaming access to file system data: Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.
Portability across Heterogeneous Hardware and Software Platforms: HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
3.1.1.2 Architecture of HDFS
xvii
Master/slave architecture. HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by HDFS clients.There are a number of DataNodes usually one per node in a cluster. The DataNodes manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. NameNode: Keeps image of entire file system namespace and file Blockmap in memory. When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.Periodic checkpointing is done, so that the system can recover back to the last checkpointed state in case of a crash[11]. DataNodes: A Datanode stores data in files in its local file system. Datanode has no knowledge about HDFS filesystem. It stores each block of HDFS data in a separate file. The DataNodes are responsible for serving read and write requests from the file systems clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport[11].
xviii
3.1.1.3 Filesystem Namespace HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode[11].
3.1.1.4 Data Organization and replication [10] HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file
xix
is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly[11].
xx
2)Drive Enterprise-Level Operational Efficiency of Hadoop Infrastructure with Oracle Grid Engine[12] All the computing resources allocated to a Hadoop cluster are used exclusively by Hadoop, which can result in having underutilized resources when Hadoop is not running. Oracle Grid Engine enables a cluster of computers to run Hadoop with other data-oriented compute application models.The benefit of this approach is that you dont have to maintain a dedicated cluster for running only Hadoop applications.
xxi
3.1.2.2 The Dell Implentation 1)Hadoop design implementation[13] The representation is broken down into the Hadoop use cases such as Compute, Storage, and Database workloads.Each workload has specific characteristics for operations, deployment, architecture, and management.
7.
Dell
Hadoop
Top-of-rack (ToR) switches in a network architecture connect directly to the DataNodes and allow for all inter-node communication within the Hadoop environment. Hadoop networks should utilize inexpensive components that are employed in a way that maximizes performance for DataNode communication.
xxii
8.Dell
Network
Within the Hadoop software ecosystem, there are several benchmark tools included that can be used for these comparisons. 1. Teragen Utilizes the parallel framework within Hadoop to quickly create large data sets that can be manipulated. 2. Terasort Read the data created by Teragen into the systems physical memory and then sort it and write it back out to the HDFS. 3. Teravalidate Ensures that the data produced by Terasort is accurate without any errors
xxiii
xxiv
nodes, very little--if any--work is required for that same program to run on a much larger amount of hardware.[14]
4.0.2 Limitations:
Programming model is very restrictive. Lack of central data can be frustrating. Still rough - software under active development. e.g. HDFS only recently added support for append operations [14] Joins of multiple datasets are tricky and slow. Often, entire dataset gets copied in the process Cluster management is hard (debugging, distributing software, collecting logs... is hard) Still single master, which requires care and may limit scaling Managing job ow isnt trivial when intermediate data should be kept. Multiple copies of already big data are created.[15] Limited SQL support.[15] Inefficient execution as HDFS has no notion of query optimizer.[15] Lack of required skills necessary for handling Hadoop.
[15]
xxv
o Its biggest cluster: 4500 nodes (2*4cpu boxes with 4*1TB disk & 16GB RAM) Other companies include Adobe, Amazon, The New York Times, Hewlett-Packard and list keeps on increasing.
xxvii
7.0 Conclusion
Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. This report focuses on the various properties, challenges and opportunities of Big Data.It also discusses how we can use Apache Hadoop to manage the Big Data and extract value out of it. A detailed discussion has been done on Hadoop and its underlying technology. So in todays hyper-connected world where more and more data is being created every day, the need
xxviii
for Hadoop is no longer a question. The only question now is how to take advantage of it best.
References
[1] Randal E. Bryant, Randy H. Katz, Edward D. Lazowska, Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society, Version 8: December 22, 2008. Available: http://www.cra.org/ccc/docs/init/Big_Data.pdf [2] What is Big Data, Bringing Big data to enterprise [Online] Available:http://www01.ibm.com/software/data/bigdata/ [3] A Comprehensive List of Big Data Statistics [Online]. Available:http://wikibon.org/blog/big-data-statistics/
xxix
[4] D.A. Bader and R. Pennington, `` Cluster Computing: Applications ,'' The International Journal of High Performance Computing, 15(2):181-185, May 2001.Available:http://www.cc.gatech.edu/~bader/papers/ijh pca.pdf [5] K. Shirahata, Hybrid Map Task Scheduling for GPUBased Heterogeneous Clusters in: Cloud Computing Technology and Science (CloudCom), 2010 Nov.30 2010Dec.3 2010 pages 733 740 [Online].Available: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp? arnumber=5708524 [6]What Is Big Data? OReilly Radar, January11, 2012, [Online].Available: http://radar.oreilly.com/2012/01/whatis-big-data.html [7]BigData,Wipro, [Online].Available:http://www.slideshare.net/wiprotechnolo gies/wipro-infographicbig-data [8]Hadoop at Yahoo!, Yahoo developer Network [Online].Available: http://developer.yahoo.com/hadoop/ [9] Owan o maley, Introduction to Hadoop [Online].Available: http://wiki.apache.org/hado op/ [10] HDFS Architecture, Hadoop 0.20 Documentation [Online]. Available:http://hadoop.apache.org/docs/r0.20.2/hdfs_desig n.html [11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System
xxx
,Available:http://storageconference.org/2010/Papers/MSST/ Shvachko.pdf [12] Oracle and Hadoop Overview [Online] Available: http://www.oracle.com/technetwork/database/bidatawarehousing/twp-hadoop-oracle-194542.pdf [13] Introduction to Hadoop Dell [Online] Available:http://i.dell.com/sites/content/business/solutions/ whitepapers/en/Documents/hadoop-introduction.pdf [14] Yahoo! Hadoop tutorial ,Yahoo Developer Network(YDN),[Online]. Available:http://developer.yahoo.com/hadoop/tutorial/modul e1.html#comparison [15] Hadoop's Limitations for Big Data Analytics , ParAccel .Inc [On line] Available:http://www.paraccel.com/resources/Whitepapers/ Hadoop-Limitations-for-Big-Data-ParAccel-Whitepaper.pdf [16]Hadoop Wiki, Powered By,[Online] Available: http://wiki.apache.org/hadoop/PoweredBy#I
xxxi