Vous êtes sur la page 1sur 78

TABLE OF CONTENTS

LIST OF FIGURES..................................................................................................................4
ABSTRACT..6
Chapter 1: Project Introduction
1.1 Introduction to Big
Data.7
1.2 Problem Statement............................................................................................................9
1.3 Scope................................................................................................................................9
1.4 Project Feature..................................................................................................................9
1.5 Organistion.......................................................................................................................9
Chapter 2: Literature Survey................................................................................................12
Chapter 3: System Analysis...................................................................................................14
3.1 Introduction................................................................................................................14
3.2. System Study..............15
3.2.1. Existing System.......................................................................................................15
3.2.2. Proposed System..15
3.2.3. Hadoop.....16
3.3 Feasibility study...17
3.3.1 Introduction to Java API...18
3.4 Objectives.......................................................................................................................19
3.5 Technoogy Used.............................................................................................................20
Chapter 4: System Requirements.........................................................................................22
4.1 Introduction....................................................................................................................22
4.2 System requirements.......................................................................................................25
4.2.1 Software requirements.............................................................................................25
4.2.2 Hardware requirements............................................................................................26
4.3 Conclusion......................................................................................................................26
Chapter 5: System design......................................................................................................27
1

5.1 Introduction....................................................................................................................27
5.2 Hdfs Architecture............................................................................................................27
5.3 Modules..........................................................................................................................30
5.4 UML diagrams................................................................................................................31
5.4.1 Sequence diagram ...................................................................................................31
5.4.2 Class diagram...........................................................................................................32
5.5 Conclusion......................................................................................................................32
Chapter 6: Implementation...................................................................................................33
6.1 Introduction ...................................................................................................................33
6.2 Map Reduce....................................................................................................................33
6.3 Hdfs Shell Commands....................................................................................................38
6.4 Sample Code..................................................................................................................45
6.5 Conclusion......................................................................................................................46
Chapter 7: Screenshots..........................................................................................................47
Chapter 8: Testing and Validation........................................................................................69
8.1 Testing:............................................................................................................................79
8.2 Types of testing:..............................................................................................................70
8.2.1 White box Testing.... ...70
8.2.2 Black box Testing.........70
8.2.3 Alpha Testing........70
8.2.4 Beta Testing..................70
8.3. Path Testing...71
9: Conclusion and future scope............................................................................................72
9.1. Conclusion ....72
9.1.1. Selecting a Project in Hadoop....72
9.1.2. Rethinking and adopting to Existing Hadoop72
9.1.3. Path Availability.73
9.1.4. Services insight and Operations.73
2

9.1.5. Adapt lean ad Agile Integration Principles73

9.2. Future Scope and Hadoop.74


10: References.....................75

LIST OF FIGURES
Figure No.

Name of Figure

Page No.

1.1

Structure of big data

1.2

Data Analysis

1.3

3 Vs of big data

3.1

Evolution of big data

16

3.2

Objectives

19

5.1

Architecture of HDFS

28

5.2

Architecture of client server modules

30

5.3

Sequence Diagram

31

5.4

Class Diagram

32

7.1

CAT command execution

47

7.2

copy to local command execution

48

7.3

cp command execution

49

7.4

du command execution

50

7.5

dus command execution

51

7.6

Expunge command execution

52

7.7

Get command execution

53

7.8

Get Merge command execution

54

7.9

ls command execution

55

7.10

lsr command execution

56

7.11

mkdir command execution

57

7.12

Move From Local command execution

58

7.13

mv command execution

59

7.14

put command execution

60

7.15

rm command execution

61

7.16

rmr command execution

62

7.17

stat command execution

63

7.18

tail command execution

64

7.19

test command execution

65

7.20

text command execution

66

7.21

touchz command execution

67

ABSTRACT
Hadoop is a flexible and available architecture for large scale computation and data processing
on a network of commodity hardware. As Hadoop is an open source framework for processing,
storing and analyzing massive amounts of distributed unstructured data. Originally created by
Doug Cutting at Yahoo!, Hadoop was inspired by MapReduce, a user-defined function developed
by Google in early 2000s for indexing the Web. It was designed to handle petabytes and
Exabytes of data distributed over multiple nodes in parallel. Hadoop clusters run on inexpensive
commodity hardware so projects can scale-out without breaking the bank. Hadoop is now a
project of the Apache Software Foundation, where hundreds of contributors continuously
improve the core technology. Fundamental concept: Rather than banging away at one, huge
block of data with a single machine, Hadoop breaks up Big Data into multiple parts so each part
can be processed and analyzed at the same time. Why Hadoop used for searching, log
processing, recommendation systems, analytics, video and image analysis, data retention? It is
used by the top level apache foundation project, large active user base, mailing lists, users
groups, very active development, and strong development teams.
Hadoop is a popular open-source implementation of MapReduce for the analysis of large
datasets. To manage storage resources across the cluster, Hadoop uses a distributed user-level
filesystem. This filesystem HDFS is written in Java and designed for portability across
heterogeneous hardware and software platforms. This paper analyzes the performance of HDFS
and uncovers several performance issues. First, architectural bottlenecks exist in the Hadoop
implementation that result in inefficient HDFS usage due to delays in scheduling new
MapReduce tasks. Second, portability limitations prevent the Java implementation from
exploiting features of the native platform. Third, HDFS implicitly makes portability assumptions
about how the native platform manages storage resources, even though native filesystems and
I/O schedulers vary widely in design and behavior. This paper investigates the root causes of
these performance bottlenecks in order to evaluate tradeoffs between portability and performance
in the Hadoop distributed filesystem.

CHAPTER 1: INTRODUCTION
1.1 INTRODUCTION TO BIGDATA
DEFINITIONS:Big data exceeds the reach of commonly used hardware environments and software tools to
capture, manage, and process it with in a tolerable elapsed time for its user population.
Teradata Magazine article, 2011
Big data refers to data sets whose size is beyond the ability of typical database software tools to
capture, store, manage and analyze.
- The McKinsey Global Institute, 2012
Big data is a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools.
- Wikipedia, 2014.
Datasets that exceeds the boundaries and sizes of normal processing capabilities forcing you to
take a non-traditional approach.

Fig1.1:-structure of Big Data


7

Fig1.2:- data analysis


For the past two decades most business analytics have been created using structured data
extracted from operational systems and consolidated into a data warehouse. Big data
dramatically increases both the number of data sources and the variety and volume of data that is
useful for analysis. A high percentage of this data is often described as multi-structured to
distinguish it from the structured operational data used to populate a data warehouse. In most
organizations, multi-structured data is growing at a considerably faster rate than structured data.
Two important data management trends for processing big data are relational DBMS
products optimized for analytical workloads (often called analytic RDBMSs, or ADBMSs) and
non-relational systems for processing multi-structured data. In a 2001 research report and related
lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and
opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity
(speed of data in and out), and variety (range of data types and sources). In the modern world,
not only Gartner but many of the industries continue to use this "3Vs" model for describing big
data.
Later in 2012, Gartner updated big datas definition as follows: "Big data is high volume,
high velocity, and/or high variety information assets that require new forms of processing to

enable enhanced decision making, insight discovery and process optimization. Additionally, a
new V "Veracity" is added by some organizations to describe it.
Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big
data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many
petabytes of data.
Big data can also be defined as "Big data is a large volume unstructured data which
cannot be handled by standard database management systems like DBMS, RDBMS or
ORDBMS".

The 3Vs Of Big Data:

Fig1.3:- 3Vs of big data

Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes
even petabytesof information.

Turn 12 terabytes of Tweets created each day into improved product sentiment analysis.

Convert 350 billion annual meter readings to better predict power consumption.

Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud,
big data must be used as it streams into your enterprise in order to maximize its value.

Scrutinize 5 million trade events created each day to identify potential fraud.
9

Analyze 500 million daily call detail records in real-time to predict customer churn faster.

Variety: Big data is any type of data - structured and unstructured data such as text, sensor data,
audio, video, click streams, log files and more. New insights are found when analyzing these
data types together.

Monitor 100s of live video feeds from surveillance cameras to target points of interest.

1.2 PROBLEM STATEMENT


Relative database systems and desktop statistics and visualization packages often have difficulty
handling big data. The work instead requires "massively parallel software running on tens,
hundreds, or even thousands of servers. What is considered big data varies depending on the
capabilities of the users and their tools, and expanding capabilities make Big Data a moving
target. It is a very difficult task for the organization to maintain such large amounts of data which
keeps on increasing day by day.

1.3 SCOPE
Big data concept is used mainly to manage the bulk data that may be useful to give us a best Idea
of what are the factors that are responsible like the preferences of a customer to launch A new
product into the market and make it a successful. It has a wide number of applications Like Net
pricing out of home advertising retail habits and politics.

1.4 PROJECT FEATURES


This project gives an illustration of all the commands that are used in the Hadoop platform it
shows you how the commands work and Hadoop file system interaction implementation Using
java application program interface and it shows the implementation of default file sys -tem in
Hadoop.

1.5 ORGANIZATION
CHAPTER 1: In this chapter the basic introduction regarding the project is given and the uses
And the reason why it is introduced is mentioned.

10

CHAPTER 2: In this chapter Literature survey is explained. The literature survey gives a detailed
study of projects with similar activity or technology.
From the literature survey we come to know about how the other similar projects were
implemented and problems faced in these projects.
CHAPTER 3: In this chapter system analysis is explained. We also give the description of the
language that is used in this project. We explain about the advantages and disadvantages of using
the language.
CHAPTER 4: In this chapter system requirements and analysis is explained and also the
classification of requirements has been done and the functional requirements & non-functional
requirements.
CHAPTER-5: In this chapter the system design is explained with the help of different
diagrammatic representation like ER, UML etc.
CHAPTER-6: In this chapter implementation of the system is explained by explaining the basic
concept. The sample code is also explained which tells about the functionality of the
microcontroller.
CHAPTER-7: In this chapter screenshots are explained. The screenshots gives a clear description
about how the project is implemented.
CHAPTER-8: This chapter includes the different types of testing, the validation part of coding
with the brief description and also different test cases are defined.
CHAPTER-9: This chapter includes the conclusion of the project.
CHAPTER-10: It includes the references of all concepts that have been in this project

11

CHAPTER 2: LITERATURE SURVEY


Applications frequently require more resources than are available on an inexpensive machine.
Many organizations find themselves with business processes that no longer fit on a single cost
effective computer. A simple but expensive solution has been to buy specialty machines that have
a lot of memory and many CPUs. This solution scales as far as what is supported by the fastest
machines available, and usually the only limiting factor is the budget. An alternative solution is
to build a high-availability cluster. Such a cluster typically attempts to look like a single
machine, and typically requires very specialized installation and administration services. Many
high-availability clusters are proprietary and expensive.
Hadoop supports the Map Reduce model, which was introduced by Google as a method of
solving a class of pet scale problems with large clusters of inexpensive machines. The model is
based on two distinct steps for an application:
Map: An initial ingestion and transformation step, in which individual input records can be
processed in parallel.
Reduce: An aggregation or summarization step, in which all associated records must be
processed together by a single entity.
The core concept of Map Reduce in Hadoop is that input may be split into logical chunks, and
each chunk may be initially processed independently, by a map task. The results of these
individual processing chunks can be physically partitioned into distinct sets, which are then
sorted. Each sorted chunk is passed to a reduce task. Figure 1-1 illustrates how the Map Reduce
model works.
The Hadoop Distributed File System
HDFS is a file system that is designed for use for Map Reduce jobs that read input in large
chunks of input, process it, and write potentially large chunks of output. HDFS does not handle
random access particularly well. For reliability, file data is simply mirrored to multiple storage
nodes. This is referred to as replication in the Hadoop community. As long as at least one replica

12

of a data chunk is available, the consumer of that data will not know of storage server failures.
HDFS services are provided by two processes:
Name Node handles management of the file system metadata, and provides management and
control services.
Data Node provides block storage and retrieval services. There will be one Name Node process
in an HDFS file system, and this is a single point of failure. Hadoop Core provides recovery and
automatic backup of the Name Node, but no hot failover services.

13

CHAPTER 3: SYSTEM ANALYSIS


In the previous chapter we looked at the literature survey regarding to the big data. In system
analysis we give a description of the language that is being used and about the advantages and
disadvantages of it.

3.1 INTRODUCTION
Apache Hadoop is an open source software framework written in Java for distributed
sharing and distributed processing of very large data sets on computer clusters built
from commodity hardware. All the modules in Hadoop are designed with a fundamental
assumption that hardware failures are commonplace and thus should be automatically handled
in software by the framework.
The core of Apache Hadoop consists of a storage part (HDFS) and a processing part
(Map reduce). Hadoop splits files into large blocks and distributes them amongst the nodes in the
cluster. To process the data, Hadoop Map Reduce transfers packaged code for nodes to process in
parallel, based on the data each node needs to process. This approach takes advantage of data
locality nodes manipulating the data that they have on-hand to allow the data to be processed
faster and more efficiently than it would be in a more conventional supercomputer
architecture that relies on a parallel system where computation and data are connected via highspeed networking.
The base Apache Hadoop framework is composed of the following modules:

Hadoop Common contains libraries and utilities needed by other Hadoop modules;

Hadoop Distributed File System (HDFS) a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;

Hadoop YARN a resource-management platform responsible for managing compute


resources in clusters and using them for scheduling of users' applications

3.2 SYSTEM STUDY

14

In system study we make an analysis of present system and the proposed system and get to know
the reason behind the need of a new technology in the place of the existing one.
3.2.1 EXISTING SYSTEM
It incentivizes more collection of data and longer retention of it. If any and all data sets
might turn out to prove useful for discovering some obscure but valuable correlation, you might
as well collect it and hold on to it. In long run, the more useful big data proves to be, the stronger
this incentivizing effect will be; but in the short run it almost doesnt matter; the current buzz
over the idea is enough to do the trick.
Many (perhaps most) people are not aware of how much information is being collected
(for example, that stores are tracking their purchases over time), let alone how it is being used
(scrutinized for insights into their lives).
Big data can further tilt the playing field toward big institutions and away from individuals.
In economic terms, it accentuates the information asymmetries of big companies over other
economic actors and allows for people to be manipulated.
3.2.2

PROPOSED SYSTEM
Big data is an all-encompassing term for any collection of data sets so large and complex

that it becomes difficult to process using traditional data processing applications. The challenges
include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy
violations. The trend to larger data sets is due to the additional information derivable from
analysis of a single large set of related data, as compared to separate smaller sets with the same
total amount of data, allowing correlations to be found to "spot business trends, prevent diseases,
combat crime and so on."
3.2.3 HADOOP
Hadoop is a free, Java-based programming framework that supports the processing of large
data sets in a distributed computing environment. It is part of the Apache project sponsored by
the Apache Software Foundation.
Hadoop makes it possible to run applications on systems with thousands of nodes involving
thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes
15

and allows the system to continue operating uninterrupted in case of a node failure. This
approach lowers the risk of catastrophic system failure, even if a significant number of nodes
become in operative. Hadoop was inspired by Google's Map Reduce, a software framework in
which an application is broken down into numerous small parts. Any of these parts (also called
fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop creator, named
the framework after his child's stuffed toy elephant. The current Apache Hadoop ecosystem
consists of the Hadoop kernel, Map Reduce, the Hadoop distributed file system (HDFS) and a
number of related projects such as Apache Hive, Base and Zookeeper.

Fig3.1:- Evolution of big data


The Hadoop framework is used by major players including Google, Yahoo and IBM, largely
for applications involving search engines and advertising. The preferred operating systems are
Windows and Linux but Hadoop can also work with BSD and OS X.\

3.3 FEASIBILITY STUDY


Overview of HDFS
HDFS has many similarities with other distributed file systems, but is different in several
respects. One noticeable difference is HDFS's write-once-read-many model that relaxes

16

concurrency control requirements, simplifies data coherency, and enables high-throughput


access.
Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing
logic near the data rather than moving the data to the application space.
HDFS rigorously restricts data writing to one writer at a time. Bytes are always appended to the
end of a stream, and byte streams are guaranteed to be stored in the order written.
HDFS has many goals. Here are some of the most notable:

Fault tolerance by detecting faults and applying quick, automatic recovery

Data access via Map Reduce streaming

Simple and robust coherency model

Processing logic close to the data, rather than the data close to the processing logic

Portability across heterogeneous commodity hardware and operating systems

Scalability to reliably store and process large amounts of data

Economy by distributing data and processing across clusters of commodity personal


computers

3.3.1 INTRODUCTION TO JAVA API


The Java Client API is an open source API for creating applications that use Mark Logic
Server for document and search operations. Developers can easily take advantage of the
advanced capabilities for persistence and search of unstructured documents that Mark Logic
Server provides. The capabilities provided by the JAVA API include:

Insert, update, or remove documents and document metadata. For details,

see Document Operations.


17

Query text and lexicon values. For details, see Searching.

Configure persistent and dynamic query options. For details, see Query Options.

Apply transformations to new content and search results. For details, see Content

Transformations.

Extend the Java API to expose custom capabilities you install on Mark Logic

Server. For details, see Extending the Java API.


When working with the Java API, you first create a manager for the type of document or
operation you want to perform on the database (for instance, a JSON Document Manager to
write and read JSON documents or a Query Manager to search the database). To write or read the
content for a database operation, you use standard Java APIs such as Input Stream, DOM, Sax,
JAXB, and Transformer as well as Open Source APIs such as JDOM and Jackson.
The Java API provides a handle (a kind of adapter) as a uniform interface for content
representation. As a result, you can use APIs as different as Input Stream and DOM to provide
content for one read() or write() method. In addition, you can extend the Java API so you can use
the existing read() or write() methods with new APIs that provide useful representations for your
content.
This chapter covers a number of basic architecture aspects of the Java API, including
fundamental structures such as database clients , managements and handles used in almost every
program you will write with it. Before starting to code, you need to understand these structures
and the concepts behind them.
The Java API co-exists with the previously developed Java XCC, as they are intended for
different use cases.
A Java developer can use the Java API to quickly become productive in their existing Java
environment, using the Java interfaces for search, facets, and document management. It is also
possible to use its extension mechanism to invoke XQuery, so as both to leverage development
teams XQuery expertise and to enable Mark Logic server functionality not implemented by the
Java API.

18

XCC provides a lower-level interface for running remote or ad hoc XQuery. While it
provides significant flexibility, it also has a somewhat steeper learning curve for developers who
are unfamiliar with XQuery. You may want to think of XCC as being similar to ODBC or JDBC;
a low level API for sending query language directly to the server, while the Java Client API is a
higher level API for working with database constructs in Java. In terms of performance, the Java
API is very similar to Java XCC for compatible queries. The Java API is a very thin wrapper
over a REST API with negligible overhead. Because it is REST-based, minimize network
distance for best performance.

3.3 OBJECTIVES
Why objectives first? Well, theres a natural tendency to drop into the planning phase before
youve thought out what youre trying to do in what I like to call big animal pictures. I call this
descending into the weeds before youve got a general idea of what you need to do. Projects, big
data or otherwise, generally begin with determining your objectives and then breaking down the
resources and tasks needed to complete the project.

Fig 3.2:-objectives

19

Now we moved on to the heart of the question: how can we determine and isolate the
propagation mode of company news from the reporting of financial news in Reuters to tweets
about that information? Naturally, we also wanted to explore the different aspects of a tweet that
might make it more or less influence There are a number of tools available to measure some
aspect of social authority but for this project we focused on the following:

The volume, velocity, and acceleration of tweets generated after a news article reports
financial information.

The social authority (or influence) of the twitter as indicated by his/hers Kl out score and
number of followers.

Finally, once we had all this data how would we determine (algorithmically)
(and singularly) sources had on a company stock price?
Based on what we just covered, here are our four objectives:

20

the impact both

Table 3.3: Our Four Objectives

3.5 TECHNOLOGY USED


Big data requires exceptional technologies to efficiently process large quantities of data
within tolerable elapsed times. A 2011 McKinsey report suggests suitable technologies include A
testing, Crowd

sourcing, data

fusion and integration, genetic

algorithms, machine

learning, natural language processing, signal processing, time series, analysis and visualization
Multidimensional big data can also be represented as tensors, which can be more efficiently
handled by tensor-based computation, such as multi linear subspace learning.

Additional technologies being applied to big data include massively parallel-processing (MPP)
databases, search-based applications, data mining distributed file systems, distributed databases,
cloud based infrastructure (applications, storage and computing resources) and the Internet.
Some but not all MPP relational databases have the ability to store and manage petabytes
of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data
tables in the RDBMS. DARPAs Topological Data Analysis program seeks the fundamental
structure of massive data sets and in 2008 the technology went public with the launch of a
company called Avado.
The practitioners of big data analytics processes are generally hostile to slower shared
storage, preferring direct-attached storage (DAS) in its various forms from solid state drive
(SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of
shared storage architecturesStorage area network (SAN) and Network-attached storage (NAS)
is that they are relatively slow, complex, and expensive. These qualities are not consistent with
big data analytics systems that thrive on system performance, commodity infrastructure, and low
cost.
Real or near-real time information delivery is one of the defining characteristics of data
analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is

21

gooddata on spinning disk at the other end of FC SAN connection is not the cost of a SAN at
the scale needed for analytics applications is very much higher than other storage techniques.
There are advantages as well as disadvantages to shared storage in big data analytics, but
big data analytics practitioners as of 2011 did not favour it.]

CHAPTER 4: SYSTEM REQUIREMENTS


4.1 Introduction
Before you begin the Big Data Extensions deployment tasks, make sure that your system meets
all of the prerequisites. Big Data Extensions requires that you install and configure vSphere, and
that your environment meets minimum resource requirements. You must also make sure that you
have licenses for the VMware components of your deployment.
vSphere Requirements

Before you can install Big Data Extensions, you must have set up the
following VMware products.

Install vSphere 5.0 (or later) Enterprise or Enterprise Plus.


Note
The Big Data Extensions graphical user interface is only supported

when using vS

Extensions on vSphere 5.0, you must perform all administrative tasks using the com
22

When installing Big Data Extensions on vSphere 5.1 later you must use VM
authentication. When logging in to vSphere 5.1 or later you pass entication to

configure with multiple identity sources such as Active Directory and OpenLDA

password is exchanged for a security token which is used to access vSphere comp

Enable the vSphere Network Time Protocol on the ESXi hosts. The Network Time
processes occur in sync across hosts.
Cluster Settings

Configure your cluster with the following settings.

Enable vSphere HA and vSphere DRS

Enable Host Monitoring.

Enable Admission Control and set desired policy. The default policy is to tolerate on

Set the virtual machine restart priority to High

Set virtual machine monitoring to virtual machine and Application

Monitoring.

Set the Monitoring sensitivity to High.

Enable vMotion and Fault Tolerance Logging.

All hosts in the cluster have Hardware VT enabled in the BIOS.

The Management Network VM kernel Port has vMotion and Fault Tolerance Log

Network Settings

Big Data Extensions deploys clusters on a single network. Virtual


machines are deployed with one NIC, which is attached to a specific
23

Port Group. The environment determines how this Port Group is


configured and which network backs the Port Group.
Either a switch or vSphere Distributed Switch can be used to provide
the Port Group backing a Serengeti cluster. vDS acts as a single
virtual switch across all attached hosts while a vSwitch is per-host
and requires the Port Group to be configured manually.
When configuring your network for use with Big Data Extensions,
the following ports must be open as listening ports.

Ports 8080 and 8443 are used by the Big Data Extensions plug-in

user interface

Port 22 is used by SSH clients.

To prevent having to open network firewall port to access Hadoop

services, log

can access your cluster.

To connect to the Internet (for example, to create an internal Yum

repository from

a proxy.

Direct Attached

Direct Attached Storage should be attached and configured on the


physical controller to present each disk separately to the operating
system. This configuration is commonly described as Just A Bunch
Of Disks (JBOD). You must create VMFS Datastores on Direct
Attached Storage using the following disk drive recommendations.

8-12 disk drives per host.The more disk drives per host, the better the performance.

1-1.5 disk drives per processor core.

7,200 RPM disk Serial ATA disk drives.


24

Residue Requirements

Resource pool with at least 27.5GB RAM.

for the vSphere

40GB or more (recommended) disk space for the management server and Hadoop te
rescue Requirements for
Data store free space is not less than the total size needed by the

the Hadoop Cluster

Hadoop cluster, plus swap disks for each Hadoop node that is equal
to the memory size requested.

Network configured across all relevant ESX or ESXi hosts, and has
connectivity with the network in use by the management server.

HA is enabled for the master node if HA protection is needed. You


must use shared storage in order to use HA or FT to protect the
Hadoop master node.

Hardware Requirements

Host hardware is listed in the VMware Compatibility Guide. To run


at optimal performance, install your vSphere and Big Data
Extensions environment on the following hardware.

Dual Quad-core CPUs or greater that have Hyper-Threading enabled. If


you can estimate your computing workload, consider more than
powerful CPU.

Use High Availability (HA) and dual power supplies for the node's host
machine.

4-8 GBs of memory per processor core, with 6% overhead for

25

virtualization.
Tested Host and Virtual The following is the maximum host and virtual machine support that
Machine

has been confirmed to successfully run with Big Data Extensions.

Support

45 physical hosts running a total of 182 virtual machines.

128 virtual ESXi hosts deployed on 45 physical hosts, running 256


virtual machines.

4.2. SYSTEM REQUIREMENTS:


4.2.1 Software Requirements
When enterprise executives try to wrap their minds around the challenges of Big Data, two
things quickly become evident: Big Data will require Big Infrastructure in one form or another,
but it will also require new levels of management and analysis to turn that data into valuable
knowledge.
Too often, however, the latter part of that equation gets all the attention, resulting in
situations in which all the tools are put in place to coordinate and interpret massive reams of
data only to get bogged down in endless traffic bottlenecks and resource allocation issues.
However, Big Infrastructure usually requires Big Expenditures, so it makes sense to formulate
a plan now to the kinds of volumes that are expected to become the common workloads of
the very near future.
To some, that means the enterprise will have to adopt more of the technologies and
architectures that currently populate the high-performance computing (HPC) world of
scientific and educational facilities. As ZDNet's Larry Dignan pointed out this month,companies
like Univa are adapting platforms like Oracle Grid
26

Engine to enterprise

environments.

Company CEO Gary Tyreman notes that it's one thing to build a pilot Hadoop environment, but
quite another to scale it to enterprise levels. Clustering technologies and even high-end
appliances will go a long way toward getting the enterprise ready to truly tackle the challenges of
Big Data.

Integrated hardware and software platforms are also making a big push for the enterprise
market. Teradata just introduced the Unified Data Environment and Unified Data
Architecture solutions that seek to dismantle the data silos that keep critical disparate data sets
apart. Uniting key systems like the Aster and Apache Hadoop releases with new tools like
Viewpoint, Connector and Vital Infrastructure, and wrapped in the new Warehouse Appliance
2700 and Aster Big Analytics appliances, the platforms aim for nothing less than complete,
seamless integration and analysis of accumulated enterprise knowledge.

4.2.2 Hardware Requirements:


As I mentioned, though, none of this will come on the cheap. Gartner predicts that Big Data will
account for $28 billion in IT spending this year alone, rising to $34 billion next year and
consuming about 10 percent of total capital outlays. Perhaps most ominously, nearly half of Big
Data budgets will go toward social network analysis and content analytics, while only a small
fraction will find its way to increasing data functionality. It seems, then, that the vast majority of
enterprises are seeking to repurpose existing infrastructure to the needs of Big Data. It will be
interesting to note whether future studies will illuminate the success or failure of that strategy.

Indeed, as application performance management (APM) firm OpTier notes in a recent


analysis of Big Data trends, the primary challenge isn't simply to drill into large data volumes for
relevant information, but to do it quickly enough so that its value can be maximized. And on
27

this front, the IT industry as a whole is sorely lacking. Fortunately, speeding up the process is not
only a function of bigger and better hardware. Improved data preparation and contextual storage
practices can go a long way toward making data easier to find, retrieve and analyze, much the
same way that wide area networks can be improved through optimization rather than by adding
bandwidth.

In short, then, enterprises will need to shore up infrastructure to handle increased


volumes of traffic, but as long as that foundation is in place, many of the tools needed to make
sense of it all are already available. However, the downside is that this will not be an optional
undertaking. As in sports, success in business is usually a matter of inches, and organizations of
all stripes are more than willing to INVEST in substantial infrastructure improvements to gain an
edge, even a small one.

4.3 CONCLUSION:
These are the hardware and software requirements that are needed for Hadoop to run. In the next
chapter the system design will be discussed.

CHAPTER 5: SYSTEM DESIGN


5.1 INTRODUCTION
In the previous chapter we have seen the requirements that are necessary .In this phase
we look at the system design for better understanding of the technology. It is about the physical
organization of the system. It is demonstrated with the help of UML diagrams or block diagrams
etc, it is explained in a pictorial representation.

28

5.2. HDFS architecture


HDFS is comprised of interconnected clusters of nodes where files and directories reside. An
HDFS cluster consists of a single node, known as a Name Node that manages the file system
namespace and regulates client access to files. In addition, data nodes (Data Nodes) store data as
blocks within files.
Name nodes and data nodes
Within HDFS, a given name node manages file system namespace operations like opening,
closing, and renaming files and directories. A name node also maps data blocks to data nodes,
which handle read and write requests from HDFS clients. Data nodes also create, delete, and
replicate data blocks according to instructions from the governing name node.
As the Figure illustrates, each cluster contains one name node. This design facilitates a simplified
model for managing each namespace and arbitrating data distribution.
Relationships between name nodes and data nodes
Name nodes and data nodes are software components designed to run in a decoupled manner on
commodity machines across heterogeneous operating systems. HDFS is built using the Java
programming language; therefore, any machine that supports the Java programming language
can run HDFS. A typical installation cluster has a dedicated machine that runs a name node and
possibly one data node. Each of the other machines in the cluster runs one data node.
The below figure illustrates the high-level architecture of HDFS.

29

Fig5.1
:- architecture of HDFS
Communications protocols
All HDFS communication protocols build on the TCP/IP protocol. HDFS clients connect to a
Transmission Control Protocol (TCP) port opened on the name node, and then communicate with
the name node using a proprietary Remote Procedure Call (RPC)-based protocol. Data nodes talk
to the name node using a proprietary block-based protocol.
Data nodes continuously loop, asking the name node for instructions. A name node can't connect
directly to a data node; it simply returns values from functions invoked by a data node. Each data
node maintains an open server socket so that client code or other data nodes can read or write
data. The host or port for this server socket is known by the name node, which provides the
information to interested clients or other data nodes. See the Communications protocols sidebar
for more about communication between data nodes, name nodes, and clients.
The name node maintains and administers changes to the file system namespace.

30

File system namespace


HDFS supports a traditional hierarchical file organization in which a user or an application can
create directories and store files inside them. The file system namespace hierarchy is similar to
most other existing file systems; you can create, rename, relocate, and remove files.
HDFS also supports third-party file systems such as CloudStore and Amazon Simple Storage
Service (S3)
HDFS provides interfaces for applications to move them closer to where the data is located, as
described in the following section.
Application interfaces into HDFS
You can access HDFS in many different ways. HDFS provides a native Java application
programming interface (API) and a native C-language wrapper for the Java API. In addition, we
can use a web browser to browse HDFS files.
The applications described in Table 1 are also available to interface with HDFS.

Table 1. Applications that can interface with HDFS


Application

Description

FileSystem (FS) A command-line interface similar to common Linux and UNIX shells
shell

(bash, csh, etc.) that allows interaction with HDFS data.

DFSAdmin

A command set that you can use to administer an HDFS cluster.

Fsck

A subcommand of the Hadoop command/application. You can use the fsck


command to check for inconsistencies with files, such as missing blocks, but
you cannot use the fsck command to correct these inconsistencies.

Name nodes and These have built-in web servers that let administrators check the current
data nodes

status of a cluster.

31

5.3 MODULES
Modules are nothing but the components that operate the big data. They are
1. Client
2. Server

Client is the end user who asks for a request to the server for getting his task
accomplished.

Server is the thing that forwards the response in return to the client for the request that he
had made.

The figure below illustrates about the architecture of the working of client and server
modules.

Fig5.2:-Architecture of client server modules

5.4 UML DIAGRAMS

32

UML diagrams help us to know about the functioning of the system easily in a
pictorial representation.
5.4.

1. UML sequence diagram illustrating the operation of commands in Hadoop.

fig:-5.3 sequence diagram for the operation of commands in Hadoop


5.4.2. UML class diagram illustrating the Hadoop operation

33

fig5.4:- class diagram of Hadoop operation

5.5 CONCLUSION
Thus the UML diagrams help us in understanding the operation of the system in Hadoop.In next
chapter we discuss about the implementation part of the project.

34

CHAPTER 6: IMPLEMENTATION
6.1 Introduction to HDFS commands:
In the last post, we saw how we could install Apache Hadoop on 32-bit Ubuntu and how
to run a sample program. Now that we have achieved this, lets explore further. In this post, we
will see how to run basic HDFS commands.

6.2. Map Reduce: HDFS GFS


Hadoop is a distributed OS. Like any OS, Hadoop will need its file system. For a
distributed file system, the important thing to understand is that there will be an abstraction in
terms of its representation to the programmer to its actual mount point in the physical system.
For our setup of a single node cluster using only test data, this will be OK. But if you are going
to install a Hadoop distribution on a large cluster using an installer that has default mount point,
please take caution early on. It is possible that that your data size may increase later and your
system may run out of memory.
Moving on 35

Start your Ubuntu setup and start the Hadoop system. You may need to switch user to
Hadoop and then navigate to $HADOOP_HOME/bin to run the start-all.sh script. Verify that
Hadoop has started by running JPS command and checking Name Node, Data Node etc.
1. Starting point
Go to the prompt and type Hadoop. You will see the following response.
hadoop@sumod-hadoop:~$ hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format

format the DFS filesystem

secondarynamenode

run the DFS secondary namenode

namenode

run the DFS namenode

datanode

run a DFS datanode

dfsadmin

run a DFS admin client

mradmin

run a Map-Reduce admin client

fsck

run a DFS filesystem checking utility

fs

run a generic filesystem user client


We will focus on the fs command for now. We have used the jar command and fs

command earlier to some extent.


2. Using the fs command
At the prompt, type hadoop fs to get a list of options for fs command. You can see that the
options for hadoop fs command look a lot like familiar Unix commands such as ls, mv, cp etc.
Remember that you should always use hadoop fs -<command> to run the command. We will
now see some commands to be used with hadoop fs
- ls -This command will list the contents of an HDFS directory. Note that / here means root of
HDFS and not of your local file system.
hadoop@sumod-hadoop:~$ hadoop fs -ls /
Found 2 items
drwxr-xr-x hadoop supergroup

0 2012-09-09 14:39 /app

drwxr-xr-x hadoop supergroup

0 2012-09-08 16:49 /user


36

hadoop@sumod-hadoop:~$ hadoop fs -ls /user


Found 1 items
drwxr-xr-x hadoop supergroup

0 2012-09-08 16:49 /user/hadoop

hadoop@sumod-hadoop:~$
We have only hadoop as the user on our hadoop system. It is also the de-facto superuser of the
hadoop system in our setup. You can give option -lsr to see the directory listing recursively.
Of course, any HDFS directory can be given as a starting point.
-mkdir This command can be used to create a directory in HDFS.

hadoop@sumod-hadoop:~$ hadoop fs -mkdir /user/hadoop/test


hadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoop
Found 3 items
drwxr-xr-x hadoop supergroup

0 2012-09-23 15:46 /user/hadoop/test

drwxr-xr-x hadoop supergroup

0 2012-09-08 16:36 /user/hadoop/wcinput

drwxr-xr-x hadoop supergroup

0 2012-09-08 16:49 /user/hadoop/wcoutput

mkdir is particularly useful when you have multiple users on your hadoop system. It really
helps to have separate user directories on HDFS the same way on a UNIX system. So remember
to create HDFS directories for your UNIX users who need to access Hadoop as well.
-count This command will list the count of directories, files and list file size and file name
hadoop@sumod-hadoop:~$ hadoop fs -count /user/hadoop
6

9349507 hdfs://localhost:54310/user/hadoop

-touchz This command is used to create a file of 0 length. This is similar to the Unix
touch command.

hadoop@sumod-hadoop:~$ hadoop fs -touchz /user/hadoop/test/temp.txt


hadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoop/test
Found 1 items
-rw-rr 1 hadoop supergroup 0 2012-09-23 16:00 /user/hadoop/test/temp.txt
hadoop@sumod-hadoop:~$

37

-cp and -mv These commands operate like regular Unix commands to copy and
rename a file.

hadoop@sumod-hadoop:~$ hadoop fs -cp /user/hadoop/test/temp.txt


/user/hadoop/test/temp1.txt
hadoop@sumod-hadoop:~$ hadoop fs -mv /user/hadoop/test/temp.txt
/user/hadoop/test/temp2.txt
hadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoop/test
Found 2 items
-rw-rr 1 hadoop supergroup 0 2012-09-23 16:04 /user/hadoop/test/temp1.txt
-rw-rr 1 hadoop supergroup 0 2012-09-23 16:03 /user/hadoop/test/temp2.txt
hadoop@sumod-hadoop:~$

put and -copyFromLocal These commands are used to put files from local file
system to the destination file system. The difference is that put allows reading from stdin
while copyFromLocal allows only local file reference as a source.

hadoop@sumod-hadoop:~$ hadoop fs -put hdfs.txt


hdfs://localhost:54310/user/hadoop/hdfs.txt
hadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoop
Found 7 items
-rw-rr 1 hadoop supergroup

0 2012-09-24 10:02 /user/hadoop/hdfs.txt

To use stdin see the following example. On Ubuntu, you need to press CTRL+D to stop entering
the file.
hadoop@sumod-hadoop:~$ hadoop fs -put /user/hadoop/sample.txt
this is a sample text file.
hadoop@sumod-hadoop:~$ hadoop fs -cat /user/hadoop/sample.txt
this is a sample text file.
hadoop@sumod-hadoop:~$
I have now removed the files to show an example of using copyFromLocal. Lets try copying
multiple files this time.
hadoop@sumod-hadoop:~$ hadoop fs -copyFromLocal hdfs.txt temp.txt /user/hadoop/
38

hadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoop


Found 5 items
-rw-rr 1 hadoop supergroup

0 2012-09-24 10:14 /user/hadoop/hdfs.txt

-rw-rr 1 hadoop supergroup

1089803 2012-09-24 10:14 /user/hadoop/temp.txt

-get and -copyToLocal These commands are used to copy files from HDFS to the
local file system.

hadoop@sumod-hadoop:~$ hadoop fs -get /user/hadoop/hdfs.txt .


hadoop@sumod-hadoop:~$ ls
Desktop Documents Downloads hdfs.txt Music

Pictures Public Templates #temp.txt#

Videos
hadoop@sumod-hadoop:~$ hadoop fs -copyToLocal /user/hadoop/temp.txt
Usage: java FsShell [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
hadoop@sumod-hadoop:~$ hadoop fs -copyToLocal /user/hadoop/temp.txt.
hadoop@sumod-hadoop:~$ ls
Desktop

Downloads Music

Public

#temp.txt# Videos

Documents hdfs.txt Pictures Templates temp.txt


hadoop@sumod-hadoop:~$

6.3 Hadoop Shell Commands:


1.

DFShell

2.

cat

3.

chgrp

4.

chmod

5.

chown

6.

copyFromLocal

7.

copyToLocal

8.

cp
39

9.

du

10.

dus

11.

expunge

12.

get

13.

getmerge

14.

ls

15.

lsr

16.

mkdir

17.

movefromLocal

18.

mv

19.

put

20.

rm

21.

rmr

22.

setrep

23.

stat

24.

tail

25.

test

26.

text

27.

touchz

1. DFShell
The HDFS shell is invoked by bin/hadoop dfs <args>. All the HDFS shell commands take path
URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and
for the local file system the scheme is file. The scheme and authority are optional. If not
specified, the default scheme specified in the configuration is used. An HDFS file or directory
such as /parent/child can be specified as hdfs://namenode:namenodeport/parent/child or simply
as /parent/child (given that your configuration is set to point to namenode:namenodeport). Most
of the commands in HDFS shell behave like corresponding UNIX commands. Differences are
described with each of the commands. Error information is sent to stderr and the output is sent to
stdout.
2. cat
40

Usage: hadoop dfs -cat URI [URI ]


Copies source paths to stdout.
Example:
hadoop dfs -cat hdfs://host1:port1/file1
hdfs://host2:port2/file2
hadoop dfs -cat file:///file3 /user/hadoop/file4
Exit Code:
Returns 0 on success and -1 on error.
3. chgrp
Usage: hadoop dfs -chgrp [-R] GROUP URI [URI ]
Change group association of files. With -R, make the change recursively through the directory
structure. The user must be the owner of files, or else a super-user. Additional information is in
the Permissions User Guide.
4. chmo
Usage: hadoop dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI
[URI ]
Change the permissions of files. With -R, make the change recursively through the directory
structure. The user must be the owner of the file, or else a super-user. Additional information is
in the Permissions User Guide.
5. chown
Usage: hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Change the owner of files. With -R, make the change recursively through the directory structure.
The user must be a super-user. Additional information is in the Permissions User Guide.
6. copyFromLocal
Usage: hadoop dfs -copyFromLocal <localsrc> URI
Similar to put command, except that the source is restricted to a local file reference.
41

7. copyToLocal
Usage: hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI<localdst>
Similar to get command, except that the destination is restricted to a local file reference.
8. cp
Usage: hadoop dfs -cp URI [URI ] <dest>
Copy files from source to destination. This command allows multiple sources as well in
which case the destination must be a directory.
Example:
hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2
hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2
/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
9. du
Usage: hadoop dfs -du URI [URI ]
Displays aggregate length of files contained in the directory or the length of a file in case its
just a file.
Example:
hadoop dfs -du /user/hadoop/dir1 /user/hadoop/file1
hdfs://host:port/user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.

10. dus
Usage: hadoop dfs -dus <args>
Displays a summary of file lengths.

42

11. expunge
Usage: hadoop dfs expunge
Empty the Trash. Refer to HDFS Design for more information on Trash feature.
12. get
Usage: hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>
Copy files to the local file system. Files that fail the CRC check may be copied with the
-ignorecrc option. Files and CRCs may be copied using the -crc option.
Example:
hadoop dfs -get /user/hadoop/file localfile
hadoop dfs -get hdfs://host:port/user/hadoop/file localfile
Exit Code:
Returns 0 on success and -1 on error.
13. getmerge
Usage: hadoop dfs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the
destination local file. Optionally addnl can be set to enable adding a newline character at the end
of each file.

14. ls
Usage: hadoop dfs -ls <args>
For a file returns stat on the file with the following format:
filename <number of replicas> filesize modification_date
modification_time permissions userid groupid
For a directory it returns list of its direct children as in unix. A directory is listed as:
dirname <dir> modification_time modification_time permissions
userid groupid
Example:
43

hadoop dfs -ls /user/hadoop/file1 /user/hadoop/file2


hdfs://host:port/user/hadoop/dir1 /nonexistentfile
Exit Code:
Returns 0 on success and -1 on error.
15. lsr
Usage: hadoop dfs -lsr <args>
Recursive version of ls. Similar to Unix ls -R.
16. mkdir
Usage: hadoop dfs -mkdir <paths>
Takes path uri's as argument and creates directories. The behavior is much like unix mkdir p
creating parent directories along the path.
Example:
hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop dfs -mkdir hdfs://host1:port1/user/hadoop/dir
hdfs://host2:port2/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
17. movefromLocal
Usage: dfs -moveFromLocal <src> <dst>
Displays a "not implemented" message.
18. mv
Usage: hadoop dfs -mv URI [URI ] <dest>
Moves files from source to destination. This command allows multiple sources as well in which
case the destination needs to be a directory. Moving files across filesystems is not permitted.
Example:
hadoop dfs -mv /user/hadoop/file1 /user/hadoop/file2
hadoop dfs -mv hdfs://host:port/file1 hdfs://host:port/file2
44

hdfs://host:port/file3 hdfs://host:port/dir1
Exit Code:
Returns 0 on success and -1 on error.
19. put
Usage: hadoop dfs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads
input from stdin and writes to destination filesystem.
hadoop dfs -put localfile /user/hadoop/hadoopfile
hadoop dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
hadoop dfs -put localfile hdfs://host:port/hadoop/hadoopfile
hadoop dfs -put - hdfs://host:port/hadoop/hadoopfile
Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
20. rm
Usage: hadoop dfs -rm URI [URI ]
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for
recursive deletes.
Example:
hadoop dfs -rm hdfs://host:port/file /user/hadoop/emptydir
Exit Code:
Returns 0 on success and -1 on error.
21. rmr
Usage: hadoop dfs -rmr URI [URI ]
Recursive version of delete.
Example:
hadoop dfs -rmr /user/hadoop/dir
hadoop dfs -rmr hdfs://host:port/user/hadoop/dir
45

Exit Code:
Returns 0 on success and -1 on error.
22. setrep
Usage: hadoop dfs -setrep [-R] <path>
Changes the replication factor of a file. -R option is for recursively increasing the replication
factor of files within a directory.
Example:
hadoop dfs -setrep -w 3 -R /user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.
23. stat
Usage: hadoop dfs -stat URI [URI ]
Returns the stat information on the path.
Example:
hadoop dfs -stat path
Exit Code:
Returns 0 on success and -1 on error.
24. tail
Usage: hadoop dfs -tail [-f] URI
Displays last kilobyte of the file to stdout. -f option can be used as in Unix.
Example:
hadoop dfs -tail pathname
Exit Code : Returns 0 on success and -1 on error.

25. test
Usage: hadoop dfs -test -[ezd] URI
Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true
46

-d check return 1 if the path is directory else return 0.


Example:
hadoop dfs -test -e filename
26. text
Usage: hadoop dfs -text <src>
Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
27. touchz
Usage: hadoop dfs -touchz URI [URI ]
Create a file of zero length.
Example:
hadoop -touchz pathname
Exit Code:
Returns 0 on success and -1 on error.
6.4 Java Code Program:
package cse.GNITC;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
public class ImplementCreateHDFSDir {
public static void main(String[] args) throws IOException ,URISyntaxException
{
String FolderName = args[0] ;
System.out.println("Folder name is : " + args[0]);
//1. Get the Configuration instance
Configuration conf = new Configuration();
//2. Add Configuration files to the object
conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop/conf/hdfs-site.xml"));
conf.addResource(new Path("/usr/local/hadoop/conf/mapred-site.xml"));
47

//3. Get the instance of the HDFS


FileSystem FS = FileSystem.get(new URI("hdfs://localhost:54310"), conf);
//4. Get the the folder name from the parameter
String FN = FolderName.substring(FolderName.lastIndexOf('/') + 1, FolderName.length());
System.out.println("FolderName name is : " + FN);
Path folderpath = new Path(FolderName);
//5. Check if the folder already exists
if (FS.exists(folderpath)) {
System.out.println("Folder by name " + FolderName + " already exists");
return;
}
FS.mkdirs(folderpath);
FS.close();
}
}

OUTPUT:

48

6.5 CONCLUSION
In this way,the code is implemented and thus the result will be observed in the next
chapter in the form of screen shots.

CHAPTER 7: SCREENSHOTS
1. CAT COMMAND

49

Fig 7.1 CAT Command

Usage: hadoop dfs -cat URI [URI ]


Copies source paths to stdout.
Example:
hadoop dfs -cat hdfs://host1:port1/file1 hdfs://host2:port2/file2
hadoop dfs -cat file:///file3 /user/hadoop/file4

50

2. COPYTOLOCAL COMMAND

Fig 7.2 copytoLocal Command

Usage: hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>


Similar to get command, except that the destination is restricted to a local file reference.

51

3. CP COMMAND

Fig 7.3 cp Command


Usage: hadoop dfs -cp URI [URI ] <dest>
Copy files from source to destination. This command allows multiple sources as well in which
case the destination must be a directory.
Example:

hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2

hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2/user/hadoop/dir

Exit Code:
Returns 0 on success and -1 on error.
52

4. DU COMMAND

Fig 7.4 du command

Usage: hadoop dfs -du URI [URI ]


Displays aggregate length of files contained in the directory or the length of a file in case
its just a file.
Example:
hadoop dfs -du /user/hadoop/dir1 /user/hadoop/file1
hdfs://host:port/user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.

53

5. DUS COMMAND

Fig 7.5 dus Command

Usage: hadoop dfs -dus <args>


Displays a summary of file lengths.

54

6. EXPUNGE COMMAND

Fig 7.6 expunge command

Usage: hadoop dfs -expunge


Empty the Trash. Refer to HDFS Design for more information on Trash feature.

55

7. GET COMMAND

Fig 7.7 get Command

Usage: hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>


Copy files to the local file system. Files that fail the CRC check may be copied with the
-ignorecrc option. Files and CRCs may be copied using the -crc option.
Example:
hadoop dfs -get /user/hadoop/file localfile
hadoop dfs -get hdfs://host:port/user/hadoop/file localfile

56

8. GETMERGE COMMAND

Fig 7.8 GETMERGE Command


Usage: hadoop dfs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the
destination local file. Optionally addnl can be set to enable adding a newline character at the end
of each file.

57

9. LS COMMAND

Fig 7.9 ls Command


Usage: hadoop dfs -ls <args>
For a file returns stat on the file with the following format:
filename <number of replicas> filesize modification_date modification_time permissions
userid groupid
For a directory it returns list of its direct children as in unix. A directory is listed as:
dirname <dir> modification_time modification_time permissions userid groupid
Example:
hadoop dfs -ls /user/hadoop/file1 /user/hadoop/file2
hdfs://host:port/user/hadoop/dir1 /nonexistentfile
Exit Code:
Returns 0 on success and -1 on error.

58

10. LSR COMMAND

Fig 7.10 lsr Command


Usage: hadoop dfs -lsr <args>
Recursive version of ls. Similar to Unix ls -R.

59

11. MKDIR COMMAND

Fig 7.11 mkdir command


Usage: hadoop dfs -mkdir <paths>
Takes path uri's as argument and creates directories. The behavior is much like unix mkdir p
creating parent directories along the path.
Example:

hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

Hadoopdfs -mkdir hdfs://host1:port1/user/hadoop/dir


hdfs://host2:port2/user/hadoop/dir

60

12. MOVEFROMLOCAL COMMAND

Fig 7.12 movefromLocal Command


Usage: dfs -moveFromLocal <src> <dst>
Displays a "not implemented" message.

61

13. MV COMMAND

Fig 7.13 mv Command

Usage: hadoop dfs -mv URI [URI ] <dest>


Moves files from source to destination. This command allows multiple sources as well in
which case the destination needs to be a directory. Moving files across filesystems is not
permitted.
Example:
hadoop dfs -mv /user/hadoop/file1 /user/hadoop/file2
hadoop
dfs
-mv
hdfs://host:port/file1
hdfs://host:port/file2
hdfs://host:port/file3 hdfs://host:port/dir1
Exit Code:
Returns 0 on success and -1 on error.

62

14. PUT COMMAND

Fig 7.14 put command

Usage: hadoop dfs -put <localsrc> ... <dst>


Copy single src, or multiple srcs from local file system to the destination filesystem. Also
reads input from stdin and writes to destination filesystem.

hadoop dfs -put localfile /user/hadoop/hadoopfile


hadoop dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
hadoop dfs -put localfile hdfs://host:port/hadoop/hadoopfile
hadoop dfs -put - hdfs://host:port/hadoop/hadoopfile

Reads the input from stdin.


Exit Code:

63

15. RM COMMAND

Fig 7.15 rm Command


Usage: hadoop dfs -rm URI [URI ]
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for
recursive deletes.
Example:
hadoop dfs -rm hdfs://host:port/file /user/hadoop/emptydir
Exit Code:
Returns 0 on success and -1 on error.

64

16. RMR COMMAND

Fig 7.16 rmr command

Usage: hadoop dfs -rmr URI [URI ]


Recursive version of delete.
Example:
hadoop dfs -rmr /user/hadoop/dir
hadoop dfs -rmr hdfs://host:port/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.

65

17. STAT COMMAND

Fig 7.17 stat Command

Usage: hadoop dfs -stat URI [URI ]


Returns the stat information on the path.
Example:
hadoop dfs -stat path
Exit Code:
Returns 0 on success and -1 on error.

66

18. TAIL COMMAND

Fig 7.18 tail command


Usage: hadoop dfs -tail [-f] URI
Displays last kilobyte of the file to stdout. -f option can be used as in Unix.

67

19. TEST COMMAND

Fig 7.19 Test Command


Usage: hadoop dfs -test -[ezd] URI
Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true
-d check return 1 if the path is directory else return 0.
Example:
hadoop dfs -test -e filename

68

20. TEXT COMMAND

Fig 7.20 text command

Usage: hadoop dfs -text <src>


Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.

69

21. TOUCHZ COMMAND

Fig 7.21 touchz Command

Usage: hadoop dfs -touchz URI [URI ]


Create a file of zero length.
Example:
hadoop -touchz pathname

70

7.2 Screen shot of Java Code Output:

Fig 7.22 Java Output

CHAPTER 8: TESTING AND VALIDATION


71

Software Testing is a critical element of software quality assurance and represents the ultimate
review of specification, design and coding, Testing presents an interesting anomaly for the
software engineer.

Testing Objectives include:


Testing is a process of executing a program with the intent of finding an error. A good test case is
one that has a probability of finding an as yet undiscovered error. A successful test is one that
uncovers an undiscovered error

Testing principles:
All tests should be traceable to end user requirements. Tests should be planned long before
testing begins. Testing should begin on a small scale and progress towards testing in large
Exhaustive testing is not possible To be most effective testing should be conducted by an
independent third party.

Testing strategies:
A Strategy for software testing integrates software test cases into a series of well-planned steps
that result in the successful construction of software. Software testing is a broader topic for what
is referred to as Verification and Validation. Verification refers to the set of activities that ensure
that the software correctly implements a specific function. Validation refers he set of activities
that ensure that the software that has been built is traceable to customers requirements.

8.1 Testing:
Testing is a process of executing a program with a intent of finding an error. Testing presents an
interesting anomaly for the software engineering. The goal of the software testing is to convince
system developer and customers that the software is good enough for operational use. Testing is
a process intended to build confidence in the software.

8.2 Types of testing:


The various types of testing are

72

1.
2.
3.
4.

White Box Testing


Black Box Testing
Alpha Testing
Beta Testing

8.2.1 White box testing:


It is also called as glass-box testing. It is a test case design method that uses the control structure
of the procedural design to derive test cases.Using white box testing methods, the software
engineer can derive test cases that
1. Guarantee that all independent parts within a module have been exercised at least once,
2. Exercise all logical decisions on their true and false sides.
8.2.2 Black box testing:
Its also called as behavioural testing. It focuses on the functional requirements of the software.It
is complementary approach that is likely to uncover a different class of errors than white box
errors.A black box testing enables a software engineering to derive a set of input conditions that
will fully exercise all functional requirements for a program.
8.2.3 Alpha testing:
Alpha testing is the software prototype stage when the software is first able to run. It will not
have all the intended functionality, but it will have core functions and will be able to accept
inputs and generate outputs. An alpha test usually takes place in the developer's offices on a
separate system.
8.2.4 Beta testing:
The beta test is a live application of the software in an environment that cannot be controlled
by the developer. The beta test is conducted at one or more customer sites by the end user of the
software.

8.3 Path testing:


Established technique of flow graph with Cyclomatic complexity was used to derive test cases
for all the functions. The main steps in deriving test cases were:

73

Use the design of the code and draw correspondent flow grapgh.
Determine the cyclomatic complexity of resultant flow graph, using formula
V(G)=E-N+2 or
V(G)=P+1 or
V(G)=Number of Regions
Where V(G) is cyclomatic complexity,
E is the number of edges,
N is the number of flow graph nodes,
P is the number of predicate nodes,

9: CONCLUSION AND FUTURE SCOPE

74

9.1 CONCLUSION

9.1.1 Select the Right Projects for Hadoop Implementation


Choose projects that fit Hadoops strengths and minimize its disadvantages. Enterprises use
Hadoop in data-science applications for log analysis, data mining, machine learning and image
processing involving unstructured or raw data. Hadoops lack of fixed-schema works particularly
well for answering ad-hoc queries and exploratory what if scenarios. Hadoop Distributed File
System (HDFS) and MapReduce address growth in enterprise data volumes from terabytes to
petabytes and more; and the increasing variety of complex multi-dimensional data from disparate
sources.
For applications that require faster velocity for real-time or right-time data processing, while
Apache HBase adds a distributed column-oriented database on top of HDFS and there is work in
the Hadoop community to support stream processing, Hadoop does have important speed
limitations. Likewise, compared to an enterprise data warehouse, current-generation Apache
Hadoop does not offer a comparable level of feature sophistication to mandate deterministic
query response times, balance mixed workloads, define role and group based user access, or
place limits on individual queries.

9.1.2 Rethink and Adapt Existing Architectures to Hadoop


For most organizations, Hadoop is one extension or component of a broader data architecture.
Hadoop can serve as a data bag for data aggregation and pre-processing before loading into a
data warehouse. At the same time, organizations can offload data from an enterprise data
warehouse into Hadoop to create virtual sandboxes for use by data analysts.
As part of your multi-year data architecture roadmap, be ready to accommodate changes from
Hadoop and other technologies that impact Hadoop deployment. Devise an architecture and tools
75

to efficiently implement the data processing pipeline and provision the data to production. Start
small and grow incrementally with a data platform and architecture that enable you to build once
and deploy wherever it makes sense using Hadoop or other systems, on premise or in the
cloud.
9.1.3 Plan Availability of Skills and Resources Before You Get Started
One of the constraints of deploying Hadoop is the lack of enough trained personnel resources.
There are many projects and sub-projects in the Apache ecosystem, making it difficult to stay
abreast of all of the changes. Consider a platform approach to hide the complexity of the
underlying technologies from analysts and other line of business users.
9.1.4 Prepare to Deliver Trusted Data for Areas That Impact Business Insight and
Operations
Compared to the decades of feature development by relational and transactional systems,
current-generation Hadoop offers fewer capabilities to track metadata, enforce data governance,
verify data authenticity, or comply with regulations to secure customer non-public information.
The Hadoop community will continue to introduce improvements and additions for example,
HCatalog is designed for metadata management but it takes time for those features to be
developed, tested, and validated for integration with third-party software. Hadoop is not a
replacement for master data management (MDM): lumping data from disparate sources into a
Hadoop data bag does not by itself solve broader business or compliance problems with
inconsistent, incomplete or poor quality data that may vary by business unit or by geography.
You can anticipate that data will require cleansing and matching for reporting and analysis.
Consider your end-to-end data processing pipeline, and determine your needs for security,
cleansing, matching, integration, delivery and archiving. Adhere to a data governance program to
deliver authoritative and trustworthy data to the business, and adopt metadata-driven audits to
add transparency and increase efficiency in development.
9.1.5 Adopt Lean and Agile Integration Principles

76

To transfer data between Hadoop and other elements of your data architecture, the HDFS API
provides the core interface for loading or extracting data. Other useful tools include Chukwa,
Scribe or Flume for the collection of log data, and Sqoop for data loading from or to relational
databases. Hive enables hoc query and analysis for data in HDFS using a SQL interface.
Informatica PowerCenter version 9.1 includes connectivity for HDFS, to load data into Hadoop
or extract data from Hadoop.

9.2 FUTURE SCOPE OF HADOOP


Advantages
1.
2.
3.
4.
5.
6.

Distribute data and Computation


Tasks are Independent
Linear Scaling in the Ideal case. It is used to design for cheap, commodity hardware.
HDFS stores large amount of Information
HDFS is simple and robust coherency model
HDFS should integrate well with Hadoop MapReduce, allowing data to read and
computed upon locally when possible.

77

REFERENCES & BIBILOGRAPHY

1. http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
2. http://theglobaljournals.com/gra/file.php?
3.
4.
5.
6.

val=February_2013_1360851170_47080_37.pdf
http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html
http://www.bigdatacompanies.com/5-big-disadvantages-of-hadoop-for-big-data/
http://www.havoozacademy.org
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6223552

78

Vous aimerez peut-être aussi