Académique Documents
Professionnel Documents
Culture Documents
LIST OF FIGURES..................................................................................................................4
ABSTRACT..6
Chapter 1: Project Introduction
1.1 Introduction to Big
Data.7
1.2 Problem Statement............................................................................................................9
1.3 Scope................................................................................................................................9
1.4 Project Feature..................................................................................................................9
1.5 Organistion.......................................................................................................................9
Chapter 2: Literature Survey................................................................................................12
Chapter 3: System Analysis...................................................................................................14
3.1 Introduction................................................................................................................14
3.2. System Study..............15
3.2.1. Existing System.......................................................................................................15
3.2.2. Proposed System..15
3.2.3. Hadoop.....16
3.3 Feasibility study...17
3.3.1 Introduction to Java API...18
3.4 Objectives.......................................................................................................................19
3.5 Technoogy Used.............................................................................................................20
Chapter 4: System Requirements.........................................................................................22
4.1 Introduction....................................................................................................................22
4.2 System requirements.......................................................................................................25
4.2.1 Software requirements.............................................................................................25
4.2.2 Hardware requirements............................................................................................26
4.3 Conclusion......................................................................................................................26
Chapter 5: System design......................................................................................................27
1
5.1 Introduction....................................................................................................................27
5.2 Hdfs Architecture............................................................................................................27
5.3 Modules..........................................................................................................................30
5.4 UML diagrams................................................................................................................31
5.4.1 Sequence diagram ...................................................................................................31
5.4.2 Class diagram...........................................................................................................32
5.5 Conclusion......................................................................................................................32
Chapter 6: Implementation...................................................................................................33
6.1 Introduction ...................................................................................................................33
6.2 Map Reduce....................................................................................................................33
6.3 Hdfs Shell Commands....................................................................................................38
6.4 Sample Code..................................................................................................................45
6.5 Conclusion......................................................................................................................46
Chapter 7: Screenshots..........................................................................................................47
Chapter 8: Testing and Validation........................................................................................69
8.1 Testing:............................................................................................................................79
8.2 Types of testing:..............................................................................................................70
8.2.1 White box Testing.... ...70
8.2.2 Black box Testing.........70
8.2.3 Alpha Testing........70
8.2.4 Beta Testing..................70
8.3. Path Testing...71
9: Conclusion and future scope............................................................................................72
9.1. Conclusion ....72
9.1.1. Selecting a Project in Hadoop....72
9.1.2. Rethinking and adopting to Existing Hadoop72
9.1.3. Path Availability.73
9.1.4. Services insight and Operations.73
2
LIST OF FIGURES
Figure No.
Name of Figure
Page No.
1.1
1.2
Data Analysis
1.3
3 Vs of big data
3.1
16
3.2
Objectives
19
5.1
Architecture of HDFS
28
5.2
30
5.3
Sequence Diagram
31
5.4
Class Diagram
32
7.1
47
7.2
48
7.3
cp command execution
49
7.4
du command execution
50
7.5
51
7.6
52
7.7
53
7.8
54
7.9
ls command execution
55
7.10
56
7.11
57
7.12
58
7.13
mv command execution
59
7.14
60
7.15
rm command execution
61
7.16
62
7.17
63
7.18
64
7.19
65
7.20
66
7.21
67
ABSTRACT
Hadoop is a flexible and available architecture for large scale computation and data processing
on a network of commodity hardware. As Hadoop is an open source framework for processing,
storing and analyzing massive amounts of distributed unstructured data. Originally created by
Doug Cutting at Yahoo!, Hadoop was inspired by MapReduce, a user-defined function developed
by Google in early 2000s for indexing the Web. It was designed to handle petabytes and
Exabytes of data distributed over multiple nodes in parallel. Hadoop clusters run on inexpensive
commodity hardware so projects can scale-out without breaking the bank. Hadoop is now a
project of the Apache Software Foundation, where hundreds of contributors continuously
improve the core technology. Fundamental concept: Rather than banging away at one, huge
block of data with a single machine, Hadoop breaks up Big Data into multiple parts so each part
can be processed and analyzed at the same time. Why Hadoop used for searching, log
processing, recommendation systems, analytics, video and image analysis, data retention? It is
used by the top level apache foundation project, large active user base, mailing lists, users
groups, very active development, and strong development teams.
Hadoop is a popular open-source implementation of MapReduce for the analysis of large
datasets. To manage storage resources across the cluster, Hadoop uses a distributed user-level
filesystem. This filesystem HDFS is written in Java and designed for portability across
heterogeneous hardware and software platforms. This paper analyzes the performance of HDFS
and uncovers several performance issues. First, architectural bottlenecks exist in the Hadoop
implementation that result in inefficient HDFS usage due to delays in scheduling new
MapReduce tasks. Second, portability limitations prevent the Java implementation from
exploiting features of the native platform. Third, HDFS implicitly makes portability assumptions
about how the native platform manages storage resources, even though native filesystems and
I/O schedulers vary widely in design and behavior. This paper investigates the root causes of
these performance bottlenecks in order to evaluate tradeoffs between portability and performance
in the Hadoop distributed filesystem.
CHAPTER 1: INTRODUCTION
1.1 INTRODUCTION TO BIGDATA
DEFINITIONS:Big data exceeds the reach of commonly used hardware environments and software tools to
capture, manage, and process it with in a tolerable elapsed time for its user population.
Teradata Magazine article, 2011
Big data refers to data sets whose size is beyond the ability of typical database software tools to
capture, store, manage and analyze.
- The McKinsey Global Institute, 2012
Big data is a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools.
- Wikipedia, 2014.
Datasets that exceeds the boundaries and sizes of normal processing capabilities forcing you to
take a non-traditional approach.
enable enhanced decision making, insight discovery and process optimization. Additionally, a
new V "Veracity" is added by some organizations to describe it.
Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big
data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many
petabytes of data.
Big data can also be defined as "Big data is a large volume unstructured data which
cannot be handled by standard database management systems like DBMS, RDBMS or
ORDBMS".
Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes
even petabytesof information.
Turn 12 terabytes of Tweets created each day into improved product sentiment analysis.
Convert 350 billion annual meter readings to better predict power consumption.
Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud,
big data must be used as it streams into your enterprise in order to maximize its value.
Scrutinize 5 million trade events created each day to identify potential fraud.
9
Analyze 500 million daily call detail records in real-time to predict customer churn faster.
Variety: Big data is any type of data - structured and unstructured data such as text, sensor data,
audio, video, click streams, log files and more. New insights are found when analyzing these
data types together.
Monitor 100s of live video feeds from surveillance cameras to target points of interest.
1.3 SCOPE
Big data concept is used mainly to manage the bulk data that may be useful to give us a best Idea
of what are the factors that are responsible like the preferences of a customer to launch A new
product into the market and make it a successful. It has a wide number of applications Like Net
pricing out of home advertising retail habits and politics.
1.5 ORGANIZATION
CHAPTER 1: In this chapter the basic introduction regarding the project is given and the uses
And the reason why it is introduced is mentioned.
10
CHAPTER 2: In this chapter Literature survey is explained. The literature survey gives a detailed
study of projects with similar activity or technology.
From the literature survey we come to know about how the other similar projects were
implemented and problems faced in these projects.
CHAPTER 3: In this chapter system analysis is explained. We also give the description of the
language that is used in this project. We explain about the advantages and disadvantages of using
the language.
CHAPTER 4: In this chapter system requirements and analysis is explained and also the
classification of requirements has been done and the functional requirements & non-functional
requirements.
CHAPTER-5: In this chapter the system design is explained with the help of different
diagrammatic representation like ER, UML etc.
CHAPTER-6: In this chapter implementation of the system is explained by explaining the basic
concept. The sample code is also explained which tells about the functionality of the
microcontroller.
CHAPTER-7: In this chapter screenshots are explained. The screenshots gives a clear description
about how the project is implemented.
CHAPTER-8: This chapter includes the different types of testing, the validation part of coding
with the brief description and also different test cases are defined.
CHAPTER-9: This chapter includes the conclusion of the project.
CHAPTER-10: It includes the references of all concepts that have been in this project
11
12
of a data chunk is available, the consumer of that data will not know of storage server failures.
HDFS services are provided by two processes:
Name Node handles management of the file system metadata, and provides management and
control services.
Data Node provides block storage and retrieval services. There will be one Name Node process
in an HDFS file system, and this is a single point of failure. Hadoop Core provides recovery and
automatic backup of the Name Node, but no hot failover services.
13
3.1 INTRODUCTION
Apache Hadoop is an open source software framework written in Java for distributed
sharing and distributed processing of very large data sets on computer clusters built
from commodity hardware. All the modules in Hadoop are designed with a fundamental
assumption that hardware failures are commonplace and thus should be automatically handled
in software by the framework.
The core of Apache Hadoop consists of a storage part (HDFS) and a processing part
(Map reduce). Hadoop splits files into large blocks and distributes them amongst the nodes in the
cluster. To process the data, Hadoop Map Reduce transfers packaged code for nodes to process in
parallel, based on the data each node needs to process. This approach takes advantage of data
locality nodes manipulating the data that they have on-hand to allow the data to be processed
faster and more efficiently than it would be in a more conventional supercomputer
architecture that relies on a parallel system where computation and data are connected via highspeed networking.
The base Apache Hadoop framework is composed of the following modules:
Hadoop Common contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
14
In system study we make an analysis of present system and the proposed system and get to know
the reason behind the need of a new technology in the place of the existing one.
3.2.1 EXISTING SYSTEM
It incentivizes more collection of data and longer retention of it. If any and all data sets
might turn out to prove useful for discovering some obscure but valuable correlation, you might
as well collect it and hold on to it. In long run, the more useful big data proves to be, the stronger
this incentivizing effect will be; but in the short run it almost doesnt matter; the current buzz
over the idea is enough to do the trick.
Many (perhaps most) people are not aware of how much information is being collected
(for example, that stores are tracking their purchases over time), let alone how it is being used
(scrutinized for insights into their lives).
Big data can further tilt the playing field toward big institutions and away from individuals.
In economic terms, it accentuates the information asymmetries of big companies over other
economic actors and allows for people to be manipulated.
3.2.2
PROPOSED SYSTEM
Big data is an all-encompassing term for any collection of data sets so large and complex
that it becomes difficult to process using traditional data processing applications. The challenges
include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy
violations. The trend to larger data sets is due to the additional information derivable from
analysis of a single large set of related data, as compared to separate smaller sets with the same
total amount of data, allowing correlations to be found to "spot business trends, prevent diseases,
combat crime and so on."
3.2.3 HADOOP
Hadoop is a free, Java-based programming framework that supports the processing of large
data sets in a distributed computing environment. It is part of the Apache project sponsored by
the Apache Software Foundation.
Hadoop makes it possible to run applications on systems with thousands of nodes involving
thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes
15
and allows the system to continue operating uninterrupted in case of a node failure. This
approach lowers the risk of catastrophic system failure, even if a significant number of nodes
become in operative. Hadoop was inspired by Google's Map Reduce, a software framework in
which an application is broken down into numerous small parts. Any of these parts (also called
fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop creator, named
the framework after his child's stuffed toy elephant. The current Apache Hadoop ecosystem
consists of the Hadoop kernel, Map Reduce, the Hadoop distributed file system (HDFS) and a
number of related projects such as Apache Hive, Base and Zookeeper.
16
Processing logic close to the data, rather than the data close to the processing logic
Configure persistent and dynamic query options. For details, see Query Options.
Apply transformations to new content and search results. For details, see Content
Transformations.
Extend the Java API to expose custom capabilities you install on Mark Logic
18
XCC provides a lower-level interface for running remote or ad hoc XQuery. While it
provides significant flexibility, it also has a somewhat steeper learning curve for developers who
are unfamiliar with XQuery. You may want to think of XCC as being similar to ODBC or JDBC;
a low level API for sending query language directly to the server, while the Java Client API is a
higher level API for working with database constructs in Java. In terms of performance, the Java
API is very similar to Java XCC for compatible queries. The Java API is a very thin wrapper
over a REST API with negligible overhead. Because it is REST-based, minimize network
distance for best performance.
3.3 OBJECTIVES
Why objectives first? Well, theres a natural tendency to drop into the planning phase before
youve thought out what youre trying to do in what I like to call big animal pictures. I call this
descending into the weeds before youve got a general idea of what you need to do. Projects, big
data or otherwise, generally begin with determining your objectives and then breaking down the
resources and tasks needed to complete the project.
Fig 3.2:-objectives
19
Now we moved on to the heart of the question: how can we determine and isolate the
propagation mode of company news from the reporting of financial news in Reuters to tweets
about that information? Naturally, we also wanted to explore the different aspects of a tweet that
might make it more or less influence There are a number of tools available to measure some
aspect of social authority but for this project we focused on the following:
The volume, velocity, and acceleration of tweets generated after a news article reports
financial information.
The social authority (or influence) of the twitter as indicated by his/hers Kl out score and
number of followers.
Finally, once we had all this data how would we determine (algorithmically)
(and singularly) sources had on a company stock price?
Based on what we just covered, here are our four objectives:
20
sourcing, data
algorithms, machine
learning, natural language processing, signal processing, time series, analysis and visualization
Multidimensional big data can also be represented as tensors, which can be more efficiently
handled by tensor-based computation, such as multi linear subspace learning.
Additional technologies being applied to big data include massively parallel-processing (MPP)
databases, search-based applications, data mining distributed file systems, distributed databases,
cloud based infrastructure (applications, storage and computing resources) and the Internet.
Some but not all MPP relational databases have the ability to store and manage petabytes
of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data
tables in the RDBMS. DARPAs Topological Data Analysis program seeks the fundamental
structure of massive data sets and in 2008 the technology went public with the launch of a
company called Avado.
The practitioners of big data analytics processes are generally hostile to slower shared
storage, preferring direct-attached storage (DAS) in its various forms from solid state drive
(SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of
shared storage architecturesStorage area network (SAN) and Network-attached storage (NAS)
is that they are relatively slow, complex, and expensive. These qualities are not consistent with
big data analytics systems that thrive on system performance, commodity infrastructure, and low
cost.
Real or near-real time information delivery is one of the defining characteristics of data
analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is
21
gooddata on spinning disk at the other end of FC SAN connection is not the cost of a SAN at
the scale needed for analytics applications is very much higher than other storage techniques.
There are advantages as well as disadvantages to shared storage in big data analytics, but
big data analytics practitioners as of 2011 did not favour it.]
Before you can install Big Data Extensions, you must have set up the
following VMware products.
when using vS
Extensions on vSphere 5.0, you must perform all administrative tasks using the com
22
When installing Big Data Extensions on vSphere 5.1 later you must use VM
authentication. When logging in to vSphere 5.1 or later you pass entication to
configure with multiple identity sources such as Active Directory and OpenLDA
password is exchanged for a security token which is used to access vSphere comp
Enable the vSphere Network Time Protocol on the ESXi hosts. The Network Time
processes occur in sync across hosts.
Cluster Settings
Enable Admission Control and set desired policy. The default policy is to tolerate on
Monitoring.
The Management Network VM kernel Port has vMotion and Fault Tolerance Log
Network Settings
Ports 8080 and 8443 are used by the Big Data Extensions plug-in
user interface
services, log
repository from
a proxy.
Direct Attached
8-12 disk drives per host.The more disk drives per host, the better the performance.
Residue Requirements
40GB or more (recommended) disk space for the management server and Hadoop te
rescue Requirements for
Data store free space is not less than the total size needed by the
Hadoop cluster, plus swap disks for each Hadoop node that is equal
to the memory size requested.
Network configured across all relevant ESX or ESXi hosts, and has
connectivity with the network in use by the management server.
Hardware Requirements
Use High Availability (HA) and dual power supplies for the node's host
machine.
25
virtualization.
Tested Host and Virtual The following is the maximum host and virtual machine support that
Machine
Support
Engine to enterprise
environments.
Company CEO Gary Tyreman notes that it's one thing to build a pilot Hadoop environment, but
quite another to scale it to enterprise levels. Clustering technologies and even high-end
appliances will go a long way toward getting the enterprise ready to truly tackle the challenges of
Big Data.
Integrated hardware and software platforms are also making a big push for the enterprise
market. Teradata just introduced the Unified Data Environment and Unified Data
Architecture solutions that seek to dismantle the data silos that keep critical disparate data sets
apart. Uniting key systems like the Aster and Apache Hadoop releases with new tools like
Viewpoint, Connector and Vital Infrastructure, and wrapped in the new Warehouse Appliance
2700 and Aster Big Analytics appliances, the platforms aim for nothing less than complete,
seamless integration and analysis of accumulated enterprise knowledge.
this front, the IT industry as a whole is sorely lacking. Fortunately, speeding up the process is not
only a function of bigger and better hardware. Improved data preparation and contextual storage
practices can go a long way toward making data easier to find, retrieve and analyze, much the
same way that wide area networks can be improved through optimization rather than by adding
bandwidth.
4.3 CONCLUSION:
These are the hardware and software requirements that are needed for Hadoop to run. In the next
chapter the system design will be discussed.
28
29
Fig5.1
:- architecture of HDFS
Communications protocols
All HDFS communication protocols build on the TCP/IP protocol. HDFS clients connect to a
Transmission Control Protocol (TCP) port opened on the name node, and then communicate with
the name node using a proprietary Remote Procedure Call (RPC)-based protocol. Data nodes talk
to the name node using a proprietary block-based protocol.
Data nodes continuously loop, asking the name node for instructions. A name node can't connect
directly to a data node; it simply returns values from functions invoked by a data node. Each data
node maintains an open server socket so that client code or other data nodes can read or write
data. The host or port for this server socket is known by the name node, which provides the
information to interested clients or other data nodes. See the Communications protocols sidebar
for more about communication between data nodes, name nodes, and clients.
The name node maintains and administers changes to the file system namespace.
30
Description
FileSystem (FS) A command-line interface similar to common Linux and UNIX shells
shell
DFSAdmin
Fsck
Name nodes and These have built-in web servers that let administrators check the current
data nodes
status of a cluster.
31
5.3 MODULES
Modules are nothing but the components that operate the big data. They are
1. Client
2. Server
Client is the end user who asks for a request to the server for getting his task
accomplished.
Server is the thing that forwards the response in return to the client for the request that he
had made.
The figure below illustrates about the architecture of the working of client and server
modules.
32
UML diagrams help us to know about the functioning of the system easily in a
pictorial representation.
5.4.
33
5.5 CONCLUSION
Thus the UML diagrams help us in understanding the operation of the system in Hadoop.In next
chapter we discuss about the implementation part of the project.
34
CHAPTER 6: IMPLEMENTATION
6.1 Introduction to HDFS commands:
In the last post, we saw how we could install Apache Hadoop on 32-bit Ubuntu and how
to run a sample program. Now that we have achieved this, lets explore further. In this post, we
will see how to run basic HDFS commands.
Start your Ubuntu setup and start the Hadoop system. You may need to switch user to
Hadoop and then navigate to $HADOOP_HOME/bin to run the start-all.sh script. Verify that
Hadoop has started by running JPS command and checking Name Node, Data Node etc.
1. Starting point
Go to the prompt and type Hadoop. You will see the following response.
hadoop@sumod-hadoop:~$ hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format
secondarynamenode
namenode
datanode
dfsadmin
mradmin
fsck
fs
hadoop@sumod-hadoop:~$
We have only hadoop as the user on our hadoop system. It is also the de-facto superuser of the
hadoop system in our setup. You can give option -lsr to see the directory listing recursively.
Of course, any HDFS directory can be given as a starting point.
-mkdir This command can be used to create a directory in HDFS.
mkdir is particularly useful when you have multiple users on your hadoop system. It really
helps to have separate user directories on HDFS the same way on a UNIX system. So remember
to create HDFS directories for your UNIX users who need to access Hadoop as well.
-count This command will list the count of directories, files and list file size and file name
hadoop@sumod-hadoop:~$ hadoop fs -count /user/hadoop
6
9349507 hdfs://localhost:54310/user/hadoop
-touchz This command is used to create a file of 0 length. This is similar to the Unix
touch command.
37
-cp and -mv These commands operate like regular Unix commands to copy and
rename a file.
put and -copyFromLocal These commands are used to put files from local file
system to the destination file system. The difference is that put allows reading from stdin
while copyFromLocal allows only local file reference as a source.
To use stdin see the following example. On Ubuntu, you need to press CTRL+D to stop entering
the file.
hadoop@sumod-hadoop:~$ hadoop fs -put /user/hadoop/sample.txt
this is a sample text file.
hadoop@sumod-hadoop:~$ hadoop fs -cat /user/hadoop/sample.txt
this is a sample text file.
hadoop@sumod-hadoop:~$
I have now removed the files to show an example of using copyFromLocal. Lets try copying
multiple files this time.
hadoop@sumod-hadoop:~$ hadoop fs -copyFromLocal hdfs.txt temp.txt /user/hadoop/
38
-get and -copyToLocal These commands are used to copy files from HDFS to the
local file system.
Videos
hadoop@sumod-hadoop:~$ hadoop fs -copyToLocal /user/hadoop/temp.txt
Usage: java FsShell [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
hadoop@sumod-hadoop:~$ hadoop fs -copyToLocal /user/hadoop/temp.txt.
hadoop@sumod-hadoop:~$ ls
Desktop
Downloads Music
Public
#temp.txt# Videos
DFShell
2.
cat
3.
chgrp
4.
chmod
5.
chown
6.
copyFromLocal
7.
copyToLocal
8.
cp
39
9.
du
10.
dus
11.
expunge
12.
get
13.
getmerge
14.
ls
15.
lsr
16.
mkdir
17.
movefromLocal
18.
mv
19.
put
20.
rm
21.
rmr
22.
setrep
23.
stat
24.
tail
25.
test
26.
text
27.
touchz
1. DFShell
The HDFS shell is invoked by bin/hadoop dfs <args>. All the HDFS shell commands take path
URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and
for the local file system the scheme is file. The scheme and authority are optional. If not
specified, the default scheme specified in the configuration is used. An HDFS file or directory
such as /parent/child can be specified as hdfs://namenode:namenodeport/parent/child or simply
as /parent/child (given that your configuration is set to point to namenode:namenodeport). Most
of the commands in HDFS shell behave like corresponding UNIX commands. Differences are
described with each of the commands. Error information is sent to stderr and the output is sent to
stdout.
2. cat
40
7. copyToLocal
Usage: hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI<localdst>
Similar to get command, except that the destination is restricted to a local file reference.
8. cp
Usage: hadoop dfs -cp URI [URI ] <dest>
Copy files from source to destination. This command allows multiple sources as well in
which case the destination must be a directory.
Example:
hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2
hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2
/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
9. du
Usage: hadoop dfs -du URI [URI ]
Displays aggregate length of files contained in the directory or the length of a file in case its
just a file.
Example:
hadoop dfs -du /user/hadoop/dir1 /user/hadoop/file1
hdfs://host:port/user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.
10. dus
Usage: hadoop dfs -dus <args>
Displays a summary of file lengths.
42
11. expunge
Usage: hadoop dfs expunge
Empty the Trash. Refer to HDFS Design for more information on Trash feature.
12. get
Usage: hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>
Copy files to the local file system. Files that fail the CRC check may be copied with the
-ignorecrc option. Files and CRCs may be copied using the -crc option.
Example:
hadoop dfs -get /user/hadoop/file localfile
hadoop dfs -get hdfs://host:port/user/hadoop/file localfile
Exit Code:
Returns 0 on success and -1 on error.
13. getmerge
Usage: hadoop dfs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the
destination local file. Optionally addnl can be set to enable adding a newline character at the end
of each file.
14. ls
Usage: hadoop dfs -ls <args>
For a file returns stat on the file with the following format:
filename <number of replicas> filesize modification_date
modification_time permissions userid groupid
For a directory it returns list of its direct children as in unix. A directory is listed as:
dirname <dir> modification_time modification_time permissions
userid groupid
Example:
43
hdfs://host:port/file3 hdfs://host:port/dir1
Exit Code:
Returns 0 on success and -1 on error.
19. put
Usage: hadoop dfs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads
input from stdin and writes to destination filesystem.
hadoop dfs -put localfile /user/hadoop/hadoopfile
hadoop dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
hadoop dfs -put localfile hdfs://host:port/hadoop/hadoopfile
hadoop dfs -put - hdfs://host:port/hadoop/hadoopfile
Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
20. rm
Usage: hadoop dfs -rm URI [URI ]
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for
recursive deletes.
Example:
hadoop dfs -rm hdfs://host:port/file /user/hadoop/emptydir
Exit Code:
Returns 0 on success and -1 on error.
21. rmr
Usage: hadoop dfs -rmr URI [URI ]
Recursive version of delete.
Example:
hadoop dfs -rmr /user/hadoop/dir
hadoop dfs -rmr hdfs://host:port/user/hadoop/dir
45
Exit Code:
Returns 0 on success and -1 on error.
22. setrep
Usage: hadoop dfs -setrep [-R] <path>
Changes the replication factor of a file. -R option is for recursively increasing the replication
factor of files within a directory.
Example:
hadoop dfs -setrep -w 3 -R /user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.
23. stat
Usage: hadoop dfs -stat URI [URI ]
Returns the stat information on the path.
Example:
hadoop dfs -stat path
Exit Code:
Returns 0 on success and -1 on error.
24. tail
Usage: hadoop dfs -tail [-f] URI
Displays last kilobyte of the file to stdout. -f option can be used as in Unix.
Example:
hadoop dfs -tail pathname
Exit Code : Returns 0 on success and -1 on error.
25. test
Usage: hadoop dfs -test -[ezd] URI
Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true
46
OUTPUT:
48
6.5 CONCLUSION
In this way,the code is implemented and thus the result will be observed in the next
chapter in the form of screen shots.
CHAPTER 7: SCREENSHOTS
1. CAT COMMAND
49
50
2. COPYTOLOCAL COMMAND
51
3. CP COMMAND
Exit Code:
Returns 0 on success and -1 on error.
52
4. DU COMMAND
53
5. DUS COMMAND
54
6. EXPUNGE COMMAND
55
7. GET COMMAND
56
8. GETMERGE COMMAND
57
9. LS COMMAND
58
59
60
61
13. MV COMMAND
62
63
15. RM COMMAND
64
65
66
67
68
69
70
Software Testing is a critical element of software quality assurance and represents the ultimate
review of specification, design and coding, Testing presents an interesting anomaly for the
software engineer.
Testing principles:
All tests should be traceable to end user requirements. Tests should be planned long before
testing begins. Testing should begin on a small scale and progress towards testing in large
Exhaustive testing is not possible To be most effective testing should be conducted by an
independent third party.
Testing strategies:
A Strategy for software testing integrates software test cases into a series of well-planned steps
that result in the successful construction of software. Software testing is a broader topic for what
is referred to as Verification and Validation. Verification refers to the set of activities that ensure
that the software correctly implements a specific function. Validation refers he set of activities
that ensure that the software that has been built is traceable to customers requirements.
8.1 Testing:
Testing is a process of executing a program with a intent of finding an error. Testing presents an
interesting anomaly for the software engineering. The goal of the software testing is to convince
system developer and customers that the software is good enough for operational use. Testing is
a process intended to build confidence in the software.
72
1.
2.
3.
4.
73
Use the design of the code and draw correspondent flow grapgh.
Determine the cyclomatic complexity of resultant flow graph, using formula
V(G)=E-N+2 or
V(G)=P+1 or
V(G)=Number of Regions
Where V(G) is cyclomatic complexity,
E is the number of edges,
N is the number of flow graph nodes,
P is the number of predicate nodes,
74
9.1 CONCLUSION
to efficiently implement the data processing pipeline and provision the data to production. Start
small and grow incrementally with a data platform and architecture that enable you to build once
and deploy wherever it makes sense using Hadoop or other systems, on premise or in the
cloud.
9.1.3 Plan Availability of Skills and Resources Before You Get Started
One of the constraints of deploying Hadoop is the lack of enough trained personnel resources.
There are many projects and sub-projects in the Apache ecosystem, making it difficult to stay
abreast of all of the changes. Consider a platform approach to hide the complexity of the
underlying technologies from analysts and other line of business users.
9.1.4 Prepare to Deliver Trusted Data for Areas That Impact Business Insight and
Operations
Compared to the decades of feature development by relational and transactional systems,
current-generation Hadoop offers fewer capabilities to track metadata, enforce data governance,
verify data authenticity, or comply with regulations to secure customer non-public information.
The Hadoop community will continue to introduce improvements and additions for example,
HCatalog is designed for metadata management but it takes time for those features to be
developed, tested, and validated for integration with third-party software. Hadoop is not a
replacement for master data management (MDM): lumping data from disparate sources into a
Hadoop data bag does not by itself solve broader business or compliance problems with
inconsistent, incomplete or poor quality data that may vary by business unit or by geography.
You can anticipate that data will require cleansing and matching for reporting and analysis.
Consider your end-to-end data processing pipeline, and determine your needs for security,
cleansing, matching, integration, delivery and archiving. Adhere to a data governance program to
deliver authoritative and trustworthy data to the business, and adopt metadata-driven audits to
add transparency and increase efficiency in development.
9.1.5 Adopt Lean and Agile Integration Principles
76
To transfer data between Hadoop and other elements of your data architecture, the HDFS API
provides the core interface for loading or extracting data. Other useful tools include Chukwa,
Scribe or Flume for the collection of log data, and Sqoop for data loading from or to relational
databases. Hive enables hoc query and analysis for data in HDFS using a SQL interface.
Informatica PowerCenter version 9.1 includes connectivity for HDFS, to load data into Hadoop
or extract data from Hadoop.
77
1. http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
2. http://theglobaljournals.com/gra/file.php?
3.
4.
5.
6.
val=February_2013_1360851170_47080_37.pdf
http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html
http://www.bigdatacompanies.com/5-big-disadvantages-of-hadoop-for-big-data/
http://www.havoozacademy.org
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6223552
78