Vous êtes sur la page 1sur 55


Working with Big Data

 Introduction to Big Data
What is Hadoop
 Need of HDFS and MapReduce in Hadoop
 Why Can’t RDBMS used
 Google File System (GFS)
 Hadoop Distributed File System(HDFS)
 Building Blocks of Hadoop
 Introducing and Configuring Hadoop Cluster
 Configuring XML Files
Introduction to Big Data

Today we are surrounded by data. Such large amount of data is generated

from many sources. consider the following:
 Facebook hosts more than 240 billion photos, 695,000 status updates, growing the
data @ of 7 petabytes per month
 In Twitter, Every second 98,000+ tweets are posted
 A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to
many Petabytes
 The New York Stock Exchange generates about 4-5 terabytes of new trade data per
 Every hour Walmart handles more than 1 million customer transactions
 Everyday we create 2.5 Quintillion bytes of data;
 90% of the data in the world today has been created in the last two years alone
The exponential growth of data become challenges to companies
such as Google, Amazon, Yahoo and so on.
All these companies are interested in storing and analysing such
large amount of data, because it adds a significant value to their
Big Data Analysis helps the organizations to identify new
opportunities that in turn leads higher profits and happier the
What is Big Data
Big Data is a term for datasets that are so large and complex, where
the traditional data processing applications are inadequate to handle
According to Gartner ” Big data is high-volume, high-velocity and/or
high-variety information assets that demand cost-effective,
innovative forms of information processing that enable enhanced
insight, decision making, and process automation”
Characteristics of Big Data (5 V’s)
Elements of Big Data:
Volume (Scale) : It refers to amount of data generated by organizations or
individuals. The datasets sizes from terabytes to Zettabytes.
For example: Total data stored on the Internet has crosses around 1 Yottabyte.
Velocity (Speed): It refers to the rate at which data is generated, captured and
Sources of High Velocity data are: Social media, Portable Devices, IT Devices etc.
Variety ( Different Data Formats): It refers to type and nature of data. Data
generated from multiple sources is heterogeneous in nature .
It includes:
◦ Structured Data - it follows a Pre-defined Schema, Ex:- RDBMS
◦ Semi Structured Data- It is a Schema less or Self-Describing Structure
◦ Unstructured Data:- There is no identifiable structure with this kind of data,
Ex:- e-mail messages, word documents, videos, photos, audio files and so on.
Veracity (Trust worthiness) – It refers to inconsistencies and uncertainty of data
i.e. whether the obtained data is consistent and correct (or) not.
Value -After having the 4 V’s into account there comes one more V which stands
for Value!. The bulk of Data having no Value is of no good to the company, unless
you turn it into something useful.
Sources of Big Data
Type Description Source
Social Data Refers to the information collected Facebook, Twitter and LinkedIn
from various social networking sites
and online portals

Machine Data Refers to the information collected RFID chips, GPS

from RFID chips, bar code scanners
and Sensors

Transactional Data Information collected from online Retail websites like eBay and
shopping sites, retailers Amazon
Challenges of Big Data
 Capturing
 Storing
 Analysis
 Searching
 Sharing
 Transferring
 Visualization
 Querying and Information Privacy
Big Data Analytics
Big data analytics is a process of examining large amount of data to discover hidden patterns,
market trends and consumer preferences, for the benefit of organizational decision making.
Big data analytics is the process of extracting useful information by analysing different types of
big data sets
Types of Analytics:
1. Descriptive Analytics - what happened in the business
2. Predictive Analytics - what could happen
3. Prescriptive Analytics- what should we do
Problems with Traditional Approach
1. In traditional approach, the main issue was handling the heterogeneity of
data i.e. structured, semi-structured and unstructured. The RDBMS (SQL)
focuses mostly on structured data such as XML Documents or database tables
that confirm to particular schema. But most of the big data is semi structured
or unstructured with which SQL cannot work.
2. Scale-Up Instead of Scale-out:
 Scaling commercial relational database is expensive because it uses vertical
scalability(scale up)
 Vertical Scalability is just adding capacity to single machine
 With Horizontal scalability(scale out) systems, it is possible to adding more
 Hadoop uses horizontal scalability hence it is cost effective
3. Data Storage and Analysis:
◦ The problem is simple, although the storage capacities of hard disk
drives have increased, but the disk access speed or transfer speed,
the speed at which data can be read from drives has not increased
(i.e. enormous time taken to process).
◦ For example, now a days a tera bytes drives are available, but
transfer speed is around 100 MB/S, so it takes more than Two and
half hours to read all the data. Another way to reduce the time is
to read from multiple disks at once, working in parallel we could
read the data in two minutes.
4. Declarative Query Instead of Functional Programming:
• SQL is Fundamentally a high level declarative language, we query data by
stating result you want. Under MapReduce(of Hadoop), we can specify the
actual steps to query the data

5. Availability of data:
• Data in RDBMS should be relational. Big Data need not be relational.
• In traditional relational databases you have to pre-process the data before using it.
Availability of data is more important for Hadoop.
• Relational data is often normalized to remove redundancy
The solution to above problems is Hadoop. It provides a
reliable data storage and analysis system.
The storage is provided by HDFS and analysis is provided by
1. Hadoop is an open source distributed processing framework that
manages data processing and storage for big data applications
2. It is designed to run on cluster of commodity hardware (it
doesn’t require expensive, high reliable network)
3. It is a project of Apache software Foundation, It was created by
computer scientists Doug Cutting and Mike Cafarella
4. The emphasis is on high throughput of data access rather than
low latency of data access
4. Hadoop provide interface for many applications to move
themselves closer to interact where the data is located
5. After Google published technical papers detailing its Google
File System (GFS) and MapReduce programming framework in
2003 and 2004, respectively, Cutting and Cafarella modified
earlier technology plans and developed a Java-based
MapReduce implementation and a file system modeled on
Google's called “Hadoop Framework”
Features of Hadoop
1. It is completely open source and written in Java
2. Computing power: It Uses Distributed Computing Framework that's
designed to provide rapid data access across the nodes in a cluster
3. Cost effective: It doesn’t require expensive, high reliable network. i.e. It runs
on clusters of commodity servers and can scale up to support thousands of
hardware nodes and massive amounts of data.
4. Store and process huge amounts of any kind of data, Quickly: it stores and
process Unstructured or Semi Structured data.
5. provide fault-tolerant capabilities so applications can continue to run if
individual nodes fail.
4. Storage Capacity: scale linearly, cost in not exponential
5. Reliable, data replicated , failed tasks are re run, no need to
maintain backup of data
6. Used for Batch processing (Not for online[real time] analytical
Hadoop Ecosystem
Components of Hadoop
Hadoop Framework includes following four modules:
1. Hadoop Distributed File System(HDFS) – A distributed File system that provides high
throughput access and storage to application data
2. Hadoop MapReduce (Programming aspect) – This is YARN based system for parallel
processing of large data sets
3. Hadoop YARN – This is a framework for job scheduling and cluster resource management
4. Hadoop Common – These are java libraries and utilities required by other Hadoop modules.
These libraries provides file system and OS level abstractions and contains the necessary java
files and scripts required to start Hadoop
File system
 A file system is used to control how data is stored and retrieved. Without a file system,
information placed in a storage area would be one large body of data with no way to tell where
one piece of information stops and the next begins.
 By separating the data into pieces and giving each piece a name, the information is easily
isolated and identified. Taking its name from the way paper-based information systems are
named, each group of data is called a "file”.
 The structure and logic rules used to manage the groups of information is called a "file
 The file system manages access to both the content of files and the metadata about those
 It is responsible for arranging storage space; reliability, efficiency, and tuning with regard to the
physical storage medium are important design considerations.
Distributed File System(DFS)
 A distributed file system is a client/server-based application that allows clients to access and
process data stored on the server as if it were on their own computer.
 When a user accesses a file on the server, the server sends the user a copy of the file, which
is cached on the user's computer while the data is being processed and is then returned to the
 Distributed file system (DFS) is used to make files distributed across multiple servers appear
to users as if they reside in one place on the network.
 Resource Management and Accessibility
 Fault Tolerance
 Workload Management
Google File System(GFS)
• Google File System is a proprietary distributed file system
developed by google for its own use.
• It is designed to provide efficient, reliable data access using large
cluster of commodity hardware
• GFS was implemented especially for meeting the rapidly growing
demands of Google’s data processing needs.
• A new version of the Google File System is codenamed Colossus
which was released in 2010
GFS Features Include:
Fault tolerance
Critical data replication
Automatic and efficient data recovery
High aggregate throughput
Reduced client and master interaction because of large chunk
server size
Namespace management and locking
High availability
Traditional Design issues of DFS
 Performance
 Scalable
 Reliable
 Availability
Google File System(GFS) Architecture
• GFS cluster consists of multiple nodes. These nodes are divided into two
Master Node
Large No of Chunk Servers
• Each file is divided into fixed size chunk of 64 Megabytes and chunk servers
are store these chunks as linux files on local disks.
• Each chunk is assigned a unique 64-bit label by the master node at the time of
creation to mountain the logical meaning of files to constituent chunks.
• Each chunk is replicated several times throughout the network, with minimum
being ‘3’(replication factor), even more for files that high in demand.
• Each Files are stored in hierarchical directories identified by path names.
• The master server does not usually store the actual chunks, but rather all the
metadata associated with the chunks, such as namespace, access control
data,the tables mapping 64-bit labels to chunk locations- is controlled by
• All this metadata kept by master server is updated by receiving updates from
chunkserver(Heart-Beat Messaging)
• GFS client code linked into each application implements the file system API
and communicates with the master and chunk servers to read or write data
on behalf of the application. Clients interact with the master for metadata
operations, but all data-bearing communication goes directly to the chunk
Hadoop Distributed File System(HDFS)
HDFS is a distributed file system for Hadoop, designed for storing very large files with streaming
data access patterns, running on cluster of commodity hardware
Features of HDFS:
 It contains a master-slave architecture
 HDFS stores files in blocks with a default size of 128 MB, much larger than 4-32 KB seen in
most file systems
 HDFS is optimized for throughput over latency
 It is very efficient at streaming read request for large files but poor at seek requests for many
small files
 The built-in servers of Namenode and Datanode help users to easily check the status of cluster
HDFS Architecture
Building Blocks or Daemons of Hadoop

o On a fully configured cluster, “running Hadoop” means running a set of daemons, or resident programs, on the
different servers in our network. These daemons have specific roles; some exist only on one server, some exist
across multiple servers. The daemons includes:
 NameNode (Master Node)
 DataNode (Slave Node)
 Secondary NameNode (Check-point node)
 Job Tracker
 Task Tracker
Note: A daemon is computer program, that runs as a background process, rather than being direct
control of an Interactive User
 Hadoop cluster consists of single Namenode
 The Namenode is a master of HDFS that directs the slave
Datanodes to perform low level I/O tasks
 Namenode doesn't store any user data or perform any
 It is the bookkeeper of HDFS i.e. it keeps track of files metadata,
how the files are broken down into file blocks, which node store
those blocks and overall health of distributed file system
 It regulates client access to files

 The Namenode executes file system operations like opening, closing and
renaming the files and directories
 It determines mapping of blocks to Datanodes
 There is unfortunately a negative aspect of Namenode i.e. a single point of
failure of Hadoop Cluster.
 For any of the other daemons, if their host nodes fail for software or hardware
reasons, the Hadoop cluster will likely continue to function smoothly or you
can quickly restart it. Not so for the NameNode.
 Each slave node in the cluster is a Datanode, which performs reading and writing HDFS blocks to actual
files on the local file system
 when the client want to read or write a HDFS file, the file is broken into blocks and the Namenode tell
to the client on which DataNode the block resides. So that the client directly communicates with
Datanodes to process the local file.
 A DataNode may further communicate with other Datanodes to replicate its data blocks for
 It maintains 3 replicas for each block. A block is replicated over three right most nodes. This ensure
that if any node crashes, you still able to read the file.
 Upon Initialization, each Datanode informs the Namenode regarding which blocks it currently stores.
After mapping is completed, DataNode continually poll the Namenode to provide information
regarding local changes.
Namenode and Datanode Interaction in
 During Normal operation, Datanode send heartbeats to the Namenode to
Namenode to conform that DataNode is operating in controlled environment
 The default heartbeat interval is three seconds. If a Datanode doesn't receive a
heartbeat from a DataNode in ten minutes, then the Namenode consider the
DataNode to be out of service. The Namenode then schedules creation of new
replicas of those blocks on other node.
 The NameNode keeps track of the file metadata—which files are in the system and
how each file is broken down into blocks. The DataNodes provide backup store of the
blocks and constantly report to the NameNode to keep the metadata in a updated
Datanode Sends Heartbeats to Namenode
Secondary Namenode
• The Namenode stores the HDFS file system information in a file named
• Updates to the file system are not updating the fsimage file directly, but
instead are logged into a file.
• When restarting, the Namenode reads the fsimage and then applies all
the changes from the log file to bring system up to date in memory. This
process takes time.
• The SecondaryNamenode job is not be secondary to the Namenode, but
only to periodically read the file system changes log and aplly them into
the fsimage file, thus bringing it uptodate.
• Namenode is a single point of failure for a Hadoop cluster,
Secondary Namenode snapshots helps to minimize the
downtime and loss of data. Hence Secondary Namenode
is treated as Checkpoint node.
Job Tracker
• The job Tracker is a communicator between client application and Hadoop.
• Job Tracker runs on a server as a master node of the cluster
• Once code is submitted, the job tracker determines the execution plan by
• Which files to process
• Assign nodes to different tasks
• Monitor all running tasks
• If any task fails, job tracker automatically relaunch the task on a different node
• There will be only one job Tracker daemon per Hadoop Cluster
Task Tracker
 The Job Tracker is the master for the overall execution of a MapReduce job and the Task
Trackers manage the execution of individual tasks on each slave node
 Each Task Tracker is responsible for executing the individual tasks that the Job Tracker
 .Although there is a single Task Tracker per slave node, each Task Tracker can run multiple
JVMs to handle many map or reduce tasks in parallel.
 One responsibility of the Task Tracker is to constantly communicate with the Job Tracker. If the
Job Tracker fails to receive a heartbeat from a Task Tracker within a specified amount of time,
it will assume the Task Tracker has crashed and will resubmit the corresponding tasks to other
nodes in the cluster.
Job and Task Tracker Interaction
Pesticides sales Control Management System
Architecture of Proposed System
IoT and Big Data in Agriculture :