Vous êtes sur la page 1sur 24

IF 7202 Data Science And

Analytics
UNIT V
FRAMEWORKS AND VISUALIZATION
HDFS

KARTHIKA .RN
Research Scholar-CT
Introduction To Big Data
Big data is a buzzword, or catch-phrase, meaning a massive
volume of both structured and unstructured data that is so
large it is difficult to process using traditional database and
software techniques.
Systematic investigation for big data are always summarized
into 7Vs :
o Volume
o Velocity
o Variety
o Veracity
o Visualization
o Value
o Variability
Expanding 3Vs
Introduction To Hadoop
Hadoop (also known as Apache Hadoop) is an open source,
Java-based programming framework that supports the
processing of large data sets in a distributed computing
environment.
Components Of Hadoop

The four components form the basic Hadoop framework:


Hadoop Common- A set of common libraries and utilities used by
other Hadoop modules.
MapReduce- Executes a wide range of analytic functions by
analyzing datasets in parallel before reducing the results.
The Map job distributes a query to different nodes.
The Reduce gathers the results and resolves them into a single
value.
Components Of Hadoop
HDFS- The default storage layer for Hadoop.
YARN- Present in version 2.0 onwards, YARN is the cluster
management layer of Hadoop.
Prior to 2.0, MapReduce was responsible for cluster management
as well as processing.
The inclusion of YARN means you can run multiple applications in
Hadoop (so youre no longer limited to MapReduce), which all share
common cluster management.
1st Generation Hadoop: Batch Focus
Hadoop 1 Architecture
JobTracker
Manage Cluster Resources & Job Scheduling
TaskTracker
Per-node agent

Manage Tasks
Hadoop 1 Limitations
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Hadoop as Next-Gen Platform
Hadoop 2 - YARN Architecture
ResourceManager (RM)
Node
Central agent - Manages and allocates cluster resources Manager
NodeManager (NM)
Per-Node agent App Mstr

-Manages and enforces


node resource allocations Resource Node
Manager Manager
ApplicationMaster (AM) Client
Per-Application Container

Manages application
lifecycle and task Node
MapReduce Status
Manager
scheduling Job Submission
Node Status
Resource Request
YARN Application Lifecycle
YARN: Taking Hadoop Beyond
Batch
Store ALL DATA in one place

Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service


5 Key Benefits of YARN
YARN Eco-system
Applications Powered by YARN
Apache Giraph Graph Processing
Apache Hama - BSP
Apache Hadoop MapReduce Batch There's an app for that...
Apache Tez Batch/Interactive
YARN App Marketplace!
Apache S4 Stream Processing
Apache Samza Stream Processing
Apache Storm Stream Processing
Apache Spark Iterative applications
Elastic Search Scalable Search Frameworks Powered By YARN
Cloudera Llama Impala on YARN
DataTorrent Data Analysis Apache Twill
HOYA HBase on YARN REEF by Microsoft
Spring support for Hadoop 2
Apache YARN
The Data Operating System for Hadoop 2.0
Flexible Efficient Shared
Enables other purpose-built data Increase processing IN Hadoop on Provides a stable, reliable,
processing models beyond the same hardware while providing secure foundation and shared
MapReduce (batch), such as predictable performance & quality operational services across
interactive and streaming of service multiple workloads
Hadoop Distributed File System
Hadoop is ideal for storing large amounts of data, like
terabytes and petabytes, and uses HDFS as its storage
system.
HDFS lets you connect nodes contained within clusters
over which data files are distributed.
HDFS can access and store the data files as one seamless
file system. Access to data files is handled in
a streaming manner, meaning that applications or
commands are executed directly using the MapReduce
processing model.
HDFS is fault tolerant and provides high-throughput
access to large data sets.
Overview of HDFS
HDFS has many similarities with other distributed file
systems, but is different in several respects.
One noticeable difference is HDFS's write-once-read-
many model that relaxes concurrency control
requirements, simplifies data coherency, and enables
high-throughput access.
Unique attribute of HDFS is the viewpoint that it is
usually better to locate processing logic near the data
rather than moving the data to the application space.
Overview of HDFS
HDFS rigorously restricts data writing to one writer
at a time. Bytes are always appended to the end of a
stream, and byte streams are guaranteed to be stored
in the order written.
Goals Of HDFS
Fault tolerance by detecting faults and applying quick,
automatic recovery.
Data access via MapReduce streaming.
Simple and robust coherency model.
Processing logic close to the data, rather than the data
close to the processing logic.
Portability across heterogeneous commodity hardware
and operating systems.
Goals Of HDFS
Scalability to reliably store and process large amounts of
data.
Economy by distributing data and processing across
clusters of commodity personal computers.
Efficiency by distributing data and logic to process it in
parallel on nodes where data is located.
Reliability by automatically maintaining multiple copies
of data and automatically redeploying processing logic in
the event of failures.
Commonly used HDFS
Commands
mkdir :Creates a directory in the given path
syntax: bin/hdfs dfs -mkdir <path>
ls:Lists the files in a given path
syntax:bin/hdfs dfs -ls <args>
cp: Copies files from source to destination. This
command allows multiple sources as well in which
case the destination must be a directory.
syntax:bin/hdfs dfs -cp <source> <dest>
Create a directory in HDFS
bin/hdfs dfs -put <path>/* /filename(ip)
Copy the output files from the distributed file system to
the local file system and examine them:
(i)bin/hdfs dfs -get /output output
(ii)cat output/*
(or)
View the output files on the distributed file system
bin/hdfs dfs -cat /output/*

Vous aimerez peut-être aussi