Académique Documents
Professionnel Documents
Culture Documents
Analytics
UNIT V
FRAMEWORKS AND VISUALIZATION
HDFS
KARTHIKA .RN
Research Scholar-CT
Introduction To Big Data
Big data is a buzzword, or catch-phrase, meaning a massive
volume of both structured and unstructured data that is so
large it is difficult to process using traditional database and
software techniques.
Systematic investigation for big data are always summarized
into 7Vs :
o Volume
o Velocity
o Variety
o Veracity
o Visualization
o Value
o Variability
Expanding 3Vs
Introduction To Hadoop
Hadoop (also known as Apache Hadoop) is an open source,
Java-based programming framework that supports the
processing of large data sets in a distributed computing
environment.
Components Of Hadoop
Manage Tasks
Hadoop 1 Limitations
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Hadoop as Next-Gen Platform
Hadoop 2 - YARN Architecture
ResourceManager (RM)
Node
Central agent - Manages and allocates cluster resources Manager
NodeManager (NM)
Per-Node agent App Mstr
Manages application
lifecycle and task Node
MapReduce Status
Manager
scheduling Job Submission
Node Status
Resource Request
YARN Application Lifecycle
YARN: Taking Hadoop Beyond
Batch
Store ALL DATA in one place