HADOOP

HADOOP
Big data commodity CLUSTER
HADOOP
Bigdata
Volume
Variety
Apache Open source Framework
Velocity
Structure
Distributed Computing
Variety
Semi-Structure
Manage Big data
Cluster of Commodity Computer
Quasi-Structure
UnStructured
Commodity
Cheap
Scalable
Flexible
Fault Tolerance
HADOOP COMPONENT (NODE)
Classification by Node
Master Node / Secondary Master Node
Locate data and task
Slave Node
Store data
Run task
HADOOP COMPONENT (NODE)

Slave Node
Master Node
Slave Node
Slave Node
Secondary
Master Node
Slave Node
HADOOP COMPONENT
Classification by Job
HDFS (Hadoop Distributed File System)
Master : Name Node , Secondary Name Node
Slave : Data Node
Map Reduce
Map Reduce V1 (used by Hadoop 1)
Master : Job Tracker
Slave : Task Tracker
Map Reduce V2 / YARN (used by Hadoop 2)
Master : Resource Manager
Slave : Node Manager , App Master , Container
HADOOP V.1 VS HADOOP V.2
HDFS
Store Large data by split large data to block
Store data in data node (3 copy - default)
Suitable data
Unstructured data
Immutable data
No Read Write by Random Access
Data node can talk together to rebalance data , to move copies around and to
keep replication of data
HDFS cannot be mounted directly by OS
HDFS
HDFS
HDFS in HADOOP V2
HDFS Federation
HDFS in HADOOP V2
HDFS Federation
HDFS in HADOOP V2
Fault Tolerance
HDFS in HADOOP V2
Fault Tolerance
MAP REDUCE
Programming model
Processing large data sets
Batch processing (No real-time processing)
MAP
processes a key/value pair to generate a set of intermediate key/value pairs
REDUCE
merges all intermediate values associated with the same intermediate key
MAP REDUCE WORK FLOW
MAP REDUCE V.1
MAP REDUCE V.2 / YARN
MAP REDUCE V1 VS YARN
Fault tolerance
Scalability
Compatibility with Map Reduce V1
Support for workload other than Map Reduce
HADOOP V.1 VS HADOOP V.2
HADOOP V.1
MAP Reduce V1
MASTER
n1
SLAVE
n2
SLAVE
n3
Task
Tracker
Task
Tracker
DataNode
DataNode
Job Tracker
NameNode
HDFS
Secondary
NameNode
HADOOP V.2 / YARN

MASTER
n1
SLAVE
n2
SLAVE
n3
Node
Manager
Node
Manager
App Mater
Container
DataNode
DataNode
Resource Manager
YARN
NameNode
HDFS
Secondary
NameNode
HADOOP ECOSYSTEM
Oozie : Workflow
Pig : Scripting
Mahout : Machine learning
R Connectors : Statistics
Hive : SQL Query
HBase : Columnar Store
Zookeeper : Coordination
Flume : Log Collector
Sqoop : Data Exchange
HADOOP ECOSYSTEM
HADOOP ECOSYSTEM
MAP REDUCE PROGRAMMING LANGUAGE
Java:Good for: speed; control; binary data; working with existing Java or MapReduce
libraries.
Pipes:Good for: working with existing C++ libraries.
Streaming:Good for: writing MapReduce programs in scripting languages.
Dumbo(Python),Happy(Jython),Wukong(Ruby),mrtoolkit(Ruby):Good for:
Python/Ruby programmers who want quick results, and are comfortable with the
MapReduce abstraction.
Pig,Hive,Cascading:Good for: higher-level abstractions; joins; nested data.
PHP, Perl, Bash, etc.
CLOUDERA
Open-source Apache Hadoop distribution
Proprietary Cloudera management
Deployment
Configuration
Service management
Service and host monitoring
Security management
Diagnostics (Log search , Event , etc)
Extensibility APIs
Support by many Hadoop ecosystem
HADOOP MANAGEMENT PORT
50070 : Hadoop Server
8088 : Node and Application
50090 : Secondary Node
50075 : Data Node
7180 : Cloudera management
FLUME
Collecting, aggregating, and moving large amounts of log data
Simple
Flexible architecture
Robust
Fault tolerant
Reliability
FLUME
Source
Channel
Sink
Setting multi-agent
flow
Multiplexing the flow
Consolidation
DEMO
HADOOP 3 Node + Wordcount mapreduce + FLUME
Cloudera 3 Node + wordcount mapreduce + FLUME
MASTER
n1
YARN
SLAVE
n2
SLAVE
n3
Node
Manager
Node
Manager
DataNode
DataNode
FLUME
FLUME
SLAVE
n4
Resource
Manager
NameNode
HDFS
Secondary
NameNode
FLUME
FLUME
COMPARE
HADOOP
CLOUDERA - FREE
Unlimited
Unlimited
CLOUDERA ENTERPRISE
Unlimited
GUI
Diagnostic
Core Cloudera Manager

Features
Advanced Cloudera
Manager Features
Cloudera Navigator
Node Limit
CDH
Cloudera Support
High Availibility
COMPARE
HADOOP
CLOUDERA - FREE
Required Linux Expert

GUI
Difficult Configuration
Difficult Diagnosis
Difficult to find error
Difficult Authorization
Difficult modification
Yarn Bug
Ensure Hadoop ecosystem
compatible
Partner solution with splunk
BIG DATA COMPUTING ISSUE
Cloud Big table by Google
Apache Spark

HADOOP

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

HADOOP

Transféré par

Droits d'auteur :

Formats disponibles

HADOOP

Big data commodity CLUSTER

Apache Open source Framework

Manage Big data

Cluster of Commodity Computer

HADOOP COMPONENT (NODE)

Master Node / Secondary Master Node

Locate data and task

HADOOP COMPONENT (NODE)

HDFS (Hadoop Distributed File System)

Master : Name Node , Secondary Name Node

Slave : Data Node

Map Reduce V1 (used by Hadoop 1)

Master : Job Tracker

Slave : Task Tracker

Map Reduce V2 / YARN (used by Hadoop 2)

Master : Resource Manager

Slave : Node Manager , App Master , Container

HADOOP V.1 VS HADOOP V.2

Store Large data by split large data to block

Store data in data node (3 copy - default)

No Read Write by Random Access

HDFS cannot be mounted directly by OS

Processing large data sets

Batch processing (No real-time processing)

processes a key/value pair to generate a set of intermediate key/value pairs

MAP REDUCE WORK FLOW

MAP REDUCE V.1

MAP REDUCE V.2 / YARN

MAP REDUCE V1 VS YARN

Compatibility with Map Reduce V1

Support for workload other than Map Reduce

HADOOP V.1 VS HADOOP V.2

HADOOP V.2 / YARN

Mahout : Machine learning

Hive : SQL Query

HBase : Columnar Store

Flume : Log Collector

Sqoop : Data Exchange

MAP REDUCE PROGRAMMING LANGUAGE

Pipes:Good for: working with existing C++ libraries.

Streaming:Good for: writing MapReduce programs in scripting languages.

Pig,Hive,Cascading:Good for: higher-level abstractions; joins; nested data.

PHP, Perl, Bash, etc.

Open-source Apache Hadoop distribution

Proprietary Cloudera management

Service and host monitoring

Diagnostics (Log search , Event , etc)

Support by many Hadoop ecosystem

HADOOP MANAGEMENT PORT

50070 : Hadoop Server

8088 : Node and Application

50090 : Secondary Node

50075 : Data Node

7180 : Cloudera management

Collecting, aggregating, and moving large amounts of log data

Multiplexing the flow

Cloudera 3 Node + wordcount mapreduce + FLUME

Core Cloudera Manager

Required Linux Expert

Difficult to find error

BIG DATA COMPUTING ISSUE

Cloud Big table by Google