Vous êtes sur la page 1sur 35

HADOOP

Big data commodity CLUSTER

HADOOP
Bigdata
Volume

Variety

Apache Open source Framework

Velocity

Structure

Distributed Computing

Variety

Semi-Structure

Manage Big data

Cluster of Commodity Computer

Quasi-Structure
UnStructured
Commodity
Cheap
Scalable
Flexible
Fault Tolerance

HADOOP COMPONENT (NODE)

Classification by Node

Master Node / Secondary Master Node

Locate data and task

Slave Node

Store data

Run task

HADOOP COMPONENT (NODE)


Slave Node
Master Node
Slave Node

Slave Node
Secondary
Master Node
Slave Node

HADOOP COMPONENT

Classification by Job

HDFS (Hadoop Distributed File System)

Master : Name Node , Secondary Name Node

Slave : Data Node

Map Reduce

Map Reduce V1 (used by Hadoop 1)

Master : Job Tracker

Slave : Task Tracker

Map Reduce V2 / YARN (used by Hadoop 2)

Master : Resource Manager

Slave : Node Manager , App Master , Container

HADOOP V.1 VS HADOOP V.2

HDFS

Store Large data by split large data to block

Store data in data node (3 copy - default)

Suitable data

Unstructured data

Immutable data

No Read Write by Random Access

Data node can talk together to rebalance data , to move copies around and to
keep replication of data

HDFS cannot be mounted directly by OS

HDFS

HDFS

HDFS in HADOOP V2

HDFS Federation

HDFS in HADOOP V2

HDFS Federation

HDFS in HADOOP V2

Fault Tolerance

HDFS in HADOOP V2

Fault Tolerance

MAP REDUCE

Programming model

Processing large data sets

Batch processing (No real-time processing)

MAP

processes a key/value pair to generate a set of intermediate key/value pairs

REDUCE

merges all intermediate values associated with the same intermediate key

MAP REDUCE WORK FLOW

MAP REDUCE V.1

MAP REDUCE V.2 / YARN

MAP REDUCE V1 VS YARN

Fault tolerance

Scalability

Compatibility with Map Reduce V1

Support for workload other than Map Reduce

HADOOP V.1 VS HADOOP V.2

HADOOP V.1

MAP Reduce V1

MASTER
n1

SLAVE
n2

SLAVE
n3

Task
Tracker

Task
Tracker

DataNode

DataNode

Job Tracker

NameNode
HDFS
Secondary
NameNode

HADOOP V.2 / YARN


MASTER
n1

SLAVE
n2

SLAVE
n3

Node
Manager

Node
Manager

App Mater

Container

DataNode

DataNode

Resource Manager
YARN

NameNode
HDFS
Secondary
NameNode

HADOOP ECOSYSTEM

Oozie : Workflow

Pig : Scripting

Mahout : Machine learning

R Connectors : Statistics

Hive : SQL Query

HBase : Columnar Store

Zookeeper : Coordination

Flume : Log Collector

Sqoop : Data Exchange

HADOOP ECOSYSTEM

HADOOP ECOSYSTEM

MAP REDUCE PROGRAMMING LANGUAGE

Java:Good for: speed; control; binary data; working with existing Java or MapReduce
libraries.

Pipes:Good for: working with existing C++ libraries.

Streaming:Good for: writing MapReduce programs in scripting languages.

Dumbo(Python),Happy(Jython),Wukong(Ruby),mrtoolkit(Ruby):Good for:
Python/Ruby programmers who want quick results, and are comfortable with the
MapReduce abstraction.

Pig,Hive,Cascading:Good for: higher-level abstractions; joins; nested data.

PHP, Perl, Bash, etc.

CLOUDERA

Open-source Apache Hadoop distribution

Proprietary Cloudera management

Deployment

Configuration

Service management

Service and host monitoring

Security management

Diagnostics (Log search , Event , etc)

Extensibility APIs

Support by many Hadoop ecosystem

HADOOP MANAGEMENT PORT

50070 : Hadoop Server

8088 : Node and Application

50090 : Secondary Node

50075 : Data Node

7180 : Cloudera management

FLUME

Collecting, aggregating, and moving large amounts of log data

Simple

Flexible architecture

Robust

Fault tolerant

Reliability

FLUME

Source

Channel

Sink

Setting multi-agent
flow

Multiplexing the flow

Consolidation

DEMO
HADOOP 3 Node + Wordcount mapreduce + FLUME

Cloudera 3 Node + wordcount mapreduce + FLUME

MASTER
n1
YARN

SLAVE
n2

SLAVE
n3

Node
Manager

Node
Manager

DataNode

DataNode

FLUME

FLUME

SLAVE
n4

Resource
Manager

NameNode
HDFS
Secondary
NameNode
FLUME

FLUME

COMPARE
HADOOP

CLOUDERA - FREE

Unlimited

Unlimited

CLOUDERA ENTERPRISE
Unlimited

GUI

Diagnostic

Core Cloudera Manager


Features
Advanced Cloudera
Manager Features
Cloudera Navigator

Node Limit
CDH

Cloudera Support

High Availibility

COMPARE
HADOOP

CLOUDERA - FREE

Required Linux Expert


GUI

Difficult Configuration

Difficult Diagnosis

Difficult to find error

Difficult Authorization

Difficult modification

Yarn Bug
Ensure Hadoop ecosystem
compatible
Partner solution with splunk

BIG DATA COMPUTING ISSUE

Cloud Big table by Google

Apache Spark

Vous aimerez peut-être aussi