Vous êtes sur la page 1sur 8


Hadoop Architecture

Hadoop Cluster Configuration and Data Loading

Hadoop MapReduce framework

Advance MapReduce
Pig and Pig Latin

Hive and HiveQL

Advance Hive, NoSQL Databases and HBase

Advance HBase and ZooKeeper
Hadoop 2.0, MRv2 and YARN

Hadoop Project Environment and Apache Oozie

About Modules

What is Big Data, Hadoop Architecture, Hadoop ecosystem components, Hadoop Storage: HDFS, H
Server Roles: NameNode, Secondary NameNode, and DataNode, Anatomy of File Write and Read.

Hadoop Cluster Architecture, Hadoop Cluster Configuration files, Hadoop Cluster Modes, Multi-Nod
Cluster, MapReduce Job execution, Common Hadoop Shell commands, Data Loading Techniques: FL
Project: Data Loading.

Hadoop Data Types, Hadoop MapReduce paradigm, Map and Reduce tasks, MapReduce Execution F
Formats (Input Splits and Records, Text Input, Binary Input, Multiple Inputs), Output Formats (TextO
Project: MapReduce Programming.

Counters, Custom Writables, Unit Testing: JUnit and MRUnit testing framework, Error Handling, Tuni
MapReduce programming and error handling.

Installing and Running Pig, Grunt, Pig's Data Model, Pig Latin, Developing & Testing Pig Latin Script
Hadoop Project: Pig Scripting.

Hive Architecture and Installation, Comparison with Traditional Database, HiveQL: Data Types, Ope
and External Tables, Partitions and Buckets, Storage Formats, Importing Data, Altering Tables, Drop
Aggregating, Map Reduce Scripts, Joins & Subqueries, Views, Map and Reduce side Joins to optimiz

Data manipulation with Hive, User Defined Functions, Appending Data into existing Hive Table, Cus
Scripting, HBase: Introduction to HBase, Client API's and their features, Available Client, HBase Arc
Advanced Usage, Schema Design, Advance Indexing, Coprocessors, Hadoop Project: HBase tables
Implementation, Consistency, Sessions, and States.

Fair and Capacity, Hadoop 2.0 New Features: NameNode High Availability, HDFS Federation, MRv2,
existing MRv1 code to MRv2, Programming in YARN framework.

In this module, you will understand how multiple Hadoop ecosystem components work together in
problems. We will discuss multiple data sets and specifications of the project. This module will also
Hadoop Jobs.

Introduction to Hadoop and its Architecture
Limitations of traditional large scale systems
Compare Hadoop with traditional systems
Understanding Hadoop Architecture
Hadoop Daemons NameNode, DataNode, JobTracker, TaskTracker
Setting up Hadoop Single Node and Multi-Node Cluster using Oracle Virtual Box
Linux VM installation on Windows / Mac / Linux for Hadoop cluster using Oracle Virtual Box
Preparing nodes for Hadoop and VM settings (Java, Passwordless SSH, network settings etc.)
Basic Linux commands
Hadoop Deployment Single Node
Hadoop configuration files and running Hadoop services
Important Web URLs and Logs for Hadoop
Run HDFS and Linux commands
Hadoop Deployment Clustered Mode
Understanding Hadoop Distributed File System
Design Goals
Blocks, FS Image and Edit Logs
Rack-Awareness in Hadoop
Replica Placement and Selection Policies
Hadoop File System Shell Commands
Safe Mode in HDFS
Hadoop DFSAdmin Commands
File Read / Write Anatomy in HDFS
Hadoop NameNode and DataNodes Directory Structure
Name and Space Quota in HDFS
HDFS Trash Concept

Understanding Hadoop DFS 2.x Concepts

HDFS High Availability
Configuring HDFS HA with two NameNodes
Automatic and Manual Fail-over techniques in HA
MapReduce Programming Framework PART 1
MapReduce Architecture
Understand the concept of Mappers, Reducers
Anatomy of MapReduce Program and its phases
MapReduce Components Mapper Class, Reducer Class
Splits, Blocks and Record Readers
Understand the concept and need of Combiner and Partitioner

Running and Monitoring MapReduce Jobs

MapReduce Programming Framework PART 2
MapReduce Internals
Understanding Input and Output Formats in Hadoop
MapReduce API
Hadoop Data Types
Writing your own MapReduce job
YARN Concepts MRv2
Hadoop 1.x Limitations
Design Goals for YARN
YARN Architecture
Components Resource Manager / Node Manager / Application Master
Classic vs. YARN
Application Execution Flow
Life-Cycle Management
Schedulers and Queues
Running and Monitoring YARN applications
Job History Server and Web Application Proxy

Apache Hive
What is Hive?
Hive Architecture & Components
Hive Installation
Hive Metastore
Hive Data Model and Data Units
Hive DDL Create/Show/Drop Database
Hive DDL Create/Show/Drop Tables
Hive DML Load Files into Tables
Hive DML Inserting Data into Tables
Hive SQL Select, Filter, Join, Group By
Multi-Table Inserts and Joins
Introduction to SerDe, UDF and UDAF
Apache Pig
PIG Installation
PIG Data types
PIG Architecture

PIG Latin
PIG Relational Operators
PIG Functions

Apache Zookeeper

What is Zookeeper
Installation Standalone/Clustered mode
Zookeeper Command Line, ZNode and Watches
HDFS HA automatic failover using Zookeeper
Apache Sqoop
Sqoop Architecture and Installation
Import/Export Data using Sqoop

Apache Flume
Flume Architecture and Installation

Flume Use Cases