Vous êtes sur la page 1sur 42

Webpage

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Agenda

Hadoop Architecture

Secondary NameNode & Checkpointing

NameNode Availability

NameNode Failover Mechanism

HDFS HA Architecture

Backup

Security

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


HDFS Architecture

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


HDFS Architecture

NameNode manages
NameNode manages all
all the
the data
data nodes
nodes and
and
Secondary
NameNode Metadata maintain all
maintain all the
the metadata
metadata information
information
NameNode
client

NameNode heartbeat and


receives heartbeat
NameNode receives and block
block
report from
report from all
all the
the DataNodes
DataNodes

DataNode DataNode DataNode Clients first


first contacts
contacts the
the NameNode
NameNode for
for file
file
Clients
1 2 3
metadata &
metadata & then
then perform
perform actual
actual file
file I/O
I/O
directly with
directly with the
the DataNodes
DataNodes

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Secondary NameNode
and
Checkpointing

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Secondary NameNode & Checkpointing
Secondary
NameNode
NameNode
Checkpointing is a process of combining
edit logs with FsImage
editLog editLog
Secondary NameNode takes over the
responsibility of checkpointing,
therefore, making NameNode more
available
Allows faster Failover as it prevents edit First time copy
fsImage fsImage
logs from getting too huge
Checkpointing happens periodically
(default: 1 hour)
editLog
FsImage
(new) (final)
Temporary
During checkpoint

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Secondary NameNode & Checkpointing
Checkpointing Related properties in hdfs-site.xml:

local directory where


the temporary edits to
be merge are stored

number of seconds
between two periodic
checkpoints.

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Secondary NameNode & Checkpointing

Manual or Forced Checkpointing:

1. Save the latest metadata to FsImage on the Master Node:


hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace
hdfs dfsadmin -safemode leave

2. Run the manual checkpointing on the Secondary NameNode:


hdfs secondarynamenode checkpoint force

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


NameNode Availability
(Single Point of Failure)

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


NameNode Availability

Planned event:
Maintenance work like software or
hardware upgradation
Availability of NameNode means we need
NameNode to be always up and running or
available for executing any Hadoop jobs
Secondary
NameNode NameNode

Unplanned Event:
NameNode crashes because of
In a standard HDFS configuration, NameNode
Hardware Failure
becomes a Single Point of Failure i.e. once
NameNode crashes whole cluster becomes
unavailable
DN1 DN2 DN3

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


NameNode Failover

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


NameNode Failover

1 2 3

It replays all the or Leaves the safe


New NameNode loads
edits transactions in the mode once it has
the file system
namespace image into
edit log to catch up to received enough
the most recent state of block reports from
memory
the NameSystem the datanodes

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


HDFS HA Architecture
(Faster Failover)

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


HDFS HA Architecture
I will take

CRAS CONTROL Two NameNodes


H! running at the same
time:
Active NameNode
Standby NameNode

In case of a NameNode
(active) failover, other
NameNode (Standby)
DN1 DN2 DN3
takes over
responsibility

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Shared Storage

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


HDFS HA Architecture

Active Standby
sync sync
NameNode NameNode
Shared Storage

Active NameNode and Standby NameNode keep their state in sync with each other using shared storage

Shared Storage
1 2
Implementation

NFS
Quorum Journal Nodes
(Network File System)

Group of separate lightweight Shared storage device that


daemons that logs record of any provide access to both the
namespace modification NameNode for storing
namespace modification

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Shared Storage Implementation
using Journal Nodes

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


HDFS HA Architecture

JournalNode JournalNode JournalNode Active NameNode is responsible for updating


the EditLogs of majority of these JournalNodes

StandbyNode reads the changes made to the


JournalNode EditLogs and updates its
Active Passive namespace state
NameNode NameNode

There must be at least three JournalNodes so


that modifications can be written to majority
of JournalNodes

DataNode1 DataNode2 DataNode3


For N JournalNodes, the system can tolerate at
most (N - 1) / 2 failures

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Automatic Failover: Zookeeper

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Automatic Failover - Zookeeper

Apache Zookeeper:
Highly available service for maintaining small amounts of coordination data, notifying clients of changes in that
data, and monitoring clients for failures.
Provides Automatic failover to the HDFS HA Architecture

Failure detection:
Each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper.
In case of NameNode failure, the ZooKeeper session will expire, notifying the other NameNode that a
failover should be triggered

Active NameNode election:


ZooKeeper provides a simple mechanism to exclusively elect a node as active
If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper
indicating that it should become the next active

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


HDFS HA Architecture

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


HDFS HA Architecture

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Backup

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Backup and Recovery
Some Useful Commands:

To check Status of Cluster and details of data


1 hadoop dfsadmin -report
nodes

2 hadoop fs ls <dir_path> To check the list of files/directories on HDFS

hdfs fsck <path> -blocks To check block information


3 hdfs fsck <path> -files To check blocks and files information

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Backup

?
Why Backup
k o f L os s o f Data
Ris

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Backup
SOLUTION FOR DATA BACKUP

Distributed Copy Using Flume

hadoop distcp hdsf://<source NN> hdfs://<target NN> Ingesting Data Using Flume

Flume

distcp

Cluster 2 Parallel Data


Cluster 1 Cluster 1 Ingestion Cluster 2

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Security - Kerberos

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Security

Kerberos is a Network authentication protocol


created by MIT
Kerberos is used to authenticate users and
services both
Most security tools fail to scale and perform with
big data environment
Eliminates the need for transmission of passwords
across the network and removes the potential
threat of an attacker sniffing the network

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Security - Kerberos
Terminologies:

Principal:
1 Identity that needs to be verified is referred to as a principal

Realm:
2 Refers to an authentication administrative domain

KDC Key Distribution Centre:


3 Authentication server in Kerberos environment

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Security - Kerberos
1
Principal:
An identity that needs to be verified is referred to as a principal. It is not necessarily just the users,
there can be multiple types of identities

User Principal Names (UPN) Service Principal Names (SPN)

Service Principal Names refer to services


User principal Names refers to users,
accessed by a user such as a database
these users are similar to users in an OS
Example: hdfs@EXAMPLE.COM
Example: foobar@EXAMPLE.Com

2
Realm:
A realm in Kerberos refers to an authentication administrative domain. Principals are assigned to
specific realms in order to demarcate boundaries and simplify administration.

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Security - Kerberos
3
Kerberos Key Distribution Centre:
Key Distribution Centre contains all the information regarding principals and realms

Kerberos Database Authentication Service (AS) Ticket Granting Service(TGS)

Replies to the initial authentication


Kerberos database is the repository of all TGS validates tickets and issues
request from the client and issues a
principals and realms. service tickets
special ticket known as TGT

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Security - Kerberos

Authenticates and Authorizes a


Client
KDC is not provided by Hadoop
Software that desires to access to a
service.
For Ex. hadoop fs, etc.
Kerberos KDC
(Key Distribution Center)
Client

Desired Network Service


(Protected by Kerberos)

Service daemon that the client


wishes to access
For Example: NameNode, YARN etc.

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Kerberos Workflow

EDUREKA ANGULAR 2 CERTIFICATION TRAINING www.edureka.co/angular-j


Kerberos Workflow

1 1
2 AS
Initial user authentication
DB request. This message is
TGS directed to Authentication
KDC Server (AS).

Reply of the AS to the previous


request. It contains the TGT.

Application Service
(AS)

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Kerberos Workflow

1 3
2 AS Request from the client to the
TGS for a service ticket. This
DB
packet includes the TGT
3 TGS
4 obtained from the previous
KDC message

4
Reply of the TGS to the
previous request. It returns the
requested service ticket

Application Service
(AS)

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Kerberos Workflow

1 5 Request that the client sends


2 AS
to an application server to
DB access a service. It contains
3 TGS the service ticket obtained
4 KDC from TGS with the previous
reply

6 5 6
Reply that the application
service gives to the client to
prove it really is the server the
client is expecting

Application Service
(AS)

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Important Kerberos
Client Commands

EDUREKA ANGULAR 2 CERTIFICATION TRAINING www.edureka.co/angular-j


Important Kerberos Client Commands

1 kinit is used to request a TGT from Kerberos

2 klist Used to see your current tickets

Used to explicitly delete your tickets.


kdestroy Though they will expire on their own after
3
several hours

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Steps to Setup Kerberos

EDUREKA ANGULAR 2 CERTIFICATION TRAINING www.edureka.co/angular-j


Steps to Setup Kerberos

1. Set KDC hostname and realm in all Hadoop nodes

2. Create Kerberos principals

3. Create and deploy Kerberos keytab files

4. Shit down all Hadoop Daemons

5. Enable Hadoop security

6. Configure HDFS security options

7. Configure MapReduce security options

8. Restart Hadoop daemons

9. Verify that everything works

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop


Kerberos Demo

EDUREKA ANGULAR 2 CERTIFICATION TRAINING www.edureka.co/angular-j


Thank You
Questions/Queries/Feedback

EDUREKA ANGULAR 2 CERTIFICATION TRAINING www.edureka.co/angular-j

Vous aimerez peut-être aussi