Cloudera Administrator Training PDF

Cloudera"Administrator"Training"
for"Apache"Hadoop"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#1$
201306"
IntroducFon"
Chapter"1"
Course"Chapters"
!! Introduc/on$ Course$Introduc/on$
!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducFon"to"Apache"Hadoop"
!! GePng"Data"Into"HDFS"
!! MapReduce"
!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaFon"and"IniFal"ConfiguraFon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,"Installing,"and"
!! Hadoop"Clients"
Configuring"a"Hadoop"Cluster"
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraFon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaFons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooFng"
!! Conclusion"
!! Kerberos"ConfiguraFon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaFon"
Chapter"Topics"
Introduc/on$ Course$Introduc/on$
!! About$This$Course$
!! About"Cloudera"
!! Course"LogisFcs"
Course"ObjecFves"
During$this$course,$you$will$learn:$
! The$core$technologies$of$Hadoop$
! How$to$populate$HDFS$from$external$sources$
! How$to$plan$your$Hadoop$cluster$hardware$and$soIware$
! How$to$deploy$a$Hadoop$cluster$
! What$issues$to$consider$when$installing$Pig,$Hive,$and$Impala$
! What$issues$to$consider$when$deploying$Hadoop$clients$
! How$Cloudera$Manager$can$simplify$Hadoop$administra/on$
! How$to$configure$HDFS$for$high$availability$
! What$issues$to$consider$when$implemen/ng$Hadoop$security$
Course"ObjecFves"(cont’d)"
During$this$course,$you$will$learn:$
! How$to$schedule$jobs$on$the$cluster$
! How$to$maintain$your$cluster$
! How$to$monitor,$troubleshoot,$and$op/mize$the$cluster$
Chapter"Topics"
!! About"This"Course"
!! About$Cloudera$
!! Course"LogisFcs"
About"Cloudera"
! The$leader$in$Apache$Hadoop#based$soIware$and$services$
! Founded$by$leading$experts$on$Hadoop$from$Facebook,$Yahoo,$Google,$
and$Oracle$
! Provides$support,$consul/ng,$training,$and$cer/fica/on$for$Hadoop$users$
! Staff$includes$commiYers$to$virtually$all$Hadoop$projects$
! Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$
– Tom"White,"Eric"Sammer,"Lars"George,"etc."
About"Cloudera"(cont’d)"
! Customers$include$many$key$users$of$Hadoop$
– Allstate,"AOL"AdverFsing,"Box,"CBS"InteracFve,"eBay,"Experian,"Groupon,"
Macys.com,"NaFonal"Cancer"InsFtute,"Orbitz,"Social"Security"
AdministraFon,"Trend"Micro,"Trulia,"US"Army,"…"
! Cloudera$public$training:$
– Cloudera"Developer"Training"for"Apache"Hadoop"
– Cloudera"Administrator"Training"for"Apache"Hadoop"
– Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"Hadoop"
– Cloudera"Training"for"Apache"HBase"
– IntroducFon"to"Data"Science:"Building"Recommender"Systems"
– Cloudera"EssenFals"for"Apache"Hadoop"
! Onsite$and$custom$training$is$also$available$
CDH"
! CDH$(Cloudera’s$Distribu/on$including$Apache$Hadoop)$
– 100%"open"source,""
enterprise/ready""
distribuFon"of"Hadoop"and""
related"projects"
– The"most"complete,"tested,""
and"widely/deployed""
distribuFon"of"Hadoop"
– Integrates"all"the"key""
Hadoop"ecosystem"projects"
– Available"as"RPMs"and"
Ubuntu/Debian/SuSE""
packages"or"as"a"tarball"
Cloudera"Standard"
! Cloudera$Standard$
– Free"download""
including"CDH"and""
Cloudera"Manager"
– Supports"an"unlimited"
number"of"nodes"
! End#to#end$administra/on$$
for$Hadoop$
– Automated"cluster""
deployment"
– Centralized""
administraFon"
– Cluster"monitoring""
and"diagnosFc"tools"
Cloudera"Enterprise"
! Cloudera$Enterprise$
– SubscripFon"product""
including"CDH"and""
Cloudera"Manager"
! Includes$support$
! Extra$Manager$features$
– Rolling"upgrades"
– SNMP"support"
– LDAP"integraFon"
– Etc."
! Add#on$support$modules$
– "Impala,"HBase,"Backup""
and"Disaster"Recovery,""
Cloudera"Navigator"
Chapter"Topics"
!! About"This"Course"
!! About"Cloudera"
!! Course$Logis/cs$
LogisFcs"
! Course$start$and$end$/mes$
! Lunch$
! Breaks$
! Restrooms$
! Can$I$come$in$early/stay$late?$
! Cer/fica/on$
IntroducFons"
! About$your$instructor$
! About$you$
– Experience"with"Hadoop?"
– Experience"as"a"System"Administrator?"
– What"plagorm(s)"do"you"use?"
– ExpectaFons"from"the"course?"
The"Case"for"Apache"Hadoop"
Chapter"2"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#1%
Course"Chapters"
!! IntroducDon" Course"IntroducDon"
!! The%Case%for%Apache%Hadoop%
!! HDFS"
Introduc.on%to%Apache%Hadoop%
!! GeOng"Data"Into"HDFS"
!! MapReduce"
!! Hadoop"InstallaDon"and"IniDal"ConfiguraDon"
!! Hadoop"Clients"
!! Advanced"Cluster"ConfiguraDon"
!! Hadoop"Security"
Cluster"OperaDons"and"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooDng"
!! Conclusion"
!! Kerberos"ConfiguraDon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaDon"
The"Case"for"Apache"Hadoop"
In%this%chapter,%you%will%learn:%
! Why%Hadoop%is%needed%
! What%problems%Hadoop%solves%
! What%comprises%Hadoop%and%the%Hadoop%ecosystem%
%
Chapter"Topics"
The%Case%for%Apache%Hadoop% Introduc.on%to%Apache%Hadoop%
!! Why%Hadoop?%
!! Fundamental"Concepts"
!! Core"Hadoop"Components"
!! Hands/On"Exercise:"Installing"Hadoop"
!! Conclusion"
The"Data"Deluge"
! We%are%genera.ng%more%data%than%ever%
– Financial"transacDons"
– Sensor"networks"
– Server"logs"
– AnalyDcs"
– e/mail"and"text"messages"
– Social"media"
"
The"Data"Deluge"(cont’d)"
! And%we%are%genera.ng%data%faster%than%ever%
– AutomaDon"
– Ubiquitous"internet"connecDvity"
– User/generated"content"
! For%example,%every%day%
– Twi>er"processes"340"million"messages"
– Amazon"S3"storage"adds"more"than"one"billion"objects"
– Facebook"users"generate"2.7"billion"comments"and"“Likes”"
Data"is"Value"
! This%data%has%many%valuable%applica.ons%
– MarkeDng"analysis"
– Product"recommendaDons"
– Demand"forecasDng"
– Fraud"detecDon"
– And"many,"many"more…"
! We%must%process%it%to%extract%that%value%
Data"Processing"Scalability"
! How%can%we%process%all%that%informa.on?%
! There%are%actually%two%problems%%
– Large/scale"data"storage"
– Large/scale"data"analysis"
Disk"Capacity"and"Price"
! We%are%genera.ng%more%data%than%ever%before%
! Fortunately,%the%size%and%cost%of%storage%has%kept%pace%
– Capacity"has"increased"while"price"has"decreased"
Year Capacity (GB) Cost per GB (USD)

1997 2.1 $157
2004 200 $1.05
2012 3,000 $0.05
Disk"Capacity"and"Performance"
! Disk%performance%has%also%increased%in%the%last%15%years%
! Unfortunately,%transfer%rates%have%not%kept%pace%with%capacity%
Year Capacity (GB) Transfer Rate (MB/s) Disk Read Time

1997 2.1 16.6 126 seconds
2004 200 56.5 59 minutes
2012 3,000 210 3 hours, 58 minutes
Data"Access"is"the"Bo>leneck"
! Although%we%can%process%data%more%quickly,%accessing%it%is%slow%
– This"is"true"for"both"reads"and"writes"
! For%example,%reading%a%single%3TB%disk%takes%almost%four%hours%
– We"cannot"process"the"data"unDl"we"have"read"it"
– We"are"limited"by"the"speed"of"a"single"disk"
! We%will%see%Hadoop’s%solu.on%in%a%few%moments%
– But"first"we"will"examine"how"we"process"large"amounts"of"data"
Monolithic"CompuDng"
! Tradi.onally,%computa.on%has%been%processor#bound%
– Intense"processing"on"small"amounts"of"data"
! For%decades,%the%goal%was%a%bigger,%more%powerful%machine%
– Faster"processor,"more"RAM"
! This%approach%has%limita.ons%
– High"cost"
– Limited"scalability"
The"Case"for"Distributed"Systems"
%
“In%pioneer%days%they%used%oxen%for%heavy%pulling,%and%when%one%ox%
couldn’t%budge%a%log,%we%didn’t%try%to%grow%a%larger%ox.”%
•  "Grace"Hopper,"early"advocate"of"distributed"compuDng"
Distributed"CompuDng"
! Modern%large#scale%processing%is%distributed%across%machines%
– Oien"hundreds"or"thousands"of"nodes"
– Common"frameworks"include"MPI,"PVM"and"Condor"
! Focuses%on%distribu.ng%the%processing%workload%
– Powerful"compute"nodes"
– Separate"systems"for"data"storage"
– Fast"network"connecDons"to"connect"them"
%
Distributed"CompuDng"Processing"Pa>ern"
! Typical%processing%paYern%
– Step"1:"Copy"input"data"from"storage"to"compute"node"
– Step"2:"Perform"necessary"processing"
– Step"3:"Copy"output"data"back"to"storage"
1 Copy input data
2 Perform processing
3 Copy output data
! This%works%fine%with%rela.vely%small%amounts%of%data%
– That"is,"where"step"2"dominates"overall"runDme"
%
Data"Processing"Bo>leneck"
! That%paYern%doesn’t%scale%with%large%amounts%of%data%
– More"Dme"spent"copying"data"than"actually"processing"it"
– GeOng"data"to"the"processors"is"the"bo>leneck"
! Grows%worse%as%more%compute%nodes%are%added%
– They"are"compeDng"for"the"same"bandwidth"
– Compute"nodes"become"starved"for"data"
%
Complexity"of"Distributed"CompuDng"
! Distributed%systems%pay%for%scalability%by%adding%complexity%
! Much%of%this%complexity%involves%
– Availability"
– Data"consistency"
– Event"synchronizaDon"
– Bandwidth"limitaDons"
– ParDal"failure"
– Cascading"failures"
! These%are%o\en%more%difficult%than%the%original%problem%
– Error"handling"oien"accounts"for"the"majority"of"the"code"
Distributed"Versus"Local"
%
“Failure%is%the%defining%difference%between%%
distributed%and%local%programming”%
•  "Ken"Arnold,"CORBA"designer"
System"Requirements:"Failure"Handling"
! Failure%is%inevitable%
– We"should"strive"to"handle"it"well"
! An%ideal%solu.on%should%have%(at%least)%these%proper.es%
%
Failure-Handling Properties of an Ideal Distributed System
Automatic Job can still complete without manual intervention
Transparent Tasks assigned to a failed component are picked up by others
Graceful Failure results only in a proportional loss of load capacity
Recoverable That capacity is reclaimed when the component is later replaced
Consistent Failure does not produce corruption or invalid results
More"System"Requirements"
! Linear%horizontal%scalability%
– Adding"new"nodes"should"add"proporDonal"load"capacity"
– Avoid"contenDon"by"using"a"“shared"nothing”"architecture"
– Must"be"able"to"expand"cluster"at"a"reasonable"cost"
! Jobs%run%in%rela.ve%isola.on%
– Results"must"be"independent"of"other"jobs"running"concurrently"
– Although"performance"can"be"affected"by"other"jobs"
! Simple%programming%model%
– Should"support"a"widely/used"language"
– The"API"must"be"relaDvely"easy"to"learn"
! Hadoop%addresses%these%requirements%
Chapter"Topics"
!! Why"Hadoop?"
!! Fundamental%Concepts%
!! Conclusion"
Hadoop:"A"Radical"SoluDon"
! Tradi.onal%distributed%compu.ng%frequently%involves%
– Complex"programming"requiring"explicit"synchronizaDon"
– Expensive,"specialized"fault/tolerant"hardware"
– High/performance"storage"systems"with"built/in"redundancy"
! Hadoop%takes%a%radically%different%approach%
– Inspired"by"Google’s"GFS"and"MapReduce"architecture"
– This"new"approach"addresses"the"problems"described"earlier"
Hadoop"Scalability"
! Hadoop%aims%for%linear%horizontal%scalability%
– Cross/communicaDon"among"nodes"is"minimal""
– Just"add"nodes"to"increase"cluster"capacity"and"performance"
! Clusters%are%built%from%industry#standard%hardware%%
– Widely/available"and"relaDvely"inexpensive"servers"
– You"can"“scale"out”"later"when"the"need"arises"
SoluDon:"Data"Access"Bo>leneck"
! Recap:%separate%storage%and%compute%systems%create%boYleneck%
– Can"spend"more"Dme"spent"copying"data"than"processing"it"
! Solu.on:%store%and%process%data%on%the%same%machines%
– This"is"why"adding"nodes"increases"capacity"and"performance"
! Op.miza.on:%Use%intelligent%job%scheduling%(data%locality)%
– Hadoop"tries"to"process"data"on"the"same"machine"that"stores"it"
– This"improves"performance"and"conserves"bandwidth"
– “Bring"the"computaDon"to"the"data”"
SoluDon:"Disk"Performance"Bo>leneck""
! Recap:%a%single%disk%has%great%capacity%but%poor%performance%
! Solu.on:%use%mul.ple%disks%in%parallel%
– The"transfer"rate"of"one"disk"might"be"210"megabytes/second"
– Almost"four"hours"to"read"3"TB"of"data"
– 1000"such"disks"in"parallel"can"transfer"210"gigabytes/second"
– Less"than"15"seconds"to"read"3TB"of"data"
! Colocated%storage%and%processing%makes%this%solu.on%feasible%
– 250/node"cluster"with"4"disks"per"node"="1000"disks"
SoluDon:"Complex"Processing"Code"
! Recap:%Distributed%programming%is%very%difficult%
– Oien"done"in"C"or"FORTRAN"using"complex"libraries"
! Solu.on:%Use%a%popular%language%and%a%high#level%API%
– MapReduce"code"is"typically"wri>en"in"Java"(like"Hadoop"itself)"
– It"is"possible"to"write"MapReduce"in"nearly"any"language"
! The%MapReduce%programming%model%simplifies%processing%
– Deal"with"one"record"(key/value"pair)"at"a"Dme"
– Complex"details"are"abstracted"away"
– No"file"I/O"
– No"networking"code"
– No"synchronizaDon"
SoluDon:"Fault"Tolerance"
! Recap:%Distributed%systems%o\en%use%expensive%components%
– In"order"to"minimize"the"possibility"of"failure"
! Solu.on:%Realize%that%failure%is%inevitable%
– And"instead"try"to"minimize"the"effect"of"failure"
– Hadoop"saDsfies"all"the"requirements"we"discussed"earlier"
! Machine%failure%is%a%regular%occurrence%
– A"server"might"have"a"mean"Dme"between"failures"(MTBF)"of"5"years"
(~1825"days)"
– Equates"about"one"failure"per"day"in"a"2,000"node"cluster"
"
Chapter"Topics"
!! Why"Hadoop?"
!! Core%Hadoop%Components%
!! Conclusion"
Core"Hadoop"Components"
! Hadoop%is%a%system%for%large#scale%data%processing%
! Hadoop%provides%two%main%components%to%achieve%this%
– Data"storage:"HDFS"
– Data"processing:"MapReduce"
! Plus%the%infrastructure%needed%to%make%them%work,%including%
– Filesystem"uDliDes"
– Job"scheduling"and"monitoring"
– Web"UI"
The"Hadoop"Ecosystem"
! Many%related%tools%integrate%with%Hadoop%
– Data"analysis:"Hive,"Pig""
– Machine"Learning:"Mahout"
– Database"integraDon:"Sqoop"
– Workflow"management:"Oozie"
– Cluster"management:"Cloudera"Manager"
! These%are%not%considered%“core%Hadoop”%
– Rather,"they"are"part"of"the"“Hadoop"ecosystem”"
– Many"are"also"open"source"Apache"projects"
– We"will"learn"about"several"of"these"later"in"the"course"
Chapter"Topics"
!! Why"Hadoop?"
!! Hands#On%Exercise:%Installing%Hadoop%
!! Conclusion"
Hands/On"Exercise:"Installing"Hadoop"–"Networking"Setup"
! Cloud%training%environment%–%four%EC2%instances%plus%the%Get2EC2%virtual%
machine%(VM)%
– Start"the"Get2EC2"VM"
– Configure"the"local"hosts"file"with"the"EC2"elasDc"IP"addresses"
– Configure"the"/etc/hosts"file"and"change"host"names"of"the"EC2"
instances"
– Verify"that"everything"is"working"correctly"
– Start"a"SOCKS5"proxy"server"on"the"Get2EC2"VM"
! Local%training%environment%–%4%VMware%VMs%
– Start"the"VMs""
– Configure"IP"addresses"
– Configure"the"/etc/hosts"file"and"change"host"names"
– Log"out"and"log"in"to"all"four"virtual"machines"
– Verify"that"everything"is"working"correctly"
Hands/On"Exercise:"Installing"Hadoop"–"Networking"Setup"
(cont’d)"
! At%the%end%of%the%day:%
– Cloud"training"environment"
– Exit"from"all"acDve"SSH"sessions"(including"the"SOCKS5"proxy"
server)"
– Suspend"the"Get2EC2"VM"
– Local"training"environment"
– Suspend"all"four"VMs"
! When%you%come%back%in%the%morning:%
– Cloud"training"environment"
– Restart"the"Get2EC2"VM"
– Restart"SSH"sessions"using"the"connect_to"scripts"
– Restart"the"SOCKS5"proxy"server"
– Local"training"environment"
– Restart"all"four"VMs"
"
Hands/On"Exercise:"Installing"Hadoop"–"Deployment"
! In%this%Hands#On%Exercise,%you%will%install%Hadoop%in%pseudo#distributed%
mode%
! Please%refer%to%the%Hands#On%Exercise%Manual%
! Pseudo#distributed%mode%deployment%a\er%exercise%comple.on%
Chapter"Topics"
!! Why"Hadoop?"
!! Conclusion%
EssenDal"Points"
! We%are%genera.ng%more%data%–%and%faster%–%than%ever%before%
! We%can%store%and%process%the%data,%but%there%are%problems%using%exis.ng%
techniques%%
– Accessing"the"data"from"disk"is"a"bo>leneck"
– In"distributed"systems,"geOng"data"to"the"processors"is"a"bo>leneck"
! Hadoop%eliminates%the%boYlenecks%by%storing%and%processing%data%on%the%
same%machine%
! Hadoop%consists%of%two%core%components,%HDFS%(storage)%and%MapReduce%
(processing),%and%an%ecosystem%of%related%tools%
Conclusion"
In%this%chapter,%you%have%learned:%
! Why%Hadoop%is%needed%
! What%problems%Hadoop%solves%
! What%comprises%Hadoop%and%the%Hadoop%ecosystem%
HDFS"
Chapter"3"
Course"Chapters"
!! IntroducEon" Course"IntroducEon"
!! HDFS%
Introduc/on%to%Apache%Hadoop%
!! MapReduce"
!! Hadoop"InstallaEon"and"IniEal"ConfiguraEon"
!! Hadoop"Clients"
!! Advanced"Cluster"ConfiguraEon"
!! Hadoop"Security"
Cluster"OperaEons"and"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooEng"
!! Conclusion"
!! Kerberos"ConfiguraEon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaEon"
HDFS"
! What%features%HDFS%provides%
! How%HDFS%reads%and%writes%files%
! How%the%NameNode%uses%memory%
! How%Hadoop%provides%file%security%
! How%to%use%the%NameNode%Web%UI%
! How%to%use%the%Hadoop%File%Shell%
Chapter"Topics"
HDFS% Introduc/on%to%Apache%Hadoop%
!! HDFS%Features%
!! WriEng"and"Reading"Files"
!! NameNode"Memory"ConsideraEons"
!! Overview"of"HDFS"Security"
!! Using"the"NameNode"Web"UI"
!! Using"the"Hadoop"File"Shell"
!! Hands/On"Exercise:"Working"with"HDFS"
!! Conclusion"
HDFS:"The"Hadoop"Distributed"File"System"
! Based%on%Google’s%GFS%(Google%File%System)%
! Provides%redundant%storage%for%massive%amounts%of%data%
– Using"industry/standard"hardware"
! At%load%/me,%data%is%distributed%across%all%nodes%
– Provides"for"efficient"MapReduce"processing"(more"later)"
HDFS"Features"
! High%performance%
! Fault%tolerance%
! Rela/vely%simple%centralized%management%
– Master/slave"architecture"
! Security%
– Two"levels"from"which"to"choose"
! Op/mized%for%MapReduce%processing%
– Data"locality"
! Scalability%
HDFS"Design"AssumpEons"
! High%component%failure%rates%
– Inexpensive"components"fail"all"the"Eme"
! “Modest”%number%of%HUGE%files%
– Just"a"few"million"
– Each"file"likely"to"be"100MB"or"larger"
– MulE/gigabyte"files"typical"
! Files%are%write#once%
! Large%streaming%reads%
– Not"random"access"
! Favor%high%sustained%throughput%over%low%latency%
HDFS"Blocks"
! When%a%file%is%added%to%HDFS,%it%is%split%into%blocks%
! This%is%a%similar%concept%to%na/ve%filesystems%
– HDFS"uses"a"much"larger"block"size""
– Default"block"size"is"64"MB"(configurable)"
Block #1
150 MB input file (64 MB)
Block #2
(64 MB)
Block #3
(remaining 22 MB)
HDFS"ReplicaEon"
! These%blocks%are%replicated%to%nodes%throughout%the%cluster%
– Based"on"the"replica+on.factor"(default"is"three)"
! Replica/on%increases%reliability%and%performance%
– Reliability:"data"can"tolerate"loss"of"all"but"one"replica"
– Performance:"more"opportuniEes"for"data"locality"
Block #1
Node A has blocks: 1, 2
Node B has blocks: 2, 3

Block #2
Node C has blocks: 1, 3
Node D has block: 1

Block #3
Node E has blocks: 2, 3
Classical"HDFS"Architecture"
! The%architecture%of%HDFS%has%recently%been%improved%
– More"resilient"
– Be>er"scalability"
! These%changes%are%only%available%in%recent%releases%
– Such"as"Cloudera’s"CDH4"
! Many%s/ll%run%earlier%releases%in%produc/on%
– We"will"discuss"the"older"architecture"first"
– Then"we"will"cover"how"it"has"changed"
Classical"HDFS"Architectural"Overview"
! There%are%three%daemons%in%“classical”%HDFS%
– NameNode"(master)"
– Secondary"NameNode"(master)"
– DataNode"(slave)"
Secondary
NameNode
NameNode
DataNode DataNode DataNode DataNode
The"NameNode"
! The%NameNode%stores%all%metadata%
– InformaEon"about"file"locaEons"in"HDFS"
– InformaEon"about"file"ownership"and"permissions"
– Names"of"the"individual"blocks"
– LocaEons"of"the"blocks"
! Metadata%is%stored%on%disk%and%read%when%the%NameNode%daemon%starts%
up%
– Filename"is"fsimage
– Note:"block"locaEons"are"not"stored"in"fsimage
! Changes%to%the%metadata%are%made%in%RAM%
– Changes"are"also"wri>en"to"a"log"file"on"disk"called"edits
– Full"details"later"
The"Slave"Nodes"
! Actual%contents%of%the%files%are%stored%as%blocks%on%the%slave%nodes%
! Blocks%are%simply%files%on%the%slave%nodes’%underlying%filesystem%
– Named"blk_xxxxxxx
– Nothing"on"the"slave"node"provides"informaEon"about"what"underlying"
file"the"block"is"a"part"of"
– That"informaEon"is"only"stored"in"the"NameNode’s"metadata"
! Each%block%is%stored%on%mul/ple%different%nodes%for%redundancy%
– Default"is"three"replicas"
! Each%slave%node%runs%a%DataNode%daemon%
– Controls"access"to"the"blocks"
– Communicates"with"the"NameNode"
The"Secondary"NameNode:"CauEon!"
! The%Secondary%NameNode%is%not-a%failover%NameNode!%
– It"performs"memory/intensive"administraEve"funcEons"for"the"
NameNode"
– NameNode"keeps"informaEon"about"files"and"blocks"(the"
metadata)"in"memory"
– NameNode"writes"metadata"changes"to"an"edit"log"
– Secondary"NameNode"periodically"combines"a"prior"snapshot"of"
the"file"system"metadata"and"edit"log"into"a"new"snapshot"
– New"snapshot"is"transmi>ed"back"to"the"NameNode"
! Secondary%NameNode%should%run%on%a%separate%machine%in%a%large%
installa/on%
– It"requires"as"much"RAM"as"the"NameNode"
File"System"Metadata"Snapshot"and"Edit"Log"
! The%fsimage%file%contains%a%file%system%metadata%snapshot%
– It"is"not-updated"at"every"write"
– This"would"be"very"slow"
! When%an%HDFS%client%performs%a%write%opera/on,%it%is%recorded%in%the%
Primary%NameNode’s%edit%log%
– The"edits"file"
– The"NameNode’s"in/memory"representaEon"of"the"file"system"
metadata"is"also"updated"
! Applying%all%changes%in%the%edits%file%during%a%NameNode%restart%could%
take%a%long%/me%
– The"file"could"also"grow"to"be"huge"
"
CheckpoinEng"the"File"System"Metadata"
! The%Secondary%NameNode%periodically%checkpoints-the%NameNode’s%in#
memory%file%system%data%
1.  Tells"the"NameNode"to"roll"its"edits"file"
2.  Retrieves"fsimage"and"edits"from"the"NameNode"
3.  Loads"fsimage"into"memory"and"applies"the"changes"from"the"
edits"file"
4.  Creates"a"new,"consolidated"fsimage"file"
5.  Sends"the"new"fsimage"file"back"to"the"primary"NameNode"
6.  The"NameNode"replaces"the"old"fsimage"file"with"the"new"one,"
replaces"the"old"edits"file"with"the"new"one"it"created"in"step"1,"
and"updates"the"fstime"file"to"record"the"checkpoint"Eme"
! By%default,%checkpoin/ng%occurs%once%an%hour%or%if%the%edits%file%grows%
larger%than%64MB%
Single"Point"of"Failure"
! Each%Hadoop%cluster%has%a%single%NameNode%
– The"Secondary"NameNode"is"not"a"failover"NameNode"
! The%NameNode%is%a%single%point%of%failure%(SPOF)%
! In%prac/ce,%this%is%not%a%major%issue%
– HDFS"will"be"unavailable"unEl"NameNode"is"replaced"
– There"is"very"li>le"risk"of"data"loss"for"a"properly"managed"system"
! Recovering%from%a%failed%NameNode%is%rela/vely%easy%
– We"will"discuss"this"process"in"detail"later"
Contemporary"HDFS"Architecture"
! The%preceding%slides%described%the%“classic”%architecture%
– Versions"of"HDFS"in"CDH3"and"earlier""
! HDFS%now%has%High%Availability%and%Federa/on%features%
– IniEally"developed"on"the"Apache"Hadoop"0.23"branch"
– Now"part"of"Hadoop"2.x"
– Available"in"Cloudera’s"distribuEon,"starEng"with"CDH4"
HDFS"High"Availability"
! HDFS%High%Availability%addresses%the%NameNode%SPOF%
! Two%NameNodes:%one%ac/ve%and%one%standby%
– Standby"NameNode"takes"over"when"acEve"NameNode"fails"
– Standby"NameNode"also"does"checkpoinEng"(Secondary"NameNode"no"
longer"needed)"
%
DataNode DataNode
NameNode
(Active)
DataNode DataNode
NameNode
(Standby)
DataNode DataNode
HDFS"FederaEon"
! Federa/on%improves%the%scalability%of%HDFS%
– In"CDH3"and"earlier,"there"could"only"be"a"single"NameNode"
– FederaEon"allows"for"mulEple"independent"NameNodes"
! Each%NameNode%manages%a%namespace-volume%
– Client/side"mount"tables"define"the"overall"view
– NN1"might"provide"/users"and"NN2"might"provide"/reports"
users reports
engineering finance marketing sales inventory TPS
Chapter"Topics"
!! HDFS"Features"
!! Wri/ng%and%Reading%Files%
!! Conclusion"
Anatomy"of"a"File"Write"
Anatomy"of"a"File"Write"(cont’d)"
1.  Client%connects%to%the%NameNode%
2.  NameNode%places%an%entry%for%the%file%in%its%metadata,%returns%the%
block%name%and%list%of%DataNodes%to%the%client%
3.  Client%connects%to%the%first%DataNode%and%starts%sending%data%
4.  As%data%is%received%by%the%first%DataNode,%it%connects%to%the%second%and%
starts%sending%data%
5.  Second%DataNode%similarly%connects%to%the%third%
6.  ack%packets%from%the%pipeline%are%sent%back%to%the%client%
7.  Client%reports%to%the%NameNode%when%the%block%is%wrieen%
Anatomy"of"a"File"Write"(cont’d)"
! If%a%DataNode%in%the%pipeline%fails%
– The"pipeline"is"closed"
– A"new"pipeline"is"opened"with"the"two"good"nodes"
– The"data"conEnues"to"be"wri>en"to"the"two"good"nodes"in"the"pipeline"
– The"NameNode"will"realize"that"the"block"is"under/replicated,"and"will"
re/replicate"it"to"another"DataNode"
! As%the%blocks%are%wrieen,%a%checksum%is%also%calculated%and%wrieen%
– Used"to"ensure"the"integrity"of"the"data"when"it"is"later"read"
Hadoop"is"‘Rack/aware’"
! Hadoop%understands%the%concept%of%‘rack%awareness’%
– The"idea"of"where"nodes"are"located,"relaEve"to"one"another"
– Helps"the"JobTracker"to"assign"tasks"to"nodes"closest"to"the"data"
– Helps"the"NameNode"determine"the"‘closest’"block"to"a"client"during"
reads"
– In"reality,"this"should"perhaps"be"described"as"being"‘switch/aware’"
! HDFS%replicates%data%blocks%on%nodes%on%different%racks%%
– Provides"extra"data"security"in"case"of"catastrophic"hardware"failure ""
! Rack#awareness%is%determined%by%a%user#defined%script%
– See"later"
HDFS"Block"ReplicaEon"Strategy"
! First%copy%of%the%block%is%placed%on%the%same%node%as%the%client%
– If"the"client"is"not"part"of"the"cluster,"the"first"block"is"placed"on"a"
random"node"
– System"tries"to"find"one"which"is"not"too"busy"
! Second%copy%of%the%block%is%placed%on%a%node%residing%on%a%different%rack%
! Third%copy%of%the%block%is%placed%on%different%node%in%the%same%rack%as%the%
second%copy%
Anatomy"of"a"File"Read"
Anatomy"of"a"File"Read"(cont’d)"
1.  Client%connects%to%the%NameNode%
2.  NameNode%returns%the%name%and%loca/ons%of%the%first%few%blocks%of%
the%file%
–  Block"locaEons"are"returned"closest/first"
3.  Client%connects%to%the%first%of%the%DataNodes,%and%reads%the%block%
!  If%the%DataNode%fails%during%the%read,%the%client%will%seamlessly%connect%
to%the%next%one%in%the%list%to%read%the%block%
Dealing"With"Data"CorrupEon"
! As%the%DataNode%is%reading%the%block,%it%also%calculates%the%checksum%
! ‘Live’%checksum%is%compared%to%the%checksum%created%when%the%block%was%
stored%
! If%they%differ,%the%client%reads%from%the%next%DataNode%in%the%list%
– The"NameNode"is"informed"that"a"corrupted"version"of"the"block"has"
been"found"
– The"NameNode"will"then"re/replicate"that"block"elsewhere"
! The%DataNode%verifies%the%checksums%for%blocks%on%a%regular%basis%to%
avoid%‘bit%rot’%
– Default"is"every"three"weeks"aier"the"block"was"created"
Data"Reliability"and"Recovery"
! DataNodes%send%heartbeats-to%the%NameNode%
– Every"three"seconds"
! Aher%a%period%without%any%heartbeats,%a%DataNode%is%assumed%to%be%lost%
– NameNode"determines"which"blocks"were"on"the"lost"node"
– NameNode"finds"other"DataNodes"with"copies"of"these"blocks"
– These"DataNodes"are"instructed"to"copy"the"blocks"to"other"nodes"
– Three/fold"replicaEon"is"acEvely"maintained"
The"NameNode"Is"Not"a"Bo>leneck"
! Note:%the%data%never%travels%via%a%NameNode%
– For"writes"
– For"reads"
– During"re/replicaEon"
Chapter"Topics"
!! HDFS"Features"
!! NameNode%Memory%Considera/ons%
!! Conclusion"
NameNode:"Memory"AllocaEon"
! When%a%NameNode%is%running,%all%metadata%is%held%in%RAM%for%fast%
response%
! Each%‘item’%consumes%150#200%bytes%of%RAM%
! Items:%
– Filename,"permissions,"etc."
– Block"informaEon"for"each"block"
NameNode:"Memory"AllocaEon"(cont’d)"
! Why%HDFS%prefers%fewer,%larger%files:%
– Consider"1GB"of"data,"HDFS"block"size"128MB"
– Stored"as"1"x"1GB"file"
– Name:"1"item"
– Blocks:"8"items"
– Total"items:"9"
– Stored"as"1000"x"1MB"files"
– Names:"1000"items"
– Blocks:"1000"items"
– Total"items:"2000"
Chapter"Topics"
!! HDFS"Features"
!! Overview%of%HDFS%Security%
!! Conclusion"
HDFS"File"Permissions""
! Files%in%HDFS%have%an%owner,%a%group,%and%permissions%
– Very"similar"to"Unix"file"permissions"
! File%permissions%are%read%(r),%write%(w)%and%execute%(x)%for%each%of%owner,%
group,%and%other%
– x"is"ignored"for"files"
– For"directories,"x"means"that"its"children"can"be"accessed"
! HDFS%permissions%are%designed%to%stop%good%people%doing%foolish%things%
– Not"to"stop"bad"people"doing"bad"things!"
– HDFS"believes"you"are"who"you"tell"it"you"are"
Hadoop"Security"Overview"
! Hadoop’s%security%has%had%authoriza/on%for%some%/me%
– The"ability"to"allow"people"to"do"some"things"but"not"others"
– Example:"file"permissions"
! Authen/ca/on%has%historically%been%rela/vely%weak%
– AuthorizaEon"requires"you"to"first"idenEfy"the"user"
– Hadoop’s"default"mechanism"for"doing"this"is"easily"defeated"
– Intended"to"prevent"stupid"mistakes"made"by"honest"people"
! Hadoop%can%now%op/onally%enforce%strong%authen/ca/on%
– Via"integraEon"with"Kerberos"
! We%will%cover%Hadoop%security%in%much%more%depth%later%
Chapter"Topics"
!! HDFS"Features"
!! Using%the%NameNode%Web%UI%
!! Conclusion"
NameNode"Web"UI"
! The%NameNode%exposes%its%Web%UI%on%port%50070%
Chapter"Topics"
!! HDFS"Features"
!! Using%the%Hadoop%File%Shell%
!! Conclusion"
Accessing"HDFS"via"the"Command"Line"
! HDFS%is%not%a%general%purpose%filesystem%
– Not"built"into"the"OS,"so"only"specialized"tools"can"access"it"
! End%users%typically%access%HDFS%via%the%hadoop fs%command%
– AcEons"are"specified"with"subcommands"(prefixed"with"a"minus"sign)"
– Most"subcommands"are"similar"to"corresponding"UNIX"commands"
! Display%the%contents%of%the%/user/fred/sales.txt%file%
$ hadoop fs -cat /user/fred/sales.txt
! Create%a%directory%(below%the%root)%called%reports%
$ hadoop fs -mkdir /reports
Copying"Local"Data"To"and"From"HDFS"
! Remember%that%HDFS%is%dis/nct%from%your%local%filesystem%
– The"hadoop fs –put"command"copies"local"files"to"HDFS"
– The"hadoop fs –get"fetches"a"local"copy"of"a"file"from"HDFS"
Hadoop Cluster
$ hadoop fs -put sales.txt /reports

Client Machine
$ hadoop fs -get /reports/sales.txt
More"hadoop fs"Command"Examples""
! Copy%file%input.txt%from%local%disk%to%the%user’s%directory%in%HDFS%
$ hadoop fs -put input.txt input.txt
– This"will"copy"the"file"to"/user/username/input.txt
! Get%a%directory%lis/ng%of%the%HDFS%root%directory%
$ hadoop fs -ls /
! Delete%the%file%/reports/sales.txt%
$ hadoop fs –rm /reports/sales.txt
Chapter"Topics"
!! HDFS"Features"
!! Using"the"Hadoop"file"shell"
!! Hands#On%Exercise:%Working%with%HDFS%
!! Conclusion"
Hands/On"Exercise:"Working"With"HDFS"
! In%this%Hands#On%Exercise,%you%will%copy%a%large%file%into%HDFS%and%explore%
the%results%in%the%NameNode%Web%UI%and%in%Linux%%
Chapter"Topics"
!! HDFS"Features"
!! Conclusion%
EssenEal"Points"
! HDFS%distributes%large%blocks%of%data%across%a%set%of%machines%to%support%
MapReduce%data%locality%
– AssumpEon"is"that"the"typical"file"size"on"a"Hadoop"cluster"is"large"
– Block"size"is"configurable,"default"is"64MB"
! HDFS%provides%fault%tolerance%with%built#in%data%redundancy%
– The"number"of"replicas"per"block"is"configurable,"default"is"3"
! The%NameNode%daemon%maintains%all%HDFS%metadata%in%memory%and%
stores%it%on%disk%
– In"a"non/HA"configuraEon,"the"NameNode"works"with"a"Secondary"
NameNode"that"provides"administraEve"services"
– In"an"HA"configuraEon,"the"Standby"NameNode"is"configured"in"case"of"
a"NameNode"failure"
Conclusion"
! What%features%HDFS%provides%
! How%HDFS%reads%and%writes%files%
! How%the%NameNode%uses%memory%
! How%Hadoop%provides%file%security%
! How%to%use%the%NameNode%Web%UI%
! How%to%use%the%Hadoop%File%Shell%
GeAng"Data"Into"HDFS"
Chapter"4"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"1$
Course"Chapters"
!! IntroducHon" Course"IntroducHon"
!! HDFS"
Introduc.on$to$Apache$Hadoop$
!! Ge6ng$Data$Into$HDFS$
!! MapReduce"
!! Hadoop"InstallaHon"and"IniHal"ConfiguraHon"
!! Hadoop"Clients"
!! Advanced"Cluster"ConfiguraHon"
!! Hadoop"Security"
Cluster"OperaHons"and"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooHng"
!! Conclusion"
!! Kerberos"ConfiguraHon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaHon"
In$this$chapter,$you$will$learn:$
! How$to$import$data$into$HDFS$with$Flume$$
! How$to$import$data$into$HDFS$with$Sqoop$
! What$REST$interfaces$Hadoop$provides$
! Best$prac.ces$for$impor.ng$data$
! In$the$last$chapter,$you$learned$how$to$use$the$hadoop fs$command$to$
copy$data$into$and$out$of$HDFS$
! In$this$chapter,$we$will$explore$several$Hadoop$ecosystem$tools$for$
accessing$HDFS$data$from$outside$of$Hadoop$
– We"can"only"provide"a"brief"overview"of"these"tools;"consult"the"
documentaHon"at"http://archive.cloudera.com/docs"for"
full"details"on"installaHon"and"configuraHon"
Chapter"Topics"
Ge6ng$Data$Into$HDFS$ Introduc.on$to$Apache$Hadoop$
!! Inges.ng$Data$From$External$Sources$With$Flume$$
!! Hands/On"Exercise:"Using"Flume"
!! IngesHng"Data"From"RelaHonal"Databases"With"Sqoop"
!! REST"Interfaces"
!! Best"PracHces"for"ImporHng"Data"
!! Hands/On"Exercise:"ImporHng"Data"With"Sqoop"
!! Conclusion"
Flume:"Basics"
! Flume$is$a$distributed,$reliable,$available$service$for$efficiently$moving$
large$amounts$of$data$as$it$is$produced$
– Ideally"suited"to"gathering"logs"from"mulHple"systems"and"inserHng"
them"into"HDFS"as"they"are"generated"
! Flume$is$an$open$source$Apache$project$
– IniHally"developed"by"Cloudera"
– Included"in"CDH"
! Flume’s$design$goals:$
– Reliability"
– Scalability"
– Extensibility"
Flume:"Usage"Pa>erns"
! Flume$is$typically$used$to$ingest$log$files$from$real".me$systems$such$as$
Web$servers,$firewalls,$and$mailservers$into$HDFS$
! Currently$in$use$in$many$large$organiza.ons,$inges.ng$millions$of$events$
per$day$
Flume:"High/Level"Overview"
Agent$$ Agent$ Agent$ Agent$
encrypt$
Optionally pre-process incoming data:

Agent$ Agent$ perform transformations, suppressions,
metadata enrichment
compress$ batch$
Each agent can be configured with an
encrypt$ in-memory or durable channel.
Writes to multiple HDFS file formats

(text, SequenceFile, JSON, Avro,
others) Agent(s)$
Parallelized writes across many
collectors – as much write throughput
as required
Flume"Agent"CharacterisHcs"
! Each$Flume$agent$has$a$source$and$a$sink$
! Source$
– Tells"the"node"where"to"receive"data"from"
! Sink$
– Tells"the"node"where"to"send"data"to"
! Channel$
– A"queue"between"the"Source"and"Sink"
– Can"be"in"memory"only"or"‘Durable’"
– Durable"channels"will"not"lose"data"if"power"is"lost"
Flume’s"Design"Goals:"Reliability"
! Channels$provide$Flume’s$reliability$
! Memory$Channel$
– Data"will"be"lost"if"power"is"lost"
! Disk"based$Channel$
– Disk/base"queue"guarantees"durability"of"data"in"face"of"a"power"loss"
! Data$transfer$between$Agents$and$Channels$is$transac.onal$
– A"failed"data"transfer"to"a"downstream"agent"rolls"back"and"retries"
! Can$configure$mul.ple$Agents$with$the$same$task$
– e.g.,"2"Agents"doing"the"job"of"1"‘collector’"–"if"one"agent"fails"then"
upstream"agents"would"fail"over"
Flume’s"Design"Goals:"Scalability"
! Scalability$
– The"ability"to"increase"system"performance"linearly"–"or"be>er"–"by"
adding"more"resources"to"the"system"
– Flume"scales"horizontally"
– As"load"increases,"more"machines"can"be"added"to"the"
configuraHon"
Flume’s"Design"Goals:"Extensibility"
! Extensibility$
– The"ability"to"add"new"funcHonality"to"a"system"
! Flume$can$be$extended$by$adding$Sources$and$Sinks$to$exis.ng$storage$
layers$or$data$plaôrms$
– General"Sources"include"data"from"files,"syslog,"and"standard"output"
from"any"Linux"process"
– General"Sinks"include"files"on"the"local"filesystem"or"HDFS"
– Developers"can"write"their"own"Sources"or"Sinks"
Installing"Flume"
! Flume$is$available$as$a$tarball,$RPM$or$Debian$package$
! Flume$tarball$
– extract"the"tar"file"
– set"the"FLUME_CONF_DIR"environment"variable"
$ cd /usr/local/ && sudo tar -zxvf <path_to_flume-ng-1.1.0-cdh4.0.0b2.tar.gz>

$ export FLUME_CONF_DIR=/usr/local/flume-ng/conf
! Flume$RPM$or$Debian$Package$
# Ubuntu and other Debian systems

$ sudo apt-get install flume-ng-agent flume-ng-doc
# RedHat-compatible systems
$ sudo yum install flume-ng-agent flume-ng-doc
# SUSE systems
$ sudo zypper install flume-ng-agent flume-ng-doc
Configuring"Flume"
! Configure$Agent$Nodes$on$the$machine(s)$genera.ng$the$data$
!  flume.properties$file$defines$source,$sinks,$channels,$and$flow$within$
an$agent$
– Copy"the"Flume"template"property"file"conf/flume-
conf.properties.template"to"conf/flume.properties"
and"edit"
# List the sources, sinks and channels for the agent

<agent>.sources = <Source>
<agent>.sinks = <Sink>
<agent>.channels = <Channel1> <Channel2>
# set channel for source

<agent>.sources.<Source>.channels = <Channel1> <Channel2> ...
# set channel for sink

<agent>.sinks.<Sink>.channel = <Channel1>
Agent"ConfiguraHon"Example"
!  tail -f of$a$file$as$the$source,$downstream$node$as$the$sink$
tail1.sources = src1
tail1.channels = ch1
tail1.sinks = sink1 sink2
tail1.sinkgroups = sg1
tail1.sources.src1.type = exec
tail1.sources.src1.command = tail -F /tmp/access_log
tail1.sources.src1.channels = ch1
tail1.channels.ch1.type = memory
tail1.channels.ch1.capacity = 500
tail1.sinks.sink1.type = avro
tail1.sinks.sink1.hostname = localhost
tail1.sinks.sink1.port = 6000
tail1.sinks.sink1.batch-size = 1
tail1.sinks.sink1.channel = ch1
""
Agent"ConfiguraHon"Example"(cont'd)"
! Upstream$node$as$the$source,$HDFS$as$the$sink$
collector1.sources = src1
collector1.channels = ch1
collector1.sinks = sink1
collector1.sources.src1.type = avro
collector1.sources.src1.bind = localhost
collector1.sources.src1.port = 6000
collector1.sources.src1.channels = ch1
collector1.channels.ch1.type = memory
collector1.channels.ch1.capacity = 500
collector1.sinks.sink1.type = hdfs
collector1.sinks.sink1.hdfs.path = /collector_dir
collector1.sinks.sink1.hdfs.filePrefix = access_log
collector1.sinks.sink1.channel = ch1
""
""
StarHng"Flume"Agents"
! Star.ng$an$agent$using$the$/etc/init.d$script$
– Can"be"used"to"start"Flume"agents"automaHcally"upon"reboot"
– Default"agent"name"is"agent
– Default"agent"configuraHon"file"is"/etc/flume-ng/conf/
flume.conf
– Start"with"sudo service flume-ng-agent start
! Star.ng$an$agent$from$the$command$line$
– Specify"the"agent"name"and"the"agent"configuraHon"file"in"the"
command"line"
– For"example:"
– flume-ng agent --conf-file \
/etc/hadoop/conf/flume-conf.properties \
--name tail-agent
Chapter"Topics"
!! IngesHng"Data"From"External"Sources"With"Flume""
!! Hands"On$Exercise:$Using$Flume$
!! REST"Interfaces"
!! Conclusion"
Hands/On"Exercise:"Using"Flume"
! In$this$Hands"On$Exercise,$you$will$create$a$simple$Flume$configura.on$to$
store$dynamically"generated$data$in$HDFS$
! Please$refer$to$the$Hands"On$Exercise$Manual$
Chapter"Topics"
!! Inges.ng$Data$From$Rela.onal$Databases$With$Sqoop$
!! REST"Interfaces"
!! Conclusion"
What"is"Sqoop?"
! Sqoop$is$“the$SQL"to"Hadoop$database$import$tool”$
– Open/source"Apache"project"
– Originally"developed"at"Cloudera"
– Included"in"CDH"
! Designed$to$import$data$from$RDBMSs$(Rela.onal$Database$Management$
Systems)$into$HDFS$
– Can"also"send"data"from"HDFS"to"an"RDBMS"
! Uses$JDBC$(Java$Database$Connec.vity)$to$connect$to$the$RDBMS$
How"Does"Sqoop"Work?"
! Sqoop$examines$each$table$and$automa.cally$generates$a$Java$class$to$
import$data$into$HDFS$
! It$then$creates$and$runs$a$Map"only$MapReduce$job$to$import$the$data$
– By"default,"four"Mappers"connect"to"the"RDBMS"
– Each"imports"a"quarter"of"the"data"
$ Hadoop Cluster
Submit MapReduce Jobs

Map-Only
Tasks
Access Table
Definitions and
RDBMS /
Generate Classes
Data Warehouse /
Import and
Document-Based
Export Data
System
Sqoop"Features"
! Imports$a$single$table,$or$all$tables$in$a$database$
! Can$specify$which$rows$to$import$
– Via"a"WHERE"clause"
! Can$specify$which$columns$to$import$
! Can$provide$an$arbitrary$SELECT$statement$
! Sqoop$can$automa.cally$create$a$Hive$table$based$on$the$imported$data$
! Supports$incremental$imports$of$data$
! Can$export$data$from$HDFS$to$a$database$table$
Sqoop"Connectors"
! Custom$Sqoop$connectors$exist$for$higher"speed$import$from$some$
RDBMSs$and$other$systems$
– Use"a"system’s"naHve"protocols"to"access"data"rather"than"JDBC"
– Provides"much"faster"performance"
– Typically"developed"by"the"third/party"RDBMS"vendor"
– SomeHmes"in"collaboraHon"with"Cloudera"
! Current$systems$supported$by$custom$connectors$include:$
– Netezza"
– Teradata"
– Oracle"Database"(connector"developed"with"Quest"Sokware)"
– Microsok"SQL"Server"
! Others$are$in$development$
! Custom$connectors$are$oien$not$open$source,$but$are$free$
Sqoop"Usage"Examples"
! List$all$databases$
sqoop list-databases --username fred -P \
--connect jdbc:mysql://dbserver.example.com/
! List$all$tables$in$the$world$database$
$sqooplist-tables --username fred -P \
--connect jdbc:mysql://dbserver.example.com/world
! Import$all$tables$in$the$world$database$
$sqoopimport-all-tables --username fred --password derf \
--connect jdbc:mysql://dbserver.example.com/world
Sqoop"2"–"Sqoop"as"a"Service"
! New$version$of$Sqoop$can$be$run$as$a$service$on$a$centrally"available$
machine$
Metadata
Repository
Hadoop Cluster
Sqoop2
Server Map-Only
Tasks
...
Access Table
Definitions and Import and
Browser Export Data
Generate Classes
RDBMS /
Data Warehouse /
Document-Based
System
Sqoop"2"–"Sqoop"as"a"Service"(cont’d)"
Func.onality$ Sqoop$ Sqoop2

Installa.on$and$ •  Connectors$and$JDBC$drivers$ •  Connectors$and$JDBC$drivers$are$
Configura.on$ are$installed$on$every$client$ installed$on$the$Sqoop2$server$
•  Local$configura.on$requires$ •  Requires$database$connec.vity$
root$privileges$ for$the$Sqoop2$server$
•  Database$connec.vity$
required$for$every$client$
Client$Interface$ CLI$only$ CLI,$Web$UI,$REST$$
Security$ Every$invoca.on$requires$ Administrator$specifies$creden.als$
creden.als$to$RDBMS$ when$crea.ng$server"side$
Connec.on$objects$
Resource$ No$resource$management$ Administrator$can$limit$the$number$
Management$ of$connec.ons$to$the$RDBMS$
Chapter"Topics"
!! REST$Interfaces$
!! Conclusion"
WebHDFS"
! Provides$an$HTTP/HTTPS$REST$interface$to$HDFS$
– Supports"both"reads"and"writes"from/to"HDFS"
– Can"be"accessed"from"within"a"program"or"script"
– Can"be"used"via"command/line"tools"such"as"curl"or"wget
! Simple$to$deploy$
– Modify"one"parameter"in"the"Hadoop"configuraHon"
– Restart"the"NameNode"and"DataNodes"
! Requires$client$access$to$every$DataNode$in$the$cluster$
! Does$not$support$HDFS$HA$deployments$
$
REST: REpresentational State Transfer

H>pFS"
! Provides$an$HTTP/HTTPS$REST$interface$to$HDFS$
– The"interface"is"idenHcal"to"the"WebHDFS"REST"interface"
! Slightly$more$complex$than$WebHDFS$to$deploy$
– Install"and"configure"an"H>pFS"server"
– Enable"proxy"access"to"HDFS"for"an"httpfs"user""
– Restart"the"NameNode(s)"
! Requires$client$access$to$the$HkpFS$server$only$
– The"H>pFS"server"then"accesses"HDFS"
! Supports$HDFS$HA$deployments$
$
WebHDFS/H>pFS"REST"Interface"Examples"
! These$examples$will$work$with$either$WebHDFS$or$HkpFS$
– For"WebHDFS,"specify"the"NameNode"host"and"port"(default:"50070)"
– For"H>pFS,"specify"the"H>pFS"server"and"port"(default:"14000)"
! Open$and$get$the$shakespeare.txt$file$
–  curl -i -L "http://host:port/webhdfs/v1/user/\
training/input/shakespeare.txt?op=OPEN&\
user.name=training"
! Make$the$mydir$directory$
–  curl -i -X PUT "http://host:port/webhdfs/v1/user/\
training/mydir?op=MKDIRS&user.name=training"
Chapter"Topics"
!! REST"Interfaces"
!! Best$Prac.ces$for$Impor.ng$Data$
!! Conclusion"
What"Do"Others"See"As"Data"Is"Imported?"
! When$a$client$starts$to$write$data$to$HDFS,$the$NameNode$marks$the$file$
as$exis.ng,$but$being$of$zero$size$
– Other"clients"will"see"that"as"an"empty"file"
! Aier$each$block$is$wriken,$other$clients$will$see$that$block$
– They"will"see"the"file"growing"as"it"is"being"created,"one"block"at"a"Hme"
! This$is$typically$not$a$good$idea$
– Other"clients"may"begin"to"process"a"file"as"it"is"being"wri>en"
ImporHng"Data:"Best"PracHces"
! Best$prac.ce$is$to$import$data$into$a$temporary$directory$
! Aier$the$file$is$completely$wriken,$move$data$to$the$target$directory$
– This"is"an"atomic"operaHon"
– Happens"very"quickly"since"it"merely"requires"an"update"of"the"
NameNode’s"metadata"
! Many$organiza.ons$standardize$on$a$directory$structure$such$as$
– /incoming/<import_job_name>/<files>"
– /for_processing/<import_job_name>/<files>"
– /completed/<import_job_name>/<files>
! It$is$the$job’s$responsibility$to$move$the$files$from$for_processing$to$
completed$aier$the$job$has$finished$successfully$
! Discussion$point:$your$best$prac.ces?$
Chapter"Topics"
!! REST"Interfaces"
!! Hands"On$Exercise:$Impor.ng$Data$With$Sqoop$
!! Conclusion"
Hands/On"Exercise:"ImporHng"Data"Using"Sqoop"
! In$this$Hands"On$Exercise,$you$will$import$data$from$a$rela.onal$database$
using$Sqoop$
Chapter"Topics"
!! REST"Interfaces"
!! Conclusion$
EssenHal"Points"
! You$can$install$Flume$agents$on$systems$such$as$Web$servers$and$mail$
servers$to$extract,$op.onally$transform,$and$pass$data$down$to$HDFS$
– Flume"scales"extremely"well"and"is"in"producHon"use"at"many"large"
organizaHons"
! Flume$uses$the$terms$source,$sink,$and$channel$to$describe$its$actors$
– A"source"is"where"an"agent"receives"data"from"
– A"sink"is"where"an"agent"sends"data"to"
– A"channel"is"a"queue"between"a"source"and"a"sink"
! Using$Sqoop,$you$can$import$data$from$a$rela.onal$database$into$HDFS$
! A$REST$interface$is$available$for$accessing$HDFS$
– To"use"the"REST"interface,"you"must"have"enabled"WebHDFS"or"
deployed"H>pFS"
– The"REST"interface"is"idenHcal"whether"you"use"WebHDFS"or"H>pFS"
Conclusion"
In$this$chapter,$you$have$learned:$
! How$you$can$import$data$into$HDFS$with$Flume$$
! How$you$can$import$data$into$HDFS$with$Sqoop$
! What$REST$interfaces$Hadoop$provides$
! Best$prac.ces$for$impor.ng$data$
MapReduce"
Chapter"5"
Course"Chapters"
!! HDFS"
Introduc/on%to%Apache%Hadoop%
!! GeQng"Data"Into"HDFS"
!! MapReduce%
!! Hadoop"Clients"
!! Hadoop"Security"
Cluster"OperaDons"and"
Maintenance"
!! Conclusion"
MapReduce"
! What%MapReduce%is%
! What%features%MapReduce%provides%
! What%the%basic%concepts%of%MapReduce%are%
! What%the%architecture%of%MapReduce%is%
! What%features%MapReduce%version%2%provides%
! How%MapReduce%handles%failure%
! How%to%use%the%JobTracker%Web%UI%
Chapter"Topics"
MapReduce% Introduc/on%to%Apache%Hadoop%
!! What%Is%MapReduce?%
!! Features"of"MapReduce"
!! Basic"Concepts"
!! Architectural"Overview"
!! MapReduce"Version"2"
!! Failure"Recovery"
!! Using"the"JobTracker"Web"UI"
!! Hands/On"Exercise:"Running"a"MapReduce"Job"
!! Conclusion"
What"Is"MapReduce?"
! MapReduce%is%a%programming%model%
– Neither"plaôrm/"nor"language/specific"
– Record/oriented"data"processing"(key"and"value)"
– Facilitates"task"distribuDon"across"mulDple"nodes"
! Where%possible,%each%node%processes%data%stored%on%that%node%%
! Consists%of%two%developer#created%phases%
– Map"
– Reduce"
! In%between%Map%and%Reduce%is%the%shuffle&and&sort%
– Sends"data"from"the"Mappers"to"the"Reducers"
MapReduce:"The"Big"Picture"
What"Is"MapReduce?"(cont’d)"
! Process%can%be%considered%as%being%similar%to%a%Unix%pipeline%
cat /my/log | grep '\.html' | sort | uniq –c > /my/outfile
Map% Reduce%
Shuffle%%
% and%sort%
What"Is"MapReduce?"(cont’d)"
! Key%concepts%to%keep%in%mind%with%MapReduce:%
– The"Mapper"works"on"an"individual"record"at"a"Dme"
– The"Reducer"aggregates"results"from"the"Mappers"
– The"intermediate"keys"produced"by"the"Mapper"are"the"keys"on"which"
the"aggregaDon"will"be"based"
Chapter"Topics"
!! What"Is"MapReduce?"
!! Features%of%MapReduce%
!! Basic"Concepts"
!! Conclusion"
Features"of"MapReduce"
! Automa/c%paralleliza/on%and%distribu/on%
! Fault#tolerance%
! Status%and%monitoring%tools%
! A%clean%abstrac/on%for%programmers%
– MapReduce"programs"are"usually"wri>en"in"Java"
– Can"be"wri>en"in"other"languages"using"Hadoop&Streaming"
! MapReduce%abstracts%all%the%‘housekeeping’%away%from%the%developer%
– Developer"can"concentrate"simply"on"wriDng"the"Map"and"Reduce"
funcDons"
Chapter"Topics"
!! Basic%Concepts%
!! Conclusion"
MapReduce:"Basic"Concepts"
! Each%Mapper%processes%a%single%input&split%from%HDFS%
– Ocen"a"single"HDFS"block"
! Hadoop%passes%the%developer’s%Map%code%one%record%at%a%/me%
! Each%record%has%a%key&and%a%value%
! Intermediate&data%is%wri]en%by%the%Mapper%to%local%disk&
! During%the%shuffle%and%sort%phase,%all%the%values%associated%with%the%same%
intermediate%key%are%transferred%to%the%same%Reducer%
– The"developer"specifies"the"number"of"Reducers"
! Reducer%is%passed%each%key%and%a%list%of%all%its%values%
– Keys"are"passed"in"sorted"order"
! Output%from%the%Reducers%is%wri]en%to%HDFS%
MapReduce:"A"Simple"Example"
! WordCount%is%the%‘Hello,%World!’%of%Hadoop%
Map%
// assume input is a set of text files
// k is a byte offset
// v is the line for that offset
let map(k, v) =
foreach word in v:
emit(word, 1)
MapReduce:"A"Simple"Example"(cont’d)"
! Sample%input%to%the%Mapper:%
1202 the cat sat on the mat

1225 the aardvark sat on the sofa
! Intermediate%data%produced:%
(the, 1), (cat, 1), (sat, 1), (on, 1), (the, 1),
(mat, 1), (the, 1), (aardvark, 1), (sat, 1),
(on, 1), (the, 1), (sofa, 1)
! Input%to%the%Reducer:%
(aardvark, [1])
(cat, [1])
(mat, [1])
(on, [1, 1])
(sat, [1, 1])
(sofa, [1])
(the, [1, 1, 1, 1])
Reduce%
// k is a word, vals is a list of 1s
let reduce(k, vals) =
sum = 0
foreach (v in vals):
sum = sum + v
emit (k, sum)
! Output%from%the%Reducer,%wri]en%to%HDFS:%
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
Chapter"Topics"
!! Basic"Concepts"
!! Architectural%Overview%
!! Conclusion"
MapReduce"Terminology"
! Job%
– Consists"of"a"Mapper,"a"Reducer,"and"a"list"of"inputs"
! Task%
– An"individual"unit"of"work"
– A"job"is"broken"down"into"many"tasks"
– Each"task"is"either"a"Map"task"or"a"Reduce"task"
! Client%
– What"the"user"runs"to"submit"a"job"
– Also"refers"to"the"machine"on"which"this"program"runs"
"
Architectural"Overview"
! There%are%two%daemons%in%“classical”%MapReduce%
– JobTracker"(master)"/"exactly"one"per"cluster"
– TaskTracker"(slave)"/"one"or"more"per"cluster"
! Slave%nodes%run%both%a%TaskTracker%and%a%DataNode%daemon%
TaskTracker TaskTracker
JobTracker
MapReduce"Job"Submission"
! A%client%submits%a%job%to%the%JobTracker%
– JobTracker"assigns"a"job"ID"
– Client"calculates"the"input&splits"for"the"job"
– Client"adds"job"code"and"configuraDon"to"HDFS"
Client TaskTracker
JobTracker
TaskTracker
MapReduce"Job"Submission"(cont’d)"
! The%JobTracker%creates%a%Map%task%for%each%input%split%
– TaskTrackers"send"periodic"“heartbeats”"to"JobTracker"
– These"heartbeats"also"signal"readiness"to"run"tasks"
– JobTracker"then"assigns"tasks"to"these"TaskTrackers"
heartbeat
Client TaskTracker
1
JobTracker
2 task assignment
TaskTracker
MapReduce"Job"Submission"(cont’d)"
! The%TaskTracker%then%forks%a%new%JVM%to%run%the%task%
– This"isolates"the"TaskTracker"from"bugs"or"faulty"code"
– A"single"instance"of"task"execuDon"is"called"a"task&a6empt&
– Status"info"periodically"sent"back"to"JobTracker"
Client TaskTracker
JobTracker
JVM
TaskTracker
JobTracker"High"Availability"
! Two%JobTrackers:%one%ac/ve%and%one%standby%
– Standby"JT"takes"over"when"acDve"JT"fails"
! Aeer%a%failover,%the%new%JobTracker%restarts%all%jobs%that%were%running%
when%the%failover%occurred%
! Available%for%“classical”%MapReduce%only%
%
JobTracker
(Active)
JobTracker
(Standby)
Chapter"Topics"
!! Basic"Concepts"
!! MapReduce%Version%2%
!! Conclusion"
MapReduce"Version"2"
! The%“classical”%MapReduce%architecture%has%one%JobTracker%
– Must"coordinate"with"every"TaskTracker"
– Poses"scalability"limit"in"large"clusters"
! Hadoop%is%being%re#architected%to%overcome%this%
– Work"being"done"in"the"2.x"branch"of"Apache"Hadoop"
– This"is"called"MapReduce"version"2"(MRv2)"
– Also"known"as"“YARN”"
! You%can%run%either%MRv1%(original)%or%MRv2%with%CDH4%
– Running"MRv1"and"MRv2"on"the"same"cluster"is"not"supported"
– It"will"degrade"performance"and"may"result"in"an"unstable"cluster"
– MRv2"is"strongly"discouraged"for"producDon"at"this"Dme"
MRv2"System"Architecture"
! In%MRv2,%there%is%a%single%Resource%Manager%per%cluster%
– Contains"Scheduler"and"ApplicaDons"Master"subcomponents"
– An"“applicaDon”"is"a"MapReduce"job"
! Each%slave%node%runs%a%Node%Manager%
– Monitors"and"manages"resource"usage"for"that"node"
%
Node Manager Node Manager
Resource Manager
Scheduler
Application Manager
MRv2"System"Architecture"(cont’d)"
! Slave%nodes%run%individual%tasks%similar%to%MRv1%%
! For%each%job,%one%slave%node%is%Applica/on%Master%
– Manages"applicaDon"lifecycle"
– NegoDates"resource"“containers”"from"Resource"Manager"
– Monitors"tasks"running"on"the"other"slave"nodes"
%
App Master - Job #2
Resource Manager
Scheduler
Application Manager
App Master - Job #1
Chapter"Topics"
!! Basic"Concepts"
!! Failure%Recovery%
!! Conclusion"
MapReduce"Failure"Recovery"
! Task%processes%send%heartbeats&to%the%TaskTracker%
! TaskTrackers%send%heartbeats%to%the%JobTracker%
! Any%task%that%fails%to%report%in%10%minutes%is%assumed%to%have%failed%
– Its"JVM"is"killed"by"the"TaskTracker"
! Any%task%that%throws%an%excep/on%is%said%to%have%failed%
! Failed%tasks%are%reported%to%the%JobTracker%by%the%TaskTracker%
! The%JobTracker%reschedules%any%failed%tasks%
– It"tries"to"avoid"rescheduling"the"task"on"the"same"TaskTracker"where"it"
previously"failed"
! If%a%task%fails%four%/mes,%the%whole%job%fails%
MapReduce"Failure"Recovery"(Cont’d)"
! Any%TaskTracker%that%fails%to%report%in%10%minutes%is%assumed%to%have%
crashed%
– All"tasks"on"the"node"are"restarted"elsewhere"
– Any"TaskTracker"reporDng"a"high"number"of"failed"tasks"is"blacklisted,"
to"prevent"the"node"from"blocking"the"enDre"job"
– There"is"also"a"‘global"blacklist’,"for"TaskTrackers"which"fail"on"
mulDple"jobs"
! The%JobTracker%manages%the%state%of%each%job%
– ParDal"results"of"failed"tasks"are"ignored"
Chapter"Topics"
!! Basic"Concepts"
!! Using%the%JobTracker%Web%UI%
!! Conclusion"
The"JobTracker"Web"UI"
! The%JobTracker%exposes%its%Web%UI%on%port%50030%
Drilling"Down"to"Individual"Jobs"
! Clicking%on%an%individual%job%name%will%reveal%more%informa/on%about%that%
job%
Chapter"Topics"
!! Basic"Concepts"
!! Hands#On%Exercise:%Running%a%MapReduce%Job%
!! Conclusion"
Hands/On"Exercise:"Running"a"MapReduce"Job"
! In%this%Hands#On%Exercise,%you%will%run%a%MapReduce%job%on%your%pseudo#
distributed%Hadoop%cluster%%
Chapter"Topics"
!! Basic"Concepts"
!! Conclusion%
EssenDal"Points"
! A%MapReduce%job%has%three%phases%
– The"Map"phase,"in"which"Mappers"take"HDFS"data"as"input"and"produce"
intermediate"data""
– Shuffle"and"sort,"which"sends"data"to"Reducers"
– The"Reduce"phase,"in"which"Reducers"aggregate"the"results"from"the"
Mappers"
! Developers%write%code%to%process%and%aggregate%data%in%the%Map%and%
Reduce%phases%
! Shuffle%and%sort%is%built%in%to%the%Hadoop%framework%
! The%Hadoop%JobTracker%and%TaskTracker%daemons%start%and%manage%Map%
tasks,%shuffle%and%sort,%and%Reduce%tasks%%
Conclusion"
! What%MapReduce%is%
! What%features%MapReduce%provides%
! What%the%basic%concepts%of%MapReduce%are%
! What%the%architecture%of%MapReduce%is%
! What%features%MapReduce%version%2%provides%
! How%MapReduce%handles%failure%
! How%to%use%the%JobTracker%Web%UI%
Planning"Your"Hadoop"Cluster"
Chapter"6"
Course"Chapters"
!! HDFS"
IntroducEon"to"Apache"Hadoop"
!! MapReduce"
!! Planning%Your%Hadoop%Cluster%
Planning,%Installing,%and%
!! Hadoop"Clients"
Configuring%a%Hadoop%Cluster%
!! Hadoop"Security"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooEng"
!! Conclusion"
!! Kerberos"ConfiguraEon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaEon"
Planning"Your"Hadoop"Cluster"
! What%issues%to%consider%when%planning%your%Hadoop%cluster%
! What%types%of%hardware%are%typically%used%for%Hadoop%nodes%
! How%to%opCmally%configure%your%network%topology%
! How%to%select%the%right%operaCng%system%and%Hadoop%distribuCon%
! How%to%plan%for%cluster%management%
Chapter"Topics"
Planning,%Installing,%and%Configuring%
Planning%Your%Hadoop%Cluster%
a%Hadoop%Cluster%
!! General%Planning%ConsideraCons%
!! Choosing"the"Right"Hardware"
!! Network"ConsideraEons"
!! Configuring"Nodes"
!! Planning"for"Cluster"Management"
!! Conclusion"
Basic"Cluster"ConfiguraEon"
Thinking"About"the"Problem"
! Hadoop%can%run%on%a%single%machine%
– Great"for"tesEng,"developing"
– Obviously"not"pracEcal"for"large"amounts"of"data"
! Many%people%start%with%a%small%cluster%and%grow%it%as%required%
– Perhaps"iniEally"just"four"or"six"nodes"
– As"the"volume"of"data"grows,"more"nodes"can"easily"be"added"
! Ways%of%deciding%when%the%cluster%needs%to%grow%
– Increasing"amount"of"computaEon"power"needed"
– Increasing"amount"of"data"which"needs"to"be"stored"
– Increasing"amount"of"memory"needed"to"process"tasks"
Cluster"Growth"Based"on"Storage"Capacity"
! Basing%your%cluster%growth%on%storage%capacity%is%oNen%a%good%method%to%
use%
! Example:%
– Data"grows"by"approximately"3TB"per"week"
– HDFS"set"up"to"replicate"each"block"three"Emes"
– Therefore,"9TB"of"extra"storage"space"required"per"week"
– Plus"some"overhead"–"say,"30%"
– Assuming"machines"with"12"x"3TB"hard"drives,"this"equates"to"a"new"
machine"required"every"three"weeks"
– AlternaEvely:"Two"years"of"data"–"2.7"petabytes"–"will"require"
approximately"83"machines"
Chapter"Topics"
a%Hadoop%Cluster%
!! General"Planning"ConsideraEons"
!! Choosing%the%Right%Hardware%
!! Conclusion"
Classifying"Nodes"
! Nodes%can%be%classified%as%either%‘slave%nodes’%or%‘master%nodes’%
! Slave%nodes%run%DataNode%plus%TaskTracker%daemons%
! Master%nodes%run%either%a%NameNode%daemon,%a%Secondary%NameNode%
Daemon,%or%a%JobTracker%daemon%
– On"smaller"clusters,"NameNode"and"JobTracker"are"oaen"run"on"the"
same"machine"
– SomeEmes"even"Secondary"NameNode"is"on"the"same"machine"as"the"
NameNode"and"JobTracker"
– Important"that"at"least"one"copy"of"the"NameNode’s"metadata"is"
stored"on"a"separate"machine"(see"later)"
Slave"Nodes:"Recommended"ConfiguraEons"
! Typical%configuraCons%for%slave%nodes%
– Midline"–"deep"storage,"1"GB"Ethernet""
– 12"x"3TB"SATA"II"hard"drives,"in"a"non/RAID,"JBOD*"configuraEon"
– 2"x"6/core"2.9"GHz"CPUs,"15"MB"cache"
– 64GB"RAM"
– 2x1"Gigabit"Ethernet"
– High/end"–"high"memory,"spindle"dense,"10"GB"Ethernet""
– 24"x"1TB"Nearline/MDL"SAS"hard"drives,"in"a"non/RAID,"JBOD*"
configuraEon"
– 2"x"6/core"2.9"GHz"CPUs,"15"MB"cache"
– 96GB"RAM"
– 1x10"Gigabit"Ethernet"
*"JBOD:"Just"a"Bunch"Of"Disks"
Slave"Nodes:"More"Details"(CPU)"
! Hex#core%CPUs%are%now%commonly%available%
! Hyper#threading%and%quick#path%interconnect%(QPI)%should%be%enabled%
! Hadoop%nodes%are%seldom%CPU#bound%
– They"are"typically"disk/"and"network/I/O"bound"
– Therefore,"top/of/the/range"CPUs"are"usually"not"necessary"
Slave"Nodes:"More"Details"(RAM)"
! Slave%node%configuraCon%specifies%the%maximum%number%of%Map%and%
Reduce%tasks%that%can%run%simultaneously%on%that%node%
! Each%Map%or%Reduce%task%will%take%2GB%to%4GB%of%RAM%
! Slave%nodes%should%not%be%using%virtual%memory%
! Ensure%you%have%enough%RAM%to%run%all%tasks,%plus%overhead%for%the%
DataNode%and%TaskTracker%daemons,%plus%the%operaCng%system%
! Rule%of%thumb:%%
%Total%number%of%tasks%=%1.5%x%number%of%processor%cores%
– This"is"a"starEng"point,"and"should"not"be"taken"as"a"definiEve"sePng"
for"all"clusters"
Slave"Nodes:"More"Details"(Disk)"
! Hadoop’s%architecture%impacts%disk%space%requirements%
– By"default,"HDFS"data"is"replicated"three"Emes"
– Temporary"data"storage"typically"requires"20/30"of"a"cluster’s"raw"disk"
capacity"
! In%general,%more%spindles%(disks)%is%becer%
! In%pracCce,%we%see%anywhere%from%four%to%24%disks%per%node%
! Use%3.5"%disks%
– Faster,"cheaper,"higher"capacity"than"2.5""disks"
! 7,200%RPM%SATA/SATA%II%drives%are%fine%
– No"need"to"buy"15,000"RPM"drives"
! 8%x%1.5TB%drives%is%likely%to%be%becer%than%6%x%2TB%drives%
– Different"tasks"are"more"likely"to"be"accessing"different"disks"
Slave"Nodes:"More"Details"(Disk)"(cont’d)"
! A%good%pracCcal%maximum%is%36TB%per%slave%node%
– More"than"that"will"result"in"massive"network"traffic"if"a"node"dies"and"
block"re/replicaEon"must"take"place"
Slave"Nodes:"Why"Not"RAID?"
! Slave%nodes%do%not%benefit%from%using%RAID*%storage%
– HDFS"provides"built/in"redundancy"by"replicaEng"blocks"across"mulEple"
nodes"
– RAID"striping"(RAID"0)"is"actually"slower"than"the"JBOD"configuraEon"
used"by"HDFS"
– RAID"0"read"and"write"operaEons"are"limited"by"the"speed"of"the"
slowest"disk"in"the"RAID"array"
– Disk"operaEons"on"JBOD"are"independent,"so"the"average"speed"is"
greater"than"that"of"the"slowest"disk"
– One"test"by"Yahoo"showed"JBOD"performing"between"30%"and"50%"
faster"than"RAID"0,"depending"on"the"operaEons"being"performed"
"
* RAID: Redundant Array of Inexpensive Disks
What"About"VirtualizaEon?"
! VirtualizaCon%will%incur%performance%and%reliability%penalCes,%including:%
– Network"contenEon"
– Typically,"remote"rather"than"local"disks"
– And"oaen"just"one"volume"even"if"the"disk"is"local"
– Typically,"no"way"to"guarantee"rack"placement"of"nodes"
– Possible"to"have"all"three"replicas"of"a"block"on"virtual"machines"on"the"
same"physical"host"
! Wherever%possible,%we%recommend%dedicated%physical%hardware%for%your%
Hadoop%cluster%
– Perfectly"reasonable"to"use"virtualizaEon"for"proof/of/concept"clusters,"
or"where"data"center"restricEons"mean"the"use"of"dedicated"hardware"
is"impossible"
What"About"Blade"Servers?"
! Blade%servers%are%not%recommended%
– Failure"of"a"blade"chassis"results"in"many"nodes"being"unavailable"
– Individual"blades"usually"have"very"limited"hard"disk"capacity"
– Network"interconnecEon"between"the"chassis"and"top/of/rack"switch"
can"become"a"bo>leneck"
Node"Failure"
! Slave%nodes%are%expected%to%fail%at%some%point%
– This"is"an"assumpEon"built"into"Hadoop"
– NameNode"will"automaEcally"re/replicate"blocks"that"were"on"the"failed"
node"to"other"nodes"in"the"cluster,"retaining"the"3x"replicaEon"
requirement"
– JobTracker"will"automaEcally"re/assign"tasks"that"were"running"on"failed"
nodes"
! Master%nodes%are%single%points%of%failure%if%not%configured%for%HA%
– If"the"NameNode"goes"down,"the"cluster"is"inaccessible"
– If"the"JobTracker"goes"down,"no"jobs"can"run"on"the"cluster"
– All"currently"running"jobs"will"fail"
! Configure%the%NameNode%for%HA%when%running%producCon%workloads%
! Consider%configuring%the%JobTracker%for%HA%for%MRv1%workloads%
Master"Node"Hardware"RecommendaEons"
! Carrier#class%hardware%
– Not"commodity"hardware"
! Dual%power%supplies%
! Dual%Ethernet%cards%
– Bonded"to"provide"failover"
! RAIDed%hard%drives%
! Reasonable%amount%of%RAM%
– 24"GB"for"clusters"of"20"nodes"or"less"
– 48"GB"for"clusters"of"up"to"300"nodes"
– 96"GB"for"larger"clusters"
Chapter"Topics"
a%Hadoop%Cluster%
!! Network%ConsideraCons%
!! Conclusion"
General"Network"ConsideraEons"
! Hadoop%is%very%bandwidth#intensive!%
– Oaen,"all"nodes"are"communicaEng"with"each"other"at"the"same"Eme"
! Use%dedicated%switches%for%your%Hadoop%cluster%
! Nodes%are%connected%to%a%top#of#rack%switch%
! Nodes%should%be%connected%at%a%minimum%speed%of%1Gb/sec%
! Consider%10Gb/sec%connecCons%in%the%following%cases:%
– Clusters"storing"very"large"amounts"of"data"
– Clusters"in"which"typical"MapReduce"jobs"produce"large"amounts"of"
intermediate"data"
General"Network"ConsideraEons"(cont’d)"
! Racks%are%interconnected%via%core%switches%
! Core%switches%should%connect%to%top#of#rack%switches%at%10Gb/sec%or%
faster%
! Beware%of%oversubscripCon%in%top#of#rack%and%core%switches%
! Consider%bonded%Ethernet%to%miCgate%against%failure%
! Consider%redundant%top#of#rack%and%core%switches%
Hostname"ResoluEon"
! When%configuring%Hadoop,%you%will%be%required%to%idenCfy%various%nodes%
in%Hadoop’s%configuraCon%files%
! Use%hostnames,%not%IP%addresses,%to%idenCfy%nodes%when%configuring%
Hadoop%
! Use%either%DNS%or%/etc/hosts%for%hostname%resoluCon%
! Forward%and%reverse%lookups%must%work%correctly%whether%you%are%using%
DNS%or%/etc/hosts%for%name%resoluCon%
– Hadoop"performs"both"forward"and"reverse"lookups"
– If"the"results"do"not"match,"major"problems"can"occur"
Chapter"Topics"
a%Hadoop%Cluster%
!! Configuring%Nodes%
!! Conclusion"
OperaEng"System"RecommendaEons"
! Choose%an%OS%you’re%comfortable%administering%
! CentOS:%geared%towards%servers%rather%than%individual%workstaCons%
– ConservaEve"about"package"versions"
– Very"widely"used"in"producEon"
! RedHat%Enterprise%Linux%(RHEL):%RedHat#supported%analog%to%CentOS%
– Includes"support"contracts,"for"a"price"
! In%producCon,%we%oNen%see%a%mixture%of%RHEL%and%CentOS%machines%
– Oaen"RHEL"on"master"nodes,"CentOS"on"slaves"
OperaEng"System"RecommendaEons"(cont’d)"
! Ubuntu:%Very%popular%distribuCon,%based%on%Debian%
– Both"desktop"and"server"versions"available"
– Try"to"use"an"LTS"(Long"Term"Support)"version"
! SuSE:%popular%distribuCon,%especially%in%Europe%
– Cloudera"provides"CDH"packages"for"SuSE"
Configuring"The"System"
! Do%not%use%Linux’s%LVM%(Logical%Volume%Manager)%to%make%all%your%disks%
appear%as%a%single%volume%
– As"with"RAID"0,"this"limits"speed"to"that"of"the"slowest"disk"
! Check%the%machines’%BIOS*%semngs%
– BIOS"sePngs"may"not"be"configured"for"opEmal"performance"
– For"example,"if"you"have"SATA"drives"make"sure"IDE"emulaEon"is"not"
enabled"
! Test%disk%I/O%speed%with%hdparm -t
– Example:"
"hdparm -t /dev/sda1
– You"should"see"speeds"of"70MB/sec"or"more"
– Anything"less"is"an"indicaEon"of"possible"problems"
* BIOS: Basic Input/Output System
Configuring"The"System"(cont’d)"
! Hadoop%has%no%specific%disk%parCConing%requirements%
– Use"whatever"parEEoning"system"makes"sense"to"you"
! Mount%disks%with%the%noatime%opCon%
! Common%directory%structure%for%data%mount%points:%
"/data/<n>/dfs/nn
/data/<n>/dfs/dn
/data/<n>/dfs/snn
/data/<n>/mapred/local
! Reduce%the%swappiness*of%the%system%
– Set"vm.swappiness"to"0"or"5"in"/etc/sysctl.conf
Filesystem"ConsideraEons"
! Cloudera%recommends%the%ext3%and%ext4%filesystems%
– ext4"is"more"commonly"used"on"new"clusters"
! XFS%provides%some%performance%benefit%during%kickstart%
– It"formats"in"0"seconds,"vs."several"minutes"for"each"disk"with"ext3/ext4"
! Currently,%more%tesCng%is%done%at%Cloudera%on%ext3/ext4%than%XFS%
OperaEng"System"Parameters"
! Increase%the%nofile%ulimit%for%the%mapred%and%hdfs%users%to%at%least%
32K%
– SePng"is"in"/etc/security/limits.conf
! Disable%IPv6%
! Disable%SELinux%if%possible%
– Incurs"a"7/10%"performance"penalty"on"a"Hadoop"cluster"
– ConfiguraEon"is"non/trivial"
! Install%and%configure%the%ntp%daemon%
– Ensures"the"Eme"on"all"nodes"is"synchronized"
– Important"for"HBase,"ZooKeeper,"Kerberos"
– Useful"when"using"logs"to"debug"problems"
Java"Virtual"Machine"(JVM)"RecommendaEons"
! Always%use%the%official%Oracle%JDK%(http://java.com/)%
– Hadoop"is"complex"soaware,"and"oaen"exposes"bugs"in"other"JDK"
implementaEons"
! Version%1.6%is%recommended%
– CDH"4"is"cerEfied"with"1.6.0_31"
– Any"later"maintenance"release"should"be"acceptable"for"producEon"
– Minimum"supported"version"is"1.6.0_8"
! Version%1.7%is%supported%starCng%from%CDH%4.2%
– There"are"restricEons"
– Refer"to"the"CDH"Release"Notes"for"details"
Which"Version"of"Hadoop?"
! CDH4.x%includes:%%
– MapReduce"v1"from"Hadoop"0.20.2"
– Stable,"recommended"for"producEon"systems"
– MapReduce"v2"and"YARN"from"Hadoop"2.x"
– SEll"experimental/alpha"
– HDFS"from"Hadoop"2.x"
– Stable,"recommended"for"producEon"systems"
– Provides"HDFS"High"Availability"
! Includes%well%over%1,000%patches%and%bugfixes%
– Including"many"developed"by"Cloudera"for"our"Support"customers"
! Ensures%interoperability%between%different%Ecosystem%projects%
! Provided%in%RPM,%Ubuntu%and%SuSE%package,%and%tarball%formats%
Chapter"Topics"
a%Hadoop%Cluster%
!! Planning%for%Cluster%Management%
!! Conclusion"
Managing"Large"Clusters"
! Each%node%in%the%cluster%requires%its%own%configuraCon%files%
! Managing%a%small%cluster%is%relaCvely%easy%
– Log"in"to"each"machine"to"make"changes"
– Manually"change"configuraEon"files"
! As%the%cluster%grows%larger,%management%becomes%more%complex%
ConfiguraEon"Management"Tools"
! ConfiguraCon%management%tools%allow%you%to%manage%configuraCon%of%
mulCple%machines%at%once%
– Can"update"files,"restart"daemons"or"even"reboot"machines"
automaEcally"where"necessary"
! RecommendaCons:%%
– Use"configuraEon"management"tools"
– Start"using"such"tools"when"the"cluster"is"small!"
– Use"Cloudera"Manager"
– The"free"ediEon"supports"clusters"with"an"unlimited"number"of"
nodes"
! Popular%open%source%tools%for%managing%the%configuraCon%include%Puppet,%
Chef%
– Many"others"exist"
– Many"commercial"tools"also"exist"
Chapter"Topics"
a%Hadoop%Cluster%
!! Conclusion%
EssenEal"Points"
! Master%nodes%run%the%NameNode,%Secondary%NameNode,%and%JobTracker%
– Provision"with"carrier/class"hardware"and"lots"of"RAM"
! Slave%nodes%run%DataNodes%and%TaskTrackers%
– Provision"with"industry/standard"hardware"
– Consider"your"data"storage"growth"rate"when"planning"current"and"
future"cluster"size"
! RAID%(on%slave%nodes)%and%virtualizaCon%can%incur%a%performance%penalty%
! Make%sure%that%forward%and%reverse%domain%lookups%work%when%
configuring%a%cluster%
! When%planning%your%cluster,%don’t%forget%about%configuraCon%
management%and%monitoring%tools%
Conclusion"
! What%issues%to%consider%when%planning%your%Hadoop%cluster%
! What%types%of%hardware%are%typically%used%for%Hadoop%nodes%
! How%to%opCmally%configure%your%network%topology%
! How%to%select%the%right%operaCng%system%and%Hadoop%distribuCon%
! How%to%plan%for%cluster%management%
Hadoop"InstallaBon"and""
IniBal"ConfiguraBon"
Chapter"7"
Course"Chapters"
!! IntroducBon" Course"IntroducBon"
!! HDFS"
IntroducBon"to"Apache"Hadoop"
!! MapReduce"
!! Hadoop%Installa1on%and%Ini1al%Configura1on%
Planning,%Installing,%and%
!! Hadoop"Clients"
Configuring%a%Hadoop%Cluster%
!! Advanced"Cluster"ConfiguraBon"
!! Hadoop"Security"
Cluster"OperaBons"and"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooBng"
!! Conclusion"
!! Kerberos"ConfiguraBon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaBon"
Hadoop"InstallaBon"and"IniBal"ConfiguraBon"
! The%different%installa1on%configura1ons%available%in%Hadoop%
! How%to%install%Hadoop%
! How%to%specify%the%Hadoop%configura1on%
! How%to%configure%HDFS%
! How%to%configure%MapReduce%
! How%to%locate%and%configure%Hadoop%log%files%
Chapter"Topics"
Hadoop%Installa1on%and%Ini1al% Planning,%Installing,%and%Configuring%
Configura1on% a%Hadoop%Cluster%
!! Deployment%Types%
!! Installing"Hadoop"
!! Specifying"the"Hadoop"ConfiguraBon"
!! Performing"IniBal"HDFS"ConfiguraBon"
!! Performing"IniBal"MapReduce"ConfiguraBon"
!! Hadoop"Logging"
!! Hands/On"Exercise:"Installing"A"Hadoop"Cluster"
!! Conclusion"
Hadoop’s"Different"Deployment"Modes"
! Hadoop%can%be%configured%to%run%in%three%different%modes%
– LocalJobRunner"
– Pseudo/distributed"
– Fully"distributed"
LocalJobRunner"Mode"
! In%LocalJobRunner%mode,%no%daemons%run%
! Everything%runs%in%a%single%Java%Virtual%Machine%(JVM)%
! Hadoop%uses%the%machine’s%standard%filesystem%for%data%storage%
– Not"HDFS"
! Suitable%for%tes1ng%MapReduce%programs%while%unit%tes1ng%
– Developers"can"trace"code"in"MapReduce"jobs"within"an"IDE""
Pseudo/Distributed"Mode"
! In%pseudo#distributed%mode,%all%daemons%run%on%the%local%machine%
– Each"runs"in"its"own"JVM"(Java"Virtual"Machine)"
! Hadoop%uses%HDFS%to%store%data%(by%default)%
! Useful%to%simulate%a%cluster%on%a%single%machine%
! Convenient%for%debugging%programs%before%launching%them%on%the%‘real’%
cluster%
Fully/Distributed"Mode"
! In%fully#distributed%mode,%Hadoop%daemons%run%on%a%cluster%of%machines%
! HDFS%is%used%to%distribute%data%amongst%the%nodes%
! Unless%you%are%running%a%small%cluster%(less%than%10%or%20%nodes),%the%
NameNode%and%JobTracker%daemons%should%each%be%running%on%dedicated%
nodes%
– For"small"clusters,"it’s"acceptable"for"both"to"run"on"the"same"physical"
node"
Chapter"Topics"
!! Deployment"Types"
!! Installing%Hadoop%
!! Hadoop"Logging"
!! Conclusion"
Deploying"on"MulBple"Machines"
! If%you%are%installing%mul1ple%machines,%use%some%kind%of%automated%
deployment%
– Red"Hat’s"Kickstart"
– Debian"Fully"AutomaBc"InstallaBon"
– Dell"Crowbar"
– …"
RPM/Package"vs"Tarballs"
! CDH%is%available%in%mul1ple%formats%
– RPMs"for"Red"Hat/style"Linux"distribuBons"(RHEL,"CentOS)"
– Packages"for"Ubuntu"and"SuSE"Linux"
– Parcels"for"installaBon"via"Cloudera"Manager"(see"later)"
– As"a"tarball"
! RPMs/packages%include%some%features%not%in%the%tarball%
– AutomaBc"creaBon"of"mapred"and"hdfs"users"
– init"scripts"to"automaBcally"start"the"Hadoop"daemons"
– Although"these"are"not"acBvated"by"default"
– Configures"the"‘alternaBves’"system"to"allow"mulBple"configuraBons"on"
the"same"machine"
! Strong%recommenda1on:%use%the%RPMs/packages%whenever%possible%
RPM"InstallaBon"Steps"
1.  Install%the%Oracle%JDK%
2.  Add%the%Cloudera%repository%
3.  Install%the%RPMs%for%the%func1onality%required%on%each%host%in%the%
cluster%
4.  Edit%the%Hadoop%configura1on%files%
– Example:""
sudo yum install hadoop-hdfs-namenode install
5.  Create%required%directories%
6.  Format%HDFS%
sudo –u hdfs hdfs namenode -format
StarBng"the"Hadoop"Daemons"
! CDH%installed%from%package%or%RPM%includes%init%scripts%to%start%the%
daemons%as%services%
–  sudo service hadoop-hdfs-namenode start
–  sudo service hadoop-hdfs-secondarynamenode start
–  sudo service hadoop-hdfs-datanode start
–  sudo service hadoop-0.20-mapreduce-jobtracker start
–  sudo service hadoop-0.20-mapreduce-tasktracker start"
! If%you%have%installed%Hadoop%manually,%or%from%the%CDH%tarball,%you%will%
have%to%start%the%daemons%manually%
! Not%all%daemons%run%on%each%machine%
– NameNode,"JobTracker"/"One"per"cluster,"unless"running"in"an"HA"
configuraBon"
– Secondary"NameNode"/"One"per"cluster"in"a"non/HA"environment"
– DataNode,"TaskTracker"/"On"each"data"node"in"the"cluster"
An"Aside:"SSH"
! Note%that%most%tutorials%tell%you%to%create%a%passwordless%SSH%login%on%
each%machine%
! This%is%not%necessary%for%the%opera1on%of%Hadoop%
– Hadoop"does"not"use"SSH"for"any"of"its"internal"communicaBons"
! ssh%is%only%required%if%you%intend%to%use%the%start-all.sh%and%stop-
all.sh%scripts%included%with%Hadoop%
– Cloudera"recommends"not"using"these"scripts"
– If"you"do"not"use"these"scripts,"you"do"not"need"to"configure"the"
masters"and"slaves"files"
! Passwordless%SSH%is%configured%in%your%lab%environment%as%a%convenience%%
– It"is"not"required"
Verify"the"InstallaBon"
! To%verify%that%everything%has%started%correctly,%check%by%running%an%
example%job:%
– Copy"files"into"Hadoop"for"input"
hadoop fs –mkdir config-files
hadoop fs -put /etc/hadoop/conf/*.xml config-files
– Run"an"example"MapReduce"job"
hadoop jar /usr/lib/hadoop-0.20-mapreduce/\
hadoop-examples.jar grep config-files output 'dfs[a-z.]+'
– View"the"output"
–  hadoop fs -cat output/part-00000 | head
Chapter"Topics"
!! Specifying%the%Hadoop%Configura1on%
!! Hadoop"Logging"
!! Conclusion"
Hadoop’s"ConfiguraBon"Files"
! Each%machine%in%the%Hadoop%cluster%has%its%own%set%of%configura1on%files%
! Configura1on%files%all%reside%in%Hadoop’s%conf%directory%
– Typically"/etc/hadoop/conf
! Most%of%the%configura1on%files%are%wrieen%in%XML%
! Upon%startup,%the%Hadoop%daemons%access%the%configura1on%files%
– Afer"modifying"configuraBon"parameters,"you"must"restart"Hadoop"
daemons"for"your"changes"to"take"effect"
%
%
Hadoop’s"ConfiguraBon"Files"–"Overview"
File Type%of%Configura1on
core-site.xml Core%
hdfs-site.xml HDFS%
mapred-site.xml MapReduce%
hadoop-policy.xml Access%control%policies%
log4j.properties Logging%
hadoop-metrics.properties, Metrics%
hadoop-metrics2.properties
include,%exclude%(file%names%are% Host%inclusion/exclusion%in%a%cluster%
configurable)%
allocations.xml%(file%name%is% FairScheduler%%
configurable)%
masters,%slaves Scripted%startup%(not%recommended)%
hadoop-env.sh Environment%variables%
Environment"Setup:"hadoop-env.sh"
! hadoop-env.sh%sets%environment%variables%necessary%for%Hadoop%to%
run%
– HADOOP_CLASSPATH
– HADOOP_HEAPSIZE
– HADOOP_LOG_DIR
– HADOOP_PID_DIR
– JAVA_HOME
! Values%are%sourced%into%all%Hadoop%control%scripts%and%therefore%the%
Hadoop%daemons%
! If%you%need%to%set%environment%variables,%do%it%here%to%ensure%that%they%
are%passed%through%to%the%control%scripts%
Environment"Setup:"hadoop-env.sh"(cont’d)"
!  HADOOP_HEAPSIZE
– Controls"the"heap"size"for"all"the"Hadoop"daemons"
– Default"1GB"
– Best"pracBce:"set"the"heap"for"individual"daemons"instead"
!  HADOOP_NAMENODE_OPTS
– Java"opBons"for"the"NameNode"
– At"least"4GB:"-Xmx4g
!  HADOOP_JOBTRACKER_OPTS
– Java"opBons"for"the"JobTracker"
– At"least"4GB:"-Xmx4g
!  HADOOP_DATANODE_OPTS,%HADOOP_TASKTRACKER_OPTS
– Set"to"1GB"each:"-Xmx1g
Sample"XML"ConfiguraBon"File"
! Sample%configura1on%file%(mapred-site.xml)%
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="configuration.xsl">
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
ConfiguraBon"Value"Precedence"
! Configura1on%parameters%can%be%specified%more%than%once%
! Highest#precedence%value%takes%priority%
! Precedence%order%(lowest%to%highest):%
– *-site.xml"on"the"slave"node"
– *-site.xml"on"the"client"machine"
– Values"set"explicitly"in"the"Job"object"for"a"MapReduce"job"
! If%a%value%in%a%configura1on%file%is%marked%as%final%it%overrides%all%others%
<property>
<name>some.property.name</name>
<value>somevalue</value>
<final>true</final>
</property>
Recommended"Parameter"Values"
! There%are%many%different%parameters%which%can%be%set%
! Defaults%are%documented%at%http://archive.cloudera.com/
cdh4/cdh/4/hadoop
– Click"the"links"under"ConfiguraBon"
! Hadoop%is%s1ll%a%young%system%
– ‘Best"pracBces’"and"opBonal"values"change"as"more"and"more"
organizaBons"deploy"Hadoop"in"producBon"
! Here%we%present%some%of%the%key%parameters,%and%suggest%recommended%
values%
– Based"on"our"experiences"working"with"clusters"ranging"from"a"few"
nodes"up"to"1,500+"
Aside:"Deprecated"ProperBes"
! Many%proper1es%have%recently%been%renamed%in%Apache%Hadoop%
– This"change"affects"CDH4,"but"not"earlier"versions"
– Some"examples"below"(see"documentaBon"for"complete"list)"
"
Deprecated%Property%Name
" New%Property%Name
"
dfs.data.dir dfs.datanode.data.dir
"
dfs.http.address dfs.namenode.http-address
"
fs.default.name fs.defaultFS
"
! In%CDH4,%you%can%use%either%the%old%or%new%name%
– The"old"property"names"are"now"deprecated,"but"sBll"work"
– We"will"use"the"old"names,"except"for"CDH4/specific"features"
%
Reading"ConfiguraBon"Changes"
! Cluster%daemons%generally%need%to%be%restarted%to%read%in%changes%to%their%
configura1on%files%
! DataNodes%do%not%need%to%be%restarted%if%only%NameNode%parameters%
were%changed%
Chapter"Topics"
!! Performing%Ini1al%HDFS%Configura1on%
!! Hadoop"Logging"
!! Conclusion"
hdfs-site.xml"
! The%single%most%important%configura1on%value%on%your%en1re%cluster,%used%
by%the%NameNode:%
dfs.name.dir Where on the local filesystem the
NameNode stores its metadata. A
comma-separated list. Default is
${hadoop.tmp.dir}/dfs/name.
! Loss%of%a%NameNode’s%metadata%will%result%in%the%effec1ve%loss%of%all%the%
data%in%its%namespace%
– Although"the"blocks"will"remain,"there"is"no"way"of"reconstrucBng"the"
original"files"without"the"metadata"
! This%must%be%at%least%two%disks%(or%a%RAID%volume)%on%the%NameNode,%plus%
an%NFS%mount%elsewhere%on%the%network%
– Failure"to"set"this"correctly"will"result"in"eventual"loss"of"your"cluster’s"
data"
hdfs-site.xml"(cont’d)"
! A%NameNode%will%write%to%the%edit%log%in%all%directories%in%
dfs.name.dir%synchronously%
! If%a%directory%in%the%list%disappears,%the%NameNode%will%con1nue%to%
func1on%
– It"will"ignore"that"directory"unBl"it"is"restarted"
! Recommenda1on%for%the%NFS%mount%point%
tcp,soft,intr,timeo=10,retrans=10
– Sof"mount"so"the"NameNode"will"not"hang"if"the"mount"point"
disappears"
– Will"retry"transacBons"10"Bmes,"at"1/10"second"intervals,"before"being"
deemed"to"have"failed"
! Note:%no%space%between%the%comma%and%next%directory%name%in%the%list!%
– Example:"/disk1/dfs/nn,/disk2/dfs/nn
dfs.block.size The block size for new files, in bytes.

Default is 67108864 (64MB).
Recommended: 134217728 (128MB).
Note: this is a client-side setting.
dfs.data.dir Where on the local filesystem a
DataNode stores its blocks. Can be a
comma-separated list of directories
(no spaces between the comma and
the path); round-robin writes to the
directories (no redundancy). Used by
DataNodes; can be different on each
DataNode.
dfs.http.address The address and port used for the

NameNode Web UI. Example:
<your_namenode>:50070. Used by
the NameNode.
dfs.replication The number of times each block
should be replicated when a file is
written. Default: 3. Recommended: 3.
Note: this is a client-side parameter.
dfs.datanode.du.reserved The amount of space on each volume
which cannot be used for HDFS block
storage. Default: 0. Recommended: at
least 10GB. Used by DataNodes.
core-site.xml"
fs.default.name The name of the default filesystem.

Usually includes the file system type,
plus the NameNode’s hostname and
port number. Example:
hdfs://<your_namenode>:8020/
Used by every machine which needs
access to the cluster, including all
nodes running Hadoop daemons.
core-site.xml"(cont’d)"
hadoop.tmp.dir Base temporary directory, both on the

local disk and in HDFS. Default is
/tmp/hadoop-${user.name}.
Used by all nodes.
This parameter is used to derive

defaults for numerous other
configuration parameters. For
example, the default value for
dfs.data.dir is
file://${hadoop.tmp.dir}/
dfs/name
Chapter"Topics"
!! Performing%Ini1al%MapReduce%Configura1on%
!! Hadoop"Logging"
!! Conclusion"
mapred-site.xml"
mapred.job.tracker Hostname and port of the JobTracker.

Example: my_job_tracker:8021.
Used by JobTracker, TaskTrackers,
clients.
mapred.child.java.opts Java options passed to the TaskTracker
child processes. Default is -Xmx200m
(200MB of heap space).
Recommendation: increase to 512MB or
1GB, depending on the requirements
from your developers. Used by
TaskTrackers.
mapred.child.ulimit Maximum virtual memory in KB allocated
to any child process of the TaskTracker. If
specified, set to 1.5x the value of
mapred.child.java.opts. Used by
TaskTrackers.
mapred-site.xml"(cont’d)"
mapred.local.dir The local directory where MapReduce

stores its intermediate data files. May be
a comma-separated list of directories on
different devices. Recommendation: list
directories on all disks, and set
dfs.datanode.du.reserved (in
hdfs-site.xml) such that
approximately 25% of the total disk
capacity cannot be used by HDFS.
Example: for a node with 4 x 1TB disks,
set dfs.datanode.du.reserved to
250GB. Used by TaskTrackers.
mapred.system.dir The HDFS directory where MapReduce
stores shared files during a job run.
Example: /mapred/system. Used by
TaskTrackers.
mapreduce.jobtracker. The root directory in HDFS below which

staging.root.dir users’ job files are stored. This should
be set to the value of the directory under
which user directories are stored.
Recommendation: /user. Used by
TaskTrackers.
mapred.tasktracker.map. Number of Map tasks which can be run

tasks.maximum simultaneously by the TaskTracker. Used
by TaskTrackers.
mapred.tasktracker. Number of Reduce tasks which can be
reduce.tasks.maximum run simultaneously by the TaskTracker.
Used by TaskTrackers.
! Rule%of%thumb:%total%number%of%Map%+%Reduce%tasks%on%a%node%should%be%
approximately%1.5%x%the%number%of%processor%cores%on%that%node%
– Assuming"there"is"enough"RAM"on"the"node"
– This"should"be"monitored"
– If"the"node"is"not"processor"or"I/O"bound,"increase"the"total"
number"of"tasks"
– Typical"distribuBon:"60%"Map"tasks,"40%"Reduce"tasks"or"70%"Map"
tasks,"30%"Reduce"tasks"
Chapter"Topics"
!! Hadoop%Logging%
!! Conclusion"
Hadoop"Log"Files"
Type%of%Log Descrip1on
Daemon% InformaBonal,"warning,"and"error"messages"
generated"by"Hadoop"daemons."Each"Hadoop"
daemon"produces"two"log"files:"
•  .log"–"First"port"of"call"when"diagnosing"
problems""
•  .out"–"CombinaBon"of"stdout"and"stderr"
during"daemon"startup,"doesn’t"usually"contain"
much"output""
Task% stdout,"stderr,"and"syslog"output"generated"
by"MapReduce"applicaBons."
Job%Configura1on% Job"configuraBon"sePngs"specified"by"the"
developer."
Job%History% Job"summary"and"counters,"task"summary"and"
analysis,"stack"traces"for"any"thrown"excepBons,"
URLs"to"navigate"to"the"task"logs."
Daemon"Logs"/"LocaBon"
Type%of% Loca1on
Daemon%Log
MRv1% Default"directory:"/var/log/hadoop-0.20-mapreduce
(JobTracker,% (Set"HADOOP_LOG_DIR"in"hadoop-env.sh"to"configure)"
TaskTracker)% Default"log"file"names:"hadoop-hadoop-<daemon>-
<hostname>.{log|out}
Example:"/var/log/hadoop-0.20-mapreduce/hadoop-
hadoop-tasktracker-elephant.log
HDFS%and%MRv2% Default"directory:"/var/log/hadoop-<component>
(Set"HADOOP_LOG_DIR"in"hadoop-env.sh"to"configure)"
Default"log"file"names:"hadoop-<component>-<daemon>-
<hostname>.{log|out}
Example:"/var/log/hadoop-hdfs/hadoop-hdfs-
datanode-tiger.log
Daemon"Logs"/"RetenBon"
Type%of% Default%Reten1on
Daemon%Log
All%.out%Files% Rotated"when"daemon"restarts,"five"files"retained"
MRv1%.log% Rotated"daily"
Files% Cannot"limit"file"size"or"the"number"of"files"kept"
Provide"your"own"scripts"to"compress,"archive,"delete"logs"
HDFS%and% Maximum"size"of"generated"log"files:"256MB""
MRv2%.log% Number"of"files"retained:"20"
Files% Maximum"disk"space"for"logs:"5GB"
Daemon"Logs"–"RetenBon"ConfiguraBon"
! The%Hadoop%daemons%use%file*appenders*for%log%messages%
– File"appenders"deliver"log"events"to"files"
– Hadoop’s"file"appenders"are"configured"in"/etc/hadoop/conf/
log4j.properties
! Hadoop%uses%two%default%file%appenders%
– HDFS"and"MRv2"daemons"–"RollingFileAppender"(RFA)"
– Maximum"size"of"generated"log"files"and"number"of"files"retained"
are"configurable"
– MRv1"daemons"–""DailyRollingFileAppender"(DRFA)"
– RotaBon"frequency"is"configurable"
! To%change%the%default%RFA%configura1on%
– Change"hadoop.log.maxfilesize"and"
hadoop.log.maxbackupindex"in"log4j.properties"
– Restart"the"Hadoop"daemons"
Daemon"Logs"–"Changing"the"Log"Level"
! Use%the%hadoop daemonlog -setlevel%command%

– hadoop daemonlog –setlevel
<daemon_host>:<daemon_HTTP_port> <class> <level>
! Valid%log%levels:%FATAL,%ERROR,%WARN,%INFO,%DEBUG,%TRACE%
! New%log%level%does%not%persist%aier%a%daemon%restarts
Daemon Port Class%

NameNode" 50070" org.apache.hadoop.hdfs.server.
namenode.NameNode
Secondary" 50090" org.apache.hadoop.hdfs.server.
NameNode" namenode.SecondaryNameNode
DataNode% 50075" org.apache.hadoop.hdfs.server.
datanode.DataNode
JobTracker% 50030" org.apache.hadoop.mapred.JobTracker
TaskTracker% 50060" org.apache.hadoop.mapred.TaskTracker
Daemon"Logs"–"Changing"the"Log"Level"(cont’d)"
! Use%the%logLevel%Web%UI%
– http://<daemon_host>:<daemon_HTTP_port>/logLevel
! New%log%level%does%not%persist%aier%a%daemon%restarts%
Daemon"Logs"–"Changing"the"Log"Level"(cont’d)"
! The%persistent%log%level%is%configured%in%%
/etc/hadoop/conf/log4j.properties
! Default%log%level%configured%by%hadoop.root.logger
– Default"is"INFO
! Log%level%can%be%set%for%any%specific%Hadoop%daemon%with%
log4j.logger.class.name=LEVEL
– Example:"
log4j.logger.org.apache.hadoop.mapred.JobTracker=INFO
! Restart%the%Hadoop%daemon%to%make%the%log%level%change%take%effect%
Task"Logs"
Type%of%Task% Loca1on
Log
MRv1% Default"directory:"/var/log/hadoop-0.20-mapreduce/
userlogs
Contains"symbolic"links"to"paths"under"mapred.local.dir
Default"retenBon:"24"hours"
Configure"retenBon"with"mapred.userlog.retain.hours
MRv2% Default"directory:"/var/log/hadoop-yarn/userlogs
Job"Logs"
Job%Log%File Contents Loca1on%and%Reten1on%

Job%Configura1on% Job"configuraBon"sePngs" ${hadoop.log.dir}/<job_id>_
XML%File% specified"by"the"developer" conf.xml
(Set"HADOOP_LOG_DIR"in"hadoop-
env.sh"to"configure)"
RetenBon:"mapred.jobtracker.
retirejob.interval"milliseconds"
Default:"1"day"(24"*"60"*"60"*"1000)"
Job%History%on% Job"summary"and"counters" hadoop.job.history.location
Local%Disk% Task"summary"and"analysis" Default"locaBon:""
Stack"traces"for"any"thrown" ${hadoop.log.dir}/history
excepBons" RetenBon:"30"days""
URLs"to"navigate"to"task"logs" "
Job%History%in% Same"as"Job"History"on"local" hadoop.job.history.user.

HDFS% disk" location
Default"locaBon:""
<job_output_directory>/_logs/
history
RetenBon:"As"long"as"the"output"
directory"exists"
Job"Logs"(cont’d)""
! The%JobTracker%also%keeps%the%data%logged%to%the%job%logs%in%memory%for%a%
limited%1me%
! You%can%access%the%job%history%from%the%command%line%
mapred job –history all <job_output_dir_in_HDFS>
Chapter"Topics"
!! Hadoop"Logging"
!! Hands#On%Exercise:%Installing%A%Hadoop%Cluster%
!! Conclusion"
Hands/On"Exercise:"Installing"a"Hadoop"Cluster"
! In%this%Hands#On%Exercise,%you%will%%create%a%Hadoop%cluster%with%four%
instances%
! Cluster%deployment%aier%exercise%comple1on:%
Chapter"Topics"
!! Hadoop"Logging"
!! Conclusion%
EssenBal"Points"
! Hadoop%can%run%in%three%modes:%local,%pseudo#distributed,%and%distributed%
! Aier%installing%Hadoop%from%CDH,%you%can%use%services%to%start,%stop,%and%
check%status%of%Hadoop%daemons%
– Example:"sudo service hadoop-hdfs-datanode start
! Hadoop’s%configura1on%resides%in%the%/etc/hadoop/conf%directory%by%
default%
– Most"configuraBon"properBes"are"name/value"pairs"in"XML"files"
– Example:"
"" <property>
<name>mapred.child.java.opts</name>
<value>-Xmx1g</value>
</property>
! Hadoop%daemons%produce%log%files%with%extensions%.log%and%.out%%
– The".log"file"is"the"first"place"to"look"when"you"run"into"problems"
Conclusion"
! The%different%installa1on%configura1ons%available%in%Hadoop%
! How%to%install%Hadoop%
! How%to%specify%the%Hadoop%configura1on%
! How%to%configure%HDFS%
! How%to%configure%MapReduce%
! How%to%locate%and%configure%Hadoop%log%files%
Installing"and"Configuring""
Hive,"Impala,"and"Pig"
Chapter"8"
Course"Chapters"
!! IntroducGon" Course"IntroducGon"
!! HDFS"
IntroducGon"to"Apache"Hadoop"
!! MapReduce"
!! Hadoop"InstallaGon"and"IniGal"ConfiguraGon"
!! Installing$and$Configuring$Hive,$Impala,$and$Pig$
Planning,$Installing,$and$
!! Hadoop"Clients"
Configuring$a$Hadoop$Cluster$
!! Advanced"Cluster"ConfiguraGon"
!! Hadoop"Security"
Cluster"OperaGons"and"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooGng"
!! Conclusion"
!! Kerberos"ConfiguraGon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaGon"
Installing"and"Configuring"Hive,"Impala,"and"Pig"
! Hive$features$and$basic$configuraCon$
! Impala$features$and$basic$configuraCon$
! Pig$features$and$installaCon$
Note"
! Note$that$this$chapter$does$not$go$into$any$significant$detail$about$Hive,$
Impala,$or$Pig$
! Our$intenCon$is$to$draw$your$aGenCon$to$issues$System$Administrators$
will$need$to$deal$with,$if$users$request$these$products$be$installed$
– The"Cloudera)Data)Analyst)Training:)Using)Pig,)Hive,)and)Impala)with)
Hadoop"course"provides"detailed"informaGon"about"how"to"use"these"
three"components"
Chapter"Topics"
Installing$and$Configuring$Hive,$ Planning,$Installing,$and$Configuring$
Impala,$and$Pig$ a$Hadoop$Cluster$
!! Hive$
!! Impala"
!! Pig"
!! Hands/On"Exercise:"Querying"HDFS"With"Hive"and"Cloudera"Impala"
!! Conclusion"
Using"Hive"to"Query"Large"Datasets"
! Hive$is$an$open$source$Apache$project$
– Originally"developed"at"Facebook"
! MoCvaCon:$many$data$analysts$are$very$familiar$with$SQL$(the$Structured$
Query$Language)$
– The"de)facto"standard"for"querying"data"in"RelaGonal"Database"
Management"Systems"(RDBMSs)"
! Data$analysts$tend$to$be$far$less$familiar$with$programming$languages$
such$as$Java$
! Hive$provides$a$way$to$query$data$in$HDFS$using$a$SQL"like$language$
Sample"Hive"Query"
SELECT * from movies m

JOIN scores s
ON (m.id = s.movie_id)
WHERE m.year > 1995
ORDER BY m.name DESC
LIMIT 50;
What"Hive"Provides"
! Hive$allows$users$to$query$data$using$HiveQL,$a$language$very$similar$to$
standard$SQL$
! Hive$turns$HiveQL$queries$into$standard$MapReduce$jobs$
– AutomaGcally"runs"the"jobs,"and"displays"the"results"to"the"user"
! Note$that$Hive$is$not$an$RDBMS!$
– Results"take"several"seconds,"minutes,"or"even"hours"to"be"produced"
– Not"possible"to"modify"the"data"using"HiveQL"
– UPDATE"and"DELETE"are"not"supported"
The"Hive"Metastore"
! A$‘Table’$in$Hive$represents$an$HDFS$directory$
– Hive"interprets"all"files"in"the"directory"as"the"contents"of"the"table"
– Stores"informaGonabout"how"the"rows"and"columns"are"delimited"
within"the"files"in"the"Hive)Metastore)
! By$default,$Hive$uses$a$Metastore$on$the$user’s$local$machine$
– Metastore"uses"Apache"Derby,"a"Java/based"RDBMS"
! A$shared$Metastore$is$a$database$in$an$RDBMS$such$as$MySQL$
– If"mulGple"users"will"be"running"Hive,"the"System"Administrator"should"
configure"a"shared"Metastore"for"all"users"
– On"producGon"systems"you"will"always"configure"a"shared"Matastore"
$$
"
The"Hive"Metastore"Service"
! The$Hive$Metastore$service$is$opConal,$but$recommended$when$you$use$a$
shared$Metastore$
! Hive$clients$make$calls$to$the$Hive$Metastore$service$using$the$Thri`$API$
! The$Hive$Metastore$service$connects$to$the$Metastore$using$JDBC$
Thrift API Hive Metastore JDBC

Client RDBMS
Service
! A$variety$of$clients$access$the$shared$Metastore$by$calling$the$Hive$
Metastore$service,$including:$
– The"Hive"CLI"
– Cloudera"Impala"
– Beeswax,"a"Hive"UI"used"by"Hue"
"
Hive"Deployment"
Hadoop Cluster
Hive CLI
Submit Map Mappers
Reduce Jobs
Reducers
Create or
Access
Hive Schema
Definitions
Local or
Shared
Metastore
Installing"and"Configuring"Hive"
! Hive$runs$on$a$user’s$machine$
– Not"on"the"Hadoop"cluster"itself"
– A"user"can"set"up"Hive"with"no"System"Administrator"input"
– Using"the"standard"Hive"command/line"or"Web/based"interface"
– If"users"will"be"running"JDBC/based"clients,"you"can"run"Hive"as"a"service"
on"a"centrally/available"machine"
! To$install,$run$sudo yum install hive
! Configure$the$client$so$Hive$can$access$the$Hadoop$cluster:$
– If"the"client"has"core-site.xml"and"mapred-site.xml"
configured"to"access"the"Hadoop"cluster,"Hive"will"use"the"configuraGon"
from"those"files"
– If"not,"configure"fs.default.name"and"mapred.job.tracker"
in"/etc/hive/conf/hive-site.xml""
CreaGng"and"Configuring"a"Shared"Metastore"
1.  Create$a$database$in$your$RDBMS$
–  CREATE DATABASE metastore;
2.  Import$the$Hive$metastore$schema$into$the$database$
–  SOURCE /usr/lib/hive/scripts/metastore/\
upgrade/mysql/hive-schema-version.mysql.sql;""
3.  Create$a$database$user$with$appropriate$privileges$
–  CREATE USER 'hiveuser'@'%' IDENTIFIED BY \
'password';""
GRANT SELECT,INSERT,UPDATE,DELETE \
ON metastore.* TO 'hiveuser'@'%';
CreaGng"and"Configuring"a"Shared"Metastore"(cont’d)"
4.  Install$and$start$the$Hive$Metastore$service:$$
–  sudo yum install hive-metastore
–  sudo service hive-metastore start
5.  Modify$hive-site.xml$on$each$user’s$machine$to$refer$to$the$
shared$Metastore$
Sample"hive-site.xml"ProperGes"for"a"MySQL"Shared"
Metastore"
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://elephant:3306/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://192.168.123.1:9083</value>
</property>
HiveServer2"–"Hive"as"a"Service"
Beeline CLI
Hadoop Cluster
Submit Map
Reduce Jobs Mappers
HiveServer2
Server
Reducers
JDBC/ODBC
Application
Shared
Metastore
HiveServer2"–"Hive"as"a"Service"(cont’d)"
FuncConality$ Hive$CLI$ HiveServer2

InstallaCon$and$ •  Local$Metastore$supported$ •  Local$Metastore$not$supported$
ConfiguraCon$ •  If$using$shared$Metastore,$ •  JDBC$drivers$to$access$the$shared$
JDBC$drivers$to$access$the$ Metastore$are$installed$on$the$
shared$Metastore$are$ HiveServer2$server$
installed$on$every$client$
•  Local$configuraCon$requires$
root$privileges$
Client$Interface$ Hive$CLI$ Beeline$CLI,$JDBC"$and$ODBC"based$
applicaCons$
Security$ •  If$using$shared$Metastore,$ •  CredenCals$to$shared$Metastore$
every$invocaCon$requires$ RDBMS$are$stored$on$the$server$
credenCals$to$RDBMS$ •  Supports$Kerberos$authenCcaCon$
•  Supports$Kerberos$ (original$HiveServer$did$not)$
authenCcaCon$
Chapter"Topics"
!! Hive"
!! Impala$
!! Pig"
!! Conclusion"
What"Is"Impala?"
! Impala$also$allows$users$to$query$data$in$HDFS$using$HiveQL$
! An$open$source$project$created$at$Cloudera$
! Impala$uses$the$same$shared$Metastore$that$Hive$uses$
! Unlike$Hive,$Impala$does$not$turn$queries$into$MapReduce$jobs$
– Impala"queries"run"on"an"addiGonal"set"of"daemons"that"run"on"the"
Hadoop"cluster""
– Oeen"referred"to"as"‘Impala"Servers’"
– Impala"queries"run"significantly"faster"than"Hive"queries"
– Tests"show"improvements"of"10x"to"50x"or"more"
Deploying"Impala"
! Impala$Servers$should$reside$on$each$DataNode$host$
– Can"coexist"with"TaskTrackers"if"you"run"Impala"and"MapReduce"
– Memory"usage"is"configurable"
! A$single$Impala$State$Store$daemon$
– Very"lightweight"
– Co/locate"with"NameNode,"Secondary"NameNode,"or"other"server"
Impala Server Impala Server

DataNode Impala State DataNode
Store Server
Impala Server NameNode Impala Server

DataNode DataNode
Secondary
Impala Server NameNode Impala Server
DataNode DataNode
Installing,"Configuring,"and"StarGng"Impala:"Overview"of"Steps"
1.  Install$the$Impala$so`ware*$
2.  Configure$HDFS$for$opCmal$Impala$performance*$
3.  Restart$all$the$DataNodes$
4.  If$necessary,$configure$hive-site.xml$for$a$shared$Hive$Metastore$
5.  Copy$hive-site.xml,$core-site.xml,$hdfs-site.xml,$and$
log4j.properties$to$/etc/impala/conf$on$all$the$hosts$that$
will$run$Impala$Servers$
6.  Configure$startup$opCons$in$/etc/default/impala$on$all$the$
hosts$that$will$run$impalad$or$the$Impala$State$Store$Server*$
7.  Start$the$Impala$Servers$and$the$Impala$State$Store$Server*$
*$Described$later$in$this$secCon$
Installing"Impala"Soeware"
! Install$the$Impala$Server$on$all$hosts$running$DataNodes:$
– sudo yum install impala-server
! Install$the$Impala$State$Store$Server$on$a$single$host:$
– sudo yum install impala-state-store
! Install$the$Impala$shell$on$clients:$
– sudo yum install impala-shell
Configuring"HDFS"for"OpGmal"Impala"Performance"
! Configure$HDFS$to$perform$short"circuit$reads.$This$configuraCon$is$
mandatory.$In$hdfs-site.xml:$
dfs.client.read.shortcircuit Allows daemons to read directly off
their host's disks instead of having to
open a socket to talk to DataNodes.
Required value: true.
dfs.domain.socket.path Short-circuit reads use a UNIX
domain socket, which requires a path.
Recommended value: /var/run/
hadoop-hdfs/dn._PORT.
dfs.client.file/block/ Timeout on a call to get the locations
storage/locations.timeout" of required file blocks. Recommended
value: 3000.""
! Note:$These$configuraCon$parameters$are$for$CDH$4.2$and$higher.$For$CDH$
4.1,$refer$to$the$documentaCon$for$an$alternate$configuraCon$
Typical"Changes"to"Impala"Startup"OpGons"
! In$/etc/default/impala:$
IMPALA_STATE_STORE_HOST The host running the Impala State

Store Server. Default value:
127.0.0.1. Required value: Host
name or IP address of the host
running the Impala State Store Server
-mem_limit"argument"in"" The amount of system memory
export IMPALA_SERVER_ARGS" available to Impala. Default value:
statement 100%. Set as needed based on
system activity. Examples:
-mem_limit=70%,
-mem_limit=31GB
$
StarGng"Impala"Soeware"
! Impala$Servers:$
– sudo service impala-server start
! Impala$State$Store$Server:$
– sudo service impala-state-store start
!  impala-shell$client:$
– impala-shell
Chapter"Topics"
!! Hive"
!! Impala"
!! Pig"
!! Conclusion"
What"Is"Pig?"
! Pig$is$another$high"level$abstracCon$on$top$of$MapReduce$
! An$open$source$Apache$project$
– Originally"created"at"Yahoo"
! Provides$a$scripCng$language$known$as$Pig,La.n$
– Composed"of"operaGons"that"are"applied"to"the"input"data"to"produce"
output"
– Language"will"be"relaGvely"easy"to"learn"for"people"experienced"in"Perl,"
Ruby,"or"other"scripGng"languages"
– Easy"to"write"complex"tasks"such"as"joins"of"mulGple"datasets"
– Under"the"covers,"Pig"LaGn"scripts"are"converted"to"MapReduce"jobs"
Sample"Pig"Script"
movies = LOAD '/data/films' AS

(id:int, name:string, year:int);
ratings = LOAD '/data/ratings' AS
(movie_id: int, user_id: int, score:int);
jnd = JOIN movies BY id, ratings BY movie_id;
recent = FILTER jnd BY year > 1995;
srtd = ORDER recent BY name DESC;
justafew = LIMIT srtd 50;
STORE justafew INTO '/data/pigoutput';
Installing"Pig"
! Pig$runs$as$a$client"side$applicaCon$
– There"is"nothing"extra"to"install"on"the"cluster"
! To$install,$run$sudo yum install pig
! Configure$the$client$so$Pig$can$access$the$Hadoop$cluster:$
– If"the"client"has"core-site.xml"and"mapred-site.xml"
configured"to"access"the"Hadoop"cluster,"Pig"will"use"the"configuraGon"
from"those"files"
– If"not,"configure"fs.default.name"and"mapred.job.tracker"
in"/etc/pig/conf/pig.properties""
Chapter"Topics"
!! Hive"
!! Impala"
!! Pig"
!! Hands"On$Exercise:$Querying$HDFS$With$Hive$and$Cloudera$Impala$
!! Conclusion"
Hands/On"Exercise:"Querying"HDFS"With"Hive"and"Cloudera"
Impala"
! In$this$Hands"On$Exercise,$you$will$install$and$configure$Hive$and$Impala$
on$your$cluster$and$run$queries$
! Cluster$deployment$aèr$exercise$compleCon:$
Chapter"Topics"
!! Hive"
!! Impala"
!! Pig"
!! Conclusion$
EssenGal"Points"
! Hive$is$a$high"level$abstracCon$on$top$of$MapReduce$
– Runs"MapReduce"jobs"on"Hadoop"based"on"HiveQL"statements"
! Impala$runs$a$separate$set$of$daemons$from$MapReduce$to$process$
HiveQL$queries$$
– One"State"Store"Server"
– Impala"Server"daemons"are"co/located"with"DataNodes"
! Hive$and$Impala$use$a$common$metastore$to$store$metadata$such$as$
column$names$and$data$types$
– For"Hive,"the"metastore"can"be"single/user"
– For"Impala,"the"metastore"must"be"shareable"
! Pig$is$another$high"level$abstracCon$on$top$of$MapReduce$
– Runs"MapReduce"jobs"on"Hadoop"based"on"Pig"LaGn"statements"
Conclusion"
! Hive$features$and$basic$configuraCon$
! Impala$features$and$basic$configuraCon$
! Pig$features$and$installaCon$
$
Hadoop"Clients"
Chapter"9"
Course"Chapters"
!! IntroducCon" Course"IntroducCon"
!! HDFS"
IntroducCon"to"Apache"Hadoop"
!! MapReduce"
!! Hadoop"InstallaCon"and"IniCal"ConfiguraCon"
!! Hadoop$Clients$
!! Advanced"Cluster"ConfiguraCon"
!! Hadoop"Security"
Cluster"OperaCons"and"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooCng"
!! Conclusion"
!! Kerberos"ConfiguraCon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaCon"
Hadoop"Clients"
! What$Hadoop$clients$are$
! How$to$install$and$configure$Hadoop$clients$
! How$to$install$and$configure$Hue$
! How$Hue$authen@cates$and$authorizes$user$access$
Chapter"Topics"
Planning,$Installing,$and$Configuring$
Hadoop$Clients$
a$Hadoop$Cluster$
!! What$Are$Hadoop$Clients?$
!! Installing"and"Configuring"Hadoop"Clients"
!! Installing"and"Configuring"Hue"
!! Hue"AuthenCcaCon"and"AuthorizaCon"
!! Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"
!! Conclusion"
Hadoop"Client"
Components"Running""
on"a"Hadoop"Cluster"
HDFS
Accesses"FuncConality"
Within"a"Hadoop"Cluster" Client
Using"the"Hadoop"APIs"
MapReduce
! A$Hadoop$client$requires$
– The"Hadoop"API"
– ConfiguraCon"to"connect"to"Hadoop"components"
Examples"of"Hadoop"Clients"
! Command"line$clients$
– The"hadoop"shell"(hadoop fs,"hadoop job)"
– The"Hive"shell""
– The"Pig"shell"
– The"Sqoop"command/line"interface"
! Server$daemons$
– Flume"agent"
– Centralized"Sqoop"and"Hive"servers"
– Oozie"
! Mapper$and$Reducer$tasks$
Deployment"Example"–"End"Users’"Systems"as"Hadoop"Clients"
Hadoop Cluster
JobTracker
NameNode
...
Deployment"Example"–"Servers"as"Hadoop"Clients"
Hadoop Cluster
JobTracker
Servers
NameNode
... Browser
Chapter"Topics"
Hadoop$Clients$
a$Hadoop$Cluster$
!! What"Are"Hadoop"Clients?"
!! Installing$and$Configuring$Hadoop$Clients"
!! Conclusion"
Installing"Commonly/Used"Hadoop"Clients"
Client Installa@on$Command$
(Run$With$sudo yum install)
Hadoop$Shell$and$MapReduce$ hadoop-client
Hive$Shell$ hive
Pig$Shell$ pig
Sqoop$Shell$ sqoop
Flume$Agent$ flume-ng-agent
Centralized$Hive$Server$ hive-server2
Centralized$Sqoop$Server$ sqoop2
Oozie$ oozie
Installing"the"Hadoop"APIs"on"Clients"
! Hadoop$client$installers$automa@cally$include$any$required$Hadoop$APIs$
as$dependencies$if$they$are$not$already$installed
==========================================================================
Package Arch Version Repository
==========================================================================
Installing:
hadoop-client x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
Installing for dependencies:
bigtop-jsvc x86_64 1.0.10-1.cdh4.2.0.p0.13.el6 Cloudera-cdh4
bigtop-utils noarch 0.4+502-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
hadoop x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
hadoop-0.20-mapreduce
x86_64 0.20.2+1341-1.cdh4.2.0.p0.21.el6 Cloudera-cdh4
hadoop-hdfs x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
hadoop-mapreduce x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
hadoop-yarn x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
zookeeper noarch 3.4.5+14-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
Configuring"Hadoop"Clients"
! Clients$must$be$configured$to$iden@fy$the$Hadoop$cluster$
– NameNode"locaCon"
– JobTracker"locaCon"
! You$can$simply$copy$the$Hadoop$configura@on$files$to$clients$
– Or"a"subset"of"the"Hadoop"configuraCon"
! You$will$need$addi@onal$client"specific$configura@on$for$some$clients$
– For"example,"hive-site.xml,"pig.properties,"flume-
conf.properties,"etc."
! Cloudera$Manager$handles$client$configura@on$
– AutomaCcally"deploys"configuraCons"to"CM/managed"hosts"in"the"
Gateway"role"
– Makes"configuraCon"available"for"download"for"non/CM/managed"
clients"
Chapter"Topics"
Hadoop$Clients$
a$Hadoop$Cluster$
!! Installing$and$Configuring$Hue$
!! Conclusion"
Hue"
Hue Applications
Hive UI
Impala UI
Browser
File Browser
Hue
Server Job Browser
Job Designer
Oozie Workflow
Editor
Shell UI
Installing,"StarCng,"Accessing,"and"Configuring"Hue"
! Install$Hue$
– sudo yum install hue
! Set$an$arbitrary$value$for$secret_key$in$the$[desktop]$sec@on$of$
hue.ini
– Used"to"hash"Hue"sessions"
! Start$the$Hue$Server$
– sudo service hue start
! Access$Hue$from$a$browser$
– http://<hue_server>:8888
! Configure$the$Hue$applica@ons$
– Modify"/etc/hue/hue.ini"
– Restart"the"Hue"Server"for"configuraCon"changes"to"take"effect"
Configuring"Hue"ApplicaCons"
Hue$Applica@on Required$Configura@on$
Hive$UI$ •  Install"Hive"on"the"machine"running"the"Hue"Server"
•  Copy"hive-site.xml"into"/etc/hive/conf"on"
the"host"running"the"Hue"Server"
•  Restart"the"Hue"Server"
•  Run"a"Hive"query"from"the"Hive"UI"in"Hue"as"a"test"
Impala$UI$ •  Prerequisite:"a"shared"Hive"metastore"
•  Install"Hive"on"the"machine"running"the"Hue"Server"
•  Copy"hive-site.xml"into"/etc/impala/conf"on"
to"the"host"running"the"Hue"Server;"must"be"configured"
for"the"shared"metastore"
•  Copy"/etc/default/impala"to"the"host"running"
the"Hue"Server"
•  Configure"server_host"in"the"[impala]"secCon"of"
hue.ini"
•  Run"an"Impala"query"from"the"Hive"UI"in"Hue"as"a"test"
Configuring"Hue"ApplicaCons"(cont’d)"
File$Browser$ •  Prerequisite:"either"WebHDFS"is"enabled"on"your"cluster,"
or"H>pFS"has"been"installed"and"configured."Note"that"
H>pFS"is"required"for"HDFS"HA"deployments."
•  Configure"webhdfs_url"in"the"[hadoop] /
[[hdfs_clusters]] / [[[default]]]"secCon"
of"hue.ini."You"configure"the"same"parameter"–"
webhdfs_url"–"regardless"of"whether"you"are"using"
WebHDFS"or"H>pFS.""
•  Browse"HDFS"from"the"Hue"File"Browser"as"a"test"
Job$Browser$ •  Install"the"Hue"plug/in"on"the"JobTracker"host"
•  Configure"mapred-site.xml"for"the"Hue"plug/in"
•  Restart"the"JobTracker"and"the"TaskTrackers"
•  Configure"jobtracker_host"in"the"[hadoop] /
[[mapred_clusters]] / [[[default]]]"
secCon"of"hue.ini
•  Browse"a"job"from"the"Hue"Job"Browser"as"a"test"
Job$Designer$ •  Prerequisite:"Oozie"is"installed"and"the"Oozie"server"is"
Oozie$Editor$ running"
•  Configure"oozie_url"in"the"[liboozie]"secCon"of"
hue.ini"
•  Configure"submit_to"to"true"in"the"[hadoop]"/
[[mapred_clusters]]"/"[[[default]]]"
secCon"of"hue.ini
•  Design"a"job"and"create"an"Oozie"workflow"as"a"test"
Command"Line$Shells$ •  Prerequisite:"For"shells"you"want"users"to"be"able"to"run"
from"Hue,"install"and"configure"client"sohware"on"the"
Hue"machine""
•  Add"Unix"user"IDs"on"the"Hue"machine"for"Hue"users"
who"will"run"shells"
•  Start"the"shells"from"Hue"as"a"test"
Chapter"Topics"
Hadoop$Clients$
a$Hadoop$Cluster$
!! Hue$Authen@ca@on$and$Authoriza@on$
!! Conclusion"
First"User"Login""
Hue
Database
Default:"SQLLite"
MySQL,"PostgreSQL,"and""
Oracle"supported"
! The$first$user$to$log$in$to$Hue$automa@cally$receives$superuser$privileges$
! Superusers$can$be$added$and$removed$
! The$first$user’s$login$creden@als$are$stored$in$the$Hue$database$
Defining"Users"and"Groups"With"the"Hue"User"Admin"App"
Hue
Database
! Use$Hue$itself$to$maintain$Hue$users$
! All$user$informa@on,$including$creden@als,$is$stored$in$the$Hue$Database$
Accessing"Users"and"Groups"From"LDAP"–"OpCon"1"
Hue
LDAP
Database
! Administrator$configures$Hue$to$access$the$LDAP$directory$
! Administrator$then$syncs$users$and$groups$from$LDAP$to$the$Hue$Database$
! Hue$authen@cates$users$by$accessing$creden@als$from$the$Hue$Database$
Accessing"Users"and"Groups"From"LDAP"–"OpCon"2"
LDAP
! Administrator$configures$Hue$to$access$the$LDAP$directory$
! Hue$authen@cates$users$by$accessing$creden@als$from$LDAP$
RestricCng"Access"to"Hue"ApplicaCons"
! Select$a$group$in$the$Hue$User$Admin$applica@on$
! Edit$the$permissions$for$the$group$
– Default"permissions"allow"group"members"to"access"every"Hue"
applicaCon"
Chapter"Topics"
Hadoop$Clients$
a$Hadoop$Cluster$
!! Hands"On$Exercise:$Using$Hue$to$Control$Hadoop$User$Access$
!! Conclusion"
Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"
! In$this$Hands"On$Exercise,$you$will$install$and$configure$Hue,$then$
configure$a$working$environment$for$analysts$with$the$following$
capabili@es:$
– SubmiPng"Hive"and"Impala"queries"
– Browsing"the"HDFS"file"system"
– Browsing"Hadoop"jobs"
– Using"the"HBase"shell""
Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"–"
Deployment"
! Servers$deployed$on$your$cluster$aaer$exercise$comple@on:$
Chapter"Topics"
Hadoop$Clients$
a$Hadoop$Cluster$
!! Conclusion$
EssenCal"Points"
! Hadoop$clients$access$func@onality$within$a$Hadoop$cluster$using$the$
Hadoop$API$
! Hadoop$clients$require$Hadoop$configura@on$sebngs$
– End"users"with"access"to"the"configuraCon"sePngs"can"modify"them"
– Client"centralizaCon"using"servers"such"as"Hue,"HiveServer2,"and"
Sqoop2"can"reduce"the"need"to"give"the"Hadoop"configuraCon"sePngs"
to"end"users""
! Hue$provides$many$capabili@es$to$end$users$
– Browsing"HDFS"files"
– Designing,"submiPng,"and"browsing"MapReduce"jobs"
– EdiCng"and"submiPng"Hive"and"Impala"queries"
– Designing,"running,"and"viewing"the"status"of"Oozie"workflows"
! Hue$users$are$authen@cated;$access$to$func@onality$can$be$restricted$
– Hue"provides"LDAP"integraCon"
Conclusion"
! What$Hadoop$clients$are$
! How$to$install$and$configure$Hadoop$clients$
! How$to$install$and$configure$Hue$
! How$Hue$authen@cates$and$authorizes$user$access$
Cloudera"Manager"
Chapter"10"
Course"Chapters"
!! HDFS"
!! MapReduce"
!! Hadoop"InstallaBon"and"IniBal"ConfiguraBon"
!! Hadoop"Clients"
!! Cloudera$Manager$
!! Hadoop"Security"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooBng"
!! Conclusion"
Cloudera"Manager"
! Why$Cloudera$Manager$is$useful$for$administering$Hadoop$clusters$
! What$features$Cloudera$Manager$offers$
! Which$daemons$are$deployed$with$Cloudera$Manager$
! How$to$install$Cloudera$Manager$
! How$to$install$a$Hadoop$cluster$using$Cloudera$Manager$
! How$to$perform$basic$administraEon$tasks$using$Cloudera$Manager$
Chapter"Topics"
Cloudera$Manager$
a$Hadoop$Cluster$
!! The$MoEvaEon$for$Cloudera$Manager$
!! Cloudera"Manager"Features"
!! Standard"and"Enterprise"Versions"
!! Cloudera"Manager"Topology"
!! Installing"Cloudera"Manager" ""
!! Installing"Hadoop"Using"Cloudera"Manager""
!! Performing"Basic"AdministraBon"Tasks"Using"Cloudera"Manager"
!! Conclusion"
Cluster"Management"
! Apache$Hadoop$is$a$complex$system$
! Configuring$a$small,$‘proof#of#concept’$cluster$is$not$too$complex$
! Dealing$with$a$larger$cluster$is$much$more$difficult$
– Deploying"configuraBon"changes"
– Verifying"host"configuraBon"
– Hadoop"configuraBon"
– SoZware"versions"
! Lots$of$configuraEon$opEons$to$opEmize$cluster$performance$
– Many"of"these"are"not"well"documented"
– Best"pracBces"are"sBll"being"determined"as"usage"of"Hadoop"grows"
! Cluster$configuraEon$is$considered$to$be$something$of$a$‘dark$art’$
! The$configuraEon$files$can$become$complex,$and$are$easy$to$break$$
Cluster"Maintenance"
! A$cluster$has$a$large$number$of$hosts$$
– MulBple"services"running"on"each"host"
– Difficult"to"know"the"status"of"everything"
! Inevitable$issues$will$arise$with$hardware$and$soTware$
! Hardware$
– Cloudera"Manager"improves"hardware"failure"recovery"with"easy"
configuraBon"of"High/Availability"HDFS"
– Decommissioning"hosts"for"planned"downBme"is"easier"
! Services$
– As"the"cluster"grows,"moving"services"to"new"hosts"is"easier"with"
Cloudera"Manager"
– View"all"service"health"statuses"from"a"single"screen"
Monitoring"Performance"
! As$a$cluster$grows,$you$will$want$to$modify$configuraEon$values$and$
monitor$their$effect$on$cluster$performance$
– For"example,"compare"the"performance"of"similar"jobs"before"and"aZer"
the"change"
– This"can"be"difficult"using"only"open/source"products"
! Keeping$track$of$your$cluster’s$performance$can$be$difficult$
– There"are"many"elements"to"monitor"
– Retaining"informaBon"over"Bme"becomes"a"challenge"
! Performance$can$degrade$
– Perhaps"only"certain"hosts"slow"down"
– A"job"may"slow"down"
Managing"Access"to"the"Cluster"
! Typically,$you$will$only$want$certain$individuals$to$have$access$to$the$
cluster$
! You$may$want$to$restrict$what$different$people$can$do$
– Some"may"have"access"only"to"HDFS"
– Others"can"run"jobs"on"the"cluster"
– Yet"others"are"administrators"and"can"create"new"users"
! CDH3$and$above$include$Kerberos$support$to$provide$much$beZer$security$
than$standard$Hadoop$
– Allow"integraBon"with"your"exisBng"authenBcaBon"systems"such"as"
LDAP"and"AcBve"Directory"
– This"can"be"difficult"to"implement"and"maintain"
– Cloudera"Manager"makes"implemenBng"Kerberos"support"much"easier"
Chapter"Topics"
Cloudera$Manager$
a$Hadoop$Cluster$
!! The"MoBvaBon"for"Cloudera"Manager"
!! Cloudera$Manager$Features$
!! Standard"and"Enterprise"Versions"
!! Conclusion"
What"Is"Cloudera"Manager?"
! Cloudera$Manager$is$an$applicaEon$designed$to$answer$the$needs$of$
enterprise#level$users$of$Hadoop$
– Install"Hadoop"soZware"on"nodes"
– Commission"and"configure"services"on"a"cluster"
– Monitor"cluster"acBvity"
– Produce"reports"of"cluster"usage"
– Manage"users"and"groups"with"access"to"the"cluster"
Automated"Deployments"
! AutomaEcally$installs$and$configures$services$on$hosts$
! Allows$modificaEon$of$configuraEon$parameters$
– Recommends"‘best"pracBce’"values"for"many"parameters"
! Makes$it$easy$to$start$and$stop$master$and$slave$instances$
! Provides$a$simple$way$to$retrieve$the$configuraEon$files$required$for$client$
machines$to$access$the$cluster$
Cloudera"Manager"Services"
! Cloudera$Manager$controls$a$wide$range$of$Hadoop$and$Hadoop$
‘ecosystem’$services:$
– HDFS"
– MapReduce"
– Hive"
– Impala"
– Pig"
– Flume"
– Mahout"
– Oozie"
– Sqoop"
– ZooKeeper"
– Hue"
– HBase"
– Cloudera"Search"
Host"Monitoring"and"ReporBng"
! Provides$a$‘dashboard’$view$of$cluster$status$
! Real#Eme$host$monitoring$of$
– EnBre"cluster"
– Individual"services"
– Individual"instances"
! AcEve$status$monitoring$
– Sends"emails"when"an"issue"occurs"
! ReporEng$of$previous$events$and$data$
Monitoring"and"Diagnosing"Cluster"Workloads"
Viewing"Service"Health"and"Performance"
Running"Reports"on"System"Performance"and"Usage"
Gathering"and"Viewing"Hadoop"Logs"
Tracking"Events"from"Across"the"Cluster"
Obtaining"Host/Level"Snapshots"
Chapter"Topics"
Cloudera$Manager$
a$Hadoop$Cluster$
!! Standard$and$Enterprise$Versions$
!! Conclusion"
Cloudera"Manager"EdiBons"
! Cloudera$Manager$comes$in$two$versions$
! Standard$EdiEon$
– Free"download"
– Manage"a"cluster"of"any"size"
! Enterprise$Core$
– Includes"extra"features"and"funcBonality"for"enterprise"and"producBon"
clusters"
– Rolling"upgrades"
– SNMP"support"
– LDAP"integraBon"
– ConfiguraBon"history"and"rollbacks"
– OperaBonal"reports"
– ExisBng"Free"EdiBon"installs"can"be"easily"upgraded"to"Enterprise"
– Contact"sales@cloudera.com"for"licensing"informaBon"
Cloudera"Manager"EdiBons"(cont’d)"
! Enterprise$EdiEon$add#ons:$
– BDR"
– Backup"and"disaster"recovery"
– Navigator"
– Data"audit"for"HDFS,"HBase,"and"Hive"
– Access"management"
– RTD"
– Support"for"HBase"
– RTQ"
– Support"for"Impala"
! A$60#day$free$trial$of$the$full$Enterprise$ediEon$is$available$
Chapter"Topics"
Cloudera$Manager$
a$Hadoop$Cluster$
!! Standard"and"Enterprise"versions"
!! Cloudera$Manager$Topology$
!! Conclusion"
Basic"Topology"of"Cloudera"Manager"
! Each$cluster$node$runs$the$Cloudera$Manager$Agent$
– Starts"and"stops"Hadoop"daemons"
– Collects"staBsBcs"
! The$Cloudera$Manager$Server$runs$on$a$machine$outside$the$cluster$
– Runs"the"service"monitor"and"acBvity"monitor"daemons"
– Stores"informaBon"about"the"cluster"in"a"database"
– Available"host"machines"in"the"cluster"
– Services,"roles,"and"configuraBons"assigned"to"each"host"
– Communicates"with"the"agents"via"HTTP"or"HTTPS"
– Agents"send"heartbeats"to"the"server"periodically"
– Sends"configuraBon"informaBon"and"commands"to"the"agents"
Basic"Topology"of"Cloudera"Manager"(cont’d)"
Chapter"Topics"
Cloudera$Manager$
a$Hadoop$Cluster$
!! Cloudera$Manager$Requirements$and$InstallaEon"
!! Conclusion"
Cloudera"Manager"4"Requirements"
! Supported$OperaEng$Systems$
– RedHat"Enterprise"Linux/Centos"5.7,"6.2,"6.4"(64/bit)"
– SUSE"Linux"Enterprise"Server"11"Service"Pack"1"or"later"(64/bit)"
– Debian"6.0"(Squeeze)"(64/bit)"
– Ubuntu"10.04,"12.04"(64/bit)"
! Supported$Browsers$
– Internet"Explorer""9"
– Google"Chrome"
– Safari"5"
– Firefox"11"
Cloudera"Manager"4"Requirements"(cont’d)"
! Supported$databases$
– MySQL"5.0,"5.1,"5.5"
– Oracle"10g"Release"2,"Oracle"11g"Release"2"
– PostgreSQL"8.1,"8.3,"8.4,"9.1"
! CDH$
– CDH"3u1"or"later"
– CDH"4.0"or"later"
Installing"Cloudera"Manager"
! Download$
– Go"to"http://www.cloudera.com/downloads
– Download"Cloudera"Manager"installer"
– Run"the"installer"from"the"command"line"
! All$subsequent$configuraEon$is$done$through$Web$UI$
– Access"via"http://<manager_host>:7180
! DocumentaEon$
– For"detailed"informaBon"about"installing"Cloudera"Manager,"refer"to"
http://docs.cloudera.com/
Chapter"Topics"
Cloudera$Manager$
a$Hadoop$Cluster$
!! Installing$Hadoop$Using$Cloudera$Manager$$
!! Conclusion"
Packages"or"Parcels"
! Cloudera$Manager$will$install$the$selected$Hadoop$services$on$cluster$
nodes$for$you$
! InstallaEon$is$via$packages$or$Parcels$
! Packages$
– Standard""RPMs,"Ubuntu"Packages"etc."
! Parcels$
– Cloudera"Manager/specific"
– Allows"mulBple"versions"of"Hadoop"to"be"present"on"a"node"
simultaneously"
– Although"only"one"will"be"running"at"any"given"Bme"
– Allows"easy"upgrading"with"minimal"down/Bme"
– Allows"easy"rolling"upgrades"(Enterprise"ediBon)"
No"Internet"Access"Required"for"Cluster"Nodes"
! OTen,$cluster$nodes$do$not$have$Internet$access$
– Security"or"other"reasons"
! Cloudera$Manager$can$install$CDH$on$cluster$nodes$using$a$local$CDH$
repository$
– Simply"mirror"the"CDH"package"or"Parcel"repository"on"a"local"machine"
– Configure"Cloudera"Manager"to"use"the"local"repository"during"
installaBon"and"upgrade"
Chapter"Topics"
Cloudera$Manager$
a$Hadoop$Cluster$
!! Performing$Basic$AdministraEon$Tasks$Using$Cloudera$Manager$
!! Conclusion"
Administering"Services"
! InteracEng$with$a$service$includes$all$instances$of$that$service$
– You"can"interact"with"an"individual"service"
– You"can"interact"with"all"services"for"an"enBre"cluster"
! Cloudera$Manager$can$add$and$remove$services$
– Will"make"recommendaBons"for"hosts"on"which"to"run"the"services"
Administering"Hosts"
! Add$hosts$
– Host(s)"will"be"managed"by"Cloudera"Manager"
– All"CDH"RPMs"or"Parcels"will"be"installed"
! Delete$role$instance$
– Host(s)"will"not"be"managed"by"Cloudera"Manager"
! Host$inspector$
– Verifies"all"seOngs,"configuraBons,"and"soZware"versions"
! Role$Groups$allows$groups$of$machines$in$the$cluster$to$share$
configuraEons$
– Example:"generaBon"1"and"generaBon"2"machines"in"a"heterogeneous"
cluster"may"support"different"numbers"of"Mappers"and"Reducers"per"
host"
Making"ConfiguraBon"Changes"
! Make$changes$to$enEre$service,$host,$or$instance$
! Cloudera$Manager$makes$intelligent$recommendaEons$about$
configuraEon$values$
– Will"verify"the"configuraBon"values"
! Cloudera$Manager$makes$it$easier$to$propagate$and$restart$affected$
services$
Making"a"ConfiguraBon"Change"
RestarBng"AZer"a"ConfiguraBon"Change"
Chapter"Topics"
Cloudera$Manager$
a$Hadoop$Cluster$
!! Conclusion$
EssenBal"Points"
! Cloudera$Manager$makes$installing,$managing,$and$monitoring$a$Hadoop$
cluster$much$simpler$
! Standard$ediEon$can$manage$clusters$of$any$size$
! Enterprise$ediEon$has$some$extra$features,$such$as$SNMP$support$and$
support$for$rolling$upgrades$
! A$free,$60#day$trial$license$of$the$full$Enterprise$EdiEon$is$available$
Conclusion"
! Why$Cloudera$Manager$is$useful$for$administering$Hadoop$clusters$
! What$features$Cloudera$Manager$offers$
! $Which$versions$of$Cloudera$Manager$are$available$
! Which$daemons$are$deployed$with$Cloudera$Manager$
! How$to$install$Cloudera$Manager$
! How$to$install$a$Hadoop$cluster$using$Cloudera$Manager$
! How$to$perform$basic$administraEon$tasks$using$Cloudera$Manager$
Advanced"Cluster"ConfiguraAon"
Chapter"11"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"1#
Course"Chapters"
!! IntroducAon" Course"IntroducAon"
!! HDFS"
IntroducAon"to"Apache"Hadoop"
!! MapReduce"
!! Hadoop"InstallaAon"and"IniAal"ConfiguraAon"
Planning,#Installing,#and#
!! Hadoop"Clients"
Configuring#a#Hadoop#Cluster#
!! Advanced#Cluster#Configura5on#
!! Hadoop"Security"
Cluster"OperaAons"and"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooAng"
!! Conclusion"
!! Kerberos"ConfiguraAon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaAon"
Advanced"Cluster"ConfiguraAon"
In#this#chapter,#you#will#learn:#
! How#to#perform#advanced#configura5on#of#Hadoop#
! How#to#configure#port#numbers#used#by#Hadoop#
! How#to#explicitly#include#or#exclude#hosts##
! How#to#configure#HDFS#rack#awareness#
! How#to#enable#HDFS#high#availability#
Chapter"Topics"
Planning,#Installing,#and#Configuring#
Advanced#Cluster#Configura5on#
a#Hadoop#Cluster#
!! Advanced#Configura5on#Parameters#
!! Configuring"Hadoop"Ports"
!! Explicitly"Including"and"Excluding"Hosts"
!! Configuring"HDFS"for"Rack"Awareness"
!! Configuring"HDFS"High"Availability"
!! Hands/On"Exercise:"Configuring"HDFS"for"High"Availability"
!! Conclusion"
Advanced"ConfiguraAon"Parameters"
! We#have#previously#covered#Hadoop’s#basic#proper5es#
– EssenAally,"the"minimum"needed"to"set"up"a"working"cluster"
! We#will#now#discuss#some#addi5onal#proper5es#
! These#generally#fall#into#one#of#several#categories#
– OpAmizaAon"and"performance"tuning"
– Capacity"management"
– Access"control"
! The#configura5on#recommenda5ons#in#this#sec5on#are#baselines#
– Use"them"as"starAng"points,"then"adjust"as"required"by"the"job"mix"in"
your"environment"
hdfs-site.xml"
dfs.namenode.handler.count The number of threads the NameNode

uses to handle RPC requests from
DataNodes. Default: 10. Recommended:
ln(number of cluster nodes) * 20.
Symptoms of this being set too low:
‘connection refused’ messages in
DataNode logs as they try to transmit
block reports to the NameNode. Used by
the NameNode.
dfs.datanode.failed. The number of volumes allowed to fail
volumes.tolerated before the DataNode takes itself offline,
ultimately resulting in all of its blocks
being re-replicated. Default: 0, but often
increased on machines with several
disks. Used by DataNodes.
core-site.xml"
fs.trash.interval When a file is deleted, it is placed in

a .Trash directory in the user’s home
directory, rather than being immediately
deleted. It is purged from HDFS after the
number of minutes specified. Default: 0
(disabled). Recommended: 1440 (one
day). Used by clients and the
NameNode.
io.file.buffer.size Determines how much data is buffered
during read and write operations. Should
be a power of 2 of hardware page size.
Default: 4096. Recommendation: 65536
(64KB). Used by clients and all nodes
running Hadoop daemons.
core-site.xml"(cont’d)"
io.compression.codecs List of compression codecs that Hadoop

can use for file compression. Used by
clients and all nodes running Hadoop
daemons. If you are using another
compression codec, add it here.
Default is
org.apache.hadoop.io.compress.
DefaultCodec,org.apache.hadoop.io.
compress.GzipCodec,org.apache.
hadoop.io.compress.BZip2Codec,org.
apache.hadoop.io.compress.
DeflateCodec,org.apache.hadoop.io.
compress.SnappyCodec
mapred-site.xml"
mapred.job.tracker. Number of threads used by the

handler.count JobTracker to respond to heartbeats
from the TaskTrackers. Default: 10.
Recommendation: ln(number of cluster
nodes) * 20. Used by the JobTracker.
mapred.reduce.parallel. Number of TaskTrackers a Reducer can
copies connect to in parallel to transfer its data.
Default: 5. Recommendation: ln(number
of cluster nodes) * 4 with a floor of 10.
tasktracker.http.threads The number of HTTP threads in the
TaskTracker which the Reducers use to
retrieve data. Default: 40.
Recommendation: 80. Used by
TaskTrackers.
mapred.reduce.slowstart. The percentage of Map tasks which must

completed.maps be completed before the JobTracker will
schedule Reducers on the cluster.
Default: 0.05 (5 percent).
Recommendation: 0.8 (80 percent).
Used by the JobTracker.
mapred.map.tasks. Whether to allow speculative execution

speculative.execution for Map tasks. Default: true.
Recommendation: true. Used by the
JobTracker.
mapred.reduce.tasks. Whether to allow speculative execution
speculative.execution for Reduce tasks. Default: true.
Recommendation: false. Used by the
JobTracker.
! If#a#task#is#running#significantly#more#slowly#than#the#average#speed#of#
tasks#for#that#job,#specula(ve*execu(on#may#occur#
– Another"a>empt"to"run"the"same"task"is"instanAated"on"a"different"
node"
– The"results"from"the"first"completed"task"are"used"
– The"slower"task"is"killed"
mapred.compress.map.output Determines whether intermediate data

from Mappers should be compressed
before transfer across the network.
Default: false. Recommendation: true.
mapred.map.output. The compression codec used to
compression.codec compress intermediate data from
Mappers. Default: org.apache.hadoop.
io.compress.DefaultCodec.
Recommended: org.apache.hadoop.
io.compress.SnappyCodec."Used by
TaskTrackers.
io.sort.mb The size of the buffer in RAM on the

Mapper to which the Mapper initially
stores its Key/Value pairs before they are
written to disk. Default: 100MB.
Recommendation: 128MB, if the child
heap size is 1GB. This allocation comes
out of the task’s JVM heap space. Used
by TaskTrackers.
io.sort.factor The number of streams to merge at once
when sorting files. Used by
TaskTrackers. Default: 10.
Recommendation: 64.
Chapter"Topics"
a#Hadoop#Cluster#
!! Advanced"ConfiguraAon"Parameters"
!! Configuring#Hadoop#Ports#
!! Conclusion"
Common"Hadoop"Ports"
! Hadoop#daemons#each#provide#a#Web"based#User#Interface#
– Useful"for"both"users"and"system"administrators"
! Expose#informa5on#on#a#variety#of#different#ports#
– Port"numbers"are"configurable,"although"there"are"defaults"for"most"
! Hadoop#also#uses#various#ports#for#components#of#the#system#to#
communicate#with#each#other#
Web"UI"Ports"for"Users"
Daemon Default Port Configuration parameter

NameNode 50070 dfs.http.address
MR HDFS
DataNode 50075 dfs.datanode.http.address
Secondary NameNode 50090 dfs.secondary.http.address
JobTracker 50030 mapred.job.tracker.http.address
TaskTracker 50060 mapred.task.tracker.http.address
Hadoop"Ports"for"Administrators"
Daemon Default# Configura5on#Parameter Used#for

Port
NameNode 8020 fs.default.name Filesystem metadata operations
DataNode 50010 dfs.datanode.address DFS data transfer
DataNode 50020 dfs.datanode.ipc.add Block metadata operations and
ress recovery
JobTracker Usually mapred.job.tracker Job submission, task tracker
8021, heartbeats
9001,
or 8012
TaskTracker Usually mapred.task.tracker. Communicating with child jobs
8021, report.address
9001,
or 8012
Chapter"Topics"
a#Hadoop#Cluster#
!! Explicitly#Including#and#Excluding#Hosts#
!! Conclusion"
Host"‘include’"and"‘exclude’"Files"
! Op5onally,#specify#dfs.hosts#in#hdfs-site.xml#to#point#to#a#file#
lis5ng#hosts#which#are#allowed#to#connect#to#a#NameNode#and#act#as#
DataNodes#
– Similarly,"mapred.hosts"points"to"a"file"which"lists"hosts"allowed"to"
connect"to"the"JobTracker"as"TaskTrackers"
! Both#files#are#op5onal#
– If"omi>ed,"any"host"may"connect"and"act"as"a"DataNode/TaskTracker"
– This"is"a"possible"security/data"integrity"issue"
! NameNode#can#be#forced#to#reread#the#dfs.hosts#file#with#
hadoop dfsadmin –refreshNodes
! JobTracker#can#be#forced#to#reread#the#mapred.hosts#file#with##
hadoop mradmin -refreshNodes#
Host"‘include’"and"‘exclude’"Files"(cont’d)"
! It#is#possible#to#explicitly#prevent#one#or#more#hosts#from#ac5ng#as#
DataNodes#
– Create"a"dfs.hosts.exclude"property,"and"specify"a"filename"
– List"the"names"of"all"the"hosts"to"exclude"in"that"file"
– These"hosts"will"then"not"be"allowed"to"connect"to"the"NameNode"
– This"is"oeen"used"if"you"intend"to"decommission"nodes"(see"later)"
– Run"hadoop dfsadmin -refreshNodes"to"make"the"NameNode"
re/read"the"file"
! Similarly,#mapred.hosts.exclude#can#be#used#to#specify#a#file#lis5ng#
hosts#which#may#not#connect#to#the#JobTracker#
– Run"hadoop mradmin -refreshNodes"to"make"the"JobTracker"
re/read"the"file"
Chapter"Topics"
a#Hadoop#Cluster#
!! Configuring#HDFS#for#Rack#Awareness#
!! Conclusion"
Rack"Topology"Awareness"
! Recall#that#HDFS#is#‘rack#aware’#
– Distributes"blocks"based"on"hosts’"locaAons"
– Administrator"supplies"a"script"which"tells"Hadoop"which"rack"a"node"is"
in"
– Should"return"a"hierarchical"‘rack"ID’"for"each"argument"it"is"passed"
– Rack"ID"is"of"the"form"/datacenter/rack
•  Example:"/datactr1/rack40
– Script"can"use"a"flat"file,"database,"etc."
– Script"name"is"in"topology.script.file.name"in""
core-site.xml
– If"this"is"blank"(default),"Hadoop"returns"a"value"of""
/default-rack for"all"nodes"
Sample"Rack"Topology"Script"
! A#sample#rack#topology#script:#
#!/usr/bin/env python
import sys
DEFAULT_RACK = "/datacenter1/default-rack"
HOST_RACK_FILE = "/etc/hadoop/conf/host-rack.map"
host_rack = {}
for line in open(HOST_RACK_FILE):
(host, rack) = line.split()
host_rack[host] = rack
for host in sys.argv[1:]:
if host in host_rack:
print host_rack[host]
else:
print DEFAULT_RACK
Sample"Rack"Topology"Script"(cont’d)"
! The#/etc/hadoop/conf/host-rack.map#file:#
host1 /datacenter1/rack1
...
Naming"Machines"to"Aid"Rack"Awareness"
! A#common#scenario#is#to#name#your#hosts#in#such#a#way#that#the#Rack#
Topology#Script#can#easily#determine#their#loca5on#
– Example:"a"host"called"r1m32
– 32nd"machine"in"Rack"1"
– The"Rack"Topology"Script"can"simply"deconstruct"the"machine"name"and"
then"return"the"rack"awareness"informaAon"
Chapter"Topics"
a#Hadoop#Cluster#
!! Configuring#HDFS#High#Availability#
!! Conclusion"
HDFS"High"Availability"Overview"
! A#single#NameNode#is#a#single#point#of#failure#
! Two#ways#a#NameNode#can#result#in#HDFS#down5me#
– Unexpected"NameNode"crash"(rare)"
– Planned"maintenance"of"NameNode"(more"common)"
! HDFS#High#Availability#(HA)#eliminates#this#SPOF"
– Available"in"CDH4"(or"related"Apache"Hadoop"0.23.x,"and"2.x)"
HDFS"High"Availability"Architecture"
! HDFS#High#Availability#uses#a#pair#of#NameNodes#
– One"AcAve"and"one"Standby"
– Clients"only"contact"the"AcAve"NameNode"
– DataNodes"heartbeat"in"to"both"NameNodes"
– AcAve"NameNode"writes"its"metadata"to"a"quorum"of"JournalNodes"
– Standby"NameNode"reads"from"the"JournalNodes"to"remain"in"sync"
with"the"AcAve"NameNode"
#
DataNode NameNode JournalNode
(Active)/Quorum
Journal Manager
DataNode
JournalNode
DataNode
NameNode
(Standby)/Quorum
DataNode Journal Manager JournalNode
HDFS"High"Availability"Architecture"(cont’d)"
! Ac5ve#NameNode#writes#edits#to#the#JournalNodes#
– Soeware"to"do"this"is"the"Quorum&Journal&Manager&(QJM)"
– Built"in"to"the"NameNode"
– Waits"for"a"success"acknowledgment"from"the"majority"of"JournalNodes"
– Majority"commit"means"a"single"crashed"or"lagging"JournalNode"
will"not"impact"NameNode"latency"
– Uses"the"Paxos"algorithm"to"ensure"reliability"even"if"edits"are"being"
wri>en"as"a"JournalNode"fails"
! Note#that#there#is#no#Secondary#NameNode#when#implemen5ng#HDFS#
High#Availability#
– The"Standby"NameNode"periodically"performs"checkpoinAng"
Failover"
! Only#one#NameNode#must#be#ac5ve#at#any#given#5me#
– The"other"is"in"standby"mode"
! The#standby#maintains#a#copy#of#the#ac5ve#NameNode’s#state#
– So"it"can"take"over"when"the"acAve"NameNode"goes"down"
! Two#types#of#failover#
– Manual"(detected"and"iniAated"by"a"user)"
– AutomaAc"(detected"and"iniAated"by"HDFS"itself)"
AutomaAc"Failover"
! Automa5c#failover#is#based#on#Apache#ZooKeeper#
– A"coordinaAon"service"system"also"used"by"HBase"
– An"open"source"Apache"project""
– One"of"the"components"in"CDH"
! A#daemon#called#the#ZooKeeper#Failover#Controller#(ZKFC)#runs#on#each#
NameNode#machine#
! ZooKeeper#needs#a#quorum#of#nodes#
– Typical"installaAons"use"three"or"five"nodes"
– Low"resource"usage"
– Can"install"alongside"exisAng"master"daemons"
HDFS"HA"With"AutomaAc"Failover"–"Deployment"
ZooKeeper Ensemble - Instances Typically Reside on Master Nodes
ZooKeeper ZooKeeper ZooKeeper
JournalNode
ZooKeeper ZooKeeper
Failover JournalNode Failover
Controller Controller
Must Reside Must Reside

on the on the
Same Host JournalNode Same Host
NameNode
NameNode JournalNodes
(Standby)/
(Active)/Quorum Typically Reside
Quorum Journal
Journal Manager on Master Nodes
Manager
DataNode DataNode DataNode DataNode
Fencing"
! It#is#essen5al#that#exactly#one#NameNode#be#ac5ve#
– Possibility"of"“split/brain"syndrome”"otherwise"
! The#Quorum#Journal#Manager#ensures#that#a#previously"ac5ve#NameNode#
cannot#corrupt#the#NameNode#metadata#
– However,"it"is"possible"in"some"circumstances"that"it"could"report"‘stale’"
data"to"a"client"
– For"this"reason,"it"is"good"pracAce"to"fence"the"NameNode"
– Isolates"it,"shuts"it"down"
! When#configuring#HDFS#HA,#you#must#specify#one#or#more#fencing#
methods#
– The"methods"can"be"shell"scripts"or"SSH"
– In"order"for"failover"to"occur,"one"of"the"methods"must"run"successfully"
"
HDFS"HA"Deployment"Overview"
1.  Modify#the#Hadoop#configura5on#
2.  Install#and#start#the#JournalNodes#
3.  Configure#and#start#a#ZooKeeper#ensemble#–#automa5c#failover#only#
4.  Ini5alize#the#shared#edits#directory#if#conver5ng#from#a#non"HA#
deployment#
5.  Install,#bootstrap,#and#start#the#Standby#NameNode#
6.  Install,#format,#and#start#the#ZooKeeper#failover#controllers#–#automa5c#
failover#only#
7.  Restart#DataNodes#and#the#MapReduce#daemons#
Modifying"the"core-site.xml"File"for"HDFS"HA"
fs.default.name Change from <host>:<port> to a logical

URI to the HDFS file system that specifies
a NameService ID defined in hdfs-
site.xml. This URI defines a virtual
NameNode and transparently resolves to
the Active NameNode. Example:
hdfs://mycluster
ha.zookeeper.quorum" Comma-delimited list of all ZooKeeper
nodes in the quorum. Specify fully-
qualified host names (FQHNs) and port
numbers. Example:
elephant.example.com:2181,
tiger.example.com:2181,
horse.example.com:2181
Used by ZooKeeper Failover Controllers.
Required for automatic failover only.
Modifying"the"hdfs-site.xml"File"for"HDFS"HA"
dfs.nameservices A unique NameService ID. Example:

mycluster
Used by NameNodes and clients.
dfs.ha.namenodes.<NSID> A comma-separated list of two
For#example: NameNodes in the cluster. Example:
dfs.ha.namenodes.mycluster nn1,nn2
Used by NameNodes and clients.
(<NSID> = NameService ID)
dfs.namenode.rpc-address. The RPC address of a NameNode in the

<NSID>.<NNID> cluster. Two entries required – one for
For#example:## each NameNode. Specify FQHN and port
dfs.namenode.rpc-address. number. Example:
mycluster.nn1 elephant.example.com:8020
dfs.namenode.rpc-address. Used by NameNodes and clients.
mycluster.nn2
(<NNID>#=#NameNode#ID)#
Modifying"the"hdfs-site.xml"File"for"HDFS"HA"(cont’d)"
dfs.namenode.http-address. The HTTP address of a NameNode in the

<NSID>.<NNID> cluster. Two entries required – one for
For#example: each NameNode. Specify FQHN and port
dfs.namenode.http-address. number. Example:
mycluster.nn1 elephant.example.com:50070 Used
dfs.namenode.http-address. by NameNodes.
mycluster.nn2
dfs.namenode.shared.edits. A semicolon-delimited list of the
dir JournalNodes. Specifies the location
written by the Active NameNode and read
by the Standby NameNode in order to
keep them synchronized. Example:
qjournal://elephant:8485;tiger:
8485;horse:8485/mycluster
Used by NameNodes.
dfs.journalnode.edits.dir The location on JournalNodes where

edits and other state information should
be stored. Example:
/disk1/dfs/jn""
Used by JournalNodes.
dfs.client.failover.proxy. The name of the Java class used to
provider.<NSID> contact the currently active NameNode.
For#example: Currently, only one class is supported in
dfs.client.failover.proxy. CDH. Example:
provider.mycluster org.apache.hadoop.hdfs.
server.namenode.ha.
ConfiguredFailoverProxy
Provider
Used by clients.
dfs.ha.fencing.methods Specifies one or more methods separated

by newlines to fence the Active
NameNode during failover. This
parameter must be specified and has no
default value. Examples:
sshfence(tiger:22)
shell(/path/to/fence-nn.sh
–target_host=tiger)
shell(/bin/true)
Used by NameNodes and ZooKeeper
Failover Controllers.
dfs.ha.fencing.ssh.private Location of SSH private key used for

-key-files SSH-based fencing. Must be readable by
the hdfs user account. Allows
automation of ssh (no passphrase)
Only needed if you’re using the
sshfence fencing method. Example:
/home/hdfs/.ssh/id_rsa
Used by NameNodes and ZooKeeper
Failover Controllers.
dfs.ha.automatic-failover. Specifies whether automatic failover is
enabled enabled. Example: true
Used by NameNodes, ZooKeeper
Failover Controllers, and clients.
Installing,"Configuring,"and"StarAng"the"JournalNodes"
On#each#host#that#will#run#a#JournalNode:#
! Install#the#JournalNode#
– sudo yum install hadoop-hdfs-journalnode"
! Create#the#shared#edits#directory#on#the#JournalNode#
– At"the"locaAon"specified"as"dfs.journalnode.edits.dir""
– Owned"by"the"hdfs"user"
! Start#the#JournalNode#
– sudo service hadoop-hdfs-journalnode start""
#
Configuring"and"StarAng"the"ZooKeeper"Suite"–"AutomaAc"
Failover"Only"
On#each#host#that#will#run#a#ZooKeeper#server:#
! Create#a#directory#for#ZooKeeper#data#
! Owned"by"the"hdfs"user"
! Create#a#file#in#the#ZooKeeper#data#directory#that#has#a#single#number#
iden5fier#
– The"idenAfier"must"be"a"different"number"on"each"ZooKeeper"node"
– Example:"echo 1 > /disk1/dfs/zk/myid
! Create#a#ZooKeeper#configura5on#file#(example#on#next#slide)#
! Start#the#ZooKeeper#server#
– sudo /usr/lib/zookeeper/bin/zkServer.sh start \"
/etc/hadoop/conf/zoo.cfg
# ©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"42#
Configuring"and"StarAng"the"ZooKeeper"Suite"–"AutomaAc"
Failover"Only"(cont’d)"
! Example#ZooKeeper#configura5on#file#
Server"idenAfier"
tickTime=2000 matches"the"
dataDir=/disk1/dfs/zk number"in"the"
clientPort=2181 myid"file"
initLimit=5
syncLimit=2
server.1=elephant.example.com:2888:3888
server.2=tiger.example.com:2888:3888
server.3=horse.example.com:2888:3888
IniAalizing"the"Shared"Edits"Directories"
! This#step#is#required#only#if#you#are#conver5ng#an#exis5ng#non"HA#HDFS#
deployment#to#an#HDFS#HA#deployment#
! Run#the#following#command#
– sudo -u hdfs hdfs namenode -initializeSharedEdits""
! Aier#ini5alizing#these#directories,#you#can#restart#your#exis5ng#
NameNode#
! Run#the#hdfs haadmin -getServiceState#command#to#verify#that#
the#exis5ng#NameNode#is#not#ac5ve#yet#
– sudo -u hdfs hdfs haadmin -getServiceState nn1""
– In"this"example,"nn1"is"the"NameNode"ID"of"the"exisAng"NameNode"
– The"command"should"return"Standby
#
Installing,"Bootstrapping,"and"StarAng"the"Standby"Name"Node"
! Install#the#second#NameNode#
! Bootstrap#it#
– sudo -u hdfs hdfs namenode -bootstrapStandby""
! Start#it#
! Run#the#hdfs haadmin -getServiceState#command#to#verify#that#
the#exis5ng#NameNode#is#not#ac5ve#yet#
– sudo -u hdfs hdfs haadmin -getServiceState nn2
– In"this"example,"nn2"is"the"NameNode"ID"of"the"new"Standby"
NameNode"
– The"command"should"return"Standby
#
#
Installing,"FormaOng,"and"StarAng"the"ZooKeeper"Failover"
Controller"–"AutomaAc"Failover"Only"
! Install#the#ZooKeeper#Failover#Controllers#on#the#same#hosts#as#the#
NameNodes#
– sudo yum install --assumeyes hadoop-hdfs-zkfc""
!  Ini5alize#the#current#NameNode#state#in#ZooKeeper#
– sudo -u hdfs hdfs zkfc -formatZK""
!  Start#the#ZooKeeper#Failover#Controllers#
– sudo service start hadoop-hdfs-zkfc
– StarAng"the"ZooKeeper"Failover"Controllers"will"bring"up"one"of"your"
NameNodes"
!  Restart#the#DataNodes#and#the#MapReduce#daemons#
!  Restart#the#Ac5ve#NameNode#and#then#check#the#NameNode#service#state#
to#test#automa5c#failover#
Configuring"HDFS"High"Availability"With"Cloudera"Manager"
! Cloudera#Manager#makes#it#very#easy#to#enable#High#Availability#
– 7"complex"steps"to"enable"High"Availability"manually"
– Or,"click"once"and"fill"in"a"few"screens"with"Cloudera"Manager"
""
Chapter"Topics"
a#Hadoop#Cluster#
!! Hands"On#Exercise:#Configuring#HDFS#for#High#Availability#
!! Conclusion"
Hands/On"Exercise:"Configuring"HDFS"for"High"Availability"
! In#this#Hands"On#Exercise,#you#will#modify#your#Hadoop#cluster#by#adding#
and#configuring#servers#required#for#HDFS#high#availability#
! Please#refer#to#the#Hands"On#Exercise#Manual#
! Cluster#deployment#aier#exercise#comple5on:#
Chapter"Topics"
a#Hadoop#Cluster#
!! Conclusion#
EssenAal"Points"
! Use#dfs.datanode.du.reserved#to#configure#minimum#amount#of#
non"HDFS#disk#space#on#volumes#
– Default"is"0"–"HDFS"is"allowed"to"consume"all"available"disk"space"on"
slave"node"volumes"
– Non/HDFS"disk"space"is"needed"for"MapReduce"intermediate"output,"
task"logs,"and"daemon"logs"
! You#can#configure#Hadoop#to#explicitly#include#a#list#of#machines#in#a#
cluster,#or#to#exclude#machines#from#a#cluster#
– Inclusion"list"provides"a"degree"of"security"
– Exclusion"list"is"used"during"node"decommissioning""
! Hadoop#provides#a#rack#awareness#capability,#which#should#be#
implemented#to#increase#data#locality#
! HDFS#can#be#configured#for#high#availability#with#an#automa5c#failover#
capability#
Conclusion"
In#this#chapter,#you#have#learned:#
! How#to#perform#advanced#configura5on#of#Hadoop#
! How#to#configure#port#numbers#used#by#Hadoop#
! How#to#explicitly#include#or#exclude#hosts##
! How#to#configure#HDFS#rack#awareness#
! How#to#enable#HDFS#high#availability#
Hadoop"Security"
Chapter"12"
Course"Chapters"
!! HDFS"
!! MapReduce"
!! Hadoop"Clients"
!! Hadoop$Security$
Cluster"OperaCons"and"
Maintenance"
!! Conclusion"
Hadoop"Security"
In$this$chapter$you$will$learn:$
! Why$security$is$important$for$Hadoop$
! How$Hadoop's$security$model$evolved$
! What$Kerberos$is$and$how$it$relates$to$Hadoop$
! What$to$consider$when$securing$Hadoop$
Chapter"Topics"
Hadoop$Security$
a$Hadoop$Cluster$
!! Why$Hadoop$Security$Is$Important$$
!! Hadoop’s"Security"System"Concepts"
!! What"Kerberos"Is"and"How"it"Works""
!! Securing"a"Hadoop"Cluster"with"Kerberos"
!! Conclusion"
Why"Hadoop"Security"Is"Important"
! Laws$governing$data$privacy$
– ParCcularly"important"for"healthcare"and"finance"industries"
! Export$control$regulaKons$for$defense$informaKon$
! ProtecKon$of$proprietary$research$data$
! Company$policies$
– Different"teams"in"a"company"have"different"needs"
! SeLng$up$mulKple$clusters$is$a$common$soluKon$
– One"cluster"may"contain"protected"data,"another"cluster"does"not"
"
Chapter"Topics"
Hadoop$Security$
a$Hadoop$Cluster$
!! Why"Hadoop"Security"Is"Important""
!! Hadoop’s$Security$System$Concepts$
!! Conclusion"
Defining"Important"Security"Terms"
! Security$
– Computer"security"is"a"very"broad"topic"
– Access"control"is"the"area"most"relevant"to"Hadoop"
– We’ll"therefore"focus"on"authenCcaCon"and"authorizaCon"
! AuthenKcaKon$
– Confirming"the"idenCty"of"a"parCcipant"
– Typically"done"by"checking"credenCals"(username/password)"
! AuthorizaKon$
– Determining"whether"parCcipant"is"allowed"to"perform"an"acCon"
– Typically"done"by"checking"an"access"control"list"
"
"
Types"of"Hadoop"Security"
! Support$for$HDFS$file$ownership$and$permissions$$
– Provides"only"modest"protecCon"
– User/group"authenCcaCon"is"easily"subverted"(client/side)"
– Mainly"intended"to"guard"against"accidental"deleCons/overwrites"
! Enhanced$security$with$Kerberos$
– Provides"strong"authenCcaCon"of"both"clients"and"servers"
– Tasks"can"be"run"under"a"job"submi>er’s"own"account"
– This"enhanced"security"is"opConal"(disabled"by"default)"
"
"
Hadoop"Security"Design"ConsideraCons"
! Hadoop$security$does$not$provide$
– EncrypCon"for"data"transmi>ed"on"the"wire""
– EncrypCon"for"data"stored"on"disk"
! The$security$of$a$cluster$is$enhanced$by$isolaKon$
– It"should"ideally"be"on"its"own"network"
– Access"to"nodes/network"should"be"limited"for"untrusted"users"
Chapter"Topics"
Hadoop$Security$
a$Hadoop$Cluster$
!! What$Kerberos$Is$and$How$it$Works$$
!! Conclusion"
Kerberos"Exchange"ParCcipants"
! Kerberos$involves$messages$exchanged$among$three$parKes$
– Client"
– The"server"providing"a"desired"network"service"
– The"Kerberos"Key"DistribuCon"Center"(KDC)"
Kerberos"Exchange"ParCcipants"(cont’d)"
Kerberos KDC
(Key Distribution
Center)
Client
Desired Network
Service
(Protected by
Kerberos)
! The$client$is$soYware$that$desires$access$to$a$service$
– The"hadoop fs"command"is"one"example"of"a"client"
Kerberos KDC
(Key Distribution
Center)
Client
Desired Network
Service
(Protected by
Kerberos)
! This$is$the$service$the$client$wishes$to$access$
– For"Hadoop,"this"will"be"a"service"daemon"(such"as"NN"or"JT)"
Kerberos KDC
(Key Distribution
Center)
Client
Desired Network
Service
(Protected by
Kerberos)
! The$Kerberos$server$(KDC)$authenKcates$and$authorizes$a$client$
! The$KDC$is$neither$part$of$nor$provided$by$Hadoop$
– Most"Linux"distribuCons"come"with"the"MIT"Kerberos"KDC"
General"Kerberos"Concepts"
Kerberos KDC
1. Authenticate
(Key Distribution
and authorize
Center)
client
Client 3. Validate the

client's token
Desired Network
2. Request for service Service
by an authenticated (Protected by
and authorized client Kerberos)
! Kerberos$is$a$standard$network$security$protocol$
– Currently"at"version"5"(RFC"4120)"
– Services"protected"by"Kerberos"don’t"directly"authenCcate"client"
General"Kerberos"Concepts"(cont’d)"
! AuthenKcated$status$is$cached$
– You"don’t"need"to"explicitly"submit"credenCals"with"each"request"
! Passwords$are$not$sent$across$network$
– Instead,"passwords"are"used"to"compute"encrypCon"keys"
– The"Kerberos"protocol"uses"encrypCon"extensively"
! Timestamps$are$an$essenKal$part$of$Kerberos$
– Make"sure"you"synchronize"system"clocks"(NTP)"
! It’s$important$that$reverse$lookups$work$correctly$
"
Kerberos"Terminology"
! Knowing$a$few$terms$will$help$you$with$the$documentaKon$
! Realm$
– A"group"of"machines"parCcipaCng"in"a"Kerberos"network"
– IdenCfied"by"an"uppercase"domain"(EXAMPLE.COM)"
! Principal$
– A"unique"idenCty"which"can"be"authenCcated"by"Kerberos"
– Can"idenCfy"either"a"host"or"an"individual"user"
– Every"user"in"a"secure"cluster"will"have"a"Kerberos"principal"
! Keytab$file$
– A"file"that"stores"Kerberos"principals"and"associated"keys"
– Allows"non/interacCve"access"to"services"protected"by"Kerberos"
"
Chapter"Topics"
Hadoop$Security$
a$Hadoop$Cluster$
!! Securing$a$Hadoop$Cluster$with$Kerberos$
!! Conclusion"
Hadoop"Security"Setup"Prerequisites"
! Working$Hadoop$cluster$$
– Installing"CDH"from"packages"is"strongly"advised!"
– Ensure"your"cluster"actually"works"before"trying"to"secure"it!"
! Working$Kerberos$KDC$server$$
! Kerberos$client$libraries$installed$on$all$Hadoop$nodes$
"
"
Configuring"Hadoop"Security"
! Hadoop$security$configuraKon$is$a$specialized$topic$
! Many$specifics$depend$on$$
– Version"of"Hadoop"and"related"programs"
– Type"of"Kerberos"server"used"(AcCve"Directory"or"MIT)"
– OperaCng"system"and"distribuCon"
! You$must$follow$instrucKons$exactly$
– There"is"li>le"room"for"misconfiguraCon"
– Mistakes"ohen"result"in"vague"“access"denied”"errors"
– May"need"to"work"around"version/specific"bugs"
Configuring"Hadoop"Security"(cont’d)"
! For$these$reasons,$we$can’t$cover$this$in$depth$during$class$
– We"do"provide"an"overview"in"Appendix"A"
! See$the$“CDH$Security$Guide”$for$detailed$instrucKons$
– Available"at"http://www.cloudera.com/
– Be"sure"to"read"the"guide"corresponding"to"your"version"of"CDH"
! The$CDH$Security$Guide$documents$manual$configuraKon$
– Configuring"Hadoop"with"Kerberos"involves"many"tedious"steps"
– Cloudera"Manager"(Enterprise)"can"automate"many"of"them"
Securing"Related"Services"
! There$are$many$“ecosystem”$tools$that$interact$with$Hadoop$
! Most$require$minor$configuraKon$changes$for$a$secure$cluster$
– Such"as"specifying"Kerberos"principals"or"keytab"file"paths"
! Exact$configuraKon$details$vary$with$each$tool$
– See"documentaCon"for"details"
! Some$require$no$configuraKon$changes$at$all$
– Such"as"Pig"and"Sqoop"
"
AcCve"Directory"IntegraCon"
! MicrosoY’s$AcKve$Directory$(AD)$is$a$enterprise$directory$service$
– Used"to"manage"user"accounts"for"a"Microsoh"Windows"network"
! Recall$that$every$Hadoop$user$must$have$a$Kerberos$principal$
– It"can"be"tedious"to"set"up"all"these"accounts"
– Many"organizaCons"would"prefer"to"use"AD"for"Hadoop"users"
! Cloudera’s$recommended$approach$
– Run"a"local"MIT"Kerberos"KDC""
– Create"all"service"principals"(like"hdfs"and"mapred)"in"this"realm"
– But"not"principals"corresponding"to"actual"users"in"AD"
– Setup"one/way"cross/realm"trust"between"MIT"Kerberos"and"AD"
– Hadoop’s"KDC"will"then"accept"user"accounts"from"AD"
! InstrucKons$can$be$found$in$Cloudera’s$CDH$Security$Guide$
Cloudera"Manager’s"Support"for"Security"
!  See$appendix$for$an$overview$of$manual$configuraKon$process"
! Many$tasks$must$be$performed$on$every$cluster$node$
– Create"Kerberos"principals"and"keytab"files"
– Copy"to"conf"directory"
– Set"proper"file"ownership"and"permissions"
– Edit"configuraCon"files"to"specify"principals,"paths,"etc."
! These$are$tedious$and$error#prone,$especially$for$large$clusters$
! It’s$worth$noKng$that$Cloudera$Manager$automates$these$steps$
Chapter"Topics"
Hadoop$Security$
a$Hadoop$Cluster$
!! Conclusion$
EssenCal"Points"
! CDH1$and$CDH2$provided$security$with$HDFS$permissions$
– Mainly"intended"to"guard"against"accidental"file"deleCon"and"overwrite"
! StarKng$with$CDH3,$Kerberos$security$became$available$
– Manual"configuraCon"requires"many"steps"
– Consider"using"Cloudera"Manager"
! Hadoop$does$not$provide:$
– Encrypted"transmission"over"the"wire"
– Data"encrypCon"on"disk"
Conclusion"
In$this$chapter$you$have$learned:$
! Why$security$is$important$for$Hadoop$
! How$Hadoop's$security$model$evolved$
! What$Kerberos$is$and$how$it$relates$to$Hadoop$
! What$to$consider$when$securing$Hadoop$
Managing"and"Scheduling"Jobs"
Chapter"13"
Course"Chapters"
!! HDFS"
IntroducDon"to"Apache"Hadoop"
!! MapReduce"
!! Hadoop"Clients"
!! Hadoop"Security"
!! Managing$and$Scheduling$Jobs$
Cluster$Opera;ons$and$
Maintenance$
!! Conclusion"
Managing"and"Scheduling"Jobs"
! How$to$view$and$stop$jobs$running$on$a$cluster$
! The$op;ons$available$for$scheduling$Hadoop$jobs$
! How$to$configure$the$Fair$Scheduler$
Chapter"Topics"
Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$
!! Managing$Running$Jobs$
!! Hands/On"Exercise:"Managing"Jobs"
!! Scheduling"Hadoop"Jobs"
!! The"FIFO"Scheduler"
!! The"Fair"Scheduler"
!! The"Capacity"Scheduler"
!! Configuring"the"Fair"Scheduler"
!! Hands/On"Exercise:"Using"The"Fair"Scheduler"
!! Conclusion"
Displaying"Running"Jobs"
! To$view$all$jobs$running$on$the$cluster,$use$mapred job -list
[training@localhost ~]$ mapred job -list

1 jobs currently running
JobId State StartTime UserName Priority SchedulingInfo
job_201110311158_0008 1 1320210148487 training NORMAL NA
Displaying"All"Jobs"
! To$display$all$jobs$including$completed$jobs,$use$$
mapred job -list all
[training@localhost ~]$ mapred job -list all

7 jobs submitted
States are:
Running : 1 Succeded : 2 Failed : 3 Prep : 4
JobId State StartTime UserName Priority
SchedulingInfo
Displaying"All"Jobs"(cont’d)"
! Note$that$states$are$displayed$as$numeric$values$
– 1:"Running"
– 2:"Succeeded"
– 3:"Failed"
– 4:"In"preparaDon"
– 5:"(undocumented)"Killed"
! Easy$to$write$a$cron$job$that$periodically$lists$(for$example)$all$failed$jobs,$
running$a$command$such$as$
– mapred job -list all | grep '<tab>3<tab>'
Displaying"the"Status"of"an"Individual"Job"
!  mapred job -status <job_id>$provides$status$about$an$individual$

job$
– CompleDon"percentage"
– Values"of"counters"
– System"counters"and"user/defined"counters"
! Note:$job$name$is$not$displayed!$
– The"Web"user"interface"is"the"most"convenient"way"to"view"more"
details"about"an"individual"job"
– More"details"later"
Killing"a"Job"
! It$is$important$to$note$that$once$a$user$has$submiVed$a$job,$they$can$not$
stop$it$just$by$hiWng$CTRL#C$on$their$terminal$
– This"stops"job"output"appearing"on"the"user’s"console"
– The"job"is"sDll"running"on"the"cluster!"
Killing"a"Job"(cont’d)"
! To$kill$a$job$use$mapred job -kill <job_id>

[training@localhost ~]$ mapred job -kill job_201110311158_0009

Killed job job_201110311158_0009

Stopping"MapReduce"Jobs"From"the"Web"UI"
! By$default,$the$JobTracker$Web$UI$is$read#only$
– Job"informaDon"is"displayed,"but"the"job"cannot"be"controlled"in"any"
way"
! It$is$possible$to$set$the$UI$to$allow$jobs,$or$individual$Map$or$Reduce$tasks,$
to$be$killed$
– Add"the"following"property"to"core-site.xml
<property>
<name>webinterface.private.actions</name>
<value>true</value>
</property>
– Restart"the"JobTracker"
Stopping"Jobs"From"the"Web"UI"(cont’d)"
! The$Web$UI$will$now$include$an$‘ac;ons’$column$for$each$task$
– And"an"overall"opDon"to"kill"enDre"jobs"
Stopping"Jobs"From"the"Web"UI"(cont’d)"
! Cau;on:$anyone$with$access$to$the$Web$UI$can$now$manipulate$running$
jobs!$
– Best"pracDce:"make"this"available"only"to"administraDve"users"
! BeVer$to$use$the$command#line$to$stop$jobs$
Chapter"Topics"
Cluster$Opera;ons$$
and$Maintenance$
!! Managing"Running"Jobs"
!! Hands#On$Exercise:$Managing$Jobs$
!! Conclusion"
Hands/On"Exercise:"Managing"Jobs"
! In$this$Hands#On$Exercise,$you$will$start$and$kill$jobs$from$the$command$
line$
! Please$refer$to$the$Hands#On$Exercise$Manual$
Chapter"Topics"
Cluster$Opera;ons$$
and$Maintenance$
!! Scheduling$Hadoop$Jobs$
!! Conclusion"
Job"Scheduling"Basics"
! A$Hadoop$job$is$composed$of$
– An"unordered"set"of"Map$tasks"which"have"locality"preferences"
– An"unordered"set"of"Reduce$tasks$
! Tasks$are$scheduled$by$the$JobTracker$
– They"are"then"scheduled/launched"by"TaskTrackers"
– One"TaskTracker"per"node"
– Each"TaskTracker"has"a"fixed"number"of"slots$for"Map"and"Reduce"tasks"
– This"may"differ"per"node"–"a"node"with"a"powerful"processor"may"
have"more"slots"than"one"with"a"slower"CPU"
– TaskTrackers"report"the"availability"of"free"task"slots"to"the"JobTracker"
on"the"Master"node"
! Scheduling$a$job$requires$assigning$Map$and$Reduce$tasks$to$available$
Map$and$Reduce$task$slots$
Chapter"Topics"
Cluster$Opera;ons$$
and$Maintenance$
!! The$FIFO$Scheduler$
!! Conclusion"
The"FIFO"Scheduler"
! Default$Hadoop$job$scheduler$is$FIFO$
– First"In,"First"Out"
! Given$two$jobs$A$and$B,$submiVed$in$that$order,$all$Map$tasks$from$job$A$
are$scheduled$before$any$Map$tasks$from$job$B$are$considered$
– Similarly"for"Reduce"tasks"
! Order$of$task$execu;on$within$a$job$may$be$shuffled$around$
A1 A2 A3 A4 B1 B2 B3 …
PrioriDes"in"the"FIFO"Scheduler"
! The$FIFO$Scheduler$supports$assigning$priori;es$to$jobs$
– PrioriDes"are"VERY_HIGH,"HIGH,"NORMAL,"LOW,"VERY_LOW"
– Set"with"the"mapred.job.priority"property"
– May"be"changed"from"the"command/line"as"the"job"is"running"
– hadoop job -set-priority <job_id> <priority>
– All"work"in"each"queue"is"processed"before"moving"on"to"the"next"
All higher-priority tasks are run first, if they exist…
C1 C2 C3
High Priority
…before any
lower-priority
tasks are started, A1 A2 A3 A4 B1 B2 B3 …
regardless of
submission order Normal Priority
PrioriDes"in"the"FIFO"Scheduler:"Problems"
! Problem:$Job$A$may$have$2,000$tasks;$Job$B$may$have$20$
– Even"if"Job"B"has"a"higher"priority"than"Job"A,"Job"B"will"not"make"any"
progress"unDl"Job"A"has"nearly"finished"
– CompleDon"Dme"should"be"proporDonal"to"job"size"
! Users$with$poor$understanding$of$the$system$may$flag$all$their$jobs$as$
HIGH_PRIORITY
– Thus"starving"other"jobs"of"processing"Dme"
! ‘All$or$nothing’$nature$of$the$scheduler$makes$sharing$a$cluster$between$
produc;on$jobs$with$SLAs$and$interac;ve$users$challenging$
Chapter"Topics"
Cluster$Opera;ons$$
and$Maintenance$
!! The$Fair$Scheduler$
!! Conclusion"
Goals"of"the"Fair"Scheduler"
! Fair$Scheduler$is$designed$to$allow$mul;ple$users$to$share$the$cluster$
simultaneously$
! Should$allow$short$interac;ve$jobs$to$coexist$with$long$produc;on$jobs$
! Should$allow$resources$to$be$controlled$propor;onally$
! Should$ensure$that$the$cluster$is$efficiently$u;lized$
! This$is$the$scheduler$recommended$by$Cloudera$for$produc;on$use$
The"Fair"Scheduler:"Basic"Concepts"
! Each$job$is$assigned$to$a$pool$
– Default"assignment"is"one"pool"per"username"
! Jobs$may$be$assigned$to$arbitrarily#named$pools$
– Such"as"“producDon”"
! Physical$slots$are$not$bound$to$any$specific$pool$
! Each$pool$gets$an$even$share$$
of$the$available$task$slots$
Pool"CreaDon"
! By$default,$pools$are$created$dynamically$based$on$the$username$
submiWng$the$job$
– No"configuraDon"necessary"
! Jobs$can$be$sent$to$designated$pools$(e.g.,$“produc;on”)$
– Pools"can"be"defined"in"a"configuraDon"file"(see"later)"
– Pools"may"have"a"minimum"number"of"mappers"and"reducers"defined"
Adding"Pools"Readjusts"the"Share"of"Slots"
! If$Charlie$now$submits$a$job$in$a$new$pool,$shares$of$slots$are$adjusted$
Determining"the"Fair"Share"
! The$fair$share$of$tasks$slots$assigned$to$the$pool$is$based$on:$
– The"actual"number"of"task"slots"available"across"the"cluster"
– The"demand"from"the"pool"
– The"number"of"tasks"eligible"to"run"
– The"minimum"share,"if"any,"configured"for"the"pool"
– The"fair"share"of"each"other"acDve"pool"
! The$fair$share$for$a$pool$will$never$be$higher$than$the$actual$demand$
! Pools$are$filled$up$to$their$minimum$share,$assuming$cluster$capacity$
! Excess$cluster$capacity$is$spread$across$all$pools$
– Aim"is"to"maintain"the"most"even"loading"possible"
Example"Minimum"Share"AllocaDon"
! First,$fill$Produc;on$up$to$20$slot$minimum$guarantee$
! Then$distribute$remaining$10$slots$evenly$across$Alice$and$Bob$
Example"AllocaDon"2:"ProducDon"Queue"Empty"
! Produc;on$has$no$demand,$so$no$slots$reserved$
! All$slots$allocated$evenly$across$Alice$and$Bob$
Example"AllocaDon"3:"MinShares"Exceed"Slots"
! minShare$of$Produc;on,$Research$exceeds$available$capacity$
! minShares$are$scaled$down$pro$rata$to$match$actual$slots$
! No$slots$remain$for$users$without$minShare$(i.e.,$Bob)$
Example"4:"minShare"<"Fair"Share"
! Produc;on$filled$to$minShare$
! Remaining$25$slots$distributed$across$all$pools$
! Produc;on$pool$gets$more$than$minShare,$to$maintain$fairness$
Pools"With"Weights"
! Instead$of$(or$in$addi;on$to)$seWng$minShare,$pools$can$be$assigned$a$
weight$
! Pools$with$higher$weight$get$more$slots$during$free$slot$alloca;on$
! ‘Even$water$glass$height’$analogy:$
– Think"of"the"weight"as"controlling"the"‘width’"of"the"glass"
Example:"Pool"With"Double"Weight"
! Produc;on$filled$to$minShare$(5)$
! Remaining$25$slots$distributed$across$pools$
! Bob’s$pool$gets$two$slots$instead$of$one$during$each$round$
MulDple"Jobs"Within"A"Pool"
! A$pool$exists$if$it$has$one$or$more$jobs$in$it$
! So$far,$we’ve$only$described$how$slots$are$assigned$to$pools$
– We"need"to"determine"how"jobs"are"scheduled"within"a"given"pool"
Job"Scheduling"Within"a"Pool"
! Within$a$pool,$resources$are$fair#scheduled$across$all$jobs$
– This"is"achieved"via"another"instance"of"Fair"Scheduler"
! It$is$possible$to$enforce$FIFO$scheduling$within$a$pool$
– May"be"appropriate"for"jobs"that"would"compete"for"external"
bandwidth,"for"example"
! Pools$can$have$a$maximum$number$of$concurrent$jobs$configured$
! The$weight$of$a$job$within$a$pool$is$determined$by$its$priority$(NORMAL,$
HIGH$etc)$
! Assignment$of$a$slot$to$a$pool$can$be$delayed$
– The"FairScheduler"lets"free"slots"on"a"host"remain"open"for"a"short"Dme"
if"no"queued"tasks"prefer"to"run"on"that"host"
– This"increases"the"overall"data"locality"hit"raDo"
PreempDon"in"the"Fair"Scheduler"
! If$shares$are$imbalanced,$pools$which$are$over$their$fair$share$may$not$
assign$new$tasks$when$their$old$ones$complete$
– Eventually,"as"tasks"complete,"free"slots"will"become"available"
– Those"free"slots"will"be"used"by"pools"which"were"under"their"fair"share"
– This"may"not"be"acceptable"in"a"producDon"environment,"where"tasks"
take"a"long"Dme"to"complete"
! Two$types$of$preemp;on$are$supported$
– minShare"preempDon"
– Fair"Share"preempDon"
minShare"PreempDon"
! Pools$with$a$minimum$share$configured$are$opera;ng$on$an$SLA$(Service$
Level$Agreement)$
– WaiDng"for"tasks"from"other"pools"to"finish"may"not"be"appropriate"
! Pools$which$are$below$their$minimum$guaranteed$share$can$kill$the$
newest$tasks$from$other$pools$to$reap$slots$
– Can"then"use"those"slots"for"their"own"tasks"
– Ensures"that"the"minimum"share"will"be"delivered"within"a"Dmeout"
window"
Fair"Share"PreempDon"
! Pools$not$receiving$their$fair$share$can$kill$tasks$from$other$pools$
– A"pool"will"kill"the"newest"task(s)"in"an"over/share"pool"to"forcibly"make"
room"for"starved"pools"
! Fair$share$preemp;on$is$used$conserva;vely$
– A"pool"must"be"operaDng"at"less"than"50%"of"its"fair"share"for"10"
minutes"before"it"can"preempt"tasks"from"other"pools"
Chapter"Topics"
Cluster$Opera;ons$$
and$Maintenance$
!! The$Capacity$Scheduler$
!! Conclusion"
The"Capacity"Scheduler:"Basic"Concepts"
! Each$job$is$assigned$to$a$queue$
– Analogous"to"a"Fair"Scheduler"pool"
! Queues$are$assigned$a$percentage$of$the$cluster’s$slots$
– Similar"to"a"Fair"Scheduler"pool’s"minShare"
! A$queue’s$unused$capacity$is$not$given$away$if$it$is$not$being$used$
– Unlike"the"Fair"Scheduler,"in"which"pools"with"lower"usage"can"give"
away"their"slots"
! Jobs$within$queues$are$FIFO$ordered$
– Similar"to"the"FIFO"Scheduler,"you"can"enable"prioriDzaDon"within"
queues"
! Slot$alloca;on$can$be$based$on$tasks’$memory$usage$
Choosing"a"Scheduler"
$ FIFO$ Fair$ Capacity$

Requirement Scheduler$ Scheduler Scheduler$
Learning$tool$or$proof$of$concept$ ✔$
Pool$u;liza;on$varies,$so$it$is$desirable$that$ ✔$
pools$give$away$resources$when$they$are$not$in$ $
use$
Jobs$within$a$pool$need$to$make$equal$progress$$ ✔$
Data$locality$makes$a$significant$difference$in$ ✔$
job$run#;me$performance$ $
Pool$u;liza;on$has$liVle$fluctua;on$ ✔$
Jobs$have$a$high$degree$of$variance$in$memory$ ✔$
u;liza;on$ $
! In$prac;ce,$the$Fair$Scheduler$is$far$more$widely$used$than$the$Capacity$
Scheduler$
Chapter"Topics"
Cluster$Opera;ons$$
and$Maintenance$
!! Configuring$the$Fair$Scheduler$
!! Conclusion"
Steps"to"Configure"the"Fair"Scheduler"
1.  Enable$the$Fair$Scheduler$
2.  Configure$Scheduler$parameters$
3.  Configure$pools$
Enabling"the"Fair"Scheduler""
! In$mapred-site.xml$on$the$JobTracker,$specify$the$scheduler$to$use:$
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
! Iden;fy$the$pool$configura;on$file:$
<property>
<name>mapred.fairscheduler.allocation.file</name>
<value>/etc/hadoop/conf/allocations.xml</value>
</property>
Scheduler"Parameters"in"mapred-site.xml
mapred.fairscheduler.poolnameproperty Specifies which job configuration property

is used to determine the pool that a job
belongs in. Default is user.name (i.e.,
one pool per user). Other options include
group.name,
mapred.job.queue.name
mapred.fairscheduler.sizebasedweight Makes a pool’s weight proportional to
log(demand) of the pool. Default: false.
mapred.fairscheduler.weightadjuster Specifies a WeightAdjuster implementation
that tunes job weights dynamically. Default
is blank; can be set to
org.apache.hadoop.mapred.NewJob
WeightBooster.
mapred.fairscheduler.preemption Enables preemption in the Fair Scheduler.
Set to true if you have pools that must
operate on an SLA. Default is false.
Configuring"Pools"
! The$alloca;ons$configura;on$file$must$exist,$and$contain$an$
<allocations>$en;ty$
!  <pool>$en;;es$can$contain$minMaps,$minReduces,$maxMaps,$
maxReduces,$maxRunningJobs,$weight,$
minSharePreemptionTimeout,$schedulingMode
!  <user>$en;;es$(op;onal)$can$contain$maxRunningJobs
– Limits"the"number"of"simultaneous"jobs"a"user"can"run"
!  userMaxJobsDefault$en;ty$(op;onal)$
– Maximum"number"of"jobs"for"any"user"without"a"specified"limit"
! System#wide$and$per#pool$;meouts$can$be$set$
Very"Basic"Pool"ConfiguraDon"
! The$alloca;ons$configura;on$file$must$exist,$and$contain$at$least$this:$
<allocations>
</allocations>
Example:"Limit"Users"to"Three"Jobs"Each"
! Limit$max$jobs$for$any$user:$specify$userMaxJobsDefault
<allocations>
<userMaxJobsDefault>3</userMaxJobsDefault>
</allocations>
Example:"Allow"One"User"More"Jobs"
! If$a$user$needs$more$than$the$standard$maximum$number$of$jobs,$create$a$
<user>$en;ty$
<allocations>
<user name="bob">
<maxRunningJobs>6</maxRunningJobs>
</user>
</allocations>
Example:"Add"a"Fair"Share"Timeout"
! Set$a$Preemp;on$;meout$
<allocations>
<user name="bob">
<maxRunningJobs>6</maxRunningJobs>
</user>
<fairSharePreemptionTimeout>300</fairSharePreemptionTimeout>
</allocations>
Example:"Create"a"‘producDon’"Pool"
! Pools$are$created$by$adding$<pool>$en;;es$
<allocations>
<pool name="production">
<minMaps>20</minMaps>
<minReduces>5</minReduces>
<weight>2.0</weight>
</pool>
</allocations>
Example:"Add"an"SLA"to"the"Pool"
<allocations>
<pool name="production">
<weight>2.0</weight>
<minSharePreemptionTimeout>60</minSharePreemptionTimeout>
</pool>
</allocations>
Example:"Create"a"FIFO"Pool"
! FIFO$pools$are$useful$for$jobs$which$are,$for$example,$bandwidth#intensive$
<allocations>
<pool name="bandwidth_intensive">
<schedulingMode>FIFO</schedulingMode>
</pool>
</allocations>
! Note:$<schedulingMode>FAIR</schedulingMode>$would$use$the$
Fair$Scheduler$
Monitoring"Pools"and"AllocaDons"
! The$Fair$Scheduler$exposes$a$status$page$in$the$JobTracker$Web$user$
interface$at$http://<job_tracker_host>:50030/scheduler
– Allows"you"to"inspect"pools"and"allocaDons"
! Any$changes$to$the$pool$configura;on$file$(e.g.,$allocations.xml)$will$
automa;cally$be$reloaded$by$the$running$scheduler$
– Scheduler"detects"a"Dmestamp"change"on"the"file"
– Waits"five"seconds"aper"the"change"was"detected,"then"reloads"the"
file"
– If"the"scheduler"cannot"parse"the"XML"in"the"configuraDon"file,"it"
will"log"a"warning"and"conDnue"to"use"the"previous"configuraDon"
Chapter"Topics"
Cluster$Opera;ons$$
and$Maintenance$
!! Hands#On$Exercise:$Using$The$Fair$Scheduler$
!! Conclusion"
Hands/On"Exercise:"Using"The"Fair"Scheduler"
! In$this$Hands#On$Exercise,$you$will$run$jobs$in$different$pools$
! Please$refer$to$the$Hands#On$Exercise$manual$
Chapter"Topics"
Cluster$Opera;ons$$
and$Maintenance$
!! Conclusion$
EssenDal"Points"
! You$cannot$kill$a$Hadoop$job$from$the$terminal$by$using$CTRL#C$
– Use"hadoop job -kill <job_id>
– Or,"enable"the"JobTracker"Web"UI"to"kill"jobs"
!  Hadoop$provides$three$job$schedulers$
– The"FIFO"Scheduler"
– Has"serious"limitaDons"and"should"not"be"used"in"producDons"
– The"Fair"Scheduler"and"Capacity"Scheduler"""
– Allow"resources"to"be"controlled"proporDonally"
– Ensure"that"a"cluster"is"used"efficiently"
! The$Fair$Scheduler$is$the$most$commonly#used$scheduler$
– Efficient"when"pool"uDlizaDon"varies"
– Delayed"task"assignment"leads"to"be>er"data"locality"
Conclusion"
! How$to$view$and$stop$jobs$running$on$a$cluster$
! The$op;ons$available$for$scheduling$mul;ple$jobs$on$the$same$cluster$
! How$to$configure$the$Fair$Scheduler$
Chapter"14"
Course"Chapters"
!! HDFS"
!! MapReduce"
!! Hadoop"Clients"
!! Hadoop"Security"
Cluster$Opera4ons$and$
!! Cluster$Maintenance$
Maintenance$
!! Conclusion"
! How$to$check$the$status$of$HDFS$
! How$to$copy$data$between$clusters$
! How$to$add$and$remove$nodes$
! How$to$rebalance$the$cluster$
! How$to$upgrade$your$cluster$
Chapter"Topics"
Cluster$Opera4ons$$
Cluster$Maintenance$
and$Maintenance$
!! Checking$HDFS$Status$
!! Hands/On"Exercise:"Breaking"The"Cluster"
!! Copying"Data"Between"Clusters"
!! Adding"and"Removing"Cluster"Nodes"
!! Rebalancing"the"Cluster"
!! Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"
!! Cluster"Upgrading"
!! Conclusion"
Checking"for"CorrupCon"in"HDFS"
!  hdfs fsck$checks$for$missing$or$corrupt$data$blocks$
– Unlike"system"fsck,"does"not"a>empt"to"repair"errors"
! Can$be$configured$to$list$all$files$
– Also"all"blocks"for"each"file,"all"block"locaCons,"all"racks"
! Examples:$
hdfs fsck /
hdfs fsck / -files
hdfs fsck / -files -blocks
hdfs fsck / -files -blocks -locations
hdfs fsck / -files -blocks -locations -racks
Checking"for"CorrupCon"in"HDFS"(cont’d)"
! Good$idea$to$run$hdfs fsck$as$a$regular$cron$job$that$e#mails$the$
results$to$administrators$
– Choose"a"low/usage"Cme"to"run"the"check"
!  -move$op4on$moves$corrupted$files$to$/lost+found
– A"corrupted"file"is"one"where"all"replicas"of"a"block"are"missing
!  -delete$op4on$deletes$corrupted$files$
Using"dfsadmin
! The$hdfs dfsadmin$command$provides$a$number$of$administra4ve$
features$including:$
! List$informa4on$about$HDFS$on$a$per#datanode$basis$
$ hdfs dfsadmin -report
! Re#read$the$dfs.hosts$and$dfs.hosts.exclude$files$
$ hdfs dfsadmin -refreshNodes
Using"dfsadmin"(cont’d)"
! Manually$set$the$filesystem$to$‘safe$mode’$
– NameNode"starts"up"in"safe"mode"
– Read/only"–"no"changes"can"be"made"to"the"metadata"
– Does"not"replicate"or"delete"blocks"
– Leaves"safe"mode"when"the"(configured)"minimum"percentage"of"
blocks"saCsfy"the"minimum"replicaCon"condiCon"
$ hdfs dfsadmin –safemode enter

$ hdfs dfsadmin –safemode leave
– Can"also"block"unCl"safemode"is"exited"
– Useful"for"shell"scripts"
– hdfs dfsadmin -safemode wait
Using"dfsadmin"(cont’d)"
! Saves$the$NameNode$metadata$to$disk$and$resets$the$edit$log$
– Must"be"in"safe"mode"
$ hdfs dfsadmin –saveNamespace
– More"on"this"later"
"
Chapter"Topics"
Cluster$Opera4ons$$
and$Maintenance$
!! Checking"HDFS"Status"
!! Hands#On$Exercise:$Breaking$The$Cluster$
!! Conclusion"
Hands/On"Exercise:"Breaking"the"Cluster"
! In$this$Hands#On$Exercise,$you$will$introduce$some$problems$into$the$
cluster$
Chapter"Topics"
Cluster$Opera4ons$$
and$Maintenance$
!! Copying$Data$Between$Clusters$
!! Conclusion"
Copying"Data"
! Hadoop$clusters$can$hold$massive$amounts$of$data$
! A$frequent$requirement$is$to$back$up$the$cluster$for$disaster$recovery$
! Ul4mately,$this$is$not$a$Hadoop$problem!$
– It’s"a"‘managing"huge"amounts"of"data’"problem"
! Cluster$could$be$backed$up$to$tape$etc$if$necessary$
– Custom"socware"may"be"needed"
Copying"Data"with"distcp
!  distcp$copies$data$within$a$cluster,$or$between$clusters$
– Used"to"copy"large"amounts"of"data"
– Turns"the"copy"procedure"into"a"MapReduce"job"
! Copies$files$or$en4re$directories$
– Files"previously"copied"will"be"skipped"
– Note"that"the"only"check"for"duplicate"files"is"that"the"file’s"name"
and"size"are"idenCcal
distcp"Examples"
! Copying$data$from$one$cluster$to$another$
–  hadoop distcp hdfs://nn1:8020/path/to/src \
hdfs://nn2:8020/path/to/dest
! Copying$data$within$the$same$cluster$
–  hadoop distcp /path/to/src /path/to/dest
! Copying$data$from$one$cluster$to$another$when$the$clusters$are$running$
different$versions$of$Hadoop$
– HA"HDFS"example"using"H>pFS"
–  hadoop distcp hdfs://mycluster/path/to/src \
webhdfs://httpfs-svr:14000/path/to/dest
– Non/HA"HDFS"example"using"WebHDFS"
–  hadoop distcp hdfs://nn1:8020/path/to/src \
webhdfs://nn2:50070/path/to/dest
Copying"Data:"Best"PracCces"
! In$prac4ce,$many$organiza4ons$do$not$copy$data$between$clusters$
! Instead,$they$write$their$data$to$two$clusters$as$it$is$being$imported$
– This"is"ocen"more"efficient"
– Not"necessary"to"run"all"MapReduce"jobs"on"the"backup"cluster"
– As"long"as"the"source"data"is"available,"all"derived"data"can"be"
regenerated"later"
Chapter"Topics"
Cluster$Opera4ons$$
and$Maintenance$
!! Adding$and$Removing$Cluster$Nodes$
!! Conclusion"
Adding"Cluster"Nodes"
! To$add$nodes$to$the$cluster:$$
1.  Add"the"names"of"the"nodes"to"the"‘include’"file(s),"if"you"are"using"
this"method"to"explicitly"list"allowed"nodes"
–  The"file(s)"referred"to"by"dfs.hosts"(and"mapred.hosts"if"
that"has"been"used)
2.  Update"your"rack"awareness"script"with"the"new"informaCon"
3.  Update"the"NameNode"with"this"new"informaCon"
–  hdfs dfsadmin –refreshNodes
4.  Update"the"JobTracker"with"this"new"informaCon"
–  hadoop mradmin –refreshNodes"
5.  Start"the"new"DataNode"and"TaskTracker"instances"
6.  Check"that"the"new"DataNodes"and"TaskTrackers"appear"in"the"Web"
UI"
Adding"Nodes:"Points"to"Note"
! A$NameNode$will$not$‘favor’$a$new$node$added$to$the$cluster$
– It"will"not"prefer"to"write"blocks"to"the"node"rather"than"to"other"nodes"
! This$is$by$design$
– The"assumpCon"is"that"new"data"is"more"likely"to"be"processed"by"
MapReduce"jobs"
– If"all"new"blocks"were"wri>en"to"the"new"node,"this"would"impact"data"
locality"for"MapReduce"jobs"
– Would"also"create"a"‘hot"spot’"when"wriCng"new"data"to"the"cluster"
Removing"Cluster"Nodes"
!  To$remove$nodes$from$the$cluster:$
1.  Add"the"names"of"the"nodes"to"the"‘exclude’"file(s)"
–  The"file(s)"referred"to"by"dfs.hosts.exclude"(and"
mapred.hosts.exclude"if"that"has"been"used)
2.  Update"the"JobTracker"with"the"revised"set"of"nodes"
–  hadoop mradmin –refreshNodes"
3.  Update"the"NameNode"with"the"new"set"of"DataNodes"
–  hdfs dfsadmin -refreshNodes
–  The"NameNode"UI"will"show"the"admin"state"change"to"
‘Decommission"In"Progress’"for"affected"DataNodes"
–  When"all"DataNodes"report"their"state"as"‘Decommissioned’,"all"
the"blocks"will"have"been"replicated"elsewhere"
4.  Shut"down"the"decommissioned"nodes"
5.  Remove"the"nodes"from"the"‘include’"and"‘exclude’"files"and"update"
the"NameNode"as"above"
Chapter"Topics"
Cluster$Opera4ons$$
and$Maintenance$
!! Rebalancing$the$Cluster$
!! Conclusion"
Cluster"Rebalancing"
! An$HDFS$cluster$can$become$‘unbalanced’$
– Some"nodes"have"much"more"data"on"them"than"others"
– Example:"add"a"new"node"to"the"cluster"
– Even"acer"adding"some"files"to"HDFS,"this"node"will"have"far"less"
data"than"the"others"
– During"MapReduce"processing,"this"node"will"use"much"more"
network"bandwidth"as"it"retrieves"data"from"other"nodes"
! Clusters$can$be$rebalanced$using$the$hdfs balancer u4lity$
Using"hdfs balancer
!  hdfs balancer$reviews$data$block$placement$on$nodes$and$adjusts$
blocks$to$ensure$all$nodes$are$within$x%$u4liza4on$of$each$other$
! U4liza4on$is$defined$as$amount$of$data$storage$used$
! x$is$known$as$the$threshold$
! A$node$is$under#u4lized$if$its$u4liza4on$is$less$than$(average$u4liza4on$#$
threshold)$
! A$node$is$over#u4lized$if$its$u4liza4on$is$more$than$(average$u4liza4on$+$
threshold)$
! Note:$hdfs balancer$does$not$consider$block$placement$on$individual$
disks$on$a$node$
– Only"the"uClizaCon"of"the"node"as"a"whole"
Using"hdfs balancer"(cont’d)"
! Syntax:$
– hdfs balancer -threshold x
! Threshold$is$op4onal$
– Defaults"to"10"(i.e.,"10%"difference"in"uClizaCon"between"nodes)"
! Rebalancing$can$be$canceled$at$any$4me$
– Interrupt"the"command"with"Ctrl/C"
! Bandwidth$usage$can$be$controlled$by$seeng$the$property$
dfs.balance.bandwidthPerSec$in$hdfs-site.xml
– Specifies"a"bandwidth"in"bytes/sec"(not"bits/sec)"that"each"DataNode"
can"use"for"rebalancing"
– Default"is"1048576"(1MB/sec)"
– RecommendaCon:"approx."0.1"x"network"speed"
– e.g.,"for"a"1Gbps"network,"10MB/sec"
When"To"Rebalance"
! Rebalance$immediately$afer$adding$new$nodes$to$the$cluster$
! Rebalance$during$non#peak$usage$4mes$
– Rebalancing"does"not"interfere"with"any"exisCng"MapReduce"jobs"
– However,"it"does"use"bandwidth"
Chapter"Topics"
Cluster$Opera4ons$$
and$Maintenance$
!! Hands#On$Exercise:$Verifying$The$Cluster’s$Self#Healing$Features$
!! Conclusion"
Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"
! In$this$Hands#On$Exercise,$you$will$verify$that$the$cluster$has$recovered$
from$the$problems$you$introduced$in$the$last$exercise$
Chapter"Topics"
Cluster$Opera4ons$$
and$Maintenance$
!! Cluster$Upgrading$
!! Conclusion"
Upgrading"Socware:"When"to"Upgrade?"
! Cloudera$updates$CDH$regularly$
! Format$example:$CDH$4.3.1$
Minor"patch"
release"
Major"version" Major"version"
number" update""
! Major$versions:$every$12$to$18$months$
! Updates:$every$three$to$four$months$
! Patch$releases:$when$necessary$
! We$recommend$that$you$upgrade$when$a$new$version$update$is$released$
! Cloudera$supports$a$major$version$for$at$least$one$year$afer$a$subsequent$
release$is$available$
Upgrading"Socware:"Procedures"
! Sofware$upgrade$procedure$is$fully$documented$on$the$Cloudera$Web$
site$
! General$steps:$
1.  Stop"the"MapReduce"cluster"
2.  Stop"HDFS"cluster"
3.  Install"the"new"version"of"Hadoop"
4.  Start"the"NameNode"with"the"-upgrade"opCon"
5.  Monitor"the"HDFS"cluster"unCl"it"reports"that"the"upgrade"is"
complete"
6.  Start"the"MapReduce"cluster"
Upgrading"Socware:"Procedures"(cont’d)"
! Once$the$upgraded$cluster$has$been$running$for$a$few$days$with$no$
problems,$finalize$the$upgrade$by$running$
hdfs dfsadmin -finalizeUpgrade
– DataNodes"delete"their"previous"version"working"directories,"then"the"
NameNode"does"the"same"
! If$you$encounter$problems,$you$can$roll$back$an$(unfinalized)$upgrade$by$
stopping$the$cluster,$then$star4ng$the$old$version$of$HDFS$with$the$$
-rollback$op4on$
! Note$that$this$upgrade$procedure$is$required$when$HDFS$data$structures$
or$RPC$communica4on$format$change$
– For"example,"from"CDH3"to"CDH4"
! Probably$not$required$for$minor$version$changes$
– But"see"the"documentaCon"for"definiCve"informaCon!"
Upgrading"Socware:"Procedures"(cont’d)"
! Note$that$Cloudera$Manager$makes$cluster$upgrades$extremely$simple$
– Cloudera"Manager"Enterprise"EdiCon"also"provides"for"rolling"cluster"
upgrades"
Chapter"Topics"
Cluster$Opera4ons$$
and$Maintenance$
!! Conclusion$
EssenCal"Points"
! You$can$check$the$status$of$HDFS$with$the$hdfs fsck$command$
– Reports"problems"but"does"not"repair"them"
! You$can$use$the$distcp$command$to$copy$data$within$a$cluster$or$
between$clusters$
! Use$hdfs dfsadmin –refreshNodes$to$add$new$nodes$to$a$cluster$
or$to$remove$nodes$when$decommissioning$them$
– You"also"need"to"update"your"rack"awareness"script"
! Use$hdfs balancer$to$adjust$block$placement$across$HDFS,$so$as$to$
ensure$beker$u4liza4on$
Conclusion"
! How$to$check$the$status$of$HDFS$
! How$to$copy$data$between$clusters$
! How$to$add$and$remove$nodes$
! How$to$rebalance$the$cluster$
! How$to$upgrade$your$cluster$
Cluster"Monitoring"and"TroubleshooBng"
Chapter"15"
Course"Chapters"
!! HDFS"
!! MapReduce"
!! Hadoop"Clients"
!! Hadoop"Security"
Cluster$Opera7ons$and$
Maintenance$
!! Cluster$Monitoring$and$Troubleshoo7ng$
!! Conclusion"
Cluster"Monitoring"and"TroubleshooBng"
! What$general$system$condi7ons$to$monitor$
! How$to$monitor$a$Hadoop$cluster$
! Some$techniques$for$troubleshoo7ng$problems$on$a$Hadoop$cluster$
! Some$common$misconfigura7ons,$and$their$resolu7ons$
Chapter"Topics"
Cluster$Monitoring$$ Cluster$Opera7ons$$
and$Troubleshoo7ng$ and$Maintenance$
!! General$System$Monitoring$
!! Monitoring"Hadoop"Clusters"
!! TroubleshooBng"Hadoop"Clusters"
!! Common"MisconfiguraBons"
!! Hands/On"Exercises:"TroubleshooBng"Challenge"
!! Conclusion"
Monitoring"Hadoop"Clusters"
! You$should$use$a$monitoring$tool$to$warn$you$of$poten7al$or$actual$
problems$on$individual$machines$in$the$cluster$
! Cloudera$Manager$provides$Hadoop$cluster$monitoring$with$no$addi7onal$
configura7on$required$
– We"recommend"using"Cloudera"Manager"to"monitor"Hadoop"clusters"
! Hadoop$exposes$data$that$lets$you$integrate$cluster$monitoring$into$many$
exis7ng$monitoring$tools$
– JMX"broadcasts"
– Metrics"sinks"
Items"to"Monitor"
! Monitor$the$Hadoop$daemons$
– Alert"an"operator"if"a"daemon"goes"down"
– Check"can"be"done"with"
service hadoop-0.20-daemon_name status
! Monitor$disks$and$disk$par77ons$
– Alert"immediately"if"a"disk"fails"
– Send"a"warning"when"a"disk"reaches"80%"capacity"
– Send"a"criBcal"alert"when"a"disk"reaches"90%"capacity"
! Monitor$CPU$usage$on$master$nodes$
– Send"an"alert"on"excessive"CPU"usage"
– Slave"nodes"will"oèn"reach"100%"usage"
– This"is"not"a"problem"
Items"to"Monitor"(cont’d)"
! Monitor$swap$on$all$nodes$
– Alert"if"the"swap"parBBon"starts"to"be"used"
– Memory"allocaBon"is"overcommi>ed"
! Monitor$network$transfer$speeds$
! Monitor$HDFS$health$
– HA"configuraBon"
– Check"the"size"of"the"edit"logs"on"the"JournalNodes"
– Monitor"for"failovers"
– Non/HA"configuraBon"
– Check"the"age"of"the"fsimage"file"and/or"check"the"size"of"the"
edits"file"
Log"File"Growth"–"Daemon"Logs"
! MRv1$daemon$logs$
– Monitor"to"avoid"out/of/space"errors"
– MRv1"daemons"log"using"DRFA,"so"logs"are"not"deleted"
– Write"scripts"to"compress,"archive,"and"delete"MRv1"daemon"logs"
– Or,"reconfigure"MRv1"daemons"to"log"using"RFA"
!  HDFS$and$MRv2$daemon$logs$$
– HDFS"and"MRv2"daemons"log"using"RFA,"so"logs"are"rotated"
– Configure"appropriate"size"and"retenBon"policies"in"log4j""
Log"File"Growth"–"Task"Logs"
! Cau7on:$inexperienced$developers$will$oWen$create$large$task$logs$from$
their$jobs$
– Data"wri>en"to"stdout/stderr
– Data"wri>en"using"log4j"from"within"the"code"
! Large$task$logs$can$run$your$slave$nodes$out$of$disk$space$
– The"log"files"are"wri>en"to"local"disk"–"not"HDFS"–"on"the"slave"nodes"
– Ensure"you"have"enough"room"on"the"parBBon"for"logs"
– Monitor"to"ensure"that"developers"are"not"logging"excessively"
"
Chapter"Topics"
!! General"System"Monitoring"
!! Monitoring$Hadoop$Clusters$
!! Conclusion"
Metrics"CollecBon"in"CDH4"
Metrics"Exposure"in"CDH4"
! You$can$configure$Hadoop$to$publish$Hadoop$metrics$sinks$or$to$broadcast$
over$JMX$ports$
Accessing"CDH4"Metrics"From"Monitoring"So`ware"
! In$addi7on$to$configuring$Hadoop$to$expose$metrics,$you$must$configure$your$
monitoring$soWware$to$access$either$JMX$ports$or$a$Hadoop$metrics$sink$
Hadoop"Metrics"Frameworks"in"CDH4""
Metrics2$Framework$ Original$Metrics$Framework$
Supports"most"CDH4"daemons,"including" Supports"the"MapReduce"1"JobTracker"and"
NameNode,"DataNode," TaskTracker"daemons"
SecondaryNameNode,""YARN"daemons,"map"
tasks,"and"reduce"tasks"
Allows"filtering"of"metrics"published"to"a"sink" Allows"filtering"of"metrics"published"to"a"sink"
by"context"(jvm,"dfs,"mapred,"rpc,"etc.)," by"context"only"
daemon,"source,"record,"or"metrics"name"
Reads"metrics"sink"configuraBon"from"the"" Reads"metrics"sink"configuraBon"from"the""
/etc/hadoop/conf/ /etc/hadoop/conf/
hadoop-metrics2.properties"file" hadoop-metrics.properties"file"
Exposes"all"metrics"to"MBeans" Exposes"a"limited"number"of"metrics"to"
MBeans"
More"About"Metrics"Contexts"
! jvm$
– StaBsBcs"from"the"JVM"including"memory"usage,"thread"counts,"
garbage"collecBon"informaBon"
– All"Hadoop"daemons"use"this"context"
! dfs$
– NameNode"capacity,"number"of"files,"under/replicated"blocks"
! mapred$
– JobTracker"informaBon,"similar"to"that"found"on"the"JobTracker’s"Web"
status"page"
! rpc$
– For"Remote"Procedure"Calls"
Configuring"Hadoop"to"Publish"Metrics"Using"the"Metrics2"
Framework""
/etc/hadoop/conf/hadoop-metrics2.properties"
# Example: Publish metrics for the DataNode, NameNode, and

# SecondaryNameNode to a metrics sink for Ganglia 3.1
*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.
GangliaSink31
# Default sampling period (seconds)
*.period=10
# Ganglia host and gmond port number
namenode.sink.ganglia.servers=192.168.141.160:8649
datanode.sink.ganglia.servers=192.168.141.160:8649
secondarynamenode.sink.ganglia.servers=192.168.141.160:8649
Configuring"Hadoop"to"Publish"Metrics"for"MapReduce"1"
Daemons"Using"the"Original"Metrics"Framework"
/etc/hadoop/conf/hadoop-metrics.properties"
# Example 1: Publish metrics for the mapred context to a

# Ganglia 3.1 metrics sink
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
# Ganglia host and gmond port number
mapred.servers=192.168.141.160:8649
# Example 2: Publish metrics for the jvm context to a file sink

jvm.class=org.apache.hadoop.metrics.file.FileContext
jvm.period=10
# Output File
jvm.fileName=/tmp/jvmmetrics.out
Configuring"Hadoop"to"Broadcast"JMX"Metrics"
/etc/hadoop/conf/hadoop-env.sh"
# Example: Broadcast JMX metrics for the NameNode to port 8006

# with no security options enabled
export HADOOP_NAMENODE_OPTS=$HADOOP_NAMENODE_OPTS \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.port=8006
! JMX$ports$can$be$opened$on$for$the$NameNode,$SecondaryNameNode,$
DataNode,$ResourceManager,$NodeManager,$MRAppMaster$daemons$
and$for$map$and$reduce$tasks$
! JMX$ports$can$also$be$opened$on$MR1$JobTracker$and$TaskTracker$
daemons$but$broadcast$only$a$limited$amount$of$metrics$$
CharBng"Metrics"with"Cloudera"Manager"
Chapter"Topics"
!! Troubleshoo7ng$Hadoop$Clusters$
!! Conclusion"
TroubleshooBng:"The"Challenges"
! An$overt$symptom$is$a$poor$predictor$of$the$root$cause$of$a$failure$
! Errors$show$up$far$from$the$cause$of$the$problem$
! Clusters$have$a$lot$of$components$
! Example:$
– Symptom:"A"Hadoop"job"that"previously"was"able"to"run"now"fails"to"run"
– Cause:"Disk"space"on"many"nodes"has"filled"up,"so"intermediate"data"
cannot"be"copied"to"Reducers"
Common"Sources"of"Problems"
! Misconfigura7on$
! Hardware$failure$
! Resource$exhaus7on$
– Not"enough"disk"
– Not"enough"RAM"
– Not"enough"network"bandwidth"
! Inability$to$reach$hosts$on$the$network$
– Naming"issues"
– Network"hardware"issues"
– Network"delays"
Gathering"InformaBon"About"Problems"
! Are$there$any$issues$in$the$environment?$
! What$about$dependent$components?$
– MapReduce"jobs"depend"on"the"JobTracker,"which"depends"on"the"
underlying"OS"
! Is$there$any$predictability$to$the$failures?$
– All"from"the"same"job?"
– All"from"the"same"TaskTracker?"
! Is$this$a$resource$problem?$
– Have"you"received"an"alert"from"your"monitoring"system?"
! What$do$the$logs$say?$
! What$does$the$CDH$documenta7on/the$Cloudera$Knowledge$Base/your$
favorite$Web$search$engine$say$about$the$problem?$
General"Rule:"Start"Broad,"Then"Narrow"the"Scope"
JobTracker Task failing Error in the job

Yes
Log on all nodes? Cluster-wide misconfiguration
Problem in the stack
No
TaskTracker Task failing Yes Node misconfiguration

Log on the same Node resource exhausted
node? Problem in the stack
No
Task Attempt
Log /
Run strace
Avoiding"Problems"to"Begin"With"
! Misconfigura7on$
– Start"with"recommended"configuraBon"values"
– Don’t"rely"on"Hadoop’s"defaults!"
– Understand"the"precedence"of"overrides"
– Control"your"clients’"ability"to"make"configuraBon"changes"
– Test"changes"before"puPng"them"into"producBon"
– Look"for"changes"when"deploying"new"releases"of"Hadoop""
– Automate"management"of"the"configuraBon"
! Hardware$failure$and$exhaus7on$
– Monitor"your"systems"
– Benchmark"systems"to"understand"their"impact"on"your"cluster"
! Hostname$resolu7on$
– Test"forward"and"reverse"DNS"lookups"
Chapter"Topics"
!! Common$Misconfigura7ons$
!! Conclusion"
Common"MisconfiguraBons:"IntroducBon"
! 35%$of$Cloudera$support$7ckets$are$due$to$misconfigura7ons$
! In$this$sec7on,$we$will$explore$some$of$the$most$common$Hadoop$
misconfigura7ons$and$suggested$solu7ons$
! Note$that$these$are$just$some$of$the$issues$you$could$run$in$to$on$a$cluster$
! Also$note$that$these$are$possible$causes$and$resolu7ons$
– The"problems"could"be"caused"by"many"other"issues"
Map/Reduce"Task"Out"Of"Memory"Error"
FATAL org.apache.hadoop.mapred.TaskTracker:
Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask
$MapOutputBuffer.<init>(MapTask.java:781)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
$
! Symptom$
– A"task"fails"to"run"
! Possible$causes$
– Poorly"coded"Mapper"or"Reducer"
– Map"or"Reduce"task"has"run"out"of"memory"
– A"memory"leak"in"the"code"
Map/Reduce"Task"Out"Of"Memory"Error"(cont’d)"
! Possible$resolu7on$
– Increase"size"of"RAM"allocated"in"mapred.child.java.opts
– Ensure"io.sort.mb"is"smaller"than"RAM"allocated"in"
mapred.child.java.opts
– Require"the"developer"to"recode"a"poorly/wri>en"Mapper"or"Reducer"
JobTracker"Out"Of"Memory"Error"
ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:

java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:
122)
! Symptom$
– The"JobTracker"crashes"
! Cause$
– JobTracker"has"exceeded"allocated"memory"
! Possible$resolu7ons$
– Increase"JobTracker’s"memory"allocaBon"
– Reduce"mapred.jobtracker.completeuserjobs.maximum
– Amount"of"job"history"held"in"JobTracker’s"RAM"
Too"Many"Fetch"Failures"
INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for

output of task:
! Symptoms$
– Reduce"tasks"fail"
– Map"tasks"need"to"be"rerun"
– Jobs"are"delayed"by"hours"
! Cause$
– Reducers"are"failing"to"fetch"intermediate"data"from"a"TaskTracker"
where"a"Map"process"ran"
– Too"many"of"these"failures"will"cause"a"TaskTracker"to"be"blacklisted"
"
Too"Many"Fetch"Failures"(cont’d)"
– Increase"tasktracker.http.threads
– Decrease"mapred.reduce.parallel.copies
– Tune"mapred.slowstart.completed.maps
Not"Able"To"Place"Enough"Replicas"
WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to

place enough replicas
! Symptom$
– Inadequate"replicaBon"or"job"failure"
– DataNodes"do"not"have"enough"xciever"threads"
– Note:"yes,"the"configuraBon"opBon"is"misspelled!"
– Fewer"available"DataNodes"available"than"the"replicaBon"factor"of"the"
blocks"
– Increase"dfs.datanode.max.xcievers"to"4096"
– Check"replicaBon"factor"
No"Such"File"or"Directory"
ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker

because ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
! Symptom$
– The"TaskTracker"cannot"be"started"on"a"node"
– TaskTracker"disk"space"is"full"
– Insufficient"permissions"on"a"directory"the"TaskTracker"writes"to"
– A"disk"has"gone"bad"
– Increase"dfs.datanode.du.reserved"to"at"least"10%"of"the"disk"
– Set"permissions"for"mapred.local.dir"to"755,"with"owner"mapred
Where"Did"My"File"Go?"
hadoop fs –rm –r data

hadoop fs -ls /user/training/.Trash
! Symptom$
– User"cannot"recover"an"accidentally"deleted"file"from"the"trash"
– Trash"is"not"enabled"
– Trash"interval"is"set"too"low"
! Possible$resolu7on$
– Set"fs.trash.interval"to"1440"or"higher"
Chapter"Topics"
!! Hands#On$Exercises:$Troubleshoo7ng$Challenge$
!! Conclusion"
TroubleshooBng"Challenge:"Heap"of"Trouble"
! In$this$Troubleshoo7ng$Challenge,$you$will$recreate$a$problem$scenario,$
diagnose$the$problem,$and,$if$you$have$7me,$fix$the$problem$
! Your$instructor$will$provide$direc7on$as$you$go$through$the$
troubleshoo7ng$process$
Chapter"Topics"
!! Conclusion$
EssenBal"Points"
! Be$sure$to$monitor$your$Hadoop$cluster$
– Hadoop"daemons,"disk"usage,"CPU"usage,"swap,"network"usage,"and"
HDFS"health"
! Cloudera$Manager$provides$Hadoop$cluster$monitoring$
! The$Hadoop$metrics$framework$exposes$metrics$to$let$you$integrate$with$
other$monitoring$frameworks$
! Troubleshoo7ng$Hadoop$problems$is$a$challenge,$because$symptoms$do$
not$always$point$to$the$source$of$problems$
! Follow$best$prac7ces$for$configura7on$management,$benchmarking,$and$
monitoring$and$you$will$avoid$many$problems$
Conclusion"
! What$general$system$condi7ons$to$monitor$
! How$to$monitor$a$Hadoop$cluster$
! Some$techniques$for$troubleshoo7ng$problems$on$a$Hadoop$cluster$
! Some$common$misconfigura7ons,$and$their$resolu7ons$
Conclusion"
Chapter"16"
Course"Chapters"
!! HDFS"
!! GeSng"Data"Into"HDFS"
!! MapReduce"
!! Hadoop"Clients" Planning,"Installing,"and"
!! Cloudera"Manager" Configuring"a"Hadoop"Cluster"
!! Hadoop"Security"
Maintenance"
!! Cluster"Monitoring"and"TroubleshooBng"
!! Conclusion$
!! Kerberos"ConfiguraBon" Course$Conclusion$and$Appendices$
Conclusion"
During$this$course,$you$have$learned:$
! The$core$technologies$of$Hadoop$
! How$to$populate$HDFS$from$external$sources$
! How$to$plan$your$Hadoop$cluster$hardware$and$soEware$
! How$to$deploy$a$Hadoop$cluster$
! What$issues$to$consider$when$installing$Pig,$Hive,$and$Impala$
! What$issues$to$consider$when$deploying$Hadoop$clients$
! How$Cloudera$Manager$can$simplify$Hadoop$administraJon$
! How$to$configure$HDFS$for$high$availability$
! What$issues$to$consider$when$implemenJng$Hadoop$security$
Conclusion"(cont’d)"
During$this$course,$you$have$learned:$
! How$to$schedule$jobs$on$the$cluster$
! How$to$maintain$your$cluster$
! How$to$monitor,$troubleshoot,$and$opJmize$the$cluster$
Next"Steps"
! Cloudera$offers$a$number$of$other$training$courses,$including:$
– Hadoop"EssenBals"
– Cloudera"Developer"Training"for"Apache"Hadoop"
– Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"
Hadoop"
– Cloudera"Training"for"Apache"HBase"
– IntroducBon"to"Data"Science:"Building"Recommender"Systems"
– Custom"courses"
! Cloudera$also$provides$consultancy$and$troubleshooJng$services$
– Please"ask"your"instructor"for"more"informaBon"
Class"EvaluaBon"
! Please$take$a$few$minutes$to$complete$the$class$evaluaJon$
– Your"instructor"will"show"you"how"to"access"the"online"form"
CerBficaBon"Exam"
! This$course$helps$to$prepare$you$for$the$Cloudera$CerJfied$Administrator$
for$Apache$Hadoop$exam$
! For$more$informaJon$about$Cloudera$cerJficaJon,$refer$to$$
http://university.cloudera.com/certification.html
Thank"You!"
! Thank$you$for$aUending$this$course$
! If$you$have$any$further$quesJons$or$comments,$please$feel$free$to$contact$
us$
– Full"contact"details"are"on"our"Web"site"at"
http://www.cloudera.com/
Kerberos"ConfiguraBon"
Appendix"A"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"1$
Course"Chapters"
!! HDFS"
!! GeSng"Data"Into"HDFS"
!! MapReduce"
!! Hadoop"Security"
Maintenance"
!! Cluster"Monitoring"and"TroubleshooBng"
!! Conclusion"
!! Kerberos$Configura2on$ Course$Conclusion$and$Appendices$
Kerberos"Message"Exchange"
! There$are$three$phases$required$for$a$client$to$access$a$service$
– AuthenBcaBon"
– AuthorizaBon"
– Service"request"
! These$are$illustrated$over$the$next$several$slides$
– This"is"an"overview"of"the"important"points"for"Kerberos"5"
– See"RFC"4120"if"you’re"interested"in"more"detail"
Kerberos"Message"Exchange"(cont’d)"
Authen2ca2on$Phase$
Kerberos Key Distribution Center
AuthorizaBon"Phase" (KDC)
Service"Request"Phase"
Authentication Service (AS)
1
Client
Ticket Granting Service (TGS)
Desired Network Service
! Client$sends$a$Ticket"Gran2ng$Ticket$(TGT)$request$to$AS$
Authen2ca2on$Phase$
2
Client
! AS$checks$database$to$authen2cate$client$
– AuthenBcaBon"typically"done"by"checking"LDAP/AcBve"Directory"
– If"valid,"AS"sends"Ticket/GranBng"Ticket"(TGT)"to"client"
AuthenBcaBon"Phase"
Authoriza2on$Phase$ (KDC)
Client
3
! Client$uses$this$TGT$to$request$a$service$2cket$from$TGS$
– A"service"Bcket"is"validaBon"that"a"client"can"access"a"service"
"
AuthenBcaBon"Phase"
Authoriza2on$Phase$ (KDC)
Client
4
! TGS$verifies$whether$client$is$permiMed$to$use$requested$service$
– If"access"granted,"TGS"sends"service"Bcket"to"client"
AuthenBcaBon"Phase"
Service$Request$Phase$
Client
! Client$can$use$then$use$the$service$
– Service"can"validate"client"with"info"from"the"service"Bcket"
Important"Kerberos"Client"Commands"
! The$kinit$program$is$used$to$request$a$TGT$from$Kerberos$
! Use$klist$to$see$your$current$2ckets$
! Use$kdestroy to$explicitly$delete$your$2ckets$
– Though"they’ll"expire"on"their"own"aaer"several"hours"
Configuring"Hadoop"Security"
! Hadoop$security$configura2on$is$a$specialized$topic$
! Many$specifics$depend$on$$
– Version"of"Hadoop"and"related"programs"
– Type"of"Kerberos"server"used"(AcBve"Directory"or"MIT)"
– OperaBng"system"and"distribuBon"
"
Hadoop"Security"Setup"AssumpBons"
! The$following$covers$the$essen2al$points,$assuming$use$of$
– CDH3u3,"installed"from"packages"
– CentOS"5.6"
– MIT"Kerberos"server"(krb5/1.6.1)"
! See$the$“CDH$Security$Guide”$for$detailed$instruc2ons$
– Available"at"http://www.cloudera.com/
– Be"sure"to"read"the"one"corresponding"to"your"version"of"CDH"
! The$following$gives$an$overview$of$manual$configura2on$
– If"using"Kerberos,"you"should"consider"using"Cloudera"Manager"
– Cloudera"Manager"(Enterprise)"greatly"simplifies"this"process"
Hadoop"Security"Setup"Prerequisites"
! Working$Hadoop$cluster$$
– Installing"CDH3"from"packages"is"strongly"advised!"
! Working$Kerberos$KDC$server$$
! Kerberos$client$libraries$installed$on$all$Hadoop$nodes$
"
"
Hadoop"Security"Setup"Overview"
! The$main$steps$for$securing$a$cluster$are$to$
– Install"two"extra"CDH"packages"
– Enable"strong"encrypBon"in"Java"
– Set"KDC"hostname"and"realm"on"all"Hadoop"nodes"
– Create"Kerberos"principals""
– Create"and"deploy"Kerberos"keytab"files"
– Shut"down"all"Hadoop"daemons"
– Enable"Hadoop"security"
– Configure"HDFS"security"opBons"
– Configure"MapReduce"security"opBons"
– Restart"Hadoop"daemons""
– Verify"that"everything"works"
Install"Extra"CDH"Packages"
! Install$hadoop-0.20-sbin$and$hadoop-0.20-native
– On"every"node"in"the"cluster"
– Again,"installing"from"packages"is"strongly"advised"
"
Enable"Strong"EncrypBon"in"Java"
! Install$JCE$Unlimited$Strength$Jurisdic2on$Policy$Files$
– Distributed"as"a"ZIP"file"from"Oracle’s"Web"site"
! For$every$node$in$your$Hadoop$cluster$
– Locate"the"JRE’s"lib/security"directory"
– Rename"exisBng"policy"JARs"in"that"directory"
– Add"new"policy"JARs"extracted"from"downloaded"ZIP"file"
"
Set"the"KDC"Hostname"and"Realm"
! On$every$node$in$your$Hadoop$cluster$
– Edit"/etc/krb5.conf"
– Set"the"hostname"and"realm"name"of"the"KDC"
$
"
Create"Kerberos"Principals"
! Create$these$Kerberos$principals$on$every$Hadoop$cluster$node$
– host/myhost.example.com@MYREALM.COM
– hdfs/myhost.example.com@MYREALM.COM
– mapred/myhost.example.com@MYREALM.COM
! These$must$contain$a$fully"qualified$hostname$
– And"this"must"be"the"hostname"of"the"current"node"
! Replace$MYREALM.COM$with$your$actual$realm$name$
! Example$command$for$crea2ng$the$host$principal$on$a$node:$
# kadmin.local –q "addprinc –randkey host/node4.example.com"
Create"and"Deploy"Kerberos"Keytab"Files"
! On$every$Hadoop$node$in$your$cluster$
– Use"kadmin"to"create"a"keytab"for"the"hdfs"principal"
– Use"kadmin"to"create"a"keytab"for"the"mapred"principal"
– Add"entries"to"both"files"for"the"host"principal"
– See"CDH3"Security"Guide"for"specific"command"opBons"
– Deploy"the"keytab"files"to"Hadoop’s"conf"directory"
– Then"set"ownership"and"permissions"to"protect"them"
– These"steps"are"shown"in"the"commands"below"
"
$ sudo mv hdfs.keytab mapred.keytab /etc/hadoop/conf/
$ sudo chown hdfs:hadoop /etc/hadoop/conf/hdfs.keytab
$ sudo chown mapred:hadoop /etc/hadoop/conf/mapred.keytab
$ sudo chmod 400 /etc/hadoop/conf/*.keytab
Enable"Hadoop"Security"
! Shut$down$all$Hadoop$daemons$on$all$cluster$nodes$
! Edit$core-site.xml$to$enable$Hadoop$security:$
– ProperBes"must"be"specified"on"every$machine$in"your"cluster!"
Property Name Value

hadoop.security.authentication kerberos
hadoop.security.authorization true
Configure"HDFS"Security"
! Edit$hdfs-site.xml$to$add$NameNode$proper2es:$
– Again,"these"must"be"specified"on"every"machine"in"your"cluster!"
Property Name Value

dfs.block.access.token.enabled true
dfs.https.address <NAMENODE_HOSTNAME>:50475
dfs.https.port 50475
dfs.namenode.keytab.file /etc/hadoop/conf/hdfs.keytab
dfs.namenode.kerberos.principal hdfs/_HOST@YOUR-REALM.COM
dfs.namenode.kerberos.https.principal host/_HOST@YOUR-REALM.COM
Configure"HDFS"Security"(cont’d)"
! Edit$hdfs-site.xml$to$add$Secondary$NameNode$proper2es:$
– Again,"these"must"be"specified"on"every"machine$in"your"cluster"
Property Name Value

dfs.secondary.https.address <SECONDARY_NN_HOSTNAME>:50495
dfs.secondary.https.port 50495
dfs.secondary.namenode.keytab.file /etc/hadoop/conf/hdfs.keytab
dfs.secondary.namenode.kerberos.princi hdfs/_HOST@YOUR-REALM.COM
pal
dfs.secondary.namenode.kerberos.https. host/_HOST@YOUR-REALM.COM
principal
Configure"HDFS"Security"(cont’d)"
! Edit$hdfs-site.xml$to$add$DataNode$proper2es:$
– Again,"they"must"be"specified"on"every"machine$in"your"cluster"
Property Name Value

dfs.datanode.data.dir.perm 700
dfs.datanode.address 0.0.0.0:1004
dfs.datanode.http.address 0.0.0.0:1006
dfs.datanode.keytab.file /etc/hadoop/conf/hdfs.keytab
dfs.datanode.kerberos.principal hdfs/_HOST@YOUR-REALM.COM
dfs.datanode.kerberos.https.principal host/_HOST@YOUR-REALM.COM
Start"NameNode"and"Verify"AuthenBcaBon"
! Start$your$NameNode$
– Then"check"the"NameNode’s"logs"
– You"should"see"a"message"that"says"authenBcaBon"with"your"Kerberos"
realm"was"successful"
Perform"Basic"NameNode"TesBng"
! You$can$now$verify$basic$HDFS$access$
! Try$to$run$an$HDFS$command$from$another$node$
– Such"as"hadoop fs –ls /user/training
! This$command$might$fail$
– If"you"have"not"already"authenBcated"with"Kerberos"
! Use$the$kinit$command$to$authen2cate,$if$needed$
– For"example,"kinit bob will"authenBcate"user"‘bob’"
– Type"the"correct"password"for"that"account"when"prompted"
– The"HDFS"command"should"now"succeed"
Start"DataNodes"and"Protect"/tmp"Directory"
! Start$your$DataNodes$
– Logs"should"show"successful"authenBcaBon"(like"NameNode"did)"
! Set$“s2cky$bit”$on$the$HDFS$/tmp$directory$
– OpBonal,"but"strongly"recommended"
"
$ sudo -u hdfs hadoop fs -chmod 1777 /tmp
"
Start"and"Verify"Secondary"NameNode"
! Recommend$changing$fs.checkpoint.period$temporarily$
– Set"it"to"a"low"number"(such"as"180"seconds)"for"now"
! Start$your$Secondary$NameNode$
– Logs"should"show"successful"authenBcaBon""
– Logs"should"also"show"that"2NN"Web"server"uses"Kerberos"
! Logs$should$eventually$show$successful$checkpoin2ng$
– You"can"now"set"fs.checkpoint.period"to"its"original"value"
– But"don’t"forget"to"restart"the"Secondary"NameNode"
"
"
Configure"MapReduce"Security"
! Edit$mapred-site.xml$to$add$JobTracker$proper2es:$
– They"must"be"specified"on"every"machine$in"your"cluster"
Property Name Value

mapreduce.jobtracker.kerberos.principal mapred/_HOST@YOUR-REALM.COM
"
mapreduce.jobtracker.kerberos.https.pri host/_HOST@YOUR-REALM.COM
ncipal
mapreduce.jobtracker.keytab.file /etc/hadoop/conf/
mapred.keytab
"
Configure"MapReduce"Security"(cont’d)"
! Edit$mapred-site.xml$to$add$TaskTracker$proper2es:$
Property Name Value

mapreduce.tasktracker.kerberos.princip mapred/_HOST@YOUR-REALM.COM
al "
mapreduce.tasktracker.kerberos.https.p host/_HOST@YOUR-REALM.COM
rincipal
mapreduce.tasktracker.keytab.file /etc/hadoop/conf/mapred.keytab
"
! Edit$mapred-site.xml$to$add$TaskController$proper2es:$
Property Name Value

mapred.task.tracker.task-controller org.apache.hadoop.mapred.Linux
" TaskController
mapreduce.tasktracker.group mapred
"
! Edit$taskcontroller.cfg$to$add$TaskController$proper2es:$
– The"format"of"this"file"is"propertyname=value,"one"per"line"
– The"min.user.id"value"may"vary"by"OS"distribuBon""
Property Name Value

hadoop.log.dir /var/log/hadoop
"
mapreduce.tasktracker.group mapred
banned.users mapred,hdfs,bin
min.user.id 500
"
Start"and"Test"MapReduce"Daemons"
! Start$JobTracker$
– Check"the"logs"to"ensure"successful"authenBcaBon"with"realm"
! Start$a$TaskTracker$
– Check"the"logs"to"ensure"successful"authenBcaBon"with"realm"
– If"so,"start"all"other"TaskTrackers"
! Test$MapReduce$by$running$a$Job$
– Such"as"the"Pi"esBmator"job"in"the"Hadoop"examples"JAR"
– This"will"fail"if"you"haven’t"got"a"valid"Bcket"
"
$ hadoop jar /usr/lib/hadoop/hadoop*examples.jar pi 5 8
TroubleshooBng"Hadoop"Security"
! If$problems$arise,$check$the$following$
– File"permissions"of"keytab"files"and"taskcontroller"binary"
– File"ownership"of"keytab"files"and"taskcontroller"binary"
– Strong"encrypBon"policy"files"for"Java"are"correctly"configured"
– Kerberos"principals"include"fully/qualified"domain"names"
– User"submiSng"jobs"has"valid"Bckets"(use"klist"to"show"them)"
– System"Bme"is"synchronized"on"each"node"
– Reverse"name"resoluBon"works"correctly"
! Also$see$Appendix$A$(Troubleshoo2ng)$of$the$CDH$Security$Guide$
– More"soluBons"to"many"common"problems"listed"there"
"
Configuring"HDFS"FederaEon"
Appendix"B ""
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"1$
Course"Chapters"
!! HDFS"
IntroducEon"to"Apache"Hadoop"
!! GeTng"Data"Into"HDFS"
!! MapReduce"
!! Hadoop"Security"
Maintenance"
!! Cluster"Monitoring"and"TroubleshooEng"
!! Conclusion"
!! Kerberos"ConfiguraEon" Course$Conclusion$and$Appendices$
!! Configuring$HDFS$Federa5on$
HDFS"FederaEon"
! HDFS$federa5on$allows$a$cluster$to$have$mul5ple$NameNodes$
– Each"manages"a"namespace(volume(
– Client/side"mounts"define"overall"view"(similar"to"/etc/fstab)"
! Namespace$volumes$(and$NameNodes)$are$independent$
– They"do"not"communicate"with"one"another"
! Benefits$of$HDFS$federa5on$
– Scalability"
– Performance"
– IsolaEon"
! The$material$in$this$sec5on$applies$only$to$CDH4$
– And"the"equivalent"Apache"Hadoop"versions"(0.23.x)"
! Fair$to$say$that$HDFS$Federa5on$is$not$oHen$used$in$produc5on$
How"FederaEon"Works"
! Each$NameNode$manages$a$namespace(volume,$consis5ng$of(
– Metadata"(file"names,"permissions,"etc.)"
– Block"pool"(all"blocks"corresponding"to"files"in"that"volume)"
! A$NameNode$manages$the$block(pool(
– But"blocks"themselves"are"sEll"stored"on"DataNodes"
– DataNodes"in"a"cluster"store"blocks"for"all"volumes"
accounting asia
engineering Metadata australia
marketing europe
0 1 2 3 4 Block pool 5 6 7 8 9
Cluster"ID"
! Each$HDFS$filesystem$is$now$associated$with$a$Cluster(ID(
– Either"specified"or"auto/generated"when"you"format"HDFS"
– All"NameNodes"in"a"given"cluster"must"have"the"same"cluster"ID"
– The"cluster"ID"is"visible"in"the"NameNode’s"Web"UI"
$
FederaEon"ConfiguraEon"
! Each$volume$within$a$cluster$will$have$its$own$NameService$ID$
– An"arbitrary"(but"unique)"idenEfier"that"you"select"
– Used"to"define"several"volume/specific"properEes"
! Clients$and$DataNodes$must$know$about$the$volumes$
– Edit"hdfs-site.xml""
– Add"a"new"property"dfs.nameservices"
– Value"is"a"comma/delimited"list"of"all"NameService"IDs"
$ <property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
FederaEon"ConfiguraEon"(cont’d)"
! For$each$NameNode"specific$property$in$hdfs-site.xml
– Create"a"new"property"for"each"Namespace"volume"
– Append"a"dot"and"the"NameService"ID""
$ <property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>namenode01.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>namenode02.example.com:8020</value>
</property>
FederaEon"ConfiguraEon"(cont’d)"
! Repeat$this$for$volume"specific$Secondary$NameNode$proper5es$
– Usually"just"dfs.namenode.secondary.http-address
! Propagate$configura5on$changes$throughout$the$cluster$
! If$seTng$up$a$new$HDFS$installa5on$
– Subsequent"steps"match"those"of"unfederated"installaEon"
! If$adding$federa5on$to$an$exis5ng$installa5on$
– Do"not"format"the"NameNode"
Client/Side"Mounts"
! HDFS$federa5on$allows$for$mul5ple$NameNodes$
– Clients"mount"one"or"more"of"these"
! Clients$use$ViewFS$to$compose$a$view$of$HDFS$
– The"concept"is"similar"to"/etc/fstab"on"Linux"
– This"URI"typically"contains"the"cluster"ID,"as"shown"below"
! Edit$core-site.xml$and$set$the$fs.defaultFS$property
– This"property"was"formerly"called"fs.default.name
$
<property>
<name>fs.defaultFS</name>
<value>viewfs://mycluster</value>
</property>
Client/Side"Mounts"(cont’d)"
! Define$a$new$property$in$core-site.xml$for$each$mount$point$
– The"property"name"follows"the"pa>ern"
–  fs.viewfs.mounttable.<CLUSTERID>.link.<MOUNTPOINT>
– The"property"value"will"follow"this"pa>ern"
–  hdfs://<NAMENODE_HOST:PORT>/<PATH>
! Let’s$look$at$how$we$can$achieve$the$following…$
$
/
users reports
Client/Side"Mounts"(cont’d)"
$$
<property>
<name>fs.viewfs.mounttable.mycluster.link./users</name>
<value>hdfs://namenode01.example.com:8020/users</value>
</property>
<property>
<name>fs.viewfs.mounttable.mycluster.link./reports</name>
<value>hdfs://namenode02.example.com:8020/reports</value>
</property>
Served by namenode01.example.com Served by namenode02.example.com
users reports

Cloudera Administrator Training PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cloudera Administrator Training PDF

Transféré par

Droits d'auteur :

Formats disponibles

Cloudera"Administrator"Training"

Year Capacity (GB) Cost per GB (USD)

Year Capacity (GB) Transfer Rate (MB/s) Disk Read Time

1 Copy input data

3 Copy output data

Node B has blocks: 2, 3

Node D has block: 1

Node E has blocks: 2, 3

DataNode DataNode DataNode DataNode

engineering finance marketing sales inventory TPS

$ hadoop fs -cat /user/fred/sales.txt

$ hadoop fs -mkdir /reports

$ hadoop fs -put sales.txt /reports

$ hadoop fs -get /reports/sales.txt

$ hadoop fs -put input.txt input.txt

$ hadoop fs –rm /reports/sales.txt

Agent$$ Agent$ Agent$ Agent$

Optionally pre-process incoming data:

Writes to multiple HDFS file formats

$ cd /usr/local/ && sudo tar -zxvf <path_to_flume-ng-1.1.0-cdh4.0.0b2.tar.gz>

# Ubuntu and other Debian systems

# List the sources, sinks and channels for the agent

# set channel for source

# set channel for sink

Submit MapReduce Jobs

Func.onality$ Sqoop$ Sqoop2

REST: REpresentational State Transfer

cat /my/log | grep '\.html' | sort | uniq –c > /my/outfile

1202 the cat sat on the mat

Node Manager Node Manager

Node Manager Node Manager

* RAID: Redundant Array of Inexpensive Disks

* BIOS: Basic Input/Output System

dfs.block.size The block size for new files, in bytes.

dfs.http.address The address and port used for the

fs.default.name The name of the default filesystem.

hadoop.tmp.dir Base temporary directory, both on the

This parameter is used to derive

mapred.job.tracker Hostname and port of the JobTracker.

mapred.local.dir The local directory where MapReduce

mapreduce.jobtracker. The root directory in HDFS below which

mapred.tasktracker.map. Number of Map tasks which can be run

! Use%the%hadoop daemonlog -setlevel%command%

Daemon Port Class%

Job%Log%File Contents Loca1on%and%Reten1on%

Job%History%in% Same"as"Job"History"on"local" hadoop.job.history.user.

mapred job –history all <job_output_dir_in_HDFS>

SELECT * from movies m

Thrift API Hive Metastore JDBC

FuncConality$ Hive$CLI$ HiveServer2

Impala Server Impala Server

Impala Server NameNode Impala Server

IMPALA_STATE_STORE_HOST The host running the Impala State

movies = LOAD '/data/films' AS

dfs.namenode.handler.count The number of threads the NameNode

fs.trash.interval When a file is deleted, it is placed in

io.compression.codecs List of compression codecs that Hadoop

mapred.job.tracker. Number of threads used by the

mapred.reduce.slowstart. The percentage of Map tasks which must

mapred.map.tasks. Whether to allow speculative execution

mapred.compress.map.output Determines whether intermediate data

io.sort.mb The size of the buffer in RAM on the

Daemon Default Port Configuration parameter

! Use%the%hadoop daemonlog -setlevel%command%

! To$view$all$jobs$running$on$the$cluster,$use$mapred job -list

!  mapred job -status <job_id>$provides$status$about$an$individual$

! To$kill$a$job$use$mapred job -kill <job_id>