Vous êtes sur la page 1sur 639

Cloudera"Administrator"Training"

for"Apache"Hadoop"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#1$
201306"
IntroducFon"
Chapter"1"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#2$
Course"Chapters"
!! Introduc/on$ Course$Introduc/on$

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducFon"to"Apache"Hadoop"
!! GePng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaFon"and"IniFal"ConfiguraFon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,"Installing,"and"
!! Hadoop"Clients"
Configuring"a"Hadoop"Cluster"
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraFon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaFons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooFng"
!! Conclusion"
!! Kerberos"ConfiguraFon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaFon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#3$
Chapter"Topics"

Introduc/on$ Course$Introduc/on$

!! About$This$Course$
!! About"Cloudera"
!! Course"LogisFcs"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#4$
Course"ObjecFves"

During$this$course,$you$will$learn:$
! The$core$technologies$of$Hadoop$
! How$to$populate$HDFS$from$external$sources$
! How$to$plan$your$Hadoop$cluster$hardware$and$soIware$
! How$to$deploy$a$Hadoop$cluster$
! What$issues$to$consider$when$installing$Pig,$Hive,$and$Impala$
! What$issues$to$consider$when$deploying$Hadoop$clients$
! How$Cloudera$Manager$can$simplify$Hadoop$administra/on$
! How$to$configure$HDFS$for$high$availability$
! What$issues$to$consider$when$implemen/ng$Hadoop$security$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#5$
Course"ObjecFves"(cont’d)"

During$this$course,$you$will$learn:$
! How$to$schedule$jobs$on$the$cluster$
! How$to$maintain$your$cluster$
! How$to$monitor,$troubleshoot,$and$op/mize$the$cluster$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#6$
Chapter"Topics"

Introduc/on$ Course$Introduc/on$

!! About"This"Course"
!! About$Cloudera$
!! Course"LogisFcs"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#7$
About"Cloudera"

! The$leader$in$Apache$Hadoop#based$soIware$and$services$
! Founded$by$leading$experts$on$Hadoop$from$Facebook,$Yahoo,$Google,$
and$Oracle$
! Provides$support,$consul/ng,$training,$and$cer/fica/on$for$Hadoop$users$
! Staff$includes$commiYers$to$virtually$all$Hadoop$projects$
! Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$
– Tom"White,"Eric"Sammer,"Lars"George,"etc."

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#8$
About"Cloudera"(cont’d)"

! Customers$include$many$key$users$of$Hadoop$
– Allstate,"AOL"AdverFsing,"Box,"CBS"InteracFve,"eBay,"Experian,"Groupon,"
Macys.com,"NaFonal"Cancer"InsFtute,"Orbitz,"Social"Security"
AdministraFon,"Trend"Micro,"Trulia,"US"Army,"…"
! Cloudera$public$training:$
– Cloudera"Developer"Training"for"Apache"Hadoop"
– Cloudera"Administrator"Training"for"Apache"Hadoop"
– Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"Hadoop"
– Cloudera"Training"for"Apache"HBase"
– IntroducFon"to"Data"Science:"Building"Recommender"Systems"
– Cloudera"EssenFals"for"Apache"Hadoop"
! Onsite$and$custom$training$is$also$available$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#9$
CDH"

! CDH$(Cloudera’s$Distribu/on$including$Apache$Hadoop)$
– 100%"open"source,""
enterprise/ready""
distribuFon"of"Hadoop"and""
related"projects"
– The"most"complete,"tested,""
and"widely/deployed""
distribuFon"of"Hadoop"
– Integrates"all"the"key""
Hadoop"ecosystem"projects"
– Available"as"RPMs"and"
Ubuntu/Debian/SuSE""
packages"or"as"a"tarball"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#10$
Cloudera"Standard"

! Cloudera$Standard$
– Free"download""
including"CDH"and""
Cloudera"Manager"
– Supports"an"unlimited"
number"of"nodes"
! End#to#end$administra/on$$
for$Hadoop$
– Automated"cluster""
deployment"
– Centralized""
administraFon"
– Cluster"monitoring""
and"diagnosFc"tools"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#11$
Cloudera"Enterprise"

! Cloudera$Enterprise$
– SubscripFon"product""
including"CDH"and""
Cloudera"Manager"
! Includes$support$
! Extra$Manager$features$
– Rolling"upgrades"
– SNMP"support"
– LDAP"integraFon"
– Etc."
! Add#on$support$modules$
– "Impala,"HBase,"Backup""
and"Disaster"Recovery,""
Cloudera"Navigator"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#12$
Chapter"Topics"

Introduc/on$ Course$Introduc/on$

!! About"This"Course"
!! About"Cloudera"
!! Course$Logis/cs$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#13$
LogisFcs"

! Course$start$and$end$/mes$
! Lunch$
! Breaks$
! Restrooms$
! Can$I$come$in$early/stay$late?$
! Cer/fica/on$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#14$
IntroducFons"

! About$your$instructor$
! About$you$
– Experience"with"Hadoop?"
– Experience"as"a"System"Administrator?"
– What"plagorm(s)"do"you"use?"
– ExpectaFons"from"the"course?"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#15$
The"Case"for"Apache"Hadoop"
Chapter"2"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#1%
Course"Chapters"
!! IntroducDon" Course"IntroducDon"

!! The%Case%for%Apache%Hadoop%
!! HDFS"
Introduc.on%to%Apache%Hadoop%
!! GeOng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaDon"and"IniDal"ConfiguraDon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,"Installing,"and"
!! Hadoop"Clients"
Configuring"a"Hadoop"Cluster"
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraDon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaDons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooDng"
!! Conclusion"
!! Kerberos"ConfiguraDon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaDon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#2%
The"Case"for"Apache"Hadoop"

In%this%chapter,%you%will%learn:%
! Why%Hadoop%is%needed%
! What%problems%Hadoop%solves%
! What%comprises%Hadoop%and%the%Hadoop%ecosystem%
%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#3%
Chapter"Topics"

The%Case%for%Apache%Hadoop% Introduc.on%to%Apache%Hadoop%

!! Why%Hadoop?%
!! Fundamental"Concepts"
!! Core"Hadoop"Components"
!! Hands/On"Exercise:"Installing"Hadoop"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#4%
The"Data"Deluge"

! We%are%genera.ng%more%data%than%ever%
– Financial"transacDons"
– Sensor"networks"
– Server"logs"
– AnalyDcs"
– e/mail"and"text"messages"
– Social"media"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#5%
The"Data"Deluge"(cont’d)"

! And%we%are%genera.ng%data%faster%than%ever%
– AutomaDon"
– Ubiquitous"internet"connecDvity"
– User/generated"content"
! For%example,%every%day%
– Twi>er"processes"340"million"messages"
– Amazon"S3"storage"adds"more"than"one"billion"objects"
– Facebook"users"generate"2.7"billion"comments"and"“Likes”"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#6%
Data"is"Value"

! This%data%has%many%valuable%applica.ons%
– MarkeDng"analysis"
– Product"recommendaDons"
– Demand"forecasDng"
– Fraud"detecDon"
– And"many,"many"more…"
! We%must%process%it%to%extract%that%value%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#7%
Data"Processing"Scalability"

! How%can%we%process%all%that%informa.on?%
! There%are%actually%two%problems%%
– Large/scale"data"storage"
– Large/scale"data"analysis"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#8%
Disk"Capacity"and"Price"

! We%are%genera.ng%more%data%than%ever%before%
! Fortunately,%the%size%and%cost%of%storage%has%kept%pace%
– Capacity"has"increased"while"price"has"decreased"

Year Capacity (GB) Cost per GB (USD)


1997 2.1 $157
2004 200 $1.05
2012 3,000 $0.05

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#9%
Disk"Capacity"and"Performance"

! Disk%performance%has%also%increased%in%the%last%15%years%
! Unfortunately,%transfer%rates%have%not%kept%pace%with%capacity%

Year Capacity (GB) Transfer Rate (MB/s) Disk Read Time


1997 2.1 16.6 126 seconds
2004 200 56.5 59 minutes
2012 3,000 210 3 hours, 58 minutes

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#10%
Data"Access"is"the"Bo>leneck"

! Although%we%can%process%data%more%quickly,%accessing%it%is%slow%
– This"is"true"for"both"reads"and"writes"
! For%example,%reading%a%single%3TB%disk%takes%almost%four%hours%
– We"cannot"process"the"data"unDl"we"have"read"it"
– We"are"limited"by"the"speed"of"a"single"disk"
! We%will%see%Hadoop’s%solu.on%in%a%few%moments%
– But"first"we"will"examine"how"we"process"large"amounts"of"data"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#11%
Monolithic"CompuDng"

! Tradi.onally,%computa.on%has%been%processor#bound%
– Intense"processing"on"small"amounts"of"data"
! For%decades,%the%goal%was%a%bigger,%more%powerful%machine%
– Faster"processor,"more"RAM"
! This%approach%has%limita.ons%
– High"cost"
– Limited"scalability"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#12%
The"Case"for"Distributed"Systems"

%
“In%pioneer%days%they%used%oxen%for%heavy%pulling,%and%when%one%ox%
couldn’t%budge%a%log,%we%didn’t%try%to%grow%a%larger%ox.”%
•  "Grace"Hopper,"early"advocate"of"distributed"compuDng"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#13%
Distributed"CompuDng"

! Modern%large#scale%processing%is%distributed%across%machines%
– Oien"hundreds"or"thousands"of"nodes"
– Common"frameworks"include"MPI,"PVM"and"Condor"
! Focuses%on%distribu.ng%the%processing%workload%
– Powerful"compute"nodes"
– Separate"systems"for"data"storage"
– Fast"network"connecDons"to"connect"them"
%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#14%
Distributed"CompuDng"Processing"Pa>ern"

! Typical%processing%paYern%
– Step"1:"Copy"input"data"from"storage"to"compute"node"
– Step"2:"Perform"necessary"processing"
– Step"3:"Copy"output"data"back"to"storage"

1 Copy input data

2 Perform processing

3 Copy output data

! This%works%fine%with%rela.vely%small%amounts%of%data%
– That"is,"where"step"2"dominates"overall"runDme"
%
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#15%
Data"Processing"Bo>leneck"

! That%paYern%doesn’t%scale%with%large%amounts%of%data%
– More"Dme"spent"copying"data"than"actually"processing"it"
– GeOng"data"to"the"processors"is"the"bo>leneck"
! Grows%worse%as%more%compute%nodes%are%added%
– They"are"compeDng"for"the"same"bandwidth"
– Compute"nodes"become"starved"for"data"
%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#16%
Complexity"of"Distributed"CompuDng"

! Distributed%systems%pay%for%scalability%by%adding%complexity%
! Much%of%this%complexity%involves%
– Availability"
– Data"consistency"
– Event"synchronizaDon"
– Bandwidth"limitaDons"
– ParDal"failure"
– Cascading"failures"
! These%are%o\en%more%difficult%than%the%original%problem%
– Error"handling"oien"accounts"for"the"majority"of"the"code"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#17%
Distributed"Versus"Local"

%
“Failure%is%the%defining%difference%between%%
distributed%and%local%programming”%
•  "Ken"Arnold,"CORBA"designer"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#18%
System"Requirements:"Failure"Handling"

! Failure%is%inevitable%
– We"should"strive"to"handle"it"well"
! An%ideal%solu.on%should%have%(at%least)%these%proper.es%
%
Failure-Handling Properties of an Ideal Distributed System
Automatic Job can still complete without manual intervention
Transparent Tasks assigned to a failed component are picked up by others
Graceful Failure results only in a proportional loss of load capacity
Recoverable That capacity is reclaimed when the component is later replaced
Consistent Failure does not produce corruption or invalid results

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#19%
More"System"Requirements"

! Linear%horizontal%scalability%
– Adding"new"nodes"should"add"proporDonal"load"capacity"
– Avoid"contenDon"by"using"a"“shared"nothing”"architecture"
– Must"be"able"to"expand"cluster"at"a"reasonable"cost"
! Jobs%run%in%rela.ve%isola.on%
– Results"must"be"independent"of"other"jobs"running"concurrently"
– Although"performance"can"be"affected"by"other"jobs"
! Simple%programming%model%
– Should"support"a"widely/used"language"
– The"API"must"be"relaDvely"easy"to"learn"
! Hadoop%addresses%these%requirements%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#20%
Chapter"Topics"

The%Case%for%Apache%Hadoop% Introduc.on%to%Apache%Hadoop%

!! Why"Hadoop?"
!! Fundamental%Concepts%
!! Core"Hadoop"Components"
!! Hands/On"Exercise:"Installing"Hadoop"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#21%
Hadoop:"A"Radical"SoluDon"

! Tradi.onal%distributed%compu.ng%frequently%involves%
– Complex"programming"requiring"explicit"synchronizaDon"
– Expensive,"specialized"fault/tolerant"hardware"
– High/performance"storage"systems"with"built/in"redundancy"
! Hadoop%takes%a%radically%different%approach%
– Inspired"by"Google’s"GFS"and"MapReduce"architecture"
– This"new"approach"addresses"the"problems"described"earlier"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#22%
Hadoop"Scalability"

! Hadoop%aims%for%linear%horizontal%scalability%
– Cross/communicaDon"among"nodes"is"minimal""
– Just"add"nodes"to"increase"cluster"capacity"and"performance"
! Clusters%are%built%from%industry#standard%hardware%%
– Widely/available"and"relaDvely"inexpensive"servers"
– You"can"“scale"out”"later"when"the"need"arises"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#23%
SoluDon:"Data"Access"Bo>leneck"

! Recap:%separate%storage%and%compute%systems%create%boYleneck%
– Can"spend"more"Dme"spent"copying"data"than"processing"it"
! Solu.on:%store%and%process%data%on%the%same%machines%
– This"is"why"adding"nodes"increases"capacity"and"performance"
! Op.miza.on:%Use%intelligent%job%scheduling%(data%locality)%
– Hadoop"tries"to"process"data"on"the"same"machine"that"stores"it"
– This"improves"performance"and"conserves"bandwidth"
– “Bring"the"computaDon"to"the"data”"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#24%
SoluDon:"Disk"Performance"Bo>leneck""

! Recap:%a%single%disk%has%great%capacity%but%poor%performance%
! Solu.on:%use%mul.ple%disks%in%parallel%
– The"transfer"rate"of"one"disk"might"be"210"megabytes/second"
– Almost"four"hours"to"read"3"TB"of"data"
– 1000"such"disks"in"parallel"can"transfer"210"gigabytes/second"
– Less"than"15"seconds"to"read"3TB"of"data"
! Colocated%storage%and%processing%makes%this%solu.on%feasible%
– 250/node"cluster"with"4"disks"per"node"="1000"disks"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#25%
SoluDon:"Complex"Processing"Code"

! Recap:%Distributed%programming%is%very%difficult%
– Oien"done"in"C"or"FORTRAN"using"complex"libraries"
! Solu.on:%Use%a%popular%language%and%a%high#level%API%
– MapReduce"code"is"typically"wri>en"in"Java"(like"Hadoop"itself)"
– It"is"possible"to"write"MapReduce"in"nearly"any"language"
! The%MapReduce%programming%model%simplifies%processing%
– Deal"with"one"record"(key/value"pair)"at"a"Dme"
– Complex"details"are"abstracted"away"
– No"file"I/O"
– No"networking"code"
– No"synchronizaDon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#26%
SoluDon:"Fault"Tolerance"

! Recap:%Distributed%systems%o\en%use%expensive%components%
– In"order"to"minimize"the"possibility"of"failure"
! Solu.on:%Realize%that%failure%is%inevitable%
– And"instead"try"to"minimize"the"effect"of"failure"
– Hadoop"saDsfies"all"the"requirements"we"discussed"earlier"
! Machine%failure%is%a%regular%occurrence%
– A"server"might"have"a"mean"Dme"between"failures"(MTBF)"of"5"years"
(~1825"days)"
– Equates"about"one"failure"per"day"in"a"2,000"node"cluster"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#27%
Chapter"Topics"

The%Case%for%Apache%Hadoop% Introduc.on%to%Apache%Hadoop%

!! Why"Hadoop?"
!! Fundamental"Concepts"
!! Core%Hadoop%Components%
!! Hands/On"Exercise:"Installing"Hadoop"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#28%
Core"Hadoop"Components"

! Hadoop%is%a%system%for%large#scale%data%processing%
! Hadoop%provides%two%main%components%to%achieve%this%
– Data"storage:"HDFS"
– Data"processing:"MapReduce"
! Plus%the%infrastructure%needed%to%make%them%work,%including%
– Filesystem"uDliDes"
– Job"scheduling"and"monitoring"
– Web"UI"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#29%
The"Hadoop"Ecosystem"

! Many%related%tools%integrate%with%Hadoop%
– Data"analysis:"Hive,"Pig""
– Machine"Learning:"Mahout"
– Database"integraDon:"Sqoop"
– Workflow"management:"Oozie"
– Cluster"management:"Cloudera"Manager"
! These%are%not%considered%“core%Hadoop”%
– Rather,"they"are"part"of"the"“Hadoop"ecosystem”"
– Many"are"also"open"source"Apache"projects"
– We"will"learn"about"several"of"these"later"in"the"course"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#30%
Chapter"Topics"

The%Case%for%Apache%Hadoop% Introduc.on%to%Apache%Hadoop%

!! Why"Hadoop?"
!! Fundamental"Concepts"
!! Core"Hadoop"Components"
!! Hands#On%Exercise:%Installing%Hadoop%
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#31%
Hands/On"Exercise:"Installing"Hadoop"–"Networking"Setup"

! Cloud%training%environment%–%four%EC2%instances%plus%the%Get2EC2%virtual%
machine%(VM)%
– Start"the"Get2EC2"VM"
– Configure"the"local"hosts"file"with"the"EC2"elasDc"IP"addresses"
– Configure"the"/etc/hosts"file"and"change"host"names"of"the"EC2"
instances"
– Verify"that"everything"is"working"correctly"
– Start"a"SOCKS5"proxy"server"on"the"Get2EC2"VM"
! Local%training%environment%–%4%VMware%VMs%
– Start"the"VMs""
– Configure"IP"addresses"
– Configure"the"/etc/hosts"file"and"change"host"names"
– Log"out"and"log"in"to"all"four"virtual"machines"
– Verify"that"everything"is"working"correctly"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#32%
Hands/On"Exercise:"Installing"Hadoop"–"Networking"Setup"
(cont’d)"
! At%the%end%of%the%day:%
– Cloud"training"environment"
– Exit"from"all"acDve"SSH"sessions"(including"the"SOCKS5"proxy"
server)"
– Suspend"the"Get2EC2"VM"
– Local"training"environment"
– Suspend"all"four"VMs"
! When%you%come%back%in%the%morning:%
– Cloud"training"environment"
– Restart"the"Get2EC2"VM"
– Restart"SSH"sessions"using"the"connect_to"scripts"
– Restart"the"SOCKS5"proxy"server"
– Local"training"environment"
– Restart"all"four"VMs"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#33%
"
Hands/On"Exercise:"Installing"Hadoop"–"Deployment"

! In%this%Hands#On%Exercise,%you%will%install%Hadoop%in%pseudo#distributed%
mode%
! Please%refer%to%the%Hands#On%Exercise%Manual%
! Pseudo#distributed%mode%deployment%a\er%exercise%comple.on%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#34%
Chapter"Topics"

The%Case%for%Apache%Hadoop% Introduc.on%to%Apache%Hadoop%

!! Why"Hadoop?"
!! Fundamental"Concepts"
!! Core"Hadoop"Components"
!! Hands/On"Exercise:"Installing"Hadoop"
!! Conclusion%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#35%
EssenDal"Points"

! We%are%genera.ng%more%data%–%and%faster%–%than%ever%before%
! We%can%store%and%process%the%data,%but%there%are%problems%using%exis.ng%
techniques%%
– Accessing"the"data"from"disk"is"a"bo>leneck"
– In"distributed"systems,"geOng"data"to"the"processors"is"a"bo>leneck"
! Hadoop%eliminates%the%boYlenecks%by%storing%and%processing%data%on%the%
same%machine%
! Hadoop%consists%of%two%core%components,%HDFS%(storage)%and%MapReduce%
(processing),%and%an%ecosystem%of%related%tools%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#36%
Conclusion"

In%this%chapter,%you%have%learned:%
! Why%Hadoop%is%needed%
! What%problems%Hadoop%solves%
! What%comprises%Hadoop%and%the%Hadoop%ecosystem%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#37%
HDFS"
Chapter"3"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#1%
Course"Chapters"
!! IntroducEon" Course"IntroducEon"

!! The"Case"for"Apache"Hadoop"
!! HDFS%
Introduc/on%to%Apache%Hadoop%
!! GeOng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaEon"and"IniEal"ConfiguraEon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,"Installing,"and"
!! Hadoop"Clients"
Configuring"a"Hadoop"Cluster"
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraEon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaEons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooEng"
!! Conclusion"
!! Kerberos"ConfiguraEon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaEon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#2%
HDFS"

In%this%chapter,%you%will%learn:%
! What%features%HDFS%provides%
! How%HDFS%reads%and%writes%files%
! How%the%NameNode%uses%memory%
! How%Hadoop%provides%file%security%
! How%to%use%the%NameNode%Web%UI%
! How%to%use%the%Hadoop%File%Shell%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#3%
Chapter"Topics"

HDFS% Introduc/on%to%Apache%Hadoop%

!! HDFS%Features%
!! WriEng"and"Reading"Files"
!! NameNode"Memory"ConsideraEons"
!! Overview"of"HDFS"Security"
!! Using"the"NameNode"Web"UI"
!! Using"the"Hadoop"File"Shell"
!! Hands/On"Exercise:"Working"with"HDFS"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#4%
HDFS:"The"Hadoop"Distributed"File"System"

! Based%on%Google’s%GFS%(Google%File%System)%
! Provides%redundant%storage%for%massive%amounts%of%data%
– Using"industry/standard"hardware"
! At%load%/me,%data%is%distributed%across%all%nodes%
– Provides"for"efficient"MapReduce"processing"(more"later)"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#5%
HDFS"Features"

! High%performance%
! Fault%tolerance%
! Rela/vely%simple%centralized%management%
– Master/slave"architecture"
! Security%
– Two"levels"from"which"to"choose"
! Op/mized%for%MapReduce%processing%
– Data"locality"
! Scalability%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#6%
HDFS"Design"AssumpEons"

! High%component%failure%rates%
– Inexpensive"components"fail"all"the"Eme"
! “Modest”%number%of%HUGE%files%
– Just"a"few"million"
– Each"file"likely"to"be"100MB"or"larger"
– MulE/gigabyte"files"typical"
! Files%are%write#once%
! Large%streaming%reads%
– Not"random"access"
! Favor%high%sustained%throughput%over%low%latency%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#7%
HDFS"Blocks"

! When%a%file%is%added%to%HDFS,%it%is%split%into%blocks%
! This%is%a%similar%concept%to%na/ve%filesystems%
– HDFS"uses"a"much"larger"block"size""
– Default"block"size"is"64"MB"(configurable)"

Block #1
150 MB input file (64 MB)

Block #2
(64 MB)

Block #3
(remaining 22 MB)

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#8%
HDFS"ReplicaEon"

! These%blocks%are%replicated%to%nodes%throughout%the%cluster%
– Based"on"the"replica+on.factor"(default"is"three)"
! Replica/on%increases%reliability%and%performance%
– Reliability:"data"can"tolerate"loss"of"all"but"one"replica"
– Performance:"more"opportuniEes"for"data"locality"
Block #1
Node A has blocks: 1, 2

Node B has blocks: 2, 3


Block #2
Node C has blocks: 1, 3

Node D has block: 1


Block #3

Node E has blocks: 2, 3

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#9%
Classical"HDFS"Architecture"

! The%architecture%of%HDFS%has%recently%been%improved%
– More"resilient"
– Be>er"scalability"
! These%changes%are%only%available%in%recent%releases%
– Such"as"Cloudera’s"CDH4"
! Many%s/ll%run%earlier%releases%in%produc/on%
– We"will"discuss"the"older"architecture"first"
– Then"we"will"cover"how"it"has"changed"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#10%
Classical"HDFS"Architectural"Overview"

! There%are%three%daemons%in%“classical”%HDFS%
– NameNode"(master)"
– Secondary"NameNode"(master)"
– DataNode"(slave)"

Secondary
NameNode
NameNode

DataNode DataNode DataNode DataNode

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#11%
The"NameNode"

! The%NameNode%stores%all%metadata%
– InformaEon"about"file"locaEons"in"HDFS"
– InformaEon"about"file"ownership"and"permissions"
– Names"of"the"individual"blocks"
– LocaEons"of"the"blocks"
! Metadata%is%stored%on%disk%and%read%when%the%NameNode%daemon%starts%
up%
– Filename"is"fsimage
– Note:"block"locaEons"are"not"stored"in"fsimage
! Changes%to%the%metadata%are%made%in%RAM%
– Changes"are"also"wri>en"to"a"log"file"on"disk"called"edits
– Full"details"later"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#12%
The"Slave"Nodes"

! Actual%contents%of%the%files%are%stored%as%blocks%on%the%slave%nodes%
! Blocks%are%simply%files%on%the%slave%nodes’%underlying%filesystem%
– Named"blk_xxxxxxx
– Nothing"on"the"slave"node"provides"informaEon"about"what"underlying"
file"the"block"is"a"part"of"
– That"informaEon"is"only"stored"in"the"NameNode’s"metadata"
! Each%block%is%stored%on%mul/ple%different%nodes%for%redundancy%
– Default"is"three"replicas"
! Each%slave%node%runs%a%DataNode%daemon%
– Controls"access"to"the"blocks"
– Communicates"with"the"NameNode"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#13%
The"Secondary"NameNode:"CauEon!"

! The%Secondary%NameNode%is%not-a%failover%NameNode!%
– It"performs"memory/intensive"administraEve"funcEons"for"the"
NameNode"
– NameNode"keeps"informaEon"about"files"and"blocks"(the"
metadata)"in"memory"
– NameNode"writes"metadata"changes"to"an"edit"log"
– Secondary"NameNode"periodically"combines"a"prior"snapshot"of"
the"file"system"metadata"and"edit"log"into"a"new"snapshot"
– New"snapshot"is"transmi>ed"back"to"the"NameNode"
! Secondary%NameNode%should%run%on%a%separate%machine%in%a%large%
installa/on%
– It"requires"as"much"RAM"as"the"NameNode"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#14%
File"System"Metadata"Snapshot"and"Edit"Log"

! The%fsimage%file%contains%a%file%system%metadata%snapshot%
– It"is"not-updated"at"every"write"
– This"would"be"very"slow"
! When%an%HDFS%client%performs%a%write%opera/on,%it%is%recorded%in%the%
Primary%NameNode’s%edit%log%
– The"edits"file"
– The"NameNode’s"in/memory"representaEon"of"the"file"system"
metadata"is"also"updated"
! Applying%all%changes%in%the%edits%file%during%a%NameNode%restart%could%
take%a%long%/me%
– The"file"could"also"grow"to"be"huge"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#15%
CheckpoinEng"the"File"System"Metadata"

! The%Secondary%NameNode%periodically%checkpoints-the%NameNode’s%in#
memory%file%system%data%
1.  Tells"the"NameNode"to"roll"its"edits"file"
2.  Retrieves"fsimage"and"edits"from"the"NameNode"
3.  Loads"fsimage"into"memory"and"applies"the"changes"from"the"
edits"file"
4.  Creates"a"new,"consolidated"fsimage"file"
5.  Sends"the"new"fsimage"file"back"to"the"primary"NameNode"
6.  The"NameNode"replaces"the"old"fsimage"file"with"the"new"one,"
replaces"the"old"edits"file"with"the"new"one"it"created"in"step"1,"
and"updates"the"fstime"file"to"record"the"checkpoint"Eme"
! By%default,%checkpoin/ng%occurs%once%an%hour%or%if%the%edits%file%grows%
larger%than%64MB%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#16%
Single"Point"of"Failure"

! Each%Hadoop%cluster%has%a%single%NameNode%
– The"Secondary"NameNode"is"not"a"failover"NameNode"
! The%NameNode%is%a%single%point%of%failure%(SPOF)%
! In%prac/ce,%this%is%not%a%major%issue%
– HDFS"will"be"unavailable"unEl"NameNode"is"replaced"
– There"is"very"li>le"risk"of"data"loss"for"a"properly"managed"system"
! Recovering%from%a%failed%NameNode%is%rela/vely%easy%
– We"will"discuss"this"process"in"detail"later"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#17%
Contemporary"HDFS"Architecture"

! The%preceding%slides%described%the%“classic”%architecture%
– Versions"of"HDFS"in"CDH3"and"earlier""
! HDFS%now%has%High%Availability%and%Federa/on%features%
– IniEally"developed"on"the"Apache"Hadoop"0.23"branch"
– Now"part"of"Hadoop"2.x"
– Available"in"Cloudera’s"distribuEon,"starEng"with"CDH4"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#18%
HDFS"High"Availability"

! HDFS%High%Availability%addresses%the%NameNode%SPOF%
! Two%NameNodes:%one%ac/ve%and%one%standby%
– Standby"NameNode"takes"over"when"acEve"NameNode"fails"
– Standby"NameNode"also"does"checkpoinEng"(Secondary"NameNode"no"
longer"needed)"
%
DataNode DataNode
NameNode
(Active)

DataNode DataNode

NameNode
(Standby)
DataNode DataNode

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#19%
HDFS"FederaEon"

! Federa/on%improves%the%scalability%of%HDFS%
– In"CDH3"and"earlier,"there"could"only"be"a"single"NameNode"
– FederaEon"allows"for"mulEple"independent"NameNodes"
! Each%NameNode%manages%a%namespace-volume%
– Client/side"mount"tables"define"the"overall"view
– NN1"might"provide"/users"and"NN2"might"provide"/reports"

users reports

engineering finance marketing sales inventory TPS

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#20%
Chapter"Topics"

HDFS% Introduc/on%to%Apache%Hadoop%

!! HDFS"Features"
!! Wri/ng%and%Reading%Files%
!! NameNode"Memory"ConsideraEons"
!! Overview"of"HDFS"Security"
!! Using"the"NameNode"Web"UI"
!! Using"the"Hadoop"File"Shell"
!! Hands/On"Exercise:"Working"with"HDFS"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#21%
Anatomy"of"a"File"Write"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#22%
Anatomy"of"a"File"Write"(cont’d)"

1.  Client%connects%to%the%NameNode%
2.  NameNode%places%an%entry%for%the%file%in%its%metadata,%returns%the%
block%name%and%list%of%DataNodes%to%the%client%
3.  Client%connects%to%the%first%DataNode%and%starts%sending%data%
4.  As%data%is%received%by%the%first%DataNode,%it%connects%to%the%second%and%
starts%sending%data%
5.  Second%DataNode%similarly%connects%to%the%third%
6.  ack%packets%from%the%pipeline%are%sent%back%to%the%client%
7.  Client%reports%to%the%NameNode%when%the%block%is%wrieen%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#23%
Anatomy"of"a"File"Write"(cont’d)"

! If%a%DataNode%in%the%pipeline%fails%
– The"pipeline"is"closed"
– A"new"pipeline"is"opened"with"the"two"good"nodes"
– The"data"conEnues"to"be"wri>en"to"the"two"good"nodes"in"the"pipeline"
– The"NameNode"will"realize"that"the"block"is"under/replicated,"and"will"
re/replicate"it"to"another"DataNode"
! As%the%blocks%are%wrieen,%a%checksum%is%also%calculated%and%wrieen%
– Used"to"ensure"the"integrity"of"the"data"when"it"is"later"read"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#24%
Hadoop"is"‘Rack/aware’"

! Hadoop%understands%the%concept%of%‘rack%awareness’%
– The"idea"of"where"nodes"are"located,"relaEve"to"one"another"
– Helps"the"JobTracker"to"assign"tasks"to"nodes"closest"to"the"data"
– Helps"the"NameNode"determine"the"‘closest’"block"to"a"client"during"
reads"
– In"reality,"this"should"perhaps"be"described"as"being"‘switch/aware’"
! HDFS%replicates%data%blocks%on%nodes%on%different%racks%%
– Provides"extra"data"security"in"case"of"catastrophic"hardware"failure ""
! Rack#awareness%is%determined%by%a%user#defined%script%
– See"later"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#25%
HDFS"Block"ReplicaEon"Strategy"

! First%copy%of%the%block%is%placed%on%the%same%node%as%the%client%
– If"the"client"is"not"part"of"the"cluster,"the"first"block"is"placed"on"a"
random"node"
– System"tries"to"find"one"which"is"not"too"busy"
! Second%copy%of%the%block%is%placed%on%a%node%residing%on%a%different%rack%
! Third%copy%of%the%block%is%placed%on%different%node%in%the%same%rack%as%the%
second%copy%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#26%
Anatomy"of"a"File"Read"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#27%
Anatomy"of"a"File"Read"(cont’d)"

1.  Client%connects%to%the%NameNode%
2.  NameNode%returns%the%name%and%loca/ons%of%the%first%few%blocks%of%
the%file%
–  Block"locaEons"are"returned"closest/first"
3.  Client%connects%to%the%first%of%the%DataNodes,%and%reads%the%block%
!  If%the%DataNode%fails%during%the%read,%the%client%will%seamlessly%connect%
to%the%next%one%in%the%list%to%read%the%block%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#28%
Dealing"With"Data"CorrupEon"

! As%the%DataNode%is%reading%the%block,%it%also%calculates%the%checksum%
! ‘Live’%checksum%is%compared%to%the%checksum%created%when%the%block%was%
stored%
! If%they%differ,%the%client%reads%from%the%next%DataNode%in%the%list%
– The"NameNode"is"informed"that"a"corrupted"version"of"the"block"has"
been"found"
– The"NameNode"will"then"re/replicate"that"block"elsewhere"
! The%DataNode%verifies%the%checksums%for%blocks%on%a%regular%basis%to%
avoid%‘bit%rot’%
– Default"is"every"three"weeks"aier"the"block"was"created"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#29%
Data"Reliability"and"Recovery"

! DataNodes%send%heartbeats-to%the%NameNode%
– Every"three"seconds"
! Aher%a%period%without%any%heartbeats,%a%DataNode%is%assumed%to%be%lost%
– NameNode"determines"which"blocks"were"on"the"lost"node"
– NameNode"finds"other"DataNodes"with"copies"of"these"blocks"
– These"DataNodes"are"instructed"to"copy"the"blocks"to"other"nodes"
– Three/fold"replicaEon"is"acEvely"maintained"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#30%
The"NameNode"Is"Not"a"Bo>leneck"

! Note:%the%data%never%travels%via%a%NameNode%
– For"writes"
– For"reads"
– During"re/replicaEon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#31%
Chapter"Topics"

HDFS% Introduc/on%to%Apache%Hadoop%

!! HDFS"Features"
!! WriEng"and"Reading"Files"
!! NameNode%Memory%Considera/ons%
!! Overview"of"HDFS"Security"
!! Using"the"NameNode"Web"UI"
!! Using"the"Hadoop"File"Shell"
!! Hands/On"Exercise:"Working"with"HDFS"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#32%
NameNode:"Memory"AllocaEon"

! When%a%NameNode%is%running,%all%metadata%is%held%in%RAM%for%fast%
response%
! Each%‘item’%consumes%150#200%bytes%of%RAM%
! Items:%
– Filename,"permissions,"etc."
– Block"informaEon"for"each"block"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#33%
NameNode:"Memory"AllocaEon"(cont’d)"

! Why%HDFS%prefers%fewer,%larger%files:%
– Consider"1GB"of"data,"HDFS"block"size"128MB"
– Stored"as"1"x"1GB"file"
– Name:"1"item"
– Blocks:"8"items"
– Total"items:"9"
– Stored"as"1000"x"1MB"files"
– Names:"1000"items"
– Blocks:"1000"items"
– Total"items:"2000"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#34%
Chapter"Topics"

HDFS% Introduc/on%to%Apache%Hadoop%

!! HDFS"Features"
!! WriEng"and"Reading"Files"
!! NameNode"Memory"ConsideraEons"
!! Overview%of%HDFS%Security%
!! Using"the"NameNode"Web"UI"
!! Using"the"Hadoop"File"Shell"
!! Hands/On"Exercise:"Working"with"HDFS"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#35%
HDFS"File"Permissions""

! Files%in%HDFS%have%an%owner,%a%group,%and%permissions%
– Very"similar"to"Unix"file"permissions"
! File%permissions%are%read%(r),%write%(w)%and%execute%(x)%for%each%of%owner,%
group,%and%other%
– x"is"ignored"for"files"
– For"directories,"x"means"that"its"children"can"be"accessed"
! HDFS%permissions%are%designed%to%stop%good%people%doing%foolish%things%
– Not"to"stop"bad"people"doing"bad"things!"
– HDFS"believes"you"are"who"you"tell"it"you"are"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#36%
Hadoop"Security"Overview"

! Hadoop’s%security%has%had%authoriza/on%for%some%/me%
– The"ability"to"allow"people"to"do"some"things"but"not"others"
– Example:"file"permissions"
! Authen/ca/on%has%historically%been%rela/vely%weak%
– AuthorizaEon"requires"you"to"first"idenEfy"the"user"
– Hadoop’s"default"mechanism"for"doing"this"is"easily"defeated"
– Intended"to"prevent"stupid"mistakes"made"by"honest"people"
! Hadoop%can%now%op/onally%enforce%strong%authen/ca/on%
– Via"integraEon"with"Kerberos"
! We%will%cover%Hadoop%security%in%much%more%depth%later%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#37%
Chapter"Topics"

HDFS% Introduc/on%to%Apache%Hadoop%

!! HDFS"Features"
!! WriEng"and"Reading"Files"
!! NameNode"Memory"ConsideraEons"
!! Overview"of"HDFS"Security"
!! Using%the%NameNode%Web%UI%
!! Using"the"Hadoop"File"Shell"
!! Hands/On"Exercise:"Working"with"HDFS"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#38%
NameNode"Web"UI"

! The%NameNode%exposes%its%Web%UI%on%port%50070%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#39%
Chapter"Topics"

HDFS% Introduc/on%to%Apache%Hadoop%

!! HDFS"Features"
!! WriEng"and"Reading"Files"
!! NameNode"Memory"ConsideraEons"
!! Overview"of"HDFS"Security"
!! Using"the"NameNode"Web"UI"
!! Using%the%Hadoop%File%Shell%
!! Hands/On"Exercise:"Working"with"HDFS"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#40%
Accessing"HDFS"via"the"Command"Line"

! HDFS%is%not%a%general%purpose%filesystem%
– Not"built"into"the"OS,"so"only"specialized"tools"can"access"it"
! End%users%typically%access%HDFS%via%the%hadoop fs%command%
– AcEons"are"specified"with"subcommands"(prefixed"with"a"minus"sign)"
– Most"subcommands"are"similar"to"corresponding"UNIX"commands"
! Display%the%contents%of%the%/user/fred/sales.txt%file%

$ hadoop fs -cat /user/fred/sales.txt

! Create%a%directory%(below%the%root)%called%reports%

$ hadoop fs -mkdir /reports

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#41%
Copying"Local"Data"To"and"From"HDFS"

! Remember%that%HDFS%is%dis/nct%from%your%local%filesystem%
– The"hadoop fs –put"command"copies"local"files"to"HDFS"
– The"hadoop fs –get"fetches"a"local"copy"of"a"file"from"HDFS"

Hadoop Cluster

$ hadoop fs -put sales.txt /reports


Client Machine

$ hadoop fs -get /reports/sales.txt

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#42%
More"hadoop fs"Command"Examples""

! Copy%file%input.txt%from%local%disk%to%the%user’s%directory%in%HDFS%

$ hadoop fs -put input.txt input.txt

– This"will"copy"the"file"to"/user/username/input.txt
! Get%a%directory%lis/ng%of%the%HDFS%root%directory%

$ hadoop fs -ls /

! Delete%the%file%/reports/sales.txt%

$ hadoop fs –rm /reports/sales.txt

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#43%
Chapter"Topics"

HDFS% Introduc/on%to%Apache%Hadoop%

!! HDFS"Features"
!! WriEng"and"Reading"Files"
!! NameNode"Memory"ConsideraEons"
!! Overview"of"HDFS"Security"
!! Using"the"NameNode"Web"UI"
!! Using"the"Hadoop"file"shell"
!! Hands#On%Exercise:%Working%with%HDFS%
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#44%
Hands/On"Exercise:"Working"With"HDFS"

! In%this%Hands#On%Exercise,%you%will%copy%a%large%file%into%HDFS%and%explore%
the%results%in%the%NameNode%Web%UI%and%in%Linux%%
! Please%refer%to%the%Hands#On%Exercise%Manual%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#45%
Chapter"Topics"

HDFS% Introduc/on%to%Apache%Hadoop%

!! HDFS"Features"
!! WriEng"and"Reading"Files"
!! NameNode"Memory"ConsideraEons"
!! Overview"of"HDFS"Security"
!! Using"the"NameNode"Web"UI"
!! Using"the"Hadoop"File"Shell"
!! Hands/On"Exercise:"Working"with"HDFS"
!! Conclusion%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#46%
EssenEal"Points"

! HDFS%distributes%large%blocks%of%data%across%a%set%of%machines%to%support%
MapReduce%data%locality%
– AssumpEon"is"that"the"typical"file"size"on"a"Hadoop"cluster"is"large"
– Block"size"is"configurable,"default"is"64MB"
! HDFS%provides%fault%tolerance%with%built#in%data%redundancy%
– The"number"of"replicas"per"block"is"configurable,"default"is"3"
! The%NameNode%daemon%maintains%all%HDFS%metadata%in%memory%and%
stores%it%on%disk%
– In"a"non/HA"configuraEon,"the"NameNode"works"with"a"Secondary"
NameNode"that"provides"administraEve"services"
– In"an"HA"configuraEon,"the"Standby"NameNode"is"configured"in"case"of"
a"NameNode"failure"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#47%
Conclusion"

In%this%chapter,%you%have%learned:%
! What%features%HDFS%provides%
! How%HDFS%reads%and%writes%files%
! How%the%NameNode%uses%memory%
! How%Hadoop%provides%file%security%
! How%to%use%the%NameNode%Web%UI%
! How%to%use%the%Hadoop%File%Shell%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#48%
GeAng"Data"Into"HDFS"
Chapter"4"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"1$
Course"Chapters"
!! IntroducHon" Course"IntroducHon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
Introduc.on$to$Apache$Hadoop$
!! Ge6ng$Data$Into$HDFS$
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaHon"and"IniHal"ConfiguraHon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,"Installing,"and"
!! Hadoop"Clients"
Configuring"a"Hadoop"Cluster"
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraHon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaHons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooHng"
!! Conclusion"
!! Kerberos"ConfiguraHon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaHon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"2$
GeAng"Data"Into"HDFS"

In$this$chapter,$you$will$learn:$
! How$to$import$data$into$HDFS$with$Flume$$
! How$to$import$data$into$HDFS$with$Sqoop$
! What$REST$interfaces$Hadoop$provides$
! Best$prac.ces$for$impor.ng$data$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"3$
GeAng"Data"Into"HDFS"

! In$the$last$chapter,$you$learned$how$to$use$the$hadoop fs$command$to$
copy$data$into$and$out$of$HDFS$
! In$this$chapter,$we$will$explore$several$Hadoop$ecosystem$tools$for$
accessing$HDFS$data$from$outside$of$Hadoop$
– We"can"only"provide"a"brief"overview"of"these"tools;"consult"the"
documentaHon"at"http://archive.cloudera.com/docs"for"
full"details"on"installaHon"and"configuraHon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"4$
Chapter"Topics"

Ge6ng$Data$Into$HDFS$ Introduc.on$to$Apache$Hadoop$

!! Inges.ng$Data$From$External$Sources$With$Flume$$
!! Hands/On"Exercise:"Using"Flume"
!! IngesHng"Data"From"RelaHonal"Databases"With"Sqoop"
!! REST"Interfaces"
!! Best"PracHces"for"ImporHng"Data"
!! Hands/On"Exercise:"ImporHng"Data"With"Sqoop"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"5$
Flume:"Basics"

! Flume$is$a$distributed,$reliable,$available$service$for$efficiently$moving$
large$amounts$of$data$as$it$is$produced$
– Ideally"suited"to"gathering"logs"from"mulHple"systems"and"inserHng"
them"into"HDFS"as"they"are"generated"
! Flume$is$an$open$source$Apache$project$
– IniHally"developed"by"Cloudera"
– Included"in"CDH"
! Flume’s$design$goals:$
– Reliability"
– Scalability"
– Extensibility"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"6$
Flume:"Usage"Pa>erns"

! Flume$is$typically$used$to$ingest$log$files$from$real".me$systems$such$as$
Web$servers,$firewalls,$and$mailservers$into$HDFS$
! Currently$in$use$in$many$large$organiza.ons,$inges.ng$millions$of$events$
per$day$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"7$
Flume:"High/Level"Overview"

Agent$$ Agent$ Agent$ Agent$

encrypt$

Optionally pre-process incoming data:


Agent$ Agent$ perform transformations, suppressions,
metadata enrichment
compress$ batch$
Each agent can be configured with an
encrypt$ in-memory or durable channel.

Writes to multiple HDFS file formats


(text, SequenceFile, JSON, Avro,
others) Agent(s)$
Parallelized writes across many
collectors – as much write throughput
as required

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"8$
Flume"Agent"CharacterisHcs"

! Each$Flume$agent$has$a$source$and$a$sink$
! Source$
– Tells"the"node"where"to"receive"data"from"
! Sink$
– Tells"the"node"where"to"send"data"to"
! Channel$
– A"queue"between"the"Source"and"Sink"
– Can"be"in"memory"only"or"‘Durable’"
– Durable"channels"will"not"lose"data"if"power"is"lost"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"9$
Flume’s"Design"Goals:"Reliability"

! Channels$provide$Flume’s$reliability$
! Memory$Channel$
– Data"will"be"lost"if"power"is"lost"
! Disk"based$Channel$
– Disk/base"queue"guarantees"durability"of"data"in"face"of"a"power"loss"
! Data$transfer$between$Agents$and$Channels$is$transac.onal$
– A"failed"data"transfer"to"a"downstream"agent"rolls"back"and"retries"
! Can$configure$mul.ple$Agents$with$the$same$task$
– e.g.,"2"Agents"doing"the"job"of"1"‘collector’"–"if"one"agent"fails"then"
upstream"agents"would"fail"over"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"10$
Flume’s"Design"Goals:"Scalability"

! Scalability$
– The"ability"to"increase"system"performance"linearly"–"or"be>er"–"by"
adding"more"resources"to"the"system"
– Flume"scales"horizontally"
– As"load"increases,"more"machines"can"be"added"to"the"
configuraHon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"11$
Flume’s"Design"Goals:"Extensibility"

! Extensibility$
– The"ability"to"add"new"funcHonality"to"a"system"
! Flume$can$be$extended$by$adding$Sources$and$Sinks$to$exis.ng$storage$
layers$or$data$pla^orms$
– General"Sources"include"data"from"files,"syslog,"and"standard"output"
from"any"Linux"process"
– General"Sinks"include"files"on"the"local"filesystem"or"HDFS"
– Developers"can"write"their"own"Sources"or"Sinks"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"12$
Installing"Flume"

! Flume$is$available$as$a$tarball,$RPM$or$Debian$package$
! Flume$tarball$
– extract"the"tar"file"
– set"the"FLUME_CONF_DIR"environment"variable"

$ cd /usr/local/ && sudo tar -zxvf <path_to_flume-ng-1.1.0-cdh4.0.0b2.tar.gz>


$ export FLUME_CONF_DIR=/usr/local/flume-ng/conf

! Flume$RPM$or$Debian$Package$

# Ubuntu and other Debian systems


$ sudo apt-get install flume-ng-agent flume-ng-doc
# RedHat-compatible systems
$ sudo yum install flume-ng-agent flume-ng-doc
# SUSE systems
$ sudo zypper install flume-ng-agent flume-ng-doc

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"13$
Configuring"Flume"

! Configure$Agent$Nodes$on$the$machine(s)$genera.ng$the$data$
!  flume.properties$file$defines$source,$sinks,$channels,$and$flow$within$
an$agent$
– Copy"the"Flume"template"property"file"conf/flume-
conf.properties.template"to"conf/flume.properties"
and"edit"

# List the sources, sinks and channels for the agent


<agent>.sources = <Source>
<agent>.sinks = <Sink>
<agent>.channels = <Channel1> <Channel2>

# set channel for source


<agent>.sources.<Source>.channels = <Channel1> <Channel2> ...

# set channel for sink


<agent>.sinks.<Sink>.channel = <Channel1>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"14$
Agent"ConfiguraHon"Example"

!  tail -f of$a$file$as$the$source,$downstream$node$as$the$sink$

tail1.sources = src1
tail1.channels = ch1
tail1.sinks = sink1 sink2
tail1.sinkgroups = sg1

tail1.sources.src1.type = exec
tail1.sources.src1.command = tail -F /tmp/access_log
tail1.sources.src1.channels = ch1

tail1.channels.ch1.type = memory
tail1.channels.ch1.capacity = 500

tail1.sinks.sink1.type = avro
tail1.sinks.sink1.hostname = localhost
tail1.sinks.sink1.port = 6000
tail1.sinks.sink1.batch-size = 1
tail1.sinks.sink1.channel = ch1
""

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"15$
Agent"ConfiguraHon"Example"(cont'd)"

! Upstream$node$as$the$source,$HDFS$as$the$sink$

collector1.sources = src1
collector1.channels = ch1
collector1.sinks = sink1

collector1.sources.src1.type = avro
collector1.sources.src1.bind = localhost
collector1.sources.src1.port = 6000
collector1.sources.src1.channels = ch1

collector1.channels.ch1.type = memory
collector1.channels.ch1.capacity = 500

collector1.sinks.sink1.type = hdfs
collector1.sinks.sink1.hdfs.path = /collector_dir
collector1.sinks.sink1.hdfs.filePrefix = access_log
collector1.sinks.sink1.channel = ch1
""
""

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"16$
StarHng"Flume"Agents"

! Star.ng$an$agent$using$the$/etc/init.d$script$
– Can"be"used"to"start"Flume"agents"automaHcally"upon"reboot"
– Default"agent"name"is"agent
– Default"agent"configuraHon"file"is"/etc/flume-ng/conf/
flume.conf
– Start"with"sudo service flume-ng-agent start
! Star.ng$an$agent$from$the$command$line$
– Specify"the"agent"name"and"the"agent"configuraHon"file"in"the"
command"line"
– For"example:"
– flume-ng agent --conf-file \
/etc/hadoop/conf/flume-conf.properties \
--name tail-agent

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"17$
Chapter"Topics"

Ge6ng$Data$Into$HDFS$ Introduc.on$to$Apache$Hadoop$

!! IngesHng"Data"From"External"Sources"With"Flume""
!! Hands"On$Exercise:$Using$Flume$
!! IngesHng"Data"From"RelaHonal"Databases"With"Sqoop"
!! REST"Interfaces"
!! Best"PracHces"for"ImporHng"Data"
!! Hands/On"Exercise:"ImporHng"Data"With"Sqoop"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"18$
Hands/On"Exercise:"Using"Flume"

! In$this$Hands"On$Exercise,$you$will$create$a$simple$Flume$configura.on$to$
store$dynamically"generated$data$in$HDFS$
! Please$refer$to$the$Hands"On$Exercise$Manual$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"19$
Chapter"Topics"

Ge6ng$Data$Into$HDFS$ Introduc.on$to$Apache$Hadoop$

!! IngesHng"Data"From"External"Sources"With"Flume""
!! Hands/On"Exercise:"Using"Flume"
!! Inges.ng$Data$From$Rela.onal$Databases$With$Sqoop$
!! REST"Interfaces"
!! Best"PracHces"for"ImporHng"Data"
!! Hands/On"Exercise:"ImporHng"Data"With"Sqoop"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"20$
What"is"Sqoop?"

! Sqoop$is$“the$SQL"to"Hadoop$database$import$tool”$
– Open/source"Apache"project"
– Originally"developed"at"Cloudera"
– Included"in"CDH"
! Designed$to$import$data$from$RDBMSs$(Rela.onal$Database$Management$
Systems)$into$HDFS$
– Can"also"send"data"from"HDFS"to"an"RDBMS"
! Uses$JDBC$(Java$Database$Connec.vity)$to$connect$to$the$RDBMS$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"21$
How"Does"Sqoop"Work?"

! Sqoop$examines$each$table$and$automa.cally$generates$a$Java$class$to$
import$data$into$HDFS$
! It$then$creates$and$runs$a$Map"only$MapReduce$job$to$import$the$data$
– By"default,"four"Mappers"connect"to"the"RDBMS"
– Each"imports"a"quarter"of"the"data"
$ Hadoop Cluster

Submit MapReduce Jobs


Map-Only
Tasks

Access Table
Definitions and
RDBMS /
Generate Classes
Data Warehouse /
Import and
Document-Based
Export Data
System

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"22$
Sqoop"Features"

! Imports$a$single$table,$or$all$tables$in$a$database$
! Can$specify$which$rows$to$import$
– Via"a"WHERE"clause"
! Can$specify$which$columns$to$import$
! Can$provide$an$arbitrary$SELECT$statement$
! Sqoop$can$automa.cally$create$a$Hive$table$based$on$the$imported$data$
! Supports$incremental$imports$of$data$
! Can$export$data$from$HDFS$to$a$database$table$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"23$
Sqoop"Connectors"

! Custom$Sqoop$connectors$exist$for$higher"speed$import$from$some$
RDBMSs$and$other$systems$
– Use"a"system’s"naHve"protocols"to"access"data"rather"than"JDBC"
– Provides"much"faster"performance"
– Typically"developed"by"the"third/party"RDBMS"vendor"
– SomeHmes"in"collaboraHon"with"Cloudera"
! Current$systems$supported$by$custom$connectors$include:$
– Netezza"
– Teradata"
– Oracle"Database"(connector"developed"with"Quest"Sokware)"
– Microsok"SQL"Server"
! Others$are$in$development$
! Custom$connectors$are$oien$not$open$source,$but$are$free$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"24$
Sqoop"Usage"Examples"

! List$all$databases$
sqoop list-databases --username fred -P \
--connect jdbc:mysql://dbserver.example.com/

! List$all$tables$in$the$world$database$
$sqooplist-tables --username fred -P \
--connect jdbc:mysql://dbserver.example.com/world

! Import$all$tables$in$the$world$database$
$sqoopimport-all-tables --username fred --password derf \
--connect jdbc:mysql://dbserver.example.com/world

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"25$
Sqoop"2"–"Sqoop"as"a"Service"

! New$version$of$Sqoop$can$be$run$as$a$service$on$a$centrally"available$
machine$
Metadata
Repository

Hadoop Cluster

Sqoop2
Server Map-Only
Tasks

...
Access Table
Definitions and Import and
Browser Export Data
Generate Classes

RDBMS /
Data Warehouse /
Document-Based
System

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"26$
Sqoop"2"–"Sqoop"as"a"Service"(cont’d)"

Func.onality$ Sqoop$ Sqoop2


Installa.on$and$ •  Connectors$and$JDBC$drivers$ •  Connectors$and$JDBC$drivers$are$
Configura.on$ are$installed$on$every$client$ installed$on$the$Sqoop2$server$
•  Local$configura.on$requires$ •  Requires$database$connec.vity$
root$privileges$ for$the$Sqoop2$server$
•  Database$connec.vity$
required$for$every$client$
Client$Interface$ CLI$only$ CLI,$Web$UI,$REST$$
Security$ Every$invoca.on$requires$ Administrator$specifies$creden.als$
creden.als$to$RDBMS$ when$crea.ng$server"side$
Connec.on$objects$
Resource$ No$resource$management$ Administrator$can$limit$the$number$
Management$ of$connec.ons$to$the$RDBMS$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"27$
Chapter"Topics"

Ge6ng$Data$Into$HDFS$ Introduc.on$to$Apache$Hadoop$

!! IngesHng"Data"From"External"Sources"With"Flume""
!! Hands/On"Exercise:"Using"Flume"
!! IngesHng"Data"From"RelaHonal"Databases"With"Sqoop"
!! REST$Interfaces$
!! Best"PracHces"for"ImporHng"Data"
!! Hands/On"Exercise:"ImporHng"Data"With"Sqoop"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"28$
WebHDFS"

! Provides$an$HTTP/HTTPS$REST$interface$to$HDFS$
– Supports"both"reads"and"writes"from/to"HDFS"
– Can"be"accessed"from"within"a"program"or"script"
– Can"be"used"via"command/line"tools"such"as"curl"or"wget
! Simple$to$deploy$
– Modify"one"parameter"in"the"Hadoop"configuraHon"
– Restart"the"NameNode"and"DataNodes"
! Requires$client$access$to$every$DataNode$in$the$cluster$
! Does$not$support$HDFS$HA$deployments$
$

REST: REpresentational State Transfer


©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"29$
H>pFS"

! Provides$an$HTTP/HTTPS$REST$interface$to$HDFS$
– The"interface"is"idenHcal"to"the"WebHDFS"REST"interface"
! Slightly$more$complex$than$WebHDFS$to$deploy$
– Install"and"configure"an"H>pFS"server"
– Enable"proxy"access"to"HDFS"for"an"httpfs"user""
– Restart"the"NameNode(s)"
! Requires$client$access$to$the$HkpFS$server$only$
– The"H>pFS"server"then"accesses"HDFS"
! Supports$HDFS$HA$deployments$
$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"30$
WebHDFS/H>pFS"REST"Interface"Examples"

! These$examples$will$work$with$either$WebHDFS$or$HkpFS$
– For"WebHDFS,"specify"the"NameNode"host"and"port"(default:"50070)"
– For"H>pFS,"specify"the"H>pFS"server"and"port"(default:"14000)"
! Open$and$get$the$shakespeare.txt$file$
–  curl -i -L "http://host:port/webhdfs/v1/user/\
training/input/shakespeare.txt?op=OPEN&\
user.name=training"

! Make$the$mydir$directory$
–  curl -i -X PUT "http://host:port/webhdfs/v1/user/\
training/mydir?op=MKDIRS&user.name=training"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"31$
Chapter"Topics"

Ge6ng$Data$Into$HDFS$ Introduc.on$to$Apache$Hadoop$

!! IngesHng"Data"From"External"Sources"With"Flume""
!! Hands/On"Exercise:"Using"Flume"
!! IngesHng"Data"From"RelaHonal"Databases"With"Sqoop"
!! REST"Interfaces"
!! Best$Prac.ces$for$Impor.ng$Data$
!! Hands/On"Exercise:"ImporHng"Data"With"Sqoop"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"32$
What"Do"Others"See"As"Data"Is"Imported?"

! When$a$client$starts$to$write$data$to$HDFS,$the$NameNode$marks$the$file$
as$exis.ng,$but$being$of$zero$size$
– Other"clients"will"see"that"as"an"empty"file"
! Aier$each$block$is$wriken,$other$clients$will$see$that$block$
– They"will"see"the"file"growing"as"it"is"being"created,"one"block"at"a"Hme"
! This$is$typically$not$a$good$idea$
– Other"clients"may"begin"to"process"a"file"as"it"is"being"wri>en"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"33$
ImporHng"Data:"Best"PracHces"

! Best$prac.ce$is$to$import$data$into$a$temporary$directory$
! Aier$the$file$is$completely$wriken,$move$data$to$the$target$directory$
– This"is"an"atomic"operaHon"
– Happens"very"quickly"since"it"merely"requires"an"update"of"the"
NameNode’s"metadata"
! Many$organiza.ons$standardize$on$a$directory$structure$such$as$
– /incoming/<import_job_name>/<files>"
– /for_processing/<import_job_name>/<files>"
– /completed/<import_job_name>/<files>
! It$is$the$job’s$responsibility$to$move$the$files$from$for_processing$to$
completed$aier$the$job$has$finished$successfully$
! Discussion$point:$your$best$prac.ces?$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"34$
Chapter"Topics"

Ge6ng$Data$Into$HDFS$ Introduc.on$to$Apache$Hadoop$

!! IngesHng"Data"From"External"Sources"With"Flume""
!! Hands/On"Exercise:"Using"Flume"
!! IngesHng"Data"From"RelaHonal"Databases"With"Sqoop"
!! REST"Interfaces"
!! Best"PracHces"for"ImporHng"Data"
!! Hands"On$Exercise:$Impor.ng$Data$With$Sqoop$
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"35$
Hands/On"Exercise:"ImporHng"Data"Using"Sqoop"

! In$this$Hands"On$Exercise,$you$will$import$data$from$a$rela.onal$database$
using$Sqoop$
! Please$refer$to$the$Hands"On$Exercise$Manual$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"36$
Chapter"Topics"

Ge6ng$Data$Into$HDFS$ Introduc.on$to$Apache$Hadoop$

!! IngesHng"Data"From"External"Sources"With"Flume""
!! Hands/On"Exercise:"Using"Flume"
!! IngesHng"Data"From"RelaHonal"Databases"With"Sqoop"
!! REST"Interfaces"
!! Best"PracHces"for"ImporHng"Data"
!! Hands/On"Exercise:"ImporHng"Data"With"Sqoop"
!! Conclusion$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"37$
EssenHal"Points"

! You$can$install$Flume$agents$on$systems$such$as$Web$servers$and$mail$
servers$to$extract,$op.onally$transform,$and$pass$data$down$to$HDFS$
– Flume"scales"extremely"well"and"is"in"producHon"use"at"many"large"
organizaHons"
! Flume$uses$the$terms$source,$sink,$and$channel$to$describe$its$actors$
– A"source"is"where"an"agent"receives"data"from"
– A"sink"is"where"an"agent"sends"data"to"
– A"channel"is"a"queue"between"a"source"and"a"sink"
! Using$Sqoop,$you$can$import$data$from$a$rela.onal$database$into$HDFS$
! A$REST$interface$is$available$for$accessing$HDFS$
– To"use"the"REST"interface,"you"must"have"enabled"WebHDFS"or"
deployed"H>pFS"
– The"REST"interface"is"idenHcal"whether"you"use"WebHDFS"or"H>pFS"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"38$
Conclusion"

In$this$chapter,$you$have$learned:$
! How$you$can$import$data$into$HDFS$with$Flume$$
! How$you$can$import$data$into$HDFS$with$Sqoop$
! What$REST$interfaces$Hadoop$provides$
! Best$prac.ces$for$impor.ng$data$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 4"39$
MapReduce"
Chapter"5"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#1%
Course"Chapters"
!! IntroducDon" Course"IntroducDon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
Introduc/on%to%Apache%Hadoop%
!! GeQng"Data"Into"HDFS"
!! MapReduce%

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaDon"and"IniDal"ConfiguraDon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,"Installing,"and"
!! Hadoop"Clients"
Configuring"a"Hadoop"Cluster"
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraDon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaDons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooDng"
!! Conclusion"
!! Kerberos"ConfiguraDon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaDon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#2%
MapReduce"

In%this%chapter,%you%will%learn:%
! What%MapReduce%is%
! What%features%MapReduce%provides%
! What%the%basic%concepts%of%MapReduce%are%
! What%the%architecture%of%MapReduce%is%
! What%features%MapReduce%version%2%provides%
! How%MapReduce%handles%failure%
! How%to%use%the%JobTracker%Web%UI%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#3%
Chapter"Topics"

MapReduce% Introduc/on%to%Apache%Hadoop%

!! What%Is%MapReduce?%
!! Features"of"MapReduce"
!! Basic"Concepts"
!! Architectural"Overview"
!! MapReduce"Version"2"
!! Failure"Recovery"
!! Using"the"JobTracker"Web"UI"
!! Hands/On"Exercise:"Running"a"MapReduce"Job"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#4%
What"Is"MapReduce?"

! MapReduce%is%a%programming%model%
– Neither"pla^orm/"nor"language/specific"
– Record/oriented"data"processing"(key"and"value)"
– Facilitates"task"distribuDon"across"mulDple"nodes"
! Where%possible,%each%node%processes%data%stored%on%that%node%%
! Consists%of%two%developer#created%phases%
– Map"
– Reduce"
! In%between%Map%and%Reduce%is%the%shuffle&and&sort%
– Sends"data"from"the"Mappers"to"the"Reducers"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#5%
MapReduce:"The"Big"Picture"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#6%
What"Is"MapReduce?"(cont’d)"

! Process%can%be%considered%as%being%similar%to%a%Unix%pipeline%

cat /my/log | grep '\.html' | sort | uniq –c > /my/outfile

Map% Reduce%
Shuffle%%
% and%sort%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#7%
What"Is"MapReduce?"(cont’d)"

! Key%concepts%to%keep%in%mind%with%MapReduce:%
– The"Mapper"works"on"an"individual"record"at"a"Dme"
– The"Reducer"aggregates"results"from"the"Mappers"
– The"intermediate"keys"produced"by"the"Mapper"are"the"keys"on"which"
the"aggregaDon"will"be"based"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#8%
Chapter"Topics"

MapReduce% Introduc/on%to%Apache%Hadoop%

!! What"Is"MapReduce?"
!! Features%of%MapReduce%
!! Basic"Concepts"
!! Architectural"Overview"
!! MapReduce"Version"2"
!! Failure"Recovery"
!! Using"the"JobTracker"Web"UI"
!! Hands/On"Exercise:"Running"a"MapReduce"Job"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#9%
Features"of"MapReduce"

! Automa/c%paralleliza/on%and%distribu/on%
! Fault#tolerance%
! Status%and%monitoring%tools%
! A%clean%abstrac/on%for%programmers%
– MapReduce"programs"are"usually"wri>en"in"Java"
– Can"be"wri>en"in"other"languages"using"Hadoop&Streaming"
! MapReduce%abstracts%all%the%‘housekeeping’%away%from%the%developer%
– Developer"can"concentrate"simply"on"wriDng"the"Map"and"Reduce"
funcDons"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#10%
Chapter"Topics"

MapReduce% Introduc/on%to%Apache%Hadoop%

!! What"Is"MapReduce?"
!! Features"of"MapReduce"
!! Basic%Concepts%
!! Architectural"Overview"
!! MapReduce"Version"2"
!! Failure"Recovery"
!! Using"the"JobTracker"Web"UI"
!! Hands/On"Exercise:"Running"a"MapReduce"Job"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#11%
MapReduce:"Basic"Concepts"

! Each%Mapper%processes%a%single%input&split%from%HDFS%
– Ocen"a"single"HDFS"block"
! Hadoop%passes%the%developer’s%Map%code%one%record%at%a%/me%
! Each%record%has%a%key&and%a%value%
! Intermediate&data%is%wri]en%by%the%Mapper%to%local%disk&
! During%the%shuffle%and%sort%phase,%all%the%values%associated%with%the%same%
intermediate%key%are%transferred%to%the%same%Reducer%
– The"developer"specifies"the"number"of"Reducers"
! Reducer%is%passed%each%key%and%a%list%of%all%its%values%
– Keys"are"passed"in"sorted"order"
! Output%from%the%Reducers%is%wri]en%to%HDFS%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#12%
MapReduce:"A"Simple"Example"

! WordCount%is%the%‘Hello,%World!’%of%Hadoop%

Map%
// assume input is a set of text files
// k is a byte offset
// v is the line for that offset
let map(k, v) =
foreach word in v:
emit(word, 1)

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#13%
MapReduce:"A"Simple"Example"(cont’d)"

! Sample%input%to%the%Mapper:%

1202 the cat sat on the mat


1225 the aardvark sat on the sofa

! Intermediate%data%produced:%

(the, 1), (cat, 1), (sat, 1), (on, 1), (the, 1),
(mat, 1), (the, 1), (aardvark, 1), (sat, 1),
(on, 1), (the, 1), (sofa, 1)

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#14%
MapReduce:"A"Simple"Example"(cont’d)"

! Input%to%the%Reducer:%

(aardvark, [1])
(cat, [1])
(mat, [1])
(on, [1, 1])
(sat, [1, 1])
(sofa, [1])
(the, [1, 1, 1, 1])

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#15%
MapReduce:"A"Simple"Example"(cont’d)"

Reduce%
// k is a word, vals is a list of 1s
let reduce(k, vals) =
sum = 0
foreach (v in vals):
sum = sum + v
emit (k, sum)

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#16%
MapReduce:"A"Simple"Example"(cont’d)"

! Output%from%the%Reducer,%wri]en%to%HDFS:%

(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#17%
Chapter"Topics"

MapReduce% Introduc/on%to%Apache%Hadoop%

!! What"Is"MapReduce?"
!! Features"of"MapReduce"
!! Basic"Concepts"
!! Architectural%Overview%
!! MapReduce"Version"2"
!! Failure"Recovery"
!! Using"the"JobTracker"Web"UI"
!! Hands/On"Exercise:"Running"a"MapReduce"Job"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#18%
MapReduce"Terminology"

! Job%
– Consists"of"a"Mapper,"a"Reducer,"and"a"list"of"inputs"
! Task%
– An"individual"unit"of"work"
– A"job"is"broken"down"into"many"tasks"
– Each"task"is"either"a"Map"task"or"a"Reduce"task"
! Client%
– What"the"user"runs"to"submit"a"job"
– Also"refers"to"the"machine"on"which"this"program"runs"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#19%
Architectural"Overview"

! There%are%two%daemons%in%“classical”%MapReduce%
– JobTracker"(master)"/"exactly"one"per"cluster"
– TaskTracker"(slave)"/"one"or"more"per"cluster"
! Slave%nodes%run%both%a%TaskTracker%and%a%DataNode%daemon%

TaskTracker TaskTracker

JobTracker

TaskTracker TaskTracker

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#20%
MapReduce"Job"Submission"

! A%client%submits%a%job%to%the%JobTracker%
– JobTracker"assigns"a"job"ID"
– Client"calculates"the"input&splits"for"the"job"
– Client"adds"job"code"and"configuraDon"to"HDFS"

Client TaskTracker

JobTracker

TaskTracker TaskTracker

TaskTracker

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#21%
MapReduce"Job"Submission"(cont’d)"

! The%JobTracker%creates%a%Map%task%for%each%input%split%
– TaskTrackers"send"periodic"“heartbeats”"to"JobTracker"
– These"heartbeats"also"signal"readiness"to"run"tasks"
– JobTracker"then"assigns"tasks"to"these"TaskTrackers"

heartbeat
Client TaskTracker
1

JobTracker

2 task assignment
TaskTracker TaskTracker

TaskTracker

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#22%
MapReduce"Job"Submission"(cont’d)"

! The%TaskTracker%then%forks%a%new%JVM%to%run%the%task%
– This"isolates"the"TaskTracker"from"bugs"or"faulty"code"
– A"single"instance"of"task"execuDon"is"called"a"task&a6empt&
– Status"info"periodically"sent"back"to"JobTracker"

Client TaskTracker

JobTracker
JVM

TaskTracker TaskTracker

TaskTracker

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#23%
JobTracker"High"Availability"

! Two%JobTrackers:%one%ac/ve%and%one%standby%
– Standby"JT"takes"over"when"acDve"JT"fails"
! Aeer%a%failover,%the%new%JobTracker%restarts%all%jobs%that%were%running%
when%the%failover%occurred%
! Available%for%“classical”%MapReduce%only%
%
TaskTracker TaskTracker
JobTracker
(Active)

TaskTracker TaskTracker

JobTracker
(Standby)
TaskTracker TaskTracker

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#24%
Chapter"Topics"

MapReduce% Introduc/on%to%Apache%Hadoop%

!! What"Is"MapReduce?"
!! Features"of"MapReduce"
!! Basic"Concepts"
!! Architectural"Overview"
!! MapReduce%Version%2%
!! Failure"Recovery"
!! Using"the"JobTracker"Web"UI"
!! Hands/On"Exercise:"Running"a"MapReduce"Job"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#25%
MapReduce"Version"2"

! The%“classical”%MapReduce%architecture%has%one%JobTracker%
– Must"coordinate"with"every"TaskTracker"
– Poses"scalability"limit"in"large"clusters"
! Hadoop%is%being%re#architected%to%overcome%this%
– Work"being"done"in"the"2.x"branch"of"Apache"Hadoop"
– This"is"called"MapReduce"version"2"(MRv2)"
– Also"known"as"“YARN”"
! You%can%run%either%MRv1%(original)%or%MRv2%with%CDH4%
– Running"MRv1"and"MRv2"on"the"same"cluster"is"not"supported"
– It"will"degrade"performance"and"may"result"in"an"unstable"cluster"
– MRv2"is"strongly"discouraged"for"producDon"at"this"Dme"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#26%
MRv2"System"Architecture"

! In%MRv2,%there%is%a%single%Resource%Manager%per%cluster%
– Contains"Scheduler"and"ApplicaDons"Master"subcomponents"
– An"“applicaDon”"is"a"MapReduce"job"
! Each%slave%node%runs%a%Node%Manager%
– Monitors"and"manages"resource"usage"for"that"node"
%
Node Manager Node Manager

Resource Manager

Scheduler
Application Manager

Node Manager Node Manager

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#27%
MRv2"System"Architecture"(cont’d)"

! Slave%nodes%run%individual%tasks%similar%to%MRv1%%
! For%each%job,%one%slave%node%is%Applica/on%Master%
– Manages"applicaDon"lifecycle"
– NegoDates"resource"“containers”"from"Resource"Manager"
– Monitors"tasks"running"on"the"other"slave"nodes"
%
Node Manager Node Manager
App Master - Job #2
Resource Manager

Scheduler
Application Manager
App Master - Job #1

Node Manager Node Manager

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#28%
Chapter"Topics"

MapReduce% Introduc/on%to%Apache%Hadoop%

!! What"Is"MapReduce?"
!! Features"of"MapReduce"
!! Basic"Concepts"
!! Architectural"Overview"
!! MapReduce"Version"2"
!! Failure%Recovery%
!! Using"the"JobTracker"Web"UI"
!! Hands/On"Exercise:"Running"a"MapReduce"Job"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#29%
MapReduce"Failure"Recovery"

! Task%processes%send%heartbeats&to%the%TaskTracker%
! TaskTrackers%send%heartbeats%to%the%JobTracker%
! Any%task%that%fails%to%report%in%10%minutes%is%assumed%to%have%failed%
– Its"JVM"is"killed"by"the"TaskTracker"
! Any%task%that%throws%an%excep/on%is%said%to%have%failed%
! Failed%tasks%are%reported%to%the%JobTracker%by%the%TaskTracker%
! The%JobTracker%reschedules%any%failed%tasks%
– It"tries"to"avoid"rescheduling"the"task"on"the"same"TaskTracker"where"it"
previously"failed"
! If%a%task%fails%four%/mes,%the%whole%job%fails%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#30%
MapReduce"Failure"Recovery"(Cont’d)"

! Any%TaskTracker%that%fails%to%report%in%10%minutes%is%assumed%to%have%
crashed%
– All"tasks"on"the"node"are"restarted"elsewhere"
– Any"TaskTracker"reporDng"a"high"number"of"failed"tasks"is"blacklisted,"
to"prevent"the"node"from"blocking"the"enDre"job"
– There"is"also"a"‘global"blacklist’,"for"TaskTrackers"which"fail"on"
mulDple"jobs"
! The%JobTracker%manages%the%state%of%each%job%
– ParDal"results"of"failed"tasks"are"ignored"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#31%
Chapter"Topics"

MapReduce% Introduc/on%to%Apache%Hadoop%

!! What"Is"MapReduce?"
!! Features"of"MapReduce"
!! Basic"Concepts"
!! Architectural"Overview"
!! MapReduce"Version"2"
!! Failure"Recovery"
!! Using%the%JobTracker%Web%UI%
!! Hands/On"Exercise:"Running"a"MapReduce"Job"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#32%
The"JobTracker"Web"UI"

! The%JobTracker%exposes%its%Web%UI%on%port%50030%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#33%
Drilling"Down"to"Individual"Jobs"

! Clicking%on%an%individual%job%name%will%reveal%more%informa/on%about%that%
job%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#34%
Chapter"Topics"

MapReduce% Introduc/on%to%Apache%Hadoop%

!! What"Is"MapReduce?"
!! Features"of"MapReduce"
!! Basic"Concepts"
!! Architectural"Overview"
!! MapReduce"Version"2"
!! Failure"Recovery"
!! Using"the"JobTracker"Web"UI"
!! Hands#On%Exercise:%Running%a%MapReduce%Job%
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#35%
Hands/On"Exercise:"Running"a"MapReduce"Job"

! In%this%Hands#On%Exercise,%you%will%run%a%MapReduce%job%on%your%pseudo#
distributed%Hadoop%cluster%%
! Please%refer%to%the%Hands#On%Exercise%Manual%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#36%
Chapter"Topics"

MapReduce% Introduc/on%to%Apache%Hadoop%

!! What"Is"MapReduce?"
!! Features"of"MapReduce"
!! Basic"Concepts"
!! Architectural"Overview"
!! MapReduce"Version"2"
!! Failure"Recovery"
!! Using"the"JobTracker"Web"UI"
!! Hands/On"Exercise:"Running"a"MapReduce"Job"
!! Conclusion%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#37%
EssenDal"Points"

! A%MapReduce%job%has%three%phases%
– The"Map"phase,"in"which"Mappers"take"HDFS"data"as"input"and"produce"
intermediate"data""
– Shuffle"and"sort,"which"sends"data"to"Reducers"
– The"Reduce"phase,"in"which"Reducers"aggregate"the"results"from"the"
Mappers"
! Developers%write%code%to%process%and%aggregate%data%in%the%Map%and%
Reduce%phases%
! Shuffle%and%sort%is%built%in%to%the%Hadoop%framework%
! The%Hadoop%JobTracker%and%TaskTracker%daemons%start%and%manage%Map%
tasks,%shuffle%and%sort,%and%Reduce%tasks%%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#38%
Conclusion"

In%this%chapter,%you%have%learned:%
! What%MapReduce%is%
! What%features%MapReduce%provides%
! What%the%basic%concepts%of%MapReduce%are%
! What%the%architecture%of%MapReduce%is%
! What%features%MapReduce%version%2%provides%
! How%MapReduce%handles%failure%
! How%to%use%the%JobTracker%Web%UI%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#39%
Planning"Your"Hadoop"Cluster"
Chapter"6"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#1%
Course"Chapters"
!! IntroducEon" Course"IntroducEon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducEon"to"Apache"Hadoop"
!! GePng"Data"Into"HDFS"
!! MapReduce"

!! Planning%Your%Hadoop%Cluster%
!! Hadoop"InstallaEon"and"IniEal"ConfiguraEon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,%Installing,%and%
!! Hadoop"Clients"
Configuring%a%Hadoop%Cluster%
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraEon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaEons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooEng"
!! Conclusion"
!! Kerberos"ConfiguraEon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaEon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#2%
Planning"Your"Hadoop"Cluster"

In%this%chapter,%you%will%learn:%
! What%issues%to%consider%when%planning%your%Hadoop%cluster%
! What%types%of%hardware%are%typically%used%for%Hadoop%nodes%
! How%to%opCmally%configure%your%network%topology%
! How%to%select%the%right%operaCng%system%and%Hadoop%distribuCon%
! How%to%plan%for%cluster%management%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#3%
Chapter"Topics"

Planning,%Installing,%and%Configuring%
Planning%Your%Hadoop%Cluster%
a%Hadoop%Cluster%

!! General%Planning%ConsideraCons%
!! Choosing"the"Right"Hardware"
!! Network"ConsideraEons"
!! Configuring"Nodes"
!! Planning"for"Cluster"Management"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#4%
Basic"Cluster"ConfiguraEon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#5%
Thinking"About"the"Problem"

! Hadoop%can%run%on%a%single%machine%
– Great"for"tesEng,"developing"
– Obviously"not"pracEcal"for"large"amounts"of"data"
! Many%people%start%with%a%small%cluster%and%grow%it%as%required%
– Perhaps"iniEally"just"four"or"six"nodes"
– As"the"volume"of"data"grows,"more"nodes"can"easily"be"added"
! Ways%of%deciding%when%the%cluster%needs%to%grow%
– Increasing"amount"of"computaEon"power"needed"
– Increasing"amount"of"data"which"needs"to"be"stored"
– Increasing"amount"of"memory"needed"to"process"tasks"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#6%
Cluster"Growth"Based"on"Storage"Capacity"

! Basing%your%cluster%growth%on%storage%capacity%is%oNen%a%good%method%to%
use%
! Example:%
– Data"grows"by"approximately"3TB"per"week"
– HDFS"set"up"to"replicate"each"block"three"Emes"
– Therefore,"9TB"of"extra"storage"space"required"per"week"
– Plus"some"overhead"–"say,"30%"
– Assuming"machines"with"12"x"3TB"hard"drives,"this"equates"to"a"new"
machine"required"every"three"weeks"
– AlternaEvely:"Two"years"of"data"–"2.7"petabytes"–"will"require"
approximately"83"machines"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#7%
Chapter"Topics"

Planning,%Installing,%and%Configuring%
Planning%Your%Hadoop%Cluster%
a%Hadoop%Cluster%

!! General"Planning"ConsideraEons"
!! Choosing%the%Right%Hardware%
!! Network"ConsideraEons"
!! Configuring"Nodes"
!! Planning"for"Cluster"Management"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#8%
Classifying"Nodes"

! Nodes%can%be%classified%as%either%‘slave%nodes’%or%‘master%nodes’%
! Slave%nodes%run%DataNode%plus%TaskTracker%daemons%
! Master%nodes%run%either%a%NameNode%daemon,%a%Secondary%NameNode%
Daemon,%or%a%JobTracker%daemon%
– On"smaller"clusters,"NameNode"and"JobTracker"are"oaen"run"on"the"
same"machine"
– SomeEmes"even"Secondary"NameNode"is"on"the"same"machine"as"the"
NameNode"and"JobTracker"
– Important"that"at"least"one"copy"of"the"NameNode’s"metadata"is"
stored"on"a"separate"machine"(see"later)"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#9%
Slave"Nodes:"Recommended"ConfiguraEons"

! Typical%configuraCons%for%slave%nodes%
– Midline"–"deep"storage,"1"GB"Ethernet""
– 12"x"3TB"SATA"II"hard"drives,"in"a"non/RAID,"JBOD*"configuraEon"
– 2"x"6/core"2.9"GHz"CPUs,"15"MB"cache"
– 64GB"RAM"
– 2x1"Gigabit"Ethernet"
– High/end"–"high"memory,"spindle"dense,"10"GB"Ethernet""
– 24"x"1TB"Nearline/MDL"SAS"hard"drives,"in"a"non/RAID,"JBOD*"
configuraEon"
– 2"x"6/core"2.9"GHz"CPUs,"15"MB"cache"
– 96GB"RAM"
– 1x10"Gigabit"Ethernet"

*"JBOD:"Just"a"Bunch"Of"Disks"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#10%
Slave"Nodes:"More"Details"(CPU)"

! Hex#core%CPUs%are%now%commonly%available%
! Hyper#threading%and%quick#path%interconnect%(QPI)%should%be%enabled%
! Hadoop%nodes%are%seldom%CPU#bound%
– They"are"typically"disk/"and"network/I/O"bound"
– Therefore,"top/of/the/range"CPUs"are"usually"not"necessary"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#11%
Slave"Nodes:"More"Details"(RAM)"

! Slave%node%configuraCon%specifies%the%maximum%number%of%Map%and%
Reduce%tasks%that%can%run%simultaneously%on%that%node%
! Each%Map%or%Reduce%task%will%take%2GB%to%4GB%of%RAM%
! Slave%nodes%should%not%be%using%virtual%memory%
! Ensure%you%have%enough%RAM%to%run%all%tasks,%plus%overhead%for%the%
DataNode%and%TaskTracker%daemons,%plus%the%operaCng%system%
! Rule%of%thumb:%%
%Total%number%of%tasks%=%1.5%x%number%of%processor%cores%
– This"is"a"starEng"point,"and"should"not"be"taken"as"a"definiEve"sePng"
for"all"clusters"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#12%
Slave"Nodes:"More"Details"(Disk)"

! Hadoop’s%architecture%impacts%disk%space%requirements%
– By"default,"HDFS"data"is"replicated"three"Emes"
– Temporary"data"storage"typically"requires"20/30"of"a"cluster’s"raw"disk"
capacity"
! In%general,%more%spindles%(disks)%is%becer%
! In%pracCce,%we%see%anywhere%from%four%to%24%disks%per%node%
! Use%3.5"%disks%
– Faster,"cheaper,"higher"capacity"than"2.5""disks"
! 7,200%RPM%SATA/SATA%II%drives%are%fine%
– No"need"to"buy"15,000"RPM"drives"
! 8%x%1.5TB%drives%is%likely%to%be%becer%than%6%x%2TB%drives%
– Different"tasks"are"more"likely"to"be"accessing"different"disks"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#13%
Slave"Nodes:"More"Details"(Disk)"(cont’d)"

! A%good%pracCcal%maximum%is%36TB%per%slave%node%
– More"than"that"will"result"in"massive"network"traffic"if"a"node"dies"and"
block"re/replicaEon"must"take"place"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#14%
Slave"Nodes:"Why"Not"RAID?"

! Slave%nodes%do%not%benefit%from%using%RAID*%storage%
– HDFS"provides"built/in"redundancy"by"replicaEng"blocks"across"mulEple"
nodes"
– RAID"striping"(RAID"0)"is"actually"slower"than"the"JBOD"configuraEon"
used"by"HDFS"
– RAID"0"read"and"write"operaEons"are"limited"by"the"speed"of"the"
slowest"disk"in"the"RAID"array"
– Disk"operaEons"on"JBOD"are"independent,"so"the"average"speed"is"
greater"than"that"of"the"slowest"disk"
– One"test"by"Yahoo"showed"JBOD"performing"between"30%"and"50%"
faster"than"RAID"0,"depending"on"the"operaEons"being"performed"
"

* RAID: Redundant Array of Inexpensive Disks

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#15%
What"About"VirtualizaEon?"

! VirtualizaCon%will%incur%performance%and%reliability%penalCes,%including:%
– Network"contenEon"
– Typically,"remote"rather"than"local"disks"
– And"oaen"just"one"volume"even"if"the"disk"is"local"
– Typically,"no"way"to"guarantee"rack"placement"of"nodes"
– Possible"to"have"all"three"replicas"of"a"block"on"virtual"machines"on"the"
same"physical"host"
! Wherever%possible,%we%recommend%dedicated%physical%hardware%for%your%
Hadoop%cluster%
– Perfectly"reasonable"to"use"virtualizaEon"for"proof/of/concept"clusters,"
or"where"data"center"restricEons"mean"the"use"of"dedicated"hardware"
is"impossible"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#16%
What"About"Blade"Servers?"

! Blade%servers%are%not%recommended%
– Failure"of"a"blade"chassis"results"in"many"nodes"being"unavailable"
– Individual"blades"usually"have"very"limited"hard"disk"capacity"
– Network"interconnecEon"between"the"chassis"and"top/of/rack"switch"
can"become"a"bo>leneck"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#17%
Node"Failure"

! Slave%nodes%are%expected%to%fail%at%some%point%
– This"is"an"assumpEon"built"into"Hadoop"
– NameNode"will"automaEcally"re/replicate"blocks"that"were"on"the"failed"
node"to"other"nodes"in"the"cluster,"retaining"the"3x"replicaEon"
requirement"
– JobTracker"will"automaEcally"re/assign"tasks"that"were"running"on"failed"
nodes"
! Master%nodes%are%single%points%of%failure%if%not%configured%for%HA%
– If"the"NameNode"goes"down,"the"cluster"is"inaccessible"
– If"the"JobTracker"goes"down,"no"jobs"can"run"on"the"cluster"
– All"currently"running"jobs"will"fail"
! Configure%the%NameNode%for%HA%when%running%producCon%workloads%
! Consider%configuring%the%JobTracker%for%HA%for%MRv1%workloads%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#18%
Master"Node"Hardware"RecommendaEons"

! Carrier#class%hardware%
– Not"commodity"hardware"
! Dual%power%supplies%
! Dual%Ethernet%cards%
– Bonded"to"provide"failover"
! RAIDed%hard%drives%
! Reasonable%amount%of%RAM%
– 24"GB"for"clusters"of"20"nodes"or"less"
– 48"GB"for"clusters"of"up"to"300"nodes"
– 96"GB"for"larger"clusters"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#19%
Chapter"Topics"

Planning,%Installing,%and%Configuring%
Planning%Your%Hadoop%Cluster%
a%Hadoop%Cluster%

!! General"Planning"ConsideraEons"
!! Choosing"the"Right"Hardware"
!! Network%ConsideraCons%
!! Configuring"Nodes"
!! Planning"for"Cluster"Management"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#20%
General"Network"ConsideraEons"

! Hadoop%is%very%bandwidth#intensive!%
– Oaen,"all"nodes"are"communicaEng"with"each"other"at"the"same"Eme"
! Use%dedicated%switches%for%your%Hadoop%cluster%
! Nodes%are%connected%to%a%top#of#rack%switch%
! Nodes%should%be%connected%at%a%minimum%speed%of%1Gb/sec%
! Consider%10Gb/sec%connecCons%in%the%following%cases:%
– Clusters"storing"very"large"amounts"of"data"
– Clusters"in"which"typical"MapReduce"jobs"produce"large"amounts"of"
intermediate"data"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#21%
General"Network"ConsideraEons"(cont’d)"

! Racks%are%interconnected%via%core%switches%
! Core%switches%should%connect%to%top#of#rack%switches%at%10Gb/sec%or%
faster%
! Beware%of%oversubscripCon%in%top#of#rack%and%core%switches%
! Consider%bonded%Ethernet%to%miCgate%against%failure%
! Consider%redundant%top#of#rack%and%core%switches%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#22%
Hostname"ResoluEon"

! When%configuring%Hadoop,%you%will%be%required%to%idenCfy%various%nodes%
in%Hadoop’s%configuraCon%files%
! Use%hostnames,%not%IP%addresses,%to%idenCfy%nodes%when%configuring%
Hadoop%
! Use%either%DNS%or%/etc/hosts%for%hostname%resoluCon%
! Forward%and%reverse%lookups%must%work%correctly%whether%you%are%using%
DNS%or%/etc/hosts%for%name%resoluCon%
– Hadoop"performs"both"forward"and"reverse"lookups"
– If"the"results"do"not"match,"major"problems"can"occur"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#23%
Chapter"Topics"

Planning,%Installing,%and%Configuring%
Planning%Your%Hadoop%Cluster%
a%Hadoop%Cluster%

!! General"Planning"ConsideraEons"
!! Choosing"the"Right"Hardware"
!! Network"ConsideraEons"
!! Configuring%Nodes%
!! Planning"for"Cluster"Management"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#24%
OperaEng"System"RecommendaEons"

! Choose%an%OS%you’re%comfortable%administering%
! CentOS:%geared%towards%servers%rather%than%individual%workstaCons%
– ConservaEve"about"package"versions"
– Very"widely"used"in"producEon"
! RedHat%Enterprise%Linux%(RHEL):%RedHat#supported%analog%to%CentOS%
– Includes"support"contracts,"for"a"price"
! In%producCon,%we%oNen%see%a%mixture%of%RHEL%and%CentOS%machines%
– Oaen"RHEL"on"master"nodes,"CentOS"on"slaves"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#25%
OperaEng"System"RecommendaEons"(cont’d)"

! Ubuntu:%Very%popular%distribuCon,%based%on%Debian%
– Both"desktop"and"server"versions"available"
– Try"to"use"an"LTS"(Long"Term"Support)"version"
! SuSE:%popular%distribuCon,%especially%in%Europe%
– Cloudera"provides"CDH"packages"for"SuSE"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#26%
Configuring"The"System"

! Do%not%use%Linux’s%LVM%(Logical%Volume%Manager)%to%make%all%your%disks%
appear%as%a%single%volume%
– As"with"RAID"0,"this"limits"speed"to"that"of"the"slowest"disk"
! Check%the%machines’%BIOS*%semngs%
– BIOS"sePngs"may"not"be"configured"for"opEmal"performance"
– For"example,"if"you"have"SATA"drives"make"sure"IDE"emulaEon"is"not"
enabled"
! Test%disk%I/O%speed%with%hdparm -t
– Example:"
"hdparm -t /dev/sda1
– You"should"see"speeds"of"70MB/sec"or"more"
– Anything"less"is"an"indicaEon"of"possible"problems"

* BIOS: Basic Input/Output System

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#27%
Configuring"The"System"(cont’d)"

! Hadoop%has%no%specific%disk%parCConing%requirements%
– Use"whatever"parEEoning"system"makes"sense"to"you"
! Mount%disks%with%the%noatime%opCon%
! Common%directory%structure%for%data%mount%points:%
"/data/<n>/dfs/nn
/data/<n>/dfs/dn
/data/<n>/dfs/snn
/data/<n>/mapred/local
! Reduce%the%swappiness*of%the%system%
– Set"vm.swappiness"to"0"or"5"in"/etc/sysctl.conf

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#28%
Filesystem"ConsideraEons"

! Cloudera%recommends%the%ext3%and%ext4%filesystems%
– ext4"is"more"commonly"used"on"new"clusters"
! XFS%provides%some%performance%benefit%during%kickstart%
– It"formats"in"0"seconds,"vs."several"minutes"for"each"disk"with"ext3/ext4"
! Currently,%more%tesCng%is%done%at%Cloudera%on%ext3/ext4%than%XFS%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#29%
OperaEng"System"Parameters"

! Increase%the%nofile%ulimit%for%the%mapred%and%hdfs%users%to%at%least%
32K%
– SePng"is"in"/etc/security/limits.conf
! Disable%IPv6%
! Disable%SELinux%if%possible%
– Incurs"a"7/10%"performance"penalty"on"a"Hadoop"cluster"
– ConfiguraEon"is"non/trivial"
! Install%and%configure%the%ntp%daemon%
– Ensures"the"Eme"on"all"nodes"is"synchronized"
– Important"for"HBase,"ZooKeeper,"Kerberos"
– Useful"when"using"logs"to"debug"problems"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#30%
Java"Virtual"Machine"(JVM)"RecommendaEons"

! Always%use%the%official%Oracle%JDK%(http://java.com/)%
– Hadoop"is"complex"soaware,"and"oaen"exposes"bugs"in"other"JDK"
implementaEons"
! Version%1.6%is%recommended%
– CDH"4"is"cerEfied"with"1.6.0_31"
– Any"later"maintenance"release"should"be"acceptable"for"producEon"
– Minimum"supported"version"is"1.6.0_8"
! Version%1.7%is%supported%starCng%from%CDH%4.2%
– There"are"restricEons"
– Refer"to"the"CDH"Release"Notes"for"details"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#31%
Which"Version"of"Hadoop?"

! CDH4.x%includes:%%
– MapReduce"v1"from"Hadoop"0.20.2"
– Stable,"recommended"for"producEon"systems"
– MapReduce"v2"and"YARN"from"Hadoop"2.x"
– SEll"experimental/alpha"
– HDFS"from"Hadoop"2.x"
– Stable,"recommended"for"producEon"systems"
– Provides"HDFS"High"Availability"
! Includes%well%over%1,000%patches%and%bugfixes%
– Including"many"developed"by"Cloudera"for"our"Support"customers"
! Ensures%interoperability%between%different%Ecosystem%projects%
! Provided%in%RPM,%Ubuntu%and%SuSE%package,%and%tarball%formats%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#32%
Chapter"Topics"

Planning,%Installing,%and%Configuring%
Planning%Your%Hadoop%Cluster%
a%Hadoop%Cluster%

!! General"Planning"ConsideraEons"
!! Choosing"the"Right"Hardware"
!! Network"ConsideraEons"
!! Configuring"Nodes"
!! Planning%for%Cluster%Management%
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#33%
Managing"Large"Clusters"

! Each%node%in%the%cluster%requires%its%own%configuraCon%files%
! Managing%a%small%cluster%is%relaCvely%easy%
– Log"in"to"each"machine"to"make"changes"
– Manually"change"configuraEon"files"
! As%the%cluster%grows%larger,%management%becomes%more%complex%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#34%
ConfiguraEon"Management"Tools"

! ConfiguraCon%management%tools%allow%you%to%manage%configuraCon%of%
mulCple%machines%at%once%
– Can"update"files,"restart"daemons"or"even"reboot"machines"
automaEcally"where"necessary"
! RecommendaCons:%%
– Use"configuraEon"management"tools"
– Start"using"such"tools"when"the"cluster"is"small!"
– Use"Cloudera"Manager"
– The"free"ediEon"supports"clusters"with"an"unlimited"number"of"
nodes"
! Popular%open%source%tools%for%managing%the%configuraCon%include%Puppet,%
Chef%
– Many"others"exist"
– Many"commercial"tools"also"exist"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#35%
Chapter"Topics"

Planning,%Installing,%and%Configuring%
Planning%Your%Hadoop%Cluster%
a%Hadoop%Cluster%

!! General"Planning"ConsideraEons"
!! Choosing"the"Right"Hardware"
!! Network"ConsideraEons"
!! Configuring"Nodes"
!! Planning"for"Cluster"Management"
!! Conclusion%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#36%
EssenEal"Points"

! Master%nodes%run%the%NameNode,%Secondary%NameNode,%and%JobTracker%
– Provision"with"carrier/class"hardware"and"lots"of"RAM"
! Slave%nodes%run%DataNodes%and%TaskTrackers%
– Provision"with"industry/standard"hardware"
– Consider"your"data"storage"growth"rate"when"planning"current"and"
future"cluster"size"
! RAID%(on%slave%nodes)%and%virtualizaCon%can%incur%a%performance%penalty%
! Make%sure%that%forward%and%reverse%domain%lookups%work%when%
configuring%a%cluster%
! When%planning%your%cluster,%don’t%forget%about%configuraCon%
management%and%monitoring%tools%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#37%
Conclusion"

In%this%chapter,%you%have%learned:%
! What%issues%to%consider%when%planning%your%Hadoop%cluster%
! What%types%of%hardware%are%typically%used%for%Hadoop%nodes%
! How%to%opCmally%configure%your%network%topology%
! How%to%select%the%right%operaCng%system%and%Hadoop%distribuCon%
! How%to%plan%for%cluster%management%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#38%
Hadoop"InstallaBon"and""
IniBal"ConfiguraBon"
Chapter"7"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#1%
Course"Chapters"
!! IntroducBon" Course"IntroducBon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducBon"to"Apache"Hadoop"
!! GePng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop%Installa1on%and%Ini1al%Configura1on%
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,%Installing,%and%
!! Hadoop"Clients"
Configuring%a%Hadoop%Cluster%
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraBon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaBons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooBng"
!! Conclusion"
!! Kerberos"ConfiguraBon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaBon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#2%
Hadoop"InstallaBon"and"IniBal"ConfiguraBon"

In%this%chapter,%you%will%learn:%
! The%different%installa1on%configura1ons%available%in%Hadoop%
! How%to%install%Hadoop%
! How%to%specify%the%Hadoop%configura1on%
! How%to%configure%HDFS%
! How%to%configure%MapReduce%
! How%to%locate%and%configure%Hadoop%log%files%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#3%
Chapter"Topics"

Hadoop%Installa1on%and%Ini1al% Planning,%Installing,%and%Configuring%
Configura1on% a%Hadoop%Cluster%

!! Deployment%Types%
!! Installing"Hadoop"
!! Specifying"the"Hadoop"ConfiguraBon"
!! Performing"IniBal"HDFS"ConfiguraBon"
!! Performing"IniBal"MapReduce"ConfiguraBon"
!! Hadoop"Logging"
!! Hands/On"Exercise:"Installing"A"Hadoop"Cluster"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#4%
Hadoop’s"Different"Deployment"Modes"

! Hadoop%can%be%configured%to%run%in%three%different%modes%
– LocalJobRunner"
– Pseudo/distributed"
– Fully"distributed"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#5%
LocalJobRunner"Mode"

! In%LocalJobRunner%mode,%no%daemons%run%
! Everything%runs%in%a%single%Java%Virtual%Machine%(JVM)%
! Hadoop%uses%the%machine’s%standard%filesystem%for%data%storage%
– Not"HDFS"
! Suitable%for%tes1ng%MapReduce%programs%while%unit%tes1ng%
– Developers"can"trace"code"in"MapReduce"jobs"within"an"IDE""

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#6%
Pseudo/Distributed"Mode"

! In%pseudo#distributed%mode,%all%daemons%run%on%the%local%machine%
– Each"runs"in"its"own"JVM"(Java"Virtual"Machine)"
! Hadoop%uses%HDFS%to%store%data%(by%default)%
! Useful%to%simulate%a%cluster%on%a%single%machine%
! Convenient%for%debugging%programs%before%launching%them%on%the%‘real’%
cluster%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#7%
Fully/Distributed"Mode"

! In%fully#distributed%mode,%Hadoop%daemons%run%on%a%cluster%of%machines%
! HDFS%is%used%to%distribute%data%amongst%the%nodes%
! Unless%you%are%running%a%small%cluster%(less%than%10%or%20%nodes),%the%
NameNode%and%JobTracker%daemons%should%each%be%running%on%dedicated%
nodes%
– For"small"clusters,"it’s"acceptable"for"both"to"run"on"the"same"physical"
node"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#8%
Chapter"Topics"

Hadoop%Installa1on%and%Ini1al% Planning,%Installing,%and%Configuring%
Configura1on% a%Hadoop%Cluster%

!! Deployment"Types"
!! Installing%Hadoop%
!! Specifying"the"Hadoop"ConfiguraBon"
!! Performing"IniBal"HDFS"ConfiguraBon"
!! Performing"IniBal"MapReduce"ConfiguraBon"
!! Hadoop"Logging"
!! Hands/On"Exercise:"Installing"A"Hadoop"Cluster"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#9%
Deploying"on"MulBple"Machines"

! If%you%are%installing%mul1ple%machines,%use%some%kind%of%automated%
deployment%
– Red"Hat’s"Kickstart"
– Debian"Fully"AutomaBc"InstallaBon"
– Dell"Crowbar"
– …"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#10%
RPM/Package"vs"Tarballs"

! CDH%is%available%in%mul1ple%formats%
– RPMs"for"Red"Hat/style"Linux"distribuBons"(RHEL,"CentOS)"
– Packages"for"Ubuntu"and"SuSE"Linux"
– Parcels"for"installaBon"via"Cloudera"Manager"(see"later)"
– As"a"tarball"
! RPMs/packages%include%some%features%not%in%the%tarball%
– AutomaBc"creaBon"of"mapred"and"hdfs"users"
– init"scripts"to"automaBcally"start"the"Hadoop"daemons"
– Although"these"are"not"acBvated"by"default"
– Configures"the"‘alternaBves’"system"to"allow"mulBple"configuraBons"on"
the"same"machine"
! Strong%recommenda1on:%use%the%RPMs/packages%whenever%possible%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#11%
RPM"InstallaBon"Steps"

1.  Install%the%Oracle%JDK%
2.  Add%the%Cloudera%repository%
3.  Install%the%RPMs%for%the%func1onality%required%on%each%host%in%the%
cluster%
4.  Edit%the%Hadoop%configura1on%files%
– Example:""
sudo yum install hadoop-hdfs-namenode install
5.  Create%required%directories%
6.  Format%HDFS%
sudo –u hdfs hdfs namenode -format

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#12%
StarBng"the"Hadoop"Daemons"

! CDH%installed%from%package%or%RPM%includes%init%scripts%to%start%the%
daemons%as%services%
–  sudo service hadoop-hdfs-namenode start
–  sudo service hadoop-hdfs-secondarynamenode start
–  sudo service hadoop-hdfs-datanode start
–  sudo service hadoop-0.20-mapreduce-jobtracker start
–  sudo service hadoop-0.20-mapreduce-tasktracker start"

! If%you%have%installed%Hadoop%manually,%or%from%the%CDH%tarball,%you%will%
have%to%start%the%daemons%manually%
! Not%all%daemons%run%on%each%machine%
– NameNode,"JobTracker"/"One"per"cluster,"unless"running"in"an"HA"
configuraBon"
– Secondary"NameNode"/"One"per"cluster"in"a"non/HA"environment"
– DataNode,"TaskTracker"/"On"each"data"node"in"the"cluster"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#13%
An"Aside:"SSH"

! Note%that%most%tutorials%tell%you%to%create%a%passwordless%SSH%login%on%
each%machine%
! This%is%not%necessary%for%the%opera1on%of%Hadoop%
– Hadoop"does"not"use"SSH"for"any"of"its"internal"communicaBons"
! ssh%is%only%required%if%you%intend%to%use%the%start-all.sh%and%stop-
all.sh%scripts%included%with%Hadoop%
– Cloudera"recommends"not"using"these"scripts"
– If"you"do"not"use"these"scripts,"you"do"not"need"to"configure"the"
masters"and"slaves"files"
! Passwordless%SSH%is%configured%in%your%lab%environment%as%a%convenience%%
– It"is"not"required"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#14%
Verify"the"InstallaBon"

! To%verify%that%everything%has%started%correctly,%check%by%running%an%
example%job:%
– Copy"files"into"Hadoop"for"input"
hadoop fs –mkdir config-files
hadoop fs -put /etc/hadoop/conf/*.xml config-files
– Run"an"example"MapReduce"job"
hadoop jar /usr/lib/hadoop-0.20-mapreduce/\
hadoop-examples.jar grep config-files output 'dfs[a-z.]+'
– View"the"output"
–  hadoop fs -cat output/part-00000 | head

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#15%
Chapter"Topics"

Hadoop%Installa1on%and%Ini1al% Planning,%Installing,%and%Configuring%
Configura1on% a%Hadoop%Cluster%

!! Deployment"Types"
!! Installing"Hadoop"
!! Specifying%the%Hadoop%Configura1on%
!! Performing"IniBal"HDFS"ConfiguraBon"
!! Performing"IniBal"MapReduce"ConfiguraBon"
!! Hadoop"Logging"
!! Hands/On"Exercise:"Installing"A"Hadoop"Cluster"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#16%
Hadoop’s"ConfiguraBon"Files"

! Each%machine%in%the%Hadoop%cluster%has%its%own%set%of%configura1on%files%
! Configura1on%files%all%reside%in%Hadoop’s%conf%directory%
– Typically"/etc/hadoop/conf
! Most%of%the%configura1on%files%are%wrieen%in%XML%
! Upon%startup,%the%Hadoop%daemons%access%the%configura1on%files%
– Afer"modifying"configuraBon"parameters,"you"must"restart"Hadoop"
daemons"for"your"changes"to"take"effect"
%
%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#17%
Hadoop’s"ConfiguraBon"Files"–"Overview"

File Type%of%Configura1on
core-site.xml Core%
hdfs-site.xml HDFS%
mapred-site.xml MapReduce%
hadoop-policy.xml Access%control%policies%
log4j.properties Logging%
hadoop-metrics.properties, Metrics%
hadoop-metrics2.properties
include,%exclude%(file%names%are% Host%inclusion/exclusion%in%a%cluster%
configurable)%
allocations.xml%(file%name%is% FairScheduler%%
configurable)%
masters,%slaves Scripted%startup%(not%recommended)%
hadoop-env.sh Environment%variables%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#18%
Environment"Setup:"hadoop-env.sh"

! hadoop-env.sh%sets%environment%variables%necessary%for%Hadoop%to%
run%
– HADOOP_CLASSPATH
– HADOOP_HEAPSIZE
– HADOOP_LOG_DIR
– HADOOP_PID_DIR
– JAVA_HOME
! Values%are%sourced%into%all%Hadoop%control%scripts%and%therefore%the%
Hadoop%daemons%
! If%you%need%to%set%environment%variables,%do%it%here%to%ensure%that%they%
are%passed%through%to%the%control%scripts%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#19%
Environment"Setup:"hadoop-env.sh"(cont’d)"

!  HADOOP_HEAPSIZE
– Controls"the"heap"size"for"all"the"Hadoop"daemons"
– Default"1GB"
– Best"pracBce:"set"the"heap"for"individual"daemons"instead"
!  HADOOP_NAMENODE_OPTS
– Java"opBons"for"the"NameNode"
– At"least"4GB:"-Xmx4g
!  HADOOP_JOBTRACKER_OPTS
– Java"opBons"for"the"JobTracker"
– At"least"4GB:"-Xmx4g
!  HADOOP_DATANODE_OPTS,%HADOOP_TASKTRACKER_OPTS
– Set"to"1GB"each:"-Xmx1g

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#20%
Sample"XML"ConfiguraBon"File"

! Sample%configura1on%file%(mapred-site.xml)%

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="configuration.xsl">
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#21%
ConfiguraBon"Value"Precedence"

! Configura1on%parameters%can%be%specified%more%than%once%
! Highest#precedence%value%takes%priority%
! Precedence%order%(lowest%to%highest):%
– *-site.xml"on"the"slave"node"
– *-site.xml"on"the"client"machine"
– Values"set"explicitly"in"the"Job"object"for"a"MapReduce"job"
! If%a%value%in%a%configura1on%file%is%marked%as%final%it%overrides%all%others%

<property>
<name>some.property.name</name>
<value>somevalue</value>
<final>true</final>
</property>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#22%
Recommended"Parameter"Values"

! There%are%many%different%parameters%which%can%be%set%
! Defaults%are%documented%at%http://archive.cloudera.com/
cdh4/cdh/4/hadoop
– Click"the"links"under"ConfiguraBon"
! Hadoop%is%s1ll%a%young%system%
– ‘Best"pracBces’"and"opBonal"values"change"as"more"and"more"
organizaBons"deploy"Hadoop"in"producBon"
! Here%we%present%some%of%the%key%parameters,%and%suggest%recommended%
values%
– Based"on"our"experiences"working"with"clusters"ranging"from"a"few"
nodes"up"to"1,500+"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#23%
Aside:"Deprecated"ProperBes"

! Many%proper1es%have%recently%been%renamed%in%Apache%Hadoop%
– This"change"affects"CDH4,"but"not"earlier"versions"
– Some"examples"below"(see"documentaBon"for"complete"list)"
"
Deprecated%Property%Name
" New%Property%Name
"
dfs.data.dir dfs.datanode.data.dir
"
dfs.http.address dfs.namenode.http-address
"
fs.default.name fs.defaultFS
"
! In%CDH4,%you%can%use%either%the%old%or%new%name%
– The"old"property"names"are"now"deprecated,"but"sBll"work"
– We"will"use"the"old"names,"except"for"CDH4/specific"features"
%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#24%
Reading"ConfiguraBon"Changes"

! Cluster%daemons%generally%need%to%be%restarted%to%read%in%changes%to%their%
configura1on%files%
! DataNodes%do%not%need%to%be%restarted%if%only%NameNode%parameters%
were%changed%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#25%
Chapter"Topics"

Hadoop%Installa1on%and%Ini1al% Planning,%Installing,%and%Configuring%
Configura1on% a%Hadoop%Cluster%

!! Deployment"Types"
!! Installing"Hadoop"
!! Specifying"the"Hadoop"ConfiguraBon"
!! Performing%Ini1al%HDFS%Configura1on%
!! Performing"IniBal"MapReduce"ConfiguraBon"
!! Hadoop"Logging"
!! Hands/On"Exercise:"Installing"A"Hadoop"Cluster"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#26%
hdfs-site.xml"

! The%single%most%important%configura1on%value%on%your%en1re%cluster,%used%
by%the%NameNode:%
dfs.name.dir Where on the local filesystem the
NameNode stores its metadata. A
comma-separated list. Default is
${hadoop.tmp.dir}/dfs/name.

! Loss%of%a%NameNode’s%metadata%will%result%in%the%effec1ve%loss%of%all%the%
data%in%its%namespace%
– Although"the"blocks"will"remain,"there"is"no"way"of"reconstrucBng"the"
original"files"without"the"metadata"
! This%must%be%at%least%two%disks%(or%a%RAID%volume)%on%the%NameNode,%plus%
an%NFS%mount%elsewhere%on%the%network%
– Failure"to"set"this"correctly"will"result"in"eventual"loss"of"your"cluster’s"
data"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#27%
hdfs-site.xml"(cont’d)"

! A%NameNode%will%write%to%the%edit%log%in%all%directories%in%
dfs.name.dir%synchronously%
! If%a%directory%in%the%list%disappears,%the%NameNode%will%con1nue%to%
func1on%
– It"will"ignore"that"directory"unBl"it"is"restarted"
! Recommenda1on%for%the%NFS%mount%point%
tcp,soft,intr,timeo=10,retrans=10
– Sof"mount"so"the"NameNode"will"not"hang"if"the"mount"point"
disappears"
– Will"retry"transacBons"10"Bmes,"at"1/10"second"intervals,"before"being"
deemed"to"have"failed"
! Note:%no%space%between%the%comma%and%next%directory%name%in%the%list!%
– Example:"/disk1/dfs/nn,/disk2/dfs/nn

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#28%
hdfs-site.xml"(cont’d)"

dfs.block.size The block size for new files, in bytes.


Default is 67108864 (64MB).
Recommended: 134217728 (128MB).
Note: this is a client-side setting.
dfs.data.dir Where on the local filesystem a
DataNode stores its blocks. Can be a
comma-separated list of directories
(no spaces between the comma and
the path); round-robin writes to the
directories (no redundancy). Used by
DataNodes; can be different on each
DataNode.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#29%
hdfs-site.xml"(cont’d)"

dfs.http.address The address and port used for the


NameNode Web UI. Example:
<your_namenode>:50070. Used by
the NameNode.
dfs.replication The number of times each block
should be replicated when a file is
written. Default: 3. Recommended: 3.
Note: this is a client-side parameter.
dfs.datanode.du.reserved The amount of space on each volume
which cannot be used for HDFS block
storage. Default: 0. Recommended: at
least 10GB. Used by DataNodes.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#30%
core-site.xml"

fs.default.name The name of the default filesystem.


Usually includes the file system type,
plus the NameNode’s hostname and
port number. Example:
hdfs://<your_namenode>:8020/
Used by every machine which needs
access to the cluster, including all
nodes running Hadoop daemons.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#31%
core-site.xml"(cont’d)"

hadoop.tmp.dir Base temporary directory, both on the


local disk and in HDFS. Default is
/tmp/hadoop-${user.name}.
Used by all nodes.

This parameter is used to derive


defaults for numerous other
configuration parameters. For
example, the default value for
dfs.data.dir is
file://${hadoop.tmp.dir}/
dfs/name

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#32%
Chapter"Topics"

Hadoop%Installa1on%and%Ini1al% Planning,%Installing,%and%Configuring%
Configura1on% a%Hadoop%Cluster%

!! Deployment"Types"
!! Installing"Hadoop"
!! Specifying"the"Hadoop"ConfiguraBon"
!! Performing"IniBal"HDFS"ConfiguraBon"
!! Performing%Ini1al%MapReduce%Configura1on%
!! Hadoop"Logging"
!! Hands/On"Exercise:"Installing"A"Hadoop"Cluster"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#33%
mapred-site.xml"

mapred.job.tracker Hostname and port of the JobTracker.


Example: my_job_tracker:8021.
Used by JobTracker, TaskTrackers,
clients.
mapred.child.java.opts Java options passed to the TaskTracker
child processes. Default is -Xmx200m
(200MB of heap space).
Recommendation: increase to 512MB or
1GB, depending on the requirements
from your developers. Used by
TaskTrackers.
mapred.child.ulimit Maximum virtual memory in KB allocated
to any child process of the TaskTracker. If
specified, set to 1.5x the value of
mapred.child.java.opts. Used by
TaskTrackers.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#34%
mapred-site.xml"(cont’d)"

mapred.local.dir The local directory where MapReduce


stores its intermediate data files. May be
a comma-separated list of directories on
different devices. Recommendation: list
directories on all disks, and set
dfs.datanode.du.reserved (in
hdfs-site.xml) such that
approximately 25% of the total disk
capacity cannot be used by HDFS.
Example: for a node with 4 x 1TB disks,
set dfs.datanode.du.reserved to
250GB. Used by TaskTrackers.
mapred.system.dir The HDFS directory where MapReduce
stores shared files during a job run.
Example: /mapred/system. Used by
TaskTrackers.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#35%
mapred-site.xml"(cont’d)"

mapreduce.jobtracker. The root directory in HDFS below which


staging.root.dir users’ job files are stored. This should
be set to the value of the directory under
which user directories are stored.
Recommendation: /user. Used by
TaskTrackers.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#36%
mapred-site.xml"(cont’d)"

mapred.tasktracker.map. Number of Map tasks which can be run


tasks.maximum simultaneously by the TaskTracker. Used
by TaskTrackers.
mapred.tasktracker. Number of Reduce tasks which can be
reduce.tasks.maximum run simultaneously by the TaskTracker.
Used by TaskTrackers.
! Rule%of%thumb:%total%number%of%Map%+%Reduce%tasks%on%a%node%should%be%
approximately%1.5%x%the%number%of%processor%cores%on%that%node%
– Assuming"there"is"enough"RAM"on"the"node"
– This"should"be"monitored"
– If"the"node"is"not"processor"or"I/O"bound,"increase"the"total"
number"of"tasks"
– Typical"distribuBon:"60%"Map"tasks,"40%"Reduce"tasks"or"70%"Map"
tasks,"30%"Reduce"tasks"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#37%
Chapter"Topics"

Hadoop%Installa1on%and%Ini1al% Planning,%Installing,%and%Configuring%
Configura1on% a%Hadoop%Cluster%

!! Deployment"Types"
!! Installing"Hadoop"
!! Specifying"the"Hadoop"ConfiguraBon"
!! Performing"IniBal"HDFS"ConfiguraBon"
!! Performing"IniBal"MapReduce"ConfiguraBon"
!! Hadoop%Logging%
!! Hands/On"Exercise:"Installing"A"Hadoop"Cluster"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#38%
Hadoop"Log"Files"

Type%of%Log Descrip1on
Daemon% InformaBonal,"warning,"and"error"messages"
generated"by"Hadoop"daemons."Each"Hadoop"
daemon"produces"two"log"files:"
•  .log"–"First"port"of"call"when"diagnosing"
problems""
•  .out"–"CombinaBon"of"stdout"and"stderr"
during"daemon"startup,"doesn’t"usually"contain"
much"output""
Task% stdout,"stderr,"and"syslog"output"generated"
by"MapReduce"applicaBons."
Job%Configura1on% Job"configuraBon"sePngs"specified"by"the"
developer."
Job%History% Job"summary"and"counters,"task"summary"and"
analysis,"stack"traces"for"any"thrown"excepBons,"
URLs"to"navigate"to"the"task"logs."

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#39%
Daemon"Logs"/"LocaBon"

Type%of% Loca1on
Daemon%Log
MRv1% Default"directory:"/var/log/hadoop-0.20-mapreduce
(JobTracker,% (Set"HADOOP_LOG_DIR"in"hadoop-env.sh"to"configure)"
TaskTracker)% Default"log"file"names:"hadoop-hadoop-<daemon>-
<hostname>.{log|out}
Example:"/var/log/hadoop-0.20-mapreduce/hadoop-
hadoop-tasktracker-elephant.log
HDFS%and%MRv2% Default"directory:"/var/log/hadoop-<component>
(Set"HADOOP_LOG_DIR"in"hadoop-env.sh"to"configure)"
Default"log"file"names:"hadoop-<component>-<daemon>-
<hostname>.{log|out}
Example:"/var/log/hadoop-hdfs/hadoop-hdfs-
datanode-tiger.log

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#40%
Daemon"Logs"/"RetenBon"

Type%of% Default%Reten1on
Daemon%Log
All%.out%Files% Rotated"when"daemon"restarts,"five"files"retained"
MRv1%.log% Rotated"daily"
Files% Cannot"limit"file"size"or"the"number"of"files"kept"
Provide"your"own"scripts"to"compress,"archive,"delete"logs"
HDFS%and% Maximum"size"of"generated"log"files:"256MB""
MRv2%.log% Number"of"files"retained:"20"
Files% Maximum"disk"space"for"logs:"5GB"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#41%
Daemon"Logs"–"RetenBon"ConfiguraBon"

! The%Hadoop%daemons%use%file*appenders*for%log%messages%
– File"appenders"deliver"log"events"to"files"
– Hadoop’s"file"appenders"are"configured"in"/etc/hadoop/conf/
log4j.properties
! Hadoop%uses%two%default%file%appenders%
– HDFS"and"MRv2"daemons"–"RollingFileAppender"(RFA)"
– Maximum"size"of"generated"log"files"and"number"of"files"retained"
are"configurable"
– MRv1"daemons"–""DailyRollingFileAppender"(DRFA)"
– RotaBon"frequency"is"configurable"
! To%change%the%default%RFA%configura1on%
– Change"hadoop.log.maxfilesize"and"
hadoop.log.maxbackupindex"in"log4j.properties"
– Restart"the"Hadoop"daemons"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#42%
Daemon"Logs"–"Changing"the"Log"Level"

! Use%the%hadoop daemonlog -setlevel%command%


– hadoop daemonlog –setlevel
<daemon_host>:<daemon_HTTP_port> <class> <level>
! Valid%log%levels:%FATAL,%ERROR,%WARN,%INFO,%DEBUG,%TRACE%
! New%log%level%does%not%persist%aier%a%daemon%restarts

Daemon Port Class%


NameNode" 50070" org.apache.hadoop.hdfs.server.
namenode.NameNode
Secondary" 50090" org.apache.hadoop.hdfs.server.
NameNode" namenode.SecondaryNameNode
DataNode% 50075" org.apache.hadoop.hdfs.server.
datanode.DataNode
JobTracker% 50030" org.apache.hadoop.mapred.JobTracker
TaskTracker% 50060" org.apache.hadoop.mapred.TaskTracker

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#43%
Daemon"Logs"–"Changing"the"Log"Level"(cont’d)"

! Use%the%logLevel%Web%UI%
– http://<daemon_host>:<daemon_HTTP_port>/logLevel
! New%log%level%does%not%persist%aier%a%daemon%restarts%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#44%
Daemon"Logs"–"Changing"the"Log"Level"(cont’d)"

! The%persistent%log%level%is%configured%in%%
/etc/hadoop/conf/log4j.properties
! Default%log%level%configured%by%hadoop.root.logger
– Default"is"INFO
! Log%level%can%be%set%for%any%specific%Hadoop%daemon%with%
log4j.logger.class.name=LEVEL
– Example:"
log4j.logger.org.apache.hadoop.mapred.JobTracker=INFO

! Restart%the%Hadoop%daemon%to%make%the%log%level%change%take%effect%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#45%
Task"Logs"

Type%of%Task% Loca1on
Log
MRv1% Default"directory:"/var/log/hadoop-0.20-mapreduce/
userlogs
(Set"HADOOP_LOG_DIR"in"hadoop-env.sh"to"configure)"
Contains"symbolic"links"to"paths"under"mapred.local.dir
Default"retenBon:"24"hours"
Configure"retenBon"with"mapred.userlog.retain.hours
MRv2% Default"directory:"/var/log/hadoop-yarn/userlogs
(Set"HADOOP_LOG_DIR"in"hadoop-env.sh"to"configure)"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#46%
Job"Logs"

Job%Log%File Contents Loca1on%and%Reten1on%


Job%Configura1on% Job"configuraBon"sePngs" ${hadoop.log.dir}/<job_id>_
XML%File% specified"by"the"developer" conf.xml
(Set"HADOOP_LOG_DIR"in"hadoop-
env.sh"to"configure)"
RetenBon:"mapred.jobtracker.
retirejob.interval"milliseconds"
Default:"1"day"(24"*"60"*"60"*"1000)"
Job%History%on% Job"summary"and"counters" hadoop.job.history.location
Local%Disk% Task"summary"and"analysis" Default"locaBon:""
Stack"traces"for"any"thrown" ${hadoop.log.dir}/history
excepBons" RetenBon:"30"days""
URLs"to"navigate"to"task"logs" "

Job%History%in% Same"as"Job"History"on"local" hadoop.job.history.user.


HDFS% disk" location
Default"locaBon:""
<job_output_directory>/_logs/
history
RetenBon:"As"long"as"the"output"
directory"exists"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#47%
Job"Logs"(cont’d)""

! The%JobTracker%also%keeps%the%data%logged%to%the%job%logs%in%memory%for%a%
limited%1me%
! You%can%access%the%job%history%from%the%command%line%

mapred job –history all <job_output_dir_in_HDFS>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#48%
Chapter"Topics"

Hadoop%Installa1on%and%Ini1al% Planning,%Installing,%and%Configuring%
Configura1on% a%Hadoop%Cluster%

!! Deployment"Types"
!! Installing"Hadoop"
!! Specifying"the"Hadoop"ConfiguraBon"
!! Performing"IniBal"HDFS"ConfiguraBon"
!! Performing"IniBal"MapReduce"ConfiguraBon"
!! Hadoop"Logging"
!! Hands#On%Exercise:%Installing%A%Hadoop%Cluster%
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#49%
Hands/On"Exercise:"Installing"a"Hadoop"Cluster"

! In%this%Hands#On%Exercise,%you%will%%create%a%Hadoop%cluster%with%four%
instances%
! Please%refer%to%the%Hands#On%Exercise%Manual%
! Cluster%deployment%aier%exercise%comple1on:%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#50%
Chapter"Topics"

Hadoop%Installa1on%and%Ini1al% Planning,%Installing,%and%Configuring%
Configura1on% a%Hadoop%Cluster%

!! Deployment"Types"
!! Installing"Hadoop"
!! Specifying"the"Hadoop"ConfiguraBon"
!! Performing"IniBal"HDFS"ConfiguraBon"
!! Performing"IniBal"MapReduce"ConfiguraBon"
!! Hadoop"Logging"
!! Hands/On"Exercise:"Installing"A"Hadoop"Cluster"
!! Conclusion%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#51%
EssenBal"Points"

! Hadoop%can%run%in%three%modes:%local,%pseudo#distributed,%and%distributed%
! Aier%installing%Hadoop%from%CDH,%you%can%use%services%to%start,%stop,%and%
check%status%of%Hadoop%daemons%
– Example:"sudo service hadoop-hdfs-datanode start
! Hadoop’s%configura1on%resides%in%the%/etc/hadoop/conf%directory%by%
default%
– Most"configuraBon"properBes"are"name/value"pairs"in"XML"files"
– Example:"
"" <property>
<name>mapred.child.java.opts</name>
<value>-Xmx1g</value>
</property>

! Hadoop%daemons%produce%log%files%with%extensions%.log%and%.out%%
– The".log"file"is"the"first"place"to"look"when"you"run"into"problems"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#52%
Conclusion"

In%this%chapter,%you%have%learned:%
! The%different%installa1on%configura1ons%available%in%Hadoop%
! How%to%install%Hadoop%
! How%to%specify%the%Hadoop%configura1on%
! How%to%configure%HDFS%
! How%to%configure%MapReduce%
! How%to%locate%and%configure%Hadoop%log%files%

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#53%
Installing"and"Configuring""
Hive,"Impala,"and"Pig"
Chapter"8"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"1$
Course"Chapters"
!! IntroducGon" Course"IntroducGon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducGon"to"Apache"Hadoop"
!! GePng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaGon"and"IniGal"ConfiguraGon"
!! Installing$and$Configuring$Hive,$Impala,$and$Pig$
Planning,$Installing,$and$
!! Hadoop"Clients"
Configuring$a$Hadoop$Cluster$
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraGon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaGons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooGng"
!! Conclusion"
!! Kerberos"ConfiguraGon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaGon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"2$
Installing"and"Configuring"Hive,"Impala,"and"Pig"

In$this$chapter,$you$will$learn:$
! Hive$features$and$basic$configuraCon$
! Impala$features$and$basic$configuraCon$
! Pig$features$and$installaCon$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"3$
Note"

! Note$that$this$chapter$does$not$go$into$any$significant$detail$about$Hive,$
Impala,$or$Pig$
! Our$intenCon$is$to$draw$your$aGenCon$to$issues$System$Administrators$
will$need$to$deal$with,$if$users$request$these$products$be$installed$
– The"Cloudera)Data)Analyst)Training:)Using)Pig,)Hive,)and)Impala)with)
Hadoop"course"provides"detailed"informaGon"about"how"to"use"these"
three"components"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"4$
Chapter"Topics"

Installing$and$Configuring$Hive,$ Planning,$Installing,$and$Configuring$
Impala,$and$Pig$ a$Hadoop$Cluster$

!! Hive$
!! Impala"
!! Pig"
!! Hands/On"Exercise:"Querying"HDFS"With"Hive"and"Cloudera"Impala"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"5$
Using"Hive"to"Query"Large"Datasets"

! Hive$is$an$open$source$Apache$project$
– Originally"developed"at"Facebook"
! MoCvaCon:$many$data$analysts$are$very$familiar$with$SQL$(the$Structured$
Query$Language)$
– The"de)facto"standard"for"querying"data"in"RelaGonal"Database"
Management"Systems"(RDBMSs)"
! Data$analysts$tend$to$be$far$less$familiar$with$programming$languages$
such$as$Java$
! Hive$provides$a$way$to$query$data$in$HDFS$using$a$SQL"like$language$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"6$
Sample"Hive"Query"

SELECT * from movies m


JOIN scores s
ON (m.id = s.movie_id)
WHERE m.year > 1995
ORDER BY m.name DESC
LIMIT 50;

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"7$
What"Hive"Provides"

! Hive$allows$users$to$query$data$using$HiveQL,$a$language$very$similar$to$
standard$SQL$
! Hive$turns$HiveQL$queries$into$standard$MapReduce$jobs$
– AutomaGcally"runs"the"jobs,"and"displays"the"results"to"the"user"
! Note$that$Hive$is$not$an$RDBMS!$
– Results"take"several"seconds,"minutes,"or"even"hours"to"be"produced"
– Not"possible"to"modify"the"data"using"HiveQL"
– UPDATE"and"DELETE"are"not"supported"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"8$
The"Hive"Metastore"

! A$‘Table’$in$Hive$represents$an$HDFS$directory$
– Hive"interprets"all"files"in"the"directory"as"the"contents"of"the"table"
– Stores"informaGonabout"how"the"rows"and"columns"are"delimited"
within"the"files"in"the"Hive)Metastore)
! By$default,$Hive$uses$a$Metastore$on$the$user’s$local$machine$
– Metastore"uses"Apache"Derby,"a"Java/based"RDBMS"
! A$shared$Metastore$is$a$database$in$an$RDBMS$such$as$MySQL$
– If"mulGple"users"will"be"running"Hive,"the"System"Administrator"should"
configure"a"shared"Metastore"for"all"users"
– On"producGon"systems"you"will"always"configure"a"shared"Matastore"
$$
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"9$
The"Hive"Metastore"Service"

! The$Hive$Metastore$service$is$opConal,$but$recommended$when$you$use$a$
shared$Metastore$
! Hive$clients$make$calls$to$the$Hive$Metastore$service$using$the$Thri`$API$
! The$Hive$Metastore$service$connects$to$the$Metastore$using$JDBC$

Thrift API Hive Metastore JDBC


Client RDBMS
Service

! A$variety$of$clients$access$the$shared$Metastore$by$calling$the$Hive$
Metastore$service,$including:$
– The"Hive"CLI"
– Cloudera"Impala"
– Beeswax,"a"Hive"UI"used"by"Hue"
"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"10$
Hive"Deployment"

Hadoop Cluster

Hive CLI
Submit Map Mappers
Reduce Jobs

Reducers
Create or
Access
Hive Schema
Definitions

Local or
Shared
Metastore

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"11$
Installing"and"Configuring"Hive"

! Hive$runs$on$a$user’s$machine$
– Not"on"the"Hadoop"cluster"itself"
– A"user"can"set"up"Hive"with"no"System"Administrator"input"
– Using"the"standard"Hive"command/line"or"Web/based"interface"
– If"users"will"be"running"JDBC/based"clients,"you"can"run"Hive"as"a"service"
on"a"centrally/available"machine"
! To$install,$run$sudo yum install hive
! Configure$the$client$so$Hive$can$access$the$Hadoop$cluster:$
– If"the"client"has"core-site.xml"and"mapred-site.xml"
configured"to"access"the"Hadoop"cluster,"Hive"will"use"the"configuraGon"
from"those"files"
– If"not,"configure"fs.default.name"and"mapred.job.tracker"
in"/etc/hive/conf/hive-site.xml""

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"12$
CreaGng"and"Configuring"a"Shared"Metastore"

1.  Create$a$database$in$your$RDBMS$
–  CREATE DATABASE metastore;
2.  Import$the$Hive$metastore$schema$into$the$database$
–  SOURCE /usr/lib/hive/scripts/metastore/\
upgrade/mysql/hive-schema-version.mysql.sql;""
3.  Create$a$database$user$with$appropriate$privileges$
–  CREATE USER 'hiveuser'@'%' IDENTIFIED BY \
'password';""
GRANT SELECT,INSERT,UPDATE,DELETE \
ON metastore.* TO 'hiveuser'@'%';

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"13$
CreaGng"and"Configuring"a"Shared"Metastore"(cont’d)"

4.  Install$and$start$the$Hive$Metastore$service:$$
–  sudo yum install hive-metastore
–  sudo service hive-metastore start
5.  Modify$hive-site.xml$on$each$user’s$machine$to$refer$to$the$
shared$Metastore$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"14$
Sample"hive-site.xml"ProperGes"for"a"MySQL"Shared"
Metastore"
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://elephant:3306/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://192.168.123.1:9083</value>
</property>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"15$
HiveServer2"–"Hive"as"a"Service"

Beeline CLI

Hadoop Cluster

Submit Map
Reduce Jobs Mappers
HiveServer2
Server

Reducers

JDBC/ODBC
Application

Shared
Metastore

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"16$
HiveServer2"–"Hive"as"a"Service"(cont’d)"

FuncConality$ Hive$CLI$ HiveServer2


InstallaCon$and$ •  Local$Metastore$supported$ •  Local$Metastore$not$supported$
ConfiguraCon$ •  If$using$shared$Metastore,$ •  JDBC$drivers$to$access$the$shared$
JDBC$drivers$to$access$the$ Metastore$are$installed$on$the$
shared$Metastore$are$ HiveServer2$server$
installed$on$every$client$
•  Local$configuraCon$requires$
root$privileges$
Client$Interface$ Hive$CLI$ Beeline$CLI,$JDBC"$and$ODBC"based$
applicaCons$
Security$ •  If$using$shared$Metastore,$ •  CredenCals$to$shared$Metastore$
every$invocaCon$requires$ RDBMS$are$stored$on$the$server$
credenCals$to$RDBMS$ •  Supports$Kerberos$authenCcaCon$
•  Supports$Kerberos$ (original$HiveServer$did$not)$
authenCcaCon$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"17$
Chapter"Topics"

Installing$and$Configuring$Hive,$ Planning,$Installing,$and$Configuring$
Impala,$and$Pig$ a$Hadoop$Cluster$

!! Hive"
!! Impala$
!! Pig"
!! Hands/On"Exercise:"Querying"HDFS"With"Hive"and"Cloudera"Impala"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"18$
What"Is"Impala?"

! Impala$also$allows$users$to$query$data$in$HDFS$using$HiveQL$
! An$open$source$project$created$at$Cloudera$
! Impala$uses$the$same$shared$Metastore$that$Hive$uses$
! Unlike$Hive,$Impala$does$not$turn$queries$into$MapReduce$jobs$
– Impala"queries"run"on"an"addiGonal"set"of"daemons"that"run"on"the"
Hadoop"cluster""
– Oeen"referred"to"as"‘Impala"Servers’"
– Impala"queries"run"significantly"faster"than"Hive"queries"
– Tests"show"improvements"of"10x"to"50x"or"more"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"19$
Deploying"Impala"

! Impala$Servers$should$reside$on$each$DataNode$host$
– Can"coexist"with"TaskTrackers"if"you"run"Impala"and"MapReduce"
– Memory"usage"is"configurable"
! A$single$Impala$State$Store$daemon$
– Very"lightweight"
– Co/locate"with"NameNode,"Secondary"NameNode,"or"other"server"

Impala Server Impala Server


DataNode Impala State DataNode
Store Server

Impala Server NameNode Impala Server


DataNode DataNode

Secondary
Impala Server NameNode Impala Server
DataNode DataNode

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"20$
Installing,"Configuring,"and"StarGng"Impala:"Overview"of"Steps"

1.  Install$the$Impala$so`ware*$
2.  Configure$HDFS$for$opCmal$Impala$performance*$
3.  Restart$all$the$DataNodes$
4.  If$necessary,$configure$hive-site.xml$for$a$shared$Hive$Metastore$
5.  Copy$hive-site.xml,$core-site.xml,$hdfs-site.xml,$and$
log4j.properties$to$/etc/impala/conf$on$all$the$hosts$that$
will$run$Impala$Servers$
6.  Configure$startup$opCons$in$/etc/default/impala$on$all$the$
hosts$that$will$run$impalad$or$the$Impala$State$Store$Server*$
7.  Start$the$Impala$Servers$and$the$Impala$State$Store$Server*$
*$Described$later$in$this$secCon$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"21$
Installing"Impala"Soeware"

! Install$the$Impala$Server$on$all$hosts$running$DataNodes:$
– sudo yum install impala-server
! Install$the$Impala$State$Store$Server$on$a$single$host:$
– sudo yum install impala-state-store
! Install$the$Impala$shell$on$clients:$
– sudo yum install impala-shell

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"22$
Configuring"HDFS"for"OpGmal"Impala"Performance"

! Configure$HDFS$to$perform$short"circuit$reads.$This$configuraCon$is$
mandatory.$In$hdfs-site.xml:$
dfs.client.read.shortcircuit Allows daemons to read directly off
their host's disks instead of having to
open a socket to talk to DataNodes.
Required value: true.
dfs.domain.socket.path Short-circuit reads use a UNIX
domain socket, which requires a path.
Recommended value: /var/run/
hadoop-hdfs/dn._PORT.
dfs.client.file/block/ Timeout on a call to get the locations
storage/locations.timeout" of required file blocks. Recommended
value: 3000.""
! Note:$These$configuraCon$parameters$are$for$CDH$4.2$and$higher.$For$CDH$
4.1,$refer$to$the$documentaCon$for$an$alternate$configuraCon$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"23$
Typical"Changes"to"Impala"Startup"OpGons"

! In$/etc/default/impala:$

IMPALA_STATE_STORE_HOST The host running the Impala State


Store Server. Default value:
127.0.0.1. Required value: Host
name or IP address of the host
running the Impala State Store Server
-mem_limit"argument"in"" The amount of system memory
export IMPALA_SERVER_ARGS" available to Impala. Default value:
statement 100%. Set as needed based on
system activity. Examples:
-mem_limit=70%,
-mem_limit=31GB
$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"24$
StarGng"Impala"Soeware"

! Impala$Servers:$
– sudo service impala-server start
! Impala$State$Store$Server:$
– sudo service impala-state-store start
!  impala-shell$client:$
– impala-shell

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"25$
Chapter"Topics"

Installing$and$Configuring$Hive,$ Planning,$Installing,$and$Configuring$
Impala,$and$Pig$ a$Hadoop$Cluster$

!! Hive"
!! Impala"
!! Pig"
!! Hands/On"Exercise:"Querying"HDFS"With"Hive"and"Cloudera"Impala"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"26$
What"Is"Pig?"

! Pig$is$another$high"level$abstracCon$on$top$of$MapReduce$
! An$open$source$Apache$project$
– Originally"created"at"Yahoo"
! Provides$a$scripCng$language$known$as$Pig,La.n$
– Composed"of"operaGons"that"are"applied"to"the"input"data"to"produce"
output"
– Language"will"be"relaGvely"easy"to"learn"for"people"experienced"in"Perl,"
Ruby,"or"other"scripGng"languages"
– Easy"to"write"complex"tasks"such"as"joins"of"mulGple"datasets"
– Under"the"covers,"Pig"LaGn"scripts"are"converted"to"MapReduce"jobs"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"27$
Sample"Pig"Script"

movies = LOAD '/data/films' AS


(id:int, name:string, year:int);
ratings = LOAD '/data/ratings' AS
(movie_id: int, user_id: int, score:int);
jnd = JOIN movies BY id, ratings BY movie_id;
recent = FILTER jnd BY year > 1995;
srtd = ORDER recent BY name DESC;
justafew = LIMIT srtd 50;
STORE justafew INTO '/data/pigoutput';

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"28$
Installing"Pig"

! Pig$runs$as$a$client"side$applicaCon$
– There"is"nothing"extra"to"install"on"the"cluster"
! To$install,$run$sudo yum install pig
! Configure$the$client$so$Pig$can$access$the$Hadoop$cluster:$
– If"the"client"has"core-site.xml"and"mapred-site.xml"
configured"to"access"the"Hadoop"cluster,"Pig"will"use"the"configuraGon"
from"those"files"
– If"not,"configure"fs.default.name"and"mapred.job.tracker"
in"/etc/pig/conf/pig.properties""

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"29$
Chapter"Topics"

Installing$and$Configuring$Hive,$ Planning,$Installing,$and$Configuring$
Impala,$and$Pig$ a$Hadoop$Cluster$

!! Hive"
!! Impala"
!! Pig"
!! Hands"On$Exercise:$Querying$HDFS$With$Hive$and$Cloudera$Impala$
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"30$
Hands/On"Exercise:"Querying"HDFS"With"Hive"and"Cloudera"
Impala"
! In$this$Hands"On$Exercise,$you$will$install$and$configure$Hive$and$Impala$
on$your$cluster$and$run$queries$
! Please$refer$to$the$Hands"On$Exercise$Manual$
! Cluster$deployment$a`er$exercise$compleCon:$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"31$
Chapter"Topics"

Installing$and$Configuring$Hive,$ Planning,$Installing,$and$Configuring$
Impala,$and$Pig$ a$Hadoop$Cluster$

!! Hive"
!! Impala"
!! Pig"
!! Hands/On"Exercise:"Querying"HDFS"With"Hive"and"Cloudera"Impala"
!! Conclusion$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"32$
EssenGal"Points"

! Hive$is$a$high"level$abstracCon$on$top$of$MapReduce$
– Runs"MapReduce"jobs"on"Hadoop"based"on"HiveQL"statements"
! Impala$runs$a$separate$set$of$daemons$from$MapReduce$to$process$
HiveQL$queries$$
– One"State"Store"Server"
– Impala"Server"daemons"are"co/located"with"DataNodes"
! Hive$and$Impala$use$a$common$metastore$to$store$metadata$such$as$
column$names$and$data$types$
– For"Hive,"the"metastore"can"be"single/user"
– For"Impala,"the"metastore"must"be"shareable"
! Pig$is$another$high"level$abstracCon$on$top$of$MapReduce$
– Runs"MapReduce"jobs"on"Hadoop"based"on"Pig"LaGn"statements"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"33$
Conclusion"

In$this$chapter,$you$have$learned:$
! Hive$features$and$basic$configuraCon$
! Impala$features$and$basic$configuraCon$
! Pig$features$and$installaCon$
$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 8"34$
Hadoop"Clients"
Chapter"9"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"1$
Course"Chapters"
!! IntroducCon" Course"IntroducCon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducCon"to"Apache"Hadoop"
!! GePng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaCon"and"IniCal"ConfiguraCon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,$Installing,$and$
!! Hadoop$Clients$
Configuring$a$Hadoop$Cluster$
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraCon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaCons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooCng"
!! Conclusion"
!! Kerberos"ConfiguraCon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaCon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"2$
Hadoop"Clients"

In$this$chapter,$you$will$learn:$
! What$Hadoop$clients$are$
! How$to$install$and$configure$Hadoop$clients$
! How$to$install$and$configure$Hue$
! How$Hue$authen@cates$and$authorizes$user$access$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"3$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Clients$
a$Hadoop$Cluster$

!! What$Are$Hadoop$Clients?$
!! Installing"and"Configuring"Hadoop"Clients"
!! Installing"and"Configuring"Hue"
!! Hue"AuthenCcaCon"and"AuthorizaCon"
!! Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"4$
Hadoop"Client"

Components"Running""
on"a"Hadoop"Cluster"

HDFS

Accesses"FuncConality"
Within"a"Hadoop"Cluster" Client

Using"the"Hadoop"APIs"
MapReduce

! A$Hadoop$client$requires$
– The"Hadoop"API"
– ConfiguraCon"to"connect"to"Hadoop"components"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"5$
Examples"of"Hadoop"Clients"

! Command"line$clients$
– The"hadoop"shell"(hadoop fs,"hadoop job)"
– The"Hive"shell""
– The"Pig"shell"
– The"Sqoop"command/line"interface"
! Server$daemons$
– Flume"agent"
– Centralized"Sqoop"and"Hive"servers"
– Oozie"
! Mapper$and$Reducer$tasks$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"6$
Deployment"Example"–"End"Users’"Systems"as"Hadoop"Clients"

Hadoop Cluster

JobTracker

NameNode

...
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"7$
Deployment"Example"–"Servers"as"Hadoop"Clients"

Hadoop Cluster

JobTracker

Servers

NameNode

... Browser

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"8$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Clients$
a$Hadoop$Cluster$

!! What"Are"Hadoop"Clients?"
!! Installing$and$Configuring$Hadoop$Clients"
!! Installing"and"Configuring"Hue"
!! Hue"AuthenCcaCon"and"AuthorizaCon"
!! Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"9$
Installing"Commonly/Used"Hadoop"Clients"

Client Installa@on$Command$
(Run$With$sudo yum install)
Hadoop$Shell$and$MapReduce$ hadoop-client
Hive$Shell$ hive
Pig$Shell$ pig
Sqoop$Shell$ sqoop
Flume$Agent$ flume-ng-agent
Centralized$Hive$Server$ hive-server2
Centralized$Sqoop$Server$ sqoop2
Oozie$ oozie

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"10$
Installing"the"Hadoop"APIs"on"Clients"

! Hadoop$client$installers$automa@cally$include$any$required$Hadoop$APIs$
as$dependencies$if$they$are$not$already$installed

==========================================================================
Package Arch Version Repository
==========================================================================
Installing:
hadoop-client x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
Installing for dependencies:
bigtop-jsvc x86_64 1.0.10-1.cdh4.2.0.p0.13.el6 Cloudera-cdh4
bigtop-utils noarch 0.4+502-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
hadoop x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
hadoop-0.20-mapreduce
x86_64 0.20.2+1341-1.cdh4.2.0.p0.21.el6 Cloudera-cdh4
hadoop-hdfs x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
hadoop-mapreduce x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
hadoop-yarn x86_64 2.0.0+922-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4
zookeeper noarch 3.4.5+14-1.cdh4.2.0.p0.12.el6 Cloudera-cdh4

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"11$
Configuring"Hadoop"Clients"

! Clients$must$be$configured$to$iden@fy$the$Hadoop$cluster$
– NameNode"locaCon"
– JobTracker"locaCon"
! You$can$simply$copy$the$Hadoop$configura@on$files$to$clients$
– Or"a"subset"of"the"Hadoop"configuraCon"
! You$will$need$addi@onal$client"specific$configura@on$for$some$clients$
– For"example,"hive-site.xml,"pig.properties,"flume-
conf.properties,"etc."
! Cloudera$Manager$handles$client$configura@on$
– AutomaCcally"deploys"configuraCons"to"CM/managed"hosts"in"the"
Gateway"role"
– Makes"configuraCon"available"for"download"for"non/CM/managed"
clients"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"12$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Clients$
a$Hadoop$Cluster$

!! What"Are"Hadoop"Clients?"
!! Installing"and"Configuring"Hadoop"Clients"
!! Installing$and$Configuring$Hue$
!! Hue"AuthenCcaCon"and"AuthorizaCon"
!! Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"13$
Hue"

Hue Applications

Hive UI

Impala UI
Browser
File Browser

Hue
Server Job Browser

Job Designer

Oozie Workflow
Editor

Shell UI

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"14$
Installing,"StarCng,"Accessing,"and"Configuring"Hue"

! Install$Hue$
– sudo yum install hue
! Set$an$arbitrary$value$for$secret_key$in$the$[desktop]$sec@on$of$
hue.ini
– Used"to"hash"Hue"sessions"
! Start$the$Hue$Server$
– sudo service hue start
! Access$Hue$from$a$browser$
– http://<hue_server>:8888
! Configure$the$Hue$applica@ons$
– Modify"/etc/hue/hue.ini"
– Restart"the"Hue"Server"for"configuraCon"changes"to"take"effect"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"15$
Configuring"Hue"ApplicaCons"

Hue$Applica@on Required$Configura@on$
Hive$UI$ •  Install"Hive"on"the"machine"running"the"Hue"Server"
•  Copy"hive-site.xml"into"/etc/hive/conf"on"
the"host"running"the"Hue"Server"
•  Restart"the"Hue"Server"
•  Run"a"Hive"query"from"the"Hive"UI"in"Hue"as"a"test"
Impala$UI$ •  Prerequisite:"a"shared"Hive"metastore"
•  Install"Hive"on"the"machine"running"the"Hue"Server"
•  Copy"hive-site.xml"into"/etc/impala/conf"on"
to"the"host"running"the"Hue"Server;"must"be"configured"
for"the"shared"metastore"
•  Copy"/etc/default/impala"to"the"host"running"
the"Hue"Server"
•  Configure"server_host"in"the"[impala]"secCon"of"
hue.ini"
•  Restart"the"Hue"Server"
•  Run"an"Impala"query"from"the"Hive"UI"in"Hue"as"a"test"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"16$
Configuring"Hue"ApplicaCons"(cont’d)"

Hue$Applica@on Required$Configura@on$
File$Browser$ •  Prerequisite:"either"WebHDFS"is"enabled"on"your"cluster,"
or"H>pFS"has"been"installed"and"configured."Note"that"
H>pFS"is"required"for"HDFS"HA"deployments."
•  Configure"webhdfs_url"in"the"[hadoop] /
[[hdfs_clusters]] / [[[default]]]"secCon"
of"hue.ini."You"configure"the"same"parameter"–"
webhdfs_url"–"regardless"of"whether"you"are"using"
WebHDFS"or"H>pFS.""
•  Restart"the"Hue"Server"
•  Browse"HDFS"from"the"Hue"File"Browser"as"a"test"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"17$
Configuring"Hue"ApplicaCons"(cont’d)"

Hue$Applica@on Required$Configura@on$
Job$Browser$ •  Install"the"Hue"plug/in"on"the"JobTracker"host"
•  Configure"mapred-site.xml"for"the"Hue"plug/in"
•  Restart"the"JobTracker"and"the"TaskTrackers"
•  Configure"jobtracker_host"in"the"[hadoop] /
[[mapred_clusters]] / [[[default]]]"
secCon"of"hue.ini
•  Restart"the"Hue"Server"
•  Browse"a"job"from"the"Hue"Job"Browser"as"a"test"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"18$
Configuring"Hue"ApplicaCons"(cont’d)"

Hue$Applica@on Required$Configura@on$
Job$Designer$ •  Prerequisite:"Oozie"is"installed"and"the"Oozie"server"is"
Oozie$Editor$ running"
•  Configure"oozie_url"in"the"[liboozie]"secCon"of"
hue.ini"
•  Configure"submit_to"to"true"in"the"[hadoop]"/
[[mapred_clusters]]"/"[[[default]]]"
secCon"of"hue.ini
•  Restart"the"Hue"Server"
•  Design"a"job"and"create"an"Oozie"workflow"as"a"test"
Command"Line$Shells$ •  Prerequisite:"For"shells"you"want"users"to"be"able"to"run"
from"Hue,"install"and"configure"client"sohware"on"the"
Hue"machine""
•  Add"Unix"user"IDs"on"the"Hue"machine"for"Hue"users"
who"will"run"shells"
•  Start"the"shells"from"Hue"as"a"test"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"19$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Clients$
a$Hadoop$Cluster$

!! What"Are"Hadoop"Clients?"
!! Installing"and"Configuring"Hadoop"Clients"
!! Installing"and"Configuring"Hue"
!! Hue$Authen@ca@on$and$Authoriza@on$
!! Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"20$
First"User"Login""

Hue
Database

Default:"SQLLite"
MySQL,"PostgreSQL,"and""
Oracle"supported"

! The$first$user$to$log$in$to$Hue$automa@cally$receives$superuser$privileges$
! Superusers$can$be$added$and$removed$
! The$first$user’s$login$creden@als$are$stored$in$the$Hue$database$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"21$
Defining"Users"and"Groups"With"the"Hue"User"Admin"App"

Hue
Database

! Use$Hue$itself$to$maintain$Hue$users$
! All$user$informa@on,$including$creden@als,$is$stored$in$the$Hue$Database$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"22$
Accessing"Users"and"Groups"From"LDAP"–"OpCon"1"

Hue
LDAP
Database

! Administrator$configures$Hue$to$access$the$LDAP$directory$
! Administrator$then$syncs$users$and$groups$from$LDAP$to$the$Hue$Database$
! Hue$authen@cates$users$by$accessing$creden@als$from$the$Hue$Database$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"23$
Accessing"Users"and"Groups"From"LDAP"–"OpCon"2"

LDAP

! Administrator$configures$Hue$to$access$the$LDAP$directory$
! Hue$authen@cates$users$by$accessing$creden@als$from$LDAP$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"24$
RestricCng"Access"to"Hue"ApplicaCons"

! Select$a$group$in$the$Hue$User$Admin$applica@on$
! Edit$the$permissions$for$the$group$
– Default"permissions"allow"group"members"to"access"every"Hue"
applicaCon"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"25$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Clients$
a$Hadoop$Cluster$

!! What"Are"Hadoop"Clients?"
!! Installing"and"Configuring"Hadoop"Clients"
!! Installing"and"Configuring"Hue"
!! Hue"AuthenCcaCon"and"AuthorizaCon"
!! Hands"On$Exercise:$Using$Hue$to$Control$Hadoop$User$Access$
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"26$
Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"

! In$this$Hands"On$Exercise,$you$will$install$and$configure$Hue,$then$
configure$a$working$environment$for$analysts$with$the$following$
capabili@es:$
– SubmiPng"Hive"and"Impala"queries"
– Browsing"the"HDFS"file"system"
– Browsing"Hadoop"jobs"
– Using"the"HBase"shell""
! Please$refer$to$the$Hands"On$Exercise$Manual$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"27$
Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"–"
Deployment"
! Servers$deployed$on$your$cluster$aaer$exercise$comple@on:$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"28$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Clients$
a$Hadoop$Cluster$

!! What"Are"Hadoop"Clients?"
!! Installing"and"Configuring"Hadoop"Clients"
!! Installing"and"Configuring"Hue"
!! Hue"AuthenCcaCon"and"AuthorizaCon"
!! Hands/On"Exercise:"Using"Hue"to"Control"Hadoop"User"Access"
!! Conclusion$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"29$
EssenCal"Points"

! Hadoop$clients$access$func@onality$within$a$Hadoop$cluster$using$the$
Hadoop$API$
! Hadoop$clients$require$Hadoop$configura@on$sebngs$
– End"users"with"access"to"the"configuraCon"sePngs"can"modify"them"
– Client"centralizaCon"using"servers"such"as"Hue,"HiveServer2,"and"
Sqoop2"can"reduce"the"need"to"give"the"Hadoop"configuraCon"sePngs"
to"end"users""
! Hue$provides$many$capabili@es$to$end$users$
– Browsing"HDFS"files"
– Designing,"submiPng,"and"browsing"MapReduce"jobs"
– EdiCng"and"submiPng"Hive"and"Impala"queries"
– Designing,"running,"and"viewing"the"status"of"Oozie"workflows"
! Hue$users$are$authen@cated;$access$to$func@onality$can$be$restricted$
– Hue"provides"LDAP"integraCon"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"30$
Conclusion"

In$this$chapter,$you$have$learned:$
! What$Hadoop$clients$are$
! How$to$install$and$configure$Hadoop$clients$
! How$to$install$and$configure$Hue$
! How$Hue$authen@cates$and$authorizes$user$access$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 9"31$
Cloudera"Manager"
Chapter"10"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#1$
Course"Chapters"
!! IntroducBon" Course"IntroducBon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducBon"to"Apache"Hadoop"
!! GeOng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaBon"and"IniBal"ConfiguraBon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,$Installing,$and$
!! Hadoop"Clients"
Configuring$a$Hadoop$Cluster$
!! Cloudera$Manager$
!! Advanced"Cluster"ConfiguraBon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaBons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooBng"
!! Conclusion"
!! Kerberos"ConfiguraBon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaBon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#2$
Cloudera"Manager"

In$this$chapter,$you$will$learn:$
! Why$Cloudera$Manager$is$useful$for$administering$Hadoop$clusters$
! What$features$Cloudera$Manager$offers$
! Which$daemons$are$deployed$with$Cloudera$Manager$
! How$to$install$Cloudera$Manager$
! How$to$install$a$Hadoop$cluster$using$Cloudera$Manager$
! How$to$perform$basic$administraEon$tasks$using$Cloudera$Manager$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#3$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Cloudera$Manager$
a$Hadoop$Cluster$

!! The$MoEvaEon$for$Cloudera$Manager$
!! Cloudera"Manager"Features"
!! Standard"and"Enterprise"Versions"
!! Cloudera"Manager"Topology"
!! Installing"Cloudera"Manager" ""
!! Installing"Hadoop"Using"Cloudera"Manager""
!! Performing"Basic"AdministraBon"Tasks"Using"Cloudera"Manager"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#4$
Cluster"Management"

! Apache$Hadoop$is$a$complex$system$
! Configuring$a$small,$‘proof#of#concept’$cluster$is$not$too$complex$
! Dealing$with$a$larger$cluster$is$much$more$difficult$
– Deploying"configuraBon"changes"
– Verifying"host"configuraBon"
– Hadoop"configuraBon"
– SoZware"versions"
! Lots$of$configuraEon$opEons$to$opEmize$cluster$performance$
– Many"of"these"are"not"well"documented"
– Best"pracBces"are"sBll"being"determined"as"usage"of"Hadoop"grows"
! Cluster$configuraEon$is$considered$to$be$something$of$a$‘dark$art’$
! The$configuraEon$files$can$become$complex,$and$are$easy$to$break$$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#5$
Cluster"Maintenance"

! A$cluster$has$a$large$number$of$hosts$$
– MulBple"services"running"on"each"host"
– Difficult"to"know"the"status"of"everything"
! Inevitable$issues$will$arise$with$hardware$and$soTware$
! Hardware$
– Cloudera"Manager"improves"hardware"failure"recovery"with"easy"
configuraBon"of"High/Availability"HDFS"
– Decommissioning"hosts"for"planned"downBme"is"easier"
! Services$
– As"the"cluster"grows,"moving"services"to"new"hosts"is"easier"with"
Cloudera"Manager"
– View"all"service"health"statuses"from"a"single"screen"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#6$
Monitoring"Performance"

! As$a$cluster$grows,$you$will$want$to$modify$configuraEon$values$and$
monitor$their$effect$on$cluster$performance$
– For"example,"compare"the"performance"of"similar"jobs"before"and"aZer"
the"change"
– This"can"be"difficult"using"only"open/source"products"
! Keeping$track$of$your$cluster’s$performance$can$be$difficult$
– There"are"many"elements"to"monitor"
– Retaining"informaBon"over"Bme"becomes"a"challenge"
! Performance$can$degrade$
– Perhaps"only"certain"hosts"slow"down"
– A"job"may"slow"down"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#7$
Managing"Access"to"the"Cluster"

! Typically,$you$will$only$want$certain$individuals$to$have$access$to$the$
cluster$
! You$may$want$to$restrict$what$different$people$can$do$
– Some"may"have"access"only"to"HDFS"
– Others"can"run"jobs"on"the"cluster"
– Yet"others"are"administrators"and"can"create"new"users"
! CDH3$and$above$include$Kerberos$support$to$provide$much$beZer$security$
than$standard$Hadoop$
– Allow"integraBon"with"your"exisBng"authenBcaBon"systems"such"as"
LDAP"and"AcBve"Directory"
– This"can"be"difficult"to"implement"and"maintain"
– Cloudera"Manager"makes"implemenBng"Kerberos"support"much"easier"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#8$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Cloudera$Manager$
a$Hadoop$Cluster$

!! The"MoBvaBon"for"Cloudera"Manager"
!! Cloudera$Manager$Features$
!! Standard"and"Enterprise"Versions"
!! Cloudera"Manager"Topology"
!! Installing"Cloudera"Manager" ""
!! Installing"Hadoop"Using"Cloudera"Manager""
!! Performing"Basic"AdministraBon"Tasks"Using"Cloudera"Manager"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#9$
What"Is"Cloudera"Manager?"

! Cloudera$Manager$is$an$applicaEon$designed$to$answer$the$needs$of$
enterprise#level$users$of$Hadoop$
– Install"Hadoop"soZware"on"nodes"
– Commission"and"configure"services"on"a"cluster"
– Monitor"cluster"acBvity"
– Produce"reports"of"cluster"usage"
– Manage"users"and"groups"with"access"to"the"cluster"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#10$
Automated"Deployments"

! AutomaEcally$installs$and$configures$services$on$hosts$
! Allows$modificaEon$of$configuraEon$parameters$
– Recommends"‘best"pracBce’"values"for"many"parameters"
! Makes$it$easy$to$start$and$stop$master$and$slave$instances$
! Provides$a$simple$way$to$retrieve$the$configuraEon$files$required$for$client$
machines$to$access$the$cluster$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#11$
Cloudera"Manager"Services"

! Cloudera$Manager$controls$a$wide$range$of$Hadoop$and$Hadoop$
‘ecosystem’$services:$
– HDFS"
– MapReduce"
– Hive"
– Impala"
– Pig"
– Flume"
– Mahout"
– Oozie"
– Sqoop"
– ZooKeeper"
– Hue"
– HBase"
– Cloudera"Search"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#12$
Host"Monitoring"and"ReporBng"

! Provides$a$‘dashboard’$view$of$cluster$status$
! Real#Eme$host$monitoring$of$
– EnBre"cluster"
– Individual"services"
– Individual"instances"
! AcEve$status$monitoring$
– Sends"emails"when"an"issue"occurs"
! ReporEng$of$previous$events$and$data$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#13$
Monitoring"and"Diagnosing"Cluster"Workloads"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#14$
Viewing"Service"Health"and"Performance"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#15$
Running"Reports"on"System"Performance"and"Usage"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#16$
Gathering"and"Viewing"Hadoop"Logs"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#17$
Tracking"Events"from"Across"the"Cluster"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#18$
Obtaining"Host/Level"Snapshots"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#19$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Cloudera$Manager$
a$Hadoop$Cluster$

!! The"MoBvaBon"for"Cloudera"Manager"
!! Cloudera"Manager"Features"
!! Standard$and$Enterprise$Versions$
!! Cloudera"Manager"Topology"
!! Installing"Cloudera"Manager" ""
!! Installing"Hadoop"Using"Cloudera"Manager""
!! Performing"Basic"AdministraBon"Tasks"Using"Cloudera"Manager"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#20$
Cloudera"Manager"EdiBons"

! Cloudera$Manager$comes$in$two$versions$
! Standard$EdiEon$
– Free"download"
– Manage"a"cluster"of"any"size"
! Enterprise$Core$
– Includes"extra"features"and"funcBonality"for"enterprise"and"producBon"
clusters"
– Rolling"upgrades"
– SNMP"support"
– LDAP"integraBon"
– ConfiguraBon"history"and"rollbacks"
– OperaBonal"reports"
– ExisBng"Free"EdiBon"installs"can"be"easily"upgraded"to"Enterprise"
– Contact"sales@cloudera.com"for"licensing"informaBon"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#21$
Cloudera"Manager"EdiBons"(cont’d)"

! Enterprise$EdiEon$add#ons:$
– BDR"
– Backup"and"disaster"recovery"
– Navigator"
– Data"audit"for"HDFS,"HBase,"and"Hive"
– Access"management"
– RTD"
– Support"for"HBase"
– RTQ"
– Support"for"Impala"
! A$60#day$free$trial$of$the$full$Enterprise$ediEon$is$available$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#22$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Cloudera$Manager$
a$Hadoop$Cluster$

!! The"MoBvaBon"for"Cloudera"Manager"
!! Cloudera"Manager"Features"
!! Standard"and"Enterprise"versions"
!! Cloudera$Manager$Topology$
!! Installing"Cloudera"Manager" ""
!! Installing"Hadoop"Using"Cloudera"Manager""
!! Performing"Basic"AdministraBon"Tasks"Using"Cloudera"Manager"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#23$
Basic"Topology"of"Cloudera"Manager"

! Each$cluster$node$runs$the$Cloudera$Manager$Agent$
– Starts"and"stops"Hadoop"daemons"
– Collects"staBsBcs"
! The$Cloudera$Manager$Server$runs$on$a$machine$outside$the$cluster$
– Runs"the"service"monitor"and"acBvity"monitor"daemons"
– Stores"informaBon"about"the"cluster"in"a"database"
– Available"host"machines"in"the"cluster"
– Services,"roles,"and"configuraBons"assigned"to"each"host"
– Communicates"with"the"agents"via"HTTP"or"HTTPS"
– Agents"send"heartbeats"to"the"server"periodically"
– Sends"configuraBon"informaBon"and"commands"to"the"agents"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#24$
Basic"Topology"of"Cloudera"Manager"(cont’d)"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#25$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Cloudera$Manager$
a$Hadoop$Cluster$

!! The"MoBvaBon"for"Cloudera"Manager"
!! Cloudera"Manager"Features"
!! Standard"and"Enterprise"versions"
!! Cloudera"Manager"Topology"
!! Cloudera$Manager$Requirements$and$InstallaEon"
!! Installing"Hadoop"Using"Cloudera"Manager""
!! Performing"Basic"AdministraBon"Tasks"Using"Cloudera"Manager"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#26$
Cloudera"Manager"4"Requirements"

! Supported$OperaEng$Systems$
– RedHat"Enterprise"Linux/Centos"5.7,"6.2,"6.4"(64/bit)"
– SUSE"Linux"Enterprise"Server"11"Service"Pack"1"or"later"(64/bit)"
– Debian"6.0"(Squeeze)"(64/bit)"
– Ubuntu"10.04,"12.04"(64/bit)"
! Supported$Browsers$
– Internet"Explorer""9"
– Google"Chrome"
– Safari"5"
– Firefox"11"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#27$
Cloudera"Manager"4"Requirements"(cont’d)"

! Supported$databases$
– MySQL"5.0,"5.1,"5.5"
– Oracle"10g"Release"2,"Oracle"11g"Release"2"
– PostgreSQL"8.1,"8.3,"8.4,"9.1"
! CDH$
– CDH"3u1"or"later"
– CDH"4.0"or"later"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#28$
Installing"Cloudera"Manager"

! Download$
– Go"to"http://www.cloudera.com/downloads
– Download"Cloudera"Manager"installer"
– Run"the"installer"from"the"command"line"
! All$subsequent$configuraEon$is$done$through$Web$UI$
– Access"via"http://<manager_host>:7180
! DocumentaEon$
– For"detailed"informaBon"about"installing"Cloudera"Manager,"refer"to"
http://docs.cloudera.com/

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#29$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Cloudera$Manager$
a$Hadoop$Cluster$

!! The"MoBvaBon"for"Cloudera"Manager"
!! Cloudera"Manager"Features"
!! Standard"and"Enterprise"versions"
!! Cloudera"Manager"Topology"
!! Installing"Cloudera"Manager" ""
!! Installing$Hadoop$Using$Cloudera$Manager$$
!! Performing"Basic"AdministraBon"Tasks"Using"Cloudera"Manager"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#30$
Packages"or"Parcels"

! Cloudera$Manager$will$install$the$selected$Hadoop$services$on$cluster$
nodes$for$you$
! InstallaEon$is$via$packages$or$Parcels$
! Packages$
– Standard""RPMs,"Ubuntu"Packages"etc."
! Parcels$
– Cloudera"Manager/specific"
– Allows"mulBple"versions"of"Hadoop"to"be"present"on"a"node"
simultaneously"
– Although"only"one"will"be"running"at"any"given"Bme"
– Allows"easy"upgrading"with"minimal"down/Bme"
– Allows"easy"rolling"upgrades"(Enterprise"ediBon)"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#31$
No"Internet"Access"Required"for"Cluster"Nodes"

! OTen,$cluster$nodes$do$not$have$Internet$access$
– Security"or"other"reasons"
! Cloudera$Manager$can$install$CDH$on$cluster$nodes$using$a$local$CDH$
repository$
– Simply"mirror"the"CDH"package"or"Parcel"repository"on"a"local"machine"
– Configure"Cloudera"Manager"to"use"the"local"repository"during"
installaBon"and"upgrade"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#32$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Cloudera$Manager$
a$Hadoop$Cluster$

!! The"MoBvaBon"for"Cloudera"Manager"
!! Cloudera"Manager"Features"
!! Standard"and"Enterprise"versions"
!! Cloudera"Manager"Topology"
!! Installing"Cloudera"Manager" ""
!! Installing"Hadoop"Using"Cloudera"Manager""
!! Performing$Basic$AdministraEon$Tasks$Using$Cloudera$Manager$
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#33$
Administering"Services"

! InteracEng$with$a$service$includes$all$instances$of$that$service$
– You"can"interact"with"an"individual"service"
– You"can"interact"with"all"services"for"an"enBre"cluster"
! Cloudera$Manager$can$add$and$remove$services$
– Will"make"recommendaBons"for"hosts"on"which"to"run"the"services"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#34$
Administering"Hosts"

! Add$hosts$
– Host(s)"will"be"managed"by"Cloudera"Manager"
– All"CDH"RPMs"or"Parcels"will"be"installed"
! Delete$role$instance$
– Host(s)"will"not"be"managed"by"Cloudera"Manager"
! Host$inspector$
– Verifies"all"seOngs,"configuraBons,"and"soZware"versions"
! Role$Groups$allows$groups$of$machines$in$the$cluster$to$share$
configuraEons$
– Example:"generaBon"1"and"generaBon"2"machines"in"a"heterogeneous"
cluster"may"support"different"numbers"of"Mappers"and"Reducers"per"
host"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#35$
Making"ConfiguraBon"Changes"

! Make$changes$to$enEre$service,$host,$or$instance$
! Cloudera$Manager$makes$intelligent$recommendaEons$about$
configuraEon$values$
– Will"verify"the"configuraBon"values"
! Cloudera$Manager$makes$it$easier$to$propagate$and$restart$affected$
services$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#36$
Making"a"ConfiguraBon"Change"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#37$
RestarBng"AZer"a"ConfiguraBon"Change"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#38$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Cloudera$Manager$
a$Hadoop$Cluster$

!! The"MoBvaBon"for"Cloudera"Manager"
!! Cloudera"Manager"Features"
!! Standard"and"Enterprise"versions"
!! Cloudera"Manager"Topology"
!! Installing"Cloudera"Manager" ""
!! Installing"Hadoop"Using"Cloudera"Manager""
!! Performing"Basic"AdministraBon"Tasks"Using"Cloudera"Manager"
!! Conclusion$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#39$
EssenBal"Points"

! Cloudera$Manager$makes$installing,$managing,$and$monitoring$a$Hadoop$
cluster$much$simpler$
! Standard$ediEon$can$manage$clusters$of$any$size$
! Enterprise$ediEon$has$some$extra$features,$such$as$SNMP$support$and$
support$for$rolling$upgrades$
! A$free,$60#day$trial$license$of$the$full$Enterprise$EdiEon$is$available$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#40$
Conclusion"

In$this$chapter,$you$have$learned:$
! Why$Cloudera$Manager$is$useful$for$administering$Hadoop$clusters$
! What$features$Cloudera$Manager$offers$
! $Which$versions$of$Cloudera$Manager$are$available$
! Which$daemons$are$deployed$with$Cloudera$Manager$
! How$to$install$Cloudera$Manager$
! How$to$install$a$Hadoop$cluster$using$Cloudera$Manager$
! How$to$perform$basic$administraEon$tasks$using$Cloudera$Manager$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#41$
Advanced"Cluster"ConfiguraAon"
Chapter"11"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"1#
Course"Chapters"
!! IntroducAon" Course"IntroducAon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducAon"to"Apache"Hadoop"
!! GeOng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaAon"and"IniAal"ConfiguraAon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,#Installing,#and#
!! Hadoop"Clients"
Configuring#a#Hadoop#Cluster#
!! Cloudera"Manager"
!! Advanced#Cluster#Configura5on#
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaAons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooAng"
!! Conclusion"
!! Kerberos"ConfiguraAon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaAon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"2#
Advanced"Cluster"ConfiguraAon"

In#this#chapter,#you#will#learn:#
! How#to#perform#advanced#configura5on#of#Hadoop#
! How#to#configure#port#numbers#used#by#Hadoop#
! How#to#explicitly#include#or#exclude#hosts##
! How#to#configure#HDFS#rack#awareness#
! How#to#enable#HDFS#high#availability#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"3#
Chapter"Topics"

Planning,#Installing,#and#Configuring#
Advanced#Cluster#Configura5on#
a#Hadoop#Cluster#

!! Advanced#Configura5on#Parameters#
!! Configuring"Hadoop"Ports"
!! Explicitly"Including"and"Excluding"Hosts"
!! Configuring"HDFS"for"Rack"Awareness"
!! Configuring"HDFS"High"Availability"
!! Hands/On"Exercise:"Configuring"HDFS"for"High"Availability"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"4#
Advanced"ConfiguraAon"Parameters"

! We#have#previously#covered#Hadoop’s#basic#proper5es#
– EssenAally,"the"minimum"needed"to"set"up"a"working"cluster"
! We#will#now#discuss#some#addi5onal#proper5es#
! These#generally#fall#into#one#of#several#categories#
– OpAmizaAon"and"performance"tuning"
– Capacity"management"
– Access"control"
! The#configura5on#recommenda5ons#in#this#sec5on#are#baselines#
– Use"them"as"starAng"points,"then"adjust"as"required"by"the"job"mix"in"
your"environment"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"5#
hdfs-site.xml"

dfs.namenode.handler.count The number of threads the NameNode


uses to handle RPC requests from
DataNodes. Default: 10. Recommended:
ln(number of cluster nodes) * 20.
Symptoms of this being set too low:
‘connection refused’ messages in
DataNode logs as they try to transmit
block reports to the NameNode. Used by
the NameNode.
dfs.datanode.failed. The number of volumes allowed to fail
volumes.tolerated before the DataNode takes itself offline,
ultimately resulting in all of its blocks
being re-replicated. Default: 0, but often
increased on machines with several
disks. Used by DataNodes.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"6#
core-site.xml"

fs.trash.interval When a file is deleted, it is placed in


a .Trash directory in the user’s home
directory, rather than being immediately
deleted. It is purged from HDFS after the
number of minutes specified. Default: 0
(disabled). Recommended: 1440 (one
day). Used by clients and the
NameNode.
io.file.buffer.size Determines how much data is buffered
during read and write operations. Should
be a power of 2 of hardware page size.
Default: 4096. Recommendation: 65536
(64KB). Used by clients and all nodes
running Hadoop daemons.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"7#
core-site.xml"(cont’d)"

io.compression.codecs List of compression codecs that Hadoop


can use for file compression. Used by
clients and all nodes running Hadoop
daemons. If you are using another
compression codec, add it here.

Default is
org.apache.hadoop.io.compress.
DefaultCodec,org.apache.hadoop.io.
compress.GzipCodec,org.apache.
hadoop.io.compress.BZip2Codec,org.
apache.hadoop.io.compress.
DeflateCodec,org.apache.hadoop.io.
compress.SnappyCodec

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"8#
mapred-site.xml"

mapred.job.tracker. Number of threads used by the


handler.count JobTracker to respond to heartbeats
from the TaskTrackers. Default: 10.
Recommendation: ln(number of cluster
nodes) * 20. Used by the JobTracker.
mapred.reduce.parallel. Number of TaskTrackers a Reducer can
copies connect to in parallel to transfer its data.
Default: 5. Recommendation: ln(number
of cluster nodes) * 4 with a floor of 10.
Used by TaskTrackers.
tasktracker.http.threads The number of HTTP threads in the
TaskTracker which the Reducers use to
retrieve data. Default: 40.
Recommendation: 80. Used by
TaskTrackers.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"9#
mapred-site.xml"(cont’d)"

mapred.reduce.slowstart. The percentage of Map tasks which must


completed.maps be completed before the JobTracker will
schedule Reducers on the cluster.
Default: 0.05 (5 percent).
Recommendation: 0.8 (80 percent).
Used by the JobTracker.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"10#
mapred-site.xml"(cont’d)"

mapred.map.tasks. Whether to allow speculative execution


speculative.execution for Map tasks. Default: true.
Recommendation: true. Used by the
JobTracker.
mapred.reduce.tasks. Whether to allow speculative execution
speculative.execution for Reduce tasks. Default: true.
Recommendation: false. Used by the
JobTracker.

! If#a#task#is#running#significantly#more#slowly#than#the#average#speed#of#
tasks#for#that#job,#specula(ve*execu(on#may#occur#
– Another"a>empt"to"run"the"same"task"is"instanAated"on"a"different"
node"
– The"results"from"the"first"completed"task"are"used"
– The"slower"task"is"killed"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"11#
mapred-site.xml"(cont’d)"

mapred.compress.map.output Determines whether intermediate data


from Mappers should be compressed
before transfer across the network.
Default: false. Recommendation: true.
Used by TaskTrackers.
mapred.map.output. The compression codec used to
compression.codec compress intermediate data from
Mappers. Default: org.apache.hadoop.
io.compress.DefaultCodec.
Recommended: org.apache.hadoop.
io.compress.SnappyCodec."Used by
TaskTrackers.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"12#
mapred-site.xml"(cont’d)"

io.sort.mb The size of the buffer in RAM on the


Mapper to which the Mapper initially
stores its Key/Value pairs before they are
written to disk. Default: 100MB.
Recommendation: 128MB, if the child
heap size is 1GB. This allocation comes
out of the task’s JVM heap space. Used
by TaskTrackers.
io.sort.factor The number of streams to merge at once
when sorting files. Used by
TaskTrackers. Default: 10.
Recommendation: 64.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"13#
Chapter"Topics"

Planning,#Installing,#and#Configuring#
Advanced#Cluster#Configura5on#
a#Hadoop#Cluster#

!! Advanced"ConfiguraAon"Parameters"
!! Configuring#Hadoop#Ports#
!! Explicitly"Including"and"Excluding"Hosts"
!! Configuring"HDFS"for"Rack"Awareness"
!! Configuring"HDFS"High"Availability"
!! Hands/On"Exercise:"Configuring"HDFS"for"High"Availability"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"14#
Common"Hadoop"Ports"

! Hadoop#daemons#each#provide#a#Web"based#User#Interface#
– Useful"for"both"users"and"system"administrators"
! Expose#informa5on#on#a#variety#of#different#ports#
– Port"numbers"are"configurable,"although"there"are"defaults"for"most"
! Hadoop#also#uses#various#ports#for#components#of#the#system#to#
communicate#with#each#other#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"15#
Web"UI"Ports"for"Users"

Daemon Default Port Configuration parameter


NameNode 50070 dfs.http.address
MR HDFS

DataNode 50075 dfs.datanode.http.address

Secondary NameNode 50090 dfs.secondary.http.address

JobTracker 50030 mapred.job.tracker.http.address

TaskTracker 50060 mapred.task.tracker.http.address

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"16#
Hadoop"Ports"for"Administrators"

Daemon Default# Configura5on#Parameter Used#for


Port
NameNode 8020 fs.default.name Filesystem metadata operations
DataNode 50010 dfs.datanode.address DFS data transfer
DataNode 50020 dfs.datanode.ipc.add Block metadata operations and
ress recovery
JobTracker Usually mapred.job.tracker Job submission, task tracker
8021, heartbeats
9001,
or 8012
TaskTracker Usually mapred.task.tracker. Communicating with child jobs
8021, report.address
9001,
or 8012

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"17#
Chapter"Topics"

Planning,#Installing,#and#Configuring#
Advanced#Cluster#Configura5on#
a#Hadoop#Cluster#

!! Advanced"ConfiguraAon"Parameters"
!! Configuring"Hadoop"Ports"
!! Explicitly#Including#and#Excluding#Hosts#
!! Configuring"HDFS"for"Rack"Awareness"
!! Configuring"HDFS"High"Availability"
!! Hands/On"Exercise:"Configuring"HDFS"for"High"Availability"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"18#
Host"‘include’"and"‘exclude’"Files"

! Op5onally,#specify#dfs.hosts#in#hdfs-site.xml#to#point#to#a#file#
lis5ng#hosts#which#are#allowed#to#connect#to#a#NameNode#and#act#as#
DataNodes#
– Similarly,"mapred.hosts"points"to"a"file"which"lists"hosts"allowed"to"
connect"to"the"JobTracker"as"TaskTrackers"
! Both#files#are#op5onal#
– If"omi>ed,"any"host"may"connect"and"act"as"a"DataNode/TaskTracker"
– This"is"a"possible"security/data"integrity"issue"
! NameNode#can#be#forced#to#reread#the#dfs.hosts#file#with#
hadoop dfsadmin –refreshNodes
! JobTracker#can#be#forced#to#reread#the#mapred.hosts#file#with##
hadoop mradmin -refreshNodes#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"19#
Host"‘include’"and"‘exclude’"Files"(cont’d)"

! It#is#possible#to#explicitly#prevent#one#or#more#hosts#from#ac5ng#as#
DataNodes#
– Create"a"dfs.hosts.exclude"property,"and"specify"a"filename"
– List"the"names"of"all"the"hosts"to"exclude"in"that"file"
– These"hosts"will"then"not"be"allowed"to"connect"to"the"NameNode"
– This"is"oeen"used"if"you"intend"to"decommission"nodes"(see"later)"
– Run"hadoop dfsadmin -refreshNodes"to"make"the"NameNode"
re/read"the"file"
! Similarly,#mapred.hosts.exclude#can#be#used#to#specify#a#file#lis5ng#
hosts#which#may#not#connect#to#the#JobTracker#
– Run"hadoop mradmin -refreshNodes"to"make"the"JobTracker"
re/read"the"file"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"20#
Chapter"Topics"

Planning,#Installing,#and#Configuring#
Advanced#Cluster#Configura5on#
a#Hadoop#Cluster#

!! Advanced"ConfiguraAon"Parameters"
!! Configuring"Hadoop"Ports"
!! Explicitly"Including"and"Excluding"Hosts"
!! Configuring#HDFS#for#Rack#Awareness#
!! Configuring"HDFS"High"Availability"
!! Hands/On"Exercise:"Configuring"HDFS"for"High"Availability"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"21#
Rack"Topology"Awareness"

! Recall#that#HDFS#is#‘rack#aware’#
– Distributes"blocks"based"on"hosts’"locaAons"
– Administrator"supplies"a"script"which"tells"Hadoop"which"rack"a"node"is"
in"
– Should"return"a"hierarchical"‘rack"ID’"for"each"argument"it"is"passed"
– Rack"ID"is"of"the"form"/datacenter/rack
•  Example:"/datactr1/rack40
– Script"can"use"a"flat"file,"database,"etc."
– Script"name"is"in"topology.script.file.name"in""
core-site.xml
– If"this"is"blank"(default),"Hadoop"returns"a"value"of""
/default-rack for"all"nodes"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"22#
Sample"Rack"Topology"Script"

! A#sample#rack#topology#script:#

#!/usr/bin/env python
import sys
DEFAULT_RACK = "/datacenter1/default-rack"
HOST_RACK_FILE = "/etc/hadoop/conf/host-rack.map"
host_rack = {}
for line in open(HOST_RACK_FILE):
(host, rack) = line.split()
host_rack[host] = rack
for host in sys.argv[1:]:
if host in host_rack:
print host_rack[host]
else:
print DEFAULT_RACK

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"23#
Sample"Rack"Topology"Script"(cont’d)"

! The#/etc/hadoop/conf/host-rack.map#file:#

host1 /datacenter1/rack1
host2 /datacenter1/rack1
host3 /datacenter1/rack1
host4 /datacenter1/rack1
host5 /datacenter1/rack2
host6 /datacenter1/rack2
host7 /datacenter1/rack2
host8 /datacenter1/rack2
...

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"24#
Naming"Machines"to"Aid"Rack"Awareness"

! A#common#scenario#is#to#name#your#hosts#in#such#a#way#that#the#Rack#
Topology#Script#can#easily#determine#their#loca5on#
– Example:"a"host"called"r1m32
– 32nd"machine"in"Rack"1"
– The"Rack"Topology"Script"can"simply"deconstruct"the"machine"name"and"
then"return"the"rack"awareness"informaAon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"25#
Chapter"Topics"

Planning,#Installing,#and#Configuring#
Advanced#Cluster#Configura5on#
a#Hadoop#Cluster#

!! Advanced"ConfiguraAon"Parameters"
!! Configuring"Hadoop"Ports"
!! Explicitly"Including"and"Excluding"Hosts"
!! Configuring"HDFS"for"Rack"Awareness"
!! Configuring#HDFS#High#Availability#
!! Hands/On"Exercise:"Configuring"HDFS"for"High"Availability"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"26#
HDFS"High"Availability"Overview"

! A#single#NameNode#is#a#single#point#of#failure#
! Two#ways#a#NameNode#can#result#in#HDFS#down5me#
– Unexpected"NameNode"crash"(rare)"
– Planned"maintenance"of"NameNode"(more"common)"
! HDFS#High#Availability#(HA)#eliminates#this#SPOF"
– Available"in"CDH4"(or"related"Apache"Hadoop"0.23.x,"and"2.x)"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"27#
HDFS"High"Availability"Architecture"

! HDFS#High#Availability#uses#a#pair#of#NameNodes#
– One"AcAve"and"one"Standby"
– Clients"only"contact"the"AcAve"NameNode"
– DataNodes"heartbeat"in"to"both"NameNodes"
– AcAve"NameNode"writes"its"metadata"to"a"quorum"of"JournalNodes"
– Standby"NameNode"reads"from"the"JournalNodes"to"remain"in"sync"
with"the"AcAve"NameNode"
#
DataNode NameNode JournalNode
(Active)/Quorum
Journal Manager
DataNode
JournalNode
DataNode
NameNode
(Standby)/Quorum
DataNode Journal Manager JournalNode

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"28#
HDFS"High"Availability"Architecture"(cont’d)"

! Ac5ve#NameNode#writes#edits#to#the#JournalNodes#
– Soeware"to"do"this"is"the"Quorum&Journal&Manager&(QJM)"
– Built"in"to"the"NameNode"
– Waits"for"a"success"acknowledgment"from"the"majority"of"JournalNodes"
– Majority"commit"means"a"single"crashed"or"lagging"JournalNode"
will"not"impact"NameNode"latency"
– Uses"the"Paxos"algorithm"to"ensure"reliability"even"if"edits"are"being"
wri>en"as"a"JournalNode"fails"
! Note#that#there#is#no#Secondary#NameNode#when#implemen5ng#HDFS#
High#Availability#
– The"Standby"NameNode"periodically"performs"checkpoinAng"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"29#
Failover"

! Only#one#NameNode#must#be#ac5ve#at#any#given#5me#
– The"other"is"in"standby"mode"
! The#standby#maintains#a#copy#of#the#ac5ve#NameNode’s#state#
– So"it"can"take"over"when"the"acAve"NameNode"goes"down"
! Two#types#of#failover#
– Manual"(detected"and"iniAated"by"a"user)"
– AutomaAc"(detected"and"iniAated"by"HDFS"itself)"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"30#
AutomaAc"Failover"

! Automa5c#failover#is#based#on#Apache#ZooKeeper#
– A"coordinaAon"service"system"also"used"by"HBase"
– An"open"source"Apache"project""
– One"of"the"components"in"CDH"
! A#daemon#called#the#ZooKeeper#Failover#Controller#(ZKFC)#runs#on#each#
NameNode#machine#
! ZooKeeper#needs#a#quorum#of#nodes#
– Typical"installaAons"use"three"or"five"nodes"
– Low"resource"usage"
– Can"install"alongside"exisAng"master"daemons"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"31#
HDFS"HA"With"AutomaAc"Failover"–"Deployment"

ZooKeeper Ensemble - Instances Typically Reside on Master Nodes

ZooKeeper ZooKeeper ZooKeeper

JournalNode

ZooKeeper ZooKeeper
Failover JournalNode Failover
Controller Controller

Must Reside Must Reside


on the on the
Same Host JournalNode Same Host

NameNode
NameNode JournalNodes
(Standby)/
(Active)/Quorum Typically Reside
Quorum Journal
Journal Manager on Master Nodes
Manager

DataNode DataNode DataNode DataNode

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"32#
Fencing"

! It#is#essen5al#that#exactly#one#NameNode#be#ac5ve#
– Possibility"of"“split/brain"syndrome”"otherwise"
! The#Quorum#Journal#Manager#ensures#that#a#previously"ac5ve#NameNode#
cannot#corrupt#the#NameNode#metadata#
– However,"it"is"possible"in"some"circumstances"that"it"could"report"‘stale’"
data"to"a"client"
– For"this"reason,"it"is"good"pracAce"to"fence"the"NameNode"
– Isolates"it,"shuts"it"down"
! When#configuring#HDFS#HA,#you#must#specify#one#or#more#fencing#
methods#
– The"methods"can"be"shell"scripts"or"SSH"
– In"order"for"failover"to"occur,"one"of"the"methods"must"run"successfully"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"33#
HDFS"HA"Deployment"Overview"

1.  Modify#the#Hadoop#configura5on#
2.  Install#and#start#the#JournalNodes#
3.  Configure#and#start#a#ZooKeeper#ensemble#–#automa5c#failover#only#
4.  Ini5alize#the#shared#edits#directory#if#conver5ng#from#a#non"HA#
deployment#
5.  Install,#bootstrap,#and#start#the#Standby#NameNode#
6.  Install,#format,#and#start#the#ZooKeeper#failover#controllers#–#automa5c#
failover#only#
7.  Restart#DataNodes#and#the#MapReduce#daemons#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"34#
Modifying"the"core-site.xml"File"for"HDFS"HA"

fs.default.name Change from <host>:<port> to a logical


URI to the HDFS file system that specifies
a NameService ID defined in hdfs-
site.xml. This URI defines a virtual
NameNode and transparently resolves to
the Active NameNode. Example:
hdfs://mycluster
ha.zookeeper.quorum" Comma-delimited list of all ZooKeeper
nodes in the quorum. Specify fully-
qualified host names (FQHNs) and port
numbers. Example:
elephant.example.com:2181,
tiger.example.com:2181,
horse.example.com:2181
Used by ZooKeeper Failover Controllers.
Required for automatic failover only.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"35#
Modifying"the"hdfs-site.xml"File"for"HDFS"HA"

dfs.nameservices A unique NameService ID. Example:


mycluster
Used by NameNodes and clients.
dfs.ha.namenodes.<NSID> A comma-separated list of two
For#example: NameNodes in the cluster. Example:
dfs.ha.namenodes.mycluster nn1,nn2
Used by NameNodes and clients.
(<NSID> = NameService ID)

dfs.namenode.rpc-address. The RPC address of a NameNode in the


<NSID>.<NNID> cluster. Two entries required – one for
For#example:## each NameNode. Specify FQHN and port
dfs.namenode.rpc-address. number. Example:
mycluster.nn1 elephant.example.com:8020
dfs.namenode.rpc-address. Used by NameNodes and clients.
mycluster.nn2

(<NNID>#=#NameNode#ID)#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"36#
Modifying"the"hdfs-site.xml"File"for"HDFS"HA"(cont’d)"

dfs.namenode.http-address. The HTTP address of a NameNode in the


<NSID>.<NNID> cluster. Two entries required – one for
For#example: each NameNode. Specify FQHN and port
dfs.namenode.http-address. number. Example:
mycluster.nn1 elephant.example.com:50070 Used
dfs.namenode.http-address. by NameNodes.
mycluster.nn2
dfs.namenode.shared.edits. A semicolon-delimited list of the
dir JournalNodes. Specifies the location
written by the Active NameNode and read
by the Standby NameNode in order to
keep them synchronized. Example:
qjournal://elephant:8485;tiger:
8485;horse:8485/mycluster
Used by NameNodes.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"37#
Modifying"the"hdfs-site.xml"File"for"HDFS"HA"(cont’d)"

dfs.journalnode.edits.dir The location on JournalNodes where


edits and other state information should
be stored. Example:
/disk1/dfs/jn""
Used by JournalNodes.
dfs.client.failover.proxy. The name of the Java class used to
provider.<NSID> contact the currently active NameNode.
For#example: Currently, only one class is supported in
dfs.client.failover.proxy. CDH. Example:
provider.mycluster org.apache.hadoop.hdfs.
server.namenode.ha.
ConfiguredFailoverProxy
Provider
Used by clients.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"38#
Modifying"the"hdfs-site.xml"File"for"HDFS"HA"(cont’d)"

dfs.ha.fencing.methods Specifies one or more methods separated


by newlines to fence the Active
NameNode during failover. This
parameter must be specified and has no
default value. Examples:
sshfence(tiger:22)
shell(/path/to/fence-nn.sh
–target_host=tiger)
shell(/bin/true)
Used by NameNodes and ZooKeeper
Failover Controllers.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"39#
Modifying"the"hdfs-site.xml"File"for"HDFS"HA"(cont’d)"

dfs.ha.fencing.ssh.private Location of SSH private key used for


-key-files SSH-based fencing. Must be readable by
the hdfs user account. Allows
automation of ssh (no passphrase)
Only needed if you’re using the
sshfence fencing method. Example:
/home/hdfs/.ssh/id_rsa
Used by NameNodes and ZooKeeper
Failover Controllers.
dfs.ha.automatic-failover. Specifies whether automatic failover is
enabled enabled. Example: true
Used by NameNodes, ZooKeeper
Failover Controllers, and clients.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"40#
Installing,"Configuring,"and"StarAng"the"JournalNodes"

On#each#host#that#will#run#a#JournalNode:#
! Install#the#JournalNode#
– sudo yum install hadoop-hdfs-journalnode"
! Create#the#shared#edits#directory#on#the#JournalNode#
– At"the"locaAon"specified"as"dfs.journalnode.edits.dir""
– Owned"by"the"hdfs"user"
! Start#the#JournalNode#
– sudo service hadoop-hdfs-journalnode start""
#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"41#
Configuring"and"StarAng"the"ZooKeeper"Suite"–"AutomaAc"
Failover"Only"
On#each#host#that#will#run#a#ZooKeeper#server:#
! Create#a#directory#for#ZooKeeper#data#
! Owned"by"the"hdfs"user"
! Create#a#file#in#the#ZooKeeper#data#directory#that#has#a#single#number#
iden5fier#
– The"idenAfier"must"be"a"different"number"on"each"ZooKeeper"node"
– Example:"echo 1 > /disk1/dfs/zk/myid
! Create#a#ZooKeeper#configura5on#file#(example#on#next#slide)#
! Start#the#ZooKeeper#server#
– sudo /usr/lib/zookeeper/bin/zkServer.sh start \"
/etc/hadoop/conf/zoo.cfg

# ©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"42#
Configuring"and"StarAng"the"ZooKeeper"Suite"–"AutomaAc"
Failover"Only"(cont’d)"
! Example#ZooKeeper#configura5on#file#

Server"idenAfier"
tickTime=2000 matches"the"
dataDir=/disk1/dfs/zk number"in"the"
clientPort=2181 myid"file"
initLimit=5
syncLimit=2
server.1=elephant.example.com:2888:3888
server.2=tiger.example.com:2888:3888
server.3=horse.example.com:2888:3888

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"43#
IniAalizing"the"Shared"Edits"Directories"

! This#step#is#required#only#if#you#are#conver5ng#an#exis5ng#non"HA#HDFS#
deployment#to#an#HDFS#HA#deployment#
! Run#the#following#command#
– sudo -u hdfs hdfs namenode -initializeSharedEdits""
! Aier#ini5alizing#these#directories,#you#can#restart#your#exis5ng#
NameNode#
! Run#the#hdfs haadmin -getServiceState#command#to#verify#that#
the#exis5ng#NameNode#is#not#ac5ve#yet#
– sudo -u hdfs hdfs haadmin -getServiceState nn1""
– In"this"example,"nn1"is"the"NameNode"ID"of"the"exisAng"NameNode"
– The"command"should"return"Standby
#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"44#
Installing,"Bootstrapping,"and"StarAng"the"Standby"Name"Node"

! Install#the#second#NameNode#
! Bootstrap#it#
– sudo -u hdfs hdfs namenode -bootstrapStandby""
! Start#it#
! Run#the#hdfs haadmin -getServiceState#command#to#verify#that#
the#exis5ng#NameNode#is#not#ac5ve#yet#
– sudo -u hdfs hdfs haadmin -getServiceState nn2
– In"this"example,"nn2"is"the"NameNode"ID"of"the"new"Standby"
NameNode"
– The"command"should"return"Standby
#

#
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"45#
Installing,"FormaOng,"and"StarAng"the"ZooKeeper"Failover"
Controller"–"AutomaAc"Failover"Only"
! Install#the#ZooKeeper#Failover#Controllers#on#the#same#hosts#as#the#
NameNodes#
– sudo yum install --assumeyes hadoop-hdfs-zkfc""
!  Ini5alize#the#current#NameNode#state#in#ZooKeeper#
– sudo -u hdfs hdfs zkfc -formatZK""
!  Start#the#ZooKeeper#Failover#Controllers#
– sudo service start hadoop-hdfs-zkfc
– StarAng"the"ZooKeeper"Failover"Controllers"will"bring"up"one"of"your"
NameNodes"
!  Restart#the#DataNodes#and#the#MapReduce#daemons#
!  Restart#the#Ac5ve#NameNode#and#then#check#the#NameNode#service#state#
to#test#automa5c#failover#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"46#
Configuring"HDFS"High"Availability"With"Cloudera"Manager"

! Cloudera#Manager#makes#it#very#easy#to#enable#High#Availability#
– 7"complex"steps"to"enable"High"Availability"manually"
– Or,"click"once"and"fill"in"a"few"screens"with"Cloudera"Manager"

""

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"47#
Chapter"Topics"

Planning,#Installing,#and#Configuring#
Advanced#Cluster#Configura5on#
a#Hadoop#Cluster#

!! Advanced"ConfiguraAon"Parameters"
!! Configuring"Hadoop"Ports"
!! Explicitly"Including"and"Excluding"Hosts"
!! Configuring"HDFS"for"Rack"Awareness"
!! Configuring"HDFS"High"Availability"
!! Hands"On#Exercise:#Configuring#HDFS#for#High#Availability#
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"48#
Hands/On"Exercise:"Configuring"HDFS"for"High"Availability"

! In#this#Hands"On#Exercise,#you#will#modify#your#Hadoop#cluster#by#adding#
and#configuring#servers#required#for#HDFS#high#availability#
! Please#refer#to#the#Hands"On#Exercise#Manual#
! Cluster#deployment#aier#exercise#comple5on:#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"49#
Chapter"Topics"

Planning,#Installing,#and#Configuring#
Advanced#Cluster#Configura5on#
a#Hadoop#Cluster#

!! Advanced"ConfiguraAon"Parameters"
!! Configuring"Hadoop"Ports"
!! Explicitly"Including"and"Excluding"Hosts"
!! Configuring"HDFS"for"Rack"Awareness"
!! Configuring"HDFS"High"Availability"
!! Hands/On"Exercise:"Configuring"HDFS"for"High"Availability"
!! Conclusion#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"50#
EssenAal"Points"

! Use#dfs.datanode.du.reserved#to#configure#minimum#amount#of#
non"HDFS#disk#space#on#volumes#
– Default"is"0"–"HDFS"is"allowed"to"consume"all"available"disk"space"on"
slave"node"volumes"
– Non/HDFS"disk"space"is"needed"for"MapReduce"intermediate"output,"
task"logs,"and"daemon"logs"
! You#can#configure#Hadoop#to#explicitly#include#a#list#of#machines#in#a#
cluster,#or#to#exclude#machines#from#a#cluster#
– Inclusion"list"provides"a"degree"of"security"
– Exclusion"list"is"used"during"node"decommissioning""
! Hadoop#provides#a#rack#awareness#capability,#which#should#be#
implemented#to#increase#data#locality#
! HDFS#can#be#configured#for#high#availability#with#an#automa5c#failover#
capability#
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"51#
Conclusion"

In#this#chapter,#you#have#learned:#
! How#to#perform#advanced#configura5on#of#Hadoop#
! How#to#configure#port#numbers#used#by#Hadoop#
! How#to#explicitly#include#or#exclude#hosts##
! How#to#configure#HDFS#rack#awareness#
! How#to#enable#HDFS#high#availability#

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"52#
Hadoop"Security"
Chapter"12"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#1$
Course"Chapters"
!! IntroducCon" Course"IntroducCon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducCon"to"Apache"Hadoop"
!! GeOng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaCon"and"IniCal"ConfiguraCon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,$Installing,$and$
!! Hadoop"Clients"
Configuring$a$Hadoop$Cluster$
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraCon"
!! Hadoop$Security$
!! Managing"and"Scheduling"Jobs"
Cluster"OperaCons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Maintenance"and"TroubleshooCng"
!! Conclusion"
!! Kerberos"ConfiguraCon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaCon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#2$
Hadoop"Security"

In$this$chapter$you$will$learn:$
! Why$security$is$important$for$Hadoop$
! How$Hadoop's$security$model$evolved$
! What$Kerberos$is$and$how$it$relates$to$Hadoop$
! What$to$consider$when$securing$Hadoop$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#3$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Security$
a$Hadoop$Cluster$

!! Why$Hadoop$Security$Is$Important$$
!! Hadoop’s"Security"System"Concepts"
!! What"Kerberos"Is"and"How"it"Works""
!! Securing"a"Hadoop"Cluster"with"Kerberos"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#4$
Why"Hadoop"Security"Is"Important"

! Laws$governing$data$privacy$
– ParCcularly"important"for"healthcare"and"finance"industries"
! Export$control$regulaKons$for$defense$informaKon$
! ProtecKon$of$proprietary$research$data$
! Company$policies$
– Different"teams"in"a"company"have"different"needs"
! SeLng$up$mulKple$clusters$is$a$common$soluKon$
– One"cluster"may"contain"protected"data,"another"cluster"does"not"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#5$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Security$
a$Hadoop$Cluster$

!! Why"Hadoop"Security"Is"Important""
!! Hadoop’s$Security$System$Concepts$
!! What"Kerberos"Is"and"How"it"Works""
!! Securing"a"Hadoop"Cluster"with"Kerberos"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#6$
Defining"Important"Security"Terms"

! Security$
– Computer"security"is"a"very"broad"topic"
– Access"control"is"the"area"most"relevant"to"Hadoop"
– We’ll"therefore"focus"on"authenCcaCon"and"authorizaCon"
! AuthenKcaKon$
– Confirming"the"idenCty"of"a"parCcipant"
– Typically"done"by"checking"credenCals"(username/password)"
! AuthorizaKon$
– Determining"whether"parCcipant"is"allowed"to"perform"an"acCon"
– Typically"done"by"checking"an"access"control"list"
"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#7$
Types"of"Hadoop"Security"

! Support$for$HDFS$file$ownership$and$permissions$$
– Provides"only"modest"protecCon"
– User/group"authenCcaCon"is"easily"subverted"(client/side)"
– Mainly"intended"to"guard"against"accidental"deleCons/overwrites"
! Enhanced$security$with$Kerberos$
– Provides"strong"authenCcaCon"of"both"clients"and"servers"
– Tasks"can"be"run"under"a"job"submi>er’s"own"account"
– This"enhanced"security"is"opConal"(disabled"by"default)"

"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#8$
Hadoop"Security"Design"ConsideraCons"

! Hadoop$security$does$not$provide$
– EncrypCon"for"data"transmi>ed"on"the"wire""
– EncrypCon"for"data"stored"on"disk"
! The$security$of$a$cluster$is$enhanced$by$isolaKon$
– It"should"ideally"be"on"its"own"network"
– Access"to"nodes/network"should"be"limited"for"untrusted"users"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#9$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Security$
a$Hadoop$Cluster$

!! Why"Hadoop"Security"Is"Important""
!! Hadoop’s"Security"System"Concepts"
!! What$Kerberos$Is$and$How$it$Works$$
!! Securing"a"Hadoop"Cluster"with"Kerberos"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#10$
Kerberos"Exchange"ParCcipants"

! Kerberos$involves$messages$exchanged$among$three$parKes$
– Client"
– The"server"providing"a"desired"network"service"
– The"Kerberos"Key"DistribuCon"Center"(KDC)"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#11$
Kerberos"Exchange"ParCcipants"(cont’d)"

Kerberos KDC
(Key Distribution
Center)

Client

Desired Network
Service
(Protected by
Kerberos)

! The$client$is$soYware$that$desires$access$to$a$service$
– The"hadoop fs"command"is"one"example"of"a"client"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#12$
Kerberos"Exchange"ParCcipants"(cont’d)"

Kerberos KDC
(Key Distribution
Center)

Client

Desired Network
Service
(Protected by
Kerberos)

! This$is$the$service$the$client$wishes$to$access$
– For"Hadoop,"this"will"be"a"service"daemon"(such"as"NN"or"JT)"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#13$
Kerberos"Exchange"ParCcipants"(cont’d)"

Kerberos KDC
(Key Distribution
Center)

Client

Desired Network
Service
(Protected by
Kerberos)

! The$Kerberos$server$(KDC)$authenKcates$and$authorizes$a$client$
! The$KDC$is$neither$part$of$nor$provided$by$Hadoop$
– Most"Linux"distribuCons"come"with"the"MIT"Kerberos"KDC"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#14$
General"Kerberos"Concepts"

Kerberos KDC
1. Authenticate
(Key Distribution
and authorize
Center)
client

Client 3. Validate the


client's token

Desired Network
2. Request for service Service
by an authenticated (Protected by
and authorized client Kerberos)

! Kerberos$is$a$standard$network$security$protocol$
– Currently"at"version"5"(RFC"4120)"
– Services"protected"by"Kerberos"don’t"directly"authenCcate"client"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#15$
General"Kerberos"Concepts"(cont’d)"

! AuthenKcated$status$is$cached$
– You"don’t"need"to"explicitly"submit"credenCals"with"each"request"
! Passwords$are$not$sent$across$network$
– Instead,"passwords"are"used"to"compute"encrypCon"keys"
– The"Kerberos"protocol"uses"encrypCon"extensively"
! Timestamps$are$an$essenKal$part$of$Kerberos$
– Make"sure"you"synchronize"system"clocks"(NTP)"
! It’s$important$that$reverse$lookups$work$correctly$
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#16$
Kerberos"Terminology"

! Knowing$a$few$terms$will$help$you$with$the$documentaKon$
! Realm$
– A"group"of"machines"parCcipaCng"in"a"Kerberos"network"
– IdenCfied"by"an"uppercase"domain"(EXAMPLE.COM)"
! Principal$
– A"unique"idenCty"which"can"be"authenCcated"by"Kerberos"
– Can"idenCfy"either"a"host"or"an"individual"user"
– Every"user"in"a"secure"cluster"will"have"a"Kerberos"principal"
! Keytab$file$
– A"file"that"stores"Kerberos"principals"and"associated"keys"
– Allows"non/interacCve"access"to"services"protected"by"Kerberos"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#17$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Security$
a$Hadoop$Cluster$

!! Why"Hadoop"Security"Is"Important""
!! Hadoop’s"Security"System"Concepts"
!! What"Kerberos"Is"and"How"it"Works""
!! Securing$a$Hadoop$Cluster$with$Kerberos$
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#18$
Hadoop"Security"Setup"Prerequisites"

! Working$Hadoop$cluster$$
– Installing"CDH"from"packages"is"strongly"advised!"
– Ensure"your"cluster"actually"works"before"trying"to"secure"it!"
! Working$Kerberos$KDC$server$$
! Kerberos$client$libraries$installed$on$all$Hadoop$nodes$
"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#19$
Configuring"Hadoop"Security"

! Hadoop$security$configuraKon$is$a$specialized$topic$
! Many$specifics$depend$on$$
– Version"of"Hadoop"and"related"programs"
– Type"of"Kerberos"server"used"(AcCve"Directory"or"MIT)"
– OperaCng"system"and"distribuCon"
! You$must$follow$instrucKons$exactly$
– There"is"li>le"room"for"misconfiguraCon"
– Mistakes"ohen"result"in"vague"“access"denied”"errors"
– May"need"to"work"around"version/specific"bugs"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#20$
Configuring"Hadoop"Security"(cont’d)"

! For$these$reasons,$we$can’t$cover$this$in$depth$during$class$
– We"do"provide"an"overview"in"Appendix"A"
! See$the$“CDH$Security$Guide”$for$detailed$instrucKons$
– Available"at"http://www.cloudera.com/
– Be"sure"to"read"the"guide"corresponding"to"your"version"of"CDH"
! The$CDH$Security$Guide$documents$manual$configuraKon$
– Configuring"Hadoop"with"Kerberos"involves"many"tedious"steps"
– Cloudera"Manager"(Enterprise)"can"automate"many"of"them"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#21$
Securing"Related"Services"

! There$are$many$“ecosystem”$tools$that$interact$with$Hadoop$
! Most$require$minor$configuraKon$changes$for$a$secure$cluster$
– Such"as"specifying"Kerberos"principals"or"keytab"file"paths"
! Exact$configuraKon$details$vary$with$each$tool$
– See"documentaCon"for"details"
! Some$require$no$configuraKon$changes$at$all$
– Such"as"Pig"and"Sqoop"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#22$
AcCve"Directory"IntegraCon"

! MicrosoY’s$AcKve$Directory$(AD)$is$a$enterprise$directory$service$
– Used"to"manage"user"accounts"for"a"Microsoh"Windows"network"
! Recall$that$every$Hadoop$user$must$have$a$Kerberos$principal$
– It"can"be"tedious"to"set"up"all"these"accounts"
– Many"organizaCons"would"prefer"to"use"AD"for"Hadoop"users"
! Cloudera’s$recommended$approach$
– Run"a"local"MIT"Kerberos"KDC""
– Create"all"service"principals"(like"hdfs"and"mapred)"in"this"realm"
– But"not"principals"corresponding"to"actual"users"in"AD"
– Setup"one/way"cross/realm"trust"between"MIT"Kerberos"and"AD"
– Hadoop’s"KDC"will"then"accept"user"accounts"from"AD"
! InstrucKons$can$be$found$in$Cloudera’s$CDH$Security$Guide$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#23$
Cloudera"Manager’s"Support"for"Security"

!  See$appendix$for$an$overview$of$manual$configuraKon$process"
! Many$tasks$must$be$performed$on$every$cluster$node$
– Create"Kerberos"principals"and"keytab"files"
– Copy"to"conf"directory"
– Set"proper"file"ownership"and"permissions"
– Edit"configuraCon"files"to"specify"principals,"paths,"etc."
! These$are$tedious$and$error#prone,$especially$for$large$clusters$
! It’s$worth$noKng$that$Cloudera$Manager$automates$these$steps$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#24$
Chapter"Topics"

Planning,$Installing,$and$Configuring$
Hadoop$Security$
a$Hadoop$Cluster$

!! Why"Hadoop"Security"Is"Important""
!! Hadoop’s"Security"System"Concepts"
!! What"Kerberos"Is"and"How"it"Works""
!! Securing"a"Hadoop"Cluster"with"Kerberos"
!! Conclusion$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#25$
EssenCal"Points"

! CDH1$and$CDH2$provided$security$with$HDFS$permissions$
– Mainly"intended"to"guard"against"accidental"file"deleCon"and"overwrite"
! StarKng$with$CDH3,$Kerberos$security$became$available$
– Manual"configuraCon"requires"many"steps"
– Consider"using"Cloudera"Manager"
! Hadoop$does$not$provide:$
– Encrypted"transmission"over"the"wire"
– Data"encrypCon"on"disk"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#26$
Conclusion"

In$this$chapter$you$have$learned:$
! Why$security$is$important$for$Hadoop$
! How$Hadoop's$security$model$evolved$
! What$Kerberos$is$and$how$it$relates$to$Hadoop$
! What$to$consider$when$securing$Hadoop$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#27$
Managing"and"Scheduling"Jobs"
Chapter"13"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#1$
Course"Chapters"
!! IntroducDon" Course"IntroducDon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducDon"to"Apache"Hadoop"
!! GePng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaDon"and"IniDal"ConfiguraDon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,"Installing,"and"
!! Hadoop"Clients"
Configuring"a"Hadoop"Cluster"
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraDon"
!! Hadoop"Security"
!! Managing$and$Scheduling$Jobs$
Cluster$Opera;ons$and$
!! Cluster"Maintenance"
Maintenance$
!! Cluster"Maintenance"and"TroubleshooDng"
!! Conclusion"
!! Kerberos"ConfiguraDon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaDon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#2$
Managing"and"Scheduling"Jobs"

In$this$chapter,$you$will$learn:$
! How$to$view$and$stop$jobs$running$on$a$cluster$
! The$op;ons$available$for$scheduling$Hadoop$jobs$
! How$to$configure$the$Fair$Scheduler$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#3$
Chapter"Topics"

Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$

!! Managing$Running$Jobs$
!! Hands/On"Exercise:"Managing"Jobs"
!! Scheduling"Hadoop"Jobs"
!! The"FIFO"Scheduler"
!! The"Fair"Scheduler"
!! The"Capacity"Scheduler"
!! Configuring"the"Fair"Scheduler"
!! Hands/On"Exercise:"Using"The"Fair"Scheduler"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#4$
Displaying"Running"Jobs"

! To$view$all$jobs$running$on$the$cluster,$use$mapred job -list

[training@localhost ~]$ mapred job -list


1 jobs currently running

JobId State StartTime UserName Priority SchedulingInfo

job_201110311158_0008 1 1320210148487 training NORMAL NA

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#5$
Displaying"All"Jobs"

! To$display$all$jobs$including$completed$jobs,$use$$
mapred job -list all

[training@localhost ~]$ mapred job -list all


7 jobs submitted
States are:
Running : 1 Succeded : 2 Failed : 3 Prep : 4
JobId State StartTime UserName Priority
SchedulingInfo
job_201110311158_0004 2 1320177624627 training NORMAL NA
job_201110311158_0005 2 1320177864702 training NORMAL NA
job_201110311158_0006 2 1320209627260 training NORMAL NA
job_201110311158_0007 2 1320210018614 training NORMAL NA
job_201110311158_0008 2 1320210148487 training NORMAL NA
job_201110311158_0001 2 1320097902546 training NORMAL NA
job_201110311158_0003 2 1320099376966 training NORMAL NA

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#6$
Displaying"All"Jobs"(cont’d)"

! Note$that$states$are$displayed$as$numeric$values$
– 1:"Running"
– 2:"Succeeded"
– 3:"Failed"
– 4:"In"preparaDon"
– 5:"(undocumented)"Killed"
! Easy$to$write$a$cron$job$that$periodically$lists$(for$example)$all$failed$jobs,$
running$a$command$such$as$
– mapred job -list all | grep '<tab>3<tab>'

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#7$
Displaying"the"Status"of"an"Individual"Job"

!  mapred job -status <job_id>$provides$status$about$an$individual$


job$
– CompleDon"percentage"
– Values"of"counters"
– System"counters"and"user/defined"counters"
! Note:$job$name$is$not$displayed!$
– The"Web"user"interface"is"the"most"convenient"way"to"view"more"
details"about"an"individual"job"
– More"details"later"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#8$
Killing"a"Job"

! It$is$important$to$note$that$once$a$user$has$submiVed$a$job,$they$can$not$
stop$it$just$by$hiWng$CTRL#C$on$their$terminal$
– This"stops"job"output"appearing"on"the"user’s"console"
– The"job"is"sDll"running"on"the"cluster!"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#9$
Killing"a"Job"(cont’d)"

! To$kill$a$job$use$mapred job -kill <job_id>

[training@localhost ~]$ mapred job -list


1 jobs currently running
JobId State StartTime UserName Priority SchedulingInfo
job_201110311158_0009 1 1320210791739 training NORMAL NA

[training@localhost ~]$ mapred job -kill job_201110311158_0009


Killed job job_201110311158_0009

[training@localhost ~]$ mapred job -list


0 jobs currently running
JobId State StartTime UserName Priority SchedulingInfo

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#10$
Stopping"MapReduce"Jobs"From"the"Web"UI"

! By$default,$the$JobTracker$Web$UI$is$read#only$
– Job"informaDon"is"displayed,"but"the"job"cannot"be"controlled"in"any"
way"
! It$is$possible$to$set$the$UI$to$allow$jobs,$or$individual$Map$or$Reduce$tasks,$
to$be$killed$
– Add"the"following"property"to"core-site.xml

<property>
<name>webinterface.private.actions</name>
<value>true</value>
</property>

– Restart"the"JobTracker"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#11$
Stopping"Jobs"From"the"Web"UI"(cont’d)"

! The$Web$UI$will$now$include$an$‘ac;ons’$column$for$each$task$
– And"an"overall"opDon"to"kill"enDre"jobs"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#12$
Stopping"Jobs"From"the"Web"UI"(cont’d)"

! Cau;on:$anyone$with$access$to$the$Web$UI$can$now$manipulate$running$
jobs!$
– Best"pracDce:"make"this"available"only"to"administraDve"users"
! BeVer$to$use$the$command#line$to$stop$jobs$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#13$
Chapter"Topics"

Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$

!! Managing"Running"Jobs"
!! Hands#On$Exercise:$Managing$Jobs$
!! Scheduling"Hadoop"Jobs"
!! The"FIFO"Scheduler"
!! The"Fair"Scheduler"
!! The"Capacity"Scheduler"
!! Configuring"the"Fair"Scheduler"
!! Hands/On"Exercise:"Using"The"Fair"Scheduler"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#14$
Hands/On"Exercise:"Managing"Jobs"

! In$this$Hands#On$Exercise,$you$will$start$and$kill$jobs$from$the$command$
line$
! Please$refer$to$the$Hands#On$Exercise$Manual$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#15$
Chapter"Topics"

Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$

!! Managing"Running"Jobs"
!! Hands/On"Exercise:"Managing"Jobs"
!! Scheduling$Hadoop$Jobs$
!! The"FIFO"Scheduler"
!! The"Fair"Scheduler"
!! The"Capacity"Scheduler"
!! Configuring"the"Fair"Scheduler"
!! Hands/On"Exercise:"Using"The"Fair"Scheduler"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#16$
Job"Scheduling"Basics"

! A$Hadoop$job$is$composed$of$
– An"unordered"set"of"Map$tasks"which"have"locality"preferences"
– An"unordered"set"of"Reduce$tasks$
! Tasks$are$scheduled$by$the$JobTracker$
– They"are"then"scheduled/launched"by"TaskTrackers"
– One"TaskTracker"per"node"
– Each"TaskTracker"has"a"fixed"number"of"slots$for"Map"and"Reduce"tasks"
– This"may"differ"per"node"–"a"node"with"a"powerful"processor"may"
have"more"slots"than"one"with"a"slower"CPU"
– TaskTrackers"report"the"availability"of"free"task"slots"to"the"JobTracker"
on"the"Master"node"
! Scheduling$a$job$requires$assigning$Map$and$Reduce$tasks$to$available$
Map$and$Reduce$task$slots$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#17$
Chapter"Topics"

Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$

!! Managing"Running"Jobs"
!! Hands/On"Exercise:"Managing"Jobs"
!! Scheduling"Hadoop"Jobs"
!! The$FIFO$Scheduler$
!! The"Fair"Scheduler"
!! The"Capacity"Scheduler"
!! Configuring"the"Fair"Scheduler"
!! Hands/On"Exercise:"Using"The"Fair"Scheduler"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#18$
The"FIFO"Scheduler"

! Default$Hadoop$job$scheduler$is$FIFO$
– First"In,"First"Out"
! Given$two$jobs$A$and$B,$submiVed$in$that$order,$all$Map$tasks$from$job$A$
are$scheduled$before$any$Map$tasks$from$job$B$are$considered$
– Similarly"for"Reduce"tasks"
! Order$of$task$execu;on$within$a$job$may$be$shuffled$around$

A1 A2 A3 A4 B1 B2 B3 …

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#19$
PrioriDes"in"the"FIFO"Scheduler"

! The$FIFO$Scheduler$supports$assigning$priori;es$to$jobs$
– PrioriDes"are"VERY_HIGH,"HIGH,"NORMAL,"LOW,"VERY_LOW"
– Set"with"the"mapred.job.priority"property"
– May"be"changed"from"the"command/line"as"the"job"is"running"
– hadoop job -set-priority <job_id> <priority>
– All"work"in"each"queue"is"processed"before"moving"on"to"the"next"

All higher-priority tasks are run first, if they exist…

C1 C2 C3
High Priority
…before any
lower-priority
tasks are started, A1 A2 A3 A4 B1 B2 B3 …
regardless of
submission order Normal Priority

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#20$
PrioriDes"in"the"FIFO"Scheduler:"Problems"

! Problem:$Job$A$may$have$2,000$tasks;$Job$B$may$have$20$
– Even"if"Job"B"has"a"higher"priority"than"Job"A,"Job"B"will"not"make"any"
progress"unDl"Job"A"has"nearly"finished"
– CompleDon"Dme"should"be"proporDonal"to"job"size"
! Users$with$poor$understanding$of$the$system$may$flag$all$their$jobs$as$
HIGH_PRIORITY
– Thus"starving"other"jobs"of"processing"Dme"
! ‘All$or$nothing’$nature$of$the$scheduler$makes$sharing$a$cluster$between$
produc;on$jobs$with$SLAs$and$interac;ve$users$challenging$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#21$
Chapter"Topics"

Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$

!! Managing"Running"Jobs"
!! Hands/On"Exercise:"Managing"Jobs"
!! Scheduling"Hadoop"Jobs"
!! The"FIFO"Scheduler"
!! The$Fair$Scheduler$
!! The"Capacity"Scheduler"
!! Configuring"the"Fair"Scheduler"
!! Hands/On"Exercise:"Using"The"Fair"Scheduler"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#22$
Goals"of"the"Fair"Scheduler"

! Fair$Scheduler$is$designed$to$allow$mul;ple$users$to$share$the$cluster$
simultaneously$
! Should$allow$short$interac;ve$jobs$to$coexist$with$long$produc;on$jobs$
! Should$allow$resources$to$be$controlled$propor;onally$
! Should$ensure$that$the$cluster$is$efficiently$u;lized$
! This$is$the$scheduler$recommended$by$Cloudera$for$produc;on$use$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#23$
The"Fair"Scheduler:"Basic"Concepts"

! Each$job$is$assigned$to$a$pool$
– Default"assignment"is"one"pool"per"username"
! Jobs$may$be$assigned$to$arbitrarily#named$pools$
– Such"as"“producDon”"
! Physical$slots$are$not$bound$to$any$specific$pool$
! Each$pool$gets$an$even$share$$
of$the$available$task$slots$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#24$
Pool"CreaDon"

! By$default,$pools$are$created$dynamically$based$on$the$username$
submiWng$the$job$
– No"configuraDon"necessary"
! Jobs$can$be$sent$to$designated$pools$(e.g.,$“produc;on”)$
– Pools"can"be"defined"in"a"configuraDon"file"(see"later)"
– Pools"may"have"a"minimum"number"of"mappers"and"reducers"defined"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#25$
Adding"Pools"Readjusts"the"Share"of"Slots"

! If$Charlie$now$submits$a$job$in$a$new$pool,$shares$of$slots$are$adjusted$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#26$
Determining"the"Fair"Share"

! The$fair$share$of$tasks$slots$assigned$to$the$pool$is$based$on:$
– The"actual"number"of"task"slots"available"across"the"cluster"
– The"demand"from"the"pool"
– The"number"of"tasks"eligible"to"run"
– The"minimum"share,"if"any,"configured"for"the"pool"
– The"fair"share"of"each"other"acDve"pool"
! The$fair$share$for$a$pool$will$never$be$higher$than$the$actual$demand$
! Pools$are$filled$up$to$their$minimum$share,$assuming$cluster$capacity$
! Excess$cluster$capacity$is$spread$across$all$pools$
– Aim"is"to"maintain"the"most"even"loading"possible"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#27$
Example"Minimum"Share"AllocaDon"

! First,$fill$Produc;on$up$to$20$slot$minimum$guarantee$
! Then$distribute$remaining$10$slots$evenly$across$Alice$and$Bob$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#28$
Example"AllocaDon"2:"ProducDon"Queue"Empty"

! Produc;on$has$no$demand,$so$no$slots$reserved$
! All$slots$allocated$evenly$across$Alice$and$Bob$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#29$
Example"AllocaDon"3:"MinShares"Exceed"Slots"

! minShare$of$Produc;on,$Research$exceeds$available$capacity$
! minShares$are$scaled$down$pro$rata$to$match$actual$slots$
! No$slots$remain$for$users$without$minShare$(i.e.,$Bob)$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#30$
Example"4:"minShare"<"Fair"Share"

! Produc;on$filled$to$minShare$
! Remaining$25$slots$distributed$across$all$pools$
! Produc;on$pool$gets$more$than$minShare,$to$maintain$fairness$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#31$
Pools"With"Weights"

! Instead$of$(or$in$addi;on$to)$seWng$minShare,$pools$can$be$assigned$a$
weight$
! Pools$with$higher$weight$get$more$slots$during$free$slot$alloca;on$
! ‘Even$water$glass$height’$analogy:$
– Think"of"the"weight"as"controlling"the"‘width’"of"the"glass"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#32$
Example:"Pool"With"Double"Weight"

! Produc;on$filled$to$minShare$(5)$
! Remaining$25$slots$distributed$across$pools$
! Bob’s$pool$gets$two$slots$instead$of$one$during$each$round$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#33$
MulDple"Jobs"Within"A"Pool"

! A$pool$exists$if$it$has$one$or$more$jobs$in$it$
! So$far,$we’ve$only$described$how$slots$are$assigned$to$pools$
– We"need"to"determine"how"jobs"are"scheduled"within"a"given"pool"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#34$
Job"Scheduling"Within"a"Pool"

! Within$a$pool,$resources$are$fair#scheduled$across$all$jobs$
– This"is"achieved"via"another"instance"of"Fair"Scheduler"
! It$is$possible$to$enforce$FIFO$scheduling$within$a$pool$
– May"be"appropriate"for"jobs"that"would"compete"for"external"
bandwidth,"for"example"
! Pools$can$have$a$maximum$number$of$concurrent$jobs$configured$
! The$weight$of$a$job$within$a$pool$is$determined$by$its$priority$(NORMAL,$
HIGH$etc)$
! Assignment$of$a$slot$to$a$pool$can$be$delayed$
– The"FairScheduler"lets"free"slots"on"a"host"remain"open"for"a"short"Dme"
if"no"queued"tasks"prefer"to"run"on"that"host"
– This"increases"the"overall"data"locality"hit"raDo"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#35$
PreempDon"in"the"Fair"Scheduler"

! If$shares$are$imbalanced,$pools$which$are$over$their$fair$share$may$not$
assign$new$tasks$when$their$old$ones$complete$
– Eventually,"as"tasks"complete,"free"slots"will"become"available"
– Those"free"slots"will"be"used"by"pools"which"were"under"their"fair"share"
– This"may"not"be"acceptable"in"a"producDon"environment,"where"tasks"
take"a"long"Dme"to"complete"
! Two$types$of$preemp;on$are$supported$
– minShare"preempDon"
– Fair"Share"preempDon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#36$
minShare"PreempDon"

! Pools$with$a$minimum$share$configured$are$opera;ng$on$an$SLA$(Service$
Level$Agreement)$
– WaiDng"for"tasks"from"other"pools"to"finish"may"not"be"appropriate"
! Pools$which$are$below$their$minimum$guaranteed$share$can$kill$the$
newest$tasks$from$other$pools$to$reap$slots$
– Can"then"use"those"slots"for"their"own"tasks"
– Ensures"that"the"minimum"share"will"be"delivered"within"a"Dmeout"
window"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#37$
Fair"Share"PreempDon"

! Pools$not$receiving$their$fair$share$can$kill$tasks$from$other$pools$
– A"pool"will"kill"the"newest"task(s)"in"an"over/share"pool"to"forcibly"make"
room"for"starved"pools"
! Fair$share$preemp;on$is$used$conserva;vely$
– A"pool"must"be"operaDng"at"less"than"50%"of"its"fair"share"for"10"
minutes"before"it"can"preempt"tasks"from"other"pools"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#38$
Chapter"Topics"

Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$

!! Managing"Running"Jobs"
!! Hands/On"Exercise:"Managing"Jobs"
!! Scheduling"Hadoop"Jobs"
!! The"FIFO"Scheduler"
!! The"Fair"Scheduler"
!! The$Capacity$Scheduler$
!! Configuring"the"Fair"Scheduler"
!! Hands/On"Exercise:"Using"The"Fair"Scheduler"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#39$
The"Capacity"Scheduler:"Basic"Concepts"

! Each$job$is$assigned$to$a$queue$
– Analogous"to"a"Fair"Scheduler"pool"
! Queues$are$assigned$a$percentage$of$the$cluster’s$slots$
– Similar"to"a"Fair"Scheduler"pool’s"minShare"
! A$queue’s$unused$capacity$is$not$given$away$if$it$is$not$being$used$
– Unlike"the"Fair"Scheduler,"in"which"pools"with"lower"usage"can"give"
away"their"slots"
! Jobs$within$queues$are$FIFO$ordered$
– Similar"to"the"FIFO"Scheduler,"you"can"enable"prioriDzaDon"within"
queues"
! Slot$alloca;on$can$be$based$on$tasks’$memory$usage$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#40$
Choosing"a"Scheduler"

$ FIFO$ Fair$ Capacity$


Requirement Scheduler$ Scheduler Scheduler$
Learning$tool$or$proof$of$concept$ ✔$
Pool$u;liza;on$varies,$so$it$is$desirable$that$ ✔$
pools$give$away$resources$when$they$are$not$in$ $
use$
Jobs$within$a$pool$need$to$make$equal$progress$$ ✔$
Data$locality$makes$a$significant$difference$in$ ✔$
job$run#;me$performance$ $
Pool$u;liza;on$has$liVle$fluctua;on$ ✔$
Jobs$have$a$high$degree$of$variance$in$memory$ ✔$
u;liza;on$ $

! In$prac;ce,$the$Fair$Scheduler$is$far$more$widely$used$than$the$Capacity$
Scheduler$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#41$
Chapter"Topics"

Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$

!! Managing"Running"Jobs"
!! Hands/On"Exercise:"Managing"Jobs"
!! Scheduling"Hadoop"Jobs"
!! The"FIFO"Scheduler"
!! The"Fair"Scheduler"
!! The"Capacity"Scheduler"
!! Configuring$the$Fair$Scheduler$
!! Hands/On"Exercise:"Using"The"Fair"Scheduler"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#42$
Steps"to"Configure"the"Fair"Scheduler"

1.  Enable$the$Fair$Scheduler$
2.  Configure$Scheduler$parameters$
3.  Configure$pools$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#43$
Enabling"the"Fair"Scheduler""

! In$mapred-site.xml$on$the$JobTracker,$specify$the$scheduler$to$use:$
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>

! Iden;fy$the$pool$configura;on$file:$
<property>
<name>mapred.fairscheduler.allocation.file</name>
<value>/etc/hadoop/conf/allocations.xml</value>
</property>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#44$
Scheduler"Parameters"in"mapred-site.xml

mapred.fairscheduler.poolnameproperty Specifies which job configuration property


is used to determine the pool that a job
belongs in. Default is user.name (i.e.,
one pool per user). Other options include
group.name,
mapred.job.queue.name
mapred.fairscheduler.sizebasedweight Makes a pool’s weight proportional to
log(demand) of the pool. Default: false.
mapred.fairscheduler.weightadjuster Specifies a WeightAdjuster implementation
that tunes job weights dynamically. Default
is blank; can be set to
org.apache.hadoop.mapred.NewJob
WeightBooster.
mapred.fairscheduler.preemption Enables preemption in the Fair Scheduler.
Set to true if you have pools that must
operate on an SLA. Default is false.

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#45$
Configuring"Pools"

! The$alloca;ons$configura;on$file$must$exist,$and$contain$an$
<allocations>$en;ty$
!  <pool>$en;;es$can$contain$minMaps,$minReduces,$maxMaps,$
maxReduces,$maxRunningJobs,$weight,$
minSharePreemptionTimeout,$schedulingMode
!  <user>$en;;es$(op;onal)$can$contain$maxRunningJobs
– Limits"the"number"of"simultaneous"jobs"a"user"can"run"
!  userMaxJobsDefault$en;ty$(op;onal)$
– Maximum"number"of"jobs"for"any"user"without"a"specified"limit"
! System#wide$and$per#pool$;meouts$can$be$set$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#46$
Very"Basic"Pool"ConfiguraDon"

! The$alloca;ons$configura;on$file$must$exist,$and$contain$at$least$this:$

<?xml version="1.0"?>
<allocations>
</allocations>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#47$
Example:"Limit"Users"to"Three"Jobs"Each"

! Limit$max$jobs$for$any$user:$specify$userMaxJobsDefault

<?xml version="1.0"?>
<allocations>
<userMaxJobsDefault>3</userMaxJobsDefault>
</allocations>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#48$
Example:"Allow"One"User"More"Jobs"

! If$a$user$needs$more$than$the$standard$maximum$number$of$jobs,$create$a$
<user>$en;ty$

<?xml version="1.0"?>
<allocations>
<userMaxJobsDefault>3</userMaxJobsDefault>
<user name="bob">
<maxRunningJobs>6</maxRunningJobs>
</user>
</allocations>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#49$
Example:"Add"a"Fair"Share"Timeout"

! Set$a$Preemp;on$;meout$

<?xml version="1.0"?>
<allocations>
<userMaxJobsDefault>3</userMaxJobsDefault>
<user name="bob">
<maxRunningJobs>6</maxRunningJobs>
</user>
<fairSharePreemptionTimeout>300</fairSharePreemptionTimeout>
</allocations>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#50$
Example:"Create"a"‘producDon’"Pool"

! Pools$are$created$by$adding$<pool>$en;;es$

<?xml version="1.0"?>
<allocations>
<userMaxJobsDefault>3</userMaxJobsDefault>
<pool name="production">
<minMaps>20</minMaps>
<minReduces>5</minReduces>
<weight>2.0</weight>
</pool>
</allocations>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#51$
Example:"Add"an"SLA"to"the"Pool"

<?xml version="1.0"?>
<allocations>
<userMaxJobsDefault>3</userMaxJobsDefault>
<pool name="production">
<minMaps>20</minMaps>
<minReduces>5</minReduces>
<weight>2.0</weight>
<minSharePreemptionTimeout>60</minSharePreemptionTimeout>
</pool>
</allocations>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#52$
Example:"Create"a"FIFO"Pool"

! FIFO$pools$are$useful$for$jobs$which$are,$for$example,$bandwidth#intensive$

<?xml version="1.0"?>
<allocations>
<pool name="bandwidth_intensive">
<minMaps>10</minMaps>
<minReduces>5</minReduces>
<schedulingMode>FIFO</schedulingMode>
</pool>
</allocations>

! Note:$<schedulingMode>FAIR</schedulingMode>$would$use$the$
Fair$Scheduler$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#53$
Monitoring"Pools"and"AllocaDons"

! The$Fair$Scheduler$exposes$a$status$page$in$the$JobTracker$Web$user$
interface$at$http://<job_tracker_host>:50030/scheduler
– Allows"you"to"inspect"pools"and"allocaDons"
! Any$changes$to$the$pool$configura;on$file$(e.g.,$allocations.xml)$will$
automa;cally$be$reloaded$by$the$running$scheduler$
– Scheduler"detects"a"Dmestamp"change"on"the"file"
– Waits"five"seconds"aper"the"change"was"detected,"then"reloads"the"
file"
– If"the"scheduler"cannot"parse"the"XML"in"the"configuraDon"file,"it"
will"log"a"warning"and"conDnue"to"use"the"previous"configuraDon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#54$
Chapter"Topics"

Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$

!! Managing"Running"Jobs"
!! Hands/On"Exercise:"Managing"Jobs"
!! Scheduling"Hadoop"Jobs"
!! The"FIFO"Scheduler"
!! The"Fair"Scheduler"
!! The"Capacity"Scheduler"
!! Configuring"the"Fair"Scheduler"
!! Hands#On$Exercise:$Using$The$Fair$Scheduler$
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#55$
Hands/On"Exercise:"Using"The"Fair"Scheduler"

! In$this$Hands#On$Exercise,$you$will$run$jobs$in$different$pools$
! Please$refer$to$the$Hands#On$Exercise$manual$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#56$
Chapter"Topics"

Cluster$Opera;ons$$
Managing$and$Scheduling$Jobs$
and$Maintenance$

!! Managing"Running"Jobs"
!! Hands/On"Exercise:"Managing"Jobs"
!! Scheduling"Hadoop"Jobs"
!! The"FIFO"Scheduler"
!! The"Fair"Scheduler"
!! The"Capacity"Scheduler"
!! Configuring"the"Fair"Scheduler"
!! Hands/On"Exercise:"Using"The"Fair"Scheduler"
!! Conclusion$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#57$
EssenDal"Points"

! You$cannot$kill$a$Hadoop$job$from$the$terminal$by$using$CTRL#C$
– Use"hadoop job -kill <job_id>
– Or,"enable"the"JobTracker"Web"UI"to"kill"jobs"
!  Hadoop$provides$three$job$schedulers$
– The"FIFO"Scheduler"
– Has"serious"limitaDons"and"should"not"be"used"in"producDons"
– The"Fair"Scheduler"and"Capacity"Scheduler"""
– Allow"resources"to"be"controlled"proporDonally"
– Ensure"that"a"cluster"is"used"efficiently"
! The$Fair$Scheduler$is$the$most$commonly#used$scheduler$
– Efficient"when"pool"uDlizaDon"varies"
– Delayed"task"assignment"leads"to"be>er"data"locality"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#58$
Conclusion"

In$this$chapter,$you$have$learned:$
! How$to$view$and$stop$jobs$running$on$a$cluster$
! The$op;ons$available$for$scheduling$mul;ple$jobs$on$the$same$cluster$
! How$to$configure$the$Fair$Scheduler$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#59$
Cluster"Maintenance"
Chapter"14"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#1$
Course"Chapters"
!! IntroducCon" Course"IntroducCon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducCon"to"Apache"Hadoop"
!! GePng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaCon"and"IniCal"ConfiguraCon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,"Installing,"and"
!! Hadoop"Clients"
Configuring"a"Hadoop"Cluster"
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraCon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster$Opera4ons$and$
!! Cluster$Maintenance$
Maintenance$
!! Cluster"Maintenance"and"TroubleshooCng"
!! Conclusion"
!! Kerberos"ConfiguraCon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaCon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#2$
Cluster"Maintenance"

In$this$chapter,$you$will$learn:$
! How$to$check$the$status$of$HDFS$
! How$to$copy$data$between$clusters$
! How$to$add$and$remove$nodes$
! How$to$rebalance$the$cluster$
! How$to$upgrade$your$cluster$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#3$
Chapter"Topics"

Cluster$Opera4ons$$
Cluster$Maintenance$
and$Maintenance$

!! Checking$HDFS$Status$
!! Hands/On"Exercise:"Breaking"The"Cluster"
!! Copying"Data"Between"Clusters"
!! Adding"and"Removing"Cluster"Nodes"
!! Rebalancing"the"Cluster"
!! Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"
!! Cluster"Upgrading"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#4$
Checking"for"CorrupCon"in"HDFS"

!  hdfs fsck$checks$for$missing$or$corrupt$data$blocks$
– Unlike"system"fsck,"does"not"a>empt"to"repair"errors"
! Can$be$configured$to$list$all$files$
– Also"all"blocks"for"each"file,"all"block"locaCons,"all"racks"
! Examples:$
hdfs fsck /
hdfs fsck / -files
hdfs fsck / -files -blocks
hdfs fsck / -files -blocks -locations
hdfs fsck / -files -blocks -locations -racks

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#5$
Checking"for"CorrupCon"in"HDFS"(cont’d)"

! Good$idea$to$run$hdfs fsck$as$a$regular$cron$job$that$e#mails$the$
results$to$administrators$
– Choose"a"low/usage"Cme"to"run"the"check"
!  -move$op4on$moves$corrupted$files$to$/lost+found
– A"corrupted"file"is"one"where"all"replicas"of"a"block"are"missing
!  -delete$op4on$deletes$corrupted$files$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#6$
Using"dfsadmin

! The$hdfs dfsadmin$command$provides$a$number$of$administra4ve$
features$including:$
! List$informa4on$about$HDFS$on$a$per#datanode$basis$

$ hdfs dfsadmin -report

! Re#read$the$dfs.hosts$and$dfs.hosts.exclude$files$

$ hdfs dfsadmin -refreshNodes

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#7$
Using"dfsadmin"(cont’d)"

! Manually$set$the$filesystem$to$‘safe$mode’$
– NameNode"starts"up"in"safe"mode"
– Read/only"–"no"changes"can"be"made"to"the"metadata"
– Does"not"replicate"or"delete"blocks"
– Leaves"safe"mode"when"the"(configured)"minimum"percentage"of"
blocks"saCsfy"the"minimum"replicaCon"condiCon"

$ hdfs dfsadmin –safemode enter


$ hdfs dfsadmin –safemode leave

– Can"also"block"unCl"safemode"is"exited"
– Useful"for"shell"scripts"
– hdfs dfsadmin -safemode wait

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#8$
Using"dfsadmin"(cont’d)"

! Saves$the$NameNode$metadata$to$disk$and$resets$the$edit$log$
– Must"be"in"safe"mode"

$ hdfs dfsadmin –saveNamespace

– More"on"this"later"

"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#9$
Chapter"Topics"

Cluster$Opera4ons$$
Cluster$Maintenance$
and$Maintenance$

!! Checking"HDFS"Status"
!! Hands#On$Exercise:$Breaking$The$Cluster$
!! Copying"Data"Between"Clusters"
!! Adding"and"Removing"Cluster"Nodes"
!! Rebalancing"the"Cluster"
!! Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"
!! Cluster"Upgrading"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#10$
Hands/On"Exercise:"Breaking"the"Cluster"

! In$this$Hands#On$Exercise,$you$will$introduce$some$problems$into$the$
cluster$
! Please$refer$to$the$Hands#On$Exercise$Manual$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#11$
Chapter"Topics"

Cluster$Opera4ons$$
Cluster$Maintenance$
and$Maintenance$

!! Checking"HDFS"Status"
!! Hands/On"Exercise:"Breaking"The"Cluster"
!! Copying$Data$Between$Clusters$
!! Adding"and"Removing"Cluster"Nodes"
!! Rebalancing"the"Cluster"
!! Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"
!! Cluster"Upgrading"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#12$
Copying"Data"

! Hadoop$clusters$can$hold$massive$amounts$of$data$
! A$frequent$requirement$is$to$back$up$the$cluster$for$disaster$recovery$
! Ul4mately,$this$is$not$a$Hadoop$problem!$
– It’s"a"‘managing"huge"amounts"of"data’"problem"
! Cluster$could$be$backed$up$to$tape$etc$if$necessary$
– Custom"socware"may"be"needed"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#13$
Copying"Data"with"distcp

!  distcp$copies$data$within$a$cluster,$or$between$clusters$
– Used"to"copy"large"amounts"of"data"
– Turns"the"copy"procedure"into"a"MapReduce"job"
! Copies$files$or$en4re$directories$
– Files"previously"copied"will"be"skipped"
– Note"that"the"only"check"for"duplicate"files"is"that"the"file’s"name"
and"size"are"idenCcal

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#14$
distcp"Examples"

! Copying$data$from$one$cluster$to$another$
–  hadoop distcp hdfs://nn1:8020/path/to/src \
hdfs://nn2:8020/path/to/dest

! Copying$data$within$the$same$cluster$
–  hadoop distcp /path/to/src /path/to/dest

! Copying$data$from$one$cluster$to$another$when$the$clusters$are$running$
different$versions$of$Hadoop$
– HA"HDFS"example"using"H>pFS"
–  hadoop distcp hdfs://mycluster/path/to/src \
webhdfs://httpfs-svr:14000/path/to/dest
– Non/HA"HDFS"example"using"WebHDFS"
–  hadoop distcp hdfs://nn1:8020/path/to/src \
webhdfs://nn2:50070/path/to/dest

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#15$
Copying"Data:"Best"PracCces"

! In$prac4ce,$many$organiza4ons$do$not$copy$data$between$clusters$
! Instead,$they$write$their$data$to$two$clusters$as$it$is$being$imported$
– This"is"ocen"more"efficient"
– Not"necessary"to"run"all"MapReduce"jobs"on"the"backup"cluster"
– As"long"as"the"source"data"is"available,"all"derived"data"can"be"
regenerated"later"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#16$
Chapter"Topics"

Cluster$Opera4ons$$
Cluster$Maintenance$
and$Maintenance$

!! Checking"HDFS"Status"
!! Hands/On"Exercise:"Breaking"The"Cluster"
!! Copying"Data"Between"Clusters"
!! Adding$and$Removing$Cluster$Nodes$
!! Rebalancing"the"Cluster"
!! Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"
!! Cluster"Upgrading"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#17$
Adding"Cluster"Nodes"

! To$add$nodes$to$the$cluster:$$
1.  Add"the"names"of"the"nodes"to"the"‘include’"file(s),"if"you"are"using"
this"method"to"explicitly"list"allowed"nodes"
–  The"file(s)"referred"to"by"dfs.hosts"(and"mapred.hosts"if"
that"has"been"used)
2.  Update"your"rack"awareness"script"with"the"new"informaCon"
3.  Update"the"NameNode"with"this"new"informaCon"
–  hdfs dfsadmin –refreshNodes
4.  Update"the"JobTracker"with"this"new"informaCon"
–  hadoop mradmin –refreshNodes"
5.  Start"the"new"DataNode"and"TaskTracker"instances"
6.  Check"that"the"new"DataNodes"and"TaskTrackers"appear"in"the"Web"
UI"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#18$
Adding"Nodes:"Points"to"Note"

! A$NameNode$will$not$‘favor’$a$new$node$added$to$the$cluster$
– It"will"not"prefer"to"write"blocks"to"the"node"rather"than"to"other"nodes"
! This$is$by$design$
– The"assumpCon"is"that"new"data"is"more"likely"to"be"processed"by"
MapReduce"jobs"
– If"all"new"blocks"were"wri>en"to"the"new"node,"this"would"impact"data"
locality"for"MapReduce"jobs"
– Would"also"create"a"‘hot"spot’"when"wriCng"new"data"to"the"cluster"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#19$
Removing"Cluster"Nodes"

!  To$remove$nodes$from$the$cluster:$
1.  Add"the"names"of"the"nodes"to"the"‘exclude’"file(s)"
–  The"file(s)"referred"to"by"dfs.hosts.exclude"(and"
mapred.hosts.exclude"if"that"has"been"used)
2.  Update"the"JobTracker"with"the"revised"set"of"nodes"
–  hadoop mradmin –refreshNodes"
3.  Update"the"NameNode"with"the"new"set"of"DataNodes"
–  hdfs dfsadmin -refreshNodes
–  The"NameNode"UI"will"show"the"admin"state"change"to"
‘Decommission"In"Progress’"for"affected"DataNodes"
–  When"all"DataNodes"report"their"state"as"‘Decommissioned’,"all"
the"blocks"will"have"been"replicated"elsewhere"
4.  Shut"down"the"decommissioned"nodes"
5.  Remove"the"nodes"from"the"‘include’"and"‘exclude’"files"and"update"
the"NameNode"as"above"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#20$
Chapter"Topics"

Cluster$Opera4ons$$
Cluster$Maintenance$
and$Maintenance$

!! Checking"HDFS"Status"
!! Hands/On"Exercise:"Breaking"The"Cluster"
!! Copying"Data"Between"Clusters"
!! Adding"and"Removing"Cluster"Nodes"
!! Rebalancing$the$Cluster$
!! Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"
!! Cluster"Upgrading"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#21$
Cluster"Rebalancing"

! An$HDFS$cluster$can$become$‘unbalanced’$
– Some"nodes"have"much"more"data"on"them"than"others"
– Example:"add"a"new"node"to"the"cluster"
– Even"acer"adding"some"files"to"HDFS,"this"node"will"have"far"less"
data"than"the"others"
– During"MapReduce"processing,"this"node"will"use"much"more"
network"bandwidth"as"it"retrieves"data"from"other"nodes"
! Clusters$can$be$rebalanced$using$the$hdfs balancer u4lity$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#22$
Using"hdfs balancer

!  hdfs balancer$reviews$data$block$placement$on$nodes$and$adjusts$
blocks$to$ensure$all$nodes$are$within$x%$u4liza4on$of$each$other$
! U4liza4on$is$defined$as$amount$of$data$storage$used$
! x$is$known$as$the$threshold$
! A$node$is$under#u4lized$if$its$u4liza4on$is$less$than$(average$u4liza4on$#$
threshold)$
! A$node$is$over#u4lized$if$its$u4liza4on$is$more$than$(average$u4liza4on$+$
threshold)$
! Note:$hdfs balancer$does$not$consider$block$placement$on$individual$
disks$on$a$node$
– Only"the"uClizaCon"of"the"node"as"a"whole"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#23$
Using"hdfs balancer"(cont’d)"

! Syntax:$
– hdfs balancer -threshold x
! Threshold$is$op4onal$
– Defaults"to"10"(i.e.,"10%"difference"in"uClizaCon"between"nodes)"
! Rebalancing$can$be$canceled$at$any$4me$
– Interrupt"the"command"with"Ctrl/C"
! Bandwidth$usage$can$be$controlled$by$seeng$the$property$
dfs.balance.bandwidthPerSec$in$hdfs-site.xml
– Specifies"a"bandwidth"in"bytes/sec"(not"bits/sec)"that"each"DataNode"
can"use"for"rebalancing"
– Default"is"1048576"(1MB/sec)"
– RecommendaCon:"approx."0.1"x"network"speed"
– e.g.,"for"a"1Gbps"network,"10MB/sec"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#24$
When"To"Rebalance"

! Rebalance$immediately$afer$adding$new$nodes$to$the$cluster$
! Rebalance$during$non#peak$usage$4mes$
– Rebalancing"does"not"interfere"with"any"exisCng"MapReduce"jobs"
– However,"it"does"use"bandwidth"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#25$
Chapter"Topics"

Cluster$Opera4ons$$
Cluster$Maintenance$
and$Maintenance$

!! Checking"HDFS"Status"
!! Hands/On"Exercise:"Breaking"The"Cluster"
!! Copying"Data"Between"Clusters"
!! Adding"and"Removing"Cluster"Nodes"
!! Rebalancing"the"Cluster"
!! Hands#On$Exercise:$Verifying$The$Cluster’s$Self#Healing$Features$
!! Cluster"Upgrading"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#26$
Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"

! In$this$Hands#On$Exercise,$you$will$verify$that$the$cluster$has$recovered$
from$the$problems$you$introduced$in$the$last$exercise$
! Please$refer$to$the$Hands#On$Exercise$Manual$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#27$
Chapter"Topics"

Cluster$Opera4ons$$
Cluster$Maintenance$
and$Maintenance$

!! Checking"HDFS"Status"
!! Hands/On"Exercise:"Breaking"The"Cluster"
!! Copying"Data"Between"Clusters"
!! Adding"and"Removing"Cluster"Nodes"
!! Rebalancing"the"Cluster"
!! Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"
!! Cluster$Upgrading$
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#28$
Upgrading"Socware:"When"to"Upgrade?"

! Cloudera$updates$CDH$regularly$
! Format$example:$CDH$4.3.1$

Minor"patch"
release"
Major"version" Major"version"
number" update""

! Major$versions:$every$12$to$18$months$
! Updates:$every$three$to$four$months$
! Patch$releases:$when$necessary$
! We$recommend$that$you$upgrade$when$a$new$version$update$is$released$
! Cloudera$supports$a$major$version$for$at$least$one$year$afer$a$subsequent$
release$is$available$
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#29$
Upgrading"Socware:"Procedures"

! Sofware$upgrade$procedure$is$fully$documented$on$the$Cloudera$Web$
site$
! General$steps:$
1.  Stop"the"MapReduce"cluster"
2.  Stop"HDFS"cluster"
3.  Install"the"new"version"of"Hadoop"
4.  Start"the"NameNode"with"the"-upgrade"opCon"
5.  Monitor"the"HDFS"cluster"unCl"it"reports"that"the"upgrade"is"
complete"
6.  Start"the"MapReduce"cluster"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#30$
Upgrading"Socware:"Procedures"(cont’d)"

! Once$the$upgraded$cluster$has$been$running$for$a$few$days$with$no$
problems,$finalize$the$upgrade$by$running$
hdfs dfsadmin -finalizeUpgrade
– DataNodes"delete"their"previous"version"working"directories,"then"the"
NameNode"does"the"same"
! If$you$encounter$problems,$you$can$roll$back$an$(unfinalized)$upgrade$by$
stopping$the$cluster,$then$star4ng$the$old$version$of$HDFS$with$the$$
-rollback$op4on$
! Note$that$this$upgrade$procedure$is$required$when$HDFS$data$structures$
or$RPC$communica4on$format$change$
– For"example,"from"CDH3"to"CDH4"
! Probably$not$required$for$minor$version$changes$
– But"see"the"documentaCon"for"definiCve"informaCon!"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#31$
Upgrading"Socware:"Procedures"(cont’d)"

! Note$that$Cloudera$Manager$makes$cluster$upgrades$extremely$simple$
– Cloudera"Manager"Enterprise"EdiCon"also"provides"for"rolling"cluster"
upgrades"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#32$
Chapter"Topics"

Cluster$Opera4ons$$
Cluster$Maintenance$
and$Maintenance$

!! Checking"HDFS"Status"
!! Hands/On"Exercise:"Breaking"The"Cluster"
!! Copying"Data"Between"Clusters"
!! Adding"and"Removing"Cluster"Nodes"
!! Rebalancing"the"Cluster"
!! Hands/On"Exercise:"Verifying"The"Cluster’s"Self/Healing"Features"
!! Cluster"Upgrading"
!! Conclusion$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#33$
EssenCal"Points"

! You$can$check$the$status$of$HDFS$with$the$hdfs fsck$command$
– Reports"problems"but"does"not"repair"them"
! You$can$use$the$distcp$command$to$copy$data$within$a$cluster$or$
between$clusters$
! Use$hdfs dfsadmin –refreshNodes$to$add$new$nodes$to$a$cluster$
or$to$remove$nodes$when$decommissioning$them$
– You"also"need"to"update"your"rack"awareness"script"
! Use$hdfs balancer$to$adjust$block$placement$across$HDFS,$so$as$to$
ensure$beker$u4liza4on$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#34$
Conclusion"

In$this$chapter,$you$have$learned:$
! How$to$check$the$status$of$HDFS$
! How$to$copy$data$between$clusters$
! How$to$add$and$remove$nodes$
! How$to$rebalance$the$cluster$
! How$to$upgrade$your$cluster$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#35$
Cluster"Monitoring"and"TroubleshooBng"
Chapter"15"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#1$
Course"Chapters"
!! IntroducBon" Course"IntroducBon"

!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducBon"to"Apache"Hadoop"
!! GePng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaBon"and"IniBal"ConfiguraBon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
Planning,"Installing,"and"
!! Hadoop"Clients"
Configuring"a"Hadoop"Cluster"
!! Cloudera"Manager"
!! Advanced"Cluster"ConfiguraBon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster$Opera7ons$and$
!! Cluster"Maintenance"
Maintenance$
!! Cluster$Monitoring$and$Troubleshoo7ng$
!! Conclusion"
!! Kerberos"ConfiguraBon" Course"Conclusion"and"Appendices"
!! Configuring"HDFS"FederaBon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#2$
Cluster"Monitoring"and"TroubleshooBng"

In$this$chapter,$you$will$learn:$
! What$general$system$condi7ons$to$monitor$
! How$to$monitor$a$Hadoop$cluster$
! Some$techniques$for$troubleshoo7ng$problems$on$a$Hadoop$cluster$
! Some$common$misconfigura7ons,$and$their$resolu7ons$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#3$
Chapter"Topics"

Cluster$Monitoring$$ Cluster$Opera7ons$$
and$Troubleshoo7ng$ and$Maintenance$

!! General$System$Monitoring$
!! Monitoring"Hadoop"Clusters"
!! TroubleshooBng"Hadoop"Clusters"
!! Common"MisconfiguraBons"
!! Hands/On"Exercises:"TroubleshooBng"Challenge"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#4$
Monitoring"Hadoop"Clusters"

! You$should$use$a$monitoring$tool$to$warn$you$of$poten7al$or$actual$
problems$on$individual$machines$in$the$cluster$
! Cloudera$Manager$provides$Hadoop$cluster$monitoring$with$no$addi7onal$
configura7on$required$
– We"recommend"using"Cloudera"Manager"to"monitor"Hadoop"clusters"
! Hadoop$exposes$data$that$lets$you$integrate$cluster$monitoring$into$many$
exis7ng$monitoring$tools$
– JMX"broadcasts"
– Metrics"sinks"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#5$
Items"to"Monitor"

! Monitor$the$Hadoop$daemons$
– Alert"an"operator"if"a"daemon"goes"down"
– Check"can"be"done"with"
service hadoop-0.20-daemon_name status
! Monitor$disks$and$disk$par77ons$
– Alert"immediately"if"a"disk"fails"
– Send"a"warning"when"a"disk"reaches"80%"capacity"
– Send"a"criBcal"alert"when"a"disk"reaches"90%"capacity"
! Monitor$CPU$usage$on$master$nodes$
– Send"an"alert"on"excessive"CPU"usage"
– Slave"nodes"will"o`en"reach"100%"usage"
– This"is"not"a"problem"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#6$
Items"to"Monitor"(cont’d)"

! Monitor$swap$on$all$nodes$
– Alert"if"the"swap"parBBon"starts"to"be"used"
– Memory"allocaBon"is"overcommi>ed"
! Monitor$network$transfer$speeds$
! Monitor$HDFS$health$
– HA"configuraBon"
– Check"the"size"of"the"edit"logs"on"the"JournalNodes"
– Monitor"for"failovers"
– Non/HA"configuraBon"
– Check"the"age"of"the"fsimage"file"and/or"check"the"size"of"the"
edits"file"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#7$
Log"File"Growth"–"Daemon"Logs"

! MRv1$daemon$logs$
– Monitor"to"avoid"out/of/space"errors"
– MRv1"daemons"log"using"DRFA,"so"logs"are"not"deleted"
– Write"scripts"to"compress,"archive,"and"delete"MRv1"daemon"logs"
– Or,"reconfigure"MRv1"daemons"to"log"using"RFA"
!  HDFS$and$MRv2$daemon$logs$$
– HDFS"and"MRv2"daemons"log"using"RFA,"so"logs"are"rotated"
– Configure"appropriate"size"and"retenBon"policies"in"log4j""

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#8$
Log"File"Growth"–"Task"Logs"

! Cau7on:$inexperienced$developers$will$oWen$create$large$task$logs$from$
their$jobs$
– Data"wri>en"to"stdout/stderr
– Data"wri>en"using"log4j"from"within"the"code"
! Large$task$logs$can$run$your$slave$nodes$out$of$disk$space$
– The"log"files"are"wri>en"to"local"disk"–"not"HDFS"–"on"the"slave"nodes"
– Ensure"you"have"enough"room"on"the"parBBon"for"logs"
– Monitor"to"ensure"that"developers"are"not"logging"excessively"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#9$
Chapter"Topics"

Cluster$Monitoring$$ Cluster$Opera7ons$$
and$Troubleshoo7ng$ and$Maintenance$

!! General"System"Monitoring"
!! Monitoring$Hadoop$Clusters$
!! TroubleshooBng"Hadoop"Clusters"
!! Common"MisconfiguraBons"
!! Hands/On"Exercises:"TroubleshooBng"Challenge"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#10$
Metrics"CollecBon"in"CDH4"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#11$
Metrics"Exposure"in"CDH4"

! You$can$configure$Hadoop$to$publish$Hadoop$metrics$sinks$or$to$broadcast$
over$JMX$ports$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#12$
Accessing"CDH4"Metrics"From"Monitoring"So`ware"

! In$addi7on$to$configuring$Hadoop$to$expose$metrics,$you$must$configure$your$
monitoring$soWware$to$access$either$JMX$ports$or$a$Hadoop$metrics$sink$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#13$
Hadoop"Metrics"Frameworks"in"CDH4""

Metrics2$Framework$ Original$Metrics$Framework$
Supports"most"CDH4"daemons,"including" Supports"the"MapReduce"1"JobTracker"and"
NameNode,"DataNode," TaskTracker"daemons"
SecondaryNameNode,""YARN"daemons,"map"
tasks,"and"reduce"tasks"
Allows"filtering"of"metrics"published"to"a"sink" Allows"filtering"of"metrics"published"to"a"sink"
by"context"(jvm,"dfs,"mapred,"rpc,"etc.)," by"context"only"
daemon,"source,"record,"or"metrics"name"
Reads"metrics"sink"configuraBon"from"the"" Reads"metrics"sink"configuraBon"from"the""
/etc/hadoop/conf/ /etc/hadoop/conf/
hadoop-metrics2.properties"file" hadoop-metrics.properties"file"
Exposes"all"metrics"to"MBeans" Exposes"a"limited"number"of"metrics"to"
MBeans"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#14$
More"About"Metrics"Contexts"

! jvm$
– StaBsBcs"from"the"JVM"including"memory"usage,"thread"counts,"
garbage"collecBon"informaBon"
– All"Hadoop"daemons"use"this"context"
! dfs$
– NameNode"capacity,"number"of"files,"under/replicated"blocks"
! mapred$
– JobTracker"informaBon,"similar"to"that"found"on"the"JobTracker’s"Web"
status"page"
! rpc$
– For"Remote"Procedure"Calls"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#15$
Configuring"Hadoop"to"Publish"Metrics"Using"the"Metrics2"
Framework""
/etc/hadoop/conf/hadoop-metrics2.properties"

# Example: Publish metrics for the DataNode, NameNode, and


# SecondaryNameNode to a metrics sink for Ganglia 3.1
*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.
GangliaSink31
# Default sampling period (seconds)
*.period=10
# Ganglia host and gmond port number
namenode.sink.ganglia.servers=192.168.141.160:8649
datanode.sink.ganglia.servers=192.168.141.160:8649
secondarynamenode.sink.ganglia.servers=192.168.141.160:8649

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#16$
Configuring"Hadoop"to"Publish"Metrics"for"MapReduce"1"
Daemons"Using"the"Original"Metrics"Framework"
/etc/hadoop/conf/hadoop-metrics.properties"

# Example 1: Publish metrics for the mapred context to a


# Ganglia 3.1 metrics sink
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
# Default sampling period (seconds)
mapred.period=10
# Ganglia host and gmond port number
mapred.servers=192.168.141.160:8649

# Example 2: Publish metrics for the jvm context to a file sink


jvm.class=org.apache.hadoop.metrics.file.FileContext
# Default sampling period (seconds)
jvm.period=10
# Output File
jvm.fileName=/tmp/jvmmetrics.out

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#17$
Configuring"Hadoop"to"Broadcast"JMX"Metrics"

/etc/hadoop/conf/hadoop-env.sh"

# Example: Broadcast JMX metrics for the NameNode to port 8006


# with no security options enabled
export HADOOP_NAMENODE_OPTS=$HADOOP_NAMENODE_OPTS \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.port=8006

! JMX$ports$can$be$opened$on$for$the$NameNode,$SecondaryNameNode,$
DataNode,$ResourceManager,$NodeManager,$MRAppMaster$daemons$
and$for$map$and$reduce$tasks$
! JMX$ports$can$also$be$opened$on$MR1$JobTracker$and$TaskTracker$
daemons$but$broadcast$only$a$limited$amount$of$metrics$$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#18$
CharBng"Metrics"with"Cloudera"Manager"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#19$
Chapter"Topics"

Cluster$Monitoring$$ Cluster$Opera7ons$$
and$Troubleshoo7ng$ and$Maintenance$

!! General"System"Monitoring"
!! Monitoring"Hadoop"Clusters"
!! Troubleshoo7ng$Hadoop$Clusters$
!! Common"MisconfiguraBons"
!! Hands/On"Exercises:"TroubleshooBng"Challenge"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#20$
TroubleshooBng:"The"Challenges"

! An$overt$symptom$is$a$poor$predictor$of$the$root$cause$of$a$failure$
! Errors$show$up$far$from$the$cause$of$the$problem$
! Clusters$have$a$lot$of$components$
! Example:$
– Symptom:"A"Hadoop"job"that"previously"was"able"to"run"now"fails"to"run"
– Cause:"Disk"space"on"many"nodes"has"filled"up,"so"intermediate"data"
cannot"be"copied"to"Reducers"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#21$
Common"Sources"of"Problems"

! Misconfigura7on$
! Hardware$failure$
! Resource$exhaus7on$
– Not"enough"disk"
– Not"enough"RAM"
– Not"enough"network"bandwidth"
! Inability$to$reach$hosts$on$the$network$
– Naming"issues"
– Network"hardware"issues"
– Network"delays"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#22$
Gathering"InformaBon"About"Problems"

! Are$there$any$issues$in$the$environment?$
! What$about$dependent$components?$
– MapReduce"jobs"depend"on"the"JobTracker,"which"depends"on"the"
underlying"OS"
! Is$there$any$predictability$to$the$failures?$
– All"from"the"same"job?"
– All"from"the"same"TaskTracker?"
! Is$this$a$resource$problem?$
– Have"you"received"an"alert"from"your"monitoring"system?"
! What$do$the$logs$say?$
! What$does$the$CDH$documenta7on/the$Cloudera$Knowledge$Base/your$
favorite$Web$search$engine$say$about$the$problem?$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#23$
General"Rule:"Start"Broad,"Then"Narrow"the"Scope"

JobTracker Task failing Error in the job


Yes
Log on all nodes? Cluster-wide misconfiguration
Problem in the stack

No

TaskTracker Task failing Yes Node misconfiguration


Log on the same Node resource exhausted
node? Problem in the stack

No

Task Attempt
Log /
Run strace

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#24$
Avoiding"Problems"to"Begin"With"

! Misconfigura7on$
– Start"with"recommended"configuraBon"values"
– Don’t"rely"on"Hadoop’s"defaults!"
– Understand"the"precedence"of"overrides"
– Control"your"clients’"ability"to"make"configuraBon"changes"
– Test"changes"before"puPng"them"into"producBon"
– Look"for"changes"when"deploying"new"releases"of"Hadoop""
– Automate"management"of"the"configuraBon"
! Hardware$failure$and$exhaus7on$
– Monitor"your"systems"
– Benchmark"systems"to"understand"their"impact"on"your"cluster"
! Hostname$resolu7on$
– Test"forward"and"reverse"DNS"lookups"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#25$
Chapter"Topics"

Cluster$Monitoring$$ Cluster$Opera7ons$$
and$Troubleshoo7ng$ and$Maintenance$

!! General"System"Monitoring"
!! Monitoring"Hadoop"Clusters"
!! TroubleshooBng"Hadoop"Clusters"
!! Common$Misconfigura7ons$
!! Hands/On"Exercises:"TroubleshooBng"Challenge"
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#26$
Common"MisconfiguraBons:"IntroducBon"

! 35%$of$Cloudera$support$7ckets$are$due$to$misconfigura7ons$
! In$this$sec7on,$we$will$explore$some$of$the$most$common$Hadoop$
misconfigura7ons$and$suggested$solu7ons$
! Note$that$these$are$just$some$of$the$issues$you$could$run$in$to$on$a$cluster$
! Also$note$that$these$are$possible$causes$and$resolu7ons$
– The"problems"could"be"caused"by"many"other"issues"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#27$
Map/Reduce"Task"Out"Of"Memory"Error"

FATAL org.apache.hadoop.mapred.TaskTracker:
Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask
$MapOutputBuffer.<init>(MapTask.java:781)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

$
! Symptom$
– A"task"fails"to"run"
! Possible$causes$
– Poorly"coded"Mapper"or"Reducer"
– Map"or"Reduce"task"has"run"out"of"memory"
– A"memory"leak"in"the"code"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#28$
Map/Reduce"Task"Out"Of"Memory"Error"(cont’d)"

! Possible$resolu7on$
– Increase"size"of"RAM"allocated"in"mapred.child.java.opts
– Ensure"io.sort.mb"is"smaller"than"RAM"allocated"in"
mapred.child.java.opts
– Require"the"developer"to"recode"a"poorly/wri>en"Mapper"or"Reducer"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#29$
JobTracker"Out"Of"Memory"Error"

ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:


java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:
122)

! Symptom$
– The"JobTracker"crashes"
! Cause$
– JobTracker"has"exceeded"allocated"memory"
! Possible$resolu7ons$
– Increase"JobTracker’s"memory"allocaBon"
– Reduce"mapred.jobtracker.completeuserjobs.maximum
– Amount"of"job"history"held"in"JobTracker’s"RAM"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#30$
Too"Many"Fetch"Failures"

INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for


output of task:

! Symptoms$
– Reduce"tasks"fail"
– Map"tasks"need"to"be"rerun"
– Jobs"are"delayed"by"hours"
! Cause$
– Reducers"are"failing"to"fetch"intermediate"data"from"a"TaskTracker"
where"a"Map"process"ran"
– Too"many"of"these"failures"will"cause"a"TaskTracker"to"be"blacklisted"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#31$
Too"Many"Fetch"Failures"(cont’d)"

! Possible$resolu7ons$
– Increase"tasktracker.http.threads
– Decrease"mapred.reduce.parallel.copies
– Tune"mapred.slowstart.completed.maps

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#32$
Not"Able"To"Place"Enough"Replicas"

WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to


place enough replicas

! Symptom$
– Inadequate"replicaBon"or"job"failure"
! Possible$causes$
– DataNodes"do"not"have"enough"xciever"threads"
– Note:"yes,"the"configuraBon"opBon"is"misspelled!"
– Fewer"available"DataNodes"available"than"the"replicaBon"factor"of"the"
blocks"
! Possible$resolu7ons$
– Increase"dfs.datanode.max.xcievers"to"4096"
– Check"replicaBon"factor"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#33$
No"Such"File"or"Directory"

ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker


because ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)

! Symptom$
– The"TaskTracker"cannot"be"started"on"a"node"
! Possible$causes$
– TaskTracker"disk"space"is"full"
– Insufficient"permissions"on"a"directory"the"TaskTracker"writes"to"
– A"disk"has"gone"bad"
! Possible$resolu7ons$
– Increase"dfs.datanode.du.reserved"to"at"least"10%"of"the"disk"
– Set"permissions"for"mapred.local.dir"to"755,"with"owner"mapred

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#34$
Where"Did"My"File"Go?"

hadoop fs –rm –r data


hadoop fs -ls /user/training/.Trash

! Symptom$
– User"cannot"recover"an"accidentally"deleted"file"from"the"trash"
! Possible$causes$
– Trash"is"not"enabled"
– Trash"interval"is"set"too"low"
! Possible$resolu7on$
– Set"fs.trash.interval"to"1440"or"higher"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#35$
Chapter"Topics"

Cluster$Monitoring$$ Cluster$Opera7ons$$
and$Troubleshoo7ng$ and$Maintenance$

!! General"System"Monitoring"
!! Monitoring"Hadoop"Clusters"
!! TroubleshooBng"Hadoop"Clusters"
!! Common"MisconfiguraBons"
!! Hands#On$Exercises:$Troubleshoo7ng$Challenge$
!! Conclusion"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#36$
TroubleshooBng"Challenge:"Heap"of"Trouble"

! In$this$Troubleshoo7ng$Challenge,$you$will$recreate$a$problem$scenario,$
diagnose$the$problem,$and,$if$you$have$7me,$fix$the$problem$
! Your$instructor$will$provide$direc7on$as$you$go$through$the$
troubleshoo7ng$process$
! Please$refer$to$the$Hands#On$Exercise$Manual$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#37$
Chapter"Topics"

Cluster$Monitoring$$ Cluster$Opera7ons$$
and$Troubleshoo7ng$ and$Maintenance$

!! General"System"Monitoring"
!! Monitoring"Hadoop"Clusters"
!! TroubleshooBng"Hadoop"Clusters"
!! Common"MisconfiguraBons"
!! Hands/On"Exercises:"TroubleshooBng"Challenge"
!! Conclusion$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#38$
EssenBal"Points"

! Be$sure$to$monitor$your$Hadoop$cluster$
– Hadoop"daemons,"disk"usage,"CPU"usage,"swap,"network"usage,"and"
HDFS"health"
! Cloudera$Manager$provides$Hadoop$cluster$monitoring$
! The$Hadoop$metrics$framework$exposes$metrics$to$let$you$integrate$with$
other$monitoring$frameworks$
! Troubleshoo7ng$Hadoop$problems$is$a$challenge,$because$symptoms$do$
not$always$point$to$the$source$of$problems$
! Follow$best$prac7ces$for$configura7on$management,$benchmarking,$and$
monitoring$and$you$will$avoid$many$problems$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#39$
Conclusion"

In$this$chapter,$you$have$learned:$
! What$general$system$condi7ons$to$monitor$
! How$to$monitor$a$Hadoop$cluster$
! Some$techniques$for$troubleshoo7ng$problems$on$a$Hadoop$cluster$
! Some$common$misconfigura7ons,$and$their$resolu7ons$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#40$
Conclusion"
Chapter"16"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#1$
Course"Chapters"
!! IntroducBon" Course"IntroducBon"
!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducBon"to"Apache"Hadoop"
!! GeSng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaBon"and"IniBal"ConfiguraBon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
!! Hadoop"Clients" Planning,"Installing,"and"
!! Cloudera"Manager" Configuring"a"Hadoop"Cluster"
!! Advanced"Cluster"ConfiguraBon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaBons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Monitoring"and"TroubleshooBng"
!! Conclusion$
!! Kerberos"ConfiguraBon" Course$Conclusion$and$Appendices$
!! Configuring"HDFS"FederaBon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#2$
Conclusion"

During$this$course,$you$have$learned:$
! The$core$technologies$of$Hadoop$
! How$to$populate$HDFS$from$external$sources$
! How$to$plan$your$Hadoop$cluster$hardware$and$soEware$
! How$to$deploy$a$Hadoop$cluster$
! What$issues$to$consider$when$installing$Pig,$Hive,$and$Impala$
! What$issues$to$consider$when$deploying$Hadoop$clients$
! How$Cloudera$Manager$can$simplify$Hadoop$administraJon$
! How$to$configure$HDFS$for$high$availability$
! What$issues$to$consider$when$implemenJng$Hadoop$security$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#3$
Conclusion"(cont’d)"

During$this$course,$you$have$learned:$
! How$to$schedule$jobs$on$the$cluster$
! How$to$maintain$your$cluster$
! How$to$monitor,$troubleshoot,$and$opJmize$the$cluster$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#4$
Next"Steps"

! Cloudera$offers$a$number$of$other$training$courses,$including:$
– Hadoop"EssenBals"
– Cloudera"Developer"Training"for"Apache"Hadoop"
– Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"
Hadoop"
– Cloudera"Training"for"Apache"HBase"
– IntroducBon"to"Data"Science:"Building"Recommender"Systems"
– Custom"courses"
! Cloudera$also$provides$consultancy$and$troubleshooJng$services$
– Please"ask"your"instructor"for"more"informaBon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#5$
Class"EvaluaBon"

! Please$take$a$few$minutes$to$complete$the$class$evaluaJon$
– Your"instructor"will"show"you"how"to"access"the"online"form"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#6$
CerBficaBon"Exam"

! This$course$helps$to$prepare$you$for$the$Cloudera$CerJfied$Administrator$
for$Apache$Hadoop$exam$
! For$more$informaJon$about$Cloudera$cerJficaJon,$refer$to$$
http://university.cloudera.com/certification.html

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#7$
Thank"You!"

! Thank$you$for$aUending$this$course$
! If$you$have$any$further$quesJons$or$comments,$please$feel$free$to$contact$
us$
– Full"contact"details"are"on"our"Web"site"at"
http://www.cloudera.com/

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#8$
Kerberos"ConfiguraBon"
Appendix"A"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"1$
Course"Chapters"
!! IntroducBon" Course"IntroducBon"
!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducBon"to"Apache"Hadoop"
!! GeSng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaBon"and"IniBal"ConfiguraBon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
!! Hadoop"Clients" Planning,"Installing,"and"
!! Cloudera"Manager" Configuring"a"Hadoop"Cluster"
!! Advanced"Cluster"ConfiguraBon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaBons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Monitoring"and"TroubleshooBng"
!! Conclusion"
!! Kerberos$Configura2on$ Course$Conclusion$and$Appendices$
!! Configuring"HDFS"FederaBon"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"2$
Kerberos"Message"Exchange"

! There$are$three$phases$required$for$a$client$to$access$a$service$
– AuthenBcaBon"
– AuthorizaBon"
– Service"request"
! These$are$illustrated$over$the$next$several$slides$
– This"is"an"overview"of"the"important"points"for"Kerberos"5"
– See"RFC"4120"if"you’re"interested"in"more"detail"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"3$
Kerberos"Message"Exchange"(cont’d)"

Authen2ca2on$Phase$
Kerberos Key Distribution Center
AuthorizaBon"Phase" (KDC)
Service"Request"Phase"
Authentication Service (AS)
1
Client

Ticket Granting Service (TGS)

Desired Network Service

! Client$sends$a$Ticket"Gran2ng$Ticket$(TGT)$request$to$AS$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"4$
Kerberos"Message"Exchange"(cont’d)"

Authen2ca2on$Phase$
Kerberos Key Distribution Center
AuthorizaBon"Phase" (KDC)
Service"Request"Phase"
Authentication Service (AS)
2
Client

Ticket Granting Service (TGS)

Desired Network Service

! AS$checks$database$to$authen2cate$client$
– AuthenBcaBon"typically"done"by"checking"LDAP/AcBve"Directory"
– If"valid,"AS"sends"Ticket/GranBng"Ticket"(TGT)"to"client"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"5$
Kerberos"Message"Exchange"(cont’d)"

AuthenBcaBon"Phase"
Kerberos Key Distribution Center
Authoriza2on$Phase$ (KDC)
Service"Request"Phase"
Authentication Service (AS)
Client
3
Ticket Granting Service (TGS)

Desired Network Service

! Client$uses$this$TGT$to$request$a$service$2cket$from$TGS$
– A"service"Bcket"is"validaBon"that"a"client"can"access"a"service"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"6$
Kerberos"Message"Exchange"(cont’d)"

AuthenBcaBon"Phase"
Kerberos Key Distribution Center
Authoriza2on$Phase$ (KDC)
Service"Request"Phase"
Authentication Service (AS)
Client
4
Ticket Granting Service (TGS)

Desired Network Service

! TGS$verifies$whether$client$is$permiMed$to$use$requested$service$
– If"access"granted,"TGS"sends"service"Bcket"to"client"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"7$
Kerberos"Message"Exchange"(cont’d)"

AuthenBcaBon"Phase"
Kerberos Key Distribution Center
AuthorizaBon"Phase" (KDC)
Service$Request$Phase$
Authentication Service (AS)
Client

Ticket Granting Service (TGS)

Desired Network Service

! Client$can$use$then$use$the$service$
– Service"can"validate"client"with"info"from"the"service"Bcket"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"8$
Important"Kerberos"Client"Commands"

! The$kinit$program$is$used$to$request$a$TGT$from$Kerberos$
! Use$klist$to$see$your$current$2ckets$
! Use$kdestroy to$explicitly$delete$your$2ckets$
– Though"they’ll"expire"on"their"own"aaer"several"hours"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"9$
Configuring"Hadoop"Security"

! Hadoop$security$configura2on$is$a$specialized$topic$
! Many$specifics$depend$on$$
– Version"of"Hadoop"and"related"programs"
– Type"of"Kerberos"server"used"(AcBve"Directory"or"MIT)"
– OperaBng"system"and"distribuBon"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"10$
Hadoop"Security"Setup"AssumpBons"

! The$following$covers$the$essen2al$points,$assuming$use$of$
– CDH3u3,"installed"from"packages"
– CentOS"5.6"
– MIT"Kerberos"server"(krb5/1.6.1)"
! See$the$“CDH$Security$Guide”$for$detailed$instruc2ons$
– Available"at"http://www.cloudera.com/
– Be"sure"to"read"the"one"corresponding"to"your"version"of"CDH"
! The$following$gives$an$overview$of$manual$configura2on$
– If"using"Kerberos,"you"should"consider"using"Cloudera"Manager"
– Cloudera"Manager"(Enterprise)"greatly"simplifies"this"process"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"11$
Hadoop"Security"Setup"Prerequisites"

! Working$Hadoop$cluster$$
– Installing"CDH3"from"packages"is"strongly"advised!"
! Working$Kerberos$KDC$server$$
! Kerberos$client$libraries$installed$on$all$Hadoop$nodes$
"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"12$
Hadoop"Security"Setup"Overview"

! The$main$steps$for$securing$a$cluster$are$to$
– Install"two"extra"CDH"packages"
– Enable"strong"encrypBon"in"Java"
– Set"KDC"hostname"and"realm"on"all"Hadoop"nodes"
– Create"Kerberos"principals""
– Create"and"deploy"Kerberos"keytab"files"
– Shut"down"all"Hadoop"daemons"
– Enable"Hadoop"security"
– Configure"HDFS"security"opBons"
– Configure"MapReduce"security"opBons"
– Restart"Hadoop"daemons""
– Verify"that"everything"works"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"13$
Install"Extra"CDH"Packages"

! Install$hadoop-0.20-sbin$and$hadoop-0.20-native
– On"every"node"in"the"cluster"
– Again,"installing"from"packages"is"strongly"advised"
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"14$
Enable"Strong"EncrypBon"in"Java"

! Install$JCE$Unlimited$Strength$Jurisdic2on$Policy$Files$
– Distributed"as"a"ZIP"file"from"Oracle’s"Web"site"
! For$every$node$in$your$Hadoop$cluster$
– Locate"the"JRE’s"lib/security"directory"
– Rename"exisBng"policy"JARs"in"that"directory"
– Add"new"policy"JARs"extracted"from"downloaded"ZIP"file"

"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"15$
Set"the"KDC"Hostname"and"Realm"

! On$every$node$in$your$Hadoop$cluster$
– Edit"/etc/krb5.conf"
– Set"the"hostname"and"realm"name"of"the"KDC"
$

"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"16$
Create"Kerberos"Principals"

! Create$these$Kerberos$principals$on$every$Hadoop$cluster$node$
– host/myhost.example.com@MYREALM.COM
– hdfs/myhost.example.com@MYREALM.COM
– mapred/myhost.example.com@MYREALM.COM
! These$must$contain$a$fully"qualified$hostname$
– And"this"must"be"the"hostname"of"the"current"node"
! Replace$MYREALM.COM$with$your$actual$realm$name$
! Example$command$for$crea2ng$the$host$principal$on$a$node:$

# kadmin.local –q "addprinc –randkey host/node4.example.com"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"17$
Create"and"Deploy"Kerberos"Keytab"Files"

! On$every$Hadoop$node$in$your$cluster$
– Use"kadmin"to"create"a"keytab"for"the"hdfs"principal"
– Use"kadmin"to"create"a"keytab"for"the"mapred"principal"
– Add"entries"to"both"files"for"the"host"principal"
– See"CDH3"Security"Guide"for"specific"command"opBons"
– Deploy"the"keytab"files"to"Hadoop’s"conf"directory"
– Then"set"ownership"and"permissions"to"protect"them"
– These"steps"are"shown"in"the"commands"below"
"
$ sudo mv hdfs.keytab mapred.keytab /etc/hadoop/conf/
$ sudo chown hdfs:hadoop /etc/hadoop/conf/hdfs.keytab
$ sudo chown mapred:hadoop /etc/hadoop/conf/mapred.keytab
$ sudo chmod 400 /etc/hadoop/conf/*.keytab

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"18$
Enable"Hadoop"Security"

! Shut$down$all$Hadoop$daemons$on$all$cluster$nodes$
! Edit$core-site.xml$to$enable$Hadoop$security:$
– ProperBes"must"be"specified"on"every$machine$in"your"cluster!"

Property Name Value


hadoop.security.authentication kerberos
hadoop.security.authorization true

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"19$
Configure"HDFS"Security"

! Edit$hdfs-site.xml$to$add$NameNode$proper2es:$
– Again,"these"must"be"specified"on"every"machine"in"your"cluster!"

Property Name Value


dfs.block.access.token.enabled true
dfs.https.address <NAMENODE_HOSTNAME>:50475
dfs.https.port 50475
dfs.namenode.keytab.file /etc/hadoop/conf/hdfs.keytab
dfs.namenode.kerberos.principal hdfs/_HOST@YOUR-REALM.COM
dfs.namenode.kerberos.https.principal host/_HOST@YOUR-REALM.COM

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"20$
Configure"HDFS"Security"(cont’d)"

! Edit$hdfs-site.xml$to$add$Secondary$NameNode$proper2es:$
– Again,"these"must"be"specified"on"every"machine$in"your"cluster"

Property Name Value


dfs.secondary.https.address <SECONDARY_NN_HOSTNAME>:50495
dfs.secondary.https.port 50495
dfs.secondary.namenode.keytab.file /etc/hadoop/conf/hdfs.keytab
dfs.secondary.namenode.kerberos.princi hdfs/_HOST@YOUR-REALM.COM
pal
dfs.secondary.namenode.kerberos.https. host/_HOST@YOUR-REALM.COM
principal

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"21$
Configure"HDFS"Security"(cont’d)"

! Edit$hdfs-site.xml$to$add$DataNode$proper2es:$
– Again,"they"must"be"specified"on"every"machine$in"your"cluster"

Property Name Value


dfs.datanode.data.dir.perm 700
dfs.datanode.address 0.0.0.0:1004
dfs.datanode.http.address 0.0.0.0:1006
dfs.datanode.keytab.file /etc/hadoop/conf/hdfs.keytab
dfs.datanode.kerberos.principal hdfs/_HOST@YOUR-REALM.COM
dfs.datanode.kerberos.https.principal host/_HOST@YOUR-REALM.COM

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"22$
Start"NameNode"and"Verify"AuthenBcaBon"

! Start$your$NameNode$
– Then"check"the"NameNode’s"logs"
– You"should"see"a"message"that"says"authenBcaBon"with"your"Kerberos"
realm"was"successful"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"23$
Perform"Basic"NameNode"TesBng"

! You$can$now$verify$basic$HDFS$access$
! Try$to$run$an$HDFS$command$from$another$node$
– Such"as"hadoop fs –ls /user/training
! This$command$might$fail$
– If"you"have"not"already"authenBcated"with"Kerberos"
! Use$the$kinit$command$to$authen2cate,$if$needed$
– For"example,"kinit bob will"authenBcate"user"‘bob’"
– Type"the"correct"password"for"that"account"when"prompted"
– The"HDFS"command"should"now"succeed"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"24$
Start"DataNodes"and"Protect"/tmp"Directory"

! Start$your$DataNodes$
– Logs"should"show"successful"authenBcaBon"(like"NameNode"did)"
! Set$“s2cky$bit”$on$the$HDFS$/tmp$directory$
– OpBonal,"but"strongly"recommended"
"
$ sudo -u hdfs hadoop fs -chmod 1777 /tmp

"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"25$
Start"and"Verify"Secondary"NameNode"

! Recommend$changing$fs.checkpoint.period$temporarily$
– Set"it"to"a"low"number"(such"as"180"seconds)"for"now"
! Start$your$Secondary$NameNode$
– Logs"should"show"successful"authenBcaBon""
– Logs"should"also"show"that"2NN"Web"server"uses"Kerberos"
! Logs$should$eventually$show$successful$checkpoin2ng$
– You"can"now"set"fs.checkpoint.period"to"its"original"value"
– But"don’t"forget"to"restart"the"Secondary"NameNode"

"

"
©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"26$
Configure"MapReduce"Security"

! Edit$mapred-site.xml$to$add$JobTracker$proper2es:$
– They"must"be"specified"on"every"machine$in"your"cluster"

Property Name Value


mapreduce.jobtracker.kerberos.principal mapred/_HOST@YOUR-REALM.COM
"
mapreduce.jobtracker.kerberos.https.pri host/_HOST@YOUR-REALM.COM
ncipal
mapreduce.jobtracker.keytab.file /etc/hadoop/conf/
mapred.keytab
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"27$
Configure"MapReduce"Security"(cont’d)"

! Edit$mapred-site.xml$to$add$TaskTracker$proper2es:$
– Again,"they"must"be"specified"on"every"machine$in"your"cluster"

Property Name Value


mapreduce.tasktracker.kerberos.princip mapred/_HOST@YOUR-REALM.COM
al "
mapreduce.tasktracker.kerberos.https.p host/_HOST@YOUR-REALM.COM
rincipal
mapreduce.tasktracker.keytab.file /etc/hadoop/conf/mapred.keytab
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"28$
Configure"MapReduce"Security"(cont’d)"

! Edit$mapred-site.xml$to$add$TaskController$proper2es:$
– Again,"they"must"be"specified"on"every"machine$in"your"cluster"

Property Name Value


mapred.task.tracker.task-controller org.apache.hadoop.mapred.Linux
" TaskController
mapreduce.tasktracker.group mapred

"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"29$
Configure"MapReduce"Security"(cont’d)"

! Edit$taskcontroller.cfg$to$add$TaskController$proper2es:$
– Again,"they"must"be"specified"on"every"machine$in"your"cluster"
– The"format"of"this"file"is"propertyname=value,"one"per"line"
– The"min.user.id"value"may"vary"by"OS"distribuBon""

Property Name Value


hadoop.log.dir /var/log/hadoop
"
mapreduce.tasktracker.group mapred
banned.users mapred,hdfs,bin
min.user.id 500
"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"30$
Start"and"Test"MapReduce"Daemons"

! Start$JobTracker$
– Check"the"logs"to"ensure"successful"authenBcaBon"with"realm"
! Start$a$TaskTracker$
– Check"the"logs"to"ensure"successful"authenBcaBon"with"realm"
– If"so,"start"all"other"TaskTrackers"
! Test$MapReduce$by$running$a$Job$
– Such"as"the"Pi"esBmator"job"in"the"Hadoop"examples"JAR"
– This"will"fail"if"you"haven’t"got"a"valid"Bcket"

"
$ hadoop jar /usr/lib/hadoop/hadoop*examples.jar pi 5 8

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"31$
TroubleshooBng"Hadoop"Security"

! If$problems$arise,$check$the$following$
– File"permissions"of"keytab"files"and"taskcontroller"binary"
– File"ownership"of"keytab"files"and"taskcontroller"binary"
– Strong"encrypBon"policy"files"for"Java"are"correctly"configured"
– Kerberos"principals"include"fully/qualified"domain"names"
– User"submiSng"jobs"has"valid"Bckets"(use"klist"to"show"them)"
– System"Bme"is"synchronized"on"each"node"
– Reverse"name"resoluBon"works"correctly"
! Also$see$Appendix$A$(Troubleshoo2ng)$of$the$CDH$Security$Guide$
– More"soluBons"to"many"common"problems"listed"there"

"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." A"32$
Configuring"HDFS"FederaEon"
Appendix"B ""

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"1$
Course"Chapters"
!! IntroducEon" Course"IntroducEon"
!! The"Case"for"Apache"Hadoop"
!! HDFS"
IntroducEon"to"Apache"Hadoop"
!! GeTng"Data"Into"HDFS"
!! MapReduce"

!! Planning"Your"Hadoop"Cluster"
!! Hadoop"InstallaEon"and"IniEal"ConfiguraEon"
!! Installing"and"Configuring"Hive,"Impala,"and"Pig"
!! Hadoop"Clients" Planning,"Installing,"and"
!! Cloudera"Manager" Configuring"a"Hadoop"Cluster"
!! Advanced"Cluster"ConfiguraEon"
!! Hadoop"Security"
!! Managing"and"Scheduling"Jobs"
Cluster"OperaEons"and"
!! Cluster"Maintenance"
Maintenance"
!! Cluster"Monitoring"and"TroubleshooEng"
!! Conclusion"
!! Kerberos"ConfiguraEon" Course$Conclusion$and$Appendices$
!! Configuring$HDFS$Federa5on$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"2$
HDFS"FederaEon"

! HDFS$federa5on$allows$a$cluster$to$have$mul5ple$NameNodes$
– Each"manages"a"namespace(volume(
– Client/side"mounts"define"overall"view"(similar"to"/etc/fstab)"
! Namespace$volumes$(and$NameNodes)$are$independent$
– They"do"not"communicate"with"one"another"
! Benefits$of$HDFS$federa5on$
– Scalability"
– Performance"
– IsolaEon"
! The$material$in$this$sec5on$applies$only$to$CDH4$
– And"the"equivalent"Apache"Hadoop"versions"(0.23.x)"
! Fair$to$say$that$HDFS$Federa5on$is$not$oHen$used$in$produc5on$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"3$
How"FederaEon"Works"

! Each$NameNode$manages$a$namespace(volume,$consis5ng$of(
– Metadata"(file"names,"permissions,"etc.)"
– Block"pool"(all"blocks"corresponding"to"files"in"that"volume)"
! A$NameNode$manages$the$block(pool(
– But"blocks"themselves"are"sEll"stored"on"DataNodes"
– DataNodes"in"a"cluster"store"blocks"for"all"volumes"

accounting asia

engineering Metadata australia

marketing europe

0 1 2 3 4 Block pool 5 6 7 8 9

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"4$
Cluster"ID"

! Each$HDFS$filesystem$is$now$associated$with$a$Cluster(ID(
– Either"specified"or"auto/generated"when"you"format"HDFS"
– All"NameNodes"in"a"given"cluster"must"have"the"same"cluster"ID"
– The"cluster"ID"is"visible"in"the"NameNode’s"Web"UI"
$

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"5$
FederaEon"ConfiguraEon"

! Each$volume$within$a$cluster$will$have$its$own$NameService$ID$
– An"arbitrary"(but"unique)"idenEfier"that"you"select"
– Used"to"define"several"volume/specific"properEes"
! Clients$and$DataNodes$must$know$about$the$volumes$
– Edit"hdfs-site.xml""
– Add"a"new"property"dfs.nameservices"
– Value"is"a"comma/delimited"list"of"all"NameService"IDs"
$ <property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"6$
FederaEon"ConfiguraEon"(cont’d)"

! For$each$NameNode"specific$property$in$hdfs-site.xml
– Create"a"new"property"for"each"Namespace"volume"
– Append"a"dot"and"the"NameService"ID""
$ <property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>namenode01.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>namenode02.example.com:8020</value>
</property>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"7$
FederaEon"ConfiguraEon"(cont’d)"

! Repeat$this$for$volume"specific$Secondary$NameNode$proper5es$
– Usually"just"dfs.namenode.secondary.http-address
! Propagate$configura5on$changes$throughout$the$cluster$
! If$seTng$up$a$new$HDFS$installa5on$
– Subsequent"steps"match"those"of"unfederated"installaEon"
! If$adding$federa5on$to$an$exis5ng$installa5on$
– Do"not"format"the"NameNode"

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"8$
Client/Side"Mounts"

! HDFS$federa5on$allows$for$mul5ple$NameNodes$
– Clients"mount"one"or"more"of"these"
! Clients$use$ViewFS$to$compose$a$view$of$HDFS$
– The"concept"is"similar"to"/etc/fstab"on"Linux"
– This"URI"typically"contains"the"cluster"ID,"as"shown"below"
! Edit$core-site.xml$and$set$the$fs.defaultFS$property
– This"property"was"formerly"called"fs.default.name
$
<property>
<name>fs.defaultFS</name>
<value>viewfs://mycluster</value>
</property>

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"9$
Client/Side"Mounts"(cont’d)"

! Define$a$new$property$in$core-site.xml$for$each$mount$point$
– The"property"name"follows"the"pa>ern"
–  fs.viewfs.mounttable.<CLUSTERID>.link.<MOUNTPOINT>
– The"property"value"will"follow"this"pa>ern"
–  hdfs://<NAMENODE_HOST:PORT>/<PATH>

! Let’s$look$at$how$we$can$achieve$the$following…$
$
/

users reports

engineering finance marketing sales inventory TPS

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"10$
Client/Side"Mounts"(cont’d)"

$$
<property>
<name>fs.viewfs.mounttable.mycluster.link./users</name>
<value>hdfs://namenode01.example.com:8020/users</value>
</property>
<property>
<name>fs.viewfs.mounttable.mycluster.link./reports</name>
<value>hdfs://namenode02.example.com:8020/reports</value>
</property>

Served by namenode01.example.com Served by namenode02.example.com

users reports

engineering finance marketing sales inventory TPS

©"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." B"11$

Vous aimerez peut-être aussi